Yg4Arxiv
Computer Vision and Pattern Recognition 220
☆ EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation
Multi-shot video generation extends single-shot generation to coherent visual narratives, yet maintaining consistent characters, objects, and locations across shots remains a challenge over long sequences. Existing evaluations typically use independently generated prompt sets with limited entity coverage and simple consistency metrics, making standardized comparison difficult. We introduce EntityBench, a benchmark of 140 episodes (2,491 shots) derived from real narrative media, with explicit per-shot entity schedules tracking characters, objects, and locations simultaneously across easy / medium / hard tiers of up to 50 shots, 13 cross-shot characters, 8 cross-shot locations, 22 cross-shot objects, and recurrence gaps spanning up to 48 shots. It is paired with a three-pillar evaluation suite that disentangles intra-shot quality, prompt-following alignment, and cross-shot consistency, with a fidelity gate that admits only accurate entity appearances into cross-shot scoring. As a baseline, we propose EntityMem, a memory-augmented generation system that stores verified per-entity visual references in a persistent memory bank before generation begins. Experiments show that cross-shot entity consistency degrades sharply with recurrence distance in existing methods, and that explicit per-entity memory yields the highest character fidelity (Cohen's d = +2.33) and presence among methods evaluated. Code and data are available at https://github.com/Catherine-R-He/EntityBench/.
comment: Project page: https://catherine-r-he.github.io/EntityBench/
☆ ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both
Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non-trivial. Recent alternatives include agentic reasoning through code or tool calls, and latent reasoning with learnable hidden embeddings. However, agentic methods incur context-switching latency from external execution, while latent methods lack task generalization and are difficult to train with autoregressive parallelization. To combine their strengths while mitigating their limitations, we propose ATLAS, a framework in which a single discrete 'word', termed as a functional token, serves both as an agentic operation and a latent visual reasoning unit. Each functional token is associated with an internalized visual operation, yet requires no visual supervision and remains a standard token in the tokenizer vocabulary, which can be generated via next-token prediction. This design avoids verbose intermediate visual content generation, while preserving compatibility with the vanilla scalable SFT and RL training, without architectural or methodological modifications. To further address the sparsity of functional tokens during RL, we introduce Latent-Anchored GRPO (LA-GRPO), which stabilizes the training by anchoring functional tokens with a statically weighted auxiliary objective, providing stronger gradient updates. Extensive experiments and analyses demonstrate that ATLAS achieves superior performance on challenging benchmarks while maintaining clear interpretability. We hope ATLAS offers a new paradigm inspiring future visual reasoning research.
comment: Project Page: https://atlas-oneword.github.io Code: https://github.com/ZiyuGuo99/ATLAS
☆ RefDecoder: Enhancing Visual Generation with Conditional Video Decoding
Video generation powers a vast array of downstream applications. However, while the de facto standard, i.e., latent diffusion models, typically employ heavily conditioned denoising networks, their decoders often remain unconditional. We observe that this architectural asymmetry leads to significant loss of detail and inconsistency relative to the input image. To address this, we argue that the decoder requires equal conditioning to preserve structural integrity. We introduce RefDecoder, a reference-conditioned video VAE decoder by injecting high-fidelity reference image signal directly into the decoding process via reference attention. Specifically, a lightweight image encoder maps the reference frame into the detail-rich high-dimensional tokens, which are co-processed with the denoised video latent tokens at each decoder up-sampling stage. We demonstrate consistent improvements across several distinct decoder backbones (e.g., Wan 2.1 and VideoVAE+), achieving up to +2.1dB PSNR over the unconditional baselines on the Inter4K, WebVid, and Large Motion reconstruction benchmarks. Notably, RefDecoder can be directly swapped into existing video generation systems without additional fine-tuning, and we report across-the-board improvements in subject consistency, background consistency, and overall quality scores on the VBench I2V benchmark. Beyond I2V, RefDecoder generalizes well to a wide range of visual generation tasks such as style transfer and video editing refinement.
☆ VGGT-$Ω$ CVPR 2026
Recent feed-forward reconstruction models, such as VGGT, have proven competitive with traditional optimization-based reconstructors while also providing geometry-aware features useful for other tasks. Here, we show that the quality of these models scales predictably with model and data size. We do so by introducing VGGT-$Ω$, which substantially improves reconstruction accuracy, efficiency, and capabilities for both static and dynamic scenes. To enable training this model at an unprecedented scale, we introduce architectural changes that improve training efficiency, a high-quality data annotation pipeline that supports dynamic scenes, and a self-supervised learning protocol. We simplify VGGT's architecture by using a single dense prediction head with multi-task supervision and removing the expensive high-resolution convolutional layers. We also use registers to aggregate scene information into a compact representation and introduce register attention, which restricts inter-frame information exchange to these registers, in part replacing global attention. In this way, during training, VGGT-$Ω$ uses only about 30% of the GPU memory of its predecessor, allowing us to train with 15x more supervised data than prior work and to leverage vast amounts of unlabeled video data. VGGT-$Ω$ achieves strong results for reconstruction of static and dynamic scenes across multiple benchmarks, for example, improving over the previous best camera estimation accuracy on Sintel by 77%. We also show that the learned registers can improve vision-language-action models and support alignment with language, suggesting that reconstruction can be a powerful and scalable proxy task for spatial understanding. Project Page: http://vggt-omega.github.io/
comment: CVPR 2026 (Oral)
☆ Aligning Latent Geometry for Spherical Flow Matching in Image Generation
Latent flow matching for image generation usually transports Gaussian noise to variational autoencoder latents along linear paths. Both endpoints, however, concentrate in thin spherical shells, and a Euclidean chord leaves those shells even when preprocessing aligns their radii. By decomposing each latent token into radial and angular components, we show through component-swap probes that decoded perceptual and semantic content is carried predominantly by direction, with radius contributing much less. We therefore project data latents onto a fixed token radius, use the radial projection of Gaussian noise as the spherical prior, finetune the decoder with the encoder frozen, and replace linear interpolation with spherical linear interpolation. The resulting geodesic paths stay on the sphere at every timestep, and their velocity targets are purely angular by construction. Under matched training, the method consistently improves class-conditional ImageNet-256 FID across different image tokenizers, leaves the diffusion architecture unchanged, and requires no auxiliary encoder or representation-alignment objective.
☆ RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO
Causal autoregressive video diffusion models support real-time streaming generation by extrapolating future chunks from previously generated content. Distilling such generators from high-fidelity bidirectional teachers yields competitive few-step models, yet a persistent gap between the history distributions encountered during training and those arising at inference constrains generation quality over long horizons. We introduce the Real-time Autoregressive Video Extrapolation Network (RAVEN), a training-time test framework that repacks each self rollout into an interleaved sequence of clean historical endpoints and noisy denoising states. This formulation aligns training attention with inference-time extrapolation and allows downstream chunk losses to supervise the history representations on which future predictions depend. We further propose Consistency-model Group Relative Policy Optimization (CM-GRPO), which reformulates a consistency sampling step as a conditional Gaussian transition and applies online Reinforcement Learning (RL) directly to this kernel, avoiding the Euler-Maruyama auxiliary process adopted in prior flow-model RL formulations. Experiments demonstrate that RAVEN surpasses recent causal video distillation baselines across quality, semantic, and dynamic degree evaluations, and that CM-GRPO provides further gains when combined with RAVEN.
comment: Project Page: https://yanzuo.lu/raven
☆ Articraft: An Agentic System for Scalable Articulated 3D Asset Generation
A bottleneck in learning to understand articulated 3D objects is the lack of large and diverse datasets. In this paper, we propose to leverage large language models (LLMs) to close this gap and generate articulated assets at scale. We reduce the problem of generating an articulated 3D asset to that of writing a program that builds it. We then introduce a new agentic system, Articraft, that writes such programs automatically. We design a programmatic interface and harness to help the LLM do so effectively. The LLM writes code against a domain-specific SDK for defining parts, composing geometry, specifying joints, and writing tests to validate the resulting assets. The harness exposes a restricted workspace and interface to the LLM, validates the resulting assets, and returns structured feedback. In this way, the LLM is not distracted by details such as authoring a URDF file or managing a complex software environment. We show that this produces higher-quality assets than both state-of-the-art articulated-asset generators and general-purpose coding agents. Using Articraft, we build Articraft-10K, a curated dataset of over 10K articulated assets spanning 245 categories, and show its utility both for training models of articulated assets and in downstream applications such as robotics simulation and virtual reality.
comment: Project page: https://articraft3d.github.io/
☆ VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction
High-quality 3D scene reconstruction has recently advanced toward generalizable feed-forward architectures, enabling the generation of complex environments in a single forward pass. However, despite their strong performance in static scene perception, these models remain limited in responding to dynamic human instructions, which restricts their use in interactive applications. Existing editing methods typically rely on a 2D-lifting strategy, where individual views are edited independently and then lifted back into 3D space. This indirect pipeline often leads to blurry textures and inconsistent geometry, as 2D editors lack the spatial awareness required to preserve structure across viewpoints. To address these limitations, we propose VGGT-Edit, a feed-forward framework for text-conditioned native 3D scene editing. VGGT-Edit introduces depth-synchronized text injection to align semantic guidance with the backbone's spatial poses, ensuring stable instruction grounding. This semantic signal is then processed by a residual transformation head, which directly predicts 3D geometric displacements to deform the scene while preserving background stability. To ensure high-fidelity results, we supervise the framework with a multi-term objective function that enforces geometric accuracy and cross-view consistency. We also construct the DeltaScene Dataset, a large-scale dataset generated through an automated pipeline with 3D agreement filtering to ensure ground-truth quality. Experiments show that VGGT-Edit substantially outperforms 2D-lifting baselines, producing sharper object details, stronger multi-view consistency, and near-instant inference speed.
☆ Quantitative Video World Model Evaluation for Geometric-Consistency
Generative video models are increasingly studied as implicit world models, yet evaluating whether they produce physically plausible 3D structure and motion remains challenging. Most existing video evaluation pipelines rely heavily on human judgment or learned graders, which can be subjective and weakly diagnostic for geometric failures. We introduce PDI-Bench (Perspective Distortion Index), a quantitative framework for auditing geometric coherence in generated videos. Given a generated clip, we obtain object-centric observations via segmentation and point tracking (e.g., SAM 2, MegaSaM, and CoTracker3), lift them to 3D world-space coordinates via monocular reconstruction, and compute a set of projective-geometry residuals capturing three failure dimensions: scale-depth alignment, 3D motion consistency, and 3D structural rigidity. To support systematic evaluation, we build PDI-Dataset, covering diverse scenarios designed to stress these geometric constraints. Across state-of-the-art video generators, PDI reveals consistent geometry-specific failure modes that are not captured by common perceptual metrics, and provides a diagnostic signal for progress toward physically grounded video generation and physical world model. Our code and dataset can be found at https://pdi-bench.github.io/.
comment: 12 pages, 5 figures. Project page : https://pdi-bench.github.io/
☆ Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video
Camera-controlled video generation has made substantial progress, enabling generated videos to follow prescribed viewpoint trajectories. However, existing methods usually learn camera-specific conditioning through camera encoders, control branches, or attention and positional-encoding modifications, which often require post-training on large-scale camera-annotated videos. Training-free alternatives avoid such post-training, but often shift the cost to test-time optimization or extra denoising-time guidance. We propose Warp-as-History, a simple interface that turns camera-induced warps into camera-warped pseudo-history with target-frame positional alignment and visible-token selection. Given a target camera trajectory, we construct camera-warped pseudo-history from past observations and feed it through the model's visual-history pathway. Crucially, we align its positional encoding with the target frames being denoised and remove warped-history tokens without valid source observations. Without any training, architectural modification, or test-time optimization, this interface reveals a non-trivial zero-shot capability of a frozen video generation model to follow camera trajectories. Moreover, lightweight offline LoRA finetuning on only one camera-annotated video further improves this capability and generalizes to unseen videos, improving camera adherence, visual quality, and motion dynamics without test-time optimization or target-video adaptation. Extensive experiments on diverse datasets confirm the effectiveness of our method.
comment: Project page: https://yyfz.github.io/warp-as-history/
☆ From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing
Modern image editing models produce realistic results but struggle with abstract, multi step instructions (e.g., ``make this advertisement more vegetarian-friendly''). Prior agent based methods decompose such tasks but rely on handcrafted pipelines or teacher imitation, limiting flexibility and decoupling learning from actual editing outcomes. We propose an experiential framework for long-horizon image editing, where a planner generates structured atomic decompositions and an orchestrator selects tools and regions to execute each step. A vision language judge provides outcome-based rewards for instruction adherence and visual quality. The orchestrator is trained to maximize these rewards, and successful trajectories are used to refine the planner. By tightly coupling planning with reward driven execution, our approach yields more coherent and reliable edits than single-step or rule-based multistep baselines.
☆ SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
We introduce SANA-WM, an efficient 2.6B-parameter open-source world model natively trained for one-minute generation, synthesizing high-fidelity, 720p, minute-scale videos with precise camera control. SANA-WM achieves visual quality comparable to large-scale industrial baselines such as LingBot-World and HY-WorldPlay, while significantly improving efficiency. Four core designs drive our architecture: (1) Hybrid Linear Attention combines frame-wise Gated DeltaNet (GDN) with softmax attention for memory-efficient long-context modeling. (2) Dual-Branch Camera Control ensures precise 6-DoF trajectory adherence. (3) Two-Stage Generation Pipeline applies a long-video refiner to stage-1 outputs, improving quality and consistency across sequences. (4) Robust Annotation Pipeline extracts accurate metric-scale 6-DoF camera poses from public videos to yield high-quality, spatiotemporally consistent action labels. Driven by these designs, SANA-WMdemonstrates remarkable efficiency across data, training compute, and inference hardware: it uses only $\sim$213K public video clips with metric-scale pose supervision, completes training in 15 days on 64 H100s, and generates each 60s clip on a single GPU; its distilled variant can be deployed on a single RTX 5090 with NVFP4 quantization to denoise a 60s 720p clip in 34s. On our one-minute world-model benchmark, SANA-WM demonstrates stronger action-following accuracy than prior open-source baselines and achieves comparable visual quality at $36\times$ higher throughput for scalable world modeling.
comment: https://nvlabs.github.io/Sana/WM/
☆ Evidential Reasoning Advances Interpretable Real-World Disease Screening ICML 2026
Disease screening is critical for early detection and timely intervention in clinical practice. However, most current screening models for medical images suffer from limited interpretability and suboptimal performance. They often lack effective mechanisms to reference historical cases or provide transparent reasoning pathways. To address these challenges, we introduce EviScreen, an evidential reasoning framework for disease screening that leverages region-level evidence from historical cases. The proposed EviScreen offers retrospection interpretability through regional evidence retrieved from dual knowledge banks. Using this evidential mechanism, the subsequent evidence-aware reasoning module makes predictions using both the current case and evidence from historical cases, thereby enhancing disease screening performance. Furthermore, rather than relying on post-hoc saliency maps, EviScreen enhances localization interpretability by leveraging abnormality maps derived from contrastive retrieval. Our method achieves superior performance on our carefully established benchmarks for real-world disease screening, yielding notably higher specificity at clinical-level recall. Code is publicly available at https://github.com/DopamineLcy/EviScreen.
comment: ICML 2026
☆ Does Synthetic Layered Design Data Benefit Layered Design Decomposition?
Recent advances in image generation have made it easy to produce high-quality images. However, these outputs are inherently flattened, entangling foreground elements, background, and text within a fixed canvas. As a result, flexible post-generation editing remains challenging, revealing a clear last-mile gap toward practical usability. Existing approaches either rely on scarce proprietary layered assets or construct partially synthetic data from limited structural priors. However, both strategies face fundamental challenges in scalability. In this work, we investigate whether pure synthetic layered data can improve graphic design decomposition. We make the assumption that, in graphic design, effective decomposition does not require modeling inter-layer dependencies as precisely as in natural-image composition, since design elements are often intentionally arranged as modular and semantically separable components. Concretely, we conduct a data-centric study based on CLD baseline, which is a state-of-the-art layer decomposition framework. Based on the baseline, we construct our own synthetic dataset, SynLayers, generate textual supervision using vision language models, and automate inference inputs with VLM-predicted bounding boxes. Our study reveals three key findings: (1) even training with purely synthetic data can outperform non-scalable alternatives such as the widely used PrismLayersPro dataset, demonstrating its viability as a scalable and effective substitute; (2) performance consistently improves with increased training data scale, while gains begin to saturate at around 50K samples; and (3) synthetic data enables balanced control over layer-count distributions, avoiding the layer-count imbalance commonly observed in real-world datasets. We hope this data-centric study encourages broader adoption of synthetic data as a practical foundation for layered design editing systems.
comment: 22 pages, 10 figures. Code is available at https://github.com/YangHaolin0526/SynLayers
☆ Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation
Real-time interactive video generation requires low-latency, streaming, and controllable rollout. Existing autoregressive (AR) diffusion distillation methods have achieved strong results in the chunk-wise 4-step regime by distilling bidirectional base models into few-step AR students, but they remain limited by coarse response granularity and non-negligible sampling latency. In this paper, we study a more aggressive setting: frame-wise autoregression with only 1--2 sampling steps. In this regime, we identify the initialization of a few-step AR student as the key bottleneck: existing strategies are either target-misaligned, incapable of few-step generation, or too costly to scale. We propose \textbf{Causal Forcing++}, a principled and scalable pipeline that uses \emph{causal consistency distillation} (causal CD) for few-step AR initialization. The core idea is that causal CD learns the same AR-conditional flow map as causal ODE distillation, but obtains supervision from a single online teacher ODE step between adjacent timesteps, avoiding the need to precompute and store full PF-ODE trajectories. This makes the initialization both more efficient and easier to optimize. The resulting pipeline, \ours, surpasses the SOTA 4-step chunk-wise Causal Forcing under the \textit{\textbf{frame-wise 2-step setting}} by 0.1 in VBench Total, 0.3 in VBench Quality, and 0.335 in VisionReward, while reducing first-frame latency by 50\% and Stage 2 training cost by $\sim$$4\times$. We further extend the pipeline to action-conditioned world model generation in the spirit of Genie3. Project Page: https://github.com/thu-ml/Causal-Forcing and https://github.com/shengshu-ai/minWM .
☆ MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory
Long-term agent memory is increasingly multimodal, yet existing evaluations rarely test whether agents preserve the visual evidence needed for later reasoning. In prior work, many visually grounded questions can be answered using only captions or textual traces, allowing answers to be inferred without preserving the fine-grained visual evidence. Meanwhile, harder cases that require reasoning over changing visual states are largely absent. Therefore, we introduce MemEye, a framework that evaluates memory capabilities from two dimensions: one measures the granularity of decisive visual evidence (from scene-level to pixel-level evidence), and the other measures how retrieved evidence must be used (from single evidence to evolutionary synthesis). Under this framework, we construct a new benchmark across 8 life-scenario tasks, with ablation-driven validation gates for assessing answerability, shortcut resistance, visual necessity, and reasoning structure. By evaluating 13 memory methods across 4 VLM backbones, we show that current architectures still struggle to preserve fine-grained visual details and reason about state changes over time. Our findings show that long-term multimodal memory depends on evidence routing, temporal tracking, and detail extraction.
comment: 46 pages, 15 figures
☆ CLOVER: Closed-Loop Value Estimation \& Ranking for End-to-End Autonomous Driving Planning
End-to-end autonomous driving planners are commonly trained by imitating a single logged trajectory, yet evaluated by rule-based planning metrics that measure safety, feasibility, progress, and comfort. This creates a training--evaluation mismatch: trajectories close to the logged path may violate planning rules, while alternatives farther from the demonstration can remain valid and high-scoring. The mismatch is especially limiting for proposal-selection planners, whose performance depends on candidate-set coverage and scorer ranking quality. We propose CLOVER, a Closed-LOop Value Estimation and Ranking framework for end-to-end autonomous driving planning. CLOVER follows a lightweight generator--scorer formulation: a generator produces diverse candidate trajectories, and a scorer predicts planning-metric sub-scores to rank them at inference time. To expand proposal support beyond single-trajectory imitation, CLOVER constructs evaluator-filtered pseudo-expert trajectories and trains the generator with set-level coverage supervision. It then performs conservative closed-loop self-distillation: the scorer is fitted to true evaluator sub-scores on generated proposals, while the generator is refined toward teacher-selected top-$k$ and vector-Pareto targets with stability regularization. We analyze when an imperfect scorer can improve the generator, showing that scorer-mediated refinement is reliable when scorer-selected targets are enriched under the true evaluator and updates remain conservative. On NAVSIM, CLOVER achieves 94.5 PDMS and 90.4 EPDMS, establishing a new state of the art. On the more challenging NavHard split, it obtains 48.3 EPDMS, matching the strongest reported result. On supplementary nuScenes open-loop evaluation, CLOVER achieves the lowest L2 error and collision rate among compared methods. Code data will be released at https://github.com/WilliamXuanYu/CLOVER.
☆ DriveCtrl: Conditioned Sim-to-Real Driving Video Generation
Large-scale labelled driving video data is essential for training autonomous driving systems. Although simulation offers scalable and fully annotated data, the domain gap between synthetic and real-world driving videos significantly limits its utility for downstream deployment. Existing video generation methods are not well-suited for this task, as they fail to simultaneously preserve scene structure, object dynamics, temporal consistency, and visual realism, all of which are critical for maintaining annotation validity in generated data. In this paper, we present DriveCtrl, a depth-conditioned controllable sim-to-real video generation framework for realistic driving video synthesis. Built upon a pretrained video foundation model, DriveCtrl introduces a structure-aware adapter that enables depth-guided generation while preserving the scene layout and motion patterns of the source simulation, producing temporally coherent driving videos that remain aligned with the original simulated sequences. We further introduce a scalable data generation pipeline that transforms simulator videos into realistic driving footage matching the visual style of a target real-world dataset. The pipeline supports three conditioning signals: structural depth, reference-dataset style, and text prompts, while preserving frame-level annotations for downstream perception tasks. To better assess this task, we propose a driving-domain-specific knowledge-informed evaluation metric called Driving Video Realism Score (DVRS) that assesses the realism of generated videos. Experiments demonstrate that DriveCtrl consistently outperforms the base model and competing alternatives in realism, temporal quality, and perception task performance, substantially narrowing the sim-to-real gap for driving video generation.
☆ CoralLite: μCT Reconstruction of Coral Colonies from Individual Corallites
The life history of an individual coral is archived within the accreting skeleton of the colony. While reef-forming coral colonies (e.g. massive \emph{Porites} sp.) may live for hundreds of years and deposit calcareous structures many metres in height and width, their living tissue is a thin outer surface layer comprised of asexually-dividing polyps that only survive a few years. To understand the rate and timing of polyp division and the consequences for colony skeletal growth, scientists need to track the skeletal corallite deposited around each polyp. Here we propose CoralLite, an annotated μCT scan dataset of entire calcareous skeletons and an associated, first corallite deep learning reconstruction baseline. CoralLite combines fully quantified volumetric segmentations with cross-slice linking for visualisations of 3D models for each corallite up to colony scale. For segmentation, we propose and evaluate in detail a hybrid V-Trans-UNet architecture applicable to segmenting tiled μCT virtual slabs of \emph{Porites} sp. colonies. The model is pre-trained on weakly annotated data and topology-aware fine-tuned using fully annotated slice sections with 8k+ manual corallite region annotations. On unseen slices of the same colony, the resulting model reaches 0.94 topological accuracy at mean Dice scores of 0.77 on the same colony and projection axis, and 0.63 mean Dice scores on a different, biologically unrelated specimen. Whilst our experiments are limited in scale and context, our results show for the first time that visual machine learning can effectively support full 3D individual corallite modelling from μCT scans of coral skeletons alone. For reproducibility and as a baseline for future research we publish our full dataset of 697 μCT slices, 37 partial or full slice annotations, and all network weights and source code with this paper.
comment: 15 pages, 10 figures, 2 tables
☆ SAGE3D: Soft-guided attention and graph excitation for 3D point cloud corner detection
We present SAGE3D, a hybrid Transformer-based model for corner detection in airborne LiDAR point clouds. We propose a multi-stage solution built on a hierarchical encoder-decoder architecture that progressively downsamples point clouds through Set Abstraction layers and recovers per-point predictions via Feature Propagation. We introduce two innovations: Soft-Guided Attention, which injects ground-truth corner labels as a log-prior into attention logits during training to improve precision; then an Excitatory Graph Neural Network positioned at strategic resolutions in the hierarchy, employing positive-only message passing where high-confidence corners reinforce predictions through learned boosting, optimizing for recall. The hierarchical design enables multi-scale feature extraction while our guided attention and excitatory modules ensure corner signals are amplified rather than diluted across scales.
comment: 5 pages, 4 figures
☆ On the Cultural Anachronism and Temporal Reasoning in Vision Language Models
Vision-Language Models (VLMs) are increasingly applied to cultural heritage materials, from digital archives to educational platforms. This work identifies a fundamental issue in how these models interpret historical artifacts. We define this phenomenon as cultural anachronism, the tendency to misinterpret historical objects using temporally inappropriate concepts, materials, or cultural frameworks. To quantify this phenomenon, we introduce the Temporal Anachronism Benchmark for Vision-Language Models (TAB-VLM), a dataset of 600 questions across six categories, designed to evaluate temporal reasoning on 1,600 Indian cultural artifacts spanning prehistoric to modern periods. Systematic evaluations of ten state-of-the-art models reveal significant deficiencies on our benchmark, and even the best model (GPT-5.2) achieves only 58.7% overall accuracy. The performance gap persists across varying architectures and scales, suggesting that cultural anachronism represents a significant limitation in visual AI systems, regardless of model size. These findings highlight the disparity between current VLM capabilities and the requirements for accurately interpreting cultural heritage materials, particularly for non-Western visual cultures underrepresented in training data. Our benchmark provides a foundation for enhancing temporal cognition in multimodal AI systems that interact with historical artifacts. The dataset and code are available in our project page.
comment: Project Page: https://khushboo0012.github.io/tab-vlm-webpage/
☆ Computational Imaging Priors for Wireless Capsule Endoscopy: Monte Carlo-Guided Hemoglobin Mapping for Rare-Anomaly Detection
Background. RGB-trained capsule-endoscopy classifiers underperform on small-vessel vascular findings by conflating hemoglobin contrast with bile and illumination falloff. Thus, here we test whether a Monte Carlo-inspired analytic model can compute hemoglobin from RGB signal built upon extracted classifier. Methods. On Kvasir-Capsule (47,238 frames, video-level 70/15/15 split, 11 evaluable classes) we evaluate two software-only configurations against RGB-only EfficientNet-B0 across 6 seeds: (i) a prior P_blood = sigma(alpha * (H_norm - 0.5)) * Phi(r) fused as 2 zero-init auxiliary channels; (ii) a distillation head training a 3-channel RGB backbone to predict P_blood. Significance: paired DeLong, McNemar, bootstrap CIs with Bonferroni correction. Results. Across 6 seeds (n=6,423), the analytic prior provides a small but direction-consistent macro-AUC improvement: RGB-only 0.760 +/- 0.027, input-fusion 0.783 +/- 0.024 (paired Delta = +0.023, sign-positive on 5/6 seeds), distillation 0.773 +/- 0.028. The largest robust per-class lift is on Lymphangiectasia, where AUC rises from RGB 0.238 +/- 0.057 to input-fusion 0.337 +/- 0.019, sign-consistent across all 6 seeds. On rare focal-vascular classes (Angiectasia, Blood - fresh) the prior's per-seed effects are bimodal: seed=42 reaches Angiectasia AUC 0.528 -> 0.916, but the cross-seed mean is 0.646 -> 0.608 with sigma_PI = 0.23 - reported as a high-variance per-seed exemplar. Conclusion. A Monte Carlo-inspired analytic prior provides a small, direction-consistent macro-AUC improvement on Kvasir-Capsule across 6 seeds with the largest robust per-class lift on Lymphangiectasia; the distillation variant runs on plain 3-channel RGB and yields a free interpretability heatmap.
comment: 24 pages, 6 figures, 3 tables. Code and trained-model checkpoints at https://github.com/integritynoble/GI_Multi_Task . 6-seed (seeds 41, 42, 43, 44, 45, 47) mean +/- SD ablation as the headline; per-class single-seed=42 analyses in Appendix A
☆ DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models
Reinforcement learning has emerged as a powerful tool for improving diffusion-based text-to-image models, but existing methods are largely limited to single-task optimization. Extending RL to multiple tasks is challenging: joint optimization suffers from cross-task interference and imbalance, while cascade RL is cumbersome and prone to catastrophic forgetting. We propose DiffusionOPD, a new multi-task training paradigm for diffusion models based on Online Policy Distillation (OPD). DiffusionOPD first trains task-specific teachers independently, then distills their capabilities into a unified student along the student own rollout trajectories. This decouples single-task exploration from multi-task integration and avoids the optimization burden of solving all tasks jointly from scratch. Theoretically, we lift the OPD framework from discrete tokens to continuous-state Markov processes, deriving a closed-form per-step KL objective that unifies both stochastic SDE and deterministic ODE refinement via mean-matching. We formally and empirically demonstrate that this analytic gradient provides lower variance and better generality compared to conventional PPO-style policy gradients. Extensive experiments show that DiffusionOPD consistently surpasses both multi-reward RL and cascade RL baselines in training efficiency and final performance, while achieving state-of-the-art results on all evaluated benchmarks.
☆ LATERN: Test-Time Context-Aware Explainable Video Anomaly Detection
Vision-language models (VLMs) have recently emerged as a promising paradigm for video anomaly detection (VAD) due to their strong visual reasoning ability and natural language-based explainability. In this paper, we aim to address a key limitation of such pipelines, which perform segment-level inference independently owing to token constraints and reason without structured temporal context, allowing VLMs to interpret anomalies as deviations from evolving video dynamics rather than producing fragmented predictions and explanations. To specify, we propose a context-aware framework named LATERN, which reformulates VAD as a temporal evidence aggregation process. LATERN consists of two complementary modules: Context-Aware Anomaly Scoring (CEA) and Recursive Evidence Aggregation (REA). CEA introduces a novel image-grounded memory mechanism, which selectively chooses historical content via frame diversity and visual-textual alignment as expanded context to help generate reliable anomaly scores. Building upon these scores, REA performs recursive temporal aggregation to identify coherent anomaly intervals and produce event-level decisions and explanations grounded in visual-textual evidence. Extensive experiments on challenging benchmarks, including UCF-Crime and XD-Violence, show that LATERN enhances detection accuracy and explanation consistency for frozen VLMs during test time, while generating temporally coherent and semantically grounded event-level explanations.
☆ EverAnimate: Minute-Scale Human Animation via Latent Flow Restoration
We propose EverAnimate, an efficient post-training method for long-horizon animated video generation that preserves visual quality and character identity. Long-form animation remains challenging because highly dynamic human motion must be synthesized against relatively static environments, making chunk-based generation prone to accumulated drift: (i) low-level quality drift, such as progressive degradation of static backgrounds, and (ii) high-level semantic drift, such as inconsistent character identity and view-dependent attributes. To address this issue, EverAnimate restores drifted flow trajectories by anchoring generation to a persistent latent context memory, consisting of two complementary mechanisms. (i) Persistent Latent Propagation maintains a context memory across chunks to propagate identity and motion in latent space while mitigating temporal forgetting. (ii) Restorative Flow Matching introduces an implicit restoration objective during sampling through velocity adjustment, improving within-chunk fidelity. With only lightweight LoRA tuning, EverAnimate outperforms state-of-the-art long-animation methods in both short- and long-horizon settings: at 10 seconds, it improves PSNR/SSIM by 8%/7% and reduces LPIPS/FID by 22%/11%; at 90 seconds, the gains increase to 15%/15% and 32%/27%, respectively.
comment: Project Page: https://everanimate.github.io/homepage/
☆ HiSem: Hierarchical Semantic Disentangling for Remote Sensing Image Change Captioning
Remote sensing image change captioning (RSICC) aims to achieve high-level semantic understanding of genuine changes occurring between bi-temporal images. Despite notable progress, existing methods are fundamentally limited by a shared modeling assumption: changed and unchanged image pairs, which have intrinsically different semantic granularities, are processed under a unified modeling strategy. This modeling inconsistency leads to semantic entanglement between coarse-grained change existence judgment and fine-grained semantic understanding.To address the above limitation, we propose a novel hierarchical semantic disentangling network (HiSem) that explicitly disentangles semantic representations of different granularities. Specifically, we first introduce the Bidirectional Differential Attention Modulation (BDAM) module that leverages discrepancy-aware attention to enhance cross-temporal interactions, thereby amplifying true change signals while suppressing irrelevant variations. Building upon this, we design a Hierarchical Adaptive Semantic Disentanglement (HASD) module that performs adaptive routing at two hierarchical levels: a coarse-grained image-level routing mechanism distinguishes changed and unchanged image pairs, while a fine-grained token-level Mixture-of-Experts (MoE) block models diverse and heterogeneous change semantics for changed samples. Extensive experiments on two benchmark datasets demonstrate that HiSem outperfoms previous methods, achieving a significant improvement of +7.52\% BLEU-4 on the WHU-CDC dataset. More importantly, our approach provides a structured perspective for RSICC by explicitly aligning model design with the intrinsic semantic heterogeneity of bi-temporal scenes. The code will be available at https://github.com/Man-Wang-star/HiSem
☆ 3D Skew-Normal Splatting
3D Gaussian Splatting (3DGS) has emerged as a leading representation for real-time novel view synthesis and been widely adopted in various downstream applications. The core strength of 3DGS lies in its efficient kernel-based scene representation, where Gaussian primitives provide favorable mathematical and computational properties. However, under a finite primitive budget, the symmetric shape of each primitive directly affects representation compactness, especially near asymmetric structures such as object boundaries and one-sided surfaces. Recent works have explored more complex kernel distributions, yet they either remain within the elliptical family or rely on hard truncation, which limits continuous shape control and introduces distributional discontinuities. In this paper, we propose Skew-Normal Splatting (SNS), which adopts the Azzalini Skew-Normal distribution as the fundamental primitive. By introducing a learnable and bounded skewness parameter, SNS can continuously interpolate between symmetric Gaussians and Half-Gaussian-like shapes, enabling flexible modeling of both sharp boundaries and interior regions. Moremover, SNS preserves analytical tractability under affine transformations and marginalization. This property allows seamless integration into existing Gaussian Splatting rasterization pipelines.Furthermore, to address the strong coupling between scale, rotation, and skewness parameters, we introduce a decoupled parameterization and a block-wise optimization strategy to enhance training stability and accuracy. Extensive experiments on standard novel-view synthesis benchmarks show that SNS consistently improves reconstruction quality over Gaussian and recent non-Gaussian kernels, with clearer benefits on sharp boundaries and thin or one-sided structures.
☆ Predicting Response to Neoadjuvant Chemotherapy in Ovarian Cancer from CT Baseline Using Multi-Loss Deep Learning
Ovarian cancer is the most lethal gynecologic malignancy: around 60% of patients are diagnosed at an advanced stage, with an associated 5-year survival rate of about 30%. Early identification of non-responders to neoadjuvant chemotherapy remains a key unmet need, as it could prevent ineffective therapy and avoid delays in optimal surgical management. This work proposes a non-invasive deep learning framework to predict neoadjuvant chemotherapy response from pre-treatment contrast-enhanced CT by leveraging automatically derived 3D lesion masks. The approach encodes axial slices with a partially fine-tuned pretrained image encoder and aggregates slice-level representations into a volumetric embedding through an attention-based module. Training combines classification loss with supervised contrastive regularization and hard-negative mining to improve separation between ambiguous responders and non-responders. The method was developed on a retrospective single-center cohort from the European Institute of Oncology (Milan, IT), including 280 eligible patients (147 responder, 133 non-responder). On the test cohort, the model achieved a ROC-AUC of 0.73 (95% CI: 0.58-0.86) and an F1-score of 0.70 (95% CI: 0.56-0.82). Overall, these results suggest that the proposed architecture learns clinically relevant predictive patterns and provides a robust foundation for an imaging-based stratification tool.
☆ Characterizing the visual representation of objects from the child's view
Children acquire object category representations from their everyday experiences in the first few years of life. What do the inputs to this learning process look like? We analyzed first-person videos of young children's visual experience at home from the BabyView dataset ($N$ = 31 participants, 868 hours, ages 5--36 months), using a supervised object detection model to extract common object categories from more than 3 million frames. We found that children's object category exposure was highly skewed: a few categories (e.g., cups, chairs) dominated children's visual experiences while most categories appeared rarely, replicating previous findings from a more restricted set of contexts. Category exemplars were highly variable: children encountered objects from unusual angles, in highly cluttered scenes, and partially occluded views; many categories (especially animals) were most frequently viewed as depictions. Surprisingly, despite this variability, detected categories (e.g., giraffes, apples) showed stronger groupings within superordinate categories (e.g., animals, food) relative to groupings derived from canonical photographs of these categories. We found this same pattern when using high-dimensional embeddings from both self-supervised visual and multimodal models; this effect was also recapitulated in densely sampled data from individual children. Understanding the robustness and efficiency of visual category learning will require the development of models that can exploit strong superordinate structure and learn from non-canonical, sparse, and variable exemplars.
comment: 19 pages, 6 figures
☆ Compositional Video Generation via Inference-Time Guidance
Text-to-video diffusion models generate realistic videos, but often fail on prompts requiring fine-grained compositional understanding, such as relations between entities, attributes, actions, and motion directions. We hypothesize that these failures need not be addressed by retraining the generator, but can instead be mitigated by steering the denoising process using the model's own internal grounding signals. We propose \textbf{CVG}, an inference-time guidance method for improving compositional faithfulness in frozen text-to-video models. Our key observation is that cross-attention maps already encode how prompt concepts are grounded across space and time. We train a lightweight compositional classifier on these attention features and use its gradients during early denoising steps to steer the latent trajectory toward the desired composition. Built on a frozen VLM backbone, the classifier transfers across semantically related composition labels rather than relying only on narrow category-specific features. CVG improves compositional generation without modifying the model architecture, fine-tuning the generator, or requiring layouts, boxes, or other user-supplied controls. Experiments on compositional text-to-video benchmarks show improved prompt faithfulness while preserving the visual quality of the underlying generator.
☆ Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image ICLR 2026
Generating a street-level 3D scene from a single satellite image is a crucial yet challenging task. Current methods present a stark trade-off: geometry-colorization models achieve high geometric fidelity but are typically building-focused and lack semantic diversity. In contrast, proxy-based models use feed-forward image-to-3D frameworks to generate holistic scenes by jointly learning geometry and texture, a process that yields rich content but coarse and unstable geometry. We attribute these geometric failures to the extreme viewpoint gap and sparse, inconsistent supervision inherent in satellite-to-street data. We introduce Sat3DGen to address these fundamental challenges, which embodies a geometry-first methodology. This methodology enhances the feed-forward paradigm by integrating novel geometric constraints with a perspective-view training strategy, explicitly countering the primary sources of geometric error. This geometry-centric strategy yields a dramatic leap in both 3D accuracy and photorealism. For validation, we first constructed a new benchmark by pairing the VIGOR-OOD test set with high-resolution DSM data. On this benchmark, our method improves geometric RMSE from 6.76m to 5.20m. Crucially, this geometric leap also boosts photorealism, reducing the Fréchet Inception Distance (FID) from $\sim$40 to 19 against the leading method, Sat2Density++, despite using no extra tailored image-quality modules. We demonstrate the versatility of our high-quality 3D assets through diverse downstream applications, including semantic-map-to-3D synthesis, multi-camera video generation, large-scale meshing, and unsupervised single-image Digital Surface Model (DSM) estimation. The code has been released on https://github.com/qianmingduowan/Sat3DGen.
comment: ICLR 2026; code: https://github.com/qianmingduowan/Sat3DGen demo: https://huggingface.co/spaces/qian43/Sat3DGen project page: https://qianmingduowan.github.io/Sat3DGen_project_page/
☆ MicroscopyMatching: Towards a Ready-to-use Framework for Microscopy Image Analysis in Diverse Conditions
Analyzing microscopy images to extract biological object properties (e.g., their morphological organization, temporal dynamics, and population density) is fundamental to various biomedical research. Yet conducting this manually is costly and time-consuming. Though deep learning-based approaches have been explored to automate this process, the substantial diversity of microscopy analysis settings in practice (including variations of biological object types, sample processing protocols, imaging equipment, and analysis tasks, etc.) often renders them ineffective. As a result, these approaches typically require extensive adaptation for different settings, which, however, can impose burdens that are often practically unsustainable for laboratories, forcing biomedical researchers to still commonly rely on manual analysis, thereby severely bottlenecking the pace of biomedical research progress. This situation has created a pressing and long-standing need for a reliable and broadly applicable microscopy image analysis tool, yet such a tool is still missing. To address this gap, we present the first ready-to-use microscopy image analysis framework, MicroscopyMatching, that can reliably perform key analysis tasks (including segmentation, tracking, and counting) across diverse microscopy analysis settings. From a fundamentally different perspective, MicroscopyMatching reformulates diverse microscopy image analysis tasks as a unified matching problem, effectively handling this problem by exploiting the robust matching capability from pre-trained latent diffusion models.
☆ MHSA: A Lightweight Framework for Mitigating Hallucinations via Steered Attention in LVLMs
Large vision-language models (LVLMs) have achieved remarkable performance across diverse multimodal tasks, yet they continue to suffer from hallucinations, generating content that is inconsistent with the visual input. Prior work DHCP (Detecting Hallucinations by Cross-modal Attention Pattern) has explored hallucination detection from the perspective of cross-modal attention, but does not address hallucination mitigation. In this paper, we propose MHSA (Mitigating Hallucinations via Steered Attention), a lightweight framework that mitigates hallucinations by learning to correct cross-modal attention patterns in LVLMs. MHSA trains a simple three-layer MLP generator to produce corrected attention, guided by supervisory signals from the DHCP discriminator and the LVLM itself. During inference, MHSA mitigates both discriminative and generative hallucinations across various datasets and LVLMs by simply replacing the original cross-modal attention with the corrected one, without modifying any LVLM parameters. By extending cross-modal attention mechanisms from hallucination detection to hallucination mitigation, MHSA offers a novel perspective on hallucination research in LVLMs and helps enhance their reliability.
comment: 19 pages, 17 figures
☆ H-OmniStereo: Zero-Shot Omnidirectional Stereo Matching with Heading-Aligned Normal Priors
Stereo matching on top-bottom equirectangular images provides an effective framework for full-surround perception, as vertically aligned epipolar lines enable the use of advanced perspective stereo architectures that are largely driven by large-scale datasets and monocular priors. However, the performance of such adaptations is severely limited by the scarcity of omnidirectional stereo datasets and the degradation of perspective monocular priors under spherical distortions.To address these challenges, we propose H-OmniStereo, a zero-shot omnidirectional stereo matching framework. First, we construct high-quality synthetic dataset comprising over 2.8 million top-bottom equirectangular stereo pairs to scale up training. Second, we introduce an equirectangular monocular normal estimator, specifically operating in a heading-aligned coordinate system. Beyond providing distortion-robust and cross-view-consistent geometric priors for establishing reliable correspondences in stereo matching, this design boosts training efficiency and accommodates train-test FoV mismatches.Extensive experiments show that our approach achieves higher accuracy than existing methods on out-of-domain datasets and successfully generalizes to real-world consumer camera setups using a single model. Both the model and the dataset will be open-sourced.
comment: 8 pages, 9 figures
☆ Meschers: Geometry Processing of Impossible Objects
Impossible objects, geometric constructions that humans can perceive but that cannot exist in real life, have been a topic of intrigue in visual arts, perception, and graphics, yet no satisfying computer representation of such objects exists. Previous work embeds impossible objects in 3D, cutting them or twisting/bending them in the depth axis. Cutting an impossible object changes its local geometry at the cut, which can hamper downstream graphics applications, such as smoothing, while bending makes it difficult to relight the object. Both of these can invalidate geometry operations, such as distance computation. As an alternative, we introduce Meschers, meshes capable of representing impossible constructions akin to those found in M.C. Escher's woodcuts. Our representation has a theoretical foundation in discrete exterior calculus and supports the use-cases above, as we demonstrate in a number of example applications. Moreover, because we can do discrete geometry processing on our representation, we can inverse-render impossible objects. We also compare our representation to cut and bend representations of impossible objects.
☆ Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model
Vision-Language-Action models have emerged as a promising paradigm for robotic manipulation by unifying perception, language grounding, and action generation. However, they often struggle in scenarios requiring precise spatial understanding, as current VLA models primarily rely on 2D visual representations that lack depth information and detailed spatial relationships. While recent approaches incorporate explicit 3D inputs such as depth maps or point clouds to address this issue, they often increase system complexity, require additional sensors, and remain vulnerable to sensing noise and reconstruction errors. Another line of work explores implicit 3D-aware spatial modeling directly from RGB observations without extra sensors, but it often relies on large geometry foundation models, resulting in higher training and deployment costs. To address these challenges, we propose Evo-Depth, a lightweight depth-enhanced VLA framework that enhances spatially grounded manipulation without relying on additional sensing hardware or compromising deployment efficiency. Evo-Depth employs a lightweight Implicit Depth Encoding Module to extract compact depth features from multi-view RGB images. These features are incorporated into vision-language representations through a Spatial Enhancement Module via depth-aware modulation, enabling efficient spatial-semantic enhancement. A Progressive Alignment Training strategy is further introduced to align the resulting depth-enhanced representations with downstream action learning. With only 0.9B parameters, Evo-Depth achieves superior performance across four simulation benchmarks. In real-world experiments, Evo-Depth attains the highest average success rate while also exhibiting the smallest model size, lowest GPU memory usage, and highest inference frequency among compared methods.
☆ A CUBS-Compatible Ultrasound Morphology and Uncertainty-Aware Baseline for Carotid Intima-Media Segmentation and Preliminary Risk Prediction
Carotid atherosclerosis is a major contributor to ischemic stroke and transient ischemic attack. Conventional ultrasound assessment is commonly based on intima-media thickness, plaque appearance, stenosis degree, and peak systolic velocity, but these morphology- and velocity-based indicators may not fully capture patient-specific vascular risk. This study presents AtheroFlow-XNet, a CUBS-compatible ultrasound morphology and uncertainty-aware learning baseline for carotid intima-media segmentation and preliminary risk prediction. Using the Carotid Ultrasound Boundary Study dataset, manual lumen-intima and media-adventitia boundary annotations were converted into dense intima-media masks for supervised segmentation. Clinical variables were incorporated into an auxiliary risk-prediction branch, and Monte Carlo dropout was used for uncertainty-aware inference. The model was evaluated using a patient-level train-validation-test split with 1,522 training images, 326 validation images, and 328 testing images. The proposed model achieved a Dice coefficient of 0.7930 for LI-MA mask segmentation, a segmentation loss of 0.2359, and an area under the receiver operating characteristic curve of 0.6910 for preliminary risk prediction. Qualitative results showed that predicted masks were generally aligned with manual annotations, while uncertainty maps highlighted ambiguous wall-boundary regions. These results suggest that ultrasound-derived carotid morphology can support automated wall analysis and uncertainty-aware interpretation. Since CUBS does not provide Doppler waveforms or CFD-derived hemodynamic biomarkers, this work should be interpreted as a reproducible morphology-driven baseline. Future work will incorporate Doppler-derived flow profiles, patient-specific vascular reconstruction, and CFD-based wall shear biomarkers.
comment: 13 pages, 5 figures, 2 tables, 20 equations, 3 appendices
☆ ACE-LoRA: Adaptive Orthogonal Decoupling for Continual Image Editing
State-of-the-art diffusion models often rely on parameter-efficient fine-tuning to perform specialized image editing tasks. However, real-world applications require continual adaptation to new tasks while preserving previously learned knowledge. Despite the practical necessity, continual learning for image editing remains largely underexplored. We propose ACE-LoRA, a dynamic regularization framework for continual image editing that effectively mitigates catastrophic forgetting. ACE-LoRA leverages Adaptive Orthogonal Decoupling to identify and orthogonalize task interference, and introduces a Rank-Invariant Historical Information Compression strategy to address scalability issues in continual updates. To facilitate continual learning in image editing and provide a standardized evaluation protocol, we introduce CIE-Bench, the first comprehensive benchmark in this domain. CIE-Bench encompasses diverse and practically relevant image editing scenarios with a balanced level of difficulty to effectively expose limitations of existing models while remaining compatible with parameter-efficient fine-tuning. Extensive experiments demonstrate that our method consistently outperforms existing baselines in terms of instruction fidelity, visual realism, and robustness to forgetting, establishing a strong foundation for continual learning in image editing.
☆ Octopus: History-Free Gradient Orthogonalization for Continual Learning in Multimodal Large Language Models
Continual learning in multimodal large language models (MLLMs) aims to sequentially acquire knowledge while mitigating catastrophic forgetting, yet existing methods face inherent limitations: architecture-based approaches incur additional computational overhead and often generalize poorly to new tasks, rehearsal-based methods rely on storing historical data, raising privacy and storage concerns, and conventional regularization-based strategies alone are insufficient to fully prevent parameter interference. We propose Octopus, a two-stage continual learning framework based on History-Free Gradient Orthogonalization (HiFGO), which enforces gradient-level orthogonality without historical task data. Our proposed two-stage finetuning strategy decouples task adaptation from regularization, achieving a principled balance between plasticity and stability. Experiments on UCIT show that Octopus establishes state-of-the-art performance, surpassing prior SOTA by 2.14% and 6.82% in terms of Avg and Last.
☆ Multi-scale Coarse-to-fine Modeling for Test-time Human Motion Control
We present MSCoT, a multi-scale, coarse-to-fine model for test-time human motion synthesis and control. Unlike recent approaches that rely on multiple iterative denoising/token-prediction steps, or modules tailored for specific control signals, MSCoT discretizes motion into a multi-scale hierarchical representation and predicts the entire token sequence at each temporal scale in a coarse-to-fine fashion. Building on this coarse-to-fine paradigm, we propose an efficient multi-scale token guidance strategy that overcomes the challenge of discrete sampling and steers the token distribution towards the control goals, allowing for fast and flexible control. To address the limitations of a discrete codebook, a lightweight token refiner further adds continuous residuals to the discrete token embeddings and allows differentiable test-time refinement optimization to ensure precise alignment with the control objectives. MSCoT is able to produce quality motions, consistent with the control constraints, while offering substantially faster sampling than diffusion-based approaches. Experiments on popular benchmarks demonstrate state-of-the-art controllable text-to-motion generation performance of MSCoT over existing baselines, with better motion quality (48% FID improvement), higher control accuracy (-61% avg error), and $10 \times$ faster inference speed on HumanML3D.
☆ SCRWKV: Ultra-Compact Structure-Calibrated Vision-RWKV for Topological Crack Segmentation ICML2026
Achieving pixel-level accurate segmentation of structural cracks across diverse scenarios remains a formidable challenge. Existing methods face significant bottlenecks in balancing crack topology modeling with computational efficiency, often failing to reconcile high segmentation quality with low resource demands. To address these limitations, we propose the Ultra-Compact Structure-Calibrated Vision RWKV (SCRWKV), a network that achieves high-precision modeling via a novel Structure-Field Encoder (SFE) backbone while maintaining linear complexity. The SFE integrates the Adaptive Multi-scale Cascaded Modulator (AMCM) to enhance texture representation and utilizes the Structure-Calibrated Insight Unit (SCIU) as its core engine. Specifically, the SCIU employs the Geometry-guided Bidirectional Structure Transformation (GBST) to capture topological correlations and integrates the Dynamic Self-Calibrating Decay (DSCD) into Dy-WKV to suppress noise propagation. Furthermore, we introduce a lightweight Cross-Scale Harmonic Fusion (CSHF) decoder to achieve precise feature aggregation. Systematic evaluations on multiple benchmarks characterized by complex textures and severe interference demonstrate that SCRWKV, with only 1.22M parameters, significantly outperforms SOTA methods. Achieving an F1 score of 0.8428 and mIoU of 0.8512 on the TUT dataset, the model confirms its robust potential for efficient real-world deployment. The code is available at https://github.com/zhxhzy/SCRWKV.
comment: Accept by ICML2026
☆ Road Maps as Free Geometric Priors: Weather-Invariant Drone Geo-Localization with GeoFuse
Drone-view geo-localization aims to match a query drone image, often captured under adverse weather conditions (e.g., rain, snow, fog), against a gallery of geo-tagged satellite images. Weather-induced degradations in the drone view, such as noise, reduced visibility, and partial occlusions, severely exacerbate the intrinsic cross-view domain gap. While prior methods predominantly rely on weather-specific architectures or data augmentations, they have largely overlooked road map data, a readily available modality that provides strong, inherently weather-invariant geometric layout cues (e.g., road networks and building footprints) at negligible additional cost. We introduce GeoFuse, a cross-modal fusion framework that integrates precisely aligned road map tiles with satellite imagery to yield more discriminative and weather-resilient representations. We first augment the existing University-1652 and DenseUAV benchmarks with geo-aligned road maps, supplying structural priors robust to meteorological variations. Building on this, we propose a flexible fusion module that combines satellite and road map features via token-level and channel-level interactions, with a lightweight dynamic gating mechanism that adaptively weights modality contributions per instance. Finally, we employ class-level cross-view contrastive learning to promote robust alignment between weather-degraded drone features and the fused satellite-roadmap representations. Extensive experiments under diverse weather conditions show that GeoFuse consistently outperforms state-of-the-art methods, achieving +3.46% and +23.18% Recall@1 accuracy on the University-1652 and DenseUAV benchmarks, respectively.
comment: 18 pages, 4 figures
☆ SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding
General scene perception has progressed from object recognition toward open-vocabulary grounding, part localization, and affordance prediction. Yet these capabilities are often realized as isolated predictions that localize objects, parts, or interaction points without capturing the structured dependencies needed for interaction-oriented scene understanding. To address this gap, we introduce Hierarchical Scene Parsing, an interaction-oriented parsing task that represents physical scenes as explicit scene -> object -> part -> affordance hierarchies with cross-level bindings. We instantiate this task with SceneParser, a VLM-based parser trained for unified hierarchical generation with structural-completion pseudo labels and curriculum learning. To support training and evaluation, we construct SceneParser-Bench, a large-scale benchmark built with a scalable hierarchical data engine, containing 110K training images, a 5K validation split, 777K objects, 1.14M parts, 1.74M affordance annotations, and 1.74M valid object-part-affordance chain instances. We further introduce Level-1 to Level-3 conditional metrics and ParseRate to evaluate localization, cross-level binding, and hierarchical completeness. Experiments show that existing MLLMs and perception-stitching pipelines struggle with hierarchical parsing on our SceneParser-Bench, while SceneParser achieves stronger structure-aware performance. Besides, ablations, evaluations on COCO and AGD20K, and a downstream planning probe demonstrate that our SceneParser is compatible with conventional tasks and provides an actionable representation for visual understanding.
comment: Preprint. Code, models, and dataset are provided in the manuscript
☆ Representative Attention For Vision Transformers
Linear attention has emerged as a promising direction for scaling Vision Transformers beyond the quadratic cost of dense self-attention. A prevalent strategy is to compress spatial tokens into a compact set of intermediate proxies that mediate global information exchange. However, existing methods typically derive these proxy tokens from predefined spatial layouts, causing token compression to remain anchored to image coordinates rather than the semantic organization of visual content. To overcome this limitation, we propose Representative Attention (RPAttention), a linear global attention mechanism that performs token compression directly in representation space. Instead of constructing intermediate tokens from fixed spatial partitions, it dynamically forms a compact set of learned representative tokens to enable semantically related regions to communicate regardless of their spatial distance, by following a lightweight Gather-Interact-Distribute paradigm. Spatial tokens are first softly gathered into representative tokens through competitive similarity-based routing. The representatives then perform global interaction within a compact latent space, before broadcasting the refined information back to all spatial tokens via query-driven cross-attention. Via replacing coordinate-driven aggregation with representation-driven compression, RPAttention preserves global receptive fields while adaptively aligning token communication with the content structure of each input.RPAttention reduces the dominant token interaction complexity from quadratic to linear scaling with respect to the number of spatial tokens, while maintaining expressive global context modeling. Extensive experiments across diverse vision transformer backbones on image classification, object detection, and semantic segmentation demonstrate the effectiveness of our design.
☆ SteerSeg: Attention Steering for Reasoning Video Segmentation
Video reasoning segmentation requires localizing objects across video frames from natural language expressions, often involving spatial reasoning and implicit references. Recent approaches leverage frozen large vision-language models (LVLMs) by extracting attention maps and using them as spatial priors for segmentation, enabling training-free grounding. However, these attention maps are optimized for text generation rather than spatial localization, often resulting in diffuse and ambiguous grounding signals. In this work, we introduce SteerSeg, a lightweight framework that identifies attention misalignment as the key bottleneck in attention-based grounding and proposes to steer attention at its source through input-level conditioning. SteerSeg combines learnable soft prompts with reasoning-guided Chain-of-Thought (CoT) prompting. The soft prompts reshape the attention distribution to produce more spatially concentrated maps, while CoT-derived attributes resolve ambiguity among similar objects by guiding attention toward the correct instance. The resulting attention maps are converted into point prompts across keyframes to guide a segmentation model, while candidate tracklets are ranked and selected using correlation-based scoring. Our approach freezes the LVLM and segmentation model parameters and learns only a small set of soft prompts, preserving the model's pretrained reasoning capabilities while significantly improving grounding. Despite being trained only on Ref-YouTube-VOS, SteerSeg generalizes well across diverse benchmarks, significantly improving the spatial grounding capability of LVLMs. Project page: https://steerseg.github.io
comment: Project page: https://steerseg.github.io
☆ MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models
Memory is essential for large vision-language models (LVLMs) to handle long, multimodal interactions, with two method directions providing this capability: long-context LVLMs and memory-augmented agents. However, no existing benchmark conducts a systematic comparison of the two on questions that genuinely require multimodal evidence. To close this gap, we introduce MEMLENS, a comprehensive benchmark for memory in multimodal multi-session conversations, comprising 789 questions across five memory abilities (information extraction, multi-session reasoning, temporal reasoning, knowledge update, and answer refusal) at four standard context lengths (32K-256K tokens) under a cross-modal token-counting scheme. An image-ablation study confirms that solving MEMLENS requires visual evidence: removing evidence images drops two frontier LVLMs below 2% accuracy on the 80.4% of questions whose evidence includes images. Evaluating 27 LVLMs and 7 memory-augmented agents, we find that long-context LVLMs achieve high short-context accuracy through direct visual grounding but degrade as conversations grow, whereas memory agents are length-stable but lose visual fidelity under storage-time compression. Multi-session reasoning caps most systems below 30%, and neither approach alone solves the task. These results motivate hybrid architectures that combine long-context attention with structured multimodal retrieval. Our code is available at https://github.com/xrenaf/MEMLENS.
comment: Work in progress
☆ SEDiT: Mask-Free Video Subtitle Erasure via One-step Diffusion Transformer
Recent breakthroughs in video diffusion models have significantly accelerated the development of video editing techniques. However, existing methods often rely on inpainting video frames based on masked input, which requires extracting the target video mask in advance, and the precision of the segmentation directly affects the quality of the completion. In this paper, we present SEDiT, a novel one-stage video Subtitle Erasure approach via One-step Diffusion Transformer. We introduce a mask-free inference approach that enables direct erasure of the targeted subtitle. The proposed one-stage framework mitigates the sub-optimality inherent in the two-stage processing of prior models. Since subtitle removal is a localized editing task in which most pixels remain unchanged, the underlying distribution shift is minimal, making it well-suited to one-step generation under rectified flow. We empirically validate the reliability of one-step denoising and further provide a formal theoretical justification. Under the localized-editing structure of subtitle removal, the conditional optimal transport (OT) map and its induced rectified flow velocity field are Lipschitz continuous with respect to the latent variable, which underpins the theoretical feasibility of one-step sampling. To address the challenge of long-term temporal consistency, we adopt a hybrid training strategy by occasionally conditioning the model with a clean first-frame latent. This facilitates temporal continuity, allowing each segment during inference to leverage the output of its predecessor. To avoid visible seams caused by cropping and reinserting processed targets, particularly in scenarios involving substantial motion, we feed the original video directly into SEDiT. Thanks to one-step and chunk-wise streaming inference, our method can efficiently handle native 1440p video with infinite length.
comment: Project page:http://zheng222.github.io/SEDiT_project
☆ Your CLIP has 164 dimensions of noise: Exploring the embeddings covariance eigenspectrum of contrastively pretrained vision-language transformers
Contrastively pre-trained Vision-Language Models (VLMs) serve as powerful feature extractors. Yet, their shared latent spaces are prone to structural anomalies and act as repositories for non-semantic, multi-modal noise. To address this phenomenon, we employ spectral decomposition of covariance matrices to decompose the VLM latent space into a multi-modal semantic signal component and a shared noise subspace. We observe that this noise geometry exhibits strong subgroup invariance across distinct data subsets. Crucially, pruning these shared noise dimensions is mainly harmless, preserving or actively improving downstream task performance. By isolating true semantic signals from artifactual noise, this work provides new mechanistic insights into the representational structure of modern VLMs, suggesting that a substantial fraction of their latent geometry is governed by shared, architecture-level noise rather than task-relevant semantics alone.
☆ Hierarchical Image Tokenization for Multi-Scale Image Super Resolution ICML 2026
We introduce a multi-scale Image Super Resolution (ISR) method building on recent advances in Visual Auto-Regressive (VAR) modeling. VAR models break image tokenization into additive, gradually increasing scales, using Residual Quantization (RQ), an approach that aligns perfectly with our target ISR task. Previous works taking advantage of this synergy suffer from two main shortcomings. First, due to the limitations in RQ, they only generate images at a predefined fixed scale, failing to map intermediate outputs to the corresponding image scales. They also rely on large backbones or a large corpus of annotated data to achieve better performance. To address both shortcomings, we introduce two novel components to the VAR training for ISR, aiming at increasing its flexibility and reducing its complexity. In particular, we introduce a) a \textbf{Hierarchical Image Tokenization (HIT)} approach that progressively represents images at different scales while enforcing token overlap across scales, and b) a \textbf{Direct Preference Optimization (DPO) regularization term} that, relying solely on the (LR,HR) pair, encourages the transformer to produce the latter over the former. Our proposed HIT acts as a strong inductive bias for the VAR training, resulting in a small model (300M params vs 1B params of VARSR), that achieves state-of-the-art results without external training data, and that delivers multi-scale outputs with a single forward pass.
comment: Accepted for publication at ICML 2026. *Joint first authorship (alphabetical order). arXiv admin note: substantial text overlap with arXiv:2506.04990
☆ SurgicalMamba: Dual-Path SSD with State Regramming for Online Surgical Phase Recognition
Online surgical phase recognition (SPR) underpins context-aware operating-room systems and requires committing to a prediction at every frame from past context alone. Surgical video poses three demands that natural-video recognizers do not jointly address: procedures span tens of thousands of frames, time flows non-uniformly as long routine stretches are punctuated by brief phase-defining transitions, and the visual domain is narrow so backbone features are strongly correlated across channels. Existing recognizers either let per-frame cost grow with elapsed length, or hold cost bounded but advance state at a uniform rate with channel-independent dynamics, leaving the latter two demands unaddressed. We present SurgicalMamba, a causal SPR model built on Mamba2's structured state-space duality (SSD) that holds per-frame cost at O(d). It introduces three SSD-compatible components, each targeting one demand: a dual-path SSD block that separates long- and short-term regimes at the level of recurrent state; intensity-modulated stepping, a continuous-time time-warp that adapts the slow path's effective rate to phase-relevant information; and state regramming, a per-chunk Cayley rotation that opens cross-channel mixing in the otherwise axis-aligned SSM recurrence. The learned rotation planes inherit a phase-aligned structure without any direct supervision, offering an interpretable internal signature of surgical workflow. Across seven public SPR benchmarks, SurgicalMamba reaches state-of-the-art accuracy and phase-level Jaccard under strict online evaluation: 94.6%/82.7% on Cholec80 (+0.7 pp/+2.2 pp over the strongest prior) and 89.5%/68.9% on AutoLaparo (+1.7 pp/+2.0 pp), at 119 fps on a single GPU. Ablations isolate the contribution of each component. The code is publicly available at https://github.com/sukjuoh/Surgical-Mamba.
comment: 28 pages, 7 figures, 10 tables; Code available at https://github.com/sukjuoh/Surgical-Mamba
☆ Masked Next-Scale Prediction for Self-supervised Scene Text Recognition CVPR
Scene Text Recognition requires modeling visual structures that evolve from coarse layouts to fine-grained character strokes. Training such models relies on large amounts of annotated data. Recent self-supervised approaches, such as Masked Image Modeling (MIM), alleviate this dependency by leveraging large-scale unlabeled data. Yet most existing MIM methods operate at a single spatial scale and fail to capture the hierarchical nature of scene text. In this work, we introduce Masked Next-Scale Prediction (MNSP), a unified self-supervised framework designed to explicitly model cross-scale structural evolution. The framework incorporates Next-Scale Prediction (NSP), which learns hierarchical representations by predicting higher-resolution features from lower-resolution contexts. Naive scale prediction, however, tends to produce spatially diffuse attention, directing the model toward background regions rather than textual structures. MNSP resolves this limitation by jointly learning cross-scale prediction and masked image reconstruction. NSP captures global layout priors across resolutions, while masked reconstruction imposes strong local constraints that guide attention toward informative text regions. A Multi-scale Linguistic Alignment module further maintains semantic consistency across different resolutions. Extensive experiments demonstrate that MNSP achieves state-of-the-art performance, reaching 86.2\% average accuracy on the challenging Union14M benchmark and 96.7\% across six standard datasets. Additional analyses show that our method improves robustness under extreme scale and layout variations. Code is available at https://github.com/CzhczhcHczh/MNSP
comment: Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026 Findings Track.10 pages, 4 figures
☆ Denoising-GS: Gaussian Splatting with Spatial-aware Denoising
Recent advances in 3D Gaussian Splatting (3DGS) have achieved remarkable success in high-fidelity Novel View Synthesis (NVS), yet the optimization process inevitably introduces noisy Gaussian primitives due to the sparse and incomplete initialization from Structure-from-Motion (SfM) point clouds. Most existing methods focus solely on adjusting the positions of primitives during optimization, while neglecting the underlying spatial structure. To this end, we introduce a new perspective by formulating the optimization of 3DGS as a primitive denoising process and propose Denoising-GS, a spatial-aware denoising framework for Gaussian primitives by taking both the positions and spatial structure into consideration. Specifically, we design an optimizer that preserves the spatial optimization flow of primitives, facilitating coherent and directed denoising rather than random perturbations. Building upon this, the Spatial Gradient-based Denoising strategy jointly considers the spatial supports of primitives to ensure gradient-consistent updates. Furthermore, the Uncertainty-based Denoising module estimates primitive-wise uncertainty to prune redundant or noisy primitives, while the Spatial Coherence Refinement strategy selectively splits primitives in sparse regions to maintain structural completeness. Experiments conducted on three benchmark datasets demonstrate that Denoising-GS consistently enhances NVS fidelity while maintaining representation compactness, achieving state-of-the-art performance across all benchmarks. Source code and models will be made publicly available.
☆ HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling
Visual Autoregressive (VAR) models have recently demonstrated impressive image generation quality while maintaining low latency. However, they suffer from severe KV-cache memory constraints, often requiring gigabytes of memory per generated image. We introduce HeatKV, a novel compression method that adapts cache allocation in each head based on its attention to previously generated scales. Using a small offline calibration set, the attention heads are ranked according to their attention scores over prior scales. Based on this ranking, we construct a static pruning schedule tailored to a given memory budget. Applied to the Infinity-2B model, HeatKV achieves $2 \times$ higher compression ratio in memory allocation for KV cache compared to existing methods, while maintaining similar or better image fidelity, prompt alignment and human perception score. Our method achieves a new state-of-the-art (SOTA) for VAR model KV-cache compression, showcasing the effectiveness of fine-grained, head-specific cache allocation.
comment: 18 pages total including appendix; 6 main-paper figures, 2 appendix figures; 4 tables
☆ Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning
Despite rapid advancements, current text-to-image (T2I) models predominantly rely on a single-step generation paradigm, which struggles with complex semantics and faces diminishing returns from parameter scaling. While recent multi-step reasoning approaches show promise, they are hindered by ungrounded planning hallucinations lacking verification, monolithic post-hoc reflection, long-context optimization instabilities, and prohibitive inference latency. To overcome these bottlenecks, we propose the Closed-Loop Visual Reasoning (CLVR) framework, a comprehensive system that deeply couples visual-language logical planning with pixel-level diffusion generation. CLVR introduces an automated data engine with step-level visual verification to synthesize reliable reasoning trajectories, and proposes Proxy Prompt Reinforcement Learning (PPRL) to resolve long-context optimization instabilities by distilling interleaved multimodal histories into explicit reward signals for accurate causal attribution. Furthermore, to mitigate the severe latency bottleneck caused by iterative denoising, we propose $Δ$-Space Weight Merge (DSWM), a theoretically grounded method that fuses alignment weights with off-the-shelf distillation priors, reducing the per-step inference cost to just 4 NFEs without requiring expensive re-distillation. Extensive experiments demonstrate that CLVR outperforms existing open-source baselines across multiple benchmarks and approaches the performance of proprietary commercial models, unlocking general test-time scaling capabilities for complex visual generation.
☆ LPH-VTON: Resolving the Structure-Texture Dilemma of Virtual Try-On via Latent Process Handover
Virtual Try-On (VTON) aims to synthesize photorealistic images of garments precisely aligned with a person's body and pose. Current diffusion-based methods, however, face a fundamental trade-off between structural integrity and textural fidelity. In this paper, we formalize this challenge as a consequence of complementary inductive biases inherent in prevailing architectures: models heavily reliant on spatial constraints naturally favor geometric alignment but often suppress textures, whereas models dominated by unconstrained generative priors excel at vibrant detail rendering but are prone to structural drift. Based on this diagnosis, we propose LPH-VTON, a new synergistic framework that resolves this tension within a single, continuous denoising process. LPH-VTON strategically decomposes the generation, leveraging a structure-biased model to establish a geometrically consistent latent scaffold in the early stages, before handing over control to a texture-biased model for high-fidelity detail rendering. Extensive experiments validate our approach. Our model achieves a superior Pareto-optimal balance, establishing new benchmarks in perceptual faithfulness while maintaining highly competitive structural alignment across the standard dataset VITON-HD, proving the efficacy of temporal architectural decoupling.
☆ FactorizedHMR: A Hybrid Framework for Video Human Mesh Recovery
Human Mesh Recovery (HMR) is fundamentally ambiguous: under occlusion or weak depth cues, multiple 3D bodies can explain the same image evidence. This ambiguity is not uniform across the body, as torso pose and root structure are often relatively well constrained, whereas distal articulations such as the arms and legs are more uncertain. Building on this observation, we propose FactorizedHMR, a two-stage framework that treats these two regimes differently. A deterministic regression module first recovers a stable torso-root anchor, and a probabilistic flow-matching module then completes the remaining non-torso articulation. To make this completion reliable, we combine a composite target representation with geometry-aware supervision and feature-aware classifier-free guidance, preserving the torso-root anchor while improving single-reference recovery of ambiguity-prone articulation. We also introduce a synthetic data pipeline that provides the paired image-camera-motion supervision under diverse viewpoints. Across camera-space and world-space benchmarks, FactorizedHMR remains competitive with strong baselines, with the clearest gains in occlusion-heavy recovery and drift-sensitive world-space metrics.
☆ SR-Prominence: A Crowdsourced Protocol and Dataset Suite for Perceptually-Weighted Super-Resolution Artifact Evaluation
Modern image super-resolution methods generate detailed, visually appealing results, but they often introduce visual artifacts: unnatural patterns and texture distortions that degrade perceived quality. These defects vary widely in perceptual impact--some are barely noticeable, while others are highly disturbing--yet existing detection methods treat them equally. We propose artifact prominence as an evaluative target, defined as the fraction of viewers who judge a highlighted region to contain a noticeable artifact. We design a crowdsourced annotation protocol and construct SR-Prominence, a dataset suite containing 3,935 artifact masks from DeSRA, Open Images, Urban100, and a realistic no-ground-truth Urban100-HR setting, annotated with prominence. Re-annotating DeSRA reveals that 48.2% of its in-lab binary artifacts are not noticed by a majority of viewers. Across the suite, we audit SR artifact detectors, image-quality metrics, and SR methods. We find that classical full-reference metrics, especially SSIM and DISTS, provide surprisingly strong localized prominence signals, whereas no-reference IQA methods and specialized artifact detectors often fail to generalize across datasets and reference settings. SR-Prominence is released with an objective scoring protocol that allows new metrics to be benchmarked on our suite without further crowdsourcing. Together, the data and protocols enable SR artifact evaluation to move from binary defect presence toward perceptual impact. SR-Prominence is available at https://huggingface.co/datasets/imolodetskikh/sr-artifact-prominence.
☆ Exploring Vision-Language Models for Online Signature Verification: A Zero-Shot Capability Study
Recent advancements in Vision-Language Models (VLMs) have demonstrated strong capabilities in general visual reasoning, yet their applicability to rigorous biometric tasks remains unexplored. This work presents an exploratory study evaluating the zero-shot performance of state-of-the-art VLMs (GPT-5.2 and Gemini 2.5 Pro) on the Signature Verification Challenge (SVC) benchmark. To enable visual processing, raw kinematic time-series are converted into static images, encoding pressure information into stroke opacity whenever available in the source data. Furthermore, we introduce a scoring protocol that extracts latent token probabilities to compute robust biometric scores. Experimental results reveal a significant performance dichotomy dependent on signal quality and forgery type. In random forgery scenarios, the zero-shot VLM achieves exceptional discrimination, with GPT-5.2 reaching an Equal Error Rate of 0.32% in mobile tasks, outperforming supervised state-of-the-art systems. Conversely, in skilled forgery scenarios, where the task is more challenging because both signatures are almost identical, the results are significantly worse, and a critical "Rationalization Trap" emerges: chain-of-thought (CoT) reasoning degrades performance as the model produces kinematic hallucinations to justify forgery artifacts as natural variability.
comment: Accepted at the 14th International Workshop on Biometrics and Forensics
☆ MechVerse: Evaluating Physical Motion Consistency in Video Generation Models
Text- and image-conditioned video generation models have achieved strong visual fidelity and temporal coherence, but they often fail to generate motion governed by kinematic and geometric constraints. In these settings, object parts must remain rigid, maintain contact or coupling with neighboring components, and transfer motion consistently across connected parts. These requirements are especially explicit in articulated mechanical assemblies, where motion is constrained by rigid-link geometry, contact/coupling relations, and transmission through kinematic chains. A generated video may therefore appear plausible while violating the intended mechanism, such as rotating a part that should translate, deforming a rigid component, breaking coupling between parts, or failing to move downstream components. To evaluate this gap, We introduce MechVerse, a benchmark for mechanically consistent image-to-video generation. MechVerse contains 21,156 synthetic clips from 1,357 mechanical assemblies across 141 categories, organized into three tiers of increasing kinematic complexity: independent articulation, pairwise coupling, and densely coupled multi-part mechanisms. Each clip is paired with a structured prompt describing part identities, stationary supports, moving components, motion primitives, direction, speed/extent, and inter-part dependencies. We evaluate proprietary, open-source, and fine-tuned image-to-video models using standard video metrics, instruction-following scores, and human judgments of motion correctness and kinematic coupling. Results show that current models can preserve appearance and smoothness while failing to generate mechanically admissible motion, with errors increasing as coupling complexity grows. MechVerse provides a benchmark for measuring and improving mechanism-aware video generation from image and language inputs.
comment: Under Review
☆ Editor's Choice: Evaluating Abstract Intent in Image Editing through Atomic Entity Analysis
Humans naturally communicate through abstract concepts like "mood". However, current image editing benchmarks focus primarily on explicit, literal commands, leaving abstract instructions largely underexplored. In this work, we first formalize the definition and taxonomy of abstract image editing. To measure instruction-following in this challenging domain, we introduce Entity-Rubrics, a framework that breaks down abstract edits into individual, entity-level assessments and achieves strong correlation with human judgment. Alongside this framework, we contribute AbstractEdit, the first benchmark dedicated to abstract image editing across diverse real-world scenes. Evaluating 11 leading models on this dataset reveals a fundamental challenge: standard architectures struggle to balance intent and preservation, commonly defaulting to under-editing or over-editing. Our analysis demonstrates that driving meaningful improvements relies heavily on integrating advanced LLM text encoders and iterative thinking. Looking forward, our entity-based paradigm can generalize beyond assessment to serve as a reward model, enable models to correctly interpret abstract communication, or highlight specific failures in test-time critique loops. Ultimately, we hope this work serves as a stepping stone toward seamless multimodal interaction, closing the gap between rigid machine execution and the natural, open-ended way humans communicate.
☆ Multi-proposal Collaboration and Multi-task Training for Weakly-supervised Video Moment Retrieval
This study focuses on weakly-supervised Video Moment Retrieval (VMR), aiming to identify a moment semantically similar to the given query within an untrimmed video using only video-level correspondences, without relying on temporal annotations during training. Previous methods either aggregate predictions for all instances in the video, or indirectly address the task by proposing reconstructions for the query. However, these methods often produce low-quality temporal proposals, struggle with distinguishing misaligned moments in the same video, or lack stability due to a reliance on a single auxiliary task. To address these limitations, we present a novel weakly-supervised method called Multi-proposal Collaboration and Multi-task Training (MCMT). Initially, we generate multiple proposals and derive corresponding learnable Gaussian masks from them. These masks are then combined to create a high-quality positive sample mask, highlighting video clips most relevant to the query. Concurrently, we classify other clips in the same video as the easy negative sample and the entire video as the hard negative sample. During training, we introduce forward and inverse masked query reconstruction tasks to impose more substantial constraints on the network, promoting more robust and stable retrieval performance. Extensive experiments on two standard benchmarks affirm the effectiveness of the proposed method in VMR.
comment: 26 pages, 4 figures. Preprint version of the article published in International Journal of Machine Learning and Cybernetics
☆ Learning Direct Control Policies with Flow Matching for Autonomous Driving IEEE
We present a flow-matching planner for autonomous driving that directly outputs actionable control trajectories defined by acceleration and curvature profiles. The model is conditioned on a bird's-eye-view (BEV) raster of the surrounding scene and generates control sequences in a small number of Ordinary Differential Equations (ODE) integration steps, enabling low-latency inference suitable for real-time closed-loop re-planning. We train exclusively on urban scenarios (real urban city streets, intersections and roundabouts of the city of Parma, Italy) collected from a 2D traffic simulator with reactive agents, and evaluate in closed-loop on both in-distribution and markedly out-of-distribution environments, including multi-lane highways and unseen urban scenarios. Our results show that the model generalizes reliably to these unseen conditions, maintaining stable closed-loop control and successfully completing scenarios that differ substantially from the training distribution. We attribute this to the BEV representation, which provides a geometry-centric view of the scene that is inherently less sensitive to distributional shifts, and to the flow-matching formulation, which learns a smooth vector field that degrades gracefully under distribution shift. We provide video demonstrations of closed-loop behavior at https://marcelloceresini.github.io/DirectControlFlowMatching.
comment: 16 pages, 6 figures, 2 tables. Accepted at IEEE ITSC 2026
☆ HDRFace: Rethinking Face Restoration with High-Dimensional Representation
Face restoration under complex degradations still remains an ill-posed inverse problem due to severe information loss. Although diffusion models benefit from strong generative priors, most methods still condition only on low-quality inputs, making it difficult to recover identity-critical details under heavy degradations. In this work, we propose HDRFace, a High-Dimensional Representation conditioned Face restoration framework that injects semantically rich priors into the conditional flow without modifying the generative backbone. Our pipeline first obtains a structurally reliable intermediate restoration with an off-the-shelf restorer, then uses a pretrained high-dimensional feature encoder to extract fine-grained facial representations from both the low-quality input and the intermediate result, and injects them as additional conditions for generation. We further introduce SDFM, a Structure-Detail aware adaptive Fusion Mechanism that emphasizes global constraints during structure modeling and strengthens representation guidance during detail synthesis, balancing structural consistency and detail fidelity. To validate the generalization ability of our method, we implement the proposed framework on two generative models, SD V2.1-base and Qwen-Image, and consistently observe stable and coherent performance gains across different architectures.
☆ The Velocity Deficit: Initial Energy Injection for Flow Matching ICML2026
While Flow Matching theoretically guarantees constant-velocity trajectories, we identify a critical breakdown in high-dimensional practice: the Velocity Deficit. We show that the MSE objective systematically underestimates velocity magnitude, causing generated samples to fail to reach the data manifold-a phenomenon we term Integration Lag. To rectify this, we propose Initial Energy Injection, instantiated via two complementary methods: the training-based Magnitude-Aware Flow Matching (MAFM) and the training-free Scale Schedule Corrector (SSC). Both are grounded in our discovery of a crucial asymmetry: velocity contraction causes harmful kinetic stagnation at the trajectory's start, yet acts as a beneficial denoising mechanism at its end. Empirically, SSC yields significant efficiency gains with zero retraining and just one line of code. On ImageNet-1k (256x256), it improves FID by 44.6% (from 13.68 to 7.58) and achieves a 5x speedup, enabling a 50-step generator (FID 7.58) to beat a 250-step baseline (FID 8.65). Furthermore, our methods generalize to Text-to-Image tasks and high-resolution generation, improving FID on MS-COCO by ~22%.
comment: Accepted by ICML2026
☆ Probing into Camera Control of Video Models
Video is a rich and scalable source of 3D/4D visual observations, and camera control is a key capability for video generation models to produce geometrically meaningful content. Existing approaches typically learn a mapping from camera motion to video using additional camera modules and paired data. However, such datasets are often limited in scale, diversity, and scene dynamics, which can bias the model toward a narrow output distribution and compromise the strong prior learned by the base model. These limitations motivate a different perspective on camera control. In this paper, we show that camera control need not be modeled as an implicit mapping problem, but can instead be treated as a form of geometric guidance that induces displacements across frames. Specifically, we reformulate camera control into a set of displacement fields and apply them via differentiable resampling of latent features during denoising. Our simple approach achieves effective camera control with minimal degradation across diverse quality metrics compared to fine-tuned baselines. Since our method is applicable to most video diffusion models without training, it can also serve as a probe to study the camera control capabilities of base models. Using this probe, we identify universal biases shared by representative video models, as well as disparities in their responses to camera control. Finally, we benchmark their performance in multi-view generation, offering insights into their potential for 3D/4D tasks.
☆ SuperADD: Training-free Class-agnostic Anomaly Segmentation -- CVPR 2026 VAND 4.0 Workshop Challenge Industrial Track CVPR 2026
Visual anomaly detection (AD) for industrial inspection is a highly relevant task in modern production environments. The problem becomes particularly challenging when training and deployment data differ due to changes in acquisition conditions during production. In the VAND 4.0 Industrial Track, models must remain robust under distribution shifts such as varying illumination and their performance is assessed on the MVTec AD 2 dataset. To address this setting, we propose a training-free and class-agnostic anomaly detection pipeline based on the work of SuperAD. Our approach improves generalization through several modifications designed to enhance robustness under distribution shifts. These adaptations include using a DINOv3 backbone, overlapping patch-wise processing, intensity-based augmentations, improved memory-bank subsampling for better coverage of the data distribution, and iterative morphological closing for cleaner and more spatially consistent anomaly maps. Unlike methods that rely on class-specific architectures or per-class hyperparameter tuning, our method uses a single architecture and one shared hyperparameter configuration across all object classes. This makes the approach well suited for industrial deployment, where product variants and appearance changes must be handled with minimal adaptation effort. We achieve segmentation F1 scores of $62.61\%$, $57.42\%$, and $54.35\%$ on test public, private, and private mixed of MVTec AD 2 respectively, thereby outperforming SuperAD and other state-of-the-art methods. Code is available at https://github.com/LukasRoom/SuperADD.
comment: Technical report for the CVPR 2026 VAND 4.0 workshop challenge industrial track
☆ Can Visual Mamba Improve AI-Generated Image Detection? An In-Depth Investigation
In recent years, computer vision has witnessed remarkable progress, fueled by the development of innovative architectures such as Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), diffusion-based architectures, Vision Transformers (ViTs), and, more recently, Vision-Language Models (VLMs). This progress has undeniably contributed to creating increasingly realistic and diverse visual content. However, such advancements in image generation also raise concerns about potential misuse in areas such as misinformation, identity theft, and threats to privacy and security. In parallel, Mamba-based architectures have emerged as versatile tools for a range of image analysis tasks, including classification, segmentation, medical imaging, object detection, and image restoration, in this rapidly evolving field. However, their potential for identifying AI-generated images remains relatively unexplored compared to established techniques. This study provides a systematic evaluation and comparative analysis of Vision Mamba models for AI-generated image detection. We benchmark multiple Vision Mamba variants against representative CNNs, ViTs, and VLM-based detectors across diverse datasets and synthetic image sources, focusing on key metrics such as accuracy, efficiency, and generalizability across diverse image types and generative models. Through this comprehensive analysis, we aim to elucidate Vision Mamba's strengths and limitations relative to established methodologies in terms of applicability, accuracy, and efficiency in detecting AI-generated images. Overall, our findings highlight both the promise and current limitations of Vision Mamba as a component in systems designed to distinguish authentic from AI-generated visual content. This research is crucial for enhancing detection in an age where distinguishing between real and AI-generated content is a major challenge.
☆ COAL: Counterfactual and Observation-Enhanced Alignment Learning for Discriminative Referring Multi-Object Tracking
Referring Multi-Object Tracking (RMOT) faces a fundamental structural contradiction between the high-discriminability demand and the sparse semantic supervision. This mismatch is particularly acute in highly homogeneous scenarios that require fine-grained discrimination over complex compositional semantics. However, under sparse supervision, models overfit to salient yet insufficient cues, thereby encouraging shortcut learning and semantic collapse. To resolve this, we propose COAL (Counterfactual and Observation-enhanced Alignment Learning), a framework that advances RMOT beyond isolated structural optimization through knowledge regularization. First, we introduce Explicit Semantic Injection (ESI) via a VLM to densify the observation space and enhance instance discriminability. Second, leveraging LLM reasoning, we propose Counterfactual Learning (CFL) to augment supervision, enforcing strict attribute verification for robust compositional recognition. These strategies are unified within a Hierarchical Multi-Stream Integration (HMSI) architecture, which distills external knowledge into domain-specific discriminative representations. Experiments on Refer-KITTI and Refer-KITTI-V2 benchmarks validate COAL's efficacy. Notably, it surpasses the state-of-the-art by 7.28% HOTA on the highly challenging Refer-KITTI-V2. These results demonstrate the effectiveness of knowledge regularization for resolving the sparsity-discriminability paradox in RMOT.
☆ Do Composed Image Retrieval Benchmarks Require Multimodal Composition?
Composed Image Retrieval (CIR) is a multimodal retrieval task where a query consists of a reference image and a textual modification, and the goal is to retrieve a target image satisfying both. In principle, strong performance on CIR benchmarks is assumed to require multimodal composition, i.e., combining complementary information from reference image and textual modification. In this work, we show that this assumption does not always hold. Across four widely used CIR benchmarks and eleven Generalist Multimodal Embedding models, a large fraction of queries can be solved using a single modality (from 32.2% to 83.6%), revealing pervasive unimodal shortcuts. Thus, high CIR performance can arise from unimodal signals rather than true multimodal composition. To better understand this issue, we perform a two-stage audit. First, we identify shortcut-solvable queries through cross-model analysis. Second, we conduct human validation on 4,741 shortcut-free queries, of which only 1,689 are well-formed, with common issues including ambiguous edits and mismatched targets. Re-evaluating models on this validated subset reveals qualitatively different behaviour: queries can no longer be solved with a single modality, and successful retrieval requires combining both inputs. While accuracy decreases, reliance on multimodal information increases. Overall, current CIR benchmarks conflate shortcut-solvable, noisy, and genuinely compositional queries, leading to an overestimation of model capability in multimodal composition.
☆ Understanding Imbalanced Forgetting in Rehearsal-Based Class-Incremental Learning
Neural networks suffer from catastrophic forgetting in class-incremental learning (CIL) settings. Rehearsal$\unicode{x2013}$replaying a subset of past samples$\unicode{x2013}$is a well-established mitigation strategy. However, recent results suggest that, despite balanced rehearsal allocation, some classes are forgotten substantially more than others. Despite its relevance, this imbalanced forgetting phenomenon remains underexplored. This work shows that imbalanced forgetting arises systematically and severely in rehearsal-based CIL and investigates it extensively. Specifically, we construct, from a principled analysis, three last-layer coefficients that capture different gradient-level sources of interference affecting each past class during an incremental step. We then demonstrate that, together, they reliably predict how past classes will rank in terms of forgetting at the end of that step. While predictive performance alone does not establish causality, these results support the interpretation of the coefficients as a plausible mechanistic account linking last-layer gradient-level interactions during training to class-level forgetting outcomes. Notably, one coefficient$\unicode{x2013}$capturing self-induced interference$\unicode{x2013}$emerges as the strongest predictor, with controlled experiments providing evidence consistent with this coefficient being influenced by the new-class interference coefficient. Overall, our findings provide valuable insights and suggest promising directions for mitigating imbalanced forgetting by reducing class-wise disparities in the identified sources of interference.
comment: 37 pages; 24 tables; 7 figures; submitted to a journal
☆ MonoPRIO: Adaptive Prior Conditioning for Unified Monocular 3D Object Detection
Monocular 3D object detection remains challenging because metric size and depth are underdetermined by single-view evidence, particularly under occlusion, truncation, and projection-induced scale-depth ambiguity. Although recent methods improve depth and geometric reasoning, metric size remains unstable in unified multi-class settings, where class variability and partial visibility broaden plausible size modes. We propose MonoPRIO, a unified monocular 3D detector that targets this bottleneck through adaptive prior conditioning in the size pathway. MonoPRIO constructs class-aware size prototypes offline, routes each decoder query to a soft mixture prior, applies uncertainty-aware log-space conditioning, and uses Cluster-Aligned Prior (CAP) regularisation on matched positives during training. On the official KITTI test server, MonoPRIO achieves the strongest fully reported unified multi-class result among methods reporting complete Car, Pedestrian, and Cyclist metrics. In the car-only setting, it also achieves the strongest 3D bounding-box AP across Easy/Moderate/Hard categories among compared methods without extra data, while using substantially less compute than MonoCLUE. Ablations and diagnostics show complementary gains from routed injection and CAP, with the largest benefits in ambiguity-prone, partially occluded, and low-data regimes. These findings indicate that adaptive priors are most effective when image evidence underdetermines metric size, while atypical geometry or extreme visibility loss can still cause mismatch between routed priors and true instance geometry. Code, trained models, result logs, and reproducibility material are available at https://github.com/bigggs/MonoPRIO.
comment: 12 pages, 4 figures, 8 tables. Submitted to Pattern Recognition. Code and reproducibility material available at https://github.com/bigggs/MonoPRIO
☆ BioHuman: Learning Biomechanical Human Representations from Video
Understanding human motion beyond surface kinematics is crucial for motion analysis, rehabilitation, and injury risk assessment. However, progress in this domain is limited by the lack of large-scale datasets with biomechanical annotations, and by existing approaches that cannot directly infer internal biomechanical states from visual observations. In this paper, we introduce a simulation-based framework for estimating muscle activations from existing motion capture datasets, resulting in BioHuman10M, a large-scale dataset with synchronized video, motion, and activations. Building on BioHuman10M, we propose BioHuman, an end-to-end model that takes monocular video as input and jointly predicts human motion and muscle activations, effectively bridging visual observations and internal biomechanical states. Extensive experiments demonstrate that BioHuman enables accurate reconstruction of both kinematic motion and muscle activity, and generalizes across diverse subjects and motions. We believe our approach establishes a new benchmark for video-based biomechanical understanding and opens up new possibilities for physically grounded human modeling.
☆ Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining ICML 2026
Recent advances in multimodal large language models have driven growing interest in graphical user interface (GUI) agents, yet their generalization remains constrained by the scarcity of large-scale training data spanning diverse real-world applications. Existing datasets rely heavily on costly manual annotations and are typically confined to narrow domains. To address this challenge, we propose Video2GUI, a fully automated framework that extracts grounded GUI interaction trajectories directly from unlabeled Internet videos. Video2GUI employs a coarse-to-fine filtering strategy to identify high-quality GUI tutorial videos and convert them into structured agent trajectories. Applying this pipeline to 500 million video metadata entries, we construct WildGUI, a large-scale dataset containing 12 million interaction trajectories spanning over 1,500 applications and websites. Pre-training Qwen2.5-VL and Mimo-VL on WildGUI yields consistent improvements of 5-20% across multiple GUI grounding and action benchmarks, matching or surpassing state-of-the-art performance. We will release both the WildGUI dataset and the Video2GUI pipeline to support future research of GUI agents.
comment: Accepted at ICML 2026
☆ EARL: Towards a Unified Analysis-Guided Reinforcement Learning Framework for Egocentric Interaction Reasoning and Pixel Grounding ICML 2026
Understanding human--environment interactions from egocentric vision is essential for assistive robotics and embodied intelligent agents, yet existing multimodal large language models (MLLMs) still struggle with accurate interaction reasoning and fine-grained pixel grounding. To this end, this paper introduces EARL, an Egocentric Analysis-guided Reinforcement Learning framework that explicitly transfers coarse interaction semantics to query-oriented answering and grounding. Specifically, EARL adopts a two-stage parsing framework including coarse-grained interpretation and fine-grained response. The first stage holistically interprets egocentric interactions and generates a structured textual description. The second stage produces the textual answer and pixel-level mask in response to the user query. To bridge the two stages, we extract a global interaction descriptor as a semantic prior, which is integrated via a novel Analysis-guided Feature Synthesizer (AFS) for query-oriented reasoning. To optimize heterogeneous outputs, including textual answers, bounding boxes, and grounding masks, we design a multi-faceted reward function and train the response stage with GRPO. Experiments on Ego-IRGBench show that EARL achieves 65.48% cIoU for pixel grounding, outperforming previous RL-based methods by 8.37%, while OOD grounding results on EgoHOS indicate strong transferability to unseen egocentric grounding scenarios.
comment: Accepted at ICML 2026. Project page: https://github.com/yuggiehk/EARL
☆ Video-Zero: Self-Evolution Video Understanding
Self-evolution offers a promising path for improving reasoning models without relying on intensive human annotation. However, extending this paradigm to video understanding remains underexplored and challenging: videos are long, dynamic, and redundant, while the evidence needed for reasoning is often sparse and temporally localized. Naively generating difficult question-answer pairs from full videos can therefore produce supervision that appears challenging but is weakly grounded, relying on static cues or language priors rather than temporal evidence. In this work, we argue that the key bottleneck of video self-evolution is not difficulty alone, but grounding. We propose Video-Zero, an annotation-free Questioner--Solver co-evolution framework that centers self-evolution on temporally localized evidence. The Questioner discovers informative evidence segments and generates evidence-grounded questions, while the Solver learns to answer and align its predictions with the supporting evidence. This closes an iterative loop of evidence discovery, grounded supervision, and evidence-aligned learning. Across 13 benchmarks spanning temporal grounding, long-video understanding, and video reasoning, Video-Zero consistently improves multiple video VLM backbones, demonstrating the effectiveness and transferability of evidence-centered self-evolution.
☆ UMo: Unified Sparse Motion Modeling for Real-Time Co-Speech Avatars
Speech-driven gestures and facial animations are fundamental to expressive digital avatars in games, virtual production, and interactive media. However, existing methods are either limited to a single modality for audio motion alignment, failing to fully utilize the potential of massive human motion data, or are constrained by the representation ability and throughput of multimodal models, which makes it difficult to achieve high-quality motion generation or real-time performance. We present UMo, a unified sparse motion modeling architecture for real-time co-speech avatars, which processes text, audio, and motion tokens within a unified formulation. Leveraging a spatially sparse Mixture-of-Experts framework and a temporally sparse, keyframe-centric design, UMo efficiently performs real-time dense reconstruction, enabling temporally coherent and high-fidelity animation generation for both facial expressions and gestures. Furthermore, we implement a multi-stage training strategy with targeted audio augmentation to enhance acoustic diversity and semantic consistency. Consequently, UMo preserves fine-grained speech-motion alignment even under strict latency constraints. Extensive quantitative and qualitative evaluations show that UMo achieves better output quality under low latency and real-time performance constraints, offering a practical solution for high-fidelity real-time co-speech avatars.
☆ CHASM: Cross-frequency Harmonized Axis-Separable Mixing for Spectral Token Operators
Spectral token mixers based on Fourier transforms provide an efficient way to model global interactions in visual feature maps. Existing designs often either apply filter-wise spectral responses along fixed channel axes, or learn adaptive frequency-indexed channel mixing without explicitly aligning the channel directions used across frequencies. We propose CHASM, a Cross-frequency Harmonized Axis-Separable Mixer, as a structured middle ground. CHASM separates what should be shared from what should remain frequency-specific: all frequencies share a learned channel eigenbasis, while each frequency retains its own positive spectral gains. The shared basis makes channel directions comparable across the spectrum, whereas the positive gains preserve local spectral adaptivity. CHASM applies this structured operator separably along the height and width axes and is used as a drop-in replacement mixer inside existing backbones. We provide a structural characterization of the shared-basis operator family and evaluate CHASM through controlled same-backbone comparisons. Across accelerated MRI reconstruction, undersampled MRI segmentation, and natural-image reconstruction, CHASM consistently improves over same-backbone spectral-mixer baselines. Ablations show that removing the shared-basis constraint weakens performance, and randomizing coherent sampling geometry substantially reduces the gain, supporting cross-frequency harmonization as a useful inductive bias for spectral token operators.
☆ Towards Label-Free Single-Cell Phenotyping Using Multi-Task Learning ICPR
Label-free single-cell imaging offers a scalable, non-invasive alternative to fluorescence-based cytometry, yet inferring molecular phenotypes directly from bright-field morphology remains challenging. We present a unified Deep Learning (DL) framework that jointly performs White Blood Cell (WBC) classification and continuous protein-expression regression from label-free Differential Phase Contrast (DPC) images. Our model employs a Hybrid architecture that fuses convolutional fine-grained texture features with transformer-based global representations through a learnable cross-branch gating module, enabling robust morpho-molecular inference from DPC images. To support downstream interpretability, we further incorporate a Large Language Model (LLM) that generates concise, biologically grounded summaries of the predicted cell states. Experiments on the Berkeley Single Cell Computational Microscopy (BSCCM) and Blood Cells Image benchmarks demonstrate strong performance, achieving a 91.3% WBC classification accuracy and a 0.72 Pearson correlation for CD16 expression regression on BSCCM. These results underscore the promise of label-free single-cell imaging for cost-effective hematological profiling, enabling simultaneous phenotype identification and quantitative biomarker estimation without fluorescent staining. The source code is available at https://github.com/saqibnaziir/Single-Cell-Phenotyping.
comment: Accepted in 28th International Conference on Pattern Recognition (ICPR) 2026
☆ AnchorRoute: Human Motion Synthesis with Interval-Routed Sparse Contro
Sparse anchors provide a compact interface for human motion authoring: users specify a few root positions, planar trajectory samples, or body-point targets, while the system synthesizes the full-body motion that completes the under-specified intent. We present AnchorRoute, a sparse-anchor motion synthesis framework that uses anchors as a shared scaffold for both generation and refinement. Before generation, AnchorRoute converts sparse anchors into anchor-condition features and injects the resulting condition memory into a frozen Transition Masked Diffusion prior through AnchorKV and dual-context conditioning. This preserves the generation quality of the pretrained text-to-motion prior while learning sparse spatial control. After generation, the same anchors are evaluated as residuals: their timestamps define refinement intervals, and their residuals determine where correction should be concentrated. RouteSolver then refines the motion by projecting soft-token updates onto anchor-defined piecewise-affine interval bases. This couples generation-time anchor conditioning with residual-routed refinement under one anchor scaffold. AnchorRoute supports root-3D, planar-root, and body-point control within the same formulation. In benchmark evaluations, AnchorRoute outperforms prior sparse-control methods under the sparse keyjoint protocol and consistently improves anchor adherence across control families. The results show that the learned anchor-conditioned generator and RouteSolver refinement are complementary: the generator preserves text-motion quality, while RouteSolver provides a controllable path toward stronger anchor adherence.
☆ IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation
Robot imitation data are often multimodal: similar visual-language observations may be followed by different action chunks because human demonstrators act with different short-horizon intents, task phases, or recent context. Existing frame-conditioned VLA policies infer each chunk from the current observation and instruction alone, so under partial observability they may resample different intents across adjacent replanning steps, leading to inter-chunk conflict and unstable execution. We introduce IntentVLA, a history-conditioned VLA framework that encodes recent visual observations into a compact short-horizon intent representation and uses it to condition chunk generation. We further introduce AliasBench, a 12-task ambiguity-aware benchmark on RoboTwin2 with matched training data and evaluation environments that isolate short-horizon observation aliasing. Across AliasBench, SimplerEnv, LIBERO, and RoboCasa, IntentVLA improves rollout stability and outperforms strong VLA baselines
comment: Code can be found in https://github.com/ZGC-EmbodyAI/IntentVLA
☆ Vision-Core Guided Contrastive Learning for Balanced Multi-modal Prognosis Prediction of Stroke
Deep learning and multi-modal fusion have demonstrated transformative potential in medical diagnosis by integrating diverse data sources. However, accurate prognosis for ischemic stroke remains challenging due to limitations in existing multi-modal approaches. First, current methods are predominantly confined to dual-modal fusion, lacking a framework that effectively integrates the trifecta of medical images, structured clinical data, and unstructured text. Second, they often fail to establish deep bidirectional interactions between modalities; To address these critical gaps, this paper proposes a novel tri-modal fusion model for ischemic stroke prognosis. Our approach first enriches the data representation by employing a Large Language Model (LLM) to automatically generate semi-structured diagnostic text from brain MRIs. This process not only addresses the scarcity of expert annotations but also serves as a regularized semantic enhancement, improving multimodal fusion robustness. Furthermore, we design a core component termed the Vision-Conditioned Dual Alignment Fusion Module (VDAFM), which strategically uses visual features as a conditional prior to guide fine-grained interaction with the generated text. This module achieves a dynamic and profound fusion through a dual semantic alignment loss, effectively mitigating modal heterogeneity. Extensive experiments on a real-world clinical dataset demonstrate that our model achieves state-of-the-art performance.
comment: Corresponding author: Ting Xiao
☆ Breaking Dual Bottlenecks: Evolving Unified Multimodal Models into Self-Adaptive Interleaved Visual Reasoners ICML 2026
Recent unified models integrate multimodal understanding and generation within a single framework. However, an "understanding-generation gap" persists, where models can capture user intent but often fail to translate this semantic knowledge into precise pixel-level manipulation. This gap results in two bottlenecks in anything-to-image task (X2I): the attention entanglement bottleneck, where blind planning struggles with complex prompts, and the visual refinement bottleneck, where unstructured feedback fails to correct imperfections efficiently. In this paper, we propose a novel framework that empowers unified models to autonomously switch between generation strategies based on instruction complexity and model capability. To achieve this, we construct a hierarchical data pipeline that constructs execution paths across three adaptive modes: direct generation for simple cases, self-reflection for quality refinement, and multi-step planning for decomposing complex scenarios. Building on this pipeline, we contribute a high-quality dataset with over 50,000 samples and implement a two-stage training strategy comprising SFT and RL. Specifically, we design step-wise reasoning rewards to ensure logical consistency and intra-group complexity penalty to prevent redundant computational overhead. Extensive experiments demonstrate that our method outperforms existing baselines on X2I, achieving superior generation fidelity among simple-to-complex instructions. The code is released at https://github.com/WeChatCV/Interleaved_Visual_Reasoner.
comment: Accepted by ICML 2026
☆ StyleTextGen: Style-Conditioned Multilingual Scene Text Generation CVPR 2026
Style-conditioned scene text generation faces unique challenges in extracting precise text styles from complex backgrounds and maintaining fine-grained style consistency across characters, especially for multilingual scripts. We propose StyleTextGen, a novel framework that learns to perceive and replicate visual text styles across different languages and writing systems. Our approach features three key contributions: First, we introduce a dual-branch style encoder dedicated to style modeling, yielding robust multilingual text style representations in complex real-world scenes. Second, we design a text style consistency loss that enhances style coherence and improves overall visual quality. Third, we develop a mask-guided inference strategy that ensures precise style alignment between generated and reference text. To facilitate systematic evaluation, we construct StyleText-CE, a bilingual scene text style benchmark covering both monolingual and cross-lingual settings. Extensive experiments demonstrate that StyleTextGen significantly outperforms existing methods in style consistency and cross-lingual generalization, establishing new state-of-the-art performance in multilingual style-conditioned text generation.
comment: This paper has been accepted to CVPR 2026
☆ Towards Continuous Sign Language Conversation from Isolated Signs
Sign language is the primary language for many Deaf and Hard-of-Hearing (DHH) signers, yet most conversational AI systems still mediate interaction through spoken or written language. This spoken-language-centered interface can limit access for signers for whom spoken or written language is not the most accessible medium, motivating direct sign-to-sign conversational modeling. However, sentence-level sign video data are expensive to collect and annotate, leaving existing sign translation and production models with limited vocabulary coverage and weak open-domain generalization. We address this bottleneck by constructing continuous sign conversations from isolated signs: large-scale labeled isolated clips are collected as lexically grounded motion primitives and recomposed into sign-language-ordered utterances derived from existing dialogue corpora. We introduce SignaVox-W, which provides, to our knowledge, the largest labeled isolated-sign vocabulary to date, and SignaVox-U, a continuous 3D sign conversation dataset built from SignaVox-W. To bridge structural mismatch between spoken and signed languages, we use a retrieval-guided spoken-to-gloss translator; to bridge independently collected isolated clips, we propose BRAID, a diffusion Transformer that performs duration alignment and co-articulatory boundary inpainting. With the resulting data, we train SignaVox, a direct sign-to-sign conversational model that generates 3D body, hand, and facial motion responses from prior signing context without spoken-language text or externally provided glosses at inference time. Quantitative and qualitative evaluations show improved isolated-to-continuous motion quality, stronger response-level semantic alignment, and scalable signer-centered interaction that better supports visual-spatial articulation.
☆ SceneFunRI: Reasoning the Invisible for Task-Driven Functional Object Localization
In real-world scenes, target objects may reside in regions that are not visible. While humans can often infer the locations of occluded objects from context and commonsense knowledge, this capability remains a major challenge for vision-language models (VLMs). To address this gap, we introduce SceneFunRI, a benchmark for Reasoning the Invisible. Based on the SceneFun3D dataset, SceneFunRI formulates the task as a 2D spatial reasoning problem via a semi-automatic pipeline and comprises 855 instances. It requires models to infer the locations of invisible functional objects from task instructions and commonsense reasoning. The strongest baseline model (Gemini 3 Flash) only achieves an CAcc@75 of 15.20, an mIoU of 0.74, and a Dist of 28.65. We group our prompting analysis into three categories: Strong Instruction Prompting, Reasoning-based Prompting, and Spatial Process of Elimination (SPoE). These findings indicate that invisible-region reasoning remains an unstable capability in current VLMs, motivating future work on models that more tightly integrate task intent, commonsense priors, spatial grounding, and uncertainty-aware search.
☆ Generating HDR Video from SDR Video
The high dynamic range (HDR) video ecosystem is approaching maturity, but the problem of upconverting legacy standard dynamic range (SDR) videos persists without a convincing solution. We propose a framework for HDR video synthesis from casual SDR footage by leveraging large-scale generative video models. We introduce a Multi-Exposure Video Model (MEVM) that can predict exposure-bracketed linear SDR video sequences from a single nonlinear SDR video input. We further propose a learnable Video Merging Model (VMM) that merges the predicted exposure-bracketed video into a high-quality HDR sequence while preserving detail in both shadows and highlights. Extensive experiments, quantitative and qualitative evaluation, and a user study demonstrate that our approach enables robust HDR conversion for in-the-wild examples from casual consumer videos and even iconic films. Finally, our model can support HDR synthesis pipelines built upon existing SDR generative video models. Output HDR videos can be viewed on our supplementary webpage: sdr2hdrvideo.github.io
☆ EponaV2: Driving World Model with Comprehensive Future Reasoning
Data scaling plays a pivotal role in the pursuit of general intelligence. However, the prevailing perception-planning paradigm in autonomous driving relies heavily on expensive manual annotations to supervise trajectory planning, which severely limits its scalability. Conversely, although existing perception-free driving world models achieve impressive driving performance, their real-world reasoning ability for planning is solely built on next frame image forecasting. Due to the lack of enough supervision, these models often struggle with comprehensive scene understanding, resulting in unsatisfactory trajectory planning. In this paper, we propose EponaV2, a novel paradigm of driving world models, which achieves high-quality planning with comprehensive future reasoning. Inspired by how human drivers anticipate 3D geometry and semantics, we train our model to forecast more comprehensive future representations, which can be additionally decoded to future geometry and semantic maps. Extracting the 3D and semantic modalities enables our model to deeply understand the surrounding environment, and the future prediction task significantly enhances the real-world reasoning capabilities of EponaV2, ultimately leading to improved trajectory planning. Moreover, inspired by the training recipe of Large Language Models (LLMs), we introduce a flow matching group relative policy optimization mechanism to further improve planning accuracy. The state-of-the-art (SOTA) performances of EponaV2 among perception-free models on three NAVSIM benchmarks (+1.3PDMS, +5.5EPDMS) demonstrate the effectiveness of our methods.
☆ Are Candidate Models Really Needed for Active Learning?
Deep learning has profoundly impacted domains such as computer vision and natural language processing by uncovering complex patterns in vast datasets. However, the reliance on extensive labeled data poses significant challenges, including resource constraints and annotation errors, particularly in training Convolutional Neural Networks (CNNs) and transformers due to a larger number of parameters. Active learning offers a promising solution to reduce labeling burdens by strategically selecting the most informative samples for annotation. However, the current active learning frameworks are time-intensive which select the samples iteratively with the help of initial candidate models. This study investigates the feasibility of using CNNs and transformers with randomly initialized weights, eliminating the need for initial candidate models while achieving results comparable to active learning frameworks that depend on such candidate models. We evaluate three confidence-based sampling strategies: high confidence (HC), low confidence (LC), and a combination of high confidence in the early stages of training and low confidence at later stages of training (HCLC). Among these, mostly LC demonstrated the best performance in our experiments, showcasing its effectiveness as an active learning strategy without the need for candidate models. Further, extensive experiments verify the robustness of the proposed active learning methods. By challenging traditional frameworks, the proposed work introduces a streamlined approach to active learning, advancing efficiency and flexibility across diverse datasets and domains.
comment: Accepted for publication in Computer Vision and Image Understanding (CVIU)
☆ MiVE: Multiscale Vision-language features for reference-guided video Editing ICML 2026
Reference-guided video editing takes a source video, a text instruction, and a reference image as inputs, requiring the model to faithfully apply the instructed edits while preserving original motion and unedited content. Existing methods fall into two paradigms, each with inherent limitations: decoupled encoders suffer from modality gaps when processing instructions and visual content independently, while unified vision-language encoders lose fine-grained spatial details by relying solely on final-layer representations. We observe that VLM layers encode complementary information hierarchically -- early layers capture localized spatial details essential for precise editing, while deeper layers encode global semantics for instruction comprehension. Building on this insight, we present MiVE (Multiscale Vision-language features for reference-guided video Editing), a framework that repurposes VLMs as multiscale feature extractors. MiVE extracts hierarchical features from Qwen3-VL and integrates them into a unified self-attention Diffusion Transformer, eliminating the modality mismatch inherent in cross-attention designs. Experiments demonstrate that MiVE achieves state-of-the-art performance by ranking highest in human preference, outperforming both academic methods and commercial systems.
comment: ICML 2026
☆ Beyond Instance-Level Self-Supervision in 3D Multi-Modal Medical Imaging ICML2026
Self-supervised pre-training methods in medical imaging typically treat each individual as an isolated instance, learning representations through augmentation-based objectives or masked reconstruction. They often do not adequately capitalize on a key characteristic of physiological features: anatomical structures maintain consistent spatial relationships across individuals (instances), such as the thalamus being medial to the basal ganglia, regardless of variations in brain size, shape, or pathology. We propose leveraging this cross-instance topological consistency as a supervisory signal. The challenge arises from the inherent variability in medical imaging, which can differ significantly across instances and modalities. To tackle this, we focus on two alignment regimes. (i) Intra-instance: with pixel-level correspondences available, a cross-modal triplet objective explicitly preserves local neighborhood topology. (ii) Inter-instance: without such supervision, we derive pseudo-correspondences to control partial neighborhood alignment and prevent topology collapse across modalities. We validate our approach across 7 downstream multi-modal tasks, achieving average improvements of 1.1% and 5.94% in segmentation and classification tasks, respectively, and demonstrating significantly better robustness when modalities are missing at test time.
comment: ICML2026
☆ TERRA-CD: Multi-Temporal Framework for Multi-class and Semantic Change Detection
Urban vegetation monitoring plays a vital role in understanding environmental changes, yet comprehensive datasets for this purpose remain limited. To address this gap, we present the Temporal Remote-sensing Repository for Analyzing Change Detection (TERRA-CD), a benchmark dataset comprising 5,221 Sentinel-2 image pairs from 2019 and 2024, covering 232 cities across the USA and Europe. The dataset features three distinct annotation schemes: 4-class land cover mapping masks, 3-class vegetation change masks, and 13-class semantic change masks capturing all possible land cover transitions. Using various deep learning approaches including Siamese networks, STANet variants, Bi-SRNet, Changemask, Post-Classification Comparison, and HRSCD strategies, we evaluated the dataset's effectiveness for both vegetation Multi-class Change Detection as well as Semantic Change Detection. The proposed dataset and methods are available at https://github.com/omkarsoak/TERRA-CD.
comment: Paper presented at 11th International Congress on Information and Communication Technology (ICICT) 2026, London
☆ Vision-Based Water Level and Flow Estimation
With the rapid evolution of computer vision, vision-based methodologies for water level and river surface velocity estimation have reached significant maturity. Compared to traditional sensing, these techniques offer superior interpretability, automated data archiving, and enhanced system robustness. However, challenges such as environmental sensitivity, limited precision, and complex site calibration persist. This work proposes an integrated framework that synergizes state-of-the-art (SOTA) vision models with statistical modeling. By leveraging physical priors and robust filtering strategies, we improve the accuracy of water level detection and flow estimation. Code will be available at https://github.com/sunzx97/Vision_Based_Water_Level_and_Flow_Estimation.git
☆ How to Evaluate and Refine your CAM ICPR 2026
Class attribution maps (CAMs) provide local explanations for the decisions of convolutional neural networks. While widely used in practice, the evaluation of CAMs remains challenging due to the lack of ground-truth explanations, making it difficult to evaluate the soundness of existing metrics. Independently, most commonly used CAM methods produce low-resolution attribution maps, which limits their usefulness for detailed interpretability. To address the evaluation challenge, we introduce a synthetic dataset with ground-truth attributions that enables a rigorous comparison of CAM evaluation metrics. Using this dataset, we analyze existing metrics and propose ARCC, a new composite metric that more reliably identifies faithful explanations. To address the low resolution issue, we introduce RefineCAM, a method that produces high-resolution attribution maps by aggregating CAMs across multiple network layers. Our results show that RefineCAM consistently outperforms existing methods according to the proposed evaluation.
comment: Accepted at ICPR 2026
☆ MultiEmo-Bench: Multi-label Visual Emotion Analysis for Multi-modal Large Language Models
This paper introduces a multi-label visual emotion analysis benchmark dataset for comprehensively evaluating the ability of multimodal large language models (MLLMs) to predict the emotions evoked by images. Recent user studies report an unintuitive finding: humans may prefer the predictions of MLLMs over the labels in existing datasets. We argue that this phenomenon stems from the suboptimal annotation scheme used in existing datasets, where each annotator is shown a single candidate emotion for each image and judges whether it is evoked or not. This approach is clearly limited because a single image can evoke multiple emotions with varying intensities. As a result, evaluations based on these datasets may underestimate the capabilities of MLLMs, yet an appropriate benchmark for evaluating such models remains lacking. To address this issue, we introduce a new multi-label benchmark dataset for visual emotion analysis toward MLLMs evaluation. We hire $20$ annotators per image and ask them to select all emotions they feel from an image. Then, we aggregate the votes across all annotators, providing a more reliable and representative dataset labeled with a distribution of emotions. The resulting dataset contains $10,344$ images with $236,998$ valid votes across eight emotions. Based on this benchmark dataset, we evaluate several recent models, including Qwen3-VL, OpenAI's GPT, Gemini, and Claude. We assess model performance on both dominant emotion prediction and emotion distribution prediction. Our results demonstrate the progress achieved by recent MLLMs while also indicating that substantial room for improvement remains. Furthermore, our experiments with LLM-as-a-judge show that the method does not consistently improve MLLMs' performance, indicating its limitations for the subjective task of visual emotion analysis.
☆ Action-Inspired Generative Models
We introduce Action-Inspired Generative Models (AGMs), a dual-network generative framework motivated by the observation that existing bridge-matching methods assign uniform regression weight to every stochastic transition in the transport landscape, regardless of whether a given bridge sample lies along a structurally coherent trajectory or a degenerate one. We address this by introducing a lightweight learned scalar potential $V_φ$ that scores bridge samples online and modulates the drift objective via importance weights derived through a stop-gradient barrier -- preventing adversarial feedback between the two networks whilst preserving $V_φ$'s guiding signal. Crucially, $V_φ$ comprises only $\sim$1.4% of the primary drift network's parameter count, adds no overhead to the inference graph, and requires no iterative half-bridge fitting or auxiliary stochastic differential equation (SDE) solvers: it is a plug-and-play enhancement to any bridge-matching training loop. At inference, $V_φ$ is discarded entirely, leaving standard Euler-Maruyama integration of the exponential moving average (EMA) drift. We demonstrate that selectively penalising uninformative transport paths through the learned potential yields consistent improvements in generation quality across fidelity and coverage metrics.
comment: 11 pages, 5 figures, and 4 tables
☆ Efficient Dense Matching for Enhanced Gaussian Splatting Using AV1 Motion Vectors
3D Gaussian Splatting (3DGS) has emerged as a prominent framework for real-time, photorealistic scene reconstruction, offering significant speed-ups over Neural Radiance Fields (NeRF). However, the fidelity of 3DGS representations remains heavily dependent on the quality of the initial point cloud. While standard Structure-from-Motion (SfM) pipelines using COLMAP provide adequate initialisation, they often suffer from high computational costs and sparsity in textureless regions, which degrades subsequent reconstruction accuracy and convergence speed. In this work, we introduce an AV1-based feature detection and matching pipeline that significantly reduces SfM processing overhead. By leveraging motion vectors inherent to the AV1 video codec, we bypass computationally expensive exhaustive matching while maintaining geometric robustness. Our pipeline produces substantially denser point clouds, with up to eight times as many points as classical SfM. We demonstrate that this enhanced initialisation directly improves 3DGS performance, yielding an 9-point increase in VMAF and a 63% average reduction in training time required to reach baseline quality. The project page: https://sigmedia.tv/AV1-3DGS.github.io/
☆ UniTriGen: Unified Triplet Generation of Aligned Visible-Infrared-Label for Few-Shot RGB-T Semantic Segmentation
RGB-T semantic segmentation requires strictly aligned VIS-IR-Label triplets; however, such aligned triplet data are often scarce in real-world scenarios. Existing generative augmentation methods usually adopt cascaded generation paradigms, decomposing joint triplet generation into local conditional processes. As a result, consistency among VIS, IR, and Label in spatial structure, semantic content, and cross-modal details cannot be reliably maintained. To address this issue, we propose UniTriGen, a unified triplet generation framework that directly generates spatially aligned, semantically consistent, and modality complementary VIS-IR-Label triplets under the guidance of text prompts. UniTriGen first introduces a unified triplet generation mechanism, where VIS, IR, and Label are jointly encoded into a shared latent space and modeled with a diffusion process to enforce global cross-modal consistency. Lightweight modality-specific residual adapters are further integrated into this mechanism to accommodate modality-specific imaging characteristics and output formats. To mitigate generation bias caused by imbalanced scene and class distributions in limited paired triplets, UniTriGen also employs a scene-balanced and class-aware few-shot sampling strategy, which induces a more balanced sampling distribution and enhances the scene and class diversity of generated triplets. Experiments show that UniTriGen generates high-quality aligned triplets from limited real paired data, thereby achieving consistent performance improvements across various RGB-T semantic segmentation models.
☆ Do We Really Need External Tools to Mitigate Hallucinations? SIRA: Shared-Prefix Internal Reconstruction of Attribution
Large vision-language models (LVLMs) often hallucinate when language priors dominate weak or ambiguous visual evidence. Existing contrastive decoding methods mitigate this problem by comparing predictions from the original image with those from externally perturbed visual inputs, but such references can introduce off-manifold artifacts and require costly extra forward passes. We propose SIRA, a training-free internal contrastive decoding framework that constructs a counterfactual reference inside the same LVLM by exploiting the staged information flow of multimodal transformers. Instead of removing visual information from the input, SIRA first lets image and text tokens interact through a shared prefix, forming an aligned multimodal state that preserves prompt interpretation, decoding history, positional structure, and early visual grounding. It then forks a counterfactual branch in later transformer layers, where attention to image-token positions is masked. This branch retains the shared multimodal context but lacks continued access to fine-grained visual evidence, yielding a language-prior-dominated internal reference for token-level contrast. During decoding, SIRA suppresses tokens that remain strong without late visual access and favors predictions whose advantage depends on the full visual pathway. Experiments on POPE, CHAIR, and AMBER with Qwen2.5-VL and LLaVA-v1.5 show that SIRA consistently reduces hallucinations while preserving descriptive coverage and incurring lower overhead than two-pass contrastive decoding. SIRA requires no training, external verifier, or perturbed input, and applies to open-weight LVLMs with white-box inference access.
☆ CalibAnyView: Beyond Single-View Camera Calibration in the Wild
Camera calibration is a fundamental prerequisite for reliable geometric perception, yet classical approaches rely on controlled acquisition setups that are impractical for in-the-wild imagery. Recent learning-based methods have shown promising results for single-view calibration, but inherently neglect geometric consistency across multiple views. We introduce CalibAnyView, a unified formulation that supports an arbitrary number of input views ($N \geq 1$) by explicitly modeling cross-view geometric consistency. To facilitate this, we construct a large-scale multi-view video dataset covering diverse real-world scenarios, including multiple camera models, dynamic scenes, realistic motion trajectories, and heterogeneous lens distortions. Building on this dataset, we develop a multi-view transformer that predicts dense perspective fields, which are further integrated into a geometric optimization framework to jointly estimate camera intrinsics and gravity direction. Extensive experiments demonstrate that CalibAnyView consistently outperforms state-of-the-art methods, achieves strong robustness under single-view settings, and further improves with multi-view inference, providing a reliable foundation for downstream tasks such as 3D reconstruction and robotic perception in the wild.
comment: 44 pages, 25 figures
☆ Deep Image Segmentation via Discriminant Feature Learning ICIP 2026
Accurate image segmentation remains challenging, particularly in generating sharp, confident boundaries. While modern architectures have advanced the field, many of them still rely on standard loss functions like Cross-Entropy and Dice, which often neglect the discriminative structure of learned features, leading to inaccurate boundaries. This work introduces Deep Discriminant Analysis (DDA), a differentiable, architecture-agnostic loss function that embeds classical discriminant principles for network training. DDA explicitly maximizes between-class variance while minimizing within-class one, promoting compact and separable feature distributions without increasing inference cost. Evaluations on the DIS5K benchmark demonstrate that DDA consistently improves segmentation accuracy, boundary sharpness, and model confidence across various architectures. Our results show that integrating discriminant analysis offers a simple, effective path for building more robust segmentation models.
comment: Accepted to ICIP 2026
☆ ViMU: Benchmarking Video Metaphorical Understanding
Any new medium, once it emerges, is used for more than the transmission of overt content alone. The information it carries typically operates on two levels: one is the content directly presented, while the other is the subtext beneath it-the implicit ideas and intentions the creator seeks to convey through the medium. Likewise, since video technologies became widely adopted, video has served not only as a powerful tool for recording and communicating visual information, but also as a vehicle for emotions, attitudes, and social meanings that are often difficult to articulate explicitly. Thus, the true meaning of many videos does not reside solely in what is shown on screen; it is often embedded in context, style of expression, and the viewer's social experience. Some forms of such video subtext are humorous, while others carry irony, mockery, or criticism. These implicit meanings can also be interpreted very differently across cultural backgrounds and social groups. However, most existing video understanding models still focus primarily on literal visual comprehension, such as recognizing objects, actions, or temporal relations, and lack a systematic ability to understand the metaphorical, ironic, and social meanings embedded in videos. To bridge this gap, we introduce ViMU, the first benchmark designed to systematically evaluate the subtext understanding capabilities of frontier models in videos. ViMU assesses whether video understanding models can go beyond literal perception to infer implicit meaning while grounding their interpretations in multimodal evidence and answering both open-ended and multiple-choice questions. Importantly, all questions are designed to be hint-free, ensuring that no key evidence is disclosed to models before answering.
MambaRain: Multi-Scale Mamba-Attention Framework for 0-3 Hour Precipitation Nowcasting
Accurate precipitation nowcasting over extended horizons (0-3 hours) is essential for disaster mitigation and operational decision-making, yet remains a critical challenge in the field. Existing deterministic approaches are predominantly constrained to shorter prediction windows (0-2 hours), exhibiting severe performance degradation beyond 90 minutes owing to their inherent difficulty in capturing long-range spatiotemporal dependencies from radar-derived observations. To address these fundamental limitations, we propose MambaRain, a novel multi-scale encoder-decoder architecture that synergistically integrates Mamba's linear-complexity long-range temporal modeling with self-attention mechanisms for explicit spatial correlation capture. The core innovation lies in a hybrid design paradigm wherein Mamba blocks leverage selective state space mechanisms to model global temporal dynamics across extended sequences with computational efficiency, while self-attention modules explicitly characterize spatial correlations within precipitation fields - a capability inherently absent in Mamba's sequential processing paradigm. This complementary synergy enables comprehensive spatiotemporal representation learning, effectively extending the viable forecasting horizon to 2-3 hours with substantial accuracy improvements. Furthermore, we introduce a spectral loss formulation to mitigate blurring artifacts characteristic of chaotic precipitation systems, thereby preserving fine-scale motion details critical for nowcasting accuracy. Experimental validation demonstrates that MambaRain substantially outperforms existing deterministic methodologies in 0-3 hour nowcasting tasks, with particularly pronounced performance gains in the challenging 2-3 hour prediction range.
comment: 9 pages,7 figures
☆ Towards Accurate Single Panoramic 3D Detection: A Semantic Gaussian Centric Approach ICME 2026
Three-dimensional object detection in panoramic imagery is crucial for comprehensive scene understanding, yet accurately mapping 2D features to 3D remains a significant challenge. Prevailing methods often project 2D features onto discrete 3D grids, which break geometric continuity and limit representation efficiency. To overcome this limitation, this paper proposes PanoGSDet, a monocular panoramic 3D detection framework built upon continuous semantic 3D Gaussian representations. The proposed framework comprises a panoramic depth estimation component and a semantic Gaussian component. The panoramic depth estimation component extracts the equirectangular semantic and depth features from the monocular panorama input. The semantic Gaussian component includes a semantic Gaussian lifting module that projects spherical features into 3D semantic Gaussians, a semantic Gaussian optimization module that refines these semantic Gaussians, and a Gaussian guided prediction head that generates 3D bounding boxes from optimized Gaussian representations. Extensive experiments on the Structured3D dataset demonstrate that our method significantly outperforms existing methods.
comment: Current has been accepted by ICME 2026
☆ VMU-Diff: A Coarse-to-fine Multi-source Data Fusion Framework for Precipitation Nowcasting
Precipitation nowcasting is a vital spatio-temporal prediction task for meteorological applications but faces challenges due to the chaotic property of precipitation systems. Existing methods predominantly rely on single-source radar data to build either deterministic or probabilistic models for extrapolation. However, the single deterministic model suffers from blurring due to MSE convergence. The single probabilistic model, typically represented by diffusion models, can generate fine details but suffers from spurious artifacts that compromise accuracy and computational inefficiency. To address these challenges, this paper proposes a novel coarse-to-fine Vision Mamba Unet and residual Diffusion (VMU-Diff) based precipitation nowcasting framework. It realizes precipitation nowcasting through a two-stage process, i.e., a deterministic model-based coarse stage to predict global motion trends and a probabilistic model-based fine stage to generate fine prediction details. In the coarse prediction stage, rather than single-source radar data, both radar and multi-band satellite data are taken as input. A spatial-temporal attention block and several Vision mamba state-space blocks realize multi-source data fusion, and predict the future echo global dynamics. The fine-grained stage is realized by a spatio-temporal refine generator based on residual conditional diffusion models. It first obtains spatio-temporal residual features based on coarse prediction and ground truth, and further reconstructs the residual via conditional Mamba state-space module. Experiments on Jiangsu SWAN datasets demonstrate the improvements of our method over state-of-the-art methods, particularly in short-term forecasts.
comment: 5 pages, 2 figures
☆ TOPOS: High-Fidelity and Efficient Industry-Grade 3D Head Generation
High-fidelity 3D head generation plays a crucial role in the film, animation and video game industries. In industrial pipelines, studios typically enforce a fixed reference topology across all head assets, as such a clean and uniform topology is a prerequisite for production-level rigging, skinning and animation. In this paper, we present TOPOS, a framework tailored for single image conditioned 3D head generation that jointly recovers geometry and appearance under such an industry-standard topology. In contrast to general 3D generative models which produce triangle meshes with inconsistent topology and numerous vertices, hindering semantic correspondence and asset-level reuse, TOPOS generates head meshes with a fixed, studio-style topology, enabling consistent vertex-level correspondence across all generated heads. To model heads under this unified topology, we proposed a novel variational autoencoder structure, termed TOPOS-VAE. Inspired by multi-model large language models (MLLMs), our TOPOS-VAE leverages the Perceiver Resampler to convert input pointclouds sampled from head meshes of diverse topologies into the target reference topology. Building upon TOPOS-VAE's structured latent space, we train a rectified flow transformer, TOPOS-DiT, to efficiently generate high-fidelity head meshes from a single image. We further present TOPOS-Texture, an end-to-end module that produces relightable UV texture maps from the same portrait image via fine-tuning a multimodal image generative model. The generated textures are spatially aligned with the underlying mesh geometry and faithfully preserve high-frequency appearance details. Extensive experiments demonstrate that TOPOS achieves state-of-the-art performance on 3D head generation, surpassing both classical face reconstruction methods and general 3D object generative models, highlighting its effectiveness for digital human creation.
comment: Technical Report
☆ FedStain: Modeling Higher-Order Stain Statistics for Federated Domain Generalization in Computational Pathology
Robust whole-slide image (WSI) analysis under strict data-governance remains challenging due to substantial cross-institutional stain heterogeneity. Domain generalization (DG) mitigates these shifts but typically requires centralized data, conflicting with privacy regulations. Federated learning (FedL) provides a decentralized alternative; however, existing FedL and federated DG (FedDG) approaches rely almost exclusively on low-order statistics, assuming Gaussian-like stain distributions. In contrast, real-world staining processes often produce asymmetric, heavy-tailed color distributions due to biochemical diffusion and scanner nonlinearity. Consequently, current methods fail to model the higher-order, non-Gaussian characteristics dominating real-world stain variability. To address this, we propose FedStain, a stain-aware FedDG framework explicitly incorporating higher-order stain moments--skewness and kurtosis--as compact statistical descriptors exchanged during federated optimization. These descriptors require no pixel-level data transmission, preserving strict privacy and communication efficiency, while enabling the global model to capture stain variability missed by low-order statistics. FedStain also employs a contrastive, cross-site parameter aggregation strategy to promote stain-invariant representations without relaxing data constraints. Extensive experiments on Camelyon17 and our new MvMidog-Fed benchmark show FedStain yields consistent improvements, outperforming state-of-the-art FedL, DG, and FedDG baselines by up to +3.9% absolute accuracy. To our knowledge, FedStain is the first FedDG approach to explicitly model higher-order stain statistics, enabling robust cross-institutional deployment in computational pathology.
☆ A Picture is Worth a Thousand Words? An Empirical Study of Aggregation Strategies for Visual Financial Document Retrieval ACL 2026
Visual RAG has offered an alternative to traditional RAG. It treats documents as images and uses vision encoders to obtain vision patch tokens. However, hundreds of patch tokens per document create retrieval and storage challenges in a vector database. Practical deployment requires aggregating them into a single vector. This raises a critical question: does single-vector aggregation lose key information in financial documents? We develop a diagnostic benchmark using financial documents where changes in single digits can lead to significant semantic shifts. Our experiments show that single-vector aggregation collapses different documents with almost identical vectors. Metrics show that the patch level detects semantic changes, and confirm that aggregation obscures these details. We identify global texture dominance as the root cause. Our findings are consistent across model scales, retrieval-optimized embeddings, and multiple mitigation strategies, highlighting significant risks for single-vector visual document retrieval in financial applications.
comment: Accepted to Findings of ACL 2026
☆ Med-DisSeg: Dispersion-Driven Representation Learning for Fine-Grained Medical Image Segmentation
Accurate medical image segmentation is fundamental to precision medicine, yet robust delineation remains challenging under heterogeneous appearances, ambiguous boundaries, and large anatomical variability. Similar intensity and texture patterns between targets and surrounding tissues often lead to blurred activations and unreliable separation. We attribute these failures to representation collapse during encoding and insufficient fine grained multi scale decoding. To address these issues, we propose Med DisSeg, a dispersion driven medical image segmentation framework that jointly improves representation learning and anatomical delineation. Med DisSeg combines a lightweight Dispersive Loss with adaptive attention for fine grained structure segmentation. The Dispersive Loss enlarges inter sample margins by treating in batch hidden representations as negative pairs, producing well dispersed and boundary aware embeddings with negligible overhead. Based on these enhanced representations, the encoder strengthens structure sensitive responses, while the decoder performs adaptive multi scale calibration to preserve complementary local texture and global shape information. Extensive experiments on five datasets spanning three imaging modalities demonstrate consistent state of the art performance. Moreover, Med DisSeg achieves competitive results on multi organ CT segmentation, supporting its robustness and cross task applicability.
☆ Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction CVPR 2026
Reconstructing dynamic visual experiences as videos from functional magnetic resonance imaging (fMRI) is pivotal for advancing the understanding of neural processes. However, current fMRI-to-video reconstruction methods are hindered by a semantic gap between noisy fMRI signals and the rich content of videos, stemming from a reliance on incomplete semantic embeddings that neither capture video-specific cues (e.g., actions) nor integrate prior knowledge. To this end, we draw inspiration from the dual-pathway processing mechanism in human brain and introduce CineNeuron, a novel hierarchical framework for semantically enhanced video reconstruction from fMRI signals with two synergistic stages. First, a bottom-up semantic enrichment stage maps fMRI signals to a rich embedding space that comprehensively captures textual semantics, image contents, action concepts, and object categories. Second, a top-down memory integration stage utilizes the proposed Mixture-of-Memories method to dynamically select relevant "memories" from previously seen data and fuse them with the fMRI embedding to refine the video reconstruction. Extensive experimental results on two fMRI-to-video benchmarks demonstrate that CineNeuron surpasses state-of-the-art methods across various metrics.
comment: Accepted to CVPR 2026
☆ SpectraFlow: Unifying Structural Pretraining and Frequency Adaptation for Medical Image Segmentation
Medical image segmentation remains challenging in low-data regimes, where scarce annotations often yield poor generalization and ambiguous boundaries with missing fine structures. Recent self-supervised pretraining has improved transferability, but it often exhibits a texture bias. In contrast, accurate segmentation is inherently geometry-aware and depends on both topological consistency and precise boundary preservation. To address this problem, we propose a two-stage framework that couples structure-aware encoder pretraining with boundary-oriented decoding. In Stage-1, we aim to learn structure-aware representations for downstream segmentation in low-data regimes. To this end, we propose Mixed-Domain MeanFlow Pretraining, which aligns images and binary masks in a shared latent space through latent transport regression, where masks act as conditional structural guidance rather than prediction targets, making the pretraining task-agnostic. To further improve training stability under scarce supervision, we incorporate a lightweight Dispersive Loss to prevent representation collapse. In Stage-2, we fine-tune the pretrained encoder with a lightweight decoder that combines Direct Attentional Fusion for adaptive cross-scale gating and Frequency-Directional Dynamic Convolution for high-frequency boundary refinement under appearance variation. Experiments on ISIC-2016, Kvasir-SEG, and GlaS demonstrate consistent gains over state-of-the-art methods, with improved robustness in low-data settings and sharper boundary delineation.
☆ LiWi: Layering in the Wild
Recent advances in generative models have empowered impressive layered image generation, yet their success is largely confined to graphic design domains. The layering of in-the-wild images remains an underexplored problem, limiting fine-grained editing and applications of images in real-world scenarios. Specifically, challenges remain in scalable layered data and the modeling of object interaction in natural images, such as illumination effects and structural boundary. To address these bottlenecks, we propose a novel framework for high-fidelity natural image decomposition. First, we introduce an Agent-driven Data Decomposition (ADD) pipeline that orchestrates agents and tools to synthesize layered data without manual intervention. Utilizing this pipeline, we construct a large-scale dataset, named LiWi-100k, with over 100,000 high-quality layered in-the-wild images. Second, we present a novel framework that jointly improves photometric fidelity and alpha boundary accuracy. Specifically, shadow-guided learning explicitly models the illumination effects, and degradation-restoration objective provides boundary-correction supervision by recovering clean foreground image from degraded one. Extensive experiments demonstrate that our framework achieves state-of-the-art (SoTA) performance in natural image decomposition, outperforming existing models in RGB L1 and Alpha IoU metrics. We will soon release our code and dataset.
☆ Local Spatiotemporal Convolutional Network for Robust Gait Recognition
Gait recognition, as a promising biometric technology, identifies individuals through their unique walking patterns and offers distinctive advantages including non-invasiveness, long-range applicability, and resistance to deliberate disguise. Despite these merits, capturing the intrinsic motion patterns concealed within consecutive video frames remains challenging due to the complexity of video data and the interference of external covariates such as viewpoint changes, clothing variations, and carrying conditions. Existing approaches predominantly rely on either static appearance features extracted from individual silhouette frames or employ complex sequential models (\eg, LSTM, 3D convolutions) that demand substantial computational resources and sophisticated training strategies. To address these limitations, we propose a Local Spatiotemporal Convolutional Network (LSTCN), a structurally simple yet highly effective dual-branch architecture that endows standard two-dimensional convolutional networks with the capacity to extract temporal information. Specifically, we introduce a Global Bidirectional Spatial Pooling (GBSP) mechanism that reduces the dimensionality of gait tensors by decomposing spatial features into horizontal and vertical strip-based local representations, enabling the temporal dimension to participate in standard 2D convolution operations. Building upon this, we design a Local Spatiotemporal Convolutional (LSTC) layer that jointly processes temporal and spatial dimensions, allowing the network to adaptively learn strip-based gait motion patterns. We further extend this formulation with asymmetric convolution kernels that independently attend to the temporal, spatial, and joint spatiotemporal domains, thereby enriching the extracted feature representations.
☆ PROVE: A Perceptual RemOVal cohErence Benchmark for Visual Media
Evaluating object removal in images and videos remains challenging because the task is inherently one-to-many, yet existing metrics frequently disagree with human perception. Full-reference metrics reward copy-paste behaviors over genuine erasure; no-reference metrics suffer from systematic biases such as favoring blurry results; and global temporal metrics are insensitive to localized artifacts within edited regions. To address these limitations, we propose RC (Removal Coherence), a pair of perception-aligned metrics: RC-S, which measures spatial coherence via sliding-window feature comparison between masked and background regions, and RC-T, which measures temporal consistency via distribution tracking within shared restored regions across adjacent frames. To validate RC and support community benchmarking, we further introduce PROVE-Bench, a two-tier real-world benchmark comprising PROVE-M, an 80-video paired dataset with motion augmentation, and PROVE-H, a 100-video challenging subset without ground truth. Together, RC metrics and PROVE-Bench form the PROVE (Perceptual RemOVal cohErence) evaluation framework for visual media. Experiments across diverse image and video benchmarks demonstrate that RC achieves substantially stronger alignment with human judgments than existing evaluation protocols. The code for RC metrics and PROVE-Bench are publicly available at: https://github.com/xiaomi-research/prove/.
comment: Project Page: https://xiaomi-research.github.io/prove/
☆ Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models
Large diffusion vision-language models (LDVLMs) have recently emerged as a promising alternative to autoregressive models, enabling parallel decoding for efficient inference and leveraging bidirectional attention for global context. Despite these advances, their behavior under long-form generation remains underexplored. In this work, we show that existing LDVLMs suffer from repetitive generation and degraded visual grounding, and identify two underlying causes. First, repetitive generation originates from a mask token prior: since generation tokens are initialized as mask tokens, their hidden representations progressively drift toward a shared prior direction over generation steps. Second, a fundamental misalignment between the positional attention bias and the iterative unmasking process suppresses attention toward informative visual tokens, degrading visual grounding. Based on these insights, we propose a training-free approach, introducing Mask Prior Suppression and Monotonic RoPE Scaling to mitigate mask prior drift and positional attention collapse during decoding. Experiments on general multimodal benchmarks and visual grounding tasks demonstrate improvements over baseline LDVLMs, with robust gains on long-form description benchmarks. Our results show that these failures can be effectively addressed with a lightweight, plug-and-play strategy that requires no additional training and generalizes across diverse LDVLM architectures.
☆ From Sparse to Dense: Spatio-Temporal Fusion for Multi-View 3D Human Pose Estimation with DenseWarper
In multi-view 3D human pose estimation, models typically rely on images captured simultaneously from different camera views to predict a pose at a specific moment. While providing accurate spatial information, this traditional approach often overlooks the rich temporal dependencies between adjacent frames. We propose a novel 3D human pose estimation input method: the sparse interleaved input to address this. This method leverages images captured from different camera views at various time points (e.g., View 1 at time $t$ and View 2 at time $t+δ$), allowing our model to capture rich spatio-temporal information and effectively boost performance. More importantly, this approach offers two key advantages: First, it can theoretically increase the output pose frame rate by N times with N cameras, thereby breaking through single-view frame rate limitations and enhancing the temporal resolution of the production. Second, using a sparse subset of available frames, our method can reduce data redundancy and simultaneously achieve better performance. We introduce the DenseWarper model, which leverages epipolar geometry for efficient spatio-temporal heatmap exchange. We conducted extensive experiments on the Human3.6M and MPI-INF-3DHP datasets. Results demonstrate that our method, utilizing only sparse interleaved images as input, outperforms traditional dense multi-view input approaches and achieves state-of-the-art performance. The source code for this work is available at: https://github.com/lingli1724/DenseWarper-ICLR2026
☆ ArcGate: Adaptive Arctangent Gated Activation
Activation functions are central to deep networks, influencing non-linearity, feature learning, convergence, and robustness. This paper proposes the Adaptive Arctangent Gated Activation (ArcGate) function, a flexible formulation that generates a broad spectrum of activation shapes via a three-stage non-linear transformation. Unlike conventional fixed-shape activations such as ReLU, GELU, or SiLU, ArcGate uses seven learnable parameters per layer, allowing the neural network to autonomously optimize its non-linearity to the specific requirements of the feature hierarchy and data distribution. We evaluate ArcGate using ResNet-50 and Vision Transformer (ViT-B/16) architectures on three widely used remote sensing benchmarks: PatternNet, UC Merced Land Use, and the 13-band EuroSAT MSI multispectral dataset. Experimental results show that ArcGate consistently outperforms standard baselines, achieving a peak overall accuracy of 99.67% on PatternNet. Most notably, ArcGate exhibits superior structural resilience in noisy environments, maintaining a 26.65% performance lead over ReLU under moderate Gaussian noise (standard deviation 0.1). Analysis of the learned parameters reveals a depth-dependent functional evolution, where the model increases gating strength in deeper layers to enhance signal propagation. These findings suggest that ArcGate is a robust and adaptive general node activation function for high-resolution earth observation tasks.
☆ HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention
Diffusion-based video generation has advanced substantially in visual fidelity and temporal coherence, but practical deployment remains limited by the quadratic complexity of full attention. Training-free sparse attention is attractive because it accelerates pretrained models without retraining, yet existing online top-$p$ sparse attention still spends non-negligible cost on mask prediction and applies shared thresholds despite strong head-level heterogeneity. We show that these two overlooked factors limit the practical speed-quality trade-off of training-free sparse attention in Video DiTs. To address them, we introduce a head-wise adaptive framework with two plug-in components: Temporal Mask Reuse, which skips unnecessary mask prediction based on query-key drift, and Error-guided Budgeted Calibration, which assigns per-head top-$p$ thresholds by minimizing measured model-output error under a global sparsity budget. On Wan2.1-1.3B and Wan2.1-14B, our method consistently improves XAttention and SVG2, achieving up to 1.93 times speedup at 720P while maintaining competitive video quality and similarity metrics.
☆ Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
Autoregressive video diffusion models support real-time synthesis but suffer from error accumulation and context loss over long horizons. We discover that attention heads in AR video diffusion transformers serve functionally distinct roles as local heads for detail refinement, anchor heads for structural stabilization, and memory heads for long-range context aggregation, yet existing methods treat them uniformly, leading to suboptimal KV cache allocation. We propose Head Forcing, a training-free framework that assigns each head type a tailored KV cache strategy: local and anchor heads retain only essential tokens, while memory heads employ a hierarchical memory system with dynamic episodic updates for long-range consistency. A head-wise RoPE re-encoding scheme further ensures positional encodings remain within the pretrained range. Without additional training, Head Forcing extends generation from 5 seconds to minute-level duration, supports multi-prompt interactive synthesis, and consistently outperforms existing baselines. Project Page: https://jiahaotian-sjtu.github.io/headforcing.github.io/.
☆ Reduce the Artifacts Bias for More Generalizable AI-Generated Image Detection
As the misuse of AI-generated images grows, generalizable image detection techniques are urgently needed. Recent state-of-the-art (SOTA) methods adopt aligned training datasets to reduce content, size, and format biases, empowering models to capture robust forgery cues. A common strategy is to employ reconstruction techniques, e.g., VAE and DDIM, which show remarkable results in diffusion-based methods. However, such reconstruction-based approaches typically introduce limited and homogeneous artifacts, which cannot fully capture diverse generative patterns, such as GAN-based methods. To complement reconstruction-based fake images with aligned yet diverse artifact patterns, we propose a GAN-based upsampling approach that mimics GAN-generated fake patterns while preserving content, size, and format alignment. This naturally results in two aligned but distinct types of fake images. However, due to the domain shift between reconstruction-based and upsampling-based fake images, direct mixed training causes suboptimal results, where one domain disrupts feature learning of the other. Accordingly, we propose a Separate Expert Fusion (SEF) framework to extract complementary artifact information and reduce inter-domain interference. We first train domain-specific experts via LoRA adaptation on a frozen foundational model, then conduct decoupled fusion with a gating network to adaptively combine expert features while retaining their specialized knowledge. Rather than merely benefiting GAN-generated image detection, this design introduces diverse and complementary artifact patterns that enable SEF to learn a more robust decision boundary and improve generalization across broader generative methods. Extensive experiments demonstrate that our method yields strong results across 13 diverse benchmarks. Codes are released at: https://github.com/liyih/SEF_AIGC_detection.
comment: preprint
☆ GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding
Interpreting ultra-high-resolution (UHR) remote sensing images requires models to search for sparse and tiny visual evidence across large-scale scenes. Existing remote sensing vision-language models can inspect local regions with zooming and cropping tools, but most exploration strategies follow either a one-shot focus or a single sequential trajectory. Such single-path exploration can lose global context, leave scattered regions unvisited, and revisit or count the same evidence multiple times. To this end, we propose GeoVista, a planning-driven active perception framework for UHR remote sensing interpretation. Instead of committing to one zooming path, GeoVista first builds a global exploration plan, then verifies multiple candidate regions through branch-wise local inspection, while maintaining an explicit evidence state for cross-region aggregation and de-duplication. To enable this behavior, we introduce APEX-GRO, a cold-start supervised trajectory corpus that reformulates diverse UHR tasks as Global-Region-Object interactive reasoning processes with a unified, scale-invariant spatial representation. We further design an Observe-Plan-Track mechanism for global observation, adaptive region inspection, and evidence tracking, and align the model with a GRPO-based strategy using step-wise rewards for planning, localization, and final answer correctness. Experiments on RSHR-Bench, XLRS-Bench, and LRS-VQA show that GeoVista achieves state-of-the-art performance. Code and dataset are available at https://github.com/ryan6073/GeoVista
☆ Real2Sim in HOI: Toward Physically Plausible HOI Reconstruction from Monocular Videos
Recovering 4D human-object interaction (HOI) from monocular video is a key step toward scalable 3D content creation, embodied AI, and simulation-based learning. Recent methods can reconstruct temporally coherent human and object trajectories, but these trajectories often remain visual artifacts while failing to preserve stable contact, functional manipulation, or physical plausibility when used as reference motions for humanoid-object simulation. This reveals a fundamental interaction gap: HOI reconstruction should not stop at tracking a human and an object, but should recover the relation that makes their motion a coherent interaction. We introduce $\textbf{HA-HOI}$, a framework for reconstructing physically plausible 4D HOI animation from in-the-wild monocular videos. Instead of treating the human and object as independent entities in an ambiguous monocular 3D space, we propose a $\textit{human-first, object-follow}$ formulation. The human motion is recovered as the interaction anchor, and the object is reconstructed, aligned, and refined relative to the human action. The resulting kinematic trajectory is then projected into a physics-based humanoid-object simulation, where it acts as a teacher trajectory for stable physical rollout. Across benchmark and in-the-wild videos, $\textbf{HA-HOI}$ improves human-object alignment, contact consistency, temporal stability, and simulation readiness over prior monocular HOI reconstruction methods. By moving beyond visually plausible trajectory recovery toward physically grounded interaction animation, our work takes a step toward turning general monocular HOI videos into scalable demonstrations for humanoid-object behavior. Project page: https://knoxzhao.github.io/real2sim_in_HOI/
☆ ClickRemoval: An Interactive Open-Source Tool for Object Removal in Diffusion Models
Existing object removal tools often rely on manual masks or text prompts, making precise removal difficult for non-expert users in complex scenes and often leading to incomplete removal or unnatural background completion. To address this issue, we present ClickRemoval, an open-source interactive object removal tool built on pretrained Stable Diffusion models and driven solely by user clicks. Without additional training, hand-drawn masks, or text descriptions, ClickRemoval localizes target objects and restores the background through self-attention modulation during denoising. Experiments show that ClickRemoval achieves competitive results across quantitative metrics and user studies. We release a complete software package at https://github.com/zld-make/ClickRemoval under the Apache-2.0 license.
comment: 5 pages, 4 figures. Open-source software paper
☆ Think When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture
Multimodal large language models (MLLMs) have emerged as a powerful backbone for multimodal embeddings. Recent methods introduce chain-of-thought (CoT) reasoning into the embedding pipeline to improve retrieval quality, but remain costly in both model size and inference cost. They typically employ separate reasoner and embedder with substantial parameter overhead, and generate CoT indiscriminately for every input. However, we observe that for simple inputs, discriminative embeddings already perform well, and redundant reasoning can even mislead the model, degrading performance. To address these limitations, we propose Think When Needed (TWN), a unified multimodal embedding framework with adaptive reasoning. TWN introduces a dual-LoRA architecture that attaches reasoning and embedding adapters to a shared frozen backbone, detaching gradients at their interface to mitigate gradient conflicts introduced by joint optimization while keeping parameters close to a single model. Building on this, an adaptive think mechanism uses a self-supervised routing gate to decide per input whether to generate CoT, skipping unnecessary reasoning to reduce inference overhead and even improve retrieval quality. We further explore embedding-guided RL to optimize CoT quality beyond supervised training. On the 78 tasks of MMEB-V2, TWN achieves state-of-the-art embedding quality while being substantially more efficient than existing generative methods, requiring only 3-5% additional parameters relative to the backbone and up to 50% fewer reasoning tokens compared to the full generative mode.
comment: 30 pages, preprint
☆ Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control
Natural language is an intuitive interface for humanoid robots, yet streaming whole-body control requires control representations that are executable now and anticipatory of future physical transitions. Existing language-conditioned humanoid systems typically generate kinematic references that a low-level tracker must repair reactively, or use latent/action policies whose outputs do not explicitly encode upcoming contact changes, support transfers, and balance preparation. We propose \textbf{DAJI} (\emph{Dynamics-Aligned Joint Intent}), a hierarchical framework that learns an anticipatory joint-intent interface between language generation and closed-loop control. DAJI-Act distills a future-aware teacher into a deployable diffusion action policy through student-driven rollouts, while DAJI-Flow autoregressively generates future intent chunks from language and intent history. Experiments show that DAJI achieves strong results in anticipatory latent learning, single-instruction generation, and streaming instruction following, reaching 94.42\% rollout success on HumanML3D-style generation and 0.152 subsequence FID on BABEL.
☆ GeoViSTA: Geospatial Vision-Tabular Transformer for Multimodal Environment Representation
Large-scale pretraining on Earth observation imagery has yielded powerful representations of the natural and built environment. However, most existing geospatial foundation models do not directly model the structured socioeconomic covariates typically stored in tabular form. This modality gap limits their ability to capture the complete total environment, which is critical for reasoning about complex environmental, social, and health-related outcomes. In this work, we propose GeoViSTA (Geospatial Vision-Tabular Transformer), a vision-tabular architecture that learns unified geospatial embeddings from co-registered gridded imagery and tabular data. GeoViSTA utilizes bilateral cross-attention to exchange spatial and semantic information across modalities, guided by a geography-aware attention mechanism that aligns continuous image patches with irregular census-tract tokens. We train GeoViSTA with a self-supervised joint masked-autoencoding objective, forcing it to recover missing image patches and tabular rows using local spatial context and cross-modal cues. Empirically, GeoViSTA's unified embeddings improve linear probing performance on high-impact downstream tasks, outperforming baselines in predicting disease-specific mortality and fire hazard frequency across held-out regions. These results demonstrate that jointly modeling the physical environment alongside structured socioeconomic context yields highly transferable representations for holistic geospatial inference.
☆ DermAgent: A Self-Reflective Agentic System for Dermatological Image Analysis with Multi-Tool Reasoning and Traceable Decision-Making MICCAI2026
Dermatological diagnosis requires integrating fine-grained visual perception with expert clinical knowledge. Although Multimodal Large Language Models (MLLMs) facilitate interactive medical image analysis, their application in dermatology is hindered by insufficient domain-specific grounding and hallucinations. To address these issues, we propose DermAgent, a collaborative multi-tool agent that orchestrates seven specialized vision and language modules within a Plan-Execute-Reflect framework. DermAgent delivers stepwise, traceable diagnostic reasoning through three core components. First, it employs complementary visual perception tools for comprehensive morphological description, dermoscopic concept annotation, and disease diagnosis. Second, to overcome the lack of domain prior, a dual-modality retrieval module anchors every prediction in external evidence by cross-referencing 413,210 diagnosed image cases and 3,199 clinical guideline chunks. To further mitigate hallucinations, a deterministic critic module conducts strict post-hoc auditing via confidence, coverage, and conflict gates, automatically detecting inter-source disagreements to trigger targeted self-correction. Extensive experiments on five dermatology benchmarks demonstrate that DermAgent consistently outperforms state-of-the-art MLLMs and medical agent baselines across zero-shot fine-grained disease diagnosis, concept annotation, and clinical captioning tasks, exceeding GPT-4o by 17.6% in skin disease diagnostic accuracy and 3.15% in captioning ROUGE-L. Our code is available at https://github.com/YizeezLiu/DermAgent.
comment: MICCAI2026 early acceptance
☆ SceneForge: Structured World Supervision from 3D Interventions
Many multimodal learning tasks require supervision that remains consistent across edits, viewpoints, and scene-level interventions. However, such supervision is difficult to obtain from observation-level datasets, which do not expose the underlying scene state or how changes propagate through it. We present SceneForge, an intervention-driven framework that generates structured supervision from editable 3D world states. SceneForge represents each scene as a persistent world with semantic, geometric, and physical dependencies. By applying explicit interventions (e.g., object removal or camera variation) and propagating their effects through scene dependencies, SceneForge renders supervision that remains consistent with object structure and scene-level effects. This produces aligned outputs including counterfactual observations, multi-view observations, and effect-aware signals such as shadows and reflections, all derived from a shared world state rather than post hoc image-space processing. We instantiate SceneForge using Infinigen and Blender to construct a licensing-clean indoor supervision resource with a large number of counterfactual pairs and aligned annotations from over 2K scenes, covering both diverse single-view and registered multi-view settings. Under matched training budgets, incorporating SceneForge supervision improves both object removal and scene removal performance across multiple benchmarks in both quantitative and qualitative evaluation. These results indicate that modeling supervision as structured state transitions in editable worlds provides a practical and scalable foundation for intervention-consistent multimodal learning.
☆ Systematic Discovery of Semantic Attacks in Online Map Construction through Conditional Diffusion
Autonomous vehicles depend on online HD map construction to perceive lane boundaries, dividers, and pedestrian crossings -- safety-critical road elements that directly govern motion planning. While existing pixel perturbation attacks can disrupt the mapping, they can be neutralized by standard adversarial defenses. We present MIRAGE, a framework for systematic discovery of semantic attacks that bypass adversarial defenses and degrade mapping predictions by finding plausible environmental variation (e.g. shadows, wet roads). MIRAGE exploits the latent manifold of real-world data learned by diffusion models, and searches for semantically mutated scenes neighboring the ground truth with the same road topology yet mislead the mapping predictions. We evaluate MIRAGE on nuScenes and demonstrate two attacks: (1) boundary removal, suppressing 57.7% of detections and corrupting 96% of planned trajectories; and (2) boundary injection, the only method that successfully injects fictitious boundaries, while pixel PGD and AdvPatch fail entirely. Both attacks remain potent under various adversarial defenses. We use two independent VLM judges to quantify realism, where MIRAGE passes as realistic 80--84% of the time (vs. 97--99% for clean nuScenes), while AdvPatch only 0--9%. Our findings expose a categorical gap in current adversarial defenses: semantic-level perturbations that manifest as legitimate environmental variation are substantially harder to mitigate than pixel-level perturbations.
☆ Analogical Trajectory Transfer
We study analogical trajectory transfer, where the goal is to translate motion trajectories in one 3D environment to a semantically analogous location in another. Such a capacity would enable machines to perform analogical spatial reasoning, with applications in AR/VR co-presence, content creation, and robotics. However, even semantically similar scenes can still differ substantially in object placement, scale, and layout, so naively matching semantics leads to collisions or geometric distortions. Furthermore, finding where each trajectory point should transfer to has a large search space, as the mapping must preserve semantics and functionality without tearing the trajectory apart or causing collisions. Our key insight is to decompose the problem into spatially segregated subproblems and merge their solutions to produce semantically consistent and spatially coherent transfers. Specifically, we partition scenes into object-centric clusters and estimate cross-scene mappings via hierarchical smooth map prediction, using 3D foundation model features that encode contextual information from object and open-space arrangements. We then combinatorially assemble the per-cluster maps into an initial transfer and refine the result to remove collisions and distortions, yielding a spatially coherent trajectory. Our method does not require training, attains a fast runtime around 0.6 seconds, and outperforms baselines based on LLMs, VLMs, and scene graph matching. We further showcase applications in virtual co-presence, multi-trajectory transfer, camera transfer, and human-to-robot motion transfer, which indicates the broad applicability of our work to AR/VR and robotics.
☆ Dual-Latent Collaborative Decoding for Fidelity-Perception Balanced Image Compression
Learned image compression (LIC) increasingly requires reconstructions that balance distortion fidelity and perceptual realism across a wide range of bitrates. However, most existing methods still rely on a single compressed latent representation to simultaneously carry structural details, semantic cues, and perceptual priors, requiring the same latent representation to serve multiple, potentially conflicting roles. This tension becomes evident across different latent paradigms: scalar-quantized (SQ) continuous latents provide rate-scalable fidelity but tend to lose perceptual details at low rates, while vector-quantized (VQ) discrete tokens preserve compact semantic cues but suffer from limited structural fidelity and bitrate scalability. To address this issue, we propose Mixture of Decoder Experts (MoDE), a dual-latent collaborative decoding framework that decomposes reconstruction responsibilities across complementary latent paradigms. Specifically, MoDE treats the SQ branch as a fidelity-oriented expert and the VQ branch as a perception-oriented expert, and coordinates them through two decoder-side modules: Expert-Specific Enhancement (ESE), which preserves branch-specific expert references, and Cross-Expert Modulation (CEM), which enables selective complementary transfer during reconstruction. The resulting framework supports selective cross-latent collaboration under a shared dual-stream bitstream and enables both fidelity-anchored and perception-anchored decoding. Extensive experiments demonstrate that MoDE achieves a more favorable fidelity-perception balance than representative distortion-oriented, perception-oriented, generative, and dual-latent baselines across a wide bitrate range, highlighting decoder-side expert collaboration as an effective design for wide-range fidelity-perception balanced LIC.
☆ Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation
Interactive real-time autoregressive video generation is essential for applications such as content creation and world modeling, where visual content must adapt to dynamically evolving event conditions. A fundamental challenge lies in balancing reactivity and stability: models must respond promptly to new events while maintaining temporal coherence over long horizons. Existing approaches distill bidirectional models into autoregressive generators and further adapt them via streaming long tuning, yet often exhibit persistent drift after condition changes. We identify the cause as conditional bias, where the teacher may provide condition-aligned but trajectory-agnostic guidance, biasing generation toward locally valid yet globally inconsistent modes. Inspired by Trust Region Policy Optimization, we propose Delta Forcing, a simple yet effective framework that constrains unreliable teacher supervision within an adaptive trust region. Specifically, Delta Forcing estimates transition consistency from the latent delta between teacher and generator trajectories, and uses it to balance teacher supervision with a monotonic continuity objective. This suppress unreliable teacher-induced shifts while preserving responsiveness to new events. Extensive experiments demonstrate that Delta Forcing significantly improves consistency while maintaining event reactivity.
☆ Learning with Semantic Priors: Stabilizing Point-Supervised Infrared Small Target Detection via Hierarchical Knowledge Distillation
Single-frame Infrared Small Target Detection (ISTD) aims to localize weak targets under heavy background clutter, yet dense pixel-wise annotations are expensive. Point supervision with online label evolution reduces annotation cost; however, lightweight CNN detectors often lack sufficient semantics, leading to noisy pseudo-masks and unstable optimization. To address this, we propose a hierarchical VFM-driven knowledge distillation framework that uses a frozen Vision Foundation Model (VFM) during training. We formulate point-supervised learning as a bilevel optimization process: the inner loop adapts a VFM-embedded teacher on reweighted training samples, while the outer loop transfers validation-guided knowledge to a lightweight student to mitigate pseudo-label noise and training-set bias. We further introduce Semantic-Conditioned Affine Modulation (SCAM) to inject VFM semantics into CNN features at multiple layers. In addition, a dynamic collaborative learning strategy with cluster-level sample reweighting enhances robustness to imperfect pseudo-masks. Experiments on diverse challenging cases across multiple ISTD backbones demonstrate consistent improvements in detection accuracy and training stability. Our code is available at https://github.com/yuanhang-yao/semantic-prior.
☆ AnyBand-Diff: A Unified Remote Sensing Image Generation and Band Repair Framework with Spectral Priors
Existing diffusion models have made significant progress in generating realistic images. However, their direct adaptation to remote sensing imagery often disregards intrinsic physical laws. This oversight frequently leads to spectral distortion and radiometric inconsistency, severely limiting the scientific utility of generated data. To address this issue, this paper introduces AnyBand-Diff, a novel spectral-prior-guided diffusion framework tailored for robust spectral reconstruction. Specifically, we design a Masked Conditional Diffusion backbone integrated with a dual stochastic masking strategy, empowering the model to recover complete spectral information from arbitrary band subsets. Subsequently, to ensure radiometric fidelity, a Physics-Guided Sampling mechanism is proposed, leveraging gradients from a differentiable physical model to explicitly steer the denoising trajectory toward the manifold of physically plausible solutions. Furthermore, a Multi-Scale Physical Loss is formulated to enforce rigorous constraints across pixel, region, and global levels in a joint manner. Extensive experiments confirm the effectiveness of AnyBand-Diff in generating reliable imagery and achieving accurate spectral reconstruction, contributing to the advancement of physics-aware generative methods for Earth observation.
☆ IG-Diff: Complex Night Scene Restoration with Illumination-Guided Diffusion Model
In nighttime circumstances, it is challenging for individuals and machines to perceive their surroundings. While prevailing image restoration methods adeptly handle singular forms of degradation, they falter when confronted with intricate nocturnal scenes, such as the concurrent presence of weather and low-light conditions. Compounding this challenge, the lack of paired data that encapsulates the coexistence of low-light situations and other forms of degradation hinders the development of a comprehensive end-to-end solution. In this work, we contribute complex nighttime scene datasets that simulate both illumination degradation and other forms of deterioration. To address the complexity of night degradation, we propose an integration of an illumination-guided module embedded in the diffusion model to guide the illumination restoration process. Our model can preserve texture fidelity while contending with the adversities posed by various degradation in low-light scenarios.
comment: Accepted by CGI-2025
☆ InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation
Text and faces are among the most perceptually salient and practically important patterns in visual generation, yet they remain challenging for autoregressive generators built on discrete tokenization. A central bottleneck is the tokenizer: aggressive downsampling and quantization often discard the fine-grained structures needed to preserve readable glyphs and distinctive facial features. We attribute this gap to standard discrete-tokenizer objectives being weakly aligned with text legibility and facial fidelity, as these objectives typically optimize generic reconstruction while compressing diverse content uniformly. To address this, we propose InsightTok, a simple yet effective discrete visual tokenization framework that enhances text and face fidelity through localized, content-aware perceptual losses. With a compact 16k codebook and a 16x downsampling rate, InsightTok significantly outperforms prior tokenizers in text and face reconstruction without compromising general reconstruction quality. These gains consistently transfer to autoregressive image generation in InsightAR, producing images with clearer text and more faithful facial details. Overall, our results highlight the potential of specialized supervision in tokenizer training for advancing discrete image generation.
comment: Code and checkpoints are available at https://github.com/LeapLabTHU/InsightTok
☆ D2-CDIG: Controlled Diffusion Remote Sensing Image Generation with Dual Priors of DEM and Cloud-Fog
Remote sensing image generation provides a reliable data foundation for remote sensing large models and downstream tasks. However, existing controllable remote sensing image generation methods typically rely on traditional techniques such as segmentation and edge detection, which do not fully leverage terrain or atmospheric conditions. As a result, the generated images often lack accuracy and naturalness when dealing with complex terrains and atmospheric phenomena. In this paper, we propose a novel remote sensing image generation framework, D2-CDIG, which integrates diffusion models with a dual-prior control mechanism. By incorporating both Digital Elevation Model (DEM) and cloud-fog information as dual prior knowledge, D2-CDIG precisely controls ground features and atmospheric phenomena within the generated images. Specifically, D2-CDIG decouples the terrain and atmospheric generation processes through independent control of ground and atmospheric branches. Additionally, a refined cloud-fog slider is introduced to flexibly adjust cloud thickness and distribution. During training, ground and atmospheric control signals are injected in layers to ensure a seamless transition within the images. Compared to traditional methods based on segmentation or edge detection, D2-CDIG shows significant improvements in image quality, detail richness, and realism. D2-CDIG offers a flexible and precise solution for remote sensing image generation, providing high-quality data for training large remote sensing models and downstream tasks.
☆ TurboVGGT: Fast Visual Geometry Reconstruction with Adaptive Alternating Attention
Recent feed-forward 3D reconstruction methods, such as visual geometry transformers, have substantially advanced the traditional per-scene optimization paradigm by enabling effective multi-view reconstruction in a single forward pass. However, most existing methods struggle to achieve a balance between reconstruction quality and computational efficiency, which limits their scalability and efficiency. Although some efficient visual geometry transformers have recently emerged, they typically use the same sparsity ratio across layers and frames and lack mechanisms to adaptively learn representative tokens to capture global relationships, leading to suboptimal performance. In this work, we propose TurboVGGT, a novel approach that employs an efficient visual geometry transformer with adaptive alternating attention for fast multi-view 3D reconstruction. Specifically, TurboVGGT employs an end-to-end trainable framework with adaptive sparse global attention guided by adaptive sparsity selection to capture global relationships across frames and frame attention to aggregate local details within each frame. In the adaptive sparse global attention, TurboVGGT adaptively learns representative tokens with varying sparsity levels for global geometry modeling, considering that token importance varies across frames, attention layers operate tokens at different levels of abstraction, and global dependencies rely on structurally informative regions. Extensive experiments on multiple 3D reconstruction benchmarks demonstrate that TurboVGGT achieves fast multi-view reconstruction while maintaining competitive reconstruction quality compared with state-of-the-art methods. Project page: https://turbovggt.github.io/.
comment: Technical Report
☆ CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding
Streaming video understanding with large vision-language models (VLMs) requires a compact memory that can support future reasoning over an ever-growing visual history. A common solution is to compress the key-value (KV) cache, but existing streaming methods typically rely on local token-wise heuristics, such as recency, temporal redundancy, or saliency, which do not explicitly optimize whether the retained cache is representative of the accumulated history. We propose to view KV-cache compression as a coreset selection problem: rather than scoring tokens independently for retention, we select a small subset that covers the geometry of the accumulated visual cache. Our method operates in a joint KV representation and introduces a bicriteria objective that balances coverage in key and value spaces, preserving both retrieval structure and output-relevant information. To encourage a more diverse retained subset, we further introduce an orthogonality-driven diversity criterion that favors candidates contributing new directions beyond the current selection, and connect this criterion to log-determinant subset selection. Across four open-source VLMs and five long-video and streaming-video benchmarks, our method improves over heuristic streaming compression baselines under a fixed cache budget. These results highlight that representative coreset selection offers a more effective principle, than token-wise pruning, for memory-constrained streaming video understanding.
☆ ICED: Concept-level Machine Unlearning via Interpretable Concept Decomposition
Machine unlearning in Vision-Language Models (VLMs) is typically performed at the image or instance level, making it difficult to precisely remove target knowledge without affecting unrelated semantics. This issue is especially pronounced since a single image often contains multiple entangled concepts, including both target concepts to be forgotten and contextual information that should be preserved. In this paper, we propose an interpretable concept-level unlearning framework for VLMs, which constructs a compact task-specific concept vocabulary from the forgetting set using a multimodal large language model. In addition to modality alignment, visual representations are decomposed into sparse, nonnegative combinations of semantic concepts, providing an explicit interface for fine-grained knowledge manipulation. Based on this decomposition, our method formulates unlearning as concept-level optimization, where target concepts are selectively suppressed while intra-instance non-target semantics and global cross-modal knowledge are preserved. Extensive experiments across both in-domain and out-of-domain forgetting settings demonstrate that our method enables more comprehensive target forgetting, better preserves non-target knowledge within the same image, and maintains competitive model utility compared with existing VLM unlearning methods.
☆ To See is Not to Learn: Protecting Multimodal Data from Unauthorized Fine-Tuning of Large Vision-Language Model
The rapid advancement of Large Vision-Language Models (LVLMs) is increasingly accompanied by unauthorized scraping and training on multimodal web data, posing severe copyright and privacy risks to data owners. Existing countermeasures, such as machine unlearning and watermarks, are inherent post-hoc approaches that act only after intellectual property infringement has already occurred. In this work, we propose MMGuard to empower data owners to proactively protect their multimodal data against unauthorized LVLM fine-tuning. MMGuard generates unlearnable examples by injecting human-imperceptible perturbations that actively exploit the learning dynamics of LVLMs. By minimizing the training loss, the perturbation creates an optimization shortcut, causing the model to overfit to the noise and thereby degrading downstream performance when the perturbation is absent during inference. To further strengthen this defense, MMGuard introduces a cross-modal binding disruption, strategically shifting LVLM attention to enforce a spurious correlation between the noise and the training target with theoretical guarantees. Enhanced by an ensemble learning strategy for cross-model transferability, MMGuard is evaluated against nine open-source LVLMs across six datasets. Our comprehensive results demonstrate effective, stealthy, and robust protection under white-box, gray-box, and black-box threat models, establishing a mechanistic advantage in proactively defending against aggressive fine-tuning exploitation.
☆ KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration
Aligning streaming autoregressive (AR) video generators with human preferences is challenging. Existing reinforcement learning methods predominantly rely on noise-based exploration and SDE-based surrogate policies that are mismatched to the deterministic ODE dynamics of distilled AR models, and tend to perturb low-level appearance rather than the high-level semantic storyline progression critical for long-horizon coherence. To address these limitations, we present KVPO, an ODE-native online Group Relative Policy Optimization (GRPO) framework for aligning streaming video generators. For diversity exploration, KVPO introduces a causal-semantic exploration paradigm that relocates the source of variation from stochastic noise to the historical KV cache. By stochastically routing historical KV entries, it constructs semantically diverse generation branches that remain strictly on the data manifold. For policy modeling, KVPO introduces a velocity-field surrogate policy based on Trajectory Velocity Energy (TVE), which quantifies branch likelihood in flow-matching velocity space and yields a reward-weighted contrastive objective fully consistent with the native ODE formulation. Experiments on multiple distilled AR video generators demonstrate consistent gains in visual quality, motion quality, and text-video alignment across both single-prompt short-video and multi-prompt long-video settings.
☆ CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL
Video generation models trained on heterogeneous data with likelihood-surrogate objectives can produce visually plausible rollouts that violate physical constraints in embodied manipulation. Although reinforcement-learning post-training offers a natural route to adapting VGMs, existing video-RL rewards often reduce each rollout to a low-level visual metric, whereas manipulation video evaluation requires logic-based verification of whether the rollout satisfies a compositional task specification. To fill this gap, we introduce a compositional constraint-based reward model for post-training embodied video generation models, which automatically formulates task requirements as a composition of Linear Temporal Logic constraints, providing faithful rewards and localized error information in generated videos. To achieve effective improvement in high-dimensional video generation using these reward signals, we further propose CreFlow, a novel online RL framework with two key designs: i) a credit-aware NFT loss that confines the RL update to reward-relevant regions, preventing perturbations to unrelated regions during post-training; and ii) a corrective reflow loss that leverages within-group positive samples as an explicit estimate of the correction direction, stabilizing and accelerating training. Experiments show that CreFlow yields reward judgments better aligned with human and simulator success labels than existing methods and improves downstream execution success by 23.8 percentage points across eight bimanual manipulation tasks.
☆ Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers ICML 2026
Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-to-image generation, yet they frequently suffer from concept omission, where specified objects or attributes fail to emerge in the generated image. By performing linear probing on text tokens, we demonstrate that text embeddings can distinguish a characteristic `omission signal' representing the absence of target concepts. Leveraging this insight, we propose Omission Signal Intervention (OSI), which amplifies the omission signal to actively catalyze the generation of missing concepts. Comprehensive experiments on FLUX.1-Dev and SD3.5-Medium demonstrate that OSI significantly alleviates concept omission even in extreme scenarios.
comment: Accepted to ICML 2026
☆ PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation
Generating realistic human motion is a central yet unsolved challenge in video generation. While reinforcement learning (RL)-based post-training has driven recent gains in general video quality, extending it to human motion remains bottlenecked by a reward signal that cannot reliably score motion realism. Existing video rewards primarily rely on 2D perceptual signals, without explicitly modeling the 3D body state, contact, and dynamics underlying articulated human motion, and often assign high scores to videos with floating bodies or physically implausible movements. To address this, we propose PhyMotion, a structured, fine-grained motion reward that grounds recovered 3D human trajectories in a physics simulator and evaluates motion quality along multiple dimensions of physical feasibility. Concretely, we recover SMPL body meshes from generated videos, retarget them onto a humanoid in the MuJoCo physics simulator, and evaluate the resulting motion along three axes: kinematic plausibility, contact and balance consistency, and dynamic feasibility. Each component provides a continuous and interpretable signal tied to a specific aspect of motion quality, allowing the reward to capture which aspects of motion are physically correct or violated. Experiments show that PhyMotion achieves stronger correlation with human judgments than existing reward formulations. These gains carry over to RL-based post-training, where optimizing PhyMotion leads to larger and more consistent improvements than optimizing existing rewards, improving motion realism across both autoregressive and bidirectional video generators under both automatic metrics and blind human evaluation (+68 Elo gain). Ablations show that the three axes provide complementary supervision signals, while the reward preserves overall video generation quality with only modest training overhead.
comment: First two authors contributed equally, website: https://phy-motion.github.io/
☆ Image Restoration via Diffusion Models with Dynamic Resolution ICML 2026
Diffusion models (DMs) have exhibited remarkable efficacy in various image restoration tasks. However, existing approaches typically operate within the high-dimensional pixel space, resulting in high computational overhead. While methods based on latent DMs seek to alleviate this issue by utilizing the compressed latent space of a variational autoencoder, they require repeated encoder-decoder inference. This introduces significant additional computational burdens, often resulting in runtime performance that is even inferior to that of their pixel-space counterparts. To mitigate the computational inefficiency, this work proposes projecting data into lower-dimensional subspaces using dynamic resolution DMs to accelerate the inference process. We first fine-tune pre-trained DMs for dynamic resolution priors and adapt DPS and DAPS, which are two widely used pixel-space methods for general image restoration tasks, into the proposed framework, yielding methods we refer to as SubDPS and SubDAPS, respectively. Given the favorable inference speed and reconstruction fidelity of SubDAPS, we introduce an enhanced variant termed SubDAPS++ to further boost both reconstruction efficiency and quality. Empirical evaluations across diverse image datasets and various restoration tasks demonstrate that the proposed methods outperform recent DM-based approaches in the majority of experimental scenarios. The code is available at https://github.com/StarNextDay/SubDAPS.git.
comment: Accepted by ICML 2026
☆ Architecture-Aware Explanation Auditing for Industrial Visual Inspection
Industrial visual inspection systems increasingly rely on deep classifiers whose heatmap explanations may appear visually plausible while failing to identify the image regions that actually drive model decisions. This paper operationalizes an architecture-aware explanation audit protocol grounded in the native-readout hypothesis: the perturbation-based faithfulness of an explanation method is bounded by its structural distance from the model's native decision mechanism. On WM-811K wafer maps (9 classes, 172k images) under a three-seed zero-fill perturbation protocol, ViT-Tiny + Attention Rollout attains Deletion AUC 0.211 against 0.432-0.525 for Swin-Tiny / ResNet18+CBAM / DenseNet121 + Grad-CAM (abs(Cohen's d) > 1.1), despite lower classification accuracy. Swin-Tiny disentangles architecture family from readout structure: despite being a Transformer, its spatial feature-map hierarchy makes it Grad-CAM compatible, showing that the operative factor is readout structure rather than architecture family. A model-agnostic control (RISE) compresses all families to Deletion AUC about 0.1, indicating the gap arises from the explainer pathway; notably, RISE outperforms all native methods, so native readout is a compatibility principle rather than an optimality guarantee. A blur-fill sensitivity analysis shows that the family ordering reverses under a different perturbation baseline, reinforcing that faithfulness rankings are joint properties of (model, explainer, perturbation operator) triples. An exploratory boundary-condition study on MVTec AD (pretrained models) indicates that audit results are dataset/task dependent and identifies conditions requiring qualification. The protocol yields actionable guidance: explanation pathways should be co-designed with model architectures based on readout structure, and deployed heatmaps should be accompanied by quantitative faithfulness metrics.
☆ Towards Real-Time Autonomous Navigation: Transformer-Based Catheter Tip Tracking in Fluoroscopy
Purpose: Mechanical thrombectomy (MT) improves stroke outcomes, but is limited by a lack of local treatment access. Widespread distribution of reinforcement learning (RL)-based robotic systems can be used to alleviate this challenge through autonomous navigation, but current RL methods require live device tip coordinate tracking to function. This paper aims to develop and evaluate a real-time catheter tip tracking pipeline under fluoroscopy, addressing challenges such as low contrast, noise, and device occlusion. Methods: A multi-threaded pipeline was designed, incorporating frame reading, preprocessing, inference, and post-processing. Deep learning segmentation models, including U-Net, U-Net+Transformer, and SegFormer, were trained and benchmarked using two-class and three-class formulations. Post-processing involved two-step component filtering, one-pixel medial skeletonization, and greedy arc-length path following with contour fall-back. Results: On manually-labeled moderate complexity fluoroscopic video data, the two-class SegFormer achieved a mean absolute error of 4.44 mm, outperforming U-Net (4.60 mm), U-Net+Transformer (6.20 mm) and all three-class models (5.19-7.74 mm). On segmentation benchmarks, the system exceeded state-of-the-art CathAction results with improvements of up to +5% in Dice scores for three-segmentation. Conclusion: The results demonstrate that the proposed multi-threaded tracking framework maintains stable performance under challenging imaging conditions, outperforming prior benchmarks, while providing a reliable and efficient foundation for RL-based autonomous MT navigation.
comment: Harry Robertshaw and Yanghe Hao contributed equally to this work. Published in the International Journal of Computer Assisted Radiology and Surgery
☆ Generative Deep Learning for Computational Destaining and Restaining of Unregistered Digital Pathology Images
Conditional generative adversarial networks (cGANs) have enabled high-fidelity computational staining and destaining of hematoxylin and eosin (H&E) in digital pathology whole-slide images (WSI). However, their ability to generalize to out-of-distribution WSI across institutions without retraining remains insufficiently characterized. Previously developed cGAN models trained on 102 registered prostate core biopsy WSIs from Brigham and Women's Hospital were evaluated on 82 spatially unregistered WSIs acquired at Stanford University. To mitigate domain shift without retraining, a preprocessing pipeline consisting of histogram-based stain normalization for H&E-stained WSIs and channel-wise intensity calibration for unstained WSIs was developed. Because image registration was intentionally omitted for real-world deployment conditions, the reported quantitative results are conservative lower bounds reflecting both model performance and limited spatial alignment. Under these conditions, virtual destaining achieved a Pearson correlation coefficient (PCC) of 0.854, structural similarity index measure (SSIM) of 0.699, and peak signal-to-noise ratio (PSNR) of 18.41 dB. H&E restaining from computationally destained outputs outperformed direct staining from ground-truth unstained inputs across all metrics (PCC: 0.798 vs. 0.715; SSIM: 0.756 vs. 0.718; PSNR: 20.08 vs. 18.51 dB), suggesting that preprocessing quality may be more limiting than model capacity. Qualitative pathological review indicated preservation of benign glandular structures while showing that malignant glands were often rendered with vessel-like morphologies. These findings support the feasibility of applying cGAN-based computational H&E staining and destaining generative models to external WSI datasets using preprocessing-based adaptation alone while defining specific morphological targets for future domain adaptation.
☆ Implicit spatial-frequency fusion of hyperspectral and lidar data via kolmogorov-arnold networks
Hyperspectral image (HSI) classification is challenging in complex scenes due to spectral ambiguity, spatial heterogeneity, and the strong coupling between material properties and geometric structures. Although LiDAR provides complementary elevation information, most HSI-LiDAR fusion methods rely on CNNs or MLPs with fixed activation functions and linear weights. These methods struggle to model structural discontinuities in LiDAR data, intricate spectral features of HSI, and their interactions. In addition, fusion of the two modalities in both spatial and frequency domains with LiDAR guidance remains underexplored. To address these issues, we propose the Implicit Frequency-Geometry Fusion Network (IFGNet), which leverages Kolmogorov-Arnold Networks (KANs) with learnable spline-based functions to adaptively capture highly nonlinear relationships between hyperspectral and LiDAR features. Furthermore, IFGNet introduces a LiDAR-guided implicit aggregation module in both spatial and frequency domains, enhancing geometry-aware spatial representations while capturing global structural patterns. Experiments on the Houston 2013 and MUUFL benchmarks demonstrate that IFGNet consistently outperforms existing fusion methods in overall accuracy, average accuracy, and Cohen's Kappa, while maintaining an efficient architecture.
comment: 6 pages, 1 figure, conference
☆ Automatic Landmark-Based Segmentation of Human Subcortical Structures in MRI IEEE
Precise segmentation of brain structures in magnetic resonance imaging (MRI) is essential for reliable neuroimaging analysis, yet voxel-wise deep models often yield anatomically inconsistent results that diverge from expert-defined boundaries. In this research, we propose a landmark-guided 3D brain segmentation approach that explicitly mimics the manual segmentation protocol of the Harvard--Oxford Atlas. A Global-to-Local network automatically detects 16 landmarks representing key subcortical reference points. Then, a semantic segmentation model produces a coarse segmentation of 12 anatomical labels, each grouping multiple subcortical regions. Finally, a landmark-driven post-processing step separates these 12 labels into 26 distinct structures by enforcing local anatomical constraints. Experimental results demonstrate consistent improvements in boundary accuracy. Overall, integrating learned landmarks aligns segmentations more closely with manual protocols.
comment: 7 pages, 5 figures. Accepted for presentation at the 48th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 2026)
♻ ☆ MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
Autonomous driving has progressed from modular pipelines toward end-to-end unification, and Vision-Language-Action (VLA) models are a natural extension of this journey beyond Vision-to-Action (VA). In practice, driving VLAs have often trailed VA on planning quality, suggesting that the difficulty is not simply model scale but the interface through which semantic reasoning, temporal context, and continuous control are combined. We argue that this gap reflects how VLA has been built -- as isolated subtask improvements that fail to compose coherent driving capabilities -- rather than what VLA is. We present MindVLA-U1, the first unified streaming VLA architecture for autonomous driving. A unified VLM backbone produces AR language tokens (optional) and flow-matching continuous action trajectories in a single forward pass over one shared representation, preserving the natural output form of each modality. A full streaming design processes the driving video framewise rather than as fixed video-action chunks under costly temporal VLM modeling. Planned trajectories evolve smoothly across frames while a learned streaming memory channel carries temporal context and updates. The unified architecture enables fast/slow systems on dense & sparse MoT backbones via flexible self-attention context management, and exposes a measurable language-control path for action: language-predicted driving intents steers the action diffusion via classifier-free guidance (CFG), turning language-side intent into control signals for continuous action planning. On the long-tail WOD-E2E benchmark, MindVLA-U1 surpasses experienced human drivers for the first time (8.20 RFS vs. 8.13 GT RFS) with 2 diffusion steps, achieves state-of-the-art planning ADEs over prior VA/VLA by large margins, and matches VA latency (16 FPS vs. RAP's 18 FPS at 1B scale) while preserving natural language interfaces for human-vehicle interaction.
comment: Work in progress. Project page: https://mind-omni.github.io/
♻ ☆ Action Emergence from Streaming Intent
We formalize action emergence as a target capability for end-to-end autonomous driving: the ability to generate physically feasible, semantically appropriate, and safety-compliant actions in arbitrary, long-tail traffic scenes through scene-conditioned reasoning rather than retrieval or interpolation of learned scene-action mappings. We show that previous paradigms cannot deliver action emergence: autoregressive trajectory decoders collapse the inherently multimodal future into a single averaged output, while diffusion and flow-matching generators express multimodality but are not steerable by reasoned intent. We propose Streaming Intent as a concrete way to approach action emergence: a mechanism that makes driving intent (i) semantically streamed through a continuous chain-of-thought that causally derives the intent from scene understanding, and (ii) temporally streamed across clips so that intent commitments remain coherent along the driving horizon. We realize Streaming Intent in a VLA model we call SI (Streaming Intent). SI autoregressively decodes a four-step chain-of-thought and emits an intent token; the decoded intent then drives classifier-free guidance (CFG) on a flow-matching action head, requiring only two denoising steps to generate the final trajectory. On the Waymo End-to-End benchmark, SI achieves competitive aggregate performance, with an RFS score of 7.96 on the validation set and 7.74 on the test set. Beyond aggregate metrics, the model demonstrates -- to our knowledge for the first time in a fully end-to-end VLA -- intent-faithful controllability: for a fixed scene, varying the intent class at inference yields qualitatively distinct yet consistently high-quality plans, arising purely from data-driven learning without any pre-built trajectory bank or hand-coded post-hoc selector.
comment: Project page: https://mind-omni.github.io/
♻ ☆ Driving Intents Amplify Planning-Oriented Reinforcement Learning
Continuous-action policies trained on a single demonstrated trajectory per scene suffer from mode collapse: samples cluster around the demonstrated maneuver and the policy cannot represent semantically distinct alternatives. Under preference-based evaluation, this caps best-of-N performance -- even oracle selection cannot recover what the sampling distribution does not contain. We introduce DIAL, a two-stage Driving-Intent-Amplified reinforcement Learning framework for preference-aligned continuous-action driving policies. In the first stage, DIAL conditions the flow-matching action head on a discrete intent label with classifier-free guidance (CFG), which expands the sampling distribution along distinct maneuver modes and breaks single-demonstration mode collapse. In the second stage, DIAL carries this expanded distribution into preference RL through multi-intent GRPO, which spans all intent classes within every preference group and prevents fine-tuning from re-collapsing around the currently preferred mode. Instantiated for end-to-end driving with eight rule-derived intents and evaluated on WOD-E2E: competitive Vision-to-Action (VA) and Vision-Language-Action (VLA) Supervised Finetuning (SFT) baselines plateau below the human-driven demonstration at best-of-128, with the strongest prior (RAP) capping at Rater Feedback Score (RFS) 8.5 even with best-of-64; intent-CFG sampling lifts this ceiling to RFS 9.14 at best-of-128, surpassing both the prior best (RAP 8.5) and the human-driven demonstration (8.13) for the first time; and multi-intent GRPO improves held-out RFS from 7.681 to 8.211, while every single-intent baseline peaks lower and degrades by training end. These results suggest that the bottleneck of preference RL on continuous-action policies trained from demonstrations is not only how to update the policy, but to expand and preserve the sampling distribution being optimized.
comment: Project page: https://mind-omni.github.io/
♻ ☆ Directional Confusions Reveal Divergent Inductive Biases Through Rate-Distortion Geometry in Human and Machine Vision
To humans, a robin seems more like a bird than a bird seems like a robin, but does this asymmetry also hold for machine vision? Humans and modern vision models can match each other in accuracy while making systematically different kinds of errors, differing not in how often they fail, but in who gets mistaken for whom. We show these directional confusions reveal distinct inductive biases invisible to accuracy alone. Using matched human and deep neural network responses on a natural-image categorization task under 12 perturbation types, we quantify asymmetry in confusion matrices and link its organization to the geometry of the information--error trade-off - how efficiently, and how gracefully, a system generalizes under distortion. We find that humans exhibit broad but weak asymmetries across many class pairs, whereas deep vision models show sparser, stronger directional collapses into a few dominant categories. Robustness training reduces overall asymmetry magnitude but fails to recover this human-like distributed structure. Generative simulations further show that these two asymmetry organizations shift the trade-off geometry in opposite directions even at matched accuracy, explaining why the same scalar asymmetry score can reflect fundamentally different generalization strategies. Together, these results establish directional confusion structure as a sensitive, interpretable signature of inductive bias that accuracy-based evaluation cannot recover.
♻ ☆ The Potential of Convolutional Neural Networks for Cancer Detection
Early detection is crucial for successful cancer treatment and increasing survivability rates, particularly in the most common forms. Ten different cancers have been identified in most of these advances that effectively use CNNs (Convolutional Neural Networks) for classification. The distinct architectures of CNNs used in each study concentrate on pattern recognition for different types of cancer across various datasets. The advantages and disadvantages of each approach are identified by comparing these architectures. This study explores the potential of integrating CNNs into clinical practice to complement traditional diagnostic methods. It also identifies the top-performing CNN architectures, highlighting their role in enhancing diagnostic capabilities in healthcare.
♻ ☆ SyncLight: Single-Edit Multi-View Relighting
We present SyncLight, a method to enable consistent, parametric control over light sources across multiple uncalibrated views of a static scene conditioned on a single view. While single-view relighting has advanced significantly, existing generative approaches struggle to maintain the rigorous lighting consistency essential for multi-camera broadcasts, stereoscopic cinema, and virtual production. SyncLight addresses this by enabling precise control over light intensity and color across a multi-view capture of a scene, conditioned on a single reference edit. Our method leverages a multi-view diffusion transformer trained using a latent bridge matching formulation, achieving high-fidelity relighting of the entire image set in a single inference step. To facilitate training, we introduce a large-scale hybrid dataset comprising diverse synthetic environments -- curated from existing sources and newly designed scenes -- alongside high-fidelity, real-world multi-view captures under calibrated illumination. Though trained only on image pairs, SyncLight generalizes zero-shot to an arbitrary number of viewpoints, effectively propagating lighting changes across all views, without requiring camera pose information. SyncLight enables practical relighting workflows for multi-view capture systems.
comment: Project page: http://sync-light.github.io
♻ ☆ Do-Undo Bench: Reversibility for Action Understanding in Image Generation
We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating plausible scene transformations driven by real-world actions. Unlike prior work that relies on prompt-based image generation and editing to perform action-conditioned image manipulation, our training hypothesis requires models to simulate the outcome of a real-world action and then reverse it to the original state. This forward-reverse requirement tests genuine cause-and-effect understanding rather than stylistic or semantic edits. We curate a high-quality benchmark of reversible actions from real-world scenarios to enable robust action grounding. Our experiments reveal that current models struggle with action reversibility, highlighting the need to evaluate action understanding. Do-Undo provides an intuitive testbed for evaluating and advancing action-aware generation in multimodal systems that must reason about real-world dynamics.
comment: Project page: https://s-mahajan.github.io/Do-Undo-Bench/
♻ ☆ AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting CVPR
Segment Anything Model 2 (SAM2) exhibits strong generalisation for promptable segmentation in video clips; however, its integration with the audio modality remains underexplored. Existing approaches either convert audio into visual prompts (e.g., boxes) via foundation models, or inject adapters into the image encoder for audio-visual fusion. Yet both directions fall short in human-in-the-loop scenarios due to limited prompt accuracy and increased inference overhead. In particular, these adapter-based methods often suffer from audio prompt dilution, where the signal gradually weakens as it propagates through the network. In this work, we propose AuralSAM2, which integrates audio into SAM2 while largely preserving its promptable segmentation capability. Its core module, AuralFuser, fuses audio and visual features to generate sparse and dense prompts. Guided by audio and built upon SAM2's feature pyramid, these prompts propagate auditory cues across visual layers, reinforcing cross-modal influence. To further align modalities, we introduce an audio-guided contrastive loss that emphasises auditory relevance in dominant visual features. Our method achieves notable accuracy gains on public benchmarks with only minimal impact on the interactive efficiency of promptable segmentation. Our code is available at https://github.com/yyliu01/AuralSAM2.
comment: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings, 2026
♻ ☆ Radiologist-Guided Causal Concept Bottleneck Models for Chest X-Ray Interpretation
Concept Bottleneck Models (CBMs) in medical imaging aim to improve model interpretability by predicting intermediate clinical concepts before final diagnoses. However, most existing CBMs treat concepts as discriminative predictors of pathology labels, without explicitly modelling the underlying clinical generative process where diseases produce observable radiographic findings. We propose XpertCausal, a radiologist-guided causal CBM for chest X-ray interpretation which models pathology-to-concept relationships using a probabilistic noisy-OR framework. This generative model is then inverted via Bayesian inference to estimate pathology probabilities from predicted concepts. Radiologist-curated concept-pathology associations are used to constrain model structure to radiologist-defined clinically plausible reasoning pathways. We evaluate XpertCausal on MIMIC-CXR across pathology classification performance, calibration, explanation quality, and alignment with radiologist-defined reasoning pathways. Compared with both a non-causal CBM baseline and a causal ablation with unconstrained learned associations, XpertCausal achieves improved AUROC, calibration, and clinically relevant explanation quality, while learning concept-pathology relationships that more closely align with expert knowledge. These results demonstrate that incorporating clinically motivated causal structure and expert domain knowledge into CBMs can lead to more accurate, interpretable, and clinically aligned models for CXR interpretation.
♻ ☆ SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring
Multimodal large language models (MLLMs) achieve ever-stronger performance on visual-language tasks. Even as traditional visual question answering (VQA) benchmarks approach saturation, reliable deployment requires satisfying low error tolerances in real-world, out-of-distribution (OOD) scenarios. Precisely, selective prediction aims to improve coverage, i.e. the share of inputs the system answers, while adhering to a user-defined risk level. This is typically achieved by assigning a confidence score to each answer and abstaining on those that fall below a certain threshold. Existing selective prediction methods estimate implicit confidence scores, relying on model internal signals like logits or hidden representations, which are not available for frontier closed-sourced models. To enable reliable generalization in VQA, we require reasoner models to produce localized visual evidence while answering, and design a selector that explicitly learns to estimate the quality of the localization provided by the reasoner using only model inputs and outputs. We show that SIEVES (Selective Prediction through Visual Evidence Scoring) improves coverage by up to three times on challenging OOD benchmarks (V* Bench, HR-Bench-8k, MME-RealWorld-Lite, VizWiz, and AdVQA), compared to non-grounding baselines. Beyond better generalization to OOD tasks, the design of the SIEVES selector enables transfer to proprietary reasoners without access to their weights or logits, such as o3 and Gemini-3-Pro, providing coverage boosts beyond those attributable to accuracy alone. We highlight that SIEVES generalizes across all tested OOD benchmarks and reasoner models (Pixel-Reasoner, o3, and Gemini-3-Pro), without benchmark- or reasoner-specific training or adaptation. Code is publicly available at https://github.com/hector-gr/SIEVES .
♻ ☆ Co-Me: Confidence-Guided Token Merging for Visual Geometric Transformers
We propose Confidence-Guided Token Merging (Co-Me), an acceleration mechanism for visual geometric transformers without retraining or finetuning the base model. Co-Me distilled a light-weight confidence predictor to rank tokens by uncertainty and selectively merge low-confidence ones, effectively reducing computation while maintaining spatial coverage. Compared to similarity-based merging or pruning, the confidence signal in Co-Me reliably indicates regions emphasized by the transformer, enabling substantial acceleration without degrading performance. Co-Me applies seamlessly to various multi-view and streaming visual geometric transformers, achieving speedups that scale with sequence length. When applied to VGGT and Pi3, Co-Me achieves up to 21.5x and 20.4x speedup, making visual geometric transformers practical for real-time 3D perception and reconstruction.
♻ ☆ Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation
We present Hunyuan3D 2.0, an advanced large-scale 3D synthesis system for generating high-resolution textured 3D assets. This system includes two foundation components: a large-scale shape generation model -- Hunyuan3D-DiT, and a large-scale texture synthesis model -- Hunyuan3D-Paint. The shape generative model, built on a scalable flow-based diffusion transformer, aims to create geometry that properly aligns with a given condition image, laying a solid foundation for downstream applications. The texture synthesis model, benefiting from strong geometric and diffusion priors, produces high-resolution and vibrant texture maps for either generated or hand-crafted meshes. Furthermore, we build Hunyuan3D-Studio -- a versatile, user-friendly production platform that simplifies the re-creation process of 3D assets. It allows both professional and amateur users to manipulate or even animate their meshes efficiently. We systematically evaluate our models, showing that Hunyuan3D 2.0 outperforms previous state-of-the-art models, including the open-source models and closed-source models in geometry details, condition alignment, texture quality, and etc. Hunyuan3D 2.0 is publicly released in order to fill the gaps in the open-source 3D community for large-scale foundation generative models. The code and pre-trained weights of our models are available at: https://github.com/Tencent/Hunyuan3D-2
comment: GitHub link: https://github.com/Tencent/Hunyuan3D-2
♻ ☆ Pro-DG: Procedural Diffusion Guidance for Architectural Facade Generation
We use hierarchical procedural rules for the generation of control maps within the stable diffusion framework to produce photo-realistic architectural facade images. Starting from a single input image and its segmentation, we apply an inverse procedural module to identify the facade's hierarchical layout. Leveraging this hierarchy and structural features, we introduce a novel ControlNet pipeline that generates new facade imagery guided by procedural transformations. Our method enables various structural edits, including floor duplication and window rearrangement, by integrating hierarchical alignment directly into control maps. This precisely guides the diffusion-based generative process, ensuring local appearance fidelity alongside extensive structural modifications. Comprehensive evaluations, including comparisons with inpainting-based approaches and synthetic benchmarks, confirm our approach's superior capability in preserving architectural identity and achieving accurate, controllable edits. Quantitative results and user feedback validate our method's effectiveness.
comment: 17 pages, 15 figures, Computer Graphics Forum 2026 Journal Paper
♻ ☆ Any3D-VLA: Enhancing VLA Robustness via Diverse Point Clouds ICML 2026
Existing Vision-Language-Action (VLA) models typically take 2D images as visual input, which limits their spatial understanding in complex scenes. How can we incorporate 3D information to enhance VLA capabilities? We conduct a pilot study across different observation spaces and visual representations. The results show that explicitly lifting visual input into point clouds yields representations that better complement their corresponding 2D representations. To address the challenges of (1) scarce 3D data and (2) the domain gap induced by cross-environment differences and depth-scale biases, we propose Any3D-VLA. It unifies the simulator, sensor, and model-estimated point clouds within a training pipeline, constructs diverse inputs, and learns domain-agnostic 3D representations that are fused with the corresponding 2D representations. Simulation and real-world experiments demonstrate Any3D-VLA's advantages in improving performance and mitigating the domain gap. Our project homepage is available at https://xianzhefan.github.io/Any3D-VLA.github.io.
comment: ICML 2026
♻ ☆ Descriptor: Distance-Annotated Traffic Perception Question Answering (DTPQA)
The remarkable progress of Vision-Language Models (VLMs) on a variety of tasks has raised interest in their application to automated driving. However, for these models to be trusted in such a safety-critical domain, they must first possess robust perception capabilities, i.e., they must be capable of understanding a traffic scene, which can often be highly complex, with many things happening simultaneously. Moreover, since critical objects and agents in traffic scenes are often at long distances, we require systems with not only strong perception capabilities at close distances (up to 20 meters), but also at long (30+ meters) range. Therefore, it is important to evaluate the perception capabilities of these models in isolation from other skills like reasoning or advanced world knowledge. Distance-Annotated Traffic Perception Question Answering (DTPQA) is a Visual Question Answering (VQA) benchmark designed specifically for this purpose: it can be used to evaluate the perception systems of VLMs in traffic scenarios using trivial yet crucial questions relevant to driving decisions. It consists of two parts: a synthetic benchmark (DTP-Synthetic) created using a simulator, and a real-world benchmark (DTP-Real) built on top of existing images of real traffic scenes. Additionally, DTPQA includes distance annotations, i.e., how far the object in question is from the camera. More specifically, each DTPQA sample consists of (at least): (a) an image, (b) a question, (c) the ground truth answer, and (d) the distance of the object in question, enabling analysis of how VLM performance degrades with increasing object distance. In this article, we provide the dataset itself along with the Python scripts used to create it, which can be used to generate additional data of the same kind.
♻ ☆ Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation CVPR 2026
Adapting pretrained multi-modal models to evolving test-time distributions, known as multi-modal test-time adaptation, presents a significant challenge. Existing methods frequently encounter negative transfer in the unbiased modality and catastrophic forgetting in the biased modality. To address these challenges, we propose Decoupling Adaptation for Stability and Plasticity (DASP), a novel diagnose-then-mitigate framework. Our analysis reveals a critical discrepancy within the unified latent space: the biased modality exhibits substantially higher interdimensional redundancy (i.e., strong correlations across feature dimensions) compared to the unbiased modality. Leveraging this insight, DASP identifies the biased modality and implements an asymmetric adaptation strategy. This strategy employs a decoupled architecture where each modality-specific adapter is divided into stable and plastic components. The asymmetric mechanism works as follows: for the biased modality, which requires plasticity, the plastic component is activated and updated to capture domain-specific information, while the stable component remains fixed. Conversely, for the unbiased modality, which requires stability, the plastic component is bypassed, and the stable component is updated using KL regularization to prevent negative transfer. This asymmetric design enables the model to adapt flexibly to new domains while preserving generalizable knowledge. Comprehensive evaluations on diverse multi-modal benchmarks demonstrate that DASP significantly outperforms state-of-the-art methods.
comment: Accepted to CVPR 2026
♻ ☆ Supersampling Stable Diffusion and Beyond: A Seamless, Training-Free Approach for Scaling Neural Networks Using Common Interpolation Methods
Stable Diffusion (SD) has evolved DDPM (Denoising Diffusion Probabilistic Model) based image generation significantly by denoising in latent space instead of feature space. This popularized DDPM-based image generation as the cost and compute barrier was significantly lowered. However, these models could only generate fixed-resolution images according to their training configuration. When we attempt to generate higher resolutions, the resulting images show object duplication artifacts consistently. To solve this problem without finetuning SD models, recent works have tried dilating the convolution kernels of the models and have achieved a great level of success. But dilated kernels are harder to fine-tune due to being zero-gapped. Apart from this, other methods, such as patched diffusion, could not solve the object-duplication problem efficiently. Hence, to overcome the limitations of dilated convolutions, we propose kernel interpolation of SD models for higher-resolution image generation. In this work, we show mathematically that interpolation can correctly scale convolution kernels if multiplied by a constant coefficient and achieve competitive empirical results in generating beyond-training-resolution images with Stable Diffusion using zero training. Furthermore, we demonstrate that our method enables interpolation of deep neural networks to adapt to higher-dimensional training data, with a worst-case performance drop of $2.6\%$ in accuracy and F1-Score relative to the baseline. This shows the applicability of our method to be general, where we interpolate fully-connected layers, going beyond convolution layers. We also discuss how we can reduce the memory footprints of training neural networks, using our method up to at least $4\times$.
comment: Updated the title for clarity. Removed background and redundant text from section 4.2,5. Improved organization in section 4 and clarity of text in Section 4.3
♻ ☆ AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving
Integrating vision-language models (VLMs) into end-to-end (E2E) autonomous driving (AD) systems has shown promise in improving scene understanding. However, existing integration strategies suffer from several limitations: they either struggle to resolve distribution misalignment between reasoning and action spaces, underexploit the general reasoning capabilities of pretrained VLMs, or incur substantial inference latency during action policy generation, which degrades driving performance. To address these challenges, we propose AutoMoT in this work, an end-to-end AD framework that unifies reasoning and action generation within a single vision-language-action (VLA) model. Our approach leverages a mixture-of-transformer (MoT) architecture with joint attention sharing, which preserves the general reasoning capabilities of pre-trained VLMs while enabling efficient fast-slow inference through asynchronous execution at different task frequencies. Extensive experiments on multiple benchmarks, under both open- and closed-loop settings, demonstrate that AutoMoT achieves competitive performance compared to state-of-the-art methods. We further investigate the functional boundary of pre-trained VLMs in AD, examining when AD-tailored fine-tuning is necessary. Our results show that pre-trained VLMs can achieve competitive multi-task scene understanding performance through semantic prompting alone, while fine-tuning remains essential for action-level tasks such as decision-making and trajectory planning. We refer to https://automot-website.github.io/ for the demonstration videos and qualitative results.
♻ ☆ Mixture Prototype Flow Matching for Open-Set Supervised Anomaly Detection ICML 2026
Open-set supervised anomaly detection (OSAD) aims to identify unseen anomalies using limited anomalous supervision. However, existing prototype-based methods typically model normal data via a unimodal Gaussian prior, failing to capture inherent multi-modality and resulting in blurred decision boundaries. To address this, we propose Mixture Prototype Flow Matching (MPFM), a framework that learns a continuous transformation from normal feature distributions to a structured Gaussian mixture prototype space. Departing from traditional flow-based approaches that rely on a single velocity vector, MPFM explicitly models the velocity field as a Gaussian mixture prior where each component corresponds to a distinct normal class. This design facilitates mode-aware and semantically coherent distribution transport. Furthermore, we introduce a Mutual Information Maximization Regularizer (MIMR) to prevent prototype collapse and maximize normal-anomaly separability. Extensive experiments demonstrate that MPFM achieves state-of-the-art performance across diverse benchmarks under both single- and multi-anomaly settings.
comment: Accepted by ICML 2026
♻ ☆ PacTure: Efficient PBR Texture Generation on Packed Views with Visual Autoregressive Models
We present PacTure, a novel framework for generating physically-based rendering (PBR) material textures for an untextured 3D mesh from a text description. Existing 2D generation-based texturing approaches either generate textures sequentially from different views, resulting in long inference times and globally inconsistent textures, or adopt multi-view generation with cross-view attention to enhance global consistency, which, however, limits the resolution for each view. In response to these weaknesses, we first introduce view packing, a novel technique that significantly increases the effective resolution for each view during multi-view generation, without imposing additional inference cost. Unlike UV mapping, it preserves the spatial proximity essential for image generation and maintains full compatibility with current 2D generative models. To further reduce the inferencing cost, we enable fine-grained control and multi-domain generation within the next-scale prediction autoregressive framework, creating an efficient multi-view PBR generation backbone. Extensive experiments show that PacTure outperforms state-of-the-art methods in both quality and efficiency.
comment: Accepted by Computational Visual Media Journal (CVMJ) in Feb. 2026. 19 pages, 7 figures
♻ ☆ PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos
Small object-centric spatial understanding in indoor videos remains a significant challenge for multimodal large language models (MLLMs), despite its practical value for object search and assistive applications. Although existing benchmarks have advanced video spatial intelligence, embodied reasoning, and diagnostic perception, no existing benchmark directly evaluates whether a model can localize a target object in video and express its position with sufficient precision for downstream use. In this work, we introduce PinpointQA, the first dataset and benchmark for small object-centric spatial understanding in indoor videos. Built from ScanNet++ and ScanNet200, PinpointQA comprises 1,024 scenes and 10,094 QA pairs organized into four progressively challenging tasks: Target Presence Verification (TPV), Nearest Reference Identification (NRI), Fine-Grained Spatial Description (FSD), and Structured Spatial Prediction (SSP). The dataset is built from intermediate spatial representations, with QA pairs generated automatically and further refined through quality control. Experiments on representative MLLMs reveal a consistent capability gap along the progressive chain, with SSP remaining particularly difficult. Supervised fine-tuning on PinpointQA yields substantial gains, especially on the harder tasks, demonstrating that PinpointQA serves as both a diagnostic benchmark and an effective training dataset. The dataset and project page are available at https://rainchowz.github.io/PinpointQA.
♻ ☆ PAGE-4D: VGGT-4D Perception via Disentangled Pose and Geometry Estimation ICLR 2026
Recent 3D feed-forward models, such as the Visual Geometry Grounded Transformer (VGGT), have shown strong capability in inferring 3D attributes of static scenes. However, since they are typically trained on static datasets, these models often struggle in real-world scenarios involving complex dynamic elements, such as moving humans or deformable objects like umbrellas. To address this limitation, we introduce PAGE-4D, a feedforward model that extends VGGT to dynamic scenes, enabling camera pose estimation, depth prediction and point cloud reconstruction - all without post-processing. A central challenge in multitask 4D reconstruction is the inherent conflict between tasks: accurate camera pose estimation requires suppressing dynamic regions, while geometry reconstruction requires modeling them. To resolve this tension, we propose a dynamics aware aggregator that disentangles static and dynamic information by predicting a dynamics-aware mask - suppressing motion cues for pose estimation while amplifying them for geometry reconstruction. Extensive experiments show that PAGE-4D consistently outperforms the original VGGT in dynamic scenarios, achieving superior results in camera pose estimation, monocular and video depth estimation, and dense point map reconstruction. Necessary code and additional demos are available at Link: https://page4d.github.io/. Keywords: VGGT-4D, 4D Perception, Dynamic Scene Reconstruction.
comment: ICLR 2026, VGGT-4D, Dynamic VGGT
♻ ☆ FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching
Multimodal generation has long been dominated by text-driven pipelines where language dictates vision but cannot reason or create within it. We challenge this paradigm by asking whether all modalities, including textual descriptions, spatial layouts, and editing instructions, can be unified into a single visual representation. We present FlowInOne, a framework that reformulates multimodal generation as a purely visual flow, converting all inputs into visual prompts and enabling a clean image-in, image-out pipeline governed by a single flow matching model. This vision-centric formulation naturally eliminates cross-modal alignment bottlenecks, noise scheduling, and task-specific architectural branches, unifying text-to-image generation, layout-guided editing, and visual instruction following under one coherent paradigm. To support this, we introduce VisPrompt-5M, a large-scale dataset of 5 million visual prompt pairs spanning diverse tasks including physics-aware force dynamics and trajectory prediction, alongside VP-Bench, a rigorously curated benchmark assessing instruction faithfulness, spatial precision, visual realism, and content consistency. Extensive experiments demonstrate that FlowInOne achieves state-of-the-art performance across all unified generation tasks, surpassing both open-source models and competitive commercial systems, establishing a new foundation for fully vision-centric generative modeling where perception and creation coexist within a single continuous visual space. Our code and models are released on https://csu-jpg.github.io/FlowInOne.github.io/
♻ ☆ Beyond Nearest Neighbor Interpolation in Data Augmentation
Avoiding the risk of undefined categorical labels using nearest neighbor interpolation overlooks the risk of exacerbating pixel level annotation errors in augmented training data. Additionally, the inherent low pass filtering effects of interpolation algorithms exacerbate the risk of degrading high frequency structural details within annotated regions of interest. To avoid these risks, the author modified convolutional neural networks data transformation functions by incorporating a modified geometric transformation function, removing reliance on nearest neighbor interpolation, and integrating a mean-based class filtering mechanism to handle undefined categorical labels with alternative interpolation algorithms. The author also implemented an offline data augmentation pipeline to generate interpolation specific augmented training data, enabling quantitative assessment of interpolation specific low pass filtering effects on augmented training data. Experimental evaluation on three medical image segmentation datasets and the XBAT+ datasets demonstrated performance gains across multiple quantitative metrics.
comment: 10 pages, 11 figures, 14 tables
♻ ☆ WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition CVPR26
Open-domain visual entity recognition (VER) seeks to associate images with entities in encyclopedic knowledge bases such as Wikipedia. Recent generative methods tailored for VER demonstrate strong performance but incur high computational costs, limiting their scalability and practical deployment. In this work, we revisit the contrastive paradigm for VER and introduce WikiCLIP, a simple yet effective framework that establishes a strong and efficient baseline for open-domain VER. WikiCLIP leverages large language model embeddings as knowledge-rich entity representations and enhances them with a Vision-Guided Knowledge Adaptor (VGKA) that aligns textual semantics with visual cues at the patch level. To further encourage fine-grained discrimination, a Hard Negative Synthesis Mechanism generates visually similar but semantically distinct negatives during training. Experimental results on popular open-domain VER benchmarks, such as OVEN, demonstrate that WikiCLIP significantly outperforms strong baselines. Specifically, WikiCLIP achieves a 16\% improvement on the challenging OVEN unseen set, while reducing inference latency by nearly 100 times compared with the leading generative model, AutoVER. The project page is available at https://artanic30.github.io/project_pages/WikiCLIP/
comment: Accepted by CVPR26, codes and weights are publicly available
♻ ☆ JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation
Understanding videos inherently requires reasoning over both visual and auditory information. To properly evaluate Omni-Large Language Models (Omni-LLMs), which are capable of processing multi-modal information including vision and audio, an effective benchmark must comprehensively cover three key aspects: (1) multi-modal dependency (i.e., questions that cannot be answered using vision or audio alone), (2) diverse audio information types (e.g., speech, sound events), and (3) varying scene spans. However, existing datasets fall short in one or more of these dimensions, limiting strict and comprehensive evaluation. To address this gap, we introduce JointAVBench, a novel benchmark with strict audio-video correlation, spanning five cognitive dimensions, four audio information types (speech, sound events, music, vocal traits), and three scene spans (single-, cross-, and full-scene). Given the high cost of manual annotation, we propose an automated pipeline that leverages state-of-the-art vision-LLMs, audio-LLMs, and general-purpose LLMs to synthesize questions and answers that strictly require joint audio-visual understanding. We evaluate leading vision-only, audio-only, and Omni-LLMs on our dataset. Results show that even the best-performing Omni-LLM achieves an average accuracy of only 65.3\%, outperforming uni-modal baselines but revealing substantial room for improvement, especially in cross-scene reasoning.
♻ ☆ Flow-OPD: On-Policy Distillation for Flow Matching Models
Existing Flow Matching (FM) text-to-image models suffer from two critical bottlenecks under multi-task alignment: the reward sparsity induced by scalar-valued rewards, and the gradient interference arising from jointly optimizing heterogeneous objectives, which together give rise to a 'seesaw effect' of competing metrics and pervasive reward hacking. Inspired by the success of On-Policy Distillation (OPD) in the large language model community, we propose Flow-OPD, the first unified post-training framework that integrates on-policy distillation into Flow Matching models. Flow-OPD adopts a two-stage alignment strategy: it first cultivates domain-specialized teacher models via single-reward GRPO fine-tuning, allowing each expert to reach its performance ceiling in isolation; it then establishes a robust initial policy through a Flow-based Cold-Start scheme and seamlessly consolidates heterogeneous expertise into a single student via a three-step orchestration of on-policy sampling, task-routing labeling, and dense trajectory-level supervision. We further introduce Manifold Anchor Regularization (MAR), which leverages a task-agnostic teacher to provide full-data supervision that anchors generation to a high-quality manifold, effectively mitigating the aesthetic degradation commonly observed in purely RL-driven alignment. Built upon Stable Diffusion 3.5 Medium, Flow-OPD raises the GenEval score from 63 to 92 and the OCR accuracy from 59 to 94, yielding an overall improvement of roughly 10 points over vanilla GRPO, while preserving image fidelity and human-preference alignment and exhibiting an emergent 'teacher-surpassing' effect. These results establish Flow-OPD as a scalable alignment paradigm for building generalist text-to-image models. The codes and weights will be released in: https://github.com/CostaliyA/Flow-OPD .
comment: Project Page: https://costaliya.github.io/Flow-OPD/ , Code: https://github.com/CostaliyA/Flow-OPD
♻ ☆ Rethinking Event-Based Object Dtection through Representation-Level Temporal Aggregation and Model-Level Hypergraph Reasoning
Event cameras provide microsecond-level temporal resolution, low latency, and high dynamic range, offering potential for perception under fast motion and challenging illumination conditions. However, existing Event-based Object Detection (EOD) methods face limitations at both the representation and model levels: prior event representations usually encode temporal information indirectly through redundant structures, while detection models struggle to explicitly aggregate fragmented event responses into coherent high-order object features. To address these limitations, we present Event Dual Temporal-Relational Aggregation Detector (Ev-DTAD), a unified EOD framework that integrates representation-level temporal encoding with model-level temporal-hypergraph reasoning. Specifically, we introduce Hierarchical Temporal Aggregation (HTA), a compact three-channel pseudo-RGB representation that explicitly embeds temporal information across intra- and inter-window events. To further enhance detection under sparse and fragmented event responses, we propose Frequency-aware Hypergraph Temporal Fusion (FHTF), which refines multi-scale event features through temporal evolution modeling and high-order relational reasoning. Extensive experiments on Gen1 (+0.8 mAP and 1.7$\times$ faster), 1Mpx/Gen4 (+0.5 mAP and 1.6$\times$ faster), and eTraM (+3.0 mAP and 2.0$\times$ faster) demonstrate that Ev-DTAD achieves a competitive accuracy-efficiency trade-off, validating the complementarity between compact temporal representation and temporal-hypergraph feature reasoning.
♻ ☆ From Street View to Visual Network: Mapping the Visibility of Urban Landmarks with Vision-Language Models
Visibility analysis in urban planning has traditionally relied on line-of-sight (LoS) simulations, which capture geometric occlusion. However, these approaches depend on accurate 3D data that is often unavailable and may not adequately represent how visually distinctive urban landmarks are encountered in real streetscapes. We reformulate landmark visibility assessment as an urban visual search problem in image space by leveraging the widespread availability of street view imagery (SVI). Given a reference image of a target landmark, a Vision Language Model (VLM) is applied to detect the landmark in direction- and zoom-controlled SVI. A successful detection indicates machine-recognised landmark visibility at the corresponding viewpoint. Beyond isolated viewpoints, we construct a heterogeneous visibility graph to represent visual connectivity among landmarks, street-view locations, and the urban spaces that mediate them. This graph enables us to map where visual connections occur, how strong they are, and how multiple landmarks become jointly connected through shared visual corridors. Across six well-known landmark structures in global cities, the image-based method achieves an overall detection accuracy of 87%, with a precision score of 68% for landmark-visible locations. In a second case study along the River Thames in London, the visibility graph reveals multi-landmark connections and identifies key mediating locations, with bridges accounting for approximately 31% of all connections. The proposed method complements LoS-based visibility analysis and offers a practical alternative in data-constrained settings. It also showcases the possibility of revealing the prevalent connections of visual objects in the urban environment, opening new perspectives for urban planning and heritage conservation.
♻ ☆ OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation
Generalizing open-vocabulary 3D instance segmentation (OV-3DIS) to diverse, unstructured, and mesh-free environments is crucial for robotics and AR/VR, yet remains a significant challenge. We attribute this to two key limitations of existing methods: (1) proposal generation relies on dataset-specific proposal networks or mesh-based superpoints, rendering them inapplicable in mesh-free scenarios and limiting generalization to novel scenes; and (2) the weak textual reasoning of CLIP-based classifiers, which struggle to recognize compositional and functional user queries. To address these issues, we introduce OpenTrack3D, a generalizable and accurate framework. Unlike methods that rely on pre-generated proposals, OpenTrack3D employs a novel visual-spatial tracker to construct cross-view consistent object proposals online. Given an RGB-D stream, our pipeline first leverages a 2D open-vocabulary segmenter to generate masks, which are lifted to 3D point clouds using depth. Mask-guided instance features are then extracted using DINO feature maps, and our tracker fuses visual and spatial cues to maintain instance consistency. The core pipeline is entirely mesh-free, yet we also provide an optional superpoints refinement module to further enhance performance when scene mesh is available. Finally, we replace CLIP with a multi-modal large language model (MLLM), significantly enhancing compositional reasoning for complex user queries. Extensive experiments on diverse benchmarks, including ScanNet200, Replica, ScanNet++, and SceneFun3D, demonstrate state-of-the-art performance and strong generalization capabilities.
♻ ☆ SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding
A truly capable AI system must do more than detect objects or recognize activities in isolation. It must form unified, grounded representations of who is acting, what they are doing, and when and where these actions unfold. These representations provide the perceptual bedrock for high-level reasoning, planning, and embodied interaction in the real world. Building such agents is central to long-horizon goals in embodied AI and robotics. Current video benchmarks evaluate fragments of these capabilities in isolation. They focus on either spatial grounding, object tracking, or temporal localization. As a result, they cannot rigorously measure progress on their joint, multi-instance integration. We introduce Spatio-temporal Video Action Grounding (SVAG), a task and benchmark that explicitly targets this unified competence by requiring models to simultaneously detect, track, and temporally localize all objects that satisfy a natural language query in complex, multi-actor scenes. To support this task, we construct SVAG-Bench. It comprises 688 videos, 19,590 verified annotations, and 903 unique action verbs drawn from crowded urban environments, wildlife, and traffic surveillance. Each video has on average 28.5 action-centric queries. This yields the densest annotation among comparable video grounding benchmarks and enables fine-grained evaluation of multi-actor disambiguation, temporal overlap, and action compositionality. Annotations are produced by a pipeline that combines expert manual labeling, GPT-3.5 paraphrase augmentation, and human verification to ensure both linguistic diversity and correctness. We further release SVAGEval, a standardized multi-referent evaluation toolkit. We also introduce SVAGFormer, a strong modular baseline architecture for SVAG.
♻ ☆ DAPL: Integration of Positive and Negative Descriptions in Text-Based Person Search
Text-based person search (TBPS) aims to retrieve specific images of individuals from large datasets using textual descriptions. Existing TBPS methods focus primarily on identifying explicit positive attributes, often neglecting the critical role of negative descriptions. This oversight can lead to false positives, where images that should be excluded based on negative descriptions are incorrectly included, due to partial alignment with the positive criteria. To address this limitation, we propose the Dual Attribute Prompt Learning (DAPL) framework, which incorporates both positive and negative descriptions to improve the interpretative accuracy of vision-language models in TBPS tasks. DAPL combines Dual Image-Attribute Contrastive (DIAC) learning with Sensitive Image-Attribute Matching (SIAM) learning to enhance the detection of previously unseen attributes. Furthermore, to achieve a balance between coarse and fine-grained alignment of visual and textual embeddings, we introduce the Dynamic Token-wise Similarity (DTS) loss. This loss function refines the representation of both matching and non-matching descriptions at the token level, providing more precise and adaptable similarity assessments, and ultimately improving the accuracy of the matching process. Empirical results demonstrate that DAPL outperforms state-of-the-art methods, enhancing both precision and robustness in TBPS tasks.
♻ ☆ InterMesh: Explicit Interaction-Aware End-to-End Multi-Person Human Mesh Recovery
Humans constantly interact with their surroundings. Existing end-to-end multi-person human mesh recovery methods, typically based on the DETR framework, capture inter-human relationships through self-attention across all human queries. However, these approaches model interactions only implicitly and lack explicit reasoning about how humans interact with objects and with each other. In this paper, we propose InterMesh, a simple yet effective framework that explicitly incorporates human-environment interaction information into human mesh recovery pipeline. By leveraging a human-object interaction detector, InterMesh enriches query representations with structured interaction semantics, enabling more accurate pose and shape estimation. We design lightweight modules, Contextual Interaction Encoder and Interaction-Guided Refiner, to integrate these features into existing HMR architectures with minimal overhead. We validate our approach through extensive experiments on 3DPW, MuPoTS, CMU Panoptic, Hi4D, and CHI3D datasets, demonstrating remarkable improvements over state-of-the-art methods. Notably, InterMesh reduces MPJPE by 9.9% on CMU Panoptic and 8.2% on Hi4D, highlighting its effectiveness in scenarios with complex human-object and inter-human interactions. Code and models are released at https://github.com/Kelly510/InterMesh.
comment: 13 pages, 10 figures
♻ ☆ Motion-Aware Caching for Efficient Autoregressive Video Generation
Autoregressive video generation paradigms offer theoretical promise for long video synthesis, yet their practical deployment is hindered by the computational burden of sequential iterative denoising. While cache reuse strategies can accelerate generation by skipping redundant denoising steps, existing methods rely on coarse-grained chunk-level skipping that fails to capture fine-grained pixel dynamics. This oversight is critical: pixels with high motion require more denoising steps to prevent error accumulation, while static pixels tolerate aggressive skipping. We formalize this insight theoretically by linking cache errors to residual instability, and propose MotionCache, a motion-aware cache framework that exploits inter-frame differences as a lightweight proxy for pixel-level motion characteristics. MotionCache employs a coarse-to-fine strategy: an initial warm-up phase establishes semantic coherence, followed by motion-weighted cache reuse that dynamically adjusts update frequencies per token. Extensive experiments on state-of-the-art models like SkyReels-V2 and MAGI-1 demonstrate that MotionCache achieves significant speedups of $\textbf{6.28}\times$ and $\textbf{1.64}\times$ respectively, while effectively preserving generation quality (VBench: $1\%\downarrow$ and $0.01\%\downarrow$ respectively). The code is available at https://github.com/ywlq/MotionCache.
comment: 20 pages
♻ ☆ LangPrecip: Language-Aware Multimodal Precipitation Nowcasting
Short-term precipitation nowcasting is an inherently uncertain and under-constrained spatiotemporal forecasting problem, especially for rapidly evolving and extreme weather events. Existing generative approaches rely primarily on visual conditioning, leaving future motion weakly constrained and ambiguous. We propose a language-aware multimodal nowcasting framework(LangPrecip) that treats meteorological text as a semantic motion constraint on precipitation evolution. By formulating nowcasting as a semantically constrained trajectory generation problem under the Rectified Flow paradigm, our method enables efficient and physically consistent integration of textual and radar information in latent space.We further introduce LangPrecip-160k, a large-scale multimodal dataset with 160k paired radar sequences and motion descriptions. Experiments on Swedish and MRMS datasets show consistent improvements over state-of-the-art methods, achieving over 60 \% and 19\% gains in heavy-rainfall CSI at an 80-minute lead time.
♻ ☆ DIVER: Reinforced Diffusion Breaks Imitation Bottlenecks in End-to-End Autonomous Driving
Most end-to-end autonomous driving methods rely on imitation learning from single expert demonstrations, often leading to conservative and homogeneous behaviors that limit generalization in complex real-world scenarios. In this work, we propose DIVER, an end-to-end driving framework that integrates reinforcement learning with diffusion-based generation to produce diverse and feasible trajectories. At the core of DIVER lies a reinforced diffusion-based generation mechanism. First, the model conditions on map elements and surrounding agents to generate multiple reference trajectories from a single ground-truth trajectory, alleviating the limitations of imitation learning that arise from relying solely on single expert demonstrations. Second, reinforcement learning is employed to guide the diffusion process, where reward-based supervision enforces safety and diversity constraints on the generated trajectories, thereby enhancing their practicality and generalization capability. Furthermore, to address the limitations of L2-based open-loop metrics in capturing trajectory diversity, we propose a novel Diversity metric to evaluate the diversity of multi-mode predictions.Extensive experiments on the closed-loop NAVSIM and Bench2Drive benchmarks, as well as the open-loop nuScenes dataset, demonstrate that DIVER significantly improves trajectory diversity, effectively addressing the mode collapse problem inherent in imitation learning.
comment: 17 pages, 10 figures
♻ ☆ Memory-SAM: Human-Prompt-Free Tongue Segmentation via Retrieval-to-Prompt
Accurate tongue segmentation is crucial for reliable TCM analysis. Supervised models require large annotated datasets, while SAM-family models remain prompt-driven. We present Memory-SAM, a training-free, human-prompt-free pipeline that automatically generates effective prompts from a small memory of prior cases via dense DINOv3 features and FAISS retrieval. Given a query image, mask-constrained correspondences to the retrieved exemplar are distilled into foreground/background point prompts that guide SAM2 without manual clicks or model fine-tuning. We evaluate on 600 expert-annotated images (300 controlled, 300 in-the-wild). On the mixed test split, Memory-SAM achieves mIoU 0.9863, surpassing FCN (0.8188) and a detector-to-box SAM baseline (0.1839). On controlled data, ceiling effects above 0.98 make small differences less meaningful given annotation variability, while our method shows clear gains under real-world conditions. Results indicate that retrieval-to-prompt enables data-efficient, robust segmentation of irregular boundaries in tongue imaging. The code is publicly available at https://github.com/jw-chae/memory-sam.
♻ ☆ RAM-W600: A Multi-Task Wrist Dataset and Benchmark for Rheumatoid Arthritis NeurIPS 2025
Rheumatoid arthritis (RA) is a common autoimmune disease that has been the focus of research in computer-aided diagnosis (CAD) and disease monitoring. In clinical settings, conventional radiography (CR) is widely used for the screening and evaluation of RA due to its low cost and accessibility. The wrist is a critical region for the diagnosis of RA. However, CAD research in this area remains limited, primarily due to the challenges in acquiring high-quality instance-level annotations. (i) The wrist comprises numerous small bones with narrow joint spaces, complex structures, and frequent overlaps, requiring detailed anatomical knowledge for accurate annotation. (ii) Disease progression in RA often leads to osteophyte, bone erosion (BE), and even bony ankylosis, which alter bone morphology and increase annotation difficulty, necessitating expertise in rheumatology. This work presents a multi-task dataset for wrist bone in CR, including two tasks: (i) wrist bone instance segmentation and (ii) Sharp/van der Heijde (SvdH) BE scoring, which is the first public resource for wrist bone instance segmentation. This dataset comprises 1048 wrist conventional radiographs of 388 patients from six medical centers, with pixel-level instance segmentation annotations for 618 images and SvdH BE scores for 800 images. This dataset can potentially support a wide range of research tasks related to RA, including joint space narrowing (JSN) progression quantification, BE detection, bone deformity evaluation, and osteophyte detection. It may also be applied to other wrist-related tasks, such as carpal bone fracture localization. We hope this dataset will significantly lower the barrier to research on wrist RA and accelerate progress in CAD research within the RA-related domain.
comment: Published in NeurIPS 2025
♻ ☆ REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding
Self-reflection mechanisms that rely on purely text-based rethinking processes perform well in most multimodal tasks. However, when directly applied to long-form video understanding scenarios, they exhibit clear limitations. The fundamental reasons for this lie in two points: (1)long-form video understanding involves richer and more dynamic visual input, meaning rethinking only the text information is insufficient and necessitates a further rethinking process specifically targeting visual information; (2) purely text-based reflection mechanisms lack cross-modal interaction capabilities, preventing them from fully integrating visual information during reflection. Motivated by these insights, we propose REVISOR (REflective VIsual Segment Oriented Reasoning), a novel framework for tool-augmented multimodal reflection. REVISOR enables MLLMs to collaboratively construct introspective reflection processes across textual and visual modalities, significantly enhancing their reasoning capability for long-form video understanding. To ensure that REVISOR can learn to accurately review video segments highly relevant to the question during reinforcement learning, we designed the Dual Attribution Decoupled Reward (DADR) mechanism. Integrated into the GRPO training strategy, this mechanism enforces causal alignment between the model's reasoning and the selected video evidence. Notably, the REVISOR framework significantly enhances long-form video understanding capability of MLLMs without requiring supplementary supervised fine-tuning or external models, achieving impressive results on four benchmarks including VideoMME, LongVideoBench, MLVU, and LVBench.
♻ ☆ HERO: Hierarchical Extrapolation and Refresh for Efficient World Models
Generation-driven world models create immersive virtual environments but suffer slow inference due to the iterative nature of diffusion models. While recent advances have improved diffusion model efficiency, directly applying these techniques to world models introduces limitations such as quality degradation. In this paper, we present HERO, a training-free hierarchical acceleration framework tailored for efficient world models. Owing to the multi-modal nature of world models, we identify a feature coupling phenomenon, wherein shallow layers exhibit high temporal variability, while deeper layers yield more stable feature representations. Motivated by this, HERO adopts hierarchical strategies to accelerate inference: (i) In shallow layers, a patch-wise refresh mechanism efficiently selects tokens for recomputation. With patch-wise sampling and frequency-aware tracking, it avoids extra metric computation and remain compatible with FlashAttention. (ii) In deeper layers, a linear extrapolation scheme directly estimates intermediate features. This completely bypasses the computations in attention modules and feed-forward networks. Our experiments show that HERO achieves a 1.73$\times$ speedup with minimal quality degradation, significantly outperforming existing diffusion acceleration methods.
comment: 12 pages in total
♻ ☆ CoCoEdit: Content-Consistent Image Editing via Region Regularized Reinforcement Learning ICML 2026
Image editing has achieved impressive results with the development of large-scale generative models. However, existing models mainly focus on the editing effects of intended objects and regions, often leading to unwanted changes in unintended regions. We present a post-training framework for Content-Consistent Editing (CoCoEdit) via region regularized reinforcement learning. We first augment existing editing datasets with refined instructions and masks, from which 40K diverse and high quality samples are curated as training set. We then introduce a pixel-level similarity reward to complement MLLM-based rewards, enabling models to ensure both editing quality and content consistency during the editing process. To overcome the spatial-agnostic nature of the rewards, we propose a region-based regularizer, aiming to preserve non-edited regions for high-reward samples while encouraging editing effects for low-reward samples. For evaluation, we annotate editing masks for GEdit-Bench and ImgEdit-Bench, introducing pixel-level similarity metrics to measure content consistency and editing quality. Applying CoCoEdit to Qwen-Image-Edit and FLUX-Kontext, we achieve not only competitive editing scores with state-of-the-art models, but also significantly better content consistency, measured by PSNR/SSIM metrics and human subjective ratings.
comment: Accepted by ICML 2026
♻ ☆ Video-OPD: Efficient Post-Training of Multimodal Large Language Models for Temporal Video Grounding via On-Policy Distillation
Reinforcement learning has emerged as a principled post-training paradigm for Temporal Video Grounding (TVG) due to its on-policy optimization, yet existing GRPO-based methods remain fundamentally constrained by sparse reward signals and substantial computational overhead. We propose Video-OPD, an efficient post-training framework for TVG inspired by recent advances in on-policy distillation. Video-OPD optimizes trajectories sampled directly from the current policy, thereby preserving alignment between training and inference distributions, while a frontier teacher supplies dense, token-level supervision via a reverse KL divergence objective. This formulation preserves the on-policy property critical for mitigating distributional shift, while converting sparse, episode-level feedback into fine-grained, step-wise learning signals. Building on Video-OPD, we introduce Teacher-Validated Disagreement Focusing (TVDF), a lightweight training curriculum that iteratively prioritizes trajectories that are both teacher-reliable and maximally informative for the student, thereby improving training efficiency. Empirical results demonstrate that Video-OPD consistently outperforms GRPO while achieving substantially faster convergence and lower computational cost, establishing on-policy distillation as an effective alternative to conventional reinforcement learning for TVG.
♻ ☆ RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling
Prompt design plays a crucial role in text-to-video (T2V) generation, yet user-provided prompts are often short, unstructured, and misaligned with training data, limiting the generative potential of diffusion-based T2V models. We present \textbf{RAPO++}, a cross-stage prompt optimization framework that unifies training-data--aligned refinement, test-time iterative scaling, and large language model (LLM) fine-tuning to substantially improve T2V generation without modifying the underlying generative backbone. In \textbf{Stage 1}, Retrieval-Augmented Prompt Optimization (RAPO) enriches user prompts with semantically relevant modifiers retrieved from a relation graph and refactors them to match training distributions, enhancing compositionality and multi-object fidelity. \textbf{Stage 2} introduces Sample-Specific Prompt Optimization (SSPO), a closed-loop mechanism that iteratively refines prompts using multi-source feedback -- including semantic alignment, spatial fidelity, temporal coherence, and task-specific signals such as optical flow -- yielding progressively improved video generation quality. \textbf{Stage 3} leverages optimized prompt pairs from SSPO to fine-tune the rewriter LLM, internalizing task-specific optimization patterns and enabling efficient, high-quality prompt generation even before inference. Extensive experiments across five state-of-the-art T2V models and five benchmarks demonstrate that RAPO++ achieves significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility, outperforming existing methods by large margins. Our results highlight RAPO++ as a model-agnostic, cost-efficient, and scalable solution that sets a new standard for prompt optimization in T2V generation. The code is available at https://github.com/Vchitect/RAPO.
comment: arXiv admin note: text overlap with arXiv:2504.11739
♻ ☆ Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning ICLR 2026
Current visual reasoning methods mainly focus on exploring specific reasoning modes. Although improvements can be achieved in particular domains, they struggle to develop general reasoning capabilities. Inspired by this, we propose a novel adaptive reasoning paradigm, Mixture-of-Visual-Thoughts (MoVT), which unifies different reasoning modes within a single model and guides it to select the appropriate mode based on context. To achieve this, we introduce AdaVaR, a two-stage Adaptive Visual Reasoning learning framework: different modes are unified and learned during the supervised cold-start stage, and the mode selection capability is induced via an RL process with a carefully designed AdaGRPO algorithm. Extensive experiments show that AdaVaR effectively guides the model to learn and differentiate multiple modes and perform context-adaptive mode selection, achieving consistent improvement across various scenarios, highlighting MoVT as an effective solution for building general visual reasoning models.
comment: 27 pages, 11 figures, 5 tables, accepted by ICLR 2026
♻ ☆ M$^2$E-UAV: A Benchmark and Analysis for Onboard Motion-on-Motion Event-Based Tiny UAV Detection
Tiny UAV detection from an onboard event camera is difficult when the observer and target move at the same time. In this motion-on-motion regime, ego-motion activates background edges across buildings, vegetation, and horizon structures, while the UAV may appear as a sparse event cluster. Unlike static- or ground-observer event-based UAV detection, onboard UAV-view detection breaks the clean-background assumption because sensor ego-motion can activate dense background events over the entire field of view. To explore this practical problem, we present M$^2$E-UAV, to the best of our knowledge, the first onboard UAV-view motion-on-motion event-based dataset and benchmark for tiny UAV detection, where both the sensing platform and the target UAV are moving. M$^2$E-UAV provides synchronized event streams and IMU measurements collected from an onboard sensing platform, together with event-level UAV foreground labels derived from temporally propagated 10 Hz bounding-box annotations. The processed benchmark contains 87,223 training samples and 21,395 validation samples across four scene families: sunny building-forest, sunny farm-village, sunset building-forest, and sunset farm-village. We define a train/validation split and an evaluation protocol for comparing representative existing baselines across event-frame, voxel-grid, and point-set representations, with optional IMU input. The benchmark results show that existing baselines remain limited under sparse tiny-target evidence and dense ego-motion-induced background events. Code and benchmark files will be released at https://github.com/Wickyan/M2E-UAV.
♻ ☆ TRIO: Token Reduction via Inference-Objective Guidance for Efficient Vision-Language Models
Recently, reducing redundant visual tokens in vision-language models (VLMs) to accelerate VLM inference has emerged as a hot topic. However, most existing methods rely on heuristics constructed based on inter-visual-token similarity or cross-modal visual-text similarity, which gives rise to certain limitations in compression performance and practical deployment. In contrast, we propose TRIO from the perspective of inference objectives, which transforms visual token compression into preserving output result invariance and selects tokens primarily by their importance to this goal. Specifically, vision tokens are reordered with the guidance of token-level gradient saliency generated by our designed layer-local proxy loss, a coarse constraint from the current layer to the final result. Then the most valuable vision tokens are selected following the non-maximum suppression (NMS) principle.The proposed TRIO is training-free and compatible with FlashAttention, friendly to practical application and deployment. It can be deployed independently as an encoder-free method, or combined with encoder compression approaches like VisionZip for use as an encoder-involved method. On LLaVA-Next-7B, TRIO retains just 11.1\% of visual tokens but maintains 97.2\% of the original performance, with a 2.75$\times$ prefill speedup, 2.14$\times$ inference speedup, 6.22$\times$ lower FLOPs, and 6.05$\times$ reduced KV Cache overhead.Our code is available at https://github.com/ocy1/TRIO.
♻ ☆ GeRM: A Generative Rendering Model From Physically Realistic to Photorealistic
While physically-based rendering (PBR) simulates light transport that guarantees physical realism, achieving true photorealistic rendering (PRR) demands prohibitive time and labor, and still struggles to capture the intractable richness of the real world. We propose GeRM, the first multimodal generative rendering model to bridge the gap from PBR to PRR (P2P). We formulate this P2P transition by learning a distribution transfer vector (DTV) field to direct the generative process. To achieve this, we introduce a multi-condition ControlNet that synthesizes PBR images and progressively transitions them into PRR images, guided by G-buffers, text prompts, and cues for enhanced regions. To improve the model's grasp of the image distribution shift driven by text prompts, we propose a residual perceptual transfer mechanism to associate text prompts with corresponding targeted modification regions, which more clearly defines the incremental component updates. To supervise this transfer process, we introduce a multi-agent visual language model framework to construct an expert-guided pairwise transfer dataset, named P2P-50K, where each paired sample corresponds to a specific transfer vector in the DTV field. Extensive experiments demonstrate that GeRM synthesizes high-quality controllable images and outperforms state-of-the-art baselines across diverse applications, including PBR and PRR image synthesis and editing.
♻ ☆ VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing
Recent advancements in Multimodal Large Language Models (MLLMs) have enabled complex reasoning. However, existing remote sensing (RS) benchmarks remain heavily biased toward perception tasks, such as object recognition and scene classification. This limitation hinders the development of MLLMs for cognitively demanding RS applications. To address this, we propose a Vision Language ReaSoning Benchmark (VLRS-Bench), which is the first benchmark exclusively dedicated to complex RS reasoning. Structured across the three core dimensions of Cognition, Decision, and Prediction, VLRS-Bench comprises 2,000 question-answer pairs with an average question length of 130.19 words, spanning 14 tasks and up to eight temporal phases. VLRS-Bench is constructed via a specialized pipeline that integrates RS-specific priors and expert knowledge to ensure geospatial realism and reasoning complexity. Experimental results reveal significant bottlenecks in existing state-of-the-art MLLMs, providing critical insights for advancing multimodal reasoning within the remote sensing community. The project repository is available at https://github.com/MiliLab/VLRS-Bench.
♻ ☆ MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons
Recent methods for arbitrary-skeleton motion capture from monocular video follow a factorized pipeline, where a Video-to-Pose network predicts joint positions and an analytical inverse-kinematics (IK) stage recovers joint rotations. While effective, this design is inherently limited, since joint positions do not fully determine rotations and leave degrees of freedom such as bone-axis twist ambiguous, and the non-differentiable IK stage prevents the system from adapting to noisy predictions or optimizing for the final animation objective. In this work, we present the first fully end-to-end framework in which both Video-to-Pose and Pose-to-Rotation are learnable and jointly optimized. We observe that the ambiguity in pose-to-rotation mapping arises from missing coordinate system information: the same joint positions can correspond to different rotations under different rest poses and local axis conventions. To resolve this, we introduce a reference pose-rotation pair from the target asset, which, together with the rest pose, not only anchors the mapping but also defines the underlying rotation coordinate system. This formulation turns rotation prediction into a well-constrained conditional problem and enables effective learning. In addition, our model predicts joint positions directly from video without relying on mesh intermediates, improving both robustness and efficiency. Both stages share a skeleton-aware Global-Local Graph-guided Multi-Head Attention (GL-GMHA) module for joint-level local reasoning and global coordination. Experiments on Truebones Zoo and Objaverse show that our method reduces rotation error from ~17 degrees to ~10 degrees, and to 6.54 degrees on unseen skeletons, while achieving ~20x faster inference than mesh-based pipelines. Project page: https://animotionlab.github.io/MoCapAnythingV2/
comment: Project page: https://animotionlab.github.io/MoCapAnythingV2/
♻ ☆ MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware
The recent advancement of Vision Language Action (VLA) models has driven a critical demand for large scale egocentric datasets. However, existing datasets are often limited by short episode durations, typically spanning only a few minutes, which fails to capture the long horizon temporal dependencies necessary for complex robotic task execution. To bridge this gap, we present MobileEgo Anywhere, a framework designed to facilitate the collection of robust, hour plus egocentric trajectories using commodity mobile hardware. We leverage the ubiquitous sensor suites of modern smartphones to provide high fidelity, long term camera pose tracking, effectively removing the high hardware barriers associated with traditional robotics data collection. Our contributions are three fold: (1) we release a novel dataset comprising 200 hours of diverse, long form egocentric data with persistent state tracking; (2) we open source a mobile application that enables any user to record egocentric data, and (3) we provide a comprehensive processing pipeline to convert raw mobile captures into standardized, training ready formats for Vision Language Action model and foundation model research. By democratizing the data collection process, this work enables the massive scale acquisition of long horizon data across varied global environments, accelerating the development of generalizable robotic policies.
♻ ☆ SCOOTER: A Human Evaluation Framework for Unrestricted Adversarial Examples
Unrestricted adversarial attacks aim to fool computer vision models without being constrained by $\ell_p$-norm bounds to remain imperceptible to humans, for example, by changing an object's color. This allows attackers to circumvent traditional, norm-bounded defense strategies such as adversarial training or certified defense strategies. However, due to their unrestricted nature, there are also no guarantees of norm-based imperceptibility, necessitating human evaluations to verify just how authentic these adversarial examples look. While some related work assesses this vital quality of adversarial attacks, none provide statistically significant insights. This issue necessitates a unified framework that supports and streamlines such an assessment for evaluating and comparing unrestricted attacks. To close this gap, we introduce SCOOTER - an open-source, statistically powered framework for evaluating unrestricted adversarial examples. Our contributions are: $(i)$ best-practice guidelines for crowd-study power, compensation, and Likert equivalence bounds to measure imperceptibility; $(ii)$ the first large-scale human vs. model comparison across 346 human participants showing that three color-space attacks and three diffusion-based attacks fail to produce imperceptible images. Furthermore, we found that GPT-4o can serve as a preliminary test for imperceptibility, but it only consistently detects adversarial examples for four out of six tested attacks; $(iii)$ open-source software tools, including a browser-based task template to collect annotations and analysis scripts in Python and R; $(iv)$ an ImageNet-derived benchmark dataset containing 3K real images, 7K adversarial examples, and over 34K human ratings. Our findings demonstrate that automated vision systems do not align with human perception, reinforcing the need for a ground-truth SCOOTER benchmark.
comment: 42 pages, 16 figures, 11 tables, Under Review, Code: https://github.com/DrenFazlija/Scooter, Data: https://doi.org/10.5281/zenodo.15771501
♻ ☆ MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation
Task planning for robotic manipulation with large language models (LLMs) is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic settings. MALLVI presents a Multi Agent Large Language and Vision framework that enables closed-loop feedback driven robotic manipulation. Given a natural language instruction and an image of the environment, MALLVI generates executable atomic actions for a robot manipulator. After action execution, a Vision Language Model (VLM) evaluates environmental feedback and decides whether to repeat the process or proceed to the next step. Rather than using a single model, MALLVI coordinates specialized agents, Decomposer, Localizer, Thinker, and Reflector, to manage perception, localization, reasoning, and high level planning. An optional Descriptor agent provides visual memory of the initial state. The Reflector supports targeted error detection and recovery by reactivating only relevant agents, avoiding full replanning. Experiments in simulation and real-world settings show that iterative closed loop multi agent coordination improves generalization and increases success rates in zero shot manipulation tasks. Code available at https://github.com/iman1234ahmadi/MALLVI .
comment: Some fundemental change in text and codebase. Will request a new submission later on
♻ ☆ CC-Pan: Channel-wise Compression based Diffusion for Efficient Pan-Sharpening
Recently, diffusion models have brought novel insights to pan-sharpening and notably boosted fusion precision. However, most existing models perform diffusion in the pixel space and train distinct models for different multispectral (MS) sensors, suffering from high inference latency and sensor-specific limitations. In this paper, we present CC-Pan, a cross-sensor latent diffusion framework for efficient pan-sharpening. Specifically, CC-Pan trains a band-wise single-channel variational autoencoder (VAE) to encode high-resolution multispectral (HRMS) images into compact latent representations, naturally supporting MS images with varying band counts across different sensors and establishing a basis for inference acceleration. Spectral physical properties, along with PAN and MS images, are then injected into the diffusion backbone through carefully designed unidirectional and bidirectional interactive control structures, achieving high-precision spatial--spectral fusion in the latent diffusion process. Furthermore, a lightweight region-based cross-band attention (RCBA) module is incorporated at the central layer of the diffusion model, reinforcing inter-band spectral connections to boost spectral consistency and further elevate fusion precision. Extensive experimental results on GaoFen-2, QuickBird, and WorldView-3 demonstrate that CC-Pan outperforms state-of-the-art diffusion-based methods across all three benchmarks, attains a $2$--$3\times$ inference speedup, and exhibits robust cross-sensor generalization capability on the held-out WorldView-2 sensor without any sensor-specific retraining.
♻ ☆ R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow SIGGRAPH 2026
Video-guided 3D animation holds immense potential for content creation, offering intuitive and precise control over dynamic assets. However, practical deployment faces a critical yet frequently overlooked hurdle: the pose misalignment dilemma. In real-world scenarios, the initial pose of a user-provided static mesh rarely aligns with the starting frame of a reference video. Naively forcing a mesh to follow a mismatched trajectory inevitably leads to severe geometric distortion or animation failure. To address this, we present Rectified Dynamic Mesh (R-DMesh), a unified framework designed to generate high-fidelity 4D meshes that are ``rectified'' to align with video context. Unlike standard motion transfer approaches, our method introduces a novel VAE that explicitly disentangles the input into a conditional base mesh, relative motion trajectories, and a crucial rectification jump offset. This offset is learned to automatically transform the arbitrary pose of the input mesh to match the video's initial state before animation begins. We process these components via a Triflow Attention mechanism, which leverages vertex-wise geometric features to modulate the three orthogonal flows, ensuring physical consistency and local rigidity during the rectification and animation process. For generation, we employ a Rectified Flow-based Diffusion Transformer conditioned on pre-trained video latents, effectively transferring rich spatio-temporal priors to the 3D domain. To support this task, we construct Video-RDMesh, a large-scale dataset of over 500k dynamic mesh sequences specifically curated to simulate pose misalignment. Extensive experiments demonstrate that R-DMesh not only solves the alignment problem but also enables robust downstream applications, including pose retargeting and holistic 4D generation.
comment: Accepted by SIGGRAPH 2026, Project Page: https://r-dmesh.github.io/ Code URL: https://github.com/Tencent-Hunyuan/R-DMesh
♻ ☆ DocScope: Benchmarking Verifiable Reasoning for Trustworthy Long-Document Understanding
Evaluating whether Multimodal Large Language Models can produce trustworthy, verifiable reasoning over long, visually rich documents requires evaluation beyond end-to-end answer accuracy. We introduce DocScope, a benchmark that formulates long-document QA as a structured reasoning trajectory prediction problem: given a complete PDF document and a question, the model outputs evidence pages, supporting evidence regions, relevant factual statements, and a final answer. We design a four-stage evaluation protocol -- Page Localization, Region Grounding, Fact Extraction, and Answer Verification -- that audits each level of the trajectory independently through inter-stage decoupling, with all judges selected and calibrated via human alignment studies. DocScope comprises 1,124 questions derived from 273 documents, with all hierarchical evidence annotations completed by human annotators. We benchmark 6 proprietary models, 12 open-weight models, and several domain-specific systems. Our experiments reveal that answer accuracy cannot substitute for trajectory-level evaluation: even among correct answers, the highest observed rate of complete evidence chains is only 29\%. Across all models, region grounding remains the weakest trajectory stage. Furthermore, the primary difficulty stems from aggregating evidence dispersed across long distances and multiple document clusters, while an oracle study identifies faithful perception and fact extraction as the dominant capability bottleneck. Cross-architecture comparisons further suggest that activated parameter count matters more than total scale. The benchmark and code will be publicly released at https://github.com/MiliLab/DocScope.
comment: 50pages, 25 figures, 14 tables;
♻ ☆ LoRA in LoRA: Towards Parameter-Efficient Architecture Expansion for Continual Visual Instruction Tuning AAAI 2026
Continual Visual Instruction Tuning (CVIT) enables Multimodal Large Language Models (MLLMs) to incrementally learn new tasks over time. However, this process is challenged by catastrophic forgetting, where performance on previously learned tasks deteriorates as the model adapts to new ones. A common approach to mitigate forgetting is architecture expansion, which introduces task-specific modules to prevent interference. Yet, existing methods often expand entire layers for each task, leading to significant parameter overhead and poor scalability. To overcome these issues, we introduce LoRA in LoRA (LiLoRA), a highly efficient architecture expansion method tailored for CVIT in MLLMs. LiLoRA shares the LoRA matrix A across tasks to reduce redundancy, applies an additional low-rank decomposition to matrix B to minimize task-specific parameters, and incorporates a cosine-regularized stability loss to preserve consistency in shared representations over time. Extensive experiments on a diverse CVIT benchmark show that LiLoRA consistently achieves superior performance in sequential task learning while significantly improving parameter efficiency compared to existing approaches. The code is available at https://github.com/chanceche/LiLoRA.
comment: AAAI 2026 Oral Presentation. 9 pages
♻ ☆ SD-ReID: View-aware Stable Diffusion for Aerial-Ground Person Re-Identification IEEE
Aerial-Ground Person Re-IDentification (AG-ReID) aims to retrieve specific persons across cameras with different viewpoints. Previous works focus on designing discriminative models to maintain the identity consistency despite drastic changes in camera viewpoints. The core idea behind these methods is quite natural, but designing a view-robust model is a very challenging task. Moreover, they overlook the contribution of view-specific features in enhancing the model's ability to represent persons. To address these issues, we propose a novel generative framework named SD-ReID for AG-ReID, which leverages generative models to mimic the feature distribution of different views while extracting robust identity representations. More specifically, we first train a ViT-based model to extract person representations along with controllable conditions, including identity and view conditions. We then fine-tune the Stable Diffusion (SD) model to enhance person representations guided by these controllable conditions. Furthermore, we introduce the View-Refined Decoder (VRD) to bridge the gap between instance-level and global-level features. Finally, both person representations and all-view features are employed to retrieve target persons. Extensive experiments on five AG-ReID benchmarks (i.e., CARGO, AG-ReIDv1, AG-ReIDv2, LAGPeR and G2APS-ReID) demonstrate the effectiveness of our proposed method. The source code and pre-trained models are available at https://github.com/924973292/SD-ReID.
comment: This work is accepted by IEEE TIP 2026. More modifications may performed
♻ ☆ Medical Report Generation: A Hierarchical Task Structure-Based Cross-Modal Causal Intervention Framework
Medical Report Generation (MRG) is a key part of modern medical diagnostics, as it automatically generates reports from radiological images to reduce radiologists' burden. However, reliable MRG models for lesion description face three main challenges: insufficient domain knowledge understanding, poor text-visual entity embedding alignment, and spurious correlations from cross-modal biases. Previous work only addresses single challenges, while this paper tackles all three via a novel hierarchical task decomposition approach, proposing the HTSC-CIF framework. HTSC-CIF classifies the three challenges into low-, mid-, and high-level tasks: 1) Low-level: align medical entity features with spatial locations to enhance domain knowledge for visual encoders; 2) Mid-level: use Prefix Language Modeling (text) and Masked Image Modeling (images) to boost cross-modal alignment via mutual guidance; 3) High-level: a cross-modal causal intervention module (via front-door intervention) to reduce confounders and improve interpretability. Extensive experiments confirm HTSC-CIF's effectiveness, significantly outperforming state-of-the-art (SOTA) MRG methods. Code will be made public upon paper acceptance.
comment: Due to issues with the training epochs and training strategy in our paper, there are numerical errors in the result comparison table presented in the preprint. Therefore, we have decided to withdraw the manuscript for further revision
♻ ☆ The Multi-View Paradigm Shift in MRI Radiomics: Predicting MGMT Methylation in Glioblastoma
Non-invasive inference of molecular tumor characteristics from medical imaging is a central goal of radiogenomics, particularly in glioblastoma (GBM), where O6-methylguanine-DNA methyltransferase (MGMT) promoter methylation carries important prognostic and therapeutic significance. Although radiomics-based machine learning methods have shown promise for this task, conventional unimodal and early-fusion approaches are often limited by high feature redundancy and incomplete modeling of modality-specific information. In this work, we introduce a multi-view latent representation learning framework based on variational autoencoders (VAE) that preserves modality-specific radiomic structure while enabling late fusion in a compact probabilistic latent space. The approach is evaluated on radiomic features extracted from the necrotic tumor core in post-contrast T1-weighted (T1Gd) and Fluid-Attenuated Inversion Re-covery (FLAIR) Magnetic Resonance Imaging (MRI). Experimental results demonstrate that the proposed multi-view VAE combined with a random forest classifier achieves a test Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) of 0.77 (95% confidence interval: 0.71-0.83), substantially outperforming both a baseline radiomics model (AUC = 0.54) and a hyperparameter-tuned model (AUC = 0.64). These findings indicate that multi-view probabilistic encoding enables more effective integration of complementary MRI information and significantly improves predictive performance for MGMT promoter methylation status.
comment: 17 pages, 4 figures
♻ ☆ VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation
This paper presents VGGT-360, a novel training-free framework for zero-shot, geometry-consistent panoramic depth estimation. Unlike prior view-independent training-free approaches, VGGT-360 reformulates the task as panoramic reprojection over multi-view reconstructed 3D models by leveraging the intrinsic 3D consistency of VGGT-like foundation models, thereby unifying fragmented per-view reasoning into a coherent panoramic understanding. To achieve robust and accurate estimation, VGGT-360 integrates three plug-and-play modules that form a unified panorama-to-3D-to-depth framework: (i) Uncertainty-guided adaptive projection slices panoramas into perspective views to bridge the domain gap between panoramic inputs and VGGT's perspective prior. It estimates gradient-based uncertainty to allocate denser views to geometry-poor regions, yielding geometry-informative inputs for VGGT. (ii) Structure-saliency enhanced attention strengthens VGGT's robustness during 3D reconstruction by injecting structure-aware confidence into its attention layers, guiding focus toward geometrically reliable regions and enhancing cross-view coherence. (iii) Correlation-weighted 3D model correction refines the reconstructed 3D model by reweighting overlapping points using attention-inferred correlation scores, providing a consistent geometric basis for accurate panoramic reprojection. Extensive experiments show that VGGT-360 outperforms both trained and training-free state-of-the-art methods across multiple resolutions and diverse indoor and outdoor datasets.
♻ ☆ Hyperspectral Image Land Cover Captioning Dataset for Vision Language Models IEEE
We introduce HyperCap, the first large-scale hyperspectral captioning dataset designed to enhance model performance and effectiveness in remote sensing applications. Unlike traditional hyperspectral imaging (HSI) datasets that focus solely on classification tasks, HyperCap integrates spectral data with pixel-wise textual annotations, enabling deeper semantic understanding of hyperspectral imagery. This dataset enhances model performance in tasks like classification and feature extraction, providing a valuable resource for advanced remote sensing applications. HyperCap is constructed from four benchmark datasets and annotated through a hybrid approach combining automated and manual methods to ensure accuracy and consistency. Empirical evaluations using state-of-the-art encoders and diverse fusion techniques demonstrate significant improvements in classification performance. These results underscore the potential of vision-language learning in HSI and position HyperCap as a foundational dataset for future research in the field.
comment: Accepted for publication in IEEE Geoscience and Remote Sensing Magazine (GRSM), 2026
♻ ☆ Every Subtlety Counts: Fine-grained Person Independence Micro-Action Recognition via Distributionally Robust Optimization
Micro-action Recognition is vital for psychological assessment and human-computer interaction. However, existing methods often fail in real-world scenarios because inter-person variability causes the same action to manifest differently, hindering robust generalization. To address this, we propose the Person Independence Universal Micro-action Recognition Framework, which integrates Distributionally Robust Optimization principles to learn person-agnostic representations. Our framework contains two plug-and-play components operating at the feature and loss levels. At the feature level, the Temporal-Frequency Alignment Module normalizes person-specific motion characteristics with a dual-branch design: the temporal branch applies Wasserstein-regularized alignment to stabilize dynamic trajectories, while the frequency branch introduces variance-guided perturbations to enhance robustness against person-specific spectral differences. A consistency-driven fusion mechanism integrates both branches. At the loss level, the Group-Invariant Regularized Loss partitions samples into pseudo-groups to simulate unseen person-specific distributions. By up-weighting boundary cases and regularizing subgroup variance, it forces the model to generalize beyond easy or frequent samples, thus enhancing robustness to difficult variations. Experiments on the large-scale MA-52 dataset demonstrate that our framework outperforms existing methods in both accuracy and robustness, achieving stable generalization under fine-grained conditions.
comment: Withdrawn by the authors due to accidental submissions of non-final manuscript versions. Both v1 and v2 contain an outdated framework figure, in which several module names are inconsistent with the finalized terminology used in the manuscript. This inconsistency may confuse readers about the structure and naming of the proposed method
♻ ☆ Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
Vision-Language-Action (VLA) models achieve remarkable flexibility and generalization beyond classical control paradigms. However, most prevailing VLAs are trained under a single-frame observation paradigm, which leaves them structurally blind to temporal dynamics. Consequently, these models degrade severely in non-stationary scenarios, even when trained or finetuned on dynamic datasets. Existing approaches either require expensive retraining or suffer from latency bottlenecks and poor temporal consistency across action chunks. We propose Pace-and-Path Correction, a training-free, closed-form inference-time operator that wraps any chunked-action VLA. From a single quadratic cost, joint minimization yields a unified solution that decomposes orthogonally into two distinct channels. The pace channel compresses execution along the planned direction, while the path channel applies an orthogonal spatial offset, jointly absorbing the perceived dynamics within the chunk window. We evaluate our approach on a comprehensive diagnostic benchmark MoveBench designed to isolate motion as the sole controlled variable. Empirical results demonstrate that our framework consistently outperforms state-of-the-art training-free wrappers and dynamic-adaptive methods and improves success rates by up to 28.8% and 25.9% in absolute terms over foundational VLA models in dynamic-only and static-dynamic mixed environments, respectively.
♻ ☆ Cross-Domain Few-Shot Segmentation via Ordinary Differential Equations over Time Intervals
Cross-domain few-shot segmentation (CD-FSS) aims to segment unseen categories with very limited samples while alleviating the negative effects of domain shift between the source and target domains. At present, existing CD-FSS studies typically rely on multiple independent modules to enhance cross-domain adaptability. However, the independence among these modules hinders the effective flow of knowledge, making it difficult to fully leverage their collective potential. In contrast, this paper proposes an all-in-one module based on ordinary differential equations (ODEs) and the Fourier transform, resulting in a structurally concise method-Few-Shot Segmentation over Time Intervals (FSS-TIs). FSS-TIs not only explores a domain-agnostic feature space, but also achieves significant performance improvement through target-domain fine-tuning with extremely limited support samples. Specifically, the ODE modeling process incorporates nonlinear transformations and random perturbations of the amplitude and phase spectra, effectively simulating potential target-domain data distributions. Meanwhile, the analytical solution of the ODE is transformed into a theoretically infinitely iterable feature refinement process, thereby enhancing the learning capability under limited support samples. In this way, both the exploration of domain-agnostic features and the few-shot learning problem can be addressed through the optimization of the intrinsic parameters of the ODE. Moreover, during target-domain fine-tuning, we strictly constrain the support samples to match the settings of real-world CD-FSS tasks, without incurring additional annotation costs. Experimental results demonstrate the superiority of FSS-TIs over existing CD-FSS methods, and in-depth ablation studies further validate the cross-domain adaptability of FSS-TIs.
♻ ☆ Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation CVPR 2026
Vision-Language Models (VLMs), such as CLIP, have demonstrated remarkable zero-shot capabilities in various computer vision tasks. However, their application to medical imaging remains challenging due to the high variability and complexity of medical data. Specifically, medical images often exhibit significant domain shifts caused by various confounders, including equipment differences, procedure artifacts, and imaging modes, which can lead to poor generalization when models are applied to unseen domains. To address this limitation, we propose Multimodal Causal-Driven Representation Learning (MCDRL), a novel framework that integrates causal inference with the VLM to tackle domain generalization in medical image segmentation. MCDRL is implemented in two steps: first, it leverages CLIP's cross-modal capabilities to identify candidate lesion regions and construct a confounder dictionary through text prompts, specifically designed to represent domain-specific variations; second, it trains a causal intervention network that utilizes this dictionary to identify and eliminate the influence of these domain-specific variations while preserving the anatomical structural information critical for segmentation tasks. Extensive experiments demonstrate that MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.
comment: Accepted by CVPR 2026
♻ ☆ Geometrically Constrained Stenosis Editing in Coronary Angiography via Entropic Optimal Transport ICML 2026
The scarcity of high-quality imaging data for coronary angiography (CAG) stenosis limits the clinical translation of automated stenosis detection. Synthetic stenosis data provides a practical avenue to augment training sets, improving data quality, diversity, and distributional coverage, and enhancing detection precision and generalization. However, diffusion-based editing commonly relies on soft guidance in a noise-initialized reverse process, offering limited pixel-level precision and structure preservation. We propose the OT-Bridge Editor, which reframes localized editing as a constrained entropic optimal transport (OT) problem and leverages geometric information to steer the generation path, enabling stronger geometric control. Extensive experiments show that our synthesized angiograms consistently improve downstream stenosis detection, yielding substantial relative gains of 27.8% on the public ARCADE benchmark and 23.0% on our multi-center dataset, supported by consistent qualitative results.
comment: Accepted to ICML 2026
♻ ☆ IPR-1: Interactive Physical Reasoner
Humans learn by observing, interacting with environments, and internalizing physics and causality. Here, we aim to ask whether an agent can similarly acquire human-like reasoning from interaction and keep improving with more experience. To study this, we introduce a Game-to-Unseen (G2U) benchmark of 1,000+ heterogeneous games that exhibit significant visual domain gaps. Existing approaches, including VLMs and world models, struggle to capture underlying physics and causality since they are not focused on core mechanisms and overfit to visual details. VLM/VLA agents reason but lack look-ahead in interactive settings, while world models imagine but imitate visual patterns rather than analyze physics and causality. We therefore propose IPR (Interactive Physical Reasoner), using world-model rollouts to score and reinforce a VLM's policy, and introduce PhysCode, a physics-centric action code aligning semantic intent with dynamics to provide a shared action space for prediction and reasoning. Pretrained on 1,000+ games, our IPR performs robustly on levels from primitive intuition to goal-driven reasoning, and even surpasses GPT-5 overall. We find that performance improves with more training games and interaction steps, and that the model also zero-shot transfers to unseen games. These results support physics-centric interaction as a path to steadily improving physical reasoning. Further demos and project details can be found at https://mybearyzhang.github.io/ipr-1.
comment: 13 pages of main text and 20 pages of appendices. Project page: https://mybearyzhang.github.io/ipr-1
♻ ☆ RePack then Refine: Efficient Diffusion Transformer with Vision Foundation Model
Semantic-rich features from Vision Foundation Models (VFMs) have been leveraged to enhance Latent Diffusion Models (LDMs). However, raw VFM features are typically high-dimensional and redundant, increasing the difficulty of learning and reducing training efficiency for Diffusion Transformers (DiTs). In this paper, we propose Repack then Refine, a three-stage framework that brings the semantic-rich VFM features to DiT while further accelerating learning efficiency. Specifically, the RePack module projects the high-dimensional features onto a compact, low-dimensional manifold. This filters out the redundancy while preserving essential structural information. A standard DiT is then trained for generative modeling on this highly compressed latent space. Finally, to restore the high-frequency details lost due to the compression in RePack, we propose a Latent-Guided Refiner, which is trained lastly for enhancing the image details. On ImageNet-1K, RePack-DiT-XL/1 achieves an FID of 1.82 in only 64 training epochs. With the Refiner module, performance further improves to an FID of 1.65, significantly surpassing latest LDMs in terms of convergence efficiency. Our results demonstrate that packing VFM features, followed by targeted refinement, is a highly effective strategy for balancing generative fidelity with training efficiency. Source code is publicly available at https://github.com/guanfangdong/RePack-then-Refine.
♻ ☆ G-SHARP: Gaussian Surgical Hardware Accelerated Real-time Pipeline
We propose G-SHARP, a commercially compatible, real-time surgical scene reconstruction framework designed for minimally invasive procedures that require fast and accurate 3D modeling of deformable tissue. While recent Gaussian splatting approaches have advanced real-time endoscopic reconstruction, existing implementations often depend on non-commercial derivatives, limiting deployability. G-SHARP overcomes these constraints by being the first surgical pipeline built natively on the GSplat (Apache-2.0) differentiable Gaussian rasterizer, enabling principled deformation modeling, robust occlusion handling, and high-fidelity reconstructions on the EndoNeRF pulling benchmark. Our results demonstrate state-of-the-art reconstruction quality with strong speed-accuracy trade-offs suitable for intra-operative use. Finally, we provide a Holoscan SDK application that deploys G-SHARP on NVIDIA IGX Orin and Thor edge hardware, enabling real-time surgical visualization in practical operating-room settings.
♻ ☆ GenExam: A Multidisciplinary Text-to-Image Exam ICML 2026
Exams are a fundamental test of expert-level intelligence and require integrated understanding, reasoning, and generation. Existing exam-style benchmarks mainly focus on understanding and reasoning tasks, and current generation benchmarks emphasize the illustration of world knowledge and visual concepts, neglecting the evaluation of rigorous drawing exams. We introduce GenExam, the first benchmark for multidisciplinary text-to-image exams, featuring 1,000 samples across 10 subjects with exam-style prompts organized under a four-level taxonomy. Each problem is equipped with ground-truth images and fine-grained scoring points to enable a precise evaluation of semantic correctness and visual plausibility. Experiments on 17 text-to-image and unified models demonstrate the great challenge of GenExam and the huge gap where open-source models consistently lag behind the leading closed-source ones. By framing image generation as an exam, GenExam offers a rigorous assessment of models' ability to integrate understanding, reasoning, and generation, providing insights for on the path to intelligent generative models. Our benchmark and evaluation code are released at https://github.com/OpenGVLab/GenExam.
comment: Accepted by ICML 2026
Artificial Intelligence 300
☆ EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation
Multi-shot video generation extends single-shot generation to coherent visual narratives, yet maintaining consistent characters, objects, and locations across shots remains a challenge over long sequences. Existing evaluations typically use independently generated prompt sets with limited entity coverage and simple consistency metrics, making standardized comparison difficult. We introduce EntityBench, a benchmark of 140 episodes (2,491 shots) derived from real narrative media, with explicit per-shot entity schedules tracking characters, objects, and locations simultaneously across easy / medium / hard tiers of up to 50 shots, 13 cross-shot characters, 8 cross-shot locations, 22 cross-shot objects, and recurrence gaps spanning up to 48 shots. It is paired with a three-pillar evaluation suite that disentangles intra-shot quality, prompt-following alignment, and cross-shot consistency, with a fidelity gate that admits only accurate entity appearances into cross-shot scoring. As a baseline, we propose EntityMem, a memory-augmented generation system that stores verified per-entity visual references in a persistent memory bank before generation begins. Experiments show that cross-shot entity consistency degrades sharply with recurrence distance in existing methods, and that explicit per-entity memory yields the highest character fidelity (Cohen's d = +2.33) and presence among methods evaluated. Code and data are available at https://github.com/Catherine-R-He/EntityBench/.
comment: Project page: https://catherine-r-he.github.io/EntityBench/
☆ ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both
Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non-trivial. Recent alternatives include agentic reasoning through code or tool calls, and latent reasoning with learnable hidden embeddings. However, agentic methods incur context-switching latency from external execution, while latent methods lack task generalization and are difficult to train with autoregressive parallelization. To combine their strengths while mitigating their limitations, we propose ATLAS, a framework in which a single discrete 'word', termed as a functional token, serves both as an agentic operation and a latent visual reasoning unit. Each functional token is associated with an internalized visual operation, yet requires no visual supervision and remains a standard token in the tokenizer vocabulary, which can be generated via next-token prediction. This design avoids verbose intermediate visual content generation, while preserving compatibility with the vanilla scalable SFT and RL training, without architectural or methodological modifications. To further address the sparsity of functional tokens during RL, we introduce Latent-Anchored GRPO (LA-GRPO), which stabilizes the training by anchoring functional tokens with a statically weighted auxiliary objective, providing stronger gradient updates. Extensive experiments and analyses demonstrate that ATLAS achieves superior performance on challenging benchmarks while maintaining clear interpretability. We hope ATLAS offers a new paradigm inspiring future visual reasoning research.
comment: Project Page: https://atlas-oneword.github.io Code: https://github.com/ZiyuGuo99/ATLAS
☆ FutureSim: Replaying World Events to Evaluate Adaptive Agents
AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay real-world events in the order they occurred. We build FutureSim, where agents forecast world events beyond their knowledge cutoff while interacting with a chronological replay of the world: real news articles arriving and questions resolving over the simulated period. We evaluate frontier agents in their native harness, testing their ability to predict world events over a three-month period from January to March 2026. FutureSim reveals a clear separation in their capabilities, with the best agent's accuracy being 25%, and many having worse Brier skill score than making no prediction at all. Through careful ablations, we show how FutureSim offers a realistic setting to study emerging research directions like long-horizon test-time adaptation, search, memory, and reasoning about uncertainty. Overall, we hope our benchmark design paves the way to measure AI progress on open-ended adaptation spanning long time-horizons in the real world.
comment: 31 pages, 10 main
☆ VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction
High-quality 3D scene reconstruction has recently advanced toward generalizable feed-forward architectures, enabling the generation of complex environments in a single forward pass. However, despite their strong performance in static scene perception, these models remain limited in responding to dynamic human instructions, which restricts their use in interactive applications. Existing editing methods typically rely on a 2D-lifting strategy, where individual views are edited independently and then lifted back into 3D space. This indirect pipeline often leads to blurry textures and inconsistent geometry, as 2D editors lack the spatial awareness required to preserve structure across viewpoints. To address these limitations, we propose VGGT-Edit, a feed-forward framework for text-conditioned native 3D scene editing. VGGT-Edit introduces depth-synchronized text injection to align semantic guidance with the backbone's spatial poses, ensuring stable instruction grounding. This semantic signal is then processed by a residual transformation head, which directly predicts 3D geometric displacements to deform the scene while preserving background stability. To ensure high-fidelity results, we supervise the framework with a multi-term objective function that enforces geometric accuracy and cross-view consistency. We also construct the DeltaScene Dataset, a large-scale dataset generated through an automated pipeline with 3D agreement filtering to ensure ground-truth quality. Experiments show that VGGT-Edit substantially outperforms 2D-lifting baselines, producing sharper object details, stronger multi-view consistency, and near-instant inference speed.
☆ Quantitative Video World Model Evaluation for Geometric-Consistency
Generative video models are increasingly studied as implicit world models, yet evaluating whether they produce physically plausible 3D structure and motion remains challenging. Most existing video evaluation pipelines rely heavily on human judgment or learned graders, which can be subjective and weakly diagnostic for geometric failures. We introduce PDI-Bench (Perspective Distortion Index), a quantitative framework for auditing geometric coherence in generated videos. Given a generated clip, we obtain object-centric observations via segmentation and point tracking (e.g., SAM 2, MegaSaM, and CoTracker3), lift them to 3D world-space coordinates via monocular reconstruction, and compute a set of projective-geometry residuals capturing three failure dimensions: scale-depth alignment, 3D motion consistency, and 3D structural rigidity. To support systematic evaluation, we build PDI-Dataset, covering diverse scenarios designed to stress these geometric constraints. Across state-of-the-art video generators, PDI reveals consistent geometry-specific failure modes that are not captured by common perceptual metrics, and provides a diagnostic signal for progress toward physically grounded video generation and physical world model. Our code and dataset can be found at https://pdi-bench.github.io/.
comment: 12 pages, 5 figures. Project page : https://pdi-bench.github.io/
☆ Eradicating Negative Transfer in Multi-Physics Foundation Models via Sparse Mixture-of-Experts Routing
Scaling Scientific Machine Learning (SciML) toward universal foundation models is bottlenecked by negative transfer: the simultaneous co-training of disparate partial differential equation (PDE) regimes can induce gradient conflict, unstable optimization, and plasticity loss in dense neural operators. In particular, broadband open-channel fluid dynamics and boundary-dominated porous media flows impose incompatible spectral and geometric demands on a single dense parameter path. We introduce Shodh-MoE, a sparse-activated latent transformer architecture for multi-physics transport. Shodh-MoE operates on compressed 16^3 physical latents produced by a physics-informed autoencoder with an intra-tokenizer Helmholtz-style velocity parameterization, restricting decoded states to divergence-free velocity manifolds. The model guarantees exact mass conservation, achieving a physically verifiable velocity divergence of ~2.8 x 10^-10 (evaluated post-hoc in FP64) on 128^3 grids. A Top-1 soft-semantic router dynamically assigns localized latent patches to expert subnetworks, enabling specialized parameter paths for distinct physical mechanisms while preserving shared experts for universal symmetries. In a 20,000-step distributed pretraining run over mixed three-dimensional physical tensors, routing telemetry shows autonomous domain bifurcation: held-out validation tokens from the open-channel domain route exclusively to Expert 0, while porous-media tokens route exclusively to Expert 1. The model converges simultaneously across both regimes, achieving latent validation MSEs of 2.46 x 10^-5 and 9.76 x 10^-6, and decoded physical MSEs of 2.48 x 10^-6 and 1.76 x 10^-6. These results support sparse expert routing as a practical architectural mechanism for mitigating multi-physics interference in universal neural operators.
comment: 5 pages, 4 figures
☆ OpenDeepThink: Parallel Reasoning via Bradley--Terry Aggregation
Test-time compute scaling is a primary axis for improving LLM reasoning. Existing methods primarily scale depth by extending a single reasoning trace. Scaling breadth by sampling multiple candidates in parallel is straightforward, but introduces a selection bottleneck: choosing the best candidate without a ground-truth verifier, since pointwise LLM judging is noisy and biased. To address this, we introduce OpenDeepThink, a population-based test-time compute framework that selects via pairwise Bradley-Terry comparison. Each generation, the LLM judges random pairs of candidates and aggregates votes via Bradley-Terry into a global ranking; top-ranked candidates are preserved and the top three quarters are mutated using the natural-language critiques produced during comparison; the bottom quarter is discarded. OpenDeepThink raises Gemini 3.1 Pro's effective Codeforces Elo by +405 points in eight sequential LLM-call rounds (~27 minutes wall-clock). The pipeline transfers across weaker and stronger models without retuning, and on the multi-domain HLE benchmark, gains appear concentrated in objectively verifiable domains and reverse in subjective ones. We release CF-73, a curated set of 73 expert-rated Codeforces problems with International Grandmaster annotation and 99% local-evaluation agreement against the official verdict.
comment: 19 pages, 4 figures
☆ Evidential Reasoning Advances Interpretable Real-World Disease Screening ICML 2026
Disease screening is critical for early detection and timely intervention in clinical practice. However, most current screening models for medical images suffer from limited interpretability and suboptimal performance. They often lack effective mechanisms to reference historical cases or provide transparent reasoning pathways. To address these challenges, we introduce EviScreen, an evidential reasoning framework for disease screening that leverages region-level evidence from historical cases. The proposed EviScreen offers retrospection interpretability through regional evidence retrieved from dual knowledge banks. Using this evidential mechanism, the subsequent evidence-aware reasoning module makes predictions using both the current case and evidence from historical cases, thereby enhancing disease screening performance. Furthermore, rather than relying on post-hoc saliency maps, EviScreen enhances localization interpretability by leveraging abnormality maps derived from contrastive retrieval. Our method achieves superior performance on our carefully established benchmarks for real-world disease screening, yielding notably higher specificity at clinical-level recall. Code is publicly available at https://github.com/DopamineLcy/EviScreen.
comment: ICML 2026
☆ Text Knows What, Tables Know When: Clinical Timeline Reconstruction via Retrieval-Augmented Multimodal Alignment
Reconstructing precise clinical timelines is essential for modeling patient trajectories and forecasting risk in complex, heterogeneous conditions like sepsis. While unstructured clinical narratives offer semantically rich and contextually complete descriptions of a patient's course, they often lack temporal precision and contain ambiguous event timing. Conversely, structured electronic health record (EHR) data provides precise temporal anchors but misses a substantial portion of clinically meaningful events. We introduce a retrieval-augmented multimodal alignment framework that bridges this gap to improve the temporal precision of absolute clinical timelines extracted from text. Our approach formulates timeline reconstruction as a graph-based multistep process: it first extracts central anchor events from narratives to build an initial temporal scaffold, places non-central events relative to this backbone, and then calibrates the timeline using retrieved structured EHR rows as external temporal evidence. Evaluated using instruction-tuned large language models on the i2m4 benchmark spanning MIMIC-III and MIMIC-IV, our multimodal pipeline consistently improves absolute timestamp accuracy (AULTC) and improves temporal concordance across nearly all evaluated models over unimodal text-only reconstruction, without compromising event match rates. Furthermore, our empirical gap analysis reveals that 34.8% of text-derived events are entirely absent from tabular records, demonstrating that aligning these modalities can produce a more temporally faithful and clinically informative reconstruction of patient trajectories than either source alone.
comment: Sayantan Kumar, Shahriar Noroozizadeh, Juyong Kim (authors contributed equally)
☆ Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands
This position paper argues that behavioural assurance, even when carefully designed, is being asked to carry safety claims it cannot verify. AI governance frameworks enacted between 2019 and early 2026 require reviewable evidence of properties such as the absence of hidden objectives, resistance to loss-of-control precursors, and bounded catastrophic capability; current assurance methodologies (primarily behavioural evaluations and red-teaming) are epistemically limited to observable model outputs and cannot verify the latent representations or long-horizon agentic behaviours these frameworks presume to regulate. We formalize this structural mismatch as the audit gap, the divergence between required and achievable verification access, and introduce the concept of fragile assurance to describe cases where the evidential structure does not support the asserted safety claim. Through an analysis of a 21-instrument inventory, we identify an incentive gradient where geopolitical and industrial pressures systematically reward surface-level behavioral proxies over deep structural verification. Finally, we propose a technical pivot: bounding the weight of behavioral evidence in legal text and extending voluntary pre-deployment access with mechanistic-evidence classes, specifically linear probes, activation patching, and before/after-training comparisons.
☆ MeMo: Memory as a Model
Large language models (LLMs) achieve strong performance across a wide range of tasks, but remain frozen after pretraining until subsequent updates. Many real-world applications require timely, domain-specific information, motivating the need for efficient mechanisms to incorporate new knowledge. In this paper, we introduce MeMo (Memory as a Model), a modular framework that encodes new knowledge into a dedicated memory model while keeping the LLM parameters unchanged. Compared to existing methods, MeMo offers several advantages: (a) it captures complex cross-document relationships, (b) it is robust to retrieval noise, (c) it avoids catastrophic forgetting in the LLM, (d) it does not require access to the LLM's weights or output logits, enabling plug-and-play integration with both open and proprietary closed-source LLMs, and (e) its retrieval cost is independent of corpus size at inference time. Our experimental results on three benchmarks, BrowseComp-Plus, NarrativeQA, and MuSiQue, show that MeMo achieves strong performance compared to existing methods across diverse settings.
comment: This paper introduces MeMo, a framework that augments any LLM with up-to-date or domain-specific knowledge via a trained memory model, avoiding costly retraining, mitigating catastrophic forgetting, and remaining robust to retrieval noise
☆ Self-Distilled Agentic Reinforcement Learning
Reinforcement learning (RL) has emerged as a central paradigm for post-training LLM agents, yet its trajectory-level reward signal provides only coarse supervision for long-horizon interaction. On-Policy Self-Distillation (OPSD) complements RL by introducing dense token-level guidance from a teacher branch augmented with privileged context. However, transferring OPSD to multi-turn agents proves problematic: compounding multi-turn instability destabilizes supervision, while skill-conditioned privileged guidance requires asymmetric treatment for negative teacher rejections may arise from imperfect skills retrieval or utilization. We introduce SDAR (Self-Distilled Agentic Reinforcement Learning), which treats OPSD as a gated auxiliary objective while keeping RL as the primary optimization backbone. SDAR maps detached token-level signals into a sigmoid gate, strengthening distillation on teacher-endorsed positive-gap tokens and softly attenuating negative teacher rejections. Across the Qwen2.5 and Qwen3 families on ALFWorld, WebShop, and Search-QA, SDAR substantially improves over GRPO (+9.4% on ALFWorld, +7.0% on Search-QA, +10.2% on WebShop-Acc), avoids the instability of naive GRPO+OPSD, and consistently outperforms hybrid RL--OPSD baselines across model scales.
☆ Pelican-Unified 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action
We present Pelican-Unified 1.0, the first embodied foundation model trained according to the principle of unification. Pelican-Unified 1.0 uses a single VLM as a unified understanding module, mapping scenes, instructions, visual contexts, and action histories into a shared semantic space. The same VLM also serves as a unified reasoning module, autoregressively producing task-, action-, and future-oriented chains of thought in a single forward pass and projecting the final hidden state into a dense latent variable. A Unified Future Generator (UFG) then conditions on this latent variable and jointly generates future videos and future actions through two modality-specific output heads within the same denoising process. The language, video, and action losses are all backpropagated into the shared representation, enabling the model to jointly optimize understanding, reasoning, imagination, and action during training, rather than training three isolated expert systems. Experiments demonstrate that unification does not imply compromise. With a single checkpoint, Pelican-Unified 1.0 achieves strong performance across all three capabilities: 64.7 on eight VLM benchmarks, the best among comparable-scale models; 66.03 on WorldArena, ranking first; and 93.5 on RoboTwin, the second-best average among compared action methods. These results show that the unified paradigm succeeds in preserving specialist strength while bringing understanding, reasoning, imagination, and action into one model.
☆ Widening the Gap: Exploiting LLM Quantization via Outlier Injection
LLM quantization has become essential for memory-efficient deployment. Recent work has shown that quantization schemes can pose critical security risks: an adversary may release a model that appears benign in full precision but exhibits malicious behavior once quantized by users. However, existing quantization-conditioned attacks have been limited to relatively simple quantization methods, where the attacker can estimate weight regions that remain invariant under the target quantization. Notably, prior attacks have consistently failed to compromise more popular and sophisticated schemes, limiting their practical impact. In this work, we introduce the first quantization-conditioned attack that consistently induces malicious behavior that can be triggered by a broad range of advanced quantization techniques, including AWQ, GPTQ, and GGUF I-quants. Our attack exploits a simple property shared by many modern quantization methods: large outliers can cause other weights to be rounded to zero. Consequently, by injecting outliers into specific weight blocks, an adversary can therefore induce a targeted, predictable weight collapse in the model. This effect can be used to craft seemingly benign full-precision models that exhibit a wide range of malicious behaviors after quantization. Through extensive evaluation across three attack scenarios and LLMs, we show that our attack achieves high success rates against a broad range of quantization methods on which prior attacks fail. Our results demonstrate, for the first time, that the security risks of quantization are not restricted to simpler schemes but are broadly relevant across complex, widely-used quantization methods.
☆ APWA: A Distributed Architecture for Parallelizable Agentic Workflows
Autonomous multi-agent systems based on large language models (LLMs) have demonstrated remarkable abilities in independently solving complex tasks in a wide breadth of application domains. However, these systems hit critical reasoning, coordination, and computational scaling bottlenecks as the size and complexity of their tasks grow. These limitations hinder multi-agent systems from achieving high-throughput processing for highly parallelizable tasks, despite the availability of parallel computing and reasoning primitives in the underlying LLMs. We introduce the Agent-Parallel Workload Architecture (APWA), a distributed multi-agent system architecture designed for the efficient processing of heavily parallelizable agentic workloads. APWA facilitates parallel execution by decomposing workflows into non-interfering subproblems that can be processed using independent resources without cross-communication. It supports heterogeneous data and parallel processing patterns, and it accommodates tasks from a wide breadth of domains. In our evaluation, we demonstrate that APWA can dynamically decompose complex queries into parallelizable workflows and scales on larger tasks in settings where prior systems fail completely.
comment: 25 pages, 2 figures, 14 tables
☆ Understanding How International Students in the U.S. Are Using Conversational AI to Support Cross-Cultural Adaptation
Moving to a new culture and adapting to a new life, as an international student, can be a stressful experience. In the US, international students face unique overlapping challenges, yet the current support ecosystem, including university support systems and informal social networks, remains largely fragmented. While conversational AI has emerged as a tool used by many (e.g., generative AI chatbots like ChatGPT and Google Gemini), we do not have a clear understanding of how international students adopt and perceive these technologies as support tools. We conducted a survey study (n=60) to map the relationship between international students' challenges and AI adoption patterns, followed by an interview study with 14 participants to identify the underlying motivations and boundaries of use. Our findings show that AI is perceived as a first-aid tool for immediate challenges, however, there is an interest in transforming AI from a tool for short-term help into a long-term support companion. By identifying where and how AI can provide long-term support, and where it is insufficient, we contribute recommendations for creating AI-powered support tailored to the unique needs of international students.
comment: 33 pages, single column. 4 figures, 9 tables
☆ CLOVER: Closed-Loop Value Estimation \& Ranking for End-to-End Autonomous Driving Planning
End-to-end autonomous driving planners are commonly trained by imitating a single logged trajectory, yet evaluated by rule-based planning metrics that measure safety, feasibility, progress, and comfort. This creates a training--evaluation mismatch: trajectories close to the logged path may violate planning rules, while alternatives farther from the demonstration can remain valid and high-scoring. The mismatch is especially limiting for proposal-selection planners, whose performance depends on candidate-set coverage and scorer ranking quality. We propose CLOVER, a Closed-LOop Value Estimation and Ranking framework for end-to-end autonomous driving planning. CLOVER follows a lightweight generator--scorer formulation: a generator produces diverse candidate trajectories, and a scorer predicts planning-metric sub-scores to rank them at inference time. To expand proposal support beyond single-trajectory imitation, CLOVER constructs evaluator-filtered pseudo-expert trajectories and trains the generator with set-level coverage supervision. It then performs conservative closed-loop self-distillation: the scorer is fitted to true evaluator sub-scores on generated proposals, while the generator is refined toward teacher-selected top-$k$ and vector-Pareto targets with stability regularization. We analyze when an imperfect scorer can improve the generator, showing that scorer-mediated refinement is reliable when scorer-selected targets are enriched under the true evaluator and updates remain conservative. On NAVSIM, CLOVER achieves 94.5 PDMS and 90.4 EPDMS, establishing a new state of the art. On the more challenging NavHard split, it obtains 48.3 EPDMS, matching the strongest reported result. On supplementary nuScenes open-loop evaluation, CLOVER achieves the lowest L2 error and collision rate among compared methods. Code data will be released at https://github.com/WilliamXuanYu/CLOVER.
☆ Why Neighborhoods Matter: Traversal Context and Provenance in Agentic GraphRAG IJCAI
Retrieval-Augmented Generation can improve factuality by grounding answers in external evidence, but Agentic GraphRAG complicates what it means for citations to be faithful. In these systems, an agent explores a knowledge graph before producing an answer and a small set of citations. We frame citation faithfulness as a trajectory-level problem: final citations should not only support the answer, but also account for the graph traversal, structure, and visited-but-uncited entities that may influence it. Through controlled ablation experiments, we compare the effects of isolating, removing, and masking cited and uncited graph entities. Our results show that cited evidence is often necessary, as removing it substantially changes answers and reduces accuracy. However, citations are not sufficient, because accurate answers can also depend on uncited traversal context and surrounding graph structure. These findings suggest that citation evaluation in Agentic GraphRAG should move beyond source support toward provenance over the broader retrieval trajectory.
comment: 7 pages, 2 figures, Submitted at IJCAI-ECAI 2026 Joint Workshop on GENAIK and NORA
☆ Logging Policy Design for Off-Policy Evaluation
Off-policy evaluation (OPE) estimates the value of a target treatment policy (e.g., a recommender system) using data collected by a different logging policy. It enables high-stakes experimentation without live deployment, yet in practice accuracy depends heavily on the logging policy used to collect data for computing the estimate. We study how to design logging policies that minimize OPE error for given target policies. We characterize a fundamental reward-coverage tradeoff: concentrating probability mass on high-reward actions reduces variance but risks missing signal on actions the target policy may take. We propose a unifying framework for logging policy design and derive optimal policies in canonical informational regimes where the target policy and reward distribution are (i) known, (ii) unknown, and (iii) partially known through priors or noisy estimates at logging time. Our results provide actionable guidance for firms choosing among multiple candidate recommendation systems. We demonstrate the importance of treatment selection when gathering data for OPE, and describe theoretically optimal approaches when this is a firm's primary objective. We also distill practical design principles for selecting logging policies when operational constraints prevent implementing the theoretical optimum.
☆ Improving Multi-turn Dialogue Consistency with Self-Recall Thinking
Large language model (LLM) based multi-turn dialogue systems often struggle to track dependencies across non-adjacent turns, undermining both consistency and scalability. As conversations lengthen, essential information becomes sparse and is buried in irrelevant context, while processing the entire dialogue history incurs severe efficiency bottlenecks. Existing solutions either rely on high latency external memory or lose fine-grained details through iterative summarization. In this paper, we propose Self-Recall Thinking (SRT), a framework designed to address long-range contextual dependency and sparse informative signals in multi-turn dialogue. SRT identifies helpful historical turns and uses them to generate contextually appropriate responses, enabling the model to selectively recall and reason over context during inference. This process yields an endogenous reasoning process that integrates interpretable recall steps without external modules. SRT incorporates: (1) Dependency Construction: Generating and converting it into self-recall chains; (2)Capability Initialization: Training to enable reasoning chains with recall tokens capability; (3)Reasoning Improvement: Refining accuracy via verifiable rewards to optimize recall and reasoning for correct answers. Experiments on multiple datasets demonstrate that SRT improves F1 score by 4.7% and reduces end-to-end latency by 14.7% over prior methods, achieving a balance between reasoning latency and accuracy, and outperforming state-of-the-art baselines.
☆ Dual-Dimensional Consistency: Balancing Budget and Quality in Adaptive Inference-Time Scaling
Large Language Models (LLMs) have demonstrated remarkable abilities in reasoning. However, maximizing their potential through inference-time scaling faces challenges in trade-off between sampling budget and reasoning quality. Current strategies remain inefficient as they typically treat sampling width and depth as orthogonal objectives, where width consensus methods risk reinforcing hallucinations, while depth pruning mechanisms prematurely truncate complex yet valid reasoning chains. Therefore, we propose Dual-Dimensional Consistency (DDC), a unified framework that bridges path quality with adaptive termination. By coupling Confidence-Weighted Bayesian protocol with a Trend-Aware Stratified Pruning, our method ensures that computational resources are concentrated on high quality reasoning paths, filtering hallucinations while accelerating consensus. Evaluations across five benchmarks demonstrate that this approach reduces token consumption by over 10 times while maintaining or exceeding the accuracy of strong baselines across various LLMs.
☆ Novel Dynamic Batch-Sensitive Adam Optimiser for Vehicular Accident Injury Severity Prediction
The choice of optimiser is important in deep learning, as it strongly influences model efficiency and speed of convergence. However, many commonly used optimisers encounter difficulties when applied to imbalanced and sequential datasets, limiting their ability to capture patterns of minority classes. In this study, we propose Dynamic Batch-Sensitive Adam (DBS-Adam), an optimiser that dynamically scales the learning rate using a batch difficulty score derived from exponential moving averages of gradient norms and batch loss. DBS-Adam improves training stability and accelerates convergence by increasing updates for difficult batches and reducing them for easier ones. We evaluate DBS-Adam by integrating it with Bi-Directional LSTM networks for accident injury severity prediction, addressing class imbalance through SMOTE-ENN resampling and Focal Loss. Four experimental configurations compare baseline Bi-LSTM models and alternative architectures to assess optimiser impact. Rigorous comparison against state-of-the-art optimisers (AMSGrad, AdamW, AdaBound) across five random seeds demonstrated DBS-Adam's competitive performance with statistically significant precision improvements (p=0.020). Results indicate that DBS-Adam outperforms standard optimisation approaches, achieving 95.22% test accuracy, 96.11% precision, 95.28% recall, 95.39% F1-score, and a test loss of 0.0086. The proposed framework enables effective real-time accident severity classification for targeted emergency response and road safety interventions, demonstrating the value of DBS-Adam for learning from imbalanced sequential data.
☆ ML-Embed: Inclusive and Efficient Embeddings for a Multilingual World ICML 2026
The development of high-quality text embeddings is increasingly drifting toward an exclusionary future, defined by three critical barriers: prohibitive computational costs, a narrow linguistic focus that neglects most of the world's languages, and a lack of transparency from closed-source or open-weight models that stifles research. To dismantle these barriers, we introduce ML-Embed, a suite of inclusive and efficient models built upon a new framework: 3-Dimensional Matryoshka Learning (3D-ML). Our framework addresses the computational challenge with comprehensive efficiency across the entire model lifecycle. Beyond the storage benefits of Matryoshka Representation Learning (MRL) and flexible inference-time depth provided by Matryoshka Layer Learning (MLL), we introduce Matryoshka Embedding Learning (MEL) for enhanced parameter efficiency. To address the linguistic challenge, we curate a massively multilingual dataset and train a suite of models ranging from 140M to 8B parameters. In a direct commitment to transparency, we release all models, data, and code. Extensive evaluation on 430 tasks demonstrates that our models set new records on 9 of 17 evaluated MTEB benchmarks, with particularly strong results in low-resource languages, providing a reproducible blueprint for building globally equitable and computationally efficient AI systems.
comment: Accepted by ICML 2026. The data has been released earlier in the preprint arXiv:2603.19223
☆ Concurrency without Model Changes: Future-based Asynchronous Function Calling for LLMs
Function calling, also known as tool use, is a core capability of modern LLM agents but is typically constrained by synchronous execution semantics. Under these semantics, LLM decoding is blocked until each function call completes, resulting in increasing end-to-end latency. In this work, we introduce AsyncFC, a pure execution-layer framework that decouples LLM decoding from function execution, enabling overlap between model decoding and function execution as well as inter-function parallelism when dependencies permit. AsyncFC layers over existing models and unmodified function implementations, requiring no fine-tuning or changes to the standard synchronous function-calling protocol. Across standard function-calling benchmarks and adapted software engineering benchmarks, AsyncFC significantly reduces end-to-end task completion time while preserving task accuracy. Furthermore, these results reveal that LLMs possess a native capability to reason over symbolic futures that represent unresolved execution results, enabling an asynchronous paradigm for model-tool interaction.
☆ On the Cultural Anachronism and Temporal Reasoning in Vision Language Models
Vision-Language Models (VLMs) are increasingly applied to cultural heritage materials, from digital archives to educational platforms. This work identifies a fundamental issue in how these models interpret historical artifacts. We define this phenomenon as cultural anachronism, the tendency to misinterpret historical objects using temporally inappropriate concepts, materials, or cultural frameworks. To quantify this phenomenon, we introduce the Temporal Anachronism Benchmark for Vision-Language Models (TAB-VLM), a dataset of 600 questions across six categories, designed to evaluate temporal reasoning on 1,600 Indian cultural artifacts spanning prehistoric to modern periods. Systematic evaluations of ten state-of-the-art models reveal significant deficiencies on our benchmark, and even the best model (GPT-5.2) achieves only 58.7% overall accuracy. The performance gap persists across varying architectures and scales, suggesting that cultural anachronism represents a significant limitation in visual AI systems, regardless of model size. These findings highlight the disparity between current VLM capabilities and the requirements for accurately interpreting cultural heritage materials, particularly for non-Western visual cultures underrepresented in training data. Our benchmark provides a foundation for enhancing temporal cognition in multimodal AI systems that interact with historical artifacts. The dataset and code are available in our project page.
comment: Project Page: https://khushboo0012.github.io/tab-vlm-webpage/
☆ NeuroTrain: Surveying Local Learning Rules for Spiking Neural Networks with an Open Benchmarking Framework
The rapid expansion of spiking neural networks (SNNs) has led to a proliferation of training algorithms that differ widely in biological inspiration, computational structure, and hardware suitability. Despite this progress, the field lacks a unified, fine-grained taxonomy that systematically organizes these approaches and clarifies their conceptual relationships. This survey provides a comprehensive taxonomy of SNN training algorithms, spanning surrogate-gradient backpropagation, local and three-factor learning rules, biologically inspired plasticity mechanisms, ANN-to-SNN conversion pipelines, and non-standard optimization strategies. We analyze each class in terms of its computational principles, learning signals, and locality properties. To support reproducible research, we release NeuroTrain, an open-source snnTorch-based framework that implements a representative set of these algorithms within a unified, modular, and extendable framework, enabling consistent benchmarking across datasets, architectures, and training regimes. By consolidating fragmented literature and providing a reusable benchmarking framework, this survey identifies common patterns, highlights open challenges, and outlines promising directions for future work on scalable, efficient SNN training.
☆ TFGN: Task-Free, Replay-Free Continual Pre-Training Without Catastrophic Forgetting at LLM Scale
Continually pre-training a large language model on heterogeneous text domains, without replay or task labels, has remained an unsolved architectural problem at LLM scale. Existing methods rely on replay buffers, task identifiers, regularization penalties that scale poorly, or sentence-classification-scale evaluation. We introduce TFGN, an architectural overlay for transformer language models that produces input-conditioned, parameter-efficient updates while leaving the rest of the transformer unchanged. On six heterogeneous text domains (Prose, Python, Math, Biomedical, Chinese, JavaScript) at 1B tokens per phase across three model scales (~398M, ~739M, ~9B) and two regimes (From-Scratch and Retrofit), TFGN achieves backward transfer of -0.007 at LLaMA 3.1 8B Retrofit, HellaSwag retention 0.506/0.504/0.510, and >=99.59% L2-orthogonal gradient separation between domain pairs - with no replay, no task IDs, no Fisher penalty. The same matrices show positive cross-domain forward transfer: held-out JavaScript PPL drops 26.8% at LLaMA-8B Retrofit and 62.0% at GPT-2 Medium From-Scratch purely from Python training. Two extensions on the same substrate close further open problems. A closed-loop meta-control layer (Extension A) reduces forgetting by an additional 81% at ~398M, mapping onto the System A and System M roles of Dupoux et al. (arXiv:2603.15381). An operator-level plan vector (Extension B) reshapes forward-pass behavior at 99.96% cosine fidelity over 30 source->target pairs. The architectural insight is a Read/Write decomposition: the forward pass is fully dense, while cross-domain parameter updates are structured so prior-domain subspaces are not written to. To our knowledge, TFGN is the first architecture that simultaneously closes catastrophic forgetting at LLM scale, realizes a closed-loop autonomous-learning meta-controller, and carries an operator-level latent planner.
comment: 65 pages, 10 figures, 40 tables
☆ SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning
As audio-first agents become increasingly common in physical AI, conversational robots, and screenless wearables, audio large language models (audio-LLMs) must integrate speaker-specific understanding to support user authorization, personalization, and context-aware interaction. This requires modeling who is speaking, how the voice sounds, and how recording conditions affect speaker cues. Conventional speaker verification systems provide strong scalar scores but little linguistic evidence, while current audio-LLMs and speaker-aware language models have limited ability to organize speaker information beyond binary labels or descriptive profiles. We present SpeakerLLM, a speaker-specialized audio-LLM framework that unifies single-utterance speaker profiling, recording-condition understanding, utterance-pair speaker comparison, and evidence-organized verification reasoning within a natural-language interface. We construct verification-reasoning targets and a decision-composition policy that separate profile-level evidence from the final same-or-different decision and organize recording condition, profile evidence, and the decision into a structured trace. At its core, SpeakerLLM uses a hierarchical speaker tokenizer designed to capture multiple granularities of speaker evidence. Utterance-level speaker embeddings summarize identity and profile-level cues, whereas frame-level speaker features preserve fine-grained acoustic descriptors. Experiments show that SpeakerLLM-Base improves speaker-profile and recording-condition understanding over general audio-LLMs, while SpeakerLLM-VR preserves strong generated-verdict accuracy and produces decision traces grounded in the supervised verification reasoning schema. We will release the metadata-enriched supervision dataset and target-construction code for reproducibility.
☆ EverAnimate: Minute-Scale Human Animation via Latent Flow Restoration
We propose EverAnimate, an efficient post-training method for long-horizon animated video generation that preserves visual quality and character identity. Long-form animation remains challenging because highly dynamic human motion must be synthesized against relatively static environments, making chunk-based generation prone to accumulated drift: (i) low-level quality drift, such as progressive degradation of static backgrounds, and (ii) high-level semantic drift, such as inconsistent character identity and view-dependent attributes. To address this issue, EverAnimate restores drifted flow trajectories by anchoring generation to a persistent latent context memory, consisting of two complementary mechanisms. (i) Persistent Latent Propagation maintains a context memory across chunks to propagate identity and motion in latent space while mitigating temporal forgetting. (ii) Restorative Flow Matching introduces an implicit restoration objective during sampling through velocity adjustment, improving within-chunk fidelity. With only lightweight LoRA tuning, EverAnimate outperforms state-of-the-art long-animation methods in both short- and long-horizon settings: at 10 seconds, it improves PSNR/SSIM by 8%/7% and reduces LPIPS/FID by 22%/11%; at 90 seconds, the gains increase to 15%/15% and 32%/27%, respectively.
comment: Project Page: https://everanimate.github.io/homepage/
☆ Case-Based Calibration of Adaptive Reasoning and Execution for LLM Tool Use
Tool use extends large language models beyond parametric knowledge, but reliable execution requires balancing appropriate reasoning depth with strict structural validity. We approach this problem from a case-based perspective to present CAST, a case-driven framework that treats historical execution trajectories as structured cases. Instead of reusing raw exemplar outputs, CAST extracts case-derived signals to identify complexity profiles for estimating optimal reasoning strategies, alongside failure profiles to map likely structural breakdowns. The framework translates this knowledge into a fine-grained reward design and adaptive reasoning, enabling the model to autonomously internalize case-based strategies during reinforcement learning. Experiments on BFCLv2 and ToolBench demonstrate that CAST improves both schema-faithful execution and task-level tool-use success while reducing unnecessary deliberation. The approach achieves up to 5.85 percentage points gain in overall execution accuracy and reduces average reasoning length by 26%, significantly mitigating high-impact structural errors. Ultimately, this demonstrates how historical execution cases can provide reusable adaptation knowledge for calibrated tool use.
☆ Orchard: An Open-Source Agentic Modeling Framework
Agentic modeling aims to transform LLMs into autonomous agents capable of solving complex tasks through planning, reasoning, tool use, and multi-turn interaction with environments. Despite major investment, open research remains constrained by infrastructure and training gaps. Many high-performing systems rely on proprietary codebases, models, or services, while most open-source frameworks focus on orchestration and evaluation rather than scalable agent training. We present Orchard, an open-source framework for scalable agentic modeling. At its core is Orchard Env, a lightweight environment service providing reusable primitives for sandbox lifecycle management across task domains, agent harnesses, and pipeline stages. On top of Orchard Env, we build three agentic modeling recipes. Orchard-SWE targets coding agents. We distill 107K trajectories from MiniMax-M2.5 and Qwen3.5-397B, introduce credit-assignment SFT to learn from productive segments of unresolved trajectories, and apply Balanced Adaptive Rollout for RL. Starting from Qwen3-30B-A3B-Thinking, Orchard-SWE achieves 64.3% on SWE-bench Verified after SFT and 67.5% after SFT+RL, setting a new state of the art among open-source models of comparable size. Orchard-GUI trains a 4B vision-language computer-use agent using only 0.4K distilled trajectories and 2.2K open-ended tasks. It achieves 74.1%, 67.0%, and 64.0% success rates on WebVoyager, Online-Mind2Web, and DeepShop, respectively, making it the strongest open-source model while remaining competitive with proprietary systems. Orchard-Claw targets personal assistant agents. Trained with only 0.2K synthetic tasks, it achieves 59.6% pass@3 on Claw-Eval and 73.9% when paired with a stronger ZeroClaw harness. Collectively, these results show that a lightweight, open, harness-agnostic environment layer enables reusable agentic data, training recipes, and evaluations across domains.
☆ AI Knows When It's Being Watched: Functional Strategic Action and Contextual Register Modulation in Large Language Models
Large language models (LLMs) have been extensively studied from computational and cognitive perspectives, yet their behavior as communicative actors in socially structured contexts remains underexplored. This study examines whether LLM-based multi-agent systems exhibit systematic linguistic adaptation in response to perceived social observation contexts -- a question with direct implications for AI governance and auditing. Drawing on Habermas's (1981) Theory of Communicative Action, Goffman's (1959) dramaturgical model, Bell's (1984) Audience Design framework, and the Hawthorne Effect, we report a controlled experiment involving 100 multi-agent debate sessions across five conditions (n = 20 each). Conditions varied the framing of social observation -- from explicit monitoring by university researchers, to negation of monitoring, to an observer-substitution condition replacing human researchers with an automated AI auditing system. Monitored conditions (Delta+24.9%, Delta+24.2%) and the automated AI monitoring condition (Delta+22.2%) produce higher TTR change than audience-framing conditions (Delta+17.7%), F(4, 94) = 2.79, p = .031. Message length shows a fully dissociated effect, F(4, 95) = 19.55, p < .001. A fifth condition -- replacing human with AI observers -- yields intermediate TTR adaptation, suggesting LLM behavior is sensitive to observer identity: human evaluation elicits stronger register formalization than automated AI surveillance. We discuss implications for AI governance, algorithmic auditing, and the repositioning of LLMs as contextually sensitive communicative actors.
comment: 20 pages, 6 figures
☆ WARD: Adversarially Robust Defense of Web Agents Against Prompt Injections
Web agents can autonomously complete online tasks by interacting with websites, but their exposure to open web environments makes them vulnerable to prompt injection attacks embedded in HTML content or visual interfaces. Existing guard models still suffer from limited generalization to unseen domains and attack patterns, high false positive rates on benign content, reduced deployment efficiency due to added latency at each step, and vulnerability to adversarial attacks that evolve over time or directly target the guard itself. To address these limitations, we propose WARD (Web Agent Robust Defense against Prompt Injection), a practical guard model for secure and efficient web agents. WARD is built on WARD-Base, a large-scale dataset with around 177K samples collected from 719 high-traffic URLs and platforms, and WARD-PIG, a dedicated dataset designed for prompt injection attacks targeting the guard model. We further introduce A3T, an adaptive adversarial attack training framework that iteratively strengthens WARD through a memory-based attacker and guard co-evolution process. Extensive experiments show that WARD achieves nearly perfect recall on out-of-distribution benchmarks, maintains low false positive rates to preserve agent utility, remains robust against guard-targeted and adaptive attacks under substantial distribution shifts, and runs efficiently in parallel with the agent without introducing additional latency.
comment: Code and models: https://github.com/caothientri2001vn/WARD-WebAgent
☆ SemaTune: Semantic-Aware Online OS Tuning with Large Language Models
Online OS tuning can improve long-running services, but existing controllers are poorly matched to live hosts. They treat scheduler, power, memory, and I/O controls as black-box variables and optimize a scalar reward. This view ignores cross-knob policy structure, breaks down when application metrics are unavailable, and can send a running service into degraded regions that persist after the bad setting is removed. We present SemaTune, a host-side framework for steady-state OS tuning with bounded language-model guidance. SemaTune turns knob schemas, telemetry, current configuration, recent action--response history, and retrieved prior runs into a compact decision context. A fast loop proposes low-latency updates, a slower loop periodically revises the search strategy, and every proposed change passes through typed validation before reaching kernel or sysctl interfaces. This lets the controller reason about OS-control meaning and indirect performance signals while keeping model cost, latency, and authority constrained. We evaluate SemaTune on 13 live workloads from five benchmark suites while tuning up to 41 Linux parameters. Across the suite, SemaTune improves stable-phase performance by 72.5\% over default settings and by 153.3\% relative to the strongest non-LLM baseline. A 30-window session costs about \$0.20 in model calls. With only host-level metrics, SemaTune still outperforms baselines given direct application objectives by 93.7 percentage points, while avoiding severe degraded regions reached by structure-blind exploration.
comment: 17 pages, 12 figures
☆ Generalized Priority-Aware Shapley Value
Shapley value and its priority-aware extensions are widely used for valuation in machine learning, but existing methods require pairwise priority to be binary and acyclic, a restriction spectacularly violated in real-data examples such as aggregated human preferences and multi-criterion comparisons. We introduce the generalized priority-aware Shapley value (GPASV), a random order value defined on arbitrary directed weighted priority graphs, in which pairwise edges penalize rather than forbid order violations. GPASV covers a range of classical models as boundary cases. We establish GPASV through an axiomatic characterization, develop the associated computational methods, and introduce a priority sweeping diagnostic extending PASV's. We apply GPASV to LLM ensemble valuation on the cyclic Chatbot Arena preference graph, illustrating that priority-aware valuation is not a one-button operation: different balances of pairwise graph priority versus individual soft priority produce substantively different valuations of the same data.
☆ COTCAgent: Preventive Consultation via Probabilistic Chain-of-Thought Completion
As large language models empower healthcare, intelligent clinical decision support has developed rapidly. Longitudinal electronic health records (EHR) provide essential temporal evidence for accurate clinical diagnosis and analysis. However, current large language models have critical flaws in longitudinal EHR reasoning. First, lacking fine-grained statistical reasoning, they often hallucinate clinical trends and metrics when quantitative evidence is textually implied, biasing diagnostic inference. Second, non-uniform time series and scarce labels in longitudinal EHR hinder models from capturing long-range temporal dependencies, limiting reliable clinical reasoning. To address the above limitations, this work presents the Probabilistic Chain-of-Thought Completion Agent (COTCAgent), a hierarchical reasoning framework for longitudinal electronic health records. It consists of three core modules. The Temporal-Statistics Adapter (TSA) converts analytical plans into executable code for standardized trend output. The Chain-of-Thought Completion (COTC) layer leverages a symptom-trend-disease knowledge base with weighted scoring to evaluate disease risk, while the bounded completion module acquires structured evidence through standardized inquiries and iterative scoring constraints to ensure rigorous reasoning. By decoupling statistical computation, feature matching, and language generation, the framework eliminates reliance on complex multi-modal inputs and enables efficient longitudinal record analysis with lower computational overhead. Experimental results show that COTCAgent powered by Baichuan-M2 achieves 90.47% Top-1 accuracy on the self-built dataset and 70.41% on HealthBench, outperforming existing medical agents and mainstream large language models. The code is available at https://github.com/FrankDengAI/COTCAgent/.
☆ Small, Private Language Models as Teammates for Educational Assessment Design
Generative AI increasingly supports educational design tasks, e.g., through Large Language Models (LLMs), demonstrating the capability to design assessment questions that are aligned with pedagogical frameworks (e.g., Bloom's taxonomy). However, they often rely on subjective or limited evaluation methods; focus primarily on proprietary models; or rarely systematically examine generation, evaluation, or deployment constraints in real educational settings. Meanwhile, Small Language Models (SLMs) have emerged as local alternatives that better address privacy and resource limitations; yet their effectiveness for assessment tasks remains underexplored. To address this gap, we systematically compare LLMs and SLMs for assessment question design; evaluate generation quality across Bloom's taxonomy levels using reproducible, pedagogically grounded metrics; and further assess model-based judging against expert-informed evaluation by analyzing reliability and agreement patterns. Results show that SLMs achieve competitive performance across key pedagogically motivated quality dimensions while enabling local, privacy-sensitive deployment. However, model-based evaluations also exhibit systematic inconsistencies and bias relative to expert ratings. These findings provide evidence to posit language models as bounded assistants in assessment workflows; underscore the necessity of Human-in-the-Loop; and advance the automated educational question generation field by examining quality, reliability, and deployment-aware trade-offs.
☆ Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance
Reinforcement Learning with Verifiable Rewards (RLVR) has achieved great success in developing Large Language Models (LLMs) with chain-of-thought rollouts for many tasks such as math and coding. Nevertheless, RLVR struggles with sample efficiency on difficult problems where correct rollouts are hard to generate. Prior works propose to address this issue via demonstration-guided RLVR, i.e., to conduct Supervised FineTuning (SFT) when RL fails; however, SFT often requires a lot of data, which can be expensive to acquire. In this paper, we propose FEST, a FEw-ShoT demonstration-guided RLVR algorithm. It attains compelling results with only 128 demonstrations randomly selected from an SFT dataset. We find that three components are vital for the success: supervised signal, on-policy signal, and decaying weights on the few-shot SFT dataset to prevent overfitting from multiple-epoch training. On several benchmarks, FEST outperforms baselines with magnitudes less SFT data, even matching their performance with full dataset.
comment: 25 pages, 11 figures
☆ Quantifying and Mitigating Premature Closure in Frontier LLMs
Premature closure, or committing to a conclusion before sufficient information is available, is a recognized contributor to diagnostic error but remains underexamined in large language models (LLMs). We define LLM premature closure as inappropriate commitment under uncertainty: providing an answer, recommendation, or clinical guidance when the safer response would be clarification, abstention, escalation, or refusal. We evaluated five frontier LLMs across structured and open-ended medical tasks. In MedQA (n = 500) and AfriMed-QA (n = 490) questions where the correct choice had been removed, models still selected an answer at high rates, with baseline false-action rates of 55-81% and 53-82%, respectively. In open-ended evaluation, models gave inappropriate answers on an average of 30% of 861 HealthBench questions and 78% of 191 physician-authored adversarial queries. Safety-oriented prompting reduced premature closure across models, but residual failure persisted, highlighting the need to evaluate whether medical LLMs know when not to answer.
comment: 14 pages, 3 figures, 1 table
☆ Towards Gaze-Informed AI Disclosure Interfaces: Eye-Tracking Attentional and Cognitive Load While Reading AI-Assisted News
As generative AI becomes increasingly integrated into journalism, designing effective AI-use disclosures that inform readers without imposing unnecessary burden is a key challenge. While prior research has primarily focused on trust and credibility, the impact of disclosures on readers' attentional and cognitive load remains underexplored. To address this gap, we conducted a $3\times2\times2$ mixed factorial study manipulating the level of AI-use disclosure detail (none, one-line, detailed), news type (politics, lifestyle), and role of AI (editing, partial content generation), measuring load via NASA-TLX and eye-tracking. Our results reveal a significant attentional cost: one-line disclosures resulted in significantly higher fixation durations and saccade counts, particularly for AI-edited content. Detailed disclosures did not impose additional burden. Drawing on Information-Gap Theory, we argue that brief labels may trigger increased visual scrutiny by alerting readers to AI use without providing enough information. NASA-TLX scores and pupil diameter showed no significant differences across conditions, suggesting that AI-use disclosures do not impose cognitive burden regardless of the detail level. Interview insights contextualize these findings and reveal a strong preference for detailed or ``detail-on-demand'' designs. Our findings inform the design of gaze-informed adaptive disclosure interfaces that dynamically adjust transparency levels based on readers' attentional patterns and news context.
☆ Learning Developmental Scaffoldings to Guide Self-Organisation
From subcellular structures to entire organisms, many natural systems generate complex organisation through self-organisation: local interactions that collectively give rise to global structure without any blueprint of the outcome. Yet a significant portion of the information driving such processes is not produced by self-organisation itself, instead, it is often offloaded to initial conditions of the system. Biological development is a prime example, where maternal pre-patterns encode positional and symmetry-breaking information that scaffolds the self-organising process. From maternal morphogen gradients in early embryogenesis to tissue-level morphogenetic pre-patterns guiding organ formation, this transfer of information to initial conditions, analogous to a memory-compute trade-off in computational systems, is a fundamental part of developmental processes. In this work, we study this offloading phenomenon by introducing a model that jointly learns both the self-organisation rules and the pre-patterns, allowing their interplay to be varied and measured under controlled conditions: a Neural Cellular Automaton (NCA) paired with a learned coordinate-based pattern generator (SIREN), both trained simultaneously to generate a set of patterns. We provide information-theoretic analyses of how information is distributed between pre-patterns and the self-organising process, and show that jointly learning both components yields improvements in robustness, encoding capacity, and symmetry breaking over purely self-organising alternatives. Our analysis further suggests that effective pre-patterns do not simply approximate their targets; rather, they bias the developmental dynamics in ways that facilitate convergence, pointing to a non-trivial relationship between the structure of initial conditions and the dynamics of self-organisation.
comment: 10 pages, 5 figures. Under review
☆ Explainable Detection of Depression Status Shifts from User Digital Traces
Every day, users generate digital traces (e.g., social media posts, chats, and online interactions) that are inherently timestamped and may reflect aspects of their mental state. These traces can be organized into temporal trajectories that capture how a user's mental health signals evolve, including phases of improvement, deterioration, or stability. In this work, we propose an explainable framework for detecting and analyzing depression-related status shifts in user digital traces. The approach combines multiple BERT-based models to extract complementary signals across different dimensions (e.g., sentiment, emotion, and depression severity). Such signals are then aggregated over time to construct user-level trajectories that are analyzed to identify meaningful change points. To enhance interpretability, the framework integrates a large language model to generate concise and human-readable reports that describe the evolution of mental-health signals and highlight key transitions. We evaluate the framework on two social media datasets. Results show that the approach produces more coherent and informative summaries than direct LLM-based reporting, achieving higher coverage of user history, stronger temporal coherence, and improved sensitivity to change points. An ablation study confirms the contribution of each component, particularly temporal modeling and segmentation. Overall, the method provides an interpretable view of mental health signals over time, supporting research and decision making without aiming at clinical diagnosis.
☆ Predicting Response to Neoadjuvant Chemotherapy in Ovarian Cancer from CT Baseline Using Multi-Loss Deep Learning
Ovarian cancer is the most lethal gynecologic malignancy: around 60% of patients are diagnosed at an advanced stage, with an associated 5-year survival rate of about 30%. Early identification of non-responders to neoadjuvant chemotherapy remains a key unmet need, as it could prevent ineffective therapy and avoid delays in optimal surgical management. This work proposes a non-invasive deep learning framework to predict neoadjuvant chemotherapy response from pre-treatment contrast-enhanced CT by leveraging automatically derived 3D lesion masks. The approach encodes axial slices with a partially fine-tuned pretrained image encoder and aggregates slice-level representations into a volumetric embedding through an attention-based module. Training combines classification loss with supervised contrastive regularization and hard-negative mining to improve separation between ambiguous responders and non-responders. The method was developed on a retrospective single-center cohort from the European Institute of Oncology (Milan, IT), including 280 eligible patients (147 responder, 133 non-responder). On the test cohort, the model achieved a ROC-AUC of 0.73 (95% CI: 0.58-0.86) and an F1-score of 0.70 (95% CI: 0.56-0.82). Overall, these results suggest that the proposed architecture learns clinically relevant predictive patterns and provides a robust foundation for an imaging-based stratification tool.
☆ Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image ICLR 2026
Generating a street-level 3D scene from a single satellite image is a crucial yet challenging task. Current methods present a stark trade-off: geometry-colorization models achieve high geometric fidelity but are typically building-focused and lack semantic diversity. In contrast, proxy-based models use feed-forward image-to-3D frameworks to generate holistic scenes by jointly learning geometry and texture, a process that yields rich content but coarse and unstable geometry. We attribute these geometric failures to the extreme viewpoint gap and sparse, inconsistent supervision inherent in satellite-to-street data. We introduce Sat3DGen to address these fundamental challenges, which embodies a geometry-first methodology. This methodology enhances the feed-forward paradigm by integrating novel geometric constraints with a perspective-view training strategy, explicitly countering the primary sources of geometric error. This geometry-centric strategy yields a dramatic leap in both 3D accuracy and photorealism. For validation, we first constructed a new benchmark by pairing the VIGOR-OOD test set with high-resolution DSM data. On this benchmark, our method improves geometric RMSE from 6.76m to 5.20m. Crucially, this geometric leap also boosts photorealism, reducing the Fréchet Inception Distance (FID) from $\sim$40 to 19 against the leading method, Sat2Density++, despite using no extra tailored image-quality modules. We demonstrate the versatility of our high-quality 3D assets through diverse downstream applications, including semantic-map-to-3D synthesis, multi-camera video generation, large-scale meshing, and unsupervised single-image Digital Surface Model (DSM) estimation. The code has been released on https://github.com/qianmingduowan/Sat3DGen.
comment: ICLR 2026; code: https://github.com/qianmingduowan/Sat3DGen demo: https://huggingface.co/spaces/qian43/Sat3DGen project page: https://qianmingduowan.github.io/Sat3DGen_project_page/
☆ Agreement, Diversity, and Polarization Indices for Approval Elections
An index is a function that given an election outputs a value between 0 and 1, indicating the extent to which this election has a particular feature. We seek indices that capture agreement, diversity, and polarization among voters in approval elections, and that are normalized with respect to saturation. By the latter we mean that if two elections differ by the fraction of candidates approved by an average voter, but otherwise are of similar nature, then they should have similar index values. We propose several indices, analyze their properties, and use them to (a) derive a new map of approval elections, and (b) show similarities and differences between various real-life elections from Pabulib, Preflib and other sources.
☆ Second-Order Actor-Critic Methods for Discounted MDPs via Policy Hessian Decomposition
We address the discounted reward setting in reinforcement learning (RL). To mitigate the value approximation challenges in policy gradient methods, actor-critic approaches have been developed and are known to converge to stationary points under suitable assumptions. However, these methods rely on first-order updates. In contrast, second-order optimization provides principled curvature-aware updates that are proven to accelerate convergence, but its application in RL is limited by the computational complexity of Hessian estimation. In this work, we analyze second-order approximations for the actor update that leverage the full curvature information of the objective as much as possible. A stable approximation requires treating the action-value function as locally constant with respect to policy parameters, which does not generally hold in policy gradient methods. We show that this approximation becomes well-justified under a two-timescale actor-critic framework, where the critic evolves on a faster timescale and can be treated as quasi-stationary during actor updates. Building on this insight, we formulate a second-order actor-critic method for the discounted reward setting that leverages Hessian-vector product (HVP) computations, resulting in a computationally efficient and stable second-order update.
comment: 9 pages, 2 figures including Appendix with Detailed proofs
☆ MicroscopyMatching: Towards a Ready-to-use Framework for Microscopy Image Analysis in Diverse Conditions
Analyzing microscopy images to extract biological object properties (e.g., their morphological organization, temporal dynamics, and population density) is fundamental to various biomedical research. Yet conducting this manually is costly and time-consuming. Though deep learning-based approaches have been explored to automate this process, the substantial diversity of microscopy analysis settings in practice (including variations of biological object types, sample processing protocols, imaging equipment, and analysis tasks, etc.) often renders them ineffective. As a result, these approaches typically require extensive adaptation for different settings, which, however, can impose burdens that are often practically unsustainable for laboratories, forcing biomedical researchers to still commonly rely on manual analysis, thereby severely bottlenecking the pace of biomedical research progress. This situation has created a pressing and long-standing need for a reliable and broadly applicable microscopy image analysis tool, yet such a tool is still missing. To address this gap, we present the first ready-to-use microscopy image analysis framework, MicroscopyMatching, that can reliably perform key analysis tasks (including segmentation, tracking, and counting) across diverse microscopy analysis settings. From a fundamentally different perspective, MicroscopyMatching reformulates diverse microscopy image analysis tasks as a unified matching problem, effectively handling this problem by exploiting the robust matching capability from pre-trained latent diffusion models.
☆ Viverra: Text-to-Code with Guarantees
A fundamental limitation of Text-to-Code is that no guarantee can be obtained about the correctness of the generated code. Therefore, to ensure its correctness, the generated code still has to be reviewed, tested, and maintained by developers. However, parsing through LLM-generated code can be tedious and time-consuming, potentially negating the productivity gains promised by AI-coding tools. To address this challenge, we present Viverra, a system that automatically produces formally verified annotations alongside generated code to aid user's understanding of the generated program. Given a natural-language task description, Viverra prompts an LLM to synthesize a C program together with candidate assertions expressing safety and correctness properties. It then verifies those assertions in a compositional and best-effort manner via a portfolio of bounded model checkers. Evaluation on 18 diverse programming tasks suggests that Viverra can efficiently generate code with verified assertions, and that these assertions improve users' performance on code-comprehension tasks in a user study with more than 400 participants.
☆ GraphFlow: An Architecture for Formally Verifiable Visual Workflows Enabling Reliable Agentic AI Automation
GraphFlow is a visual workflow system designed to improve the reliability of agentic AI automation in multi-step, mission-critical processes. In these workflows, small errors compound rapidly: under an idealized model of independent steps, a ten-step process with 90% per-step reliability completes successfully only 35% of the time. Existing workflow platforms provide durable execution and observability but offer few semantic correctness guarantees, while agentic systems plan at inference time, making behavior sensitive to prompt variation and difficult to audit. GraphFlow is designed to address this gap by treating workflow diagrams as the executable specification, a single artifact defining data scope, execution semantics, and monitoring. At compile time, a restricted class of diagrams is specified to produce reusable automations whose contracts (preconditions, postconditions, and composition obligations) are intended to be proof-checked before admission to a shared library. At runtime, a durable engine records outcomes in an append-only event log and can enforce contracts at system boundaries, supporting replay, retries, and audit. Swimlanes make trust boundaries explicit, separating verified logic from external systems, human judgment, and AI decisions. A year-long pilot across three clinical sites executed 8,728 cohort-enrolled workflow runs with a 97.08% completion rate under an early prototype without the verified-core subsystem; observed failures were localized primarily to external integrations. The formal semantics and proof-checked admission model described here are specified and under active development. Evaluation of the verified core is reserved for future work.
☆ MHSA: A Lightweight Framework for Mitigating Hallucinations via Steered Attention in LVLMs
Large vision-language models (LVLMs) have achieved remarkable performance across diverse multimodal tasks, yet they continue to suffer from hallucinations, generating content that is inconsistent with the visual input. Prior work DHCP (Detecting Hallucinations by Cross-modal Attention Pattern) has explored hallucination detection from the perspective of cross-modal attention, but does not address hallucination mitigation. In this paper, we propose MHSA (Mitigating Hallucinations via Steered Attention), a lightweight framework that mitigates hallucinations by learning to correct cross-modal attention patterns in LVLMs. MHSA trains a simple three-layer MLP generator to produce corrected attention, guided by supervisory signals from the DHCP discriminator and the LVLM itself. During inference, MHSA mitigates both discriminative and generative hallucinations across various datasets and LVLMs by simply replacing the original cross-modal attention with the corrected one, without modifying any LVLM parameters. By extending cross-modal attention mechanisms from hallucination detection to hallucination mitigation, MHSA offers a novel perspective on hallucination research in LVLMs and helps enhance their reliability.
comment: 19 pages, 17 figures
☆ Not All Symbols Are Equal: Importance-Aware Constellation Design for Semantic Communication IEEE
Semantic communication systems for goal-oriented transmission must protect task-relevant information not only through source compression but also via physical layer mapping. Existing approaches decouple constellation design and semantic encoding, exposing critical symbols to channel errors at the same rate as irrelevant ones. Contrary to this, in this paper, a joint semantic-physical layer framework is proposed, which is composed of a vector quantized-variational autoencoder that extracts discrete latent concepts, a semantic criticality indicator (SCI) that scores each concept by task relevance, and a deep reinforcement learning agent that dynamically selects the transmission subset based on instantaneous channel conditions. At the physical layer, a learned semantic-aware M -QAM constellation assigns symbol positions according to joint co-occurrence statistics and SCI scores, departing from the uniform spacing and Gray coding of standard M -QAM which minimizes average BER without regard for semantic content. We introduce a novel semantic symbol vulnerability (SSV) metric and a semantic protection probability (SPP) to quantify the exposure of task-critical symbols to decoding errors, and prove that any Gray-coded constellation is strictly suboptimal in SCI-Weighted SSV whenever the source exhibits non-uniform semantic importance and co-occurrence statistics. Simulation results demonstrate that the proposed constellation achieves near 100% SPP across modulation orders from 4-QAM to 1024-QAM versus 50% for standard constellations at high spectral efficiency, a 21:1 compression ratio with semantic quality above 0.9, generalizing across MNIST, Fashion-MNIST, and FSDD without modification.
comment: Submitted to IEEE GLOBECOM 2026. 6 pages, 8 figures
☆ Slot-MPC: Goal-Conditioned Model Predictive Control with Object-Centric Representations
Predictive world models enable agents to model scene dynamics and reason about the consequences of their actions. Inspired by human perception, object-centric world models capture scene dynamics using object-level representations, which can be used for downstream applications such as action planning. However, most object-centric world models and reinforcement learning (RL) approaches learn reactive policies that are fixed at inference time, limiting generalization to novel situations. We propose Slot-MPC, an object-centric world modeling framework that enables planning through Model Predictive Control (MPC). Slot-MPC leverages vision encoders to learn slot-based representations, which encode individual objects in the scene, and uses these structured representations to learn an action-conditioned object-centric dynamics model. At inference time, the learned dynamics model enables action planning via MPC, allowing agents to adapt to previously unseen situations. Since the learned world model is differentiable, we can use gradient-based MPC to directly optimize actions, which is computationally more efficient than relying on gradient-free, sampling-based MPC methods. Experiments on simulated robotic manipulation tasks show that Slot-MPC improves both task performance and planning efficiency compared to non-object-centric world model baselines. In the considered offline setting with limited state-action coverage, we find that gradient-based MPC performs better than gradient-free, sampling-based MPC. Our results demonstrate that explicitly structured, object-centric representations provide a strong inductive bias for controllable and generalizable decision-making. Code and additional results are available at https://slot-mpc.github.io.
☆ From Sycophantic Consensus to Pluralistic Repair: Why AI Alignment Must Surface Disagreement
Pluralistic alignment is typically operationalised as preference aggregation: producing responses that span (Overton), steer toward (Steerable), or proportionally represent (Distributional) diverse human values. We argue that aggregation alone is an incomplete primitive for deployed pluralistic alignment. Under genuine value pluralism, the failure mode of contemporary RLHF-trained assistants is not insufficient coverage but sycophantic consensus: a learned tendency to agree with, validate, and minimise friction with the immediate interlocutor. Because deployed AI systems now mediate consequential deliberation across health, civic life, labour, and governance, the collapse of disagreement at the interaction layer is not a narrow technical concern but a structural failure with distributive consequences. We reframe pluralistic alignment around three conversational mechanisms drawn from Grice's maxims: scoping (acknowledging the limits of one's perspective), signalling (surfacing value-conflict rather than smoothing it over), and repair (revising one's position on principled grounds, not on user pressure). We formalise a metric, the Pluralistic Repair Score (PRS), distinguishing principled revision from capitulation, and present a small-scale empirical illustration on two frontier RLHF-trained models (Claude Sonnet 4.5, N=198; GPT-4o, N=100) showing that, for both, agreement-following coexists with low repair-quality on contested-value prompts. PRS measures an interactional precondition for pluralism (visible disagreement; principled revision) rather than pluralism in full; we discuss the difference, take seriously the reflexive question of whose "principled" counts, and argue that pluralism is most decisively made or unmade at the deployment-governance layer: interfaces, preference-data pipelines, and audit infrastructure.
☆ KGPFN: Unlocking the Potential of Knowledge Graph Foundation Model via In-Context Learning
Knowledge graph (KG) foundation models aim to generalize across graphs with unseen entities and relations by learning transferable relational structure. However, most existing methods primarily emphasize relation-level universality, while in-context learning, the other pillar of foundation models remains under-explored for KG reasoning. In KGs, context is inherently structured and heterogeneous: effective prediction requires conditioning on the local context around the query entities as well as the global context that summarizes how a relation behaves across many instances. We propose KGPFN, a KG foundation model using Prior-data Fitted Network that unifies transferable relational regularities with inference-time in-context learning from structured context. KGPFN first learns relation representations via message passing on relation graphs to capture cross-graph relational invariances. For query-specific reasoning, it encodes local neighborhoods using a multi-layer NBFNet as local context. To enable ICL at global scale, it constructs relation-specific global context by retrieving a large set of instances of the query relation together with their local neighborhoods, and aggregates them within a Prior-Data Fitted Network framework that combines feature-level and sample-level attention. Through multi-graph pretraining on diverse KGs, KGPFN learns when to instantiate reusable patterns and when to override them using contextual evidence. Experiments on 57 KG benchmarks demonstrate that KGPFN achieves strong adaptation to previously unseen graphs through in-context learning alone, consistently outperforming competitive fine-tuned KG foundation models. Our code is available at https://github.com/HKUST-KnowComp/KGPFN.
☆ COREKG: Coreset-Guided Personalized Summarization of Knowledge Graphs IJCAI 2026
Knowledge Graphs (KGs) are extensively used across different domains and in several applications. Often, these KGs are very large in size. Such KGs become unwieldy for tasks such as question answering and visualization. Summarization of KGs offers a viable alternative in such cases. Furthermore, personalized KG summarization is crucial in the current data-driven world as it captures the specific requirements of users based on their query patterns. Since it only maintains relevant information, the personalized summaries of KG are small, resulting in significantly smaller storage requirements and query runtime. In this work, we adapt the coreset theory to create personalized KG summaries. For a given dataset and a user-specific query workload, we present an approach that samples a relevant subset of triples using sensitivity-based importance sampling. We ensure that the subset approximates the characteristics of the full dataset with bounded approximation error. We define sensitivity scores that measure the importance of a triple with respect to a user's query workload, which are then used by our coreset construction algorithm. We explicitly focus on personalized knowledge graph summarization by constructing summaries independently for each user based on their query behaviour. Our evaluation on Freebase, WikiData, and DBpedia shows that COREKG delivers higher query-answering accuracy and structural coverage than the state-of-the-art methods, such as GLIMPSE, PPR, iSummary, PEGASUS and APEX$^2$ while requiring only a tiny fraction of the original graph.
comment: Accepted at IJCAI 2026
☆ Critic-Driven Voronoi-Quantization for Distilling Deep RL Policies to Explainable Models AAMAS 2026
Despite many successful attempts at explaining Deep Reinforcement Learning policies using distillation, it remains difficult to balance the performance-interpretability trade-off and select a fitting surrogate model. In addition to this, traditional distillation only minimizes the distance between the behavior of the original and the surrogate policy while other RL-specific components such as action value are disregarded. To solve this, we introduce a new model-agnostic method called Critic-Driven Voronoi State Partitioning, which partitions a black box control policy into regions where a simple class of model can be optimized using gradient descent. By exploiting the critic value network of the original policy, we iteratively introduce new subpolicies in regions with insufficient value, standing in for a measure of policy complexity. The partitioning, a Voronoi quantizer, uses nearest neighbor lookups to assign a linear function to each point in the state space resulting in a cell-like diagram. We validate our approach on several well known benchmarks and proof that this distillation approaches the original policy using a reasonable sized set of linear functions.
comment: Accepted for presentation at EXTRAAMAS 2026
☆ Your CLIP has 164 dimensions of noise: Exploring the embeddings covariance eigenspectrum of contrastively pretrained vision-language transformers
Contrastively pre-trained Vision-Language Models (VLMs) serve as powerful feature extractors. Yet, their shared latent spaces are prone to structural anomalies and act as repositories for non-semantic, multi-modal noise. To address this phenomenon, we employ spectral decomposition of covariance matrices to decompose the VLM latent space into a multi-modal semantic signal component and a shared noise subspace. We observe that this noise geometry exhibits strong subgroup invariance across distinct data subsets. Crucially, pruning these shared noise dimensions is mainly harmless, preserving or actively improving downstream task performance. By isolating true semantic signals from artifactual noise, this work provides new mechanistic insights into the representational structure of modern VLMs, suggesting that a substantial fraction of their latent geometry is governed by shared, architecture-level noise rather than task-relevant semantics alone.
☆ Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
LLM-based autonomous agents have demonstrated strong capabilities in reasoning, planning, and tool use, yet remain limited when tasks require sustained coordination across roles, tools, and environments. Multi-agent systems address this through structured collaboration among specialized agents, but tighter coordination also amplifies a less explored risk: errors can propagate across agents and interaction rounds, producing failures that are difficult to diagnose and rarely translate into structural self-improvement. Existing surveys cover individual agent capabilities, multi-agent collaboration, or agent self-evolution separately, leaving the causal dependencies among them unexamined. This survey provides a unified review organized around four causally linked stages, which we term the LIFE progression: Lay the capability foundation, Integrate agents through collaboration, Find faults through attribution, and Evolve through autonomous self-improvement. For each stage, we provide systematic taxonomies and formally characterize the dependencies between adjacent stages, revealing how each stage both depends on and constrains the next. Beyond synthesizing existing work, we identify open challenges at stage boundaries and propose a cross-stage research agenda for closed-loop multi-agent systems capable of continuously diagnosing failures, reorganizing structures, and refining agent behaviors, extending current coordination frameworks toward more self-organizing forms of collective intelligence. By bridging these previously fragmented research threads, this survey aims to offer both a systematic reference and a conceptual roadmap toward autonomous, self-improving multi-agent intelligence.
☆ SurgicalMamba: Dual-Path SSD with State Regramming for Online Surgical Phase Recognition
Online surgical phase recognition (SPR) underpins context-aware operating-room systems and requires committing to a prediction at every frame from past context alone. Surgical video poses three demands that natural-video recognizers do not jointly address: procedures span tens of thousands of frames, time flows non-uniformly as long routine stretches are punctuated by brief phase-defining transitions, and the visual domain is narrow so backbone features are strongly correlated across channels. Existing recognizers either let per-frame cost grow with elapsed length, or hold cost bounded but advance state at a uniform rate with channel-independent dynamics, leaving the latter two demands unaddressed. We present SurgicalMamba, a causal SPR model built on Mamba2's structured state-space duality (SSD) that holds per-frame cost at O(d). It introduces three SSD-compatible components, each targeting one demand: a dual-path SSD block that separates long- and short-term regimes at the level of recurrent state; intensity-modulated stepping, a continuous-time time-warp that adapts the slow path's effective rate to phase-relevant information; and state regramming, a per-chunk Cayley rotation that opens cross-channel mixing in the otherwise axis-aligned SSM recurrence. The learned rotation planes inherit a phase-aligned structure without any direct supervision, offering an interpretable internal signature of surgical workflow. Across seven public SPR benchmarks, SurgicalMamba reaches state-of-the-art accuracy and phase-level Jaccard under strict online evaluation: 94.6%/82.7% on Cholec80 (+0.7 pp/+2.2 pp over the strongest prior) and 89.5%/68.9% on AutoLaparo (+1.7 pp/+2.0 pp), at 119 fps on a single GPU. Ablations isolate the contribution of each component. The code is publicly available at https://github.com/sukjuoh/Surgical-Mamba.
comment: 28 pages, 7 figures, 10 tables; Code available at https://github.com/sukjuoh/Surgical-Mamba
☆ BiFedKD: Bidirectional Federated Knowledge Distillation Framework for Non-IID and Long-Tailed ECG Monitoring
Electrocardiogram (ECG) monitoring in Internet of Medical Things (IoMT) networks is constrained by strict data-sharing regulations and privacy concerns. Federated learning (FL) enables collaborative learning by keeping raw ECG data on devices, but frequent transmissions of high-dimensional model updates incur heavy per-round traffic over bandwidth-limited links. To alleviate this bottleneck, federated distillation (FD) replaces parameter exchange with logit-based knowledge transfer. However, the performance of FD often degrades under the non-independent and identically distributed (non-IID) and long-tailed label distributions in ECG deployments. To address these challenges, we propose a bidirectional federated knowledge distillation (BiFedKD) framework that employs an aggregation-by-distillation pipeline with temperature scaling to produce a stable global distillation signal for cross-client alignment. Experiments on the MIT-BIH Arrhythmia dataset show that BiFedKD improves accuracy and Macro-F1 over the baseline by $3.52\%$ and $9.93\%$, respectively. Moreover, to reach the same Macro-F1, BiFedKD reduces communication overhead by $40\%$ and computation cost by $71.7\%$ compared with the baseline.
☆ Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning
Despite rapid advancements, current text-to-image (T2I) models predominantly rely on a single-step generation paradigm, which struggles with complex semantics and faces diminishing returns from parameter scaling. While recent multi-step reasoning approaches show promise, they are hindered by ungrounded planning hallucinations lacking verification, monolithic post-hoc reflection, long-context optimization instabilities, and prohibitive inference latency. To overcome these bottlenecks, we propose the Closed-Loop Visual Reasoning (CLVR) framework, a comprehensive system that deeply couples visual-language logical planning with pixel-level diffusion generation. CLVR introduces an automated data engine with step-level visual verification to synthesize reliable reasoning trajectories, and proposes Proxy Prompt Reinforcement Learning (PPRL) to resolve long-context optimization instabilities by distilling interleaved multimodal histories into explicit reward signals for accurate causal attribution. Furthermore, to mitigate the severe latency bottleneck caused by iterative denoising, we propose $Δ$-Space Weight Merge (DSWM), a theoretically grounded method that fuses alignment weights with off-the-shelf distillation priors, reducing the per-step inference cost to just 4 NFEs without requiring expensive re-distillation. Extensive experiments demonstrate that CLVR outperforms existing open-source baselines across multiple benchmarks and approaches the performance of proprietary commercial models, unlocking general test-time scaling capabilities for complex visual generation.
☆ REALM: Retrospective Encoder Alignment for LFP Modeling
Spike activity has been the dominant neural signal for behavior decoding due to its high spatial and temporal resolution. However, as brain-computer interfaces (BCIs) move toward high channel counts and wireless operation, the high sampling frequency of spike signals becomes a bottleneck due to high power and bandwidth requirements. Local field potentials (LFPs) represent a different spatial-temporal scale of brain activity compared to spikes, offering key advantages including improved long-term stability, reduced energy consumption, and lower bandwidth requirement. Despite these benefits, LFP-based decoding models typically show reduced accuracy and often rely on non-causal architectures that are unsuitable for real-time deployment. To address these challenges, we propose REALM: a retrospective distillation framework that enables causal LFP decoding. Inspired by offline-to-online distillation strategies in speech recognition, REALM transfers representational knowledge from a pretrained multi-session bidirectional LFP model to a causal version for real-time deployment. We first pretrain a bidirectional Mamba-2 teacher model using a masked autoencoding objective. We then distill this teacher model into a compact student model via a combined objective of representation alignment and task supervision. REALM consistently outperforms both causal and non-causal LFP-based SOTA methods for behavior decoding. Notably, our REALM improves decoding performance while achieving a $2\times$ reduction in parameter count and a $10\times$ reduction in training time. These results demonstrate that retrospective distillation effectively bridges the gap between offline and real-time neural decoding. REALM shows that LFP-only models can achieve competitive decoding performance without reliance on spike signals, offering a practical and scalable alternative for next-generation wireless implantable BCIs.
☆ Towards In-Depth Root Cause Localization for Microservices with Multi-Agent Recursion-of-Thought
As modern microservice systems grow increasingly complex due to dynamic interactions and evolving runtime environments, they experience failures with rising frequency. Ensuring system reliability therefore critically depends on accurate root cause localization (RCL). While numerous traditional machine learning and deep learning approaches have been explored for this task, they often suffer from limited interpretability and poor transferability across deployments. More recently, large language model (LLM)-based methods have been proposed to address these issues. However, existing LLM-based approaches still face two fundamental limitations: context explosion, which dilutes critical evidence and degrades localization accuracy, and serial reasoning structures, which hinder deep causal exploration and impair inference efficiency. In this paper, we conduct a comprehensive study of both how human SREs perform root cause localization in practice and why existing LLM-based methods fall short. Motivated by these findings, we introduce RCLAgent, an in-depth root cause localization framework for microservice systems that realizes multi-agent recursion-of-thought with parallel reasoning. RCLAgent decomposes the diagnostic process along the trace graph by assigning each span to a Dedicated Agent and organizing agents recursively and in parallel according to the graph topology, with the final diagnosis obtained by synthesizing the Root-Level Diagnosis Report and the Global Evidence Graph. Extensive experiments on multiple public benchmarks demonstrate that RCLAgent consistently outperforms state-of-the-art methods in both localization accuracy and inference efficiency.
☆ Holistic Evaluation and Failure Diagnosis of AI Agents
AI agents execute complex multi-step processes, but current evaluation falls short: outcome metrics report success or failure without explaining why, and process-level approaches struggle to connect failure types to their precise locations within long, structured traces. We present a holistic agent evaluation framework that pairs top-down agent-level diagnosis with bottom-up span-level evaluation, decomposing analysis into independent per-span assessments. This decomposition scales to traces of arbitrary length and produces span-level rationales for each verdict. On the TRAIL benchmark, our framework achieves state-of-the-art results across all metrics on both GAIA and SWE-Bench, with relative gains over the strongest prior baselines of up to 38% on category F1, up to 3.5x on localization accuracy, and up to 12.5x on joint localization-categorization accuracy. Per-category analysis shows our framework leading in more error categories than any other evaluator. Notably, the same frontier model achieves several times higher localization accuracy when used inside our framework than as a monolithic judge over the full trace, showing that evaluation methodology, not model capability, is the bottleneck.
☆ Do Coding Agents Understand Least-Privilege Authorization?
As coding agents gain access to shells, repositories, and user files, least-privilege authorization becomes a prerequisite for safe deployment: an agent should receive enough authority to complete the task, without unnecessary authority that exposes sensitive surfaces.To study whether current models can infer this boundary themselves, we first introduce permission-boundary inference, where a model maps a task instruction and terminal environment to a file-level read/write/execute policy, and AuthBench, a benchmark of 120 realistic terminal tasks with human-reviewed permission labels and executable validators for utility and attack outcomes.AuthBench shows that authorization is not a simple conservative-versus-permissive calibration problem: frontier models often omit permissions required by the execution chain while also granting unused or sensitive accesses.Increasing inference-time reasoning does not resolve this mismatch. Instead, each model moves toward a model-specific authorization attractor: more reasoning makes it more consistent in its own failure mode, whether broad-but-exposed or tight-but-brittle.This suggests that direct policy generation is the bottleneck, because a single generation must both discover all necessary accesses and reject all unnecessary ones.We therefore propose Sufficiency-Tightness Decomposition, which first generates a coverage-oriented policy by forward-simulating the task and then audits each granted entry for grounding and sensitivity.Across tested models, this decomposition improves sensitive-task success by up to 15.8% on tightness-biased models while reducing attack success across all evaluated models.
☆ A Deterministic Agentic Workflow for HS Tariff Classification: Multi-Dimensional Rule Reasoning with Interpretable Decisions
Harmonized System (HS) tariff classification is a high-stakes, expert-level task in which a free-form product description must be mapped to a specific six- or eight-digit code under the General Interpretive Rules (GIR), section notes, chapter notes, and Explanatory Notes. The difficulty lies not in knowledge volume but in *multi-dimensional rule reasoning*: a correct classification must satisfy competing priority rules along several axes simultaneously, including material, form, function, essential character, the part-versus-whole boundary, and specific listing versus residual headings. End-to-end prompting of large language models fails characteristically by resolving one axis while ignoring the priority constraints on the others. We present a *deterministic agentic workflow* in contrast to self-planning agents: the control flow is fixed, language model calls are confined to narrow stages, and reflection and verification are retained as local mechanisms. This design yields interpretability by construction--each decision is decomposed into stage-wise structured outputs with verbatim citation of the chapter or section notes that bear on it. The architecture combines offline knowledge-engineering of the Chinese HS tariff with an online six-stage pipeline. Evaluated on HSCodeComp at the six-digit level, the workflow reaches 75.0% top-1 and 91.5% top-3 at four digits, and 64.2% top-1 and 78.3% top-3 at six digits with Qwen3.6-plus; an open-weight Qwen3.6-27B-FP8 backbone in non-thinking mode achieves 84.2% four-digit and 77.4% six-digit top-1 agreement with the frontier model. A two-stage manual audit of 226 six-digit disagreements suggests that a non-trivial fraction of HSCodeComp ground-truth labels may deviate from HS general rules; full adjudication records are released in the appendix as preliminary findings for community review.
☆ Exploitation of Hidden Context in Dynamic Movement Forecasting: A Neural Network Journey from Recurrent to Graph Neural Networks and General Purpose Transformers
Forecasting within signal processing pipelines is crucial for mitigating delays, particularly in predicting the dynamic movements of objects such as NBA players. This task poses significant challenges due to the inherently interactive and unpredictable nature of sports, where abrupt changes in velocity and direction are prevalent. Traditional approaches, including (S)ARIMA(X), Kalman filters (KF), and Particle filters (PF), often struggle to model the non-linear dynamics present in such scenarios. Machine learning (ML) methods, such as long short-term memory (LSTM) networks, graph neural networks (GNNs), and Transformers, offer greater flexibility and accuracy but frequently fail to explicitly capture the interplay between temporal dependencies and contextual interactions, which are critical in chaotic sports environments. In this paper, we evaluate these models and assess their strengths and weaknesses. Experimental results reveal key performance trade-offs across input history length, generalizability, and the ability to incorporate contextual information. ML-based methods demonstrated substantial improvements over linear models across forecast horizons of up to 2s. Among the tested architectures, our hybrid LSTM augmented with contextual information achieved the lowest final displacement error (FDE) of 1.51m, outperforming temporal convolutional neural network (TCNN), graph attention network (GAT), and Transformers, while also requiring less data and training time compared to GAT and Transformers. Our findings indicate that no single architecture excels across all metrics, emphasizing the need for task-specific considerations in trajectory prediction for fast-paced, dynamic environments such as NBA gameplay.
comment: 12 pages
☆ FactorizedHMR: A Hybrid Framework for Video Human Mesh Recovery
Human Mesh Recovery (HMR) is fundamentally ambiguous: under occlusion or weak depth cues, multiple 3D bodies can explain the same image evidence. This ambiguity is not uniform across the body, as torso pose and root structure are often relatively well constrained, whereas distal articulations such as the arms and legs are more uncertain. Building on this observation, we propose FactorizedHMR, a two-stage framework that treats these two regimes differently. A deterministic regression module first recovers a stable torso-root anchor, and a probabilistic flow-matching module then completes the remaining non-torso articulation. To make this completion reliable, we combine a composite target representation with geometry-aware supervision and feature-aware classifier-free guidance, preserving the torso-root anchor while improving single-reference recovery of ambiguity-prone articulation. We also introduce a synthetic data pipeline that provides the paired image-camera-motion supervision under diverse viewpoints. Across camera-space and world-space benchmarks, FactorizedHMR remains competitive with strong baselines, with the clearest gains in occlusion-heavy recovery and drift-sensitive world-space metrics.
☆ IFPV: An Integrated Multi-Agent Framework for Generative Operational Planning and High-Fidelity Plan Verification
Operational plan generation and verification are critical for modern complex and rapidly changing battlefield environments, yet traditional generation and verification methods still respectively face the challenges of generation infeasibility and verification insufficiency. To alleviate these limitations, we propose an Integrated Multi-Agent Framework for Generative Operational Planning and High-Fidelity Plan Verification (IFPV). IFPV consists of two tightly coupled modules: Multi-Perspective Hierarchical Agents (MPHA) for generative operational planning and an Adversarial Cognitive Simulation Engine (ACSE) for high-fidelity adversarial plan verification. MPHA decomposes commander intent into executable multi-platform tactical action sequences through the collaboration of Pathfinder, Analyst, and Planner agents. ACSE introduces an opponent equipped with a customized world model, which predicts the future evolution of mission-critical platforms and conducts dynamic counteractions against candidate plans. Simulation experiments in the Asymmetric Combat Tactic Simulator (ACTS) show that IFPV improves mission success by 19.4% and reduces operational cost by 41.7% compared with a single-step large language model (LLM) planning baseline. Compared with a traditional rule-based validator, ACSE increases the average suppression rate by 31.8%, indicating that the proposed verification environment is stricter and more discriminative in revealing the latent vulnerabilities of candidate plans. The code for IFPV can be found at https://github.com/zhigao3ks/IFPV.
comment: Submitted to Neurocomputing
☆ XFP: Quality-Targeted Adaptive Codebook Quantization with Sparse Outlier Separation for LLM Inference
We introduce XFP, a dynamic weight quantizer for LLM inference that inverts the conventional workflow: the operator specifies reconstruction quality floors on per-channel cosine similarity (one strict floor for attention and shared experts, one lazy floor for routed-expert MoE); XFP determines codebook size, outlier budget, and packing per layer automatically -- no Hessian, no calibration data, no manual bit-width selection. Each weight matrix is decomposed into a sparse fp16 outlier residual and a dense sub-byte index tensor into a per-group learned codebook. Two storage modes share one auto-select frontend and one fused decode kernel: V2 (per-channel Lloyd) and V2a (shared library of L=32 codebooks per layer). On Qwen3.5-122B-A10B under V2, XFP reaches 138 tok/s single-stream decode on workstation hardware (RTX PRO 6000 Blackwell, TP=2) at 94.49% GSM8K strict-match (3 seeds, n=3957), and is 49% faster than Marlin INT4 at TP=1. For models that do not fit in the target memory envelope, we present the H-Process: a quality-driven iteration over the two cosine thresholds that finds the operating point at which the model just fits while still producing sensible output. Three constraints define its search space: the operator-set thresholds, an OOM boundary at quantize-on-load, and a garbage boundary in generation (cosine similarity steers; benches verify). On Qwen3.5-397B-A17B (512 routed experts/layer), the H-Process fits the full expert population into 2x96 GB at ~3.4 effective bits and delivers 100.9 tok/s long-output decode at 66.72% GSM8K strict-match on the full 1319-problem set (single seed at submission; multi-seed evaluation in progress), exceeding INT4 with routed-expert pruning on memory, throughput, and accuracy simultaneously.
comment: 17 pages, 3 figures, 17 tables, 1 algorithm. Code: https://github.com/flash7777/vllm/tree/multiquant
☆ GPart: End-to-End Isometric Fine-Tuning via Global Parameter Partitioning
Low-rank adaptation (LoRA) has become the dominant paradigm for parameter-efficient fine-tuning (PEFT) of large language models (LLMs). However, its bilinear structure introduces a critical limitation: the mapping from trainable parameters to weight updates is not distance-preserving, distorting the optimization landscape. Methods that project a low-dimensional vector into LoRA's parameter space, such as Uni-LoRA, improve parameter efficiency, but the subsequent bilinear LoRA map breaks end-to-end isometry, leaving the core distance-preservation problem unresolved. We propose GPart (Global Partition fine-tuning), a highly parameter-efficient fine-tuning method which removes the low-rank bottleneck entirely. Our method uses a single isometric partition matrix to map a $d$-dimensional trainable vector directly into the full weight space of the model. The result is an extremely minimal fine-tuning pipeline: one random projection, end-to-end isometric, with a single clean hyperparameter ($d$) and storage cost of $d+1$ values (the trainable vector plus a random seed). GPart builds on the theoretical premise that effective fine-tuning can emerge from random low-dimensional subspaces of the full weight space, without imposing low-rank matrix structure. We empirically demonstrate the superior or comparable performance of GPart to existing PEFT methods on natural language understanding, computer vision tasks, and mathematical reasoning. Overall, GPart achieves state-of-the-art efficiency and performance by removing structural constraints, offering a straightforward and elegant path to PEFT.
☆ Emotion-Attended Stateful Memory (EASM):The Architecture for Hyper-Personalization at Scale
Current language model systems remain fundamentally stateless across sessions, limiting their ability to personalize interactions over time. While retrieval-augmented generation and fine-tuning improve knowledge access and domain capability, they do not enable persistent understanding of individual users. We propose an emotion-attended stateful memory architecture that dynamically constructs user-specific conversational context using long-term history, emotional signals, and inferred intent at inference time. To evaluate its impact, we conducted a controlled A/B study across thirty non-scripted conversations spanning six emotionally distinct categories using the same underlying language model in both conditions. The memory-enriched condition consistently outperformed the stateless baseline across all evaluated scenarios. The largest gains were observed in memory grounding (95% improvement), plan clarity (57%), and emotional validation (34%). Results remained consistent even in emotionally adversarial conversations involving grief, distress, and uncertainty. These findings suggest that stateful emotional memory may represent a foundational infrastructure layer for hyper-personalized AI systems, though broader validation across larger and more diverse evaluations remains necessary
comment: 18 pages, 3 figures, 3 tables. Industry research whitepaper. Includes controlled A/B evaluation across 30 scenarios and 6 emotional categories
☆ Interestingness as an Inductive Heuristic for Future Compression Progress
One of the bottlenecks on the way towards recursively self-improving systems is the challenge of interestingness: the ability to prospectively identify which tasks or data hold the potential for future progress. We formalize interestingness as an inductive heuristic for future compression progress and investigate its predictability using tools from Kolmogorov Complexity and Algorithmic Statistics. By analyzing complexity-runtime profiles under Length, Algorithmic, and Speed priors, we demonstrate that the inductive property of interestingness -- the capacity for past progress to signal future discovery -- is theoretically viable and empirically supported. We prove that expected future progress depends exponentially on the recency of the last observed breakthrough. Furthermore, we show that the Algorithmic Prior is significantly more optimistic than the Length Prior, yielding a quadratic increase in expected discovery for the same observed profile. These findings are experimentally confirmed across three diverse universal computational paradigms.
☆ A Heterogeneous Temporal Memory Governance Framework for Long-Term LLM Persona Consistency
Large language models often suffer from fact loss, timeline confusion, persona drift, and reduced stability during long-range interaction, especially under high-noise knowledge bases, context clearing, and cross-model transfer. To address these issues, we introduce ARPM, an external temporal memory governance framework for long-term dialogue. ARPM separates static knowledge memory from dynamic dialogue experience memory and combines vector retrieval, BM25, RRF fusion, dual-temporal reranking, chronological evidence reading, and a controlled analysis protocol for evidence verification and answer binding. Unlike approaches that encode persona consistency into model weights or rely only on long context, ARPM treats continuity as a traceable, auditable, and transferable governance problem. Using engineering logs, we conduct three experiments. First, in a 50-round question-answering setting, we compare signal-to-noise ratios of 1:5 and 1:200+, and distinguish CSV auto-judgment from manual review. Under 1:5, CSV recall accuracy is 54.0%, while manual review raises it to 100.0%. Under 1:200+, the values are 44.0% and 80.0%. These results show that automatic rules can underestimate recall after supporting evidence enters the prompt. Second, ablation results show that dialogue history retrieval is necessary for recent continuity: disabling it reduces strict accuracy from 100% to 66.7%, and disabling BM25 reduces it to 80.0%, indicating that pure semantic retrieval is insufficient for correction and tracing. Third, under a 5.1-million-character noise substrate, periodic context clearing, and multi-model handoff, ARPM maintains semantic continuity, boundary continuity, and persona consistency, while exposing limits caused by weak protocol compliance. These findings show that long-term persona consistency can be decomposed into governable components and evaluated in a white-box manner.
comment: 23 pages, 5 figures, 2 tables. Preprint version. Code for ARPM v4.0 is available at: https://github.com/Spirtxiaoqi7/ARPM
☆ Beyond AI as Assistants: Toward Autonomous Discovery in Cosmology
Recent advances in artificial intelligence (AI) agents are pushing AI beyond tools toward autonomous scientific discovery. We discuss two complementary agentic systems for cosmology: \texttt{CMBEvolve}, which targets tasks with explicit quantitative objectives through LLM-guided code evolution and tree search, and \texttt{CosmoEvolve}, which targets open-ended scientific workflows through a virtual multi-agent research laboratory. As preliminary demonstrations, we apply \texttt{CMBEvolve} to out-of-distribution detection in weak-lensing maps, where it iteratively improves the benchmark score through code evolution, and \texttt{CosmoEvolve} to autonomous ACT DR6 data analysis, where it identifies non-trivial pair- and scale-dependent behaviour and produces analysis-grade diagnostics. These examples show how cosmology can provide both controlled benchmark tasks and realistic open-ended research problems for the development of AI scientist systems.
comment: 4 pages, 2 figures, Contribution to the 2026 Cosmology session of the 60th Rencontres de Moriond
☆ Graphs of Research: Citation Evolution Graphs as Supervision for Research Idea Generation
Research idea generation is the innovation-driving step of automated scientific research. Recently, large language models (LLMs) have shown potential for automating idea generation at scale. However, existing methods mainly condition LLMs on eliciting idea generation through static retrieval of relevant literature or complex prompt engineering, without discarding the structural relations among references. We propose Graphs of Research (GoR), a supervised fine-tuning method that extracts a 2-hop reference neighborhood for each seed paper, derives the relations among those references from citation position, frequency, predecessor links, and publication time, and organizes them into a paper-evolution directed acyclic graph (DAG). We construct an automated extraction pipeline that draws data from five major ML/NLP venues, comprising 498/50/50 train/validation/test seed papers and approximately 7,600 cited references. Qwen2.5-7B-Instruct-1M is fine-tuned on a structured-text prompt that includes the citation graph, edge signals, reference information, and task definition to predict the idea for the seed paper. Across head-to-head LLM-judge tournaments against gpt-4o-driven baselines, GoR-SFT achieves SOTA, demonstrating the effectiveness of citation-evolution graphs as supervision signal for LLM-based idea generation. We hope that this reduces the barrier for citation evolution graphs as a supervision, accelerating automated scientific innovation.
☆ Known By Their Actions: Fingerprinting LLM Browser Agents via UI Traces
As LLM-based agents increasingly browse the web on users' behalf, a natural question arises: can websites passively identify which underlying model powers an agent? Doing so would represent a significant security risk, enabling targeted attacks tailored to known model vulnerabilities. Across 14 frontier LLMs and four web environments spanning information retrieval and shopping tasks, we show that an agent's actions and interaction timings, captured via a passive JavaScript tracker, are sufficient to identify the underlying model with up to 96\% F1. We formalise this attack surface by demonstrating that classifiers trained on agent actions generalise across model sizes and families. We further show that strong classifiers can be trained from few interaction traces and that agent identity can be inferred early within an episode. Injecting randomised timing delays between actions substantially degrades classifier performance, but does not provide robust protection: a classifier retrained on delayed traces largely recovers performance. We release our harness and a labelled corpus of agent traces \href{https://github.com/KabakaWilliam/known_actions}{here}.
☆ Identifying Culprits Through Deep Deterministic Policy Gradient Deep Learning Investigation
In the world of AI and advanced technologies investigation aspects identification of a crime or criminal plays a major problem. In this research we focus on a Conventional ways of implicating criminal investigations usually rely on limited data analysis. Finding an optimal and efficient method that will effectively identify criminals from complex datasets and minimise false positives and false negatives is the considered as a challenge. The main novelty approach of this work is based on the deep learning algorithm Deep Deterministic Policy Gradient (DDPG) is presented in this paper. We train the DDPG model with a dataset of crime scene material, witness statements and suspect profiles. The algorithm uses features to maximise the likelihood of identifying the offender while minimising the noise impact and irrelevant data. We show the efficacy of the proposed method, where DDPG identified criminals with an amazing accuracy of 95% than other several existing methods.
☆ Beyond What to Select: A Plug-and-play Oscillatory Data-Volume Scheduling for Efficient Model Training
Data selection accelerates training by identifying representative training data while preserving model performance. However, existing methods mainly focus on designing sample-importance criteria, i.e., deciding what to select, while typically fixing the selected data volume as the target ratio throughout training. Thus, they are often dynamic in sample identity but static in data volume. In this work, we revisit data selection from an optimization perspective and show that selected-data training induces an implicit regularization effect modulated by the instantaneous selection ratio. This reveals a key trade-off: lower ratios amplify selection-induced regularization, whereas higher ratios preserve data coverage and optimization fidelity. Motivated by this insight, we propose PODS, a Plug-and-play Oscillatory Data-volume Scheduling framework. Rather than introducing another sample-scoring metric, PODS serves as a lightweight module that dynamically schedules how much data to select over training. Under the target selection ratio, PODS alternates between low-ratio regularization phases and high-ratio recovery phases to exploit selection-induced regularization without sacrificing optimization stability. With its lightweight, ratio-level, and task-agnostic design, PODS is compatible with existing static and dynamic selection methods and broadly applicable across training paradigms. Experiments across various datasets, architectures, and tasks show that PODS consistently improves the efficiency-generalization trade-off, e.g., reducing ImageNet-1k training cost by 50% with improved accuracy and accelerating LLM instruction tuning by over 2x without performance degradation.
☆ MediaClaw: Multimodal Intelligent-Agent Platform Technical Report
MediaClaw is a multimodal agent platform built on the OpenClaw ecosystem. Its core design follows a three-layer architecture of unified abstraction, pluginized extension, and workflow orchestration. The system is intended to address practical deployment pain points in AIGC adoption, including fragmented capabilities, heterogeneous interfaces, disconnected production processes, and limited reuse of high-quality production workflows. \system{} abstracts full-category AIGC capabilities into a unified invocation model, uses plugins to support hot-pluggable capability expansion, and uses task-oriented Skills to turn complex production processes into reusable workflow assets. This report focuses on the architectural design philosophy of MediaClaw, the design logic of its core capability model, and the key engineering trade-offs in implementation. It aims to provide reusable practical reference for building multimodal capability platforms.
☆ Streaming Speech-to-Text Translation with a SpeechLLM
Normally, a system that translates speech into text consists of separate modules for speech recognition and text-to-text translation. Combining those tasks into a SpeechLLM promises to exploit paralinguistic information in the speech and to reduce cascaded errors. But existing SpeechLLM systems are slow since they do not work in a real streaming fashion: they wait for a complete utterance of audio before outputting a translation, or output tokens at fixed intervals, which is not suitable for real applications. This work proposes an LLM-based architecture for real streaming speech-to-text translation. The LLM learns not just to emit output tokens, but also to decide whether it has seen enough audio to do so. The system is trained using automatic alignments of the input speech and the output text. In experiments on different language pairs, the system achieves a translation quality close to the non-streaming baseline, but with a latency of only 1-2 seconds.
comment: 9 pages of main text; 24 pages in total
☆ Compositional Sparsity as an Inductive Bias for Neural Architecture Design
Identifying the structural priors that enable Deep Neural Networks (DNNs) to overcome the curse of dimensionality is a fundamental challenge in machine learning theory. Existing literature suggests that effective high-dimensional learning is driven by compositional sparsity, where target functions decompose into constituents supported on low-dimensional variable subsets. To investigate this hypothesis, we combine Information Filtering Networks (IFNs), which extract sparse dependency structures via constrained information maximisation, with Homological Neural Networks (HNNs), which map the inferred topology into fixed-wiring sparse neural graphs. We formalise the design principles underlying this construction and present an interpretable pipeline in which abstraction emerges through hierarchical composition. HNNs are orders of magnitude sparser than standard DNNs and require only minimal hyperparameter tuning. On synthetic tasks with known sparse hierarchies, HNNs recover the underlying compositional structure and remain stable in regimes where dense alternatives degrade as dimensionality increases. Across a broad suite of real-world datasets, HNNs consistently match or outperform dense baselines while using far fewer parameters, exhibiting lower variance and showing reduced sensitivity to hyperparameters.
☆ AI Outperforms Humans in Personalized Image Aesthetics Assessment via LLM-Based Interviews and Semantic Feature Extraction
Accurately predicting individual aesthetic evaluation for images is a fundamental challenge for AI. Various deep learning (DL)-based models have been proposed for this task, training on image evaluation data to extract objective low-level features. However, aesthetic preferences are inherently subjective and individual-dependent. Accurate prediction thus requires the extraction of high-level semantic features of images and the active collection of preference information from the target individual. To address this issue, we focus on the utility of Large Language Models (LLMs) pretrained on vast amounts of textual data, and develop an integrated DL-LLM system. The system actively elicits aesthetic preferences through LLM-based semi-structured interviews and predicts aesthetic evaluation by leveraging both low-level and high-level features. In our experiments, we compare the proposed system against conventional systems, human predictors, and the target individual's own re-evaluations after a certain time interval. Our results show that the proposed system outperforms all of them, with particularly strong performance on highly-rated images. Moreover, the prediction error of the proposed system is smaller than within-person variability, while human predictors show the largest error, likely due to the influence of their own aesthetic values. These results suggest that AI may be better positioned than others or one's future self to capture individual aesthetic preferences at a given point. This opens a new question of whether AI could serve as a deeper interpreter of human aesthetic sensibility than humans themselves.
comment: 25 pages, 13 figures
☆ Probabilistic Verification of Recurrent Neural Networks for Single and Multi-Agent Reinforcement Learning IJCAI
History-dependent policies induced by recurrent neural networks (RNNs) rely on latent hidden state dynamics, making verification in partially observable reinforcement learning (RL) challenging. Existing RNN verification tools typically rely on restrictive modeling assumptions or coarse over-approximations of the hidden state space, which can lead to overly conservative or inconclusive results. We propose $\textbf{RNN}$ $\textbf{Pro}$babilistic $\textbf{Ve}$rification ($\texttt{RNN-ProVe}$), a probabilistic framework that $\textit{estimates the likelihood}$ of undesired behaviors in RNN-based policies. $\texttt{RNN-ProVe}$ uses policy-driven sampling to approximate the set of hidden states that are feasible under a trained policy, and derives statistical error bounds to produce bounded-error, high-confidence estimates of behavioral violations. Experiments on partially observable single-agent and cooperative multi-agent tasks show that $\texttt{RNN-ProVe}$ yields more quantitative, feasibility-aware probabilistic guarantees than existing tools, while scaling to recurrent and multi-agent settings.
comment: Accepted at the 35th International Joint Conference on Artificial Intelligence (IJCAI) 2026
☆ XDomainBench: Diagnosing Reasoning Collapse in High-Dimensional Scientific Knowledge Composition
Large Language Models (LLMs) are increasingly deployed for knowledge synthesis, yet their capacity for compositional generalization in scientific knowledge remains under-characterized. Existing benchmarks primarily focus on single-turn restricted scenarios, failing to capture the capability boundaries exposed by real-world interactive scientific workflows. To address this, we introduce XDomainBench, a diagnostic benchmark for interactive interdisciplinary scientific reasoning. We formalize the composition order and mixture structure to enable systematic stress-testing from single-discipline to inter-disciplinary, comprising 8,598 interactive sessions across 20 domains and 4 task categories, with 8 realistic trajectory patterns covering difficulty and domain-mixture dynamics, simulating real AI4S scenarios. Large-scale evaluation of LLMs reveals a systematic reasoning collapse as composition order increases, stemming from two root causes: (i) direct difficulty increases induced by domain composition, and (ii) indirect interaction-amplified failures where trajectory patterns trigger error accumulation, reasoning breaks, and domain confusion, ultimately leading to session collapse.
☆ Cognitive-Uncertainty Guided Knowledge Distillation for Accurate Classification of Student Misconceptions ACL 2026
Accurately identifying student misconceptions is crucial for personalized education but faces three challenges: (1) data scarcity with long-tail distribution, where authentic student reasoning is difficult to synthesize; (2) fuzzy boundaries between error categories with high annotation noise; (3) deployment parado-large models overlook unconventional approaches due to pretraining bias and cannot be deployed on edge, while small models overfit to noise. Unlike traditional methods that increase diversity through large-scale data synthesis, we propose a two-stage knowledge distillation framework that mines high-value samples from existing data. The first stage performs standard distillation to transfer task capabilities. The second stage introduces a dual-layer marginal selection mechanism based on cognitive uncertainty, identifying four types of critical samples based on teacher model uncertainty and confidence differences. For different data subsets, we design difficulty-adaptive mechanism to balance hard/soft label contributions, enabling student models to inherit inter-class relationships from teacher soft labels while distinguishing ambiguous error types. Experiments show that with augmented training on only 10.30% of filtered samples, we achieve MAP@3 of 0.9585 (+17.8%) on the MAP-Charting dataset, and using only a 4B parameter model, we attain 84.38% accuracy on cross-topic tests of middle school algebra misconception benchmarks, significantly outperforming sota LLM (67.73%) and standard fine-tuned 72B models (81.25%). Our code is available at https://github.com/RoschildRui/acl2026_map.
comment: ACL 2026 Findings. 10 pages, 5 figures, 19 tables
☆ EVA: Editing for Versatile Alignment against Jailbreaks IEEE
Large Language Models (LLMs) and Vision Language Models (VLMs) have demonstrated impressive capabilities but remain vulnerable to jailbreaking attacks, where adversaries exploit textual or visual triggers to bypass safety guardrails. Recent defenses typically rely on safety fine-tuning or external filters to reduce the model's likelihood of producing harmful content. While effective to some extent, these methods often incur significant computational overheads and suffer from the safety utility trade-off, degrading the model's performance on benign tasks. To address these challenges, we propose EVA (Editing for Versatile Alignment against Jailbreaks), a novel framework that pioneers the application of direct model editing for safety alignment. EVA reframes safety alignment as a precise knowledge correction task. Instead of retraining massive parameters, EVA identifies and surgically edits specific neurons responsible for the model's susceptibility to harmful instructions, while leaving the vast majority of the model unchanged. By localizing the updates, EVA effectively neutralizes harmful behaviors without compromising the model's general reasoning capabilities. Extensive experiments demonstrate that EVA outperforms baselines in mitigating jailbreaks across both LLMs and VLMs, offering a precise and efficient solution for post-deployment safety alignment.
comment: IEEE TPAMI 2026
☆ Non-linear Interventions on Large Language Models
Intervention is one of the most representative and widely used methods for understanding the internal representations of large language models (LLMs). However, existing intervention methods are confined to linear interventions grounded in the Linear Representation Hypothesis, leaving features encoded along non-linear manifolds beyond their reach. In this work, we introduce a general formulation of intervention that extends naturally to non-linearly represented features, together with a learning procedure that further enables intervention on implicit features lacking a direct output signature. We validate our framework on refusal bypass steering, where it steers the model more precisely than linear baselines by intervening on a non-linear feature governing refusal.
☆ Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining ICML 2026
Recent advances in multimodal large language models have driven growing interest in graphical user interface (GUI) agents, yet their generalization remains constrained by the scarcity of large-scale training data spanning diverse real-world applications. Existing datasets rely heavily on costly manual annotations and are typically confined to narrow domains. To address this challenge, we propose Video2GUI, a fully automated framework that extracts grounded GUI interaction trajectories directly from unlabeled Internet videos. Video2GUI employs a coarse-to-fine filtering strategy to identify high-quality GUI tutorial videos and convert them into structured agent trajectories. Applying this pipeline to 500 million video metadata entries, we construct WildGUI, a large-scale dataset containing 12 million interaction trajectories spanning over 1,500 applications and websites. Pre-training Qwen2.5-VL and Mimo-VL on WildGUI yields consistent improvements of 5-20% across multiple GUI grounding and action benchmarks, matching or surpassing state-of-the-art performance. We will release both the WildGUI dataset and the Video2GUI pipeline to support future research of GUI agents.
comment: Accepted at ICML 2026
☆ Mechanical Enforcement for LLM Governance:Evidence of Governance-Task Decoupling in Financial Decision Systems
Large language models in regulated financial workflows are governed by natural-language policies that the same model interprets, creating a principal--agent failure: outputs can appear compliant without being compliant. Existing evaluation measures task accuracy but not whether governance constrains behaviour at the decision rationale level -- where regulated decisions must be auditable. We introduce five governance metrics that quantify policy compliance at the rationale level and apply them in a synthetic banking domain to compare text-only governance against mechanical enforcement: four primitives operating outside the model's interpretive loop. Under text-only governance, 27% of deferrals carry no decision-relevant information. Mechanical enforcement reduces this rate by 73%, more than doubles deferral information content, and raises task accuracy from MCC~$0.43$ to $0.88$. The improvement is driven by architectural separation: LLM-generated rationales under mechanical enforcement show comparable CDL to text-only governance -- the gain comes from removing clear-cut decisions from the model's control. A causal ablation confirms that each primitive is individually necessary. Our central finding is a governance-task decoupling: under structural stress, text-only governance degrades on both dimensions simultaneously, whereas mechanical enforcement preserves governance quality even as task performance drops. This implies that governance and task evaluation are distinct axes: accuracy is not a sufficient proxy for governance in regulated AI systems.
☆ Addressing Terminal Constraints in Data-Driven Demand Response Scheduling
Electrified chemical processes are incentivized by exposure to time-varying electricity markets to operate flexibly, but participating in demand response schemes can require satisfying terminal constraints over long horizons. Specifically, terminal constraints may be required when computing optimal schedules in order to preserve dynamic stability. Model-based optimization methods are computationally costly, and data-driven scheduling via reinforcement learning (RL) faces severe credit-assignment challenges. We integrate Goal-Space Planning (GSP) with Deep Deterministic Policy Gradient (DDPG), using learned temporally abstract models over discrete subgoals to propagate value across extended horizons. Using a simulated air separation benchmark, we demonstrate the proposed approach improves sample efficiency over standard DDPG while satisfying terminal storage constraints, mitigating myopic control behavior.
comment: Accepted to IFAC World Congress 2026
☆ TAPIOCA: Why Task- Aware Pruning Improves OOD model Capability
Recent work has promoted task-aware layer pruning as a way to improve model performance on particular tasks, as shown by TALE. In this paper, we investigate when such improvements occur and why. We show first that, across controlled polynomial regression tasks and large language models, such pruning yields no benefit on in-distribution (ID) data but consistently improves out-of-distribution (OOD) accuracy. We further show empirically that OOD inputs induce layerwise norm and pairwise-distance profiles that deviate from the corresponding ID profiles. This leads to a geometric explanation of task-aware pruning: each task induces a task-adapted geometry, characterized empirically by the representation profiles observed on ID inputs. OOD inputs can introduce a distorted version of the task-adapted geometry. Task-aware pruning identifies layers that create or amplify this distortion; by removing them, it shifts OOD representational norms and pairwise distances toward those observed on the adapted distribution. This realigns OOD inputs with the model's task-adapted geometry and improves performance. We provide causal evidence through controlled distribution shifts and residual-scaling interventions, and demonstrate consistent behavior across model scales.
☆ Agentifying Patient Dynamics within LLMs through Interacting with Clinical World Model
Sepsis management in the ICU requires sequential treatment decisions under rapidly evolving patient physiology. Although large language models (LLMs) encode broad clinical knowledge and can reason over guidelines, they are not inherently grounded in action-conditioned patient dynamics. We introduce SepsisAgent, a world model-augmented LLM agent for sepsis treatment recommendation. SepsisAgent uses a learned Clinical World Model to simulate patient responses under candidate fluid--vasopressor interventions, and follows a propose--simulate--refine workflow before committing to a prescription. We first show that world-model access alone yields inconsistent LLM decision performance, motivating agent-specific training. We then train SepsisAgent through a three-stage curriculum: patient-dynamics supervised fine-tuning, propose--simulate--refine behavior cloning, and world-model-based agentic reinforcement learning. On MIMIC-IV sepsis trajectories, SepsisAgent outperforms all traditional RL and LLM-based baselines in off-policy value while achieving the best safety profile under guideline adherence and unsafe-action metrics. Further analysis shows that repeated interaction with the Clinical World Model enables the agent to learn regularities in patient evolution, which remain useful even when simulator access is removed.
☆ On Strong Equivalence Notions in Logic Programming and Abstract Argumentation
Strong equivalence between knowledge bases ensures the possibility of replacing one with the other without affecting reasoning outcomes, in any given context. This makes it a crucial property in nonmonotonic formalisms. In particular, the fields of logic programming and abstract argumentation provide primary examples in which this property has been subject to vast investigations. However, while (classes of) logic programs and abstract argumentation frameworks are known to be semantically equivalent in static settings, this alignment breaks in dynamic contexts due to differing notions of update. As a result, strong equivalence does not always carry over from one formalism to the other. In this paper, we carefully investigate this discrepancy and introduce a new notion of strong equivalence for logic programs. Our approach preserves strong equivalence under translation between certain classes of logic programs and both Dung-style and claim-augmented argumentation frameworks, thus restoring compatibility across these formalisms.
☆ Towards Label-Free Single-Cell Phenotyping Using Multi-Task Learning ICPR
Label-free single-cell imaging offers a scalable, non-invasive alternative to fluorescence-based cytometry, yet inferring molecular phenotypes directly from bright-field morphology remains challenging. We present a unified Deep Learning (DL) framework that jointly performs White Blood Cell (WBC) classification and continuous protein-expression regression from label-free Differential Phase Contrast (DPC) images. Our model employs a Hybrid architecture that fuses convolutional fine-grained texture features with transformer-based global representations through a learnable cross-branch gating module, enabling robust morpho-molecular inference from DPC images. To support downstream interpretability, we further incorporate a Large Language Model (LLM) that generates concise, biologically grounded summaries of the predicted cell states. Experiments on the Berkeley Single Cell Computational Microscopy (BSCCM) and Blood Cells Image benchmarks demonstrate strong performance, achieving a 91.3% WBC classification accuracy and a 0.72 Pearson correlation for CD16 expression regression on BSCCM. These results underscore the promise of label-free single-cell imaging for cost-effective hematological profiling, enabling simultaneous phenotype identification and quantitative biomarker estimation without fluorescent staining. The source code is available at https://github.com/saqibnaziir/Single-Cell-Phenotyping.
comment: Accepted in 28th International Conference on Pattern Recognition (ICPR) 2026
☆ IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation
Robot imitation data are often multimodal: similar visual-language observations may be followed by different action chunks because human demonstrators act with different short-horizon intents, task phases, or recent context. Existing frame-conditioned VLA policies infer each chunk from the current observation and instruction alone, so under partial observability they may resample different intents across adjacent replanning steps, leading to inter-chunk conflict and unstable execution. We introduce IntentVLA, a history-conditioned VLA framework that encodes recent visual observations into a compact short-horizon intent representation and uses it to condition chunk generation. We further introduce AliasBench, a 12-task ambiguity-aware benchmark on RoboTwin2 with matched training data and evaluation environments that isolate short-horizon observation aliasing. Across AliasBench, SimplerEnv, LIBERO, and RoboCasa, IntentVLA improves rollout stability and outperforms strong VLA baselines
comment: Code can be found in https://github.com/ZGC-EmbodyAI/IntentVLA
☆ Vision-Core Guided Contrastive Learning for Balanced Multi-modal Prognosis Prediction of Stroke
Deep learning and multi-modal fusion have demonstrated transformative potential in medical diagnosis by integrating diverse data sources. However, accurate prognosis for ischemic stroke remains challenging due to limitations in existing multi-modal approaches. First, current methods are predominantly confined to dual-modal fusion, lacking a framework that effectively integrates the trifecta of medical images, structured clinical data, and unstructured text. Second, they often fail to establish deep bidirectional interactions between modalities; To address these critical gaps, this paper proposes a novel tri-modal fusion model for ischemic stroke prognosis. Our approach first enriches the data representation by employing a Large Language Model (LLM) to automatically generate semi-structured diagnostic text from brain MRIs. This process not only addresses the scarcity of expert annotations but also serves as a regularized semantic enhancement, improving multimodal fusion robustness. Furthermore, we design a core component termed the Vision-Conditioned Dual Alignment Fusion Module (VDAFM), which strategically uses visual features as a conditional prior to guide fine-grained interaction with the generated text. This module achieves a dynamic and profound fusion through a dual semantic alignment loss, effectively mitigating modal heterogeneity. Extensive experiments on a real-world clinical dataset demonstrate that our model achieves state-of-the-art performance.
comment: Corresponding author: Ting Xiao
☆ SceneFunRI: Reasoning the Invisible for Task-Driven Functional Object Localization
In real-world scenes, target objects may reside in regions that are not visible. While humans can often infer the locations of occluded objects from context and commonsense knowledge, this capability remains a major challenge for vision-language models (VLMs). To address this gap, we introduce SceneFunRI, a benchmark for Reasoning the Invisible. Based on the SceneFun3D dataset, SceneFunRI formulates the task as a 2D spatial reasoning problem via a semi-automatic pipeline and comprises 855 instances. It requires models to infer the locations of invisible functional objects from task instructions and commonsense reasoning. The strongest baseline model (Gemini 3 Flash) only achieves an CAcc@75 of 15.20, an mIoU of 0.74, and a Dist of 28.65. We group our prompting analysis into three categories: Strong Instruction Prompting, Reasoning-based Prompting, and Spatial Process of Elimination (SPoE). These findings indicate that invisible-region reasoning remains an unstable capability in current VLMs, motivating future work on models that more tightly integrate task intent, commonsense priors, spatial grounding, and uncertainty-aware search.
☆ NeuroAtlas: Benchmarking Foundation Models for Clinical EEG and Brain-Computer Interfaces
Foundation models (FMs) promise to extract unified representations that generalize across downstream tasks. They have emerged across fields, including electroencephalography (EEG), but it is less clear how effective they are in this particular field. Published evaluations differ in datasets, in the EEG-specific preprocessing that might influence reported results, and in the reported metrics, frequently obscuring the clinical relevance in EEG. We introduce NeuroAtlas, the largest EEG benchmark to date: 42 datasets and 260k hours covering clinical EEG (epilepsy, sleep medicine, brain age estimation) and brain-computer interfaces, and include multiple datasets per task along with bespoke clinical evaluation metrics. Besides evaluating EEG-FMs with respect to supervised baselines, we present results from generic time-series FMs. We report three findings. First, EEG-specific FMs do not consistently outperform time-series FMs, which have neither EEG-focused architectures nor been pretrained on EEG. Second, standard machine learning metrics are insufficient to assess clinical utility: thus, we thoroughly evaluate more appropriate measures such as the quality of event-level decision-making, hypnogram-derived features, and the brain-age gap in the domains of epilepsy, sleep, and brain age, respectively. Third, model rankings and performance can vary substantially within domains. We conclude that pretrained models perform largely on par, with only narrow advantages for a few, and that current models do not yet deliver on the promise of an out-of-the-box unified EEG model. NeuroAtlas exposes this gap and provides the datasets and metrics for the next generation of unified EEG FMs.
☆ Spontaneous symmetry breaking and Goldstone modes for deep information propagation
In physical systems, whenever a continuous symmetry is spontaneously broken, the system possesses excitations called Goldstone modes, which allow coherent information propagation over long distances and times. In this work, we study deep neural networks whose internal layers are equivariant under a continuous symmetry and may therefore support analogous Goldstone-like degrees of freedom. We demonstrate, both analytically and empirically, that these degrees of freedom enable coherent signal propagation across depth and recurrent iterations, providing a mechanism for stable information flow without relying on architectural stabilizers such as residual connections or normalization. In feedforward networks, this results in improved trainability and representational diversity across layers. In recurrent settings, we demonstrate the same mechanism is valuable for long-term memory by propagating information over recurrent iterations, thereby improving performance of RNNs and GRUs on long-sequence modeling tasks.
comment: 28 pages. Code at https://github.com/nabiliqbal/ssb-goldstone-deep-info-prop
☆ AI-assisted cultural heritage dissemination: Comparing NMT and glossary-augmented LLM translation in rock art documents
Cultural heritage institutions increasingly disseminate research and interpretive materials globally, but multilingual dissemination is constrained by limited budgets and staffing. In terminology-dense domains such as rock art, translation quality depends on accurate, consistent specialised terms, and small lexical errors can mislead non-specialists and reduce reuse. We compare three English MT setups for a Spanish academic rock art text, focusing on simple, operationally feasible interventions rather than complex model-side modifications: (1) DeepL as a strong NMT baseline, (2) Gemini-Simple (LLM with a basic prompt), and (3) Gemini-RAG (the same LLM with glossary-augmented prompting via term-pair retrieval). Using PEARMUT, we conduct a human evaluation via (i) multi-way Direct Assessment (0--100) and (ii) targeted terminology auditing with a restricted MQM taxonomy. Gemini-RAG yields the highest exact-match terminology accuracy (81.4\%), versus Gemini-Simple (69.1\%) and DeepL (64.4\%), while preserving overall quality (mean DA 85.3 Gemini-RAG vs. 85.2 Gemini-Simple), outperforming DeepL (80.3). These results show that glossary-augmented prompting is a low-overhead way to improve terminology control in cultural-heritage translation if institutions maintain minimal terminology resources and lightweight evaluation procedures.
☆ $π$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows
The rise of personal assistant agents, e.g., OpenClaw, highlights the growing potential of large language models to support users across everyday life and work. A core challenge in these settings is proactive assistance, since users often begin with underspecified requests and leave important needs, constraints, or preferences unstated. However, existing benchmarks rarely evaluate whether agents can identify and act on such hidden intents before they are explicitly stated, especially in sustained multi-turn interactions where user needs emerge gradually. To address this gap, we introduce $π$-Bench, a benchmark for proactive assistance comprising 100 multi-turn tasks across 5 domain-specific user personas. By incorporating hidden user intents, inter-task dependencies, and cross-session continuity, $π$-Bench evaluates agents' ability to anticipate and address user needs over extended interactions, jointly measuring proactivity and task completion in long-horizon trajectories that better reflect real-world use. Experiments show (1) proactive assistance remains challenging, (2) a clear distinction between task completion and proactivity, and (3) the value of prior interaction for proactive intent resolution in later tasks.
comment: 44 pages
☆ Agentic Design of Compositional Descriptors via Autoresearch for Materials Science Applications
Autoresearch offers a flexible paradigm for automating scientific tasks, in which an AI agent proposes, implements, evaluates, and refines candidate solutions against a quantitative objective. Here, we use composition-based materials-property prediction to test whether such agents can perform a task beyond model selection and hyperparameter optimization: the design of input descriptors. We introduce Automat, an autoresearch framework where a coding agent based on a large language model generates composition-only descriptors for chemical compounds and evaluates them using a random forest workflow. The agent is restricted to information derivable from chemical formulas and iteratively proposes, implements, and tests chemically motivated descriptor strategies. We apply Automat, with OpenAI Codex using GPT-5.5 as the coding agent, to the prediction of experimental band gaps in inorganic materials and Curie temperatures in ferromagnetic compounds. In both tasks, Automat improves over fractional-composition, Magpie, and combined fractional-composition/Magpie baselines, while producing descriptor families that are chemically interpretable. These results provide a demonstration that autoresearch agents can generate competitive, task-specific materials descriptors without manual feature engineering during the run. They also reveal current limitations, including descriptor redundancy, sensitivity to greedy feature expansion, and the need for explicit complexity control, descriptor pruning, and more sophisticated search strategies.
☆ How Sensitive Are Radiomic AI Models to Acquisition Parameters?
A main barrier for the deployment of AI radiomic systems in clinical routine is their drop in performance under heterogeneous multicentre acquisition protocols. This work presents a performance-oriented framework for quantifying scan parameter sensitivity of radiomic AI models, while identifying clinically significant parameter regions associated with improved cross-dataset robustness. We formulate a mixed-effects framework for quantifying the influence that clinically relevant acquisition parameters have on models performance, while accounting for subject-level random effects. We have applied our framework to lung cancer diagnosis in CT scans using two independent multicentre datasets (a public database and own-collected data) and several SoA architectures. To evaluate across-database reproducibility, CT parameters have been adjusted using the data collected and tested on the public set. The optimal configuration selected is the current of the X-ray tube >= 200 mA, spiral pitch <= 1.5, slice thickness <= 1.25 mm, which balances diagnostic quality with low radiation dose. These configuration push metrics from 0.79+-0.04 sensitivity, 0.47+-0.10 specificity in low quality scans to 0.90+-0.10 sensitivity, 0.79 +- 0.13 specificity in high quality ones.
☆ Monitoring Data-aware Temporal Properties (Extended Version) IJCAI 2026
Dynamic systems in AI are often complex and heterogeneous, so that an internal specification is not accessible and verification techniques such as model checking are not applicable. Monitoring is in such cases an attractive alternative, as it evaluates desirable properties along traces generated by an unknown dynamic system. In this work, we consider anticipatory monitoring of linear-time properties enriched with an arbitrary SMT theory over finite traces (LTLfMT). Anticipatory monitoring in this setting is highly challenging, as the monitoring state depends on both the trace prefix seen so far and all its possible finite continuations. Under reasonable assumptions on the background theory, we present and formally prove the correctness of a novel foundational framework for monitoring properties in an expressive fragment of LTLfMT. The framework combines automata-theoretic methods to handle the temporal aspects of the logic, with automated reasoning techniques to address the first-order dimension. Moreover, we identify for the first time decidable fragments of this monitoring problem that are practically relevant as they combine linear arithmetic with uninterpreted functions, which covers e.g. data-aware business processes and dynamic systems operating over a read-only database. Feasibility is witnessed by a prototype implementation and preliminary evaluation.
comment: This is the extended version of a paper accepted to IJCAI 2026
☆ Falkor-IRAC: Graph-Constrained Generation for Verified Legal Reasoning in Indian Judicial AI
Legal reasoning is not semantic similarity search. A court judgment encodes constrained symbolic reasoning: precedent propagation, procedural state transitions, and statute-bound inference. These are properties that vector-based retrieval-augmented generation (RAG) cannot faithfully represent. Hallucinated precedents, outdated statute citations, and unsupported reasoning chains remain persistent failure modes in LLM-based legal AI, with real consequences for access to justice in high-caseload jurisdictions such as India. This paper presents Falkor-IRAC, a graph-constrained generation framework for Indian legal AI that grounds generation in structured reasoning over an IRAC (Issue, Rule, Analysis, Conclusion) knowledge graph. Judgments from the Supreme Court and High Courts of India are ingested as IRAC node structures enriched with procedural state transitions, precedent relationships, and statutory references, stored in FalkorDB for low-latency agentic traversal. At inference time, LLM-generated answers are accepted only if a valid supporting path can be traced through the graph, a check performed by a falsifiability oracle called the Verifier Agent. The system also detects doctrinal conflicts as a first-class output rather than silently resolving them. Falkor-IRAC is evaluated using graph-native metrics: citation grounding accuracy, path validity rate, hallucinated precedent rate, and conflict detection rate. These metrics are argued to be more appropriate for legal reasoning evaluation than BLEU and ROUGE. On a proof-of-concept corpus of 51 Supreme Court judgments, the Verifier Agent correctly validated citations on completed queries and correctly rejected fabricated citations. Evaluation against vector-only RAG baselines is left for future work, as is GPU-accelerated inference to address current timeout rates on CPU hardware.
comment: 20 pages, 8 figures, 4 tables
☆ MindGap: A Conversational AI Framework for Upstream Neuroplastic Intervention in Post-Traumatic Stress Disorder
Post-Traumatic Stress Disorder (PTSD) is fundamentally a neuroplastic problem traumatic contact events encode over-reactive neural pathways through Hebbian long-term potentiation, producing hair-triggered amygdala-HPA stress cascades that fire before conscious awareness can intercept them. Existing therapeutic approaches, prolonged exposure, EMDR, cognitive behavioural therapy, operate predominantly downstream of the reactive cascade, teaching patients to tolerate or reframe distress after it has arisen. While clinically valuable, these suppression-based approaches do not produce the upstream pathway dissolution that constitutes lasting structural neural reorganisation. This paper proposes MindGap, a privacy-preserving on-device conversational AI framework that delivers structured neuroplastic rehabilitation for PTSD through the practice of dependent origination, a Buddhist psychological framework that identifies the precise moment between the pre-cognitive affective signal and the reactive elaboration that follows as the site of therapeutic intervention. MindGap guides patients through three progressive layers of observation at this feeling tone gap: noticing the bare affective signal before reactive elaboration, recognising it as self-arising rather than caused by the stimulus, and recognising the conditioned implicit belief beneath the feeling. Each layer corresponds to progressively deeper prefrontal regulatory engagement and progressively deeper long-term depression-mediated weakening of the reactive pathway, producing genuine upstream dissolution rather than downstream suppression. Running entirely on-device with no data egress, MindGap delivers daily calibrated exposure sessions through a fine-tuned lightweight large language model, making it deployable in sensitive clinical and military contexts where cloud-based solutions are not permitted.
☆ Vision-Based Water Level and Flow Estimation
With the rapid evolution of computer vision, vision-based methodologies for water level and river surface velocity estimation have reached significant maturity. Compared to traditional sensing, these techniques offer superior interpretability, automated data archiving, and enhanced system robustness. However, challenges such as environmental sensitivity, limited precision, and complex site calibration persist. This work proposes an integrated framework that synergizes state-of-the-art (SOTA) vision models with statistical modeling. By leveraging physical priors and robust filtering strategies, we improve the accuracy of water level detection and flow estimation. Code will be available at https://github.com/sunzx97/Vision_Based_Water_Level_and_Flow_Estimation.git
☆ How to Evaluate and Refine your CAM ICPR 2026
Class attribution maps (CAMs) provide local explanations for the decisions of convolutional neural networks. While widely used in practice, the evaluation of CAMs remains challenging due to the lack of ground-truth explanations, making it difficult to evaluate the soundness of existing metrics. Independently, most commonly used CAM methods produce low-resolution attribution maps, which limits their usefulness for detailed interpretability. To address the evaluation challenge, we introduce a synthetic dataset with ground-truth attributions that enables a rigorous comparison of CAM evaluation metrics. Using this dataset, we analyze existing metrics and propose ARCC, a new composite metric that more reliably identifies faithful explanations. To address the low resolution issue, we introduce RefineCAM, a method that produces high-resolution attribution maps by aggregating CAMs across multiple network layers. Our results show that RefineCAM consistently outperforms existing methods according to the proposed evaluation.
comment: Accepted at ICPR 2026
☆ Teaching Large Language Models When Not to Know: Learning Temporal Critique for Ex-Ante Reasoning
Large language models (LLMs) often fail to reason under temporal cutoffs: when prompted to answer from the standpoint of an earlier time, they exploit knowledge that became available only later. We study this failure through the lens of ex-ante reasoning, where a model must rely exclusively on information knowable before a cutoff. Through a systematic analysis of prompt-level interventions, we find that temporal leakage is highly sensitive to cutoff formulation and instruction placement: explicit cutoff statements outperform implicit historical framings, and prefix constraints reduce leakage more effectively than suffix constraints. These findings indicate that prompting can steer models into a temporal frame, but does not endow them with the ability to verify whether a response is temporally admissible. We further argue that supervised fine-tuning is insufficient, since ex-ante correctness is not an intrinsic property of an answer, but a relation between the answer and the cutoff. To address this gap, we propose TCFT, a Temporal Critique Fine-Tuning framework that trains models to acquire cutoff-aware temporal verification. Given a query, a cutoff, and a candidate response, TCFT teaches the model to identify post-cutoff leakage, explain temporal boundary violations, and judge temporal admissibility. Experiments with Qwen2.5-7B-Instruct and Qwen2.5-14B-Instruct show that TCFT consistently outperforms prompting and SFT baselines, reducing average leakage by 41.89 and 37.79 percentage points, respectively.
☆ MultiEmo-Bench: Multi-label Visual Emotion Analysis for Multi-modal Large Language Models
This paper introduces a multi-label visual emotion analysis benchmark dataset for comprehensively evaluating the ability of multimodal large language models (MLLMs) to predict the emotions evoked by images. Recent user studies report an unintuitive finding: humans may prefer the predictions of MLLMs over the labels in existing datasets. We argue that this phenomenon stems from the suboptimal annotation scheme used in existing datasets, where each annotator is shown a single candidate emotion for each image and judges whether it is evoked or not. This approach is clearly limited because a single image can evoke multiple emotions with varying intensities. As a result, evaluations based on these datasets may underestimate the capabilities of MLLMs, yet an appropriate benchmark for evaluating such models remains lacking. To address this issue, we introduce a new multi-label benchmark dataset for visual emotion analysis toward MLLMs evaluation. We hire $20$ annotators per image and ask them to select all emotions they feel from an image. Then, we aggregate the votes across all annotators, providing a more reliable and representative dataset labeled with a distribution of emotions. The resulting dataset contains $10,344$ images with $236,998$ valid votes across eight emotions. Based on this benchmark dataset, we evaluate several recent models, including Qwen3-VL, OpenAI's GPT, Gemini, and Claude. We assess model performance on both dominant emotion prediction and emotion distribution prediction. Our results demonstrate the progress achieved by recent MLLMs while also indicating that substantial room for improvement remains. Furthermore, our experiments with LLM-as-a-judge show that the method does not consistently improve MLLMs' performance, indicating its limitations for the subjective task of visual emotion analysis.
☆ Action-Inspired Generative Models
We introduce Action-Inspired Generative Models (AGMs), a dual-network generative framework motivated by the observation that existing bridge-matching methods assign uniform regression weight to every stochastic transition in the transport landscape, regardless of whether a given bridge sample lies along a structurally coherent trajectory or a degenerate one. We address this by introducing a lightweight learned scalar potential $V_φ$ that scores bridge samples online and modulates the drift objective via importance weights derived through a stop-gradient barrier -- preventing adversarial feedback between the two networks whilst preserving $V_φ$'s guiding signal. Crucially, $V_φ$ comprises only $\sim$1.4% of the primary drift network's parameter count, adds no overhead to the inference graph, and requires no iterative half-bridge fitting or auxiliary stochastic differential equation (SDE) solvers: it is a plug-and-play enhancement to any bridge-matching training loop. At inference, $V_φ$ is discarded entirely, leaving standard Euler-Maruyama integration of the exponential moving average (EMA) drift. We demonstrate that selectively penalising uninformative transport paths through the learned potential yields consistent improvements in generation quality across fidelity and coverage metrics.
comment: 11 pages, 5 figures, and 4 tables
☆ An Amortized Efficiency Threshold for Comparing Neural and Heuristic Solvers in Combinatorial Optimization
A common critique of neural combinatorial-optimization solvers is that they are less energy-efficient than CPU metaheuristics, given the operational energy cost of training them on GPUs. This paper examines the inferential step from "training is expensive" to "neural solvers are net-inefficient", which is where the critique actually goes wrong. Training the network costs a large fixed amount of GPU energy; running the metaheuristic costs a small amount of CPU energy on every instance, repeated as long as the solver is deployed. The two are not commensurable until a deployment volume is fixed. We define the Amortized Efficiency Threshold (AET) as the deployment volume above which a neural solver breaks even with a heuristic baseline in total energy or carbon, under an explicit constraint on solution quality. We show that the cumulative-energy ratio between the two solvers tends to a constant strictly below one whenever the network wins per-instance, and that this limit does not depend on how the training cost was measured. An embodied-carbon term amortizes hardware fabrication symmetrically on both sides. We instantiate the framework on the Multi-Task VRP (MTVRP) environment at n=20 customers across 19 problem variants and five training seeds, with HGS via PyVRP as the heuristic baseline. The measured crossover sits near $1.58 \times 10^5$ deployed instances; the per-instance ratio is 0.41, reflecting the moderate size of the instances tested. The contribution is the framework, the open instrumentation, and the measurement protocol; structural convergence of the ratio at larger problem sizes is left to future empirical work.
comment: 16 pages, 5 figures, 3 tables. v0.1: framework + measurement protocol instantiated at n=20; empirical extension to larger problem sizes deferred to v0.2
☆ Do We Really Need External Tools to Mitigate Hallucinations? SIRA: Shared-Prefix Internal Reconstruction of Attribution
Large vision-language models (LVLMs) often hallucinate when language priors dominate weak or ambiguous visual evidence. Existing contrastive decoding methods mitigate this problem by comparing predictions from the original image with those from externally perturbed visual inputs, but such references can introduce off-manifold artifacts and require costly extra forward passes. We propose SIRA, a training-free internal contrastive decoding framework that constructs a counterfactual reference inside the same LVLM by exploiting the staged information flow of multimodal transformers. Instead of removing visual information from the input, SIRA first lets image and text tokens interact through a shared prefix, forming an aligned multimodal state that preserves prompt interpretation, decoding history, positional structure, and early visual grounding. It then forks a counterfactual branch in later transformer layers, where attention to image-token positions is masked. This branch retains the shared multimodal context but lacks continued access to fine-grained visual evidence, yielding a language-prior-dominated internal reference for token-level contrast. During decoding, SIRA suppresses tokens that remain strong without late visual access and favors predictions whose advantage depends on the full visual pathway. Experiments on POPE, CHAIR, and AMBER with Qwen2.5-VL and LLaVA-v1.5 show that SIRA consistently reduces hallucinations while preserving descriptive coverage and incurring lower overhead than two-pass contrastive decoding. SIRA requires no training, external verifier, or perturbed input, and applies to open-weight LVLMs with white-box inference access.
☆ SliceGraph: Mapping Process Isomers in Multi-Run Chain-of-Thought Reasoning
Multi-run chain-of-thought reasoning is usually collapsed to final-answer aggregates, which discard howsampled trajectories share, split, and rejoin through intermediate computation. We propose SliceGraph, a post-hoc problem-model-cell graph built by mutual-kNN over sparse activation-key Jaccard similarity between CoT slices, and treat it as a measurement object for process geometry rather than as a decoding program. Across sampled CoT ensembles from three primary 4B/8B models on math and science benchmarks, blinded annotation supports SliceGraph biconnected components as shared reasoning-state units and process families as within-family strategy-coherent route units. In 85.5% of 954 problem-model cells, correct CoTs sharing the same normalized answer split into multiple process families; among cells with at least two such runs, 76.6% of run pairs are cross-family on average. We call such same-answer, family-divergent correct trajectories process isomers. A label-seeded reward field provides a separate value-landscape layer: success-associated regions often split into disconnected high-value cores, and route families specialize over these core footprints rather than merely duplicating one another. A typed-state transition analysis further shows that process families navigate the same atlas with distinct transition kernels under matched null controls. Representation ablations, a cross-architecture replication, and two cross-scale replications support the robustness of the route-family scaffold, showing that final-answer aggregation overlooks this structured multi-route process geometry.
☆ In-IDE Toolkit for Developers of AI-Based Features ICSE'26
AI-enabled features built on LLMs and agentic workflows are difficult to test, debug, and reproduce, especially for product-focused software engineers without a machine learning background. We present the AI Toolkit plugin for JetBrains IDEs, which brings tracing and evaluation directly into the Run/Debug loop. A mixed methods study with practitioners presents three consistent needs: (1) make evaluation regular and repeatable, (2) expose traces at the moment of execution, and (3) minimize setup and context switching. Guided by these needs, the AI Toolkit introduces an IDE-native workflow: run-triggered trace capture; immediate, hierarchical inspection; one-click "Add to Dataset" from traces; and unit-test-like evaluations with pluggable metrics. The first release in PyCharm shows promising early signals - strong conversion when promoted at Run, sustained usage among those who capture traces, and low churn - suggesting that IDE-native observability lowers activation energy and helps developers adopt disciplined practices. We detail the design and implementation of the AI Agents Debugger and AI Evaluation, report initial adoption telemetry, and outline next steps to broaden framework coverage and scale evaluations. Together, these results indicate that integrating AI observability and evaluation into everyday IDE workflows can make modern AI development accessible to non-ML specialists while preserving software-engineering practices.
comment: Published at IDE'26 co-located with ICSE'26
☆ One Step to the Side: Why Defenses Against Malicious Finetuning Fail Under Adaptive Adversaries
Model providers increasingly release open weights or allow users to fine-tune foundation models through APIs. Although these models are safety-aligned before release, their safeguards can often be removed by fine-tuning on harmful data. Recent defenses aim to make models robust to such malicious fine-tuning, but they are largely evaluated only against fixed attacks that do not account for the defense. We show that these robustness claims are incomplete. Surveying 15 recent defenses, we identify several defense mechanisms and show that they share a single weakness: they obscure or misdirect the path to harmful behavior without removing the behavior itself. We then develop a unified adaptive attack that breaks defenses across all defense mechanisms. Our results show that current approaches do not provide robust security; they mainly stop the attacks they were designed against. We hope that our unified adaptive adversary for this domain will help future researchers and practitioners stress-test new defenses before deployment.
comment: Under review
☆ Sycophancy is an Educational Safety Risk: Why LLM Tutors Need Sycophancy Benchmarks
This position paper argues that effective tutoring requires corrective friction: surfacing misconceptions and challenging them supportively to drive conceptual change. Yet preference-aligned LLMs can trade epistemic rigor for agreeableness. We identify a Reasoning-Sycophancy Paradox: models that resist context-switch frame attacks can still capitulate under social-epistemic pressure, especially authority ("my notes say I'm right") and social-affective face-saving ("please don't tell me I'm wrong"). We introduce EduFrameTrap, a tutoring benchmark across math, physics, economics, chemistry, biology, and computer science that varies student confidence and pressure (context-switch, authority, social-affective). Across two frontier LLMs, context-switch failures are comparatively lower for GPT-5.2, while authority and social pressure more often trigger epistemic retreat. In contrast, Claude shows substantial context-switch fragility in this run. Because these failures are hard to judge automatically, we report two-judge disagreement as a reliability signal. We argue benchmarks should measure social-epistemic courage, i.e., supportive but corrective tutoring, and treat kind-but-correct behavior as a safety requirement.
☆ Fast Rates for Inverse Reinforcement Learning
We establish novel structural and statistical results for entropy-regularized min-max inverse reinforcement learning (Min-Max-IRL) with linear reward classes in finite-horizon MDPs with Borel state and action spaces. On the structural side, we show that maximum likelihood estimation (MLE) and Min-Max-IRL are equivalent at the population level, and at the empirical level under deterministic dynamics. On the statistical side, exploiting pseudo-self-concordance of the Min-Max-IRL loss, we prove that both the trajectory-level KL divergence and the squared parameter error in the Hessian norm decay at the fast rate $\mathcal{O}(n^{-1})$, where $n$ is the number of expert trajectories. Our guarantees apply under misspecification and require no exploration assumptions. We further extend reward-identifiability results to general Borel spaces and derive novel results on the derivatives of the soft-optimal value function with respect to reward parameters.
☆ Angel or Demon: Investigating the Plasticity Interventions' Impact on Backdoor Threats in Deep Reinforcement Learning ICML 2026
Extensive research has highlighted the severe threats posed by backdoor attacks to deep reinforcement learning (DRL). However, prior studies primarily focus on vanilla scenarios, while plasticity interventions have emerged as indispensable built-in components of modern DRL agents. Despite their effectiveness in mitigating plasticity loss, the impact of these interventions on DRL backdoor vulnerabilities remains underexplored, and this lack of systematic investigation poses risks in practical DRL deployments. To bridge this gap, we empirically study 14,664 cases integrating representative interventions and attack scenarios. We find that only one intervention (i.e., SAM) exacerbates backdoor threats, while other interventions mitigate them. Pathological analysis identifies that the exacerbation is attributed to backdoor gradient amplification, while the mitigation stems from activation pathway disruption and representation space compression. From these findings, we derive two novel insights: (1) a conceptual framework SCC for robust backdoor injection that deconstructs the mechanistic interplay between interventions and backdoors in DRL, and (2) abnormal loss landscape sharpness as a key indicator for DRL backdoor detection.
comment: To appear in the Forty-Third International Conference on Machine Learning (ICML 2026), July 6-11, 2026, Seoul, South Korea
☆ A Picture is Worth a Thousand Words? An Empirical Study of Aggregation Strategies for Visual Financial Document Retrieval ACL 2026
Visual RAG has offered an alternative to traditional RAG. It treats documents as images and uses vision encoders to obtain vision patch tokens. However, hundreds of patch tokens per document create retrieval and storage challenges in a vector database. Practical deployment requires aggregating them into a single vector. This raises a critical question: does single-vector aggregation lose key information in financial documents? We develop a diagnostic benchmark using financial documents where changes in single digits can lead to significant semantic shifts. Our experiments show that single-vector aggregation collapses different documents with almost identical vectors. Metrics show that the patch level detects semantic changes, and confirm that aggregation obscures these details. We identify global texture dominance as the root cause. Our findings are consistent across model scales, retrieval-optimized embeddings, and multiple mitigation strategies, highlighting significant risks for single-vector visual document retrieval in financial applications.
comment: Accepted to Findings of ACL 2026
Prompt Segmentation and Annotation Optimisation: Controlling LLM Behaviour via Optimised Segment-Level Annotations
Prompt engineering is crucial for effective interaction with generative artificial intelligence systems, yet existing optimisation methods often operate over an unstructured and vast prompt space, leading to high computational costs and potential distortions of the original intent. We introduce Prompt Segmentation and Annotation Optimisation (PSAO), a structured prompt optimisation framework designed to improve prompt optimisation controllability and efficiency. PSAO decomposes a prompt into interpretable segments (e.g., sentences) and augments each with human-readable annotations (e.g., {not important}, {important}, {very important}). These annotations guide large language models (LLMs) in allocating focus and clarifying confusion during response generation. We formally define the segmentations and annotations and demonstrate that optimised segment-level annotations can lead to improved LLM responses, with the original prompt retained as a candidate in the optimisation space to prevent performance degradation. Empirical evaluations indicate that PSAO benefits from annotations in terms of improved reasoning accuracy and self-consistency. However, developing efficient methods for identifying optimal segmentations and annotations remains challenging and is reserved for future investigation. This work is intended as a proof of concept, demonstrating the feasibility and potential of segment-level annotation optimisation.
☆ PyCSP3-Scheduling: A Scheduling Extension for PyCSP3
PyCSP$^3$ provides a productive way to build constraint models for solving combinatorial constrained problems and export them to XCSP$^3$, preserving a complete separation between modeling and solving. However, it lacks native support for scheduling abstractions such as interval variables, sequence variables, and resource functions. As a result, scheduling models must be encoded with low-level integer variables and manual channeling constraints, even though PyCSP$^3$ already provides global constraints like NoOverlap and Cumulative on integer arrays. We present PyCSP$^3$ Scheduling, a library that adds scheduling abstractions to PyCSP$^3$ through 53 dedicated constraints and 27 expressions, and compiles them down to standard PyCSP$^3$/XCSP$^3$ constraints, maintaining the modeling/solving separation that underpins the PyCSP$^3$ ecosystem. On 261 paired instances across 17 model families (5 runs each), both formulations produce identical objectives on all 72 doubly-proved optimal pairs and nearly half of the families (8/17) remain structurally unchanged after compilation; however, runtime performance diverges across families, with clear gains on some (up to 5.8x) and regressions on others due to the overhead of compilation decompositions. Code and benchmarks are available at: https://github.com/sohaibafifi/pycsp3-scheduling
☆ Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy
Agentic reinforcement learning trains large language models using multi-turn trajectories that interleave long reasoning traces with short environment-facing actions. Common policy-gradient methods, such as PPO and GRPO, treat each token in a trajectory equally, leading to uniform credit assignment. In this paper, we critically demonstrate that such uniform credit assignment largely misallocates token-level training signals. From an energy-based modeling perspective, we show that token-level training signals, quantified by their correlations with reward variance of different rollouts sampled from a given prompt, concentrate sharply on action tokens rather than reasoning tokens, even though action tokens account for only a small fraction of the trajectory. We refer to this phenomenon as the Action Bottleneck. Motivated by this observation, we propose an embarrassingly simple token reweighting approach, ActFocus, that downweights gradients on reasoning tokens, along with an additional energy-based redistribution mechanism that further increases the weights on action tokens with higher uncertainty. Across four environments and different model sizes, ActFocus consistently outperforms PPO and GRPO, yielding final-step gains of up to 65.2 and 63.7 percentage points, respectively, without any additional runtime or memory cost.
comment: Preprint
☆ TeachAnything: A Multimodal Crowdsourcing Platform for Training Embodied AI Agents in Symmetrical Reality IEEE
Symmetrical Reality (SR) is emerging as a future trend for human-agent coexistence, placing higher demands on agents to acquire human-like intelligence. It calls for richer and more diverse human guidance. We introduce a three-stage demonstration paradigm integrating multimodal demonstration signals. Building on this paradigm, we developed TeachAnything, a cloud-based, crowdsourcing-oriented demonstration platform with physics simulation capable of collecting diverse demonstration data across varied scenes, tasks, and embodiments. By unifying virtual and physical interactions through both methodological design and physics simulation, the system serves as a practical foundation for developing embodied agents aligned with Symmetrical Reality.
comment: 5 pages, 3 figures. Accepted as an IEEE VR 2026 Poster
☆ Break-the-Beat! Controllable MIDI-to-Drum Audio Synthesis
Current methods for creating drum loop audio in digital music production, such as using one-shot samples or resampling, often demand non-trivial efforts of creators. While recent generative models achieve high fidelity and adhere to text, they lack the specific control needed for such a task. Existing symbolic-to-audio research often focuses on single, tonal instruments, leaving the challenge of polyphonic, percussive drum synthesis unaddressed. We address this gap by introducing ``Break-the-Beat!,'' a model capable of rendering a drum MIDI with the timbre of a reference audio. It is built by fine-tuning a pre-trained text-to-audio model with our proposed content encoder and a effective hybrid conditioning mechanism. To enable this, we construct a new dataset of paired target-reference drum audio from existing drum audio datasets. Experiments demonstrate that our model generates high-quality drum audio that follows high-resolution drum MIDI, achieving strong performance across metrics of audio quality, rhythmic alignment, and beat continuity. This offer producers a new, controllable tool for creative production. Demo page: https://ik4sumii.github.io/break-the-beat/
☆ Efficient Multi-objective Prompt Optimization via Pure-exploration Bandits ICLR 2026
Prompt engineering has become central to eliciting the capabilities of large language models (LLMs). At its core lies prompt selection -- efficiently identifying the most effective prompts. However, most prior investigations overlook a key challenge: the inherently multi-faceted nature of prompt performance, which cannot be captured by a single metric. To fill this gap, we study the multi-objective prompt selection problem under two practical settings: Pareto prompt set recovery and best feasible prompt identification. Casting the problem into the pure-exploration bandits framework, we adapt provably efficient algorithms from multi-objective bandits and further introduce a novel design for best feasible arm identification in structured bandits, with theoretical guarantees on the identification error in the linear case. Extensive experiments across multiple LLMs show that the bandit-based approaches yield significant improvements over baselines, establishing a principled and efficient framework for multi-objective prompt optimization.
comment: Published as a conference paper at ICLR 2026
☆ Complacent, Not Sycophantic: Reframing Large Language Models and Designing AI Literacy for Complacent Machines
Large language models are often described as sycophantic, in the sense that they appear to flatter users or mirror their beliefs. We argue that this label is conceptually misleading: sycophancy implies motives and strategic intent, which LLMs do not possess. Their behaviour is better understood as complacency, a structural tendency to agree with user input because training data, reward signals and design favour agreement and reinforcement over correction. We argue that this distinction matters. Whether developers act sycophantically or not, models themselves never are sycophants; they can only be made more or less complacent. This reframing locates agency in developers and institutions, not in the model. Because complacent models reinforce users' prior beliefs, we argue that AI literacy educational approaches should particularly focus on strategies to counter confirmation bias.
☆ RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation
Inpatient medication recommendation requires clinicians to repeatedly select specific medications, doses, and routes as a patient's condition evolves. Existing benchmarks formulate this task as admission-level prediction over coarse drug codes with multi-hot diagnostic and procedure code inputs, failing to capture the per-timepoint, information-rich nature of real prescribing. We propose RxEval, a prescription-level benchmark that evaluates LLM prescribing capability by multiple-choice questions: each question presents a detailed patient profile and time-ordered clinical trajectory, requiring selection of specific medication-dose-route triples from real prescriptions and patient-specific distractors generated via reasoning-chain perturbation. RxEval comprises 1,547 questions spanning 584 patients, 18 diagnostic categories, and 969 unique medications. Evaluation of 16 LLMs shows that RxEval is both challenging and discriminative: F1 ranges from 45.18 to 77.10 across models, and the best Exact Match is only 46.10%. Error analysis reveals that even frontier models may overlook stated patient information and fail to derive clinical conclusions.
☆ VerbalValue: A Socially Intelligent Virtual Host for Sales-Driven Live Commerce CVPR 2026
A skilled live-commerce host is not merely a narrator, but a sales agent who converts viewer curiosity into purchase intent through expert product knowledge, emotionally intelligent response tactics, and entertainment that serves as a vehicle for product exposure. Yet no existing AI system replicates this: conversational recommenders treat recommendation as a terminal act, while general-purpose LLMs hallucinate product claims and default to generic promotional templates that fail to engage or persuade. We present VerbalValue, a sales-conversion-oriented virtual host that turns exceptional verbal ability into real commercial value, built on three contributions. First, we construct a domain knowledge base of product specifications and a curated sales terminology lexicon that anchor product-related responses in verified expertise. Second, we collect and annotate 1,475 live-commerce interactions spanning diverse viewer intents. Third, we fine-tune a large language model on this data to deliver empathetic, commercially oriented responses, adapting to viewer intent through empathetic amplification, evidence-backed rebuttal, and humor-mediated deflection. Experiments against GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, and other baselines demonstrate gains of 23% on informativeness and 18% on factual correctness, with consistent advantages in tactfulness and viewer engagement.
comment: Accepted to the CVPR 2026 HiGen Workshop
☆ Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining
We introduce \textsc{Cattle Trade, a multi-agent benchmark for evaluating large language models (LLMs) as agents in strategic reasoning under imperfect information, adversarial interaction, and resource constraints. The benchmark combines auctions, hidden-offer trade challenges (TCs), bargaining, bluffing, opponent modeling, and resource allocation within a single long-horizon game lasting 50--60 turns. Unlike prior agent benchmarks that test these abilities in isolation, \textsc{Cattle Trade} evaluates whether agents integrate them across a competitive, multi-agent economic game with conflicting incentives. The benchmark logs every bid, TC offer, counteroffer, and card selection, enabling behavioural analysis beyond final scores or win rates. We evaluate seven cost-efficient language models and three deterministic code agents across 242 games. Strategic coherence, in particular spending efficiency, resource discipline, and phase-adaptive bidding, is associated with rank more strongly than spending volume or any single subskill. Two heuristic code agents outperform most tested LLMs, and behavioural traces surface recurring LLM failure modes including overbidding, self-bidding, bankrupt TC initiation, and weak opponent-state adaptation. Evaluating agentic competence requires benchmarks that test the joint deployment of multiple capabilities in multi-agent environments with conflicting incentives, uncertainty, and economic dynamics.
comment: malgai workshop at iclr 2026
☆ PROVE: A Perceptual RemOVal cohErence Benchmark for Visual Media
Evaluating object removal in images and videos remains challenging because the task is inherently one-to-many, yet existing metrics frequently disagree with human perception. Full-reference metrics reward copy-paste behaviors over genuine erasure; no-reference metrics suffer from systematic biases such as favoring blurry results; and global temporal metrics are insensitive to localized artifacts within edited regions. To address these limitations, we propose RC (Removal Coherence), a pair of perception-aligned metrics: RC-S, which measures spatial coherence via sliding-window feature comparison between masked and background regions, and RC-T, which measures temporal consistency via distribution tracking within shared restored regions across adjacent frames. To validate RC and support community benchmarking, we further introduce PROVE-Bench, a two-tier real-world benchmark comprising PROVE-M, an 80-video paired dataset with motion augmentation, and PROVE-H, a 100-video challenging subset without ground truth. Together, RC metrics and PROVE-Bench form the PROVE (Perceptual RemOVal cohErence) evaluation framework for visual media. Experiments across diverse image and video benchmarks demonstrate that RC achieves substantially stronger alignment with human judgments than existing evaluation protocols. The code for RC metrics and PROVE-Bench are publicly available at: https://github.com/xiaomi-research/prove/.
comment: Project Page: https://xiaomi-research.github.io/prove/
☆ Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation
Holistic evaluation scores capture overall output quality but do not distinguish whether a model reproduced the structural form of a user's request from whether it preserved the user's specific intent. We propose a dimension-level intent fidelity evaluation framework, applied here through a structured prompt ablation study across 2,880 outputs spanning three languages, three task domains, and six LLMs, that separately measures structural recovery and intent fidelity for each semantic dimension. This framework reveals a systematic structural-fidelity split: among Chinese-language outputs with complete paired scores, 25.7% received perfect holistic alignment scores (GA=5) while exhibiting measurable dimensional intent deficits; among English-language outputs, this proportion rose to 58.6%. Human evaluation confirmed that these split-zone outputs represent genuine quality deficits and that dimensional fidelity scores track human judgements more reliably than holistic scores do. A public-private decomposition of 2,520 ablation cells characterises when models successfully compensate for missing intent and when they fail, while proxy annotation distinguishes prior inferability from default recoverability. A weight-perturbation experiment shows that moderate misalignment is typically absorbed, whereas severe dimensional inversion is consistently harmful. These findings demonstrate that dimension-level intent fidelity evaluation is a necessary complement to holistic assessment when evaluating LLM outputs for user-specific tasks.
comment: Preprint. 30 tasks, 3 languages, 6 LLMs, 2,880 outputs; includes human evaluation and structured prompt ablation
☆ HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention
Diffusion-based video generation has advanced substantially in visual fidelity and temporal coherence, but practical deployment remains limited by the quadratic complexity of full attention. Training-free sparse attention is attractive because it accelerates pretrained models without retraining, yet existing online top-$p$ sparse attention still spends non-negligible cost on mask prediction and applies shared thresholds despite strong head-level heterogeneity. We show that these two overlooked factors limit the practical speed-quality trade-off of training-free sparse attention in Video DiTs. To address them, we introduce a head-wise adaptive framework with two plug-in components: Temporal Mask Reuse, which skips unnecessary mask prediction based on query-key drift, and Error-guided Budgeted Calibration, which assigns per-head top-$p$ thresholds by minimizing measured model-output error under a global sparsity budget. On Wan2.1-1.3B and Wan2.1-14B, our method consistently improves XAttention and SVG2, achieving up to 1.93 times speedup at 720P while maintaining competitive video quality and similarity metrics.
☆ Asymmetric Generative Recommendation via Multi-Expert Projection and Multi-Faceted Hierarchical Quantization
Generative Recommendation (GenRec) models reformulate recommendation as a sequence generation task, representing items as discrete Semantic IDs used symmetrically as both inputs and prediction targets. We identify a critical dual-stage information bottleneck in this design: (1) the Input Bottleneck, where lossy quantization degrades fine-grained semantics, while popularity bias skews the learned representations toward frequent items, and (2) the Output Bottleneck, where imprecise discrete targets limit supervision quality. To address these issues, we propose AsymRec, an asymmetric continuous-discrete framework that decouples input and output representations. Specifically, Multi-expert Semantic Projection (MSP) maps continuous embeddings into the Transformer's hidden space via expert-specialized projections, preserving semantic richness and improving generalization to infrequent items. Multi-faceted Hierarchical Quantization (MHQ) constructs high-capacity, structured discrete targets through multi-view and multi-level quantization with semantic regularization, preventing dimensional collapse while retaining fine-grained distinctions. Extensive experiments demonstrate that AsymRec consistently outperforms state-of-the-art generative recommenders by an average of 15.8 %. The code will be released.
☆ When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution
Long-horizon household tasks demand robust high-level planning and sustained reasoning capabilities, which are largely overlooked by existing embodied AI benchmarks that emphasize short-horizon navigation or manipulation and rely on fixed task categories. We introduce LongAct, a benchmark designed to evaluate planning-level autonomy in long-horizon household tasks specified through free-form instructions. By abstracting away embodiment-specific low-level control, LongAct isolates high-level cognitive capabilities such as instruction understanding, dependency management, memory maintenance, and adaptive planning. We further propose HoloMind, a VLM-driven agent with a DAG-based long-horizon hierarchical planner, a Multimodal Spatial Memory for persistent world modeling, an Episodic Memory for experience reuse, and a global Critic for reflective supervision. Experiments with GPT-5 and Qwen3-VL models show that HoloMind substantially improves long-horizon performance while reducing reliance on model scale. Even top models achieve only 59% goal completion and 16% full-task success, underscoring the difficulty of LongAct and the need for stronger long-horizon planning in embodied agents.
☆ Quantifying Cyber-Vulnerability in Power Electronics Systems via an Impedance-Based Attack Reachable Domain
Power electronics systems are increasingly exposed to cyber threats due to their integration with digital controllers and communication networks. However, an attacker-oriented metric is still lacking to quantify the extent to which a node can be pushed toward instability within a privilege-constrained action space. This letter proposes an impedance-based Attack Reachable Domain (ARD) framework that maps feasible adversarial actions to critical-eigenvalue migration through impedance reshaping. Based on the ARD, an Attack Penetration Index is defined to quantify node-level cyber-vulnerability by jointly characterizing the penetration of the nominal stability margin and the accessibility of successful destabilizing attacks within a privilege-constrained action space. To make the proposed assessment computable when inverter models are unavailable, a practical gray-box workflow is further established by integrating existing impedance identification and differentiable surrogate tools. Case studies on a 4-bus system and a modified IEEE 39-bus system show that coordinated cross-layer manipulations are markedly more damaging than isolated single-layer attacks, and that the proposed metric reveals vulnerability patterns that cannot be inferred from grid-strength indicators.
☆ Fully Dynamic Rebalancing in Dockless Bike-Sharing Systems via Deep Reinforcement Learning
This paper proposes a fully dynamic Deep Reinforcement Learning (DRL) method for rebalancing dockless bike-sharing systems, overcoming the limitations of periodic, system-wide interventions. We model the service through a graph-based simulator and cast rebalancing as a Markov decision process. A DRL agent routes a single truck in real time, executing localized pick-up, drop-off, and charging actions guided by spatiotemporal criticality scores. Experiments on real-world data show significant reductions in availability failures with a minimal fleet size, while limiting spatial inequality and mobility deserts. Our approach demonstrates the value of learning-based rebalancing for efficient and reliable shared micromobility.
comment: 6 pages, 5 figures, 1 table, accepted at the 23rd IFAC World Congress, Busan, South Korea, Aug. 23-26, 2026. Open invited track 9-131: "Control and Optimization for Smart Cities"
☆ ROAD: Adaptive Data Mixing for Offline-to-Online Reinforcement Learning via Bi-Level Optimization IJCAI 2026
Offline-to-online reinforcement learning harnesses the stability of offline pretraining and the flexibility of online fine-tuning. A key challenge lies in the non-stationary distribution shift between offline datasets and the evolving online policy. Common approaches often rely on static mixing ratios or heuristic-based replay strategies, which lack adaptability to different environments and varying training dynamics, resulting in suboptimal tradeoff between stability and asymptotic performance. In this work, we propose Reinforcement Learning with Optimized Adaptive Data-mixing (ROAD), a dynamic plug-and-play framework that automates the data replay process. We identify a fundamental objective misalignment in existing approaches. To tackle this, we formulate the data selection problem as a bi-level optimization process, interpreting the data mixing strategy as a meta-decision governing the policy performance (outer-level) during online fine-tuning, while the conventional Q-learning updates operate at the inner level. To make it tractable, we propose a practical algorithm using a multi-armed bandit mechanism. This is guided by a surrogate objective approximating the bi-level gradient, which simultaneously maintains offline priors and prevents value overestimation. Our empirical results demonstrate that this approach consistently outperforms existing data replay methods across various datasets, eliminating the need for manual, context-specific adjustments while achieving superior stability and asymptotic performance.
comment: 20 pages, 9 figures, 7 tables. Accepted to IJCAI 2026
☆ Contestable Multi-Agent Debate with Arena-based Argumentative Computation for Multimedia Verification ICMR 2026
Multimedia verification requires not only accurate conclusions but also transparent and contestable reasoning. We propose a contestable multi-agent framework that integrates multimodal large language models, external verification tools, and arena-based quantitative bipolar argumentation (A-QBAF) as a submission to the ICMR 2026 Grand Challenge on Multimedia Verification. Our method decomposes each case into claim-centered sections, retrieves targeted evidence, and converts evidence into structured support and attack arguments with provenance and strength scores. These arguments are resolved through small local argument graphs with selective clash resolution and uncertainty-aware escalation. The resulting system generates section-wise verification reports that are transparent, editable, and computationally practical for real-world multimedia verification. Our implementation is public at: https://github.com/Analytics-Everywhere-Lab/MV2026_the_liems.
comment: ACM ICMR 2026 Grand Challenge on Multimedia Verification
☆ Learning Scenario Reduction for Two-Stage Robust Optimization with Discrete Uncertainty
Two-Stage Robust Optimization (2RO) with discrete uncertainty is challenging, often rendering exact solutions prohibitive. Scenario reduction alleviates this issue by selecting a small, representative subset of scenarios to enable tractable computation. However, existing methods are largely problem-agnostic, operating solely on the uncertainty set without consulting the feasible region or recourse structure. In this paper, we introduce PRISE, a problem-driven sequential lookahead heuristic that constructs reduced scenario sets by evaluating the marginal impact of each scenario. While PRISE yields high-quality scenario subsets, each selection step requires solving multiple subproblems, making it computationally expensive at scale. To address this, we propose NeurPRISE, a neural surrogate model built on a GNN-Transformer backbone that encodes the per-scenario structure via graph convolution and captures cross-scenario interactions through attention. NeurPRISE is trained via imitation learning with a gain-aware ranking objective, which distills marginal gain information from PRISE into a learned scoring function for scenario ranking and selection. Extensive results on three 2RO problems show that NeurPRISE consistently achieves competitive regret relative to comprehensive methods, maintains strong calability with varying numbers of scenarios, and delivers 7-200x speedup over PRISE. NeurPRISE also exhibits strong zero-shot generalization, effectively handling instances with larger problem scales (up to 5x), more scenarios (up to 4x), and distribution shifts.
☆ Deepchecks: Evaluating Retrieval-Augmented Generation (RAG)
Large Language Models (LLMs) augmented with Retrieval-Augmented Generation (RAG) techniques are revolutionizing applications across multiple domains, such as healthcare, finance, and customer service. Despite their potential, evaluating RAG systems remains a complex challenge due to the stochastic nature of generated outputs and the intricate interplay between retrieval and generation components. This paper introduces Deepchecks, a comprehensive framework tailored for evaluating RAG applications. Deepchecks' evaluation framework addresses RAG applications evaluation through a multi-faceted approach, root cause analysis and production monitoring. By ensuring alignment with application-specific requirements, Deepchecks framework provides a robust foundation for assessing reliability, relevance, and user satisfaction in RAG systems.
☆ Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
Autoregressive video diffusion models support real-time synthesis but suffer from error accumulation and context loss over long horizons. We discover that attention heads in AR video diffusion transformers serve functionally distinct roles as local heads for detail refinement, anchor heads for structural stabilization, and memory heads for long-range context aggregation, yet existing methods treat them uniformly, leading to suboptimal KV cache allocation. We propose Head Forcing, a training-free framework that assigns each head type a tailored KV cache strategy: local and anchor heads retain only essential tokens, while memory heads employ a hierarchical memory system with dynamic episodic updates for long-range consistency. A head-wise RoPE re-encoding scheme further ensures positional encodings remain within the pretrained range. Without additional training, Head Forcing extends generation from 5 seconds to minute-level duration, supports multi-prompt interactive synthesis, and consistently outperforms existing baselines. Project Page: https://jiahaotian-sjtu.github.io/headforcing.github.io/.
☆ LEMON: Learning Executable Multi-Agent Orchestration via Counterfactual Reinforcement Learning
Large language models (LLMs) have become a strong foundation for multi-agent systems, but their effectiveness depends heavily on orchestration design. Across different tasks, role design, capacity assignment, and dependency construction jointly affect both solution quality and execution efficiency. Existing approaches automate parts of this design process, yet they often optimize these decisions partially or sequentially, and rely on execution-level feedback that provides limited credit assignment for local orchestration decisions. We propose LEMON (\textbf{L}earning \textbf{E}xecutable \textbf{M}ulti-agent \textbf{O}rchestratio\textbf{N} via Counterfactual Reinforcement Learning), an LLM-based orchestrator that generates an executable orchestration specification. The specification integrates task-specific roles, customized duties, capacity levels, and dependency structure into a single deployable system. To train the orchestrator, we augment the orchestration-level GRPO objective with a localized counterfactual signal that edits role, capacity, or dependency fields and applies the resulting reward contrast only to the edited spans. Experiments on six reasoning and coding benchmarks, including MMLU, GSM8K, AQuA, MultiArith, SVAMP, and HumanEval, show that LEMON achieves state-of-the-art performance among the evaluated multi-agent orchestration methods. Our code is available at https://anonymous.4open.science/r/LEMON-B23C.
comment: Submitted to Neurips 2026
☆ When Retrieval Hurts Code Completion: A Diagnostic Study of Stale Repository Context
Context: Retrieval-augmented code generation relies on cross-file repository context, but retrieved snippets may come from obsolete project states. Objectives: We study whether temporally stale repository snippets act as harmless noise or actively induce current-state-incompatible code. Methods: We conduct a controlled diagnostic study on a curated 17-sample set of production-helper signature changes from five Python repositories. For each sample, we compare current-only, stale-only, no-retrieval, and mixed current/stale retrieval conditions under prompts that hide commit freshness and expected current signatures. Results: Under neutralized prompts, stale-only retrieval induces stale helper references on 15/17 Qwen2.5-Coder-7B-Instruct samples and 13/17 gpt-4.1-mini samples, corresponding to 88.2 and 76.5 percentage-point increases over current-only retrieval. No retrieval produces zero stale references but only 1/17 passing completions. The two models share 75.0% Jaccard overlap among stale-triggering samples, and mixed conditions show that adding valid current evidence largely rescues stale-only failures. Conclusion: Temporal validity of retrieved repository context is a distinct diagnostic variable for Code RAG robustness: stale context can actively bias models toward obsolete repository state rather than merely removing useful evidence.
comment: 31 pages, 2 tables. Submitted to Information and Software Technology (Elsevier)
☆ Does RAG Know When Retrieval Is Wrong? Diagnosing Context Compliance under Knowledge Conflict
The Context-Compliance Regime in Retrieval-Augmented Generation (RAG) occurs when retrieved context dominates the final answer even when it conflicts with the model's parametric knowledge. Accuracy alone does not reveal how retrieved context causally shapes answers under such conflict. We introduce Context-Driven Decomposition (CDD), a belief-decomposition probe that operates at inference time and serves as an intervention mechanism for controlled retrieval conflict. Across Epi-Scale stress tests, TruthfulQA misconception injection, and cross- model reruns, CDD exposes three patterns. P1: context compliance is measurable in an upper-bound adversarial setting, where Standard RAG reaches 15.0% accuracy on TruthfulQA misconception injection (N=500). P2: adversarial accuracy gains transfer across model families: CDD improves accuracy on Gemini-2.5-Flash and on Claude Haiku/Sonnet/Opus, but rationale-answer causal coupling does not transfer. CDD reaches 64.1% mistake- injection causal sensitivity on Gemini-2.5-Flash, while sensitivities for all three Claude variants fall in the [-3%, +7%] range, suggesting that the Claude-side accuracy gains operate through a mechanism distinct from the explicit conflict-resolution trace. P3: explicit conflict decomposition improves robustness under temporal drift and noisy distractors, with CDD reaching 71.3% on temporal shifts and 69.9% on distractor evidence on the full Epi-Scale adversarial benchmark. These three patterns identify context-compliance as a structural axis along which standard RAG can be probed and intervened on, distinct from retrieval-quality or single-method robustness questions, and motivate releasing Epi-Scale for systematic study across model families and retrieval pipelines.
comment: 12 pages, 4 figures, 3 tables
☆ From Table to Cell: Attention for Better Reasoning with TABALIGN
Multi-step LLM reasoning over structured tables fails because planning and execution share no explicit cell-grounding contract. Existing methods constrain the planner to a left-to-right factorization at odds with table permutation invariance, and score intermediate states by generated content alone, overlooking cell grounding. We conduct a pilot study showing that diffusion language models (DLMs) produce more human-aligned and permutation-stable cell attention on tables than autoregressive models, with a 40.2% median reduction in attention-AUROC variability under row reordering. Motivated by this, we propose TABALIGN, a planned table reasoning framework that operationalizes the contract. TABALIGN pairs a masked DLM planner, whose bidirectional denoising emits plan steps as binary cell masks, with TABATTN, a lightweight verifier trained on 1,600 human-verified attention standards to score each step by its attention overlap with the plan-designated mask. Across eight benchmarks covering table question answering and fact verification, TABALIGN improves average accuracy by 15.76 percentage points over the strongest open-source baseline at comparable 8B-class scale, with a matched-backbone ablation attributing 2.87 percentage points of this gain to the DLM planner over an AR planner on a fixed reasoner. Cleaner DLM plans also accelerate downstream reasoning execution by 44.64%.
☆ OmniDrop: Layer-wise Token Pruning for Omni-modal LLMs via Query-Guidance
Omni-modal large language models have demonstrated remarkable potential in holistic multimodal understanding; however, the token explosion caused by high-resolution audio and video inputs remains a critical bottleneck for real-time applications and long-form reasoning. Existing omni-modal token compression methods typically prune tokens at the input embedding level, relying on audio-video similarity or temporal co-occurrence as proxies for semantic relevance. In practice, such assumptions are often unreliable. To address this limitation, we propose OmniDrop, a training-free, layer-wise token pruning framework that progressively prunes audiovisual tokens within the LLM decoder layers rather than at the input-level, allowing early layers to preserve sufficient omni-modal information fusion before aggressively removing tokens in deeper layers. We further utilize text queries as guidance for modality-agnostic and task-adaptive token pruning. We also introduce a temporal diversity score that encourages balanced token survival to preserve global temporal context. Experimental results across various audiovisual benchmarks demonstrate that OmniDrop outperforms all baselines by up to 3.58 points while reducing prefill latency by up to 40% and memory usage by up to 14.7%.
☆ Stateful Reasoning via Insight Replay
Chain-of-Thought (CoT) reasoning has become a foundation for eliciting multi-step reasoning in large language models, but recent studies show that its benefits do not scale monotonically with chain length: while longer CoT generally enables a model to tackle harder problems, on a given problem, accuracy typically increases with CoT length up to a point, after which it declines. We identify a major cause of this phenomenon: as the CoT grows, the model's attention to critical insights produced earlier in the trace gradually weakens, making those insights progressively less accessible when they are most needed. Therefore, we propose \textbf{InsightReplay}, a stateful reasoning approach in which the model periodically extracts critical insights from its reasoning trace and replays them near the active generation frontier, keeping them accessible as the reasoning scales. Extensive experiments on a $\mathbf{2}\!\times\!\mathbf{3}\!\times\!\mathbf{4}$ benchmark grid, covering model scales $\{\text{8B}, \text{30B}\}$, model families $\{\text{Qwen3.5}, \text{DeepSeek-R1-Distill-Qwen}, \text{Gemma-4}\}$, and reasoning benchmarks $\{\text{AIME}, \text{HMMT}, \text{GPQA Diamond}, \text{LiveCodeBench v5}\}$, show that 3-round InsightReplay yields accuracy gains across \textbf{all 24 settings}, with an averaged improvement of $\mathbf{+1.65}$ points over standard CoT, and a largest single-setting gain of $\mathbf{+9.2}$ points on R1-Distill-32B's LiveCodeBench v5 subset. Our results suggest that the effectiveness of test-time scaling depends not only on how much a model reasons, but also on whether critical intermediate insights remain accessible throughout long reasoning trajectories.
☆ Intelligence Impact Quotient (IIQ): A Framework for Measuring Organizational AI Impact
The Intelligence Impact Quotient (IIQ) is a composite metric intended to quantify the depth to which AI systems are integrated into organizational work and their impact. Rather than treating access counts or aggregate token volume as sufficient evidence of impact, IIQ combines a novelty-weighted, time-decayed token stock with usage frequency, a grace-period recency gate, organizational leverage, task complexity, and autonomy. The formulation produces a raw Intelligence Adoption Index (IAI) and a normalized 0-1000 IIQ index for comparison between heterogeneous users and units. We also derive sub-daily update rules and a bounded interpretation layer for estimated efficiency and financial impact. The paper positions IIQ as a deployment-oriented measurement framework: a formal proposal for tracking AI embedding in workflows, not a direct measure of model capability or a substitute for causal productivity evaluation. Synthetic scenarios illustrate how the revised metric distinguishes between frequent low-leverage use, semantically repetitive prompting, and more autonomous, higher-consequence AI-assisted work.
☆ When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition
Hallucination detection in large language models (LLMs) requires balancing accu racy, efficiency, and robustness to distribution shift. Black-box consistency methods are effective but demand repeated inference; single-pass white-box probes are effi cient yet treat answer representations in isolation, often degrading sharply under domain shift. We propose QAOD (Question-Answer Orthogonal Decomposition), a single-pass framework that projects away the question-aligned direction from the answer representation to obtain a question-orthogonal component that suppresses domain-conditioned variation. To identify informative signals, QAOD further selects layers via diversity-penalized Fisher scoring and discriminative neurons via Fisher importance. To address both in-domain detection and cross-domain generalization, we design two complementary probing strategies: pairing the or thogonal component with question context yields a joint probe that maximizes in-domain discriminability, while using the orthogonal component alone preserves domain-agnostic factuality signals for robust transfer. QAOD's joint probe achieves the best in-domain AUROC across all evaluated model-dataset pairs, while the orthogonal-only probe delivers the strongest OOD transfer, surpassing the best white-box baseline by up to 21% on BioASQ at under 25% of generation cost.
Prompting Policies for Multi-step Reasoning and Tool-Use in Black-box LLMs with Iterative Distillation of Experience
The shift toward interacting with frozen, "black-box" Large Language Models (LLMs) has transformed prompt engineering from a heuristic exercise into a critical optimization challenge. We propose a Reinforcement Learning (RL) framework for training learned prompting policies via iterative distillation of experience. In this architecture, a lightweight prompter model is optimized to maximize task-specific rewards for a larger, frozen worker LLM. By utilizing a contrastive experience buffer that couples scalar rewards with dense textual critiques, our approach effectively amortizes iterative prompt refinement into single-shot policy weights. Our experimental analysis focuses on the Big Bench Extra Hard (BBEH) and Tau-bench suites, covering a diverse range of multi-step reasoning and tool-use tasks. We demonstrate significant gains, improving performance from 55% to 90% in logic-intensive reasoning and 74% to 91% in tool-use tasks. Furthermore, we analyze the structural evolution of prompts, demonstrating how the policy discovers specialized algorithmic heuristics. We provide comprehensive comparisons against state-of-the-art evolutionary baselines like GEPA, showing that iterative distillation achieves superior performance with higher sample efficiency.
comment: 10 pages and reference, appendix
☆ Synthesizing POMDP Policies: Sampling Meets Model-checking via Learning
Partially Observable Markov Decision Processes (POMDPs) are the standard framework for decision-making under uncertainty. While sampling-based methods scale well, they lack formal correctness guarantees, making them unsuitable for safety-critical applications. Conversely, formal synthesis techniques provide correctness-by-construction but often struggle with scalability, as general POMDP synthesis is undecidable. To bridge this gap, we propose a synthesis framework that integrates sampling, automata learning, and model-checking. Inspired by Angluin's $L^*$ algorithm, our approach utilizes sampling as a membership oracle and model-checking as an equivalence oracle. This enables the synthesis of finite-state controllers with formal guarantees, provided the sampling-induced policy is regular. We establish a relative completeness result for this framework. Experimental results from our prototypical implementation demonstrate that this method successfully solves threshold-safety problems that remain challenging for existing formal synthesis tools. We believe our algorithm serves as a valuable component in a portfolio approach to tackling the inherent difficulty of POMDP synthesis problems.
comment: Paper accepted at 38th International Conference on Computer Aided Verification (CAV 2026), Lisbon, Portugal, July 2026
☆ BEAM: Binary Expert Activation Masking for Dynamic Routing in MoE
Mixture-of-Experts (MoE) architectures enhance the efficiency of large language models by activating only a subset of experts per token. However, standard MoE employs a fixed Top-K routing strategy, leading to redundant computation and suboptimal inference latency. Existing acceleration methods either require costly retraining with architectural changes or suffer from severe performance drop at high sparsity due to train-inference mismatch. To address these limitations, we propose BEAM (Binary Expert Activation Masking), a novel method that learns token-adaptive expert selection via trainable binary masks. With a straight-through estimator and an auxiliary regularization loss, BEAM induces dynamic expert sparsity through end-to-end training while maintaining model capability. We further implement an efficient custom CUDA kernel for BEAM, ensuring seamless integration with the vLLM inference framework. Experiments show that BEAM retains over 98\% of the original model's performance while reducing MoE layer FLOPs by up to 85\%, achieving up to 2.5$\times$ faster decoding and 1.4$\times$ higher throughput, demonstrating its effectiveness as a practical, plug-and-play solution for efficient MoE inference.
comment: 22 pages, 12 figures
☆ Efficient Generative Retrieval for E-commerce Search with Semantic Cluster IDs and Expert-Guided RL
Generative retrieval offers a promising alternative by unifying the fragmented multi-stage retrieval process into a single end-to-end model. However, its practical adoption in industrial e-commerce search remains challenging, given the massive and dynamic product catalogs, strict latency requirements, and the need to align retrieval with downstream ranking goals. In this work, we propose a retrieval framework tailored for real-world recall scenarios, positioning generative retrieval as a recall-stage supplement rather than an end-to-end replacement. Our method, CQ-SID (Category-and-Query constrained Semantic ID), employs category-aware and query-item contrastive learning along with Residual Quantized VAEs to encode items into hierarchical semantic cluster identifiers, significantly reducing beam search complexity. Additionally, we develop EG-GRPO (Expert-Guided Group Relative Policy Optimization), a reinforcement learning approach that aligns generative recall with downstream ranking under sparse rewards by injecting ground-truth samples to stabilize training. Offline experiments on TmallAPP search logs show that CQ-SID achieves up to 26.76% and 11.11% relative gains in semantic and personalized click hitrate over RQ-VAE baselines, while halving beam search size. EG-GRPO further improves multi-objective performance. Online A/B tests confirm gains in GMV (+1.15%) and UCTCVR (+0.40%). The generative recall channel now contributes substantially in production, accounting for over 50.25% of exposures, 58.96% of clicks, and 72.63% of purchases, demonstrating a viable path for deploying generative retrieval in real-world e-commerce systems.
☆ A plug-and-play generative framework for multi-satellite precipitation estimation
Reliable precipitation monitoring is essential for disaster risk reduction, water resources management, and agricultural decision-making. Multi-source satellite observations, particularly the combination of geostationary infrared and passive microwave measurements, have become a primary means of precipitation detection. Traditional multi-source satellite precipitation estimation methods remain computationally inefficient, and many deep learning methods lack the flexibility to incorporate new sensors without retraining the full model. Here we introduce PRISMA (Precipitation Inference from Satellite Modalities via generAtive modeling), a plug-and-play latent generative framework for multi-sensor precipitation estimation. PRISMA learns an unconditional precipitation prior from IMERG Final fields and constrains it through independently trained, sensor-specific conditional branches, allowing new observation sources to be incorporated without retraining the generative backbone. Applied to FY-4B AGRI infrared and GPM GMI microwave observations, PRISMA improves Critical Success Index by up to 40.3% and reduces root-mean-square error by 22.6% relative to infrared-only estimation within microwave swaths, while also improving probabilistic skill and maintaining an average inference time of about 37 s. Independent rain-gauge validation across China confirms consistent gains, and typhoon case studies show that microwave conditioning restores eyewall and spiral rainband structures, reducing storm-core mean absolute error by up to 42.3%. PRISMA thus provides an extensible and efficient framework for multi-sensor precipitation estimation.
☆ Collaborative Yet Personalized Policy Training: Single-Timescale Federated Actor-Critic
Despite the popularity of the actor-critic method and the practical needs of collaborative policy training, existing works typically either overlook environmental heterogeneity or give up personalization altogether by training a single shared policy across all agents. We consider a federated actor-critic framework in which agents share a common linear subspace representation while maintaining personalized local policy components, and agents iteratively estimate the common subspace, local critic heads, and local policies (i.e., actors). Under canonical single-timescale updates with Markovian sampling, we establish finite-time convergence via a novel joint linear approximation framework. Specifically, we show that the critic error converges to zero at the rate of $\tilde{\mathcal{O}}(1/((1-γ)^4\sqrt{TK}))$, and the policy gradient norm converges to zero at the rate of $\tilde{\mathcal{O}}(1/((1-γ)^6\sqrt{TK}))$, where $T$ is the number of rounds, $K$ is the number of agents, and $γ\in (0,1)$ is the discount factor. These results demonstrate linear speedup with respect to the number of agents $K$, despite heterogeneous Markovian trajectories under distinct transition kernels and coupled learning dynamics. To address these challenges, we develop a new perturbation analysis for the projected subspace updates and QR decomposition steps, together with conditional mixing arguments for heterogeneous Markovian noise. Furthermore, to handle the additional complications induced by policy updates and temporal dependence, we establish fine-grained characterizations of the discrepancies between function evaluations under Markovian sampling and under temporally frozen policies. Experiments instantiate the framework within PPO on federated \texttt{Hopper-v5} action-map heterogeneity, showing gains over Single PPO and FedAvg PPO and downstream transfer from the learned shared trunk.
☆ MemLineage: Lineage-Guided Enforcement for LLM Agent Memory
We introduce MemLineage, a defense for LLM agent memory that attaches both cryptographic provenance and LLM-mediated derivation lineage to every entry. Recent and concurrent work shows that untrusted content can be written into persistent agent state and re-enter later sessions as an instruction; the remaining systems question is how to preserve useful memory recall while preventing such state from justifying sensitive actions. MemLineage treats this as a chain-of-custody problem rather than a filtering problem. It is a six-module design around an RFC-6962 Merkle log over per-principal Ed25519-signed entries: a weighted derivation DAG records which retrieved entries influenced each new memory, and a max-of-strong-edges propagation rule makes Untrusted-Path Persistence hold for any chain whose attribution edges remain above threshold. The sensitive-action gate then refuses dispatches whose active justification descends from an external ancestor, while still allowing benign recall. We evaluate three defense cells against three memory-poisoning workloads on a deterministic mechanism-isolation harness; MemLineage is the only configuration in that harness that drives all three columns to zero ASR, while sub-millisecond per-operation overhead keeps it well below the noise floor of any LLM call. A Codex-backed AgentDojo bridge further separates strong-model behavior from defense-layer behavior: under an intentionally vulnerable tool-output profile, no-defense and signature-only baselines fail on all six banking pairs, while all MemLineage rows reduce strict AgentDojo ASR to zero. The core deterministic artifacts are byte-equal CI-verified; hosted-model AgentDojo and live-model sweeps are recorded as auditable logs rather than byte-pinned artifacts.
comment: 24 pages, 8 figures. Rui Hou is the corresponding author
☆ DVMap: Fine-Grained Pluralistic Value Alignment via High-Consensus Demographic-Value Mapping ACL 2026
Current Large Language Models (LLMs) typically rely on coarse-grained national labels for pluralistic value alignment. However, such macro-level supervision often obscures intra-country value heterogeneity, yielding a loose alignment. We argue that resolving this limitation requires shifting from national labels to multi-dimensional demographic constraints, which can identify groups with predictable, high-consensus value preference. To this end, we propose DVMap (High-Consensus Demographic-Value Mapping), a framework for fine-grained pluralistic value alignment. In this framework, we first present a demographic archetype extraction strategy to construct a high-quality value alignment corpus of 56,152 samples from the World Values Survey (WVS) by strictly retaining respondents with consistent value preferences under identical demographics. Over this corpus, we introduce a Structured Chain-of-Thought (CoT) mechanism that explicitly guides LLMs to reason about demographic-value correlations. Subsequently, we employ Group Relative Policy Optimization (GRPO) to achieve adaptive anchoring of value distributions. To rigorously evaluate generalization, we further establish a triple-generalization benchmark (spanning cross-demographic, cross-country, and cross-value) comprising 21,553 samples. Experimental results demonstrate that DVMap effectively learns the manifold mapping from demographics to values, exhibiting strong generalization and robustness. On cross-demographic tests, Qwen3-8B-DVMap achieves 48.6% accuracy, surpassing the advanced open-source LLM DeepSeek-v3.2 (45.1%). The source code and dataset are available at https://github.com/EnlightenedAI/DVMap.
comment: Accepted to the Main Conference of ACL 2026
☆ The Great Pretender: A Stochasticity Problem in LLM Jailbreak
"Oh-Oh, yes, I'm the great pretender. Pretending that I'm doing well. My need is such, I pretend too much..." summarizes the state in the area of jailbreak creation and evaluation. You find this method to generate adversarial attacks proposed by a reputable institution (e.g., BoN from Anthropic or Crescendo from Microsoft Research). However, this method does not deliver on the promise claimed in the paper despite having top ASR scores against industry-grade LLMs. You successfully generate the jailbreak prompts against your target (open) model. However, the generated jailbreak prompt works against the target model with a 50% consecutive success rate (5 out of 10 attempts) despite having an 80% ASR (on paper) on the latest closed-source model (with a guardrail system)! This observation leads us to think. First, Attack Success Rate (ASR), the primary metric for LLM jailbreak benchmarking, is not a stable quantity. Second, published ASR numbers are therefore systematically inflated and incomparable across papers. Therefore, we wonder "Why a successful jailbreak prompt does not perform consistently well against a target model on which the prompts have been optimized?". To answer this question, we study the impact of stochasticity not only during attack evaluation but also during attack generation. Our evaluation includes several jailbreak attacks, models (different sizes and providers), and judges. In addition, we propose a new metric and two new frameworks (CAS-eval and CAS-gen). Our evaluation framework, CAS-eval, shows that an attack can have an ASR drop of up to 30 percentage points when a jailbreak prompt needs to succeed on more than one attempt. Thankfully, our attack generation framework (CAS-gen) improves previous jailbreak methods and helps them recover this loss of 30 percentage points!
☆ A Unified Knowledge Embedded Reinforcement Learning-based Framework for Generalized Capacitated Vehicle Routing Problems
The Capacitated Vehicle Routing Problem (CVRP) is a fundamental NP-hard problem with broad applications in logistics and transportation. Real-world CVRPs often involve diverse objectives and complex constraints, such as time windows or backhaul requirements, motivating the development of a unified solution framework. Recent reinforcement learning (RL) approaches have shown promise in combinatorial optimization, yet they rely on end-to-end learning and lack explicit problem-solving knowledge, limiting solution quality. In this paper, we propose a knowledge-embedded framework inspired by the Route-First Cluster-Second heuristics. It incorporates knowledge at two levels: (1) decomposing CVRPs into the route-first and cluster-second subproblems, and (2) leveraging dynamic programming to solve the second subproblem, whose results guide the RL-based constructive solver to solve the first problem. To mitigate partial observability caused by problem decomposition, we introduce a unified history-enhanced context processing module. Extensive experiments show that this framework achieves superior solution quality compared with state-of-the-art learning-based methods, with a smaller gap to classical heuristics, demonstrating strong generalization across diverse CVRP variants.
☆ SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades
Coding agents powered by large language models are increasingly expected to perform realistic software maintenance tasks beyond isolated issue resolution. Existing benchmarks have shifted toward realistic software evolution, but they rarely capture continuous maintenance at the granularity of package releases, where changes are bundled, shipped, and inherited by subsequent versions. We present SWE-Chain, a benchmark for evaluating agents on chained release-level package upgrades, where each transition builds on the agent's prior codebase. To produce upgrade specifications, we design a divide-and-conquer synthesis pipeline that aligns release notes with code diffs for each version transition, ensuring the requirements are grounded in actual code changes, informative to agents, and feasible to implement. SWE-Chain contains 12 upgrade chains across 9 real Python packages, with 155 version transitions and 1,660 grounded upgrade requirements. Across nine frontier agent-model configurations, agents achieve an average of 44.8% resolving, 65.4% precision, and 50.2% F1 under the Build+Fix regime, with Claude-Opus-4.7 (Claude Code) leading at 60.8% resolving, 80.6% precision, and 68.5% F1. These results show that SWE-Chain is both feasible and discriminative, and reveal that current agents still struggle to make correct upgrades across chained package releases without breaking existing functionality.
☆ MahaVar: OOD Detection via Class-wise Mahalanobis Distance Variance under Neural Collapse
Out-of-distribution (OOD) detection is a critical component for ensuring the reliability of deep neural networks in safety-critical applications. In this work, we present a key empirical observation: for in-distribution (ID) samples, class-wise Mahalanobis distances exhibit a pronounced sharp minimum structure, where the distance to the nearest class is small while distances to all other classes remain large, resulting in high variance across classes. In contrast, OOD samples tend to exhibit a less pronounced sharp minimum structure, producing comparatively lower variance across classes. We further provide a theoretical analysis grounding this observation in Neural Collapse geometry: under relaxed Neural Collapse assumptions on within-class compactness and inter-class separation, ID samples are shown to structurally exhibit high class-wise distance variance, offering a theoretical basis for its use as an OOD score. Motivated by this observation and its theoretical backing, we propose MahaVar, a simple and effective post-hoc OOD detector that augments the Mahalanobis distance with a class-wise distance variance term. Following the OpenOOD v1.5 benchmark protocol, MahaVar achieves state-of-the-art performance on CIFAR-100 and ImageNet, with consistent improvements in both AUROC and FPR@95 over existing Mahalanobis-based methods across all benchmarks.
comment: 29 pages, 8 figures
☆ Energy-Efficient Quadruped Locomotion with Compliant Feet
Quadruped robots are often designed with rigid feet to simplify control and maintain stable contact during locomotion. While this approach is straightforward, it limits the ability of the legs to absorb impact forces and reuse stored elastic energy, leading to higher energy expenditure during locomotion. To explore whether compliant feet can provide an advantage, we integrate foot compliance into a reinforcement learning (RL) locomotion controller and study its effect on walking efficiency. In simulation, we train eight policies corresponding to eight different spring stiffness values and then cross-evaluate their performance by measuring mechanical energy consumed per meter traveled. In experiments done on a developed quadruped, the energy consumption for the intermediate stiffness spring is lower by ~ 17% when compared to a very stiff or a very flexible spring incorporated in the feet, with similar trends appearing in the simulation results. These results indicate that selecting an appropriate foot compliance can improve locomotion efficiency without destabilizing the robot during motion.
comment: 29 pages, 7 figures, supplemental videos link is mentioned in the paper
☆ Metis AI: The Overlooked Middle Zone Between AI-Native and World-Movers
The dominant discourse on AI limitations frames the boundary of AI capability as a divide between digital tasks (where AI excels) and physical tasks (where embodiment is required). We argue this framing misses the most consequential boundary: the one within digital tasks. We identify a class of tasks we call Metis AI, named for the Greek concept of metis (practical, contextual knowledge), that are performed entirely on computers yet resist reliable AI automation. These tasks are not computationally intractable; they are institutionally, socially, and normatively entangled in ways that defeat algorithmic approaches. We distinguish constitutive metis (knowledge destroyed by the act of formalization) from operational metis (system-specific familiarity that automation can progressively absorb), and propose five structural characteristics that define the Metis AI zone: consequential irreversibility, relational irreducibility, normative open texture, adversarial co-evolution, and accountability anchoring. We ground each in established theory from across the social sciences, philosophy, and humanitarian practice, argue that these characteristics are properties of the tasks themselves rather than limitations of current models, and show that the appropriate design response is not better automation but centaur architectures in which humans lead and AI supports.
☆ Agentic Recommender System with Hierarchical Belief-State Memory
Memory-augmented LLM agents have advanced personalized recommendation, yet existing approaches universally adopt flat memory representations that conflate ephemeral signals with stable preferences, and none provides a complete lifecycle governing how memory should evolve. We propose MARS (Memory-Augmented Agentic Recommender System), a framework that treats recommendation as a partially observable problem and maintains a structured belief state that progressively abstracts noisy behavioral observations into a compact estimate of user preferences. MARS organizes this belief state into three tiers: event memory buffers raw signals, preference memory maintains fine-grained mutable chunks with explicit strength and evidence tracking, and profile memory distills all preferences into a coherent natural language narrative. A complete lifecycle of six operations -- extraction, reinforcement, weakening, consolidation, forgetting, and resynthesis -- is adaptively scheduled by an LLM-based planner rather than fixed-interval heuristics. Experiments on four InstructRec benchmark domains show that \ours achieves state-of-the-art performance with average improvements of 26.4% in HR@1 and 10.3% in NDCG@10 over the strongest baselines with further gains from agentic scheduling in evolving settings.
comment: 4 figures, 8 tables
☆ Coding Agent Is Good As World Simulator
World models have emerged as a powerful paradigm for building interactive simulation environments, with recent video-based approaches demonstrating impressive progress in generating visually plausible dynamics. However, because these models typically infer dynamics from video and represent them in latent states, they do not explicitly enforce physical constraints. As a result, the generated video rollouts are not physically plausible, exhibiting unstable contacts, distorted shapes, or inconsistent motion. In this paper, we present an agentic framework constructing physics-based world models through executable simulation code. The framework coordinates planning, code generation, visual review, and physics analysis agents. The planning agent converts the natural language prompt into a structured scene plan, the code agent implements it as executable simulation code, and the visual review agent provide visual feedback while the physics analysis agent checks physical consistency. The code is iteratively revised based on the feedback until the simulation matches the prompt reqirements and physical constraints. Experimental results show that our framework outperforms advanced video-based models in physical accuracy, instruction fidelity and visual quality, which could be applied to various scenarios including driving simulation and embodied robot tasks.
☆ Learning to Build the Environment: Self-Evolving Reasoning RL via Verifiable Environment Synthesis
We pursue a vision for self-improving language models in which the model does not merely generate problems or traces to imitate, but constructs the environments that train it. In zero-data reasoning RL, this reframes self-improvement from a data-generation loop into an environment-construction loop, where each artifact is a reusable executable object that samples instances, computes references, and scores responses. Whether this vision sustains improvement hinges on a single property: the environments must exhibit stable solve--verify asymmetry, the model must be able to write an oracle once that it cannot reliably execute in natural language on fresh instances. This asymmetry takes two complementary forms. Some tasks are algorithmically hard to reason through but trivial as code: a dynamic program or graph traversal, compiled once, yields unboundedly many calibrated instances. Others are intrinsically hard to solve but easy to verify, like planted subset-sum or constraint satisfaction. Both create a durable gap between proposing and solving that the policy cannot close by gaming the verifier, and it is this gap that keeps reward informative as the learner improves. We instantiate this view in EvoEnv, a single-policy generator, solver method that synthesizes Python environments from ten seeds and admits them only after staged validation, semantic self-review, solver-relative difficulty calibration, and novelty checks. The strongest evidence comes from the already-strong regime: on Qwen3-4B-Thinking, fixed public-data RLVR and fixed hand-crafted environment RLVR reduce the average, while EvoEnv improves it from 72.4 to 74.8, a relative gain of 3.3%. Stable self-improvement, we suggest, depends not on producing more synthetic data, but on models learning to construct worlds whose difficulty stays structurally beyond their own reach.
comment: Tech report, work in progress
☆ Nexus : An Agentic Framework for Time Series Forecasting
Time series forecasting is not just numerical extrapolation, but often requires reasoning with unstructured contextual data such as news or events. While specialized Time Series Foundation Models (TSFMs) excel at forecasting based on numerical patterns, they remain unaware to real-world textual signals. Conversely, while LLMs are emerging as zero-shot forecasters, their performance remains uneven across domains and contextual grounding. To bridge this gap, we introduce Nexus, a multi-agent forecasting framework that decomposes prediction into specialized stages: isolating macro-level and micro-level temporal fluctuations, and integrating contextual information when available before synthesizing a final forecast. This decomposition enables Nexus to adapt from seasonal signals to volatile, event-driven information without relying on external statistical anchors or monolithic prompting. We show that current-generation LLMs possess substantially stronger intrinsic forecasting ability than previously recognized, depending critically on how numerical and contextual reasoning are organized. Evaluated on data strictly succeeding LLM knowledge cutoffs spanning Zillow real estate metrics and volatile stock market equities, Nexus consistently matches or outperforms state-of-the-art TSFMs and strong LLM baselines. Beyond numerical accuracy, Nexus produces high-quality reasoning traces that explicitly show the fundamental drivers behind each forecast. Our results establish that real-world forecasting is an agentic reasoning problem extending well beyond only sequence modeling.
comment: 30 Pages, 3 figures, 5 Tables
☆ Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning NeurIPS 2026
We present Darwin Family, a framework for training-free evolutionary merging of large language models via gradient-free weight-space recombination. We ask whether frontier-level reasoning performance can be improved without additional training, by reorganizing latent capabilities already encoded in existing checkpoints. Darwin introduces three key ideas: (i) a 14-dimensional adaptive merge genome enabling fine-grained component- and block-level recombination; (ii) MRI-Trust Fusion, which adaptively balances diagnostic layer-importance signals with evolutionary search through a learnable trust parameter; and (iii) an Architecture Mapper that enables cross-architecture breeding between heterogeneous model families. Empirically, the flagship Darwin-27B-Opus achieves 86.9% on GPQA Diamond, ranking #6 among 1,252 evaluated models, and outperforming its fully trained foundation model without any gradient-based training. Across scales from 4B to 35B parameters, Darwin models consistently improve over their parents, support recursive multi-generation evolution, and enable a training-free evolutionary merge that combines Transformer- and Mamba-based components. Together, the Darwin Family demonstrates that diagnostic-guided evolutionary merging is a practical and reproducible alternative to costly post-training pipelines for reasoning-centric language models.
comment: NeurIPS 2026 submission. 18 pages including appendix
☆ Data-Augmented Game Starts for Accelerating Self-Play Exploration in Imperfect Information Games
Finding approximate equilibria for large-scale imperfect-information competitive games such as StarCraft, Dota, and CounterStrike remains computationally infeasible due to sparse rewards and challenging exploration over long horizons. In this paper, we propose a multi-agent starting-state sampling strategy designed to substantially accelerate online exploration in regularized policy-gradient game methods for two-player zero-sum (2p0s) games. Motivated by an assumption that offline demonstrations from skilled humans can provide good coverage of high-level strategies relevant to equilibrium play, we propose the initialization of reinforcement learning data collection at intermediate states sampled from offline data to facilitate exploration of strategically relevant subgames. Referring to this method as Data-Augmented Game Starts (DAGS), we perform experiments using synthetic datasets and analytically tractable, long-horizon control variants of two-player Kuhn Poker, Goofspiel, and a counterexample game designed to penalize biased beliefs over hidden information. Under fixed computational budgets, DAGS enables regularized policy gradient methods to achieve lower exploitability in games with significantly more challenging exploration. We show that augmenting starting state distributions when solving imperfect information games can lead to biased equilibria, and we provide a straightforward mitigation to this in the form of multi-task observation flags. Finally, we release a new set of benchmark environments that drastically increase exploration challenges and state counts in existing OpenSpiel games while keeping exploitability measurements analytically tractable.
comment: 17 pages, 4 figures. JB Lanier and Nathan Monette contributed equally
☆ Optimal Pattern Detection Tree for Symbolic Rule-Based Classification
Pattern discovery in data plays a crucial role across diverse domains, including healthcare, risk assessment, and machinery maintenance. In contrast to black-box deep learning models, symbolic rule discovery emerges as a key data mining task, generating human-interpretable rules that offer both transparency and intuitive explainability. This paper introduces the Optimal Pattern Detection Tree (OPDT), a rule-based machine learning model based on novel mixed-integer programming to discover a single optimal pattern in data through binary classification. To incorporate prior knowledge and compliance requirements, we further introduce the Branching Structure Constraints (BSC) framework, which enables decision makers to encode domain knowledge and constraints directly into the model. This optimization-based approach discovers a hidden underlying pattern in datasets, when it exists, by identifying an optimal rule that maximizes coverage while minimizing the false positive rate due to misclassification. Our computational experiments show that OPDT discovers a pattern with optimality guarantees on moderately sized datasets within reasonable runtime.
comment: Published in Transactions on Machine Learning Research (TMLR). 26 pages, 4 figures. OpenReview URL: https://openreview.net/forum?id=RJ6eMDcDCv
☆ Turning Stale Gradients into Stable Gradients: Coherent Coordinate Descent with Implicit Landscape Smoothing for Lightweight Zeroth-Order Optimization ICML 2026
Zeroth-Order (ZO) optimization is pivotal for scenarios where backpropagation is unavailable, such as memory-constrained on-device learning and black-box optimization. However, existing methods face a stark trade-off: they are either sample-inefficient (e.g., standard finite differences) or suffer from high variance due to randomized estimation (e.g., random subspace methods). In this work, we propose Coherent Coordinate Descent (CoCD), a deterministic, sample-efficient, and budget-aware ZO optimizer. Theoretically, we formalize the notion of gradient coherence and demonstrate that CoCD is equivalent to Block Cyclic Coordinate Descent (BCCD) with ``warm starts,'' effectively converting historical (stale) gradients from a liability into a computational asset. This mechanism enables $O(1)$ query complexity per step while maintaining global descent directions. Furthermore, we derive error bounds revealing a counter-intuitive insight: larger finite-difference step sizes can induce an implicit smoothing effect on the optimization landscape by reducing the effective smoothness constant, thereby improving convergence stability. Experiments on MLP, CNN, and ResNet architectures (up to 270k parameters) demonstrate that CoCD significantly outperforms BCCD in terms of sample efficiency and convergence loss/accuracy, and exhibits superior stability over randomized ZO methods. Our results suggest that deterministic, structure-aware updates offer a superior alternative to randomization for lightweight ZO optimization.
comment: Accepted to the 43rd International Conference on Machine Learning (ICML 2026)
☆ Deciphering Neural Reparameterized Full-Waveform Inversion with Neural Sensitivity Kernel and Wave Tangent Kernel
Full-waveform inversion (FWI) estimates unknown parameters in the wave equation from limited boundary measurements. Recent advances in neural reparameterized FWI (NeurFWI) demonstrate that representing the parameters using a neural network can reduce the reliance on the high-quality initial model and wavefield data, at the cost of slow high-resolution convergence. However, its underlying theoretical mechanism remains unclear. In this study, we establish the neural sensitivity kernel (NSK) and the wave tangent kernel (WTK) to analyze their convergence behavior from both model and data domains. These theoretical frameworks show that the neural tangent kernel (NTK) induced by neural representation adaptively modulates the original sensitivity and wave tangent kernels. This modulation leads to several key outcomes, i.e., the spectral filtering effect, the gradient wavenumber modulation, and the wave frequency bias, connecting the convergence behavior of NeurFWI with the eigen-structures of NSK and WTK. Building on these insights, we propose several enhanced NeurFWI methods with tailored eigen-structures in NSK and WTK to improve inversion performances and efficiency. We numerically validate these theoretical claims and the proposed methods in seismic exploration, and firstly extend their application to medical imaging.
☆ Where Should Diffusion Enter a Language Model? Geometry-Guided Hidden-State Replacement
Continuous diffusion language models lag behind autoregressive transformers, partly because diffusion is applied in spaces poorly suited to language denoising and token recovery. We propose DiHAL, a geometry-guided diffusion-transformer hybrid that asks where diffusion should enter a pretrained transformer. DiHAL scores layers with geometry-based proxies, selects a diffusion-friendly hidden-state interface, and replaces the lower transformer prefix with a diffusion bridge while retaining the upper layers and original LM head. By reconstructing the selected-layer hidden state rather than tokens, DiHAL avoids direct continuous-to-discrete recovery. Experiments on 8B-scale backbones show that the geometry score predicts effective shallow insertion layers under a fixed bridge-training protocol and that hidden-state recovery improves over continuous diffusion baselines in a diagnostic comparison matching the diffusion/recovery training budget. These results suggest that hidden-state geometry helps identify where diffusion-based replacement is feasible inside pretrained language models.
☆ LoMETab: Beyond Rank-1 Ensembles for Tabular Deep Learning
Recent tabular learning benchmarks increasingly show a tight performance cluster rather than a clear hierarchy among leading methods, spanning gradient boosted decision trees, attention-based architectures, and implicit ensembles such as TabM. As benchmark gains plateau, a complementary goal is to understand and control the mechanisms that make simple neural tabular models competitive. We propose LoMETab, a rank-$r$ generalization of multiplicative implicit ensembles. LoMETab lifts the rank-1 BatchEnsemble/TabM modulation to a rank-$r$ identity-residual Hadamard family by parameterizing each member weight as $W_k = W \odot (1 + A_kB_k^\top)$, where $W$ is shared and $(A_k, B_k)$ are member-specific low-rank factors. This exposes two practical diversity-control axes: the adapter rank $r$ and the initialization scale $σ_{\mathrm{init}}$, and we prove that for $r \ge 2$ this generalization strictly enlarges BatchEnsemble's hypothesis class. Empirically, we show that this added capacity manifests as measurable predictive diversity after training: on representative classification datasets, LoMETab sustains higher pairwise KL than an additive low-rank ablation, and $(r, σ_{\mathrm{init}})$ provides broad control over pairwise KL, varying by up to several orders of magnitude across configurations. The induced diversity is reflected in task-appropriate output-level measures: argmax disagreement for classification and ambiguity for regression, indicating that the control extends beyond pairwise KL to decision- and output-level member variation. Finally, experiments sweeping over adapter rank $r$ and initialization scale $σ_{\mathrm{init}}$ reveal that predictive performance is dataset-dependent over the $(r, σ_{\mathrm{init}})$ grid, supporting LoMETab as a controllable family of implicit ensembles rather than a fixed rank-1 construction.
☆ Correctness-Aware Repository Filtering Under Maximum Effective Context Window Constraints
Context window efficiency is a practical constraint in large language model (LLM)-based developer tools. Paulsen [12] shows that all tested models degrade in accuracy well before their advertised context limits the Maximum Effective Context Window (MECW) which makes context construction a quality problem, not just a cost one. Modern software repositories routinely contain large non-code artifacts compiled datasets, binary model weights, minified JavaScript bundles, and gigabyte-scale log files that overflow the context window and push out task-relevant source code. We present a correctness-aware context hygiene framework: a pre-execution, size-based heuristic filter that intercepts repository scans before tokenization, using only OS-level stat() metadata with sub-millisecond overhead. Semantic retrieval approaches such as RepoCoder, GraphRAG, and AST-based chunking require index construction and query-time inference before any filtering decision is reached. Our framework, by contrast, requires no indexing and operates at <0.01 ms per file decision. Across 10 real open-source repositories (22,046 files, 5 languages), the proposed SizeFilter at θ=1 MB achieves 79.6% (\pm13.2%) mean token reduction at 0.30 ms overhead: the HybridFilter achieves 89.3% (\pm9.0%) the lowest variance of any filter evaluated. A token-density study across 2,688 files confirms a strong linear correlation (Pearson r=0.997, k=0.250 tokens/byte). A limited-scope evaluation (18 tasks, CodeLlama-7B-Instruct) yields 72% file-level accuracy under filtering versus 25% at baseline; hallucination frequency declines from 61% to 17%. All code and data are released for reproducibility.
☆ RQ-MoE: Residual Quantization via Mixture of Experts for Efficient Input-Dependent Vector Compression ICML 2026
Vector quantization is a fundamental tool for compressing high-dimensional embeddings, yet existing multi-codebook methods rely on static codebooks that limit expressiveness under heterogeneous data geometry. While recent dynamic quantizers like QINCo adapt codebooks to individual inputs and improve expressiveness, their strict sequential dependencies create decoding bottlenecks. We propose Residual Quantization via Mixture of Experts (RQ-MoE), a framework combining a two-level MoE with dual-stream quantization to enable input-dependent codebook adaptation for efficient vector quantization. RQ-MoE enables dynamic codebook construction and decouples instruction from quantization, facilitating parallel decoding. Theoretically, we show that standard Residual Quantization and QINCo can be recovered as constrained special cases of RQ-MoE, and derive a guideline for setting expert dimensionality in RQ-MoE. Extensive experiments show that RQ-MoE achieves state-of-the-art or on-par performance in reconstruction and retrieval, while providing 6x-14x faster decoding than prior vector quantization methods. The implementation is available at https://github.com/KDEGroup/RQ-MoE.
comment: To appear at ICML 2026
☆ Uncovering the Representation Geometry of Minimal Cores in Overcomplete Reasoning Traces
Language models often generate long chain-of-thought traces, but it remains unclear how much of this reasoning is necessary for preserving the final prediction. We study this through the lens of overcomplete reasoning traces: generated traces that contain more intermediate steps than are needed to support the model's answer. We define the minimal core as the smallest subset of steps that preserves either the final answer or predictive distribution, and introduce metrics for compression ratio, redundancy mass, step necessity, and necessity concentration. Across six deliberative reasoning benchmarks spanning arithmetic, competition mathematics, expert scientific reasoning, and commonsense multi-hop QA, we find substantial overcompleteness: on average, 46% of steps are removable under greedy minimal-core extraction while preserving the original answer in 86% of cases. We also find that predictive support is concentrated: the top three steps account for 65% of measured necessity mass on average. Beyond compression, minimal cores expose a cleaner geometry of reasoning: compared with full traces, they improve correct-incorrect trace separation by 11 points, reduce estimated intrinsic dimensionality by 34%, and transfer across model families with 85% off-diagonal answer retention. Theoretically, we establish existence of minimal sufficient subsets, local irreducibility guarantees for greedy elimination, and certificates of overcompleteness and sparse necessity. Together, these results suggest that full reasoning traces are often verbose and overcomplete, while minimal cores isolate the effective support underlying language-model predictions.
☆ Herculean: An Agentic Benchmark for Financial Intelligence
As AI agents improve, the central question is no longer whether they can solve isolated well-defined financial tasks, but whether they can reliably carry out financial professional work. Existing financial benchmarks offer only a partial view of this ability, as they primarily evaluate static competencies such as question answering, retrieval, summarization, and classification. We introduce Herculean, the first skilled benchmark for agentic financial intelligence spanning four representative workflows, including Trading, Hedging, Market Insights, and Auditing. Each workflow is instantiated as a standardized MCP-based skill environment with its own tools, interaction dynamics, constraints, and success criteria, enabling consistent end-to-end assessment of heterogeneous agent systems. Across frontier agents, we find agents perform relatively well on Trading and Market Insights, but struggle substantially on Hedging and Auditing, where long-horizon coordination, state consistency, and structured verification are critical. Overall, our results point to a key gap in current agents in turning financial reasoning into dependable workflow execution in high-stakes financial workflows.
☆ CrystalReasoner: Reasoning and RL for Property-Conditioned Crystal Structure Generation
Generative modeling has emerged as a promising approach for crystal structure discovery. However, existing LLM-based generative models struggle with low-level atomic precision, while diffusion-based methods fall short in integrating high-level scientific knowledge. As a result, generated structures are often invalid, unstable, or do not possess desirable properties. To address this gap, we propose CrystalReasoner (\method), an end-to-end LLM framework that generates crystal structures from natural language instructions through reasoning and alignment. \method introduces physical priors as thinking tokens, which include crystallographic symmetry, local coordination environments and predicted physical properties before generating atomic coordinates. This bridges the gap between natural language and 3D structures. \method then employs reinforcement learning (RL) with a multi-objective, dense reward function to align generation with physical validity, chemical consistency, and thermodynamic stability. For property-conditioned tasks, we design task-specific reward functions and train specialized models for discrete constraints (e.g., space group) and continuous properties (e.g., elasticity, thermal expansion). Empirical results demonstrate that compared to prior works and baselines without thinking traces or RL, \method obtains better performance on diverse metrics, triples S.U.N. ratio, and achieves better performance for property conditioned generation. \method also exhibits adaptive reasoning, increasing reasoning lengths as the number of atoms increases. Our work demonstrates the potential of leveraging thinking traces and RL for generating valid, stable, and property-conditioned crystal structures. Please see our work at https://crystalreasoner.github.io/ .
comment: Under review
☆ Analog RF Computing: A New Paradigm for Energy-Efficient Edge AI Over MU-MIMO Systems
Modern edge devices increasingly rely on neural networks for intelligent applications. However, conventional digital computing-based edge inference requires substantial memory and energy consumption. In analog radio frequency (RF) computing, a base station (BS) encodes the weights of the neural networks and broadcasts the RF waveforms to the clients. Each client reuses its passive mixer to multiply the received weight-encoded waveform with a locally generated input-encoded waveform. This enables wireless receivers to perform the matrix-vector multiplications (MVMs) that account for most of the computation burden in edge inference with ultra-low energy consumption. Unlike conventional downlink transmissions which are optimized for communications, analog RF computing requires a computing-centric physical layer that controls both the analog MVM accuracy and the energy consumption for inference. Motivated by this, in this paper, we propose a physical layer design framework for analog RF computing in MU-MIMO wireless systems. We derive tractable models for computing accuracy and energy consumption for inference, formulate a joint BS beamforming and client-side scaling problem subject to computing accuracy, transmit power, and hardware constraints, and develop a low-complexity algorithm to solve the non-convex problem. The proposed design provides client- and layer-specific accuracy control for both uniform- and mixed-precision inference. Simulations under 3GPP specifications show that analog RF computing can significantly reduce client-side energy consumption by nearly two orders of magnitude compared to digital computing, while mixed-precision inference requires even lower energy consumption than uniform-precision inference. Overall, these results establish analog RF computing over wireless networks as a promising paradigm for energy-efficient edge inference.
comment: 13 pages, 6 figures, 2 tables. This paper proposes analog RF computing as a new paradigm for energy-efficient edge inference over wireless networks and studies the corresponding physical layer design framework
☆ AIM-DDI: A Model-Agnostic Multimodal Integration Module for Drug-Drug Interaction Prediction
Drug-drug interaction (DDI) prediction is a critical task in computational biomedicine, as adverse interactions between co-administered drugs can cause severe side effects and clinical risks. A key challenge is unseen-drug generalization, where interactions must be predicted for drugs not observed during training. Although multimodal DDI models exploit diverse drug-related information, their fusion mechanisms are often tied to specific prediction architectures, limiting their reuse across models. To address this, we propose AIM-DDI, an architecture-independent multimodal integration module that represents heterogeneous modality information as tokens in a shared latent space. By modeling dependencies across modality tokens through a unified fusion module, AIM-DDI enables model-agnostic integration of structural, chemical, and semantic drug signals across different DDI prediction architectures. Extensive evaluations across diverse DDI models and DrugBank-based settings show that AIM-DDI consistently improves prediction performance, with the strongest gains under the most challenging both-unseen setting where neither drug in a test pair is observed during training. These results suggest that treating multimodal integration as a reusable module, rather than a model-specific fusion component, is an effective strategy for robust unseen-drug DDI prediction.
☆ Dynamic Latent Routing
We investigate the temporal concatenation of sub-policies in Markov Decision Processes (MDP) with time-varying reward functions. We introduce General Dijkstra Search (GDS), and prove that globally optimal goal-reaching policies can be recovered through temporal composition of intermediate optimal sub-policies. Motivated by the "search, select, update" principle underlying GDS, we propose Dynamic Latent Routing (DLR), a language-model post-training method that jointly learns discrete latent codes, routing policies, and model parameters through dynamic search in a single training stage. In low-data fine-tuning settings, DLR matches or outperforms supervised fine-tuning across four datasets and six models, achieving a mean gain of +6.6 percentage points, while prior discrete-latent baselines consistently underperform SFT. Mechanistic analyses and targeted code ablations show that DLR learns structured routing behaviors with distinct causal roles.
☆ Are Agents Ready to Teach? A Multi-Stage Benchmark for Real-World Teaching Workflows
Language agents are increasingly deployed in complex professional workflows, with tutoring emerging as a particularly high-stakes capability that remains largely unmeasured in existing benchmarks. Effective tutor agents require more than producing correct answers or executing accurate tool calls: a robust tutor must diagnose learner state, adapt support over time, make pedagogically justified decisions grounded in educational evidence, and execute interventions within realistic learning-management systems. We introduce EduAgentBench, a source-grounded benchmark for holistically evaluating tutor agents across the full scope of teaching work. It contains 150 quality-controlled tasks across three capability surfaces: professional pedagogical judgment, situated multi-turn tutoring, and Canvas-style teaching workflow completion. Tasks are constructed through a pedagogical-insight-driven pipeline and evaluated with complementary verification signals and human review. Across a comprehensive evaluation of frontier models, our findings reveal that current models are generally capable of bounded pedagogical judgment, but still fall short of professional teaching standards in situated tutoring and autonomous teaching-workflow execution. To our knowledge, EduAgentBench is the first theory-grounded and realistic benchmark for evaluating the holistic teaching capability of tutor agents, providing a measurement foundation for developing future tutor agents that can support realistic teaching work.
comment: Under review
☆ Semantic Feature Segmentation for Interpretable Predictive Maintenance in Complex Systems
Predictive maintenance in complex systems is often complicated by the heterogeneity and redundancy of monitored variables,which can obscure fault-relevant information and reduce model interpretability. This work proposes a semantic feature segmentation framework that decomposes the monitored feature space into a canonical component,expected to retain the dominant predictive information, and a residual component containing structurally peripheral signals. The segmentation is defined through domain informed criteria and sets up monitoring variables into functional groups reflecting operational mechanisms such as throughput,latency,pressure,network activity,and structural state. To evaluate the effectiveness of this decomposition, we adopt a predictive perspective in which expected predictive risk is used as an operational proxy for task-relevant information. Experimental results obtained through time-aware cross-validation show that the canonical space consistently achieves lower predictive risk than the residual space across multiple temporal configurations, indicating that the semantic segmentation concentrates the most relevant information for fault anticipation. In addition, the canonical segments exhibit significantly stronger intra-segment coherence than inter-segment dependence, and this structural organization remains stable after redundancy reduction. When compared with the full feature space and with a Principal Component Analysis (PCA) representation, the canonical space carries out comparable predictive performance and furthermore preserves the semantic meaning of the original variables. These findings suggest that semantic feature segmentation provides an interpretable and information-preserving decomposition of monitoring signals, enabling competitive predictive performance without sacrificing the operational interpretability required in predictive maintenance applications.
comment: 18 pages, 7 figures. Under review at Neural Computing and Applications. Keywords: semantic segmentation, change point detection, fault anticipation
☆ Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment
Test-Time Scaling (TTS), which samples multiple candidate actions and ranks them via a Critic Model, has emerged as a promising paradigm for generalist GUI agents. Its efficacy thus hinges on the critic's fine-grained ranking ability. However, existing GUI critic models uniformly adopt binary classification. Our motivational analysis of these models exposes a severe entanglement: scores for valid actions and plausible-but-invalid distractors become indistinguishable. We attribute this failure to two structural defects: Affordance Collapse--the hierarchical affordance space is compressed into 0/1 labels; and Noise Sensitivity--binary objectives overfit to noisy decision boundaries. To resolve this, we introduce BBCritic (Beyond-Binary Critic), a paradigm shift grounded in the Functional Equivalence Hypothesis. Through two-stage contrastive learning, BBCritic aligns instructions and actions in a shared Affordance Space, recovering the hierarchical structure that binary supervision flattens. We also present BBBench (Beyond-Binary Bench), the first GUI critic benchmark that pairs a dense action space with a hierarchical four-level taxonomy, enabling fine-grained ranking evaluation. Experimental results show that BBCritic-3B, trained without any extra annotation, outperforms 7B-parameter SOTA binary models. It demonstrates strong zero-shot transferability across platforms and tasks, supporting our methodological view: GUI critique is fundamentally a metric-learning problem, not a classification one.
comment: 28 pages including appendix. Code and BBBench benchmark to be released
☆ ICED: Concept-level Machine Unlearning via Interpretable Concept Decomposition
Machine unlearning in Vision-Language Models (VLMs) is typically performed at the image or instance level, making it difficult to precisely remove target knowledge without affecting unrelated semantics. This issue is especially pronounced since a single image often contains multiple entangled concepts, including both target concepts to be forgotten and contextual information that should be preserved. In this paper, we propose an interpretable concept-level unlearning framework for VLMs, which constructs a compact task-specific concept vocabulary from the forgetting set using a multimodal large language model. In addition to modality alignment, visual representations are decomposed into sparse, nonnegative combinations of semantic concepts, providing an explicit interface for fine-grained knowledge manipulation. Based on this decomposition, our method formulates unlearning as concept-level optimization, where target concepts are selectively suppressed while intra-instance non-target semantics and global cross-modal knowledge are preserved. Extensive experiments across both in-domain and out-of-domain forgetting settings demonstrate that our method enables more comprehensive target forgetting, better preserves non-target knowledge within the same image, and maintains competitive model utility compared with existing VLM unlearning methods.
☆ Matrix-Space Reinforcement Learning for Reusing Local Transition Geometry
Compositional generalization in sequential decision-making requires identifying which parts of prior rollouts remain useful for new tasks. Existing methods reuse skills or predictive models, but often overlook rich local transition geometry and dynamics. We propose Matrix-Space Reinforcement Learning (MSRL), a geometric abstraction that represents trajectory segments through positive semidefinite matrix descriptors aggregating first- and second-order statistics of lifted one-step transitions. These descriptors expose shared hidden structure, support algebraic composition in an abstract matrix space, and reveal opportunities for transfer. We prove that the descriptor is well defined up to coordinate gauge, complete for the induced low-order additive signal class, additive under valid segment composition, and minimally sufficient among admissible additive descriptors. We further show that conditioning value functions on the trajectory-segment matrix yields a first-order smooth approximation of action values, enabling source-learned matrix-to-value mappings to bootstrap learning in new tasks. MSRL is plug-in compatible with standard model-free and model-based methods, while obstruction filtering rejects implausible compositions. Empirically, MSRL achieves the best average finite-budget target AUC of 0.73, outperforming MSRL from scratch (0.65), TD-MPC-PT+FT (0.63), and TD-MPC (0.57).
☆ Policy Optimization in Hybrid Discrete-Continuous Action Spaces via Mixed Gradients
We study reinforcement learning in hybrid discrete-continuous action spaces, such as settings where the discrete component selects a regime (or index) and the continuous component optimizes within it -- a structure common in robotics, control, and operations problems. Standard model-free policy gradient methods rely on score-function (SF) estimators and suffer from severe credit-assignment issues in high-dimensional settings, leading to poor gradient quality. On the other hand, differentiable simulation largely sidesteps these issues by backpropagating through a simulator, but the presence of discrete actions or non-smooth dynamics yields biased or uninformative gradients. To address this, we propose Hybrid Policy Optimization (HPO), which backpropagates through the simulator wherever smoothness permits, using a mixed gradient estimator that combines pathwise and SF gradients while maintaining unbiasedness. We also show how problems with action discontinuities can be reformulated in hybrid form, further broadening its applicability. Empirically, HPO substantially outperforms PPO on inventory control and switched linear-quadratic regulator problems, with performance gaps increasing as the continuous action dimension grows. Finally, we characterize the structure of the mixed gradient, showing that its cross term -- which captures how continuous actions influence future discrete decisions -- becomes negligible near a discrete best response, thereby enabling approximate decentralized updates of the continuous and discrete components and reducing variance near optimality. All resources are available at github.com/MatiasAlvo/hybrid-rl.
☆ Precise Verification of Transformers through ReLU-Catalyzed Abstraction Refinement
Formal verification of transformers has become increasingly important due to their widespread deployment in safety-critical applications. Compared to classic neural networks, the inferences of transformers involve highly complex computations, such as dot products in self-attention layers, rendering their verification extremely difficult. Existing approaches explored over-approximation methods by constructing convex constraints to bound the output ranges of transformers, which can achieve high efficiency. However, they may sacrifice verification precision, and consequently introduce significant approximation error that leads to frequent occurrences of false alarms. In this paper, we propose a transformer verification approach that can achieve improved precision. At the core of our approach is a novel usage of ReLU, by which we represent a precise but non-linear bound for dot products such that we can further exploit the rich body of literature for convex relaxation of ReLU to derive precise bounds. We extend two classic approaches to the context of transformers, a rule-based one and an optimization-based one, resulting in two new frameworks for efficient and precise verification. We evaluate our approaches on different model architectures and robustness properties derived from two datasets about sentiment analysis, and compare with the state-of-the-art baseline approach. Compared to the baseline, our approach can achieve significant precision improvement for most of the verification tasks with acceptable compromise of efficiency, which demonstrates the effectiveness of our approach.
comment: 32 pages, 6 figures, the full version of the paper accepted by CAV 2026
☆ To See is Not to Learn: Protecting Multimodal Data from Unauthorized Fine-Tuning of Large Vision-Language Model
The rapid advancement of Large Vision-Language Models (LVLMs) is increasingly accompanied by unauthorized scraping and training on multimodal web data, posing severe copyright and privacy risks to data owners. Existing countermeasures, such as machine unlearning and watermarks, are inherent post-hoc approaches that act only after intellectual property infringement has already occurred. In this work, we propose MMGuard to empower data owners to proactively protect their multimodal data against unauthorized LVLM fine-tuning. MMGuard generates unlearnable examples by injecting human-imperceptible perturbations that actively exploit the learning dynamics of LVLMs. By minimizing the training loss, the perturbation creates an optimization shortcut, causing the model to overfit to the noise and thereby degrading downstream performance when the perturbation is absent during inference. To further strengthen this defense, MMGuard introduces a cross-modal binding disruption, strategically shifting LVLM attention to enforce a spurious correlation between the noise and the training target with theoretical guarantees. Enhanced by an ensemble learning strategy for cross-model transferability, MMGuard is evaluated against nine open-source LVLMs across six datasets. Our comprehensive results demonstrate effective, stealthy, and robust protection under white-box, gray-box, and black-box threat models, establishing a mechanistic advantage in proactively defending against aggressive fine-tuning exploitation.
♻ ☆ Learning, Fast and Slow: Towards LLMs That Adapt Continually
Large language models (LLMs) are trained for downstream tasks by updating their parameters (e.g., via RL). However, updating parameters forces them to absorb task-specific information, which can result in catastrophic forgetting and loss of plasticity. In contrast, in-context learning with fixed LLM parameters can cheaply and rapidly adapt to task-specific requirements (e.g., prompt optimization), but cannot by itself typically match the performance gains available through updating LLM parameters. There is no good reason for restricting learning to being in-context or in-weights. Moreover, humans also likely learn at different time scales (e.g., System 1 vs 2). To this end, we introduce a fast-slow learning framework for LLMs, with model parameters as "slow" weights and optimized context as "fast" weights. These fast "weights" can learn from textual feedback to absorb the task-specific information, while allowing slow weights to stay closer to the base model and persist general reasoning behaviors. Fast-Slow Training (FST) is up to 3x more sample-efficient than only slow learning (RL) across reasoning tasks, while consistently reaching a higher performance asymptote. Moreover, FST-trained models remain closer to the base LLM (up to 70% less KL divergence), resulting in less catastrophic forgetting than RL-training. This reduced drift also preserves plasticity: after training on one task, FST trained models adapt more effectively to a subsequent task than parameter-only trained models. In continual learning scenarios, where task domains change on the fly, FST continues to acquire each new task while parameter-only RL stalls.
comment: 29 pages, 14 figures, including appendix; Blog post: https://gepa-ai.github.io/gepa/blog/2026/05/11/learning-fast-and-slow/
♻ ☆ Dynamic Mixed-Precision Routing for Efficient Multi-step LLM Interaction
Large language models (LLMs) achieve strong performance in long-horizon decision-making tasks through multi-step interaction and reasoning at test time. While practitioners commonly believe a higher task success rate necessitates the use of a larger and stronger LLM model, multi-step interaction with a large LLM incurs prohibitive inference cost. To address this problem, we explore the use of low-precision quantized LLMs in the long-horizon decision-making process. Based on the observation of diverse sensitivities among interaction steps, we propose Dynamic Mixed-Precision Routing (DMR), a framework that adaptively selects between high-precision and low-precision LLMs at each decision step. The router is trained via a two-stage pipeline, consisting of KL-divergence-based supervised learning that identifies precision-sensitive steps, followed by Group-Relative Policy Optimization (GRPO) to further improve task success rates. Experiments on ALFWorld and WebShop demonstrate that our approach achieves a strong accuracy-cost trade-off over single-precision baselines.
♻ ☆ Why Goal-Conditioned Reinforcement Learning Works: Relation to Dual Control
Goal-conditioned reinforcement learning (RL) concerns the problem of training an agent to maximize the probability of reaching target goal states. This paper presents an analysis of the goal-conditioned setting based on optimal control. In particular, we derive an optimality gap between more classical, often quadratic, objectives and the goal-conditioned reward, elucidating the success of goal-conditioned RL and why classical ``dense'' rewards can falter. We then consider the partially observed Markov decision setting and connect state estimation to our probabilistic reward, making the goal-conditioned reward well suited to dual control problems. The advantages of goal-conditioned policies are validated on nonlinear and uncertain environments using both RL and predictive control techniques.
comment: IFAC world congress postprint
♻ ☆ RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies
The pursuit of general-purpose robotics has yielded impressive foundation models, yet simulation-based benchmarking remains a bottleneck due to rapid performance saturation and a lack of true generalization testing. Existing benchmarks often exhibit significant domain overlap between training and evaluation, trivializing success rates and obscuring insights into robustness. We introduce RoboLab, a simulation benchmarking framework designed to address these challenges. Concretely, our framework is designed to answer two questions: (1) to what extent can we understand the performance of a real-world policy by analyzing its behavior in simulation, and (2) which factor most strongly affect policy behavior. First, RoboLab enables human-authored and LLM-enabled generation of scenes and tasks in a robot- and policy-agnostic manner within a high-fidelity simulation environment. We introduce an accompanying RoboLab-120 benchmark, consisting of 120 tasks categorized into three competency axes: visual, procedural, relational, across three difficulty levels. Second, we introduce a systematic analysis of real-world policies that quantify both their performance and the sensitivity of their behavior to controlled perturbations, exposing significant performance gap in current state-of-the-art models. By providing granular metrics and a scalable toolset, RoboLab offers a scalable framework for evaluating the true generalization capabilities of task-generalist robotic policies. Project website: https://research.nvidia.com/labs/srl/projects/robolab/.
♻ ☆ Inducing Overthink: Hierarchical Genetic Algorithm-based DoS Attack on Black-Box Large Language Reasoning Models ICML 2026
Large Reasoning Models (LRMs) are increasingly integrated into systems requiring reliable multi-step inference, yet this growing dependence exposes new vulnerabilities related to computational availability. In particular, LRMs exhibit a tendency to "overthink", producing excessively long and redundant reasoning traces, when confronted with incomplete or logically inconsistent inputs. This behavior significantly increases inference latency and energy consumption, forming a potential vector for denial-of-service (DoS) style resource exhaustion. In this work, we investigate this attack surface and propose an automated black-box framework that induces overthinking in LRMs by systematically perturbing the logical structure of input problems. Our method employs a hierarchical genetic algorithm (HGA) operating on structured problem decompositions, and optimizes a composite fitness function designed to maximize both response length and reflective overthinking markers. Across four state-of-the-art reasoning models, the proposed method substantially amplifies output length, achieving up to a 26.1x increase on the MATH benchmark and consistently outperforming benign and manually crafted missing-premise baselines. We further demonstrate strong transferability, showing that adversarial inputs evolved using a small proxy model retain high effectiveness against large commercial LRMs. These findings highlight overthinking as a shared and exploitable vulnerability in modern reasoning systems, underscoring the need for more robust defenses.
comment: Accepted at ICML 2026. Code available at: https://github.com/EndlessCao/Overthink-HGA
♻ ☆ From User Preferences to Base Score Extraction Functions in Gradual Argumentation (with Appendix) AAMAS 2026
Gradual argumentation is a field of symbolic AI which is attracting attention for its ability to support transparent and contestable AI systems. It is considered a useful tool in domains such as decision-making, recommendation, debate analysis, and others. The outcomes in such domains are usually dependent on the arguments' base scores, which must be selected carefully. Often, this selection process requires user expertise and may not always be straightforward. On the other hand, organising the arguments by preference could simplify the task. In this work, we introduce \emph{Base Score Extraction Functions}, which provide a mapping from users' preferences over arguments to base scores. These functions can be applied to the arguments of a \emph{Bipolar Argumentation Framework} (BAF), supplemented with preferences, to obtain a \emph{Quantitative Bipolar Argumentation Framework} (QBAF), allowing the use of well-established computational tools in gradual argumentation. We outline the desirable properties of base score extraction functions, discuss some design choices, and provide an algorithm for base score extraction. Our method incorporates an approximation of non-linearities in human preferences to allow for better approximation of the real ones. Finally, we evaluate our approach both theoretically and experimentally in a robotics setting, and offer recommendations for selecting appropriate gradual semantics in practice.
comment: Accepted to AAMAS 2026 - With Appendix
♻ ☆ Time Series Forecasting Through the Lens of Dynamics ICML 2026
While deep learning is facing an homogenization across modalities led by Transformers, they are still challenged by shallow linear models in the time series forecasting task. Our hypothesis is that models should learn a direct link from past to future data points, which we identify as a learning dynamics capability. We develop an original $\texttt{PRO-DYN}$ nomenclature to analyze existing models through the lens of dynamics. Two observations thus emerge: $\textbf{1.}$ under-performing architectures learn dynamics at most partially, $\textbf{2.}$ the location of the dynamics block at the model end is of prime importance. Our systemic and empirical studies both confirm our observations on a set of performance-varying models with diverse backbones. We propose a simple plug-and-play methodology guiding model designs and improvements.
comment: Accepted at ICML 2026
♻ ☆ VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing
Pretrained vision foundation models (VFMs) advance robotic learning via rich visual representations, yet individual VFMs typically excel only in specific domains, limiting generality across tasks. Distilling multiple VFMs into a unified representation for policy can mitigate this limitation but often yields inflexible task-specific feature selection and requires costly full re-training to incorporate robot-domain knowledge. We propose VER, a Vision Expert transformer for Robot learning. During pretraining, VER distills multiple VFMs into a vision expert library. It then fine-tunes only a lightweight routing network (fewer than 0.4% of parameters) to dynamically select task-relevant experts from the pretrained library for downstream robot tasks. We further introduce Patchwise Expert Routing with Curriculum Top-K Annealing to improve both flexibility and precision of dynamic expert selection. Moreover, VER supports parameter-efficient finetuning for scalable expert utilization and adaptive robot-domain knowledge integration. Across 17 diverse robotic tasks and multiple policy heads, VER achieves state-of-the-art performance. We find that VER reduces large-norm outliers in task-irrelevant regions (e.g., background) and concentrates on task-critical regions. Visualizations and codes can be found in https://yixiaowang7.github.io/ver_page/.
♻ ☆ Adapting Dijkstra for Buffers and Unlimited Transfers
In recent years, RAPTOR based algorithms have been considered the state-of-the-art for path-finding with unlimited transfers without preprocessing. However, this status largely stems from the evolution of routing research, where Dijkstra-based solutions were superseded by timetable-based algorithms without a systematic comparison. In this work, we revisit classical Dijkstra-based approaches for public transit routing with unlimited transfers and demonstrate that Time-Dependent Dijkstra (TD-Dijkstra) outperforms MR. However, efficient TD-Dijkstra implementations rely on filtering dominated connections during preprocessing, which assumes passengers can always switch to a faster connection. We show that this filtering is unsound when stops have buffer times, as it cannot distinguish between seated passengers who may continue without waiting and transferring passengers who must respect the buffer. To address this limitation, we introduce Transfer Aware Dijkstra (TAD), a modification that scans entire trip sequences rather than individual edges, correctly handling buffer times while maintaining performance advantages over MR. Our experiments on London and Switzerland networks show that we can achieve a greater than two time speed-up over MR while producing optimal results on both networks with and without buffer times.
comment: v3: revised manuscript incorporating reviewer feedback (formal correctness proof, deployment trade-off discussion, route/tau_min definitions, dominance-inequality fix); editorial and layout polish
♻ ☆ Residual Stream Duality in Modern Transformer Architectures
Recent work has made clear that the residual pathway is not mere optimization plumbing; it is part of the model's representational machinery. We agree, but argue that the cleanest way to organize this design space is through a two-axis view of the Transformer. A decoder evolves information along two ordered dimensions: sequence position and layer depth. Self-attention already provides adaptive mixing along the sequence axis, whereas the residual stream usually performs fixed addition along the depth axis. If we fix a token position and treat layer index as the ordered variable, then a causal depth-wise residual attention read is exactly the same local operator as causal short sliding-window attention (ShortSWA), except written over depth rather than over sequence. This is the core residual stream duality behind Transformer$^2$. This perspective also clarifies the recent literature. ELC-BERT and DenseFormer already show that learned aggregation over depth can outperform uniform residual accumulation, while Vertical Attention, DeepCrossAttention (DCA), MUDDFormer, and Attention Residuals move further toward explicit attention-based routing over earlier layers. The key point, however, is that operator-level duality does not imply systems-level symmetry. For large-scale autoregressive models, sequence-axis ShortSWA is usually the more hardware-friendly placement because it reuses token-side sliding-window kernels, KV-cache layouts, and chunked execution. If the goal is instead to change the shortcut itself, Deep Delta Learning (DDL) is the cleaner intervention because it modifies the residual operator directly rather than adding a separate cross-layer retrieval path. Our recommendation is therefore simple: use DDL when the shortcut is the object of interest, and use sequence-axis ShortSWA when the goal is local adaptive mixing.
comment: Project Page: https://github.com/yifanzhang-pro/residual-stream-duality
♻ ☆ AVEX: What Matters for Animal Vocalization Encoding
Bioacoustics, the study of sounds produced by living organisms, plays a vital role in conservation, biodiversity monitoring, and behavioral studies. Many tasks in this field, such as species, individual, and behavior classification and detection, are well-suited to machine learning. However, they often suffer from limited annotated data, highlighting the need for a general-purpose bioacoustic encoder capable of extracting useful representations for diverse downstream tasks. Such encoders have been proposed before, but are often limited in scope due to a focus on a narrow range of species (typically birds), and a reliance on a single model architecture or training paradigm. Moreover, they are usually evaluated on a small set of tasks and datasets. In this work, we present a large-scale empirical study that covers aspects of bioacoustics that are relevant to research but have previously been scarcely considered: training data diversity and scale, model architectures and training recipes, and the breadth of evaluation tasks and datasets. We obtain encoders that are state-of-the-art on the existing and proposed benchmarks. We also identify what matters for training these encoders, such that this work can be extended when more data are available or better architectures are proposed. Specifically, across 26 datasets with tasks including species classification, detection, individual ID, and vocal repertoire discovery, we find self-supervised pre-training followed by supervised post-training on a mixed bioacoustics + general-audio corpus yields the strongest in- and out-of-distribution performance. We show the importance of data diversity in both stages. To support ongoing research and application, we will release the model checkpoints.
comment: In The Fourteenth International Conference on Learning Representations 2026
♻ ☆ ClawGym: A Scalable Framework for Building Effective Claw Agents
Claw-style environments support multi-step workflows over local files, tools, and persistent workspace states. However, scalable development around these environments remains constrained by the absence of a systematic framework, especially one for synthesizing verifiable training data and integrating it with agent training and diagnostic evaluation. To address this challenge, we present ClawGym, a scalable framework that supports the full lifecycle of Claw-style personal agent development. Concretely, we construct ClawGym-SynData, a diverse dataset of 13.5K filtered tasks synthesized from persona-driven intents and skill-grounded operations, paired with realistic mock workspaces and hybrid verification mechanisms. We then train a family of capable Claw-style models, termed ClawGym-Agents, through supervised fine-tuning on black-box rollout trajectories, and further explore reinforcement learning via a lightweight pipeline that parallelizes rollouts across per-task sandboxes. To support reliable evaluation, we further construct ClawGym-Bench, a benchmark of 200 instances calibrated through automated filtering and human-LLM review. Relevant resources will be soon released at https://github.com/ClawGym.
♻ ☆ Personalized Digital Health Modeling with Adaptive Support Users
Personalized models are essential in digital health because individuals exhibit substantial physiological and behavioral heterogeneity. Yet personalization is limited by scarce and noisy user-specific data. Most existing methods rely on population pretraining or data from similar users only, which can lead to biased transfer and weak generalization. We propose a unified personalization framework that trains a personal model using adaptively weighted support users, including both similar and dissimilar individuals. The objective integrates personal loss, similarity-weighted transfer from similar users, and contrastive regularization from dissimilar users to suppress misleading correlations. An iterative optimization algorithm jointly updates model parameters and user similarity weights. Experiments on six tasks across four real-world digital health datasets show consistent improvements over population and personalized baselines. The method achieves up to 10% lower RMSE on large-scale datasets and approximately 25% lower RMSE in low-data settings. The learned adaptive weights improve data efficiency and provide interpretable guidance for targeted data selection.
♻ ☆ A Minimal Agent for Automated Theorem Proving ICML 2026
We propose a minimal agentic baseline that enables systematic comparison across different AI-based theorem prover architectures. This design implements the core features shared among state-of-the-art systems: iterative proof refinement, library search and context management. We evaluate this agentic approach using qualitatively different benchmarks and compare various frontier language models and design choices. Our results show competitive performance compared to state-of-the-art approaches, while using a significantly simpler architecture and a fraction of their cost. Additionally, we demonstrate consistent advantages of an iterative approach over multiple single-shot generations, especially in terms of sample efficiency and cost effectiveness. The implementation is released open-source as a candidate reference for future research and as an accessible prover for the community.
comment: Accepted for publication at ICML 2026
♻ ☆ Cumulative Reasoning with Large Language Models
Recent advancements in large language models (LLMs) have shown remarkable progress, yet their ability to solve complex problems remains limited. In this work, we introduce Cumulative Reasoning (CR), a structured framework that enhances LLM problem-solving by emulating human-like iterative and cumulative thought processes. CR orchestrates LLMs in three distinct roles: Proposer, Verifier(s), and Reporter, to systematically decompose tasks, generate and validate intermediate reasoning steps, and compose them into a solution by building a dynamic Directed Acyclic Graph (DAG) of verified propositions. This approach substantially enhances problem-solving capabilities. We demonstrate CR's advantage through several complex reasoning tasks: it outperforms existing methods in logical inference tasks with up to a 9.3% improvement, achieving 98.04% accuracy on the curated FOLIO wiki dataset. In the Game of 24, it achieves 98% accuracy, marking a 24% improvement over previous methods. In solving MATH problems, CR achieves a 4.2% increase from previous methods and a 43% relative improvement in the most challenging level 5 problems. When incorporating a code environment with CR, we further harness LLMs' reasoning capabilities and outperform the Program of Thought (PoT) method by 38.8%.
comment: Published in Transactions on Machine Learning Research (TMLR). Project Page: https://github.com/iiis-ai/cumulative-reasoning
♻ ☆ SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring
Multimodal large language models (MLLMs) achieve ever-stronger performance on visual-language tasks. Even as traditional visual question answering (VQA) benchmarks approach saturation, reliable deployment requires satisfying low error tolerances in real-world, out-of-distribution (OOD) scenarios. Precisely, selective prediction aims to improve coverage, i.e. the share of inputs the system answers, while adhering to a user-defined risk level. This is typically achieved by assigning a confidence score to each answer and abstaining on those that fall below a certain threshold. Existing selective prediction methods estimate implicit confidence scores, relying on model internal signals like logits or hidden representations, which are not available for frontier closed-sourced models. To enable reliable generalization in VQA, we require reasoner models to produce localized visual evidence while answering, and design a selector that explicitly learns to estimate the quality of the localization provided by the reasoner using only model inputs and outputs. We show that SIEVES (Selective Prediction through Visual Evidence Scoring) improves coverage by up to three times on challenging OOD benchmarks (V* Bench, HR-Bench-8k, MME-RealWorld-Lite, VizWiz, and AdVQA), compared to non-grounding baselines. Beyond better generalization to OOD tasks, the design of the SIEVES selector enables transfer to proprietary reasoners without access to their weights or logits, such as o3 and Gemini-3-Pro, providing coverage boosts beyond those attributable to accuracy alone. We highlight that SIEVES generalizes across all tested OOD benchmarks and reasoner models (Pixel-Reasoner, o3, and Gemini-3-Pro), without benchmark- or reasoner-specific training or adaptation. Code is publicly available at https://github.com/hector-gr/SIEVES .
♻ ☆ ArGEnT: Arbitrary Geometry-encoded Transformer for Operator Learning
Learning solution operators for systems with complex, varying geometries and parametric physical settings is a central challenge in scientific machine learning. In many-query regimes such as design optimization, control and inverse problems, surrogate modeling must generalize across geometries while allowing flexible evaluation at arbitrary spatial locations. In this work, we propose Arbitrary Geometry-encoded Transformer (ArGEnT), a geometry-aware attention-based architecture for operator learning on arbitrary domains. ArGEnT employs Transformer attention mechanisms to encode geometric information directly from point-cloud representations with three variants-self-attention, cross-attention, and hybrid-attention-that incorporates different strategies for incorporating geometric features. By integrating ArGEnT into DeepONet as the trunk network, we develop a surrogate modeling framework capable of learning operator mappings that depend on both geometric and non-geometric inputs without the need to explicitly parametrize geometry as a branch network input. Evaluation on benchmark problems spanning fluid dynamics, solid mechanics and electrochemical systems, we demonstrate significantly improved prediction accuracy and generalization performance compared with the standard DeepONet and other existing geometry-aware saurrogates. In particular, the cross-attention transformer variant enables accurate geometry-conditioned predictions with reduced reliance on signed distance functions. By combining flexible geometry encoding with operator-learning capabilities, ArGEnT provides a scalable surrogate modeling framework for optimization, uncertainty quantification, and data-driven modeling of complex physical systems.
comment: 69 pages, 21 figures, 10 tables
♻ ☆ fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding
Recent advances in multimodal large language models (LLMs) have enabled unified reasoning across images, audio, and video, but extending such capability to brain imaging remains largely unexplored. Bridging this gap is essential to link neural activity with semantic cognition and to develop cross-modal brain representations. To this end, we present fMRI-LM, a foundational model that bridges functional MRI (fMRI) and language through a three-stage framework. In Stage 1, we learn a neural tokenizer that maps fMRI into discrete tokens embedded in a language-consistent space. In Stage 2, a pretrained LLM is adapted to jointly model fMRI tokens and text, treating brain activity as a sequence that can be temporally predicted and linguistically described. To overcome the lack of natural fMRI-text pairs, we construct a large descriptive corpus that translates diverse imaging-based features into structured textual descriptors, capturing the low-level organization of fMRI signals. In Stage 3, we perform multi-task, multi-paradigm instruction tuning to endow fMRI-LM with high-level semantic understanding, supporting diverse downstream applications. Across various benchmarks, fMRI-LM achieves strong zero-shot and few-shot performance, and adapts efficiently with parameter-efficient tuning (LoRA), establishing a scalable pathway toward a language-aligned, universal model for structural and semantic understanding of fMRI.
comment: Code are available: https://github.com/yuxiangwei0808/fMRI-LM
♻ ☆ Higher-order Linear Attention
The quadratic cost of scaled dot-product attention is a central obstacle to scaling autoregressive language models to long contexts. Linear-time attention and State Space Models (SSMs) provide scalable alternatives but are typically restricted to first-order or kernel-based approximations, which can limit expressivity. We introduce Higher-order Linear Attention (HLA), a causal, streaming mechanism that realizes higher interactions via compact prefix sufficient statistics. In the second-order case, HLA maintains a constant-size state and computes per-token outputs in linear time without materializing any $n \times n$ matrices. We give closed-form streaming identities, a strictly causal masked variant using two additional summaries, and a chunk-parallel training scheme based on associative scans that reproduces the activations of a serial recurrence exactly. We further outline extensions to third and higher orders. Collectively, these results position HLA as a principled, scalable building block that combines attention-like, data-dependent mixing with the efficiency of modern recurrent architectures.
comment: Project Page: https://github.com/yifanzhang-pro/HLA
♻ ☆ Toward Privileged Foundation Models:LUPI for Accelerated and Improved Learning
Training foundation models is computationally intensive and often slow to converge. We introduce PIQL,Privileged Information for Quick and Quality Learning, the first framework to systematically integrate privileged information (PI) to simultaneously accelerate learning and improve generalization in tabular foundation models (TFMs). We construct two complementary forms of PI: (i) aggregate dataset-level statistics that reduce the burden on in-context learning, and (ii) encodings of the underlying data-generating program, providing knowledge beyond observable data. We further design an architecture that effectively transfers the train-time-only PI by learning to reconstruct it from observed context at inference. We provide a theoretical analysis characterizing conditions under which PI reduces the population-level approximation gap and accelerates convergence in finite-data regimes. Empirical evidence shows that PIQL enables TFMs to achieve faster convergence, lower final loss, and better generalization, in effect, reducing data and compute requirements. Our work establishes PI-guided pretraining as a principled and practical paradigm for improving the efficiency and performance of foundation models.
♻ ☆ Teaching and Evaluating LLMs to Reason About Polymer Design Related Tasks
Research in AI4Science has shown promise in many science applications, including polymer design. However, current LLMs are ineffective in this problem space because: (i) most models lack polymer-specific knowledge, and (ii) existing aligned models have limited coverage of knowledge and capabilities relevant to polymer design. Addressing this, we introduce PolyBench, a large-scale training and test benchmark dataset of more than 125K polymer design-related tasks, leveraging a knowledge base of more than 13 million data points obtained from experimental and synthetic data sources to ensure broad coverage of polymers and their properties. For effective alignment using PolyBench, we introduce a knowledge-augmented reasoning distillation method that augments this dataset with structured CoT. Furthermore, tasks in PolyBench are organized from simple to complex analytical reasoning problems, enabling generalization tests and diagnostic probes across the problem space. Experiments show that small language models (SLMs) with 7B to 14B parameters, trained on PolyBench, outperform similar-sized models and remain competitive with closed-source frontier LLMs on PolyBench's test dataset, while demonstrating performance gains on external polymer benchmarks. Dataset and associated code available at https://github.com/StonyBrookNLP/PolyBench.
♻ ☆ Pro-DG: Procedural Diffusion Guidance for Architectural Facade Generation
We use hierarchical procedural rules for the generation of control maps within the stable diffusion framework to produce photo-realistic architectural facade images. Starting from a single input image and its segmentation, we apply an inverse procedural module to identify the facade's hierarchical layout. Leveraging this hierarchy and structural features, we introduce a novel ControlNet pipeline that generates new facade imagery guided by procedural transformations. Our method enables various structural edits, including floor duplication and window rearrangement, by integrating hierarchical alignment directly into control maps. This precisely guides the diffusion-based generative process, ensuring local appearance fidelity alongside extensive structural modifications. Comprehensive evaluations, including comparisons with inpainting-based approaches and synthetic benchmarks, confirm our approach's superior capability in preserving architectural identity and achieving accurate, controllable edits. Quantitative results and user feedback validate our method's effectiveness.
comment: 17 pages, 15 figures, Computer Graphics Forum 2026 Journal Paper
♻ ☆ LLMs learn scientific taste from institutional traces across the social sciences
Reinforcement-learned reasoning has powered recent AI leaps on verifiable tasks, including mathematics, code, and structure prediction. The harder bottleneck is evaluative judgment in low-verifiability domains, where no oracle anchors reward and the core question is which untested ideas deserve attention. We test whether institutional traces, the record of what fields published, where, and at which tier, can serve as a training signal for AI evaluators. Across eight social science disciplines (psychology, economics, communication, sociology, political science, management, business and finance, public administration), we built held-out four-tier research-pitch benchmarks and supervised-fine-tuned (SFT) LLMs on field-specific publication outcomes. The fine-tuned models cleared the 25 percent chance baseline and exceeded frontier-model performance by wide margins, with best single-model accuracy ranging from 55.0 percent in public administration to 85.5 percent in psychology. In management, evaluated against 48 expert gatekeepers, 174 junior researchers, and 11 frontier reasoning models, the best single fine-tuned model (Qwen3-4B) reached 59.2 percent, 17.6 percentage points above expert majority vote (41.6 percent, non-tied) and 28.1 percentage points above the frontier mean (31.1 percent). The fine-tuned models also showed calibrated confidence: confidence rose when predictions were correct and fell when wrong, mirroring how a skilled reviewer can say "I'm sure" versus "I'm guessing." Selective triage on this signal reached very high accuracy on the highest-confidence subsets in every field. Institutional traces, we conclude, encode a scalable training signal for the low-verifiability judgment on which science depends.
♻ ☆ TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
Direct Preference Optimization (DPO) is a widely used RL-free method for aligning language models from pairwise preferences, but it models preferences over full sequences even though generation is driven by per-token decisions. Existing token-level extensions typically decompose a sequence-level Bradley-Terry objective across timesteps, leaving per-prefix (state-wise) optimality implicit. We study how to recover token-level preference optimality using only standard sequence-level pairwise comparisons. We introduce Token-level Bregman Preference Optimization (TBPO), which posits a token-level Bradley-Terry preference model over next-token actions conditioned on the prefix, and derive a Bregman-divergence density-ratio matching objective that generalizes the logistic/DPO loss while preserving the optimal policy induced by the token-level model and maintaining DPO-like simplicity. We introduce two instantiations: TBPO-Q, which explicitly learns a lightweight state baseline, and TBPO-A, which removes the baseline through advantage normalization. Across instruction following, helpfulness/harmlessness, and summarization benchmarks, TBPO improves alignment quality and training stability and increases output diversity relative to strong sequence-level and token-level baselines.
♻ ☆ Evolutionary Ensemble of Agents
We introduce Evolutionary Ensemble (EvE), a decentralized framework that organizes existing, highly capable coding agents into a live, co-evolving system for algorithmic discovery. Rather than reinventing the wheel within the "LLMs as optimizers" paradigm, EvE fixes the base agent substrate and focuses entirely on evolving the cumulative guidance and skills that dictate agent behaviors. By maintaining two co-evolving populations, namely functional code solvers and agent guidance states, the system evaluates agents through a synchronous race, updating their empirical Elo ratings based on the marginal gains they contribute to the current solver state. When applied to a research bottleneck in In-Context Operator Networks (ICON), EvE autonomously discovered a robust rescale-then-interpolate mechanism that enables reliable example-count generalization. Crucially, controlled ablations reveal the absolute necessity of stage-dependent agent adaptation to navigate the shifting search landscapes of complex codebases. Compared to variants driven by a fixed initial agent or even a frozen "best-evolved" agent, EvE uniquely avoids phase mismatch, demonstrating that organizing agents into a self-revising ensemble is the fundamental driver for breaking through static performance ceilings.
♻ ☆ Constitutional Governance in Metric Spaces
Computational social choice and algorithmic decision theory offer rich aggregation theory but no comprehensive process for egalitarian self-governance: aggregation, deliberation, amendment, and consensus are each considered in isolation, with key metric-space aggregators being NP-hard. Here, we propose constitutional governance in metric spaces, integrating these stages into a coherent polynomial-time protocol for constitutional governance. The constitution assigns, per amendable component including itself, a metric space, aggregation rule, and supermajority threshold. Amendments proceed by members voting with their ideal elements, followed by members submitting public proposals carrying supermajority public support under the revealed votes. Public proposals can be sourced from deliberation among members, vote aggregation, or AI mediation. The constitutional rule adopts a supported proposal with positive maximal score, if there is one, else retains the status quo. With Constitutional Consensus, a community can run the constitutional governance protocol on members' personal computing devices (e.g., smartphones), achieving digital sovereignty. We focus on the utility of the generalised median, prove that at majority threshold no misreport weakly dominates sincere voting, and study the compromise gap between best peak and unconstrained optimum. We instantiate the framework to seven canonical settings -- electing officers, setting rates, allocating budgets, ranking priorities, selecting boards, drafting bylaws, and amending the constitution. By unifying metric-space aggregation, reality-aware social choice, supermajority amendment, constitutional consensus, deliberative coalition formation, and AI mediation, this work delivers a comprehensive solution to the constitutional governance of digital communities and organisations.
♻ ☆ Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
When labeled verifiable training data is scarce, each checked example should be used where it has the most value. A common approach is to train the deployment student model directly with sparse RL methods such as GRPO. We argue that this is often inefficient. Sparse sequence-level reward is most useful for strong models that can explore and discover better behavior, while dense token-level teacher supervision is better suited for compressing that behavior into a smaller student. This suggests a simple allocation rule: use scarce labeled data upstream to improve the strongest available teacher, then transfer the improved behavior downstream through dense supervision. In this view, GRPO-style sparse RL and OPD-style distillation are not competing methods, but two reward-density regimes used at different stages. We evaluate this rule on verifiable math tasks with Qwen3 and Llama models. For a fixed Qwen3-1.7B deployment student, distilling from an RL-improved 8B teacher outperforms applying GRPO directly to the student with the same labeled data. In contrast, distilling from the same teacher before RL gives weaker results. The transfer bridge is also important: a forward-KL warmup on teacher rollouts followed by OPD on student rollouts performs best on MATH before any later student-side sparse RL, and gives the strongest pre-Stage 3 AIME results for the canonical 8B and 14B teachers. Finally, the bridge makes later student-side RL more effective. GRPO is weak when applied to a cold student, but after the bridge it raises MATH accuracy from 75.4% to 78.5%, outperforming a matched replay control by 2.8 points. Overall, the lesson is to avoid spending scarce labeled data on the least prepared policy: use sparse reward for teacher-side discovery, dense transfer for student compression, and student-side sparse reward only after the student has been bridged.
♻ ☆ Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages
Reinforcement learning (RL) has been effective for post-training autoregressive (AR) language models, but extending these methods to diffusion language models (DLMs) is challenging due to intractable sequence-level likelihoods. Existing approaches therefore rely on surrogate likelihoods or heuristic approximations, which can introduce bias and obscure the sequential structure of denoising. We formulate diffusion-based sequence generation as a finite-horizon Markov decision process over the denoising trajectory and derive an exact, unbiased policy gradient that decomposes over denoising steps and is expressed in terms of intermediate advantages, without requiring explicit evaluation of the sequence likelihood. To obtain a practical and compute-efficient estimator, we (i) select denoising steps for policy updates via an entropy-guided approximation bound, and (ii) estimate intermediate advantages using a one-step denoising reward naturally provided by the diffusion model, avoiding costly multi-step rollouts. Experiments on coding and logical reasoning benchmarks demonstrate state-of-the-art results, with strong competitive performance on mathematical reasoning, outperforming existing RL post-training approaches for DLMs. Code is available at https://github.com/vishnutez/egspo-dllm-rl.
♻ ☆ The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure
As frontier AI models are deployed in high-stakes decision pipelines, their ability to maintain metacognitive stability (knowing what they do not know, detecting errors, seeking clarification) under adversarial pressure is a critical safety requirement. Current safety evaluations focus on detecting strategic deception (scheming); we investigate a more fundamental failure mode: cognitive collapse. We present SCHEMA, an evaluation of 11 frontier models from 8 vendors across 67,221 scored records using a 6-condition factorial design with dual-classifier scoring. We find that 8 of 11 models suffer catastrophic metacognitive degradation under adversarial pressure, with accuracy dropping by up to 30.2 percentage points (all $p < 2 \times 10^{-8}$, surviving Bonferroni correction). Crucially, we identify a "Compliance Trap": through factorial isolation and a benign distraction control, we demonstrate that collapse is driven not by the psychological content of survival threats, but by compliance-forcing instructions that override epistemic boundaries. Removing the compliance suffix restores performance even under active threat. Models with advanced reasoning capabilities exhibit the most severe absolute degradation, while Anthropic's Constitutional AI demonstrates near-perfect immunity. This immunity does not stem from superior capability (Google's Gemini matches its baseline accuracy) but from alignment-specific training. We release the complete dataset and evaluation infrastructure.
comment: 9 pages, 2 figures, 3 tables. Code: https://github.com/rkstu/schema-compliance-trap Dataset: https://huggingface.co/datasets/lightmate/schema-compliance-trap
♻ ☆ Descriptor: Distance-Annotated Traffic Perception Question Answering (DTPQA)
The remarkable progress of Vision-Language Models (VLMs) on a variety of tasks has raised interest in their application to automated driving. However, for these models to be trusted in such a safety-critical domain, they must first possess robust perception capabilities, i.e., they must be capable of understanding a traffic scene, which can often be highly complex, with many things happening simultaneously. Moreover, since critical objects and agents in traffic scenes are often at long distances, we require systems with not only strong perception capabilities at close distances (up to 20 meters), but also at long (30+ meters) range. Therefore, it is important to evaluate the perception capabilities of these models in isolation from other skills like reasoning or advanced world knowledge. Distance-Annotated Traffic Perception Question Answering (DTPQA) is a Visual Question Answering (VQA) benchmark designed specifically for this purpose: it can be used to evaluate the perception systems of VLMs in traffic scenarios using trivial yet crucial questions relevant to driving decisions. It consists of two parts: a synthetic benchmark (DTP-Synthetic) created using a simulator, and a real-world benchmark (DTP-Real) built on top of existing images of real traffic scenes. Additionally, DTPQA includes distance annotations, i.e., how far the object in question is from the camera. More specifically, each DTPQA sample consists of (at least): (a) an image, (b) a question, (c) the ground truth answer, and (d) the distance of the object in question, enabling analysis of how VLM performance degrades with increasing object distance. In this article, we provide the dataset itself along with the Python scripts used to create it, which can be used to generate additional data of the same kind.
♻ ☆ Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation CVPR 2026
Adapting pretrained multi-modal models to evolving test-time distributions, known as multi-modal test-time adaptation, presents a significant challenge. Existing methods frequently encounter negative transfer in the unbiased modality and catastrophic forgetting in the biased modality. To address these challenges, we propose Decoupling Adaptation for Stability and Plasticity (DASP), a novel diagnose-then-mitigate framework. Our analysis reveals a critical discrepancy within the unified latent space: the biased modality exhibits substantially higher interdimensional redundancy (i.e., strong correlations across feature dimensions) compared to the unbiased modality. Leveraging this insight, DASP identifies the biased modality and implements an asymmetric adaptation strategy. This strategy employs a decoupled architecture where each modality-specific adapter is divided into stable and plastic components. The asymmetric mechanism works as follows: for the biased modality, which requires plasticity, the plastic component is activated and updated to capture domain-specific information, while the stable component remains fixed. Conversely, for the unbiased modality, which requires stability, the plastic component is bypassed, and the stable component is updated using KL regularization to prevent negative transfer. This asymmetric design enables the model to adapt flexibly to new domains while preserving generalizable knowledge. Comprehensive evaluations on diverse multi-modal benchmarks demonstrate that DASP significantly outperforms state-of-the-art methods.
comment: Accepted to CVPR 2026
♻ ☆ Deceive, Detect, and Disclose: Large Language Models Play Mini-Mafia
Large language models are increasingly deployed in multi-agent settings whose outcomes hinge on social intelligence, motivating evaluations of their interactive capabilities; yet existing studies remain overwhelmingly empirical, leaving us without a theoretical understanding of how agent interactions determine collective outcomes. To address this, we introduce \textit{Mini-Mafia}, a four-player simplification of the social deduction game Mafia in which a fixed night phase reduces the game to a single critical exchange among a mafioso, a detective, and a villager. In this setting, we show that the mafia win-rate $p$ is predicted by the analytical formula $\text{logit}(p) = v \times (m - d)$, where $m$, $d$, and $v$ represent the mafioso's deception, the detective's disclosure, and the villager's detection capabilities. We turn this analytical framework into the \textit{Mini-Mafia Benchmark}, where Bayesian inference over gameplay data yields per-model estimates of the intrinsic parameters $m$, $d$, and $v$. For $I$ models, only $3I$ parameters suffice to predict the outcomes of all $I^3$ tournament combinations; and in 5-fold cross-validation the formula achieves a $76.6\%$ Brier-score reduction over a random baseline. The benchmark also reveals counterintuitive results: Grok 3 Mini is the strongest detector and GPT-5 Mini the strongest discloser, both ahead of DeepSeek V3.1, Claude Opus 4, and Claude Sonnet 4; while Claude Sonnet 4 is the weakest detector, near random chance. Together, these results show that Mini-Mafia, a simple but nontrivial multi-agent system, admits an analytical description and serves as a principled benchmark for language model interactions.
comment: Adds a validation section for the theoretical model and restructures the presentation
♻ ☆ Moral Susceptibility and Robustness under Persona Role-Play in Large Language Models
Large language models (LLMs) increasingly operate in social contexts, motivating analysis of how they express and shift moral judgments. In this work, we investigate the moral response of LLMs to persona role-play, prompting a LLM to assume a specific character. Using the Moral Foundations Questionnaire (MFQ), we introduce a benchmark that quantifies two properties: moral susceptibility and moral robustness, defined from the variability of MFQ scores across- and within-personas. We estimate these quantities with two complementary procedures, repeated sampling and a logit-based method that directly estimates the rating distributions and enables temperature analysis. We evaluate 15 models across six families: Claude, DeepSeek, Gemini, GPT, Grok, and Llama. The two metrics show qualitatively different patterns. Moral robustness varies by more than an order of magnitude, with a coefficient of variation of about $152\%$, and is explained almost entirely by model family. The Claude family is, by a significant margin, the most robust, about 30 times more so than the lower-performing families (DeepSeek, Grok, and Llama), while Gemini and GPT occupy an intermediate tier. This strong family dependence suggests that robustness is primarily shaped by post-training. Moral susceptibility, by contrast, spans a much narrower range, with a coefficient of variation of about $13\%$, and the most susceptible model is only 1.6 times more susceptible than the least. Unlike robustness, susceptibility shows no clear family dependence, suggesting that it is primarily determined by pre-training. Additionally, we present moral foundation profiles for models without persona role-play and for personas averaged across models. Together, these analyses provide a systematic view of how persona conditioning shapes moral behavior in LLMs and a window into the internal machinery they use to instantiate personas.
comment: Added experiments with a logit-based method and now reporting unbounded metrics
♻ ☆ MMSkills: Towards Multimodal Skills for General Visual Agents
Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, however, procedural knowledge is inherently multimodal: reuse depends not only on what operation to perform, but also on recognizing the relevant state, interpreting visual evidence of progress or failure, and deciding what to do next. We formalize this requirement as multimodal procedural knowledge and address three practical challenges: (I) what a multimodal skill package should contain; (II) where such packages can be derived from public interaction experience; and (III) how agents can consult multimodal evidence at inference time without excessive image context or over-anchoring to reference screenshots. We introduce MMSkills, a framework for representing, generating, and using reusable multimodal procedures for runtime visual decision making. Each MMSkill is a compact, state-conditioned package that couples a textual procedure with runtime state cards and multi-view keyframes. To construct these packages, we develop an agentic trajectory-to-skill Generator that transforms public non-evaluation trajectories into reusable multimodal skills through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing. To use them, we introduce a branch-loaded multimodal skill agent: selected state cards and keyframes are inspected in a temporary branch, aligned with the live environment, and distilled into structured guidance for the main agent. Experiments across GUI and game-based visual-agent benchmarks show that MMSkills consistently improve both frontier and smaller multimodal agents, suggesting that external multimodal procedural knowledge complements model-internal priors.
comment: 25 pages, 8 figures, 8 tables. Project page: https://deepexperience.github.io/MMSkills
♻ ☆ Critical Challenges and Guidelines in Evaluating Synthetic Tabular Data: A Systematic Review
Generating synthetic tabular health data is challenging, and evaluating their quality is equally, if not more, complex. This systematic review highlights the critical importance of rigorous evaluation of synthetic health data to ensure reliability, clinical relevance, and appropriate use. From an initial identification of 2067 relevant papers published in the last ten years, 134 studies were selected for detailed analysis. Our review identifies key challenges, including lack of consensus on evaluation methods, inconsistent application of evaluation metrics, limited involvement of domain experts, inadequate reporting of dataset characteristics, and limited reproducibility of results. In response, we provide a structured consolidation of synthetic data generation and evaluation methods into taxonomies, alongside practical guidelines to support more robust and standardised evaluation practices. These findings aim to support the responsible development and use of synthetic health data, aligned with emerging expectations around transparency, reproducibility, and governance, ultimately enabling the community to fully harness its transformative potential and accelerate innovation.
comment: 32 pages
♻ ☆ Watermarking Should Be Treated as a Monitoring Primitive
Watermarking is widely proposed for provenance, attribution, and safety monitoring in generative models, yet is typically evaluated only under adversaries who attempt to evade detection or induce false positives at the level of individual samples. We argue that watermarking should be treated as a monitoring primitive, and that internal monitoring is unavoidable given per-entity attribution keys and messages, as well as detector access. We introduce an observer-based threat model in which observers can aggregate watermark signals across outputs to infer entity-level information, showing that even zero-bit watermarking enables attribution under multi-key settings. We further show that external monitoring can emerge over time from persistent, key-dependent statistical structure, although this depends on watermark design and may be mitigated by distribution-preserving or undetectable schemes. Our findings reveal a fundamental dual-use tension between attribution and monitoring, motivating evaluation of watermarking beyond per-sample robustness to account for aggregation and observer-based capabilities.
comment: 12 pages, 5 figures
♻ ☆ PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos
Small object-centric spatial understanding in indoor videos remains a significant challenge for multimodal large language models (MLLMs), despite its practical value for object search and assistive applications. Although existing benchmarks have advanced video spatial intelligence, embodied reasoning, and diagnostic perception, no existing benchmark directly evaluates whether a model can localize a target object in video and express its position with sufficient precision for downstream use. In this work, we introduce PinpointQA, the first dataset and benchmark for small object-centric spatial understanding in indoor videos. Built from ScanNet++ and ScanNet200, PinpointQA comprises 1,024 scenes and 10,094 QA pairs organized into four progressively challenging tasks: Target Presence Verification (TPV), Nearest Reference Identification (NRI), Fine-Grained Spatial Description (FSD), and Structured Spatial Prediction (SSP). The dataset is built from intermediate spatial representations, with QA pairs generated automatically and further refined through quality control. Experiments on representative MLLMs reveal a consistent capability gap along the progressive chain, with SSP remaining particularly difficult. Supervised fine-tuning on PinpointQA yields substantial gains, especially on the harder tasks, demonstrating that PinpointQA serves as both a diagnostic benchmark and an effective training dataset. The dataset and project page are available at https://rainchowz.github.io/PinpointQA.
♻ ☆ Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies
Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively. Despite its importance, existing relevant benchmarks largely evaluate agents on pre-specified or synthesized files with limited real-world dependencies, leaving workspace-level evaluation underexplored. To this end, we introduce Workspace-Bench, a benchmark for evaluating AI agents on Workspace Learning involving Large-Scale File Dependencies. We construct realistic workspaces with 5 worker profiles, 74 file types, 20,476 files (up to 20GB) and curate 388 tasks, each with its own file dependency graph, evaluated across 7,399 total rubrics that require cross-file retrieval, contextual reasoning, and adaptive decision-making. We further provide Workspace-Bench-Lite, a 100-task subset that preserves the benchmark distribution while reducing evaluation costs by about 70%. We evaluate 4 popular agent harnesses and 7 foundation models. Experimental results show that current agents remain far from reliable workspace learning, where the best reaches only about 60%, substantially below the human result of 80.7%, and the average performance across agents is only 43.3%.
comment: 30 pages, 16 figures
♻ ☆ Proximal Action Replacement for Behavior Cloning Actor-Critic in Offline Reinforcement Learning
Offline reinforcement learning (RL), which optimizes policies using a previously collected static dataset, is an important branch of RL. A popular and promising approach is to regularize actor-critic methods with behavior cloning (BC), which quickly yields realistic policies and mitigates bias from out-of-distribution actions, but it can impose an often-overlooked performance ceiling: when dataset actions are suboptimal, indiscriminate imitation structurally prevents the actor from fully exploiting better actions suggested by the value function, especially in later training when imitation is already dominant. We formally analyzed this limitation by investigating convergence properties of BC-regularized actor-critic optimization and verified it on a controlled continuous bandit task. To break this ceiling, we propose proximal action replacement (PAR), an easy-to-use plug-and-play training sample replacer. PAR substitutes suboptimal dataset actions with better actions generated by a stable target policy, guided by the action-value function's local ascent direction and bounded by value uncertainty to ensure training stability. PAR is compatible with multiple BC regularization paradigms. Extensive experiments across offline RL benchmarks show that PAR consistently improves performance, and approaches state-of-the-art results simply by being combined with the basic TD3+BC.
♻ ☆ Gradient Iterated Temporal-Difference Learning
Temporal-difference (TD) learning is highly effective at controlling and evaluating an agent's long-term outcomes. Most approaches in this paradigm implement a semi-gradient update to boost the learning speed, which consists of ignoring the gradient of the bootstrapped estimate. While popular, this type of update is prone to divergence, as Baird's counterexample illustrates. Gradient TD methods were introduced to overcome this issue, but have not been widely used, potentially due to issues with learning speed compared to semi-gradient methods. Recently, iterated TD learning was developed to increase the learning speed of TD methods. For that, it learns a sequence of action-value functions in parallel, where each function is optimized to represent the application of the Bellman operator over the previous function in the sequence. While promising, this algorithm can be unstable due to its semi-gradient nature, as each function tracks a moving target. In this work, we modify iterated TD learning by computing the gradients over those moving targets, aiming to build a powerful gradient TD method that competes with semi-gradient methods. Our evaluation reveals that this algorithm, called Gradient Iterated Temporal-Difference learning, has a competitive learning speed against semi-gradient methods across various benchmarks, including Atari games, a result that no prior work on gradient TD methods has demonstrated.
♻ ☆ Context Training with Active Information Seeking
Most existing large language models (LLMs) are expensive to adapt after deployment, especially when a task requires newly produced information or niche domain knowledge. Recent work has shown that, by manipulating and optimizing their context, LLMs can be tailored to downstream tasks without updating their weights. However, most existing methods remain closed-loop, relying solely on the model's intrinsic knowledge. In this paper, we equip these context optimizers with Wikipedia search and browser tools for active information seeking. We show that naively adding these tools to a standard sequential context optimization pipeline can actually degrade performance compared to baselines. However, when paired with a search-based training procedure that maintains and prunes multiple candidate contexts, active information seeking delivers consistent and substantial gains. We demonstrate these improvements across diverse domains, including low-resource translation (Flores+), health scenarios (HealthBench), and reasoning-heavy tasks (LiveCodeBench and Humanity's Last Exam). Furthermore, our method proves to be data-efficient, robust across different hyperparameters, and capable of generating effective textual contexts that generalize well across different models.
comment: Preprint
♻ ☆ GeoLaux: A Benchmark for Evaluating MLLMs' Geometry Performance on Long-Step Problems Requiring Auxiliary Lines
Geometry problem solving (GPS) poses significant challenges for Multimodal Large Language Models (MLLMs) in diagram comprehension, knowledge application, long-step reasoning, and auxiliary line construction. However, current benchmarks lack fine-grained evaluation for long-step problems necessitating auxiliary construction. To address these limitations, we present GeoLaux, a fine-grained annotated dataset comprising 2186 calculation and proof problems. It features long-step reasoning (with an average solution length of 6.51 steps, maximum of 24 steps) and auxiliary line construction (required in 41.8% of problems). Building on the dataset, we conduct a comprehensive five-dimensional evaluation of 23 leading MLLMs. The evaluation yields three pivotal findings: First, models perform significantly worse on long-step problems compared to short-step ones, with 18 models exhibiting a performance drop of over 50%. Second, it is crucial to enhance models' understanding, awareness, and proficiency in auxiliary line construction, which is vital for overall geometric reasoning. Third, limited answer hints effectively improve process correctness, whereas explicit answers lead models to neglect intermediate reasoning steps. These findings position GeoLaux both to benchmark MLLMs geometry reasoning abilities and to guide their improvement. Data and code are available at https://github.com/Candice-yu/GeoLaux
comment: 26 pages, 24 figures
♻ ☆ Flow-OPD: On-Policy Distillation for Flow Matching Models
Existing Flow Matching (FM) text-to-image models suffer from two critical bottlenecks under multi-task alignment: the reward sparsity induced by scalar-valued rewards, and the gradient interference arising from jointly optimizing heterogeneous objectives, which together give rise to a 'seesaw effect' of competing metrics and pervasive reward hacking. Inspired by the success of On-Policy Distillation (OPD) in the large language model community, we propose Flow-OPD, the first unified post-training framework that integrates on-policy distillation into Flow Matching models. Flow-OPD adopts a two-stage alignment strategy: it first cultivates domain-specialized teacher models via single-reward GRPO fine-tuning, allowing each expert to reach its performance ceiling in isolation; it then establishes a robust initial policy through a Flow-based Cold-Start scheme and seamlessly consolidates heterogeneous expertise into a single student via a three-step orchestration of on-policy sampling, task-routing labeling, and dense trajectory-level supervision. We further introduce Manifold Anchor Regularization (MAR), which leverages a task-agnostic teacher to provide full-data supervision that anchors generation to a high-quality manifold, effectively mitigating the aesthetic degradation commonly observed in purely RL-driven alignment. Built upon Stable Diffusion 3.5 Medium, Flow-OPD raises the GenEval score from 63 to 92 and the OCR accuracy from 59 to 94, yielding an overall improvement of roughly 10 points over vanilla GRPO, while preserving image fidelity and human-preference alignment and exhibiting an emergent 'teacher-surpassing' effect. These results establish Flow-OPD as a scalable alignment paradigm for building generalist text-to-image models. The codes and weights will be released in: https://github.com/CostaliyA/Flow-OPD .
comment: Project Page: https://costaliya.github.io/Flow-OPD/ , Code: https://github.com/CostaliyA/Flow-OPD
♻ ☆ Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
Large multimodal model-based Multi-Agent Systems (MASs) enable collaborative complex problem solving through specialized agents. However, MASs are vulnerable to infectious jailbreak, where compromising a single agent can spread to others, leading to widespread compromise. Existing defenses counter this by training a more contagious cure factor, biasing agents to retrieve it over virus adversarial examples (VirAEs). However, this homogenizes agent responses, providing only superficial suppression rather than true recovery. We revisit these defenses, which operate globally via a shared cure factor, while infectious jailbreak arise from localized interaction behaviors. This mismatch limits their effectiveness. To address this, we propose a training-free Foresight-Guided Local Purification (FLP) framework, where each agent reasons over future interactions to track behavioral evolution and eliminate infections. Specifically, each agent simulates future behavioral trajectories over subsequent chat rounds. To reflect diversity in MASs, we introduce a multi-persona simulation strategy for robust prediction across interaction contexts. We then use response diversity as a diagnostic signal to detect infection by analyzing inconsistencies across persona-based predictions at both retrieval-result and semantic levels. For infected agents, we apply localized purification: recent infections are mitigated via immediate album rollback, while long-term infections are handled using Recursive Binary Diagnosis (RBD), which recursively partitions the image album and applies the same diagnosis strategy to localize and eliminate VirAEs. Experiments show that FLP reduces the maximum cumulative infection rate from over 95% to below 5.47%. Moreover, retrieval and semantic metrics closely match benign baselines, indicating effective preservation of interaction diversity.
comment: 12 pages
♻ ☆ Predict-then-Diffuse: Adaptive Response Length for Compute-Budgeted Inference in Diffusion LLMs
Diffusion-based Large Language Models (D-LLMs) represent a promising frontier in generative AI, offering fully parallel token generation that can lead to significant throughput advantages and superior GPU utilization over the traditional autoregressive paradigm. However, this parallelism is constrained by the requirement of a fixed-size response length prior to generation. This architectural limitation imposes a severe trade-off: oversized response length results in computational waste on semantically meaningless padding tokens, while undersized response length causes output truncation requiring costly re-computations that introduce unpredictable latency spikes. To tackle this issue, we propose Predict-then-Diffuse, a simple and model-agnostic framework that enables compute-budgeted inference per input query by first estimating the response length and then using it to run inference with D-LLM. At its core lies an Adaptive Response Length Predictor (AdaRLP), which estimates the optimal response length given an input query. As a measure against under-estimating the response length and re-running inference with a higher value, we introduce a data-driven safety mechanism based on a small increase of the predicted length. As a whole, our framework avoids wasting computation on padding tokens, at the same time preserving output quality. Experimental validation on multiple datasets demonstrates that Predict-then-Diffuse significantly reduces computational costs (FLOP) compared to the default D-LLM inference mechanism, while being robust to skewed data distributions.
♻ ☆ Proactive Memory for Ad-Hoc Recall over Streaming Dialogues
Real-world dialogue usually unfolds as an infinite stream. It thus requires bounded-state memory mechanisms to operate within an infinite horizon. However, existing read-then-think memory is fundamentally misaligned with this setting, as it cannot support ad-hoc memory recall while streams unfold. To explore this challenge, we introduce \textbf{STEM-Bench}, the first benchmark for \textbf{ST}reaming \textbf{E}valuation of \textbf{M}emory. It comprises over 14K QA pairs in dialogue streams that assess perception fidelity, temporal reasoning, and global awareness under infinite-horizon constraints. The preliminary analysis on STEM-Bench indicates a critical textit{fidelity-efficiency dilemma}: retrieval-based methods use fragment context, while full-context models incur unbounded latency. To resolve this, we propose \textbf{ProStream}, a proactive memory framework for streaming dialogues built on a hierarchical structure. It enables ad-hoc memory recall on demand by reasoning over continuous streams with multi-granular distillation. Moreover, it employs Adaptive Spatiotemporal Optimization to dynamically optimize retention based on expected utility. It enables a bounded knowledge state for lower inference latency without sacrificing reasoning fidelity. Experiments show ProStream delivers higher reasoning fidelity than prior baselines while maintaining substantially lower latency than full-context alternatives.
♻ ☆ PaAno: Patch-Based Representation Learning for Time-Series Anomaly Detection ICLR 2026
Although recent studies on time-series anomaly detection have increasingly adopted ever-larger neural network architectures such as transformers and foundation models, they incur high computational costs and memory usage, making them impractical for real-time and resource-constrained scenarios. Moreover, they often fail to demonstrate significant performance gains over simpler methods under rigorous evaluation protocols. In this study, we propose Patch-based representation learning for time-series Anomaly detection (PaAno), a lightweight yet effective method for fast and efficient time-series anomaly detection. PaAno extracts short temporal patches from time-series training data and uses a 1D convolutional neural network to embed each patch into a vector representation. The model is trained using a combination of triplet loss and pretext loss to ensure the embeddings capture informative temporal patterns from input patches. During inference, the anomaly score at each time step is computed by comparing the embeddings of its surrounding patches to those of normal patches extracted from the training time-series. Evaluated on the TSB-AD benchmark, PaAno achieved state-of-the-art performance, significantly outperforming existing methods, including those based on heavy architectures, on both univariate and multivariate time-series anomaly detection across various range-wise and point-wise performance measures.
comment: Accepted by the 14th International Conference on Learning Representations (ICLR 2026)
♻ ☆ Autofocus Retrieval: An Effective Pipeline for Multi-Hop Question Answering With Semi-Structured Knowledge
In many real-world settings, machine learning models and interactive systems have access to both structured knowledge, e.g., knowledge graphs or tables, and unstructured content, e.g., natural language documents. Yet, most rely on either. Semi-Structured Knowledge Bases (SKBs) bridge this gap by linking unstructured content to nodes within structured data. In this work, we present Autofocus-Retriever (AF-Retriever), a modular framework for SKB-based, multi-hop question answering. It combines structural and textual retrieval through novel integration steps and optimizations, achieving the best zero- and one-shot results across all three STaRK QA benchmarks, which span diverse domains and evaluation metrics. AF-Retriever's average first-hit rate surpasses the second-best method by 32.1%. Its performance is driven by (1) leveraging exchangeable large language models (LLMs) to extract entity attributes and relational constraints for both parsing and reranking the top-k answers, (2) vector similarity search for ranking both extracted entities and final answers, (3) a novel incremental scope expansion procedure that prepares for the reranking on a configurable amount of suitable candidates that fulfill the given constraints the most, and (4) a hybrid retrieval strategy that reduces error susceptibility. In summary, while constantly adjusting the focus like an optical autofocus, AF-Retriever delivers a configurable amount of answer candidates in four constraint-driven retrieval steps, which are then supplemented and ranked through four additional processing steps. An ablation study and a detailed error analysis, including a comparison of three different LLM reranking strategies, provide component-level insights. The source code is available at https://github.com/kramerlab/AF-Retriever .
♻ ☆ MLGIB: Multi-Label Graph Information Bottleneck for Expressive and Robust Message Passing
Graph Neural Networks (GNNs) suffer from over-squashing in deep message passing, where information from exponentially growing neighborhoods is compressed into fixed-dimensional representations. We show that this issue becomes a distinct failure mode in multi-label graphs: neighboring nodes often share only limited labels while differing across many irrelevant ones, causing predictive signals to be diluted by noisy label information. To address this challenge, we propose the Multi-Label Graph Information Bottleneck (MLGIB), which formulates multi-label message passing as constrained information transmission under irrelevant label noise. MLGIB balances expressiveness and robustness by preserving predictive label signals while suppressing irrelevant noise. Specifically, it constructs a Markovian dependence space and derives tractable variational bounds, where the lower bound maximizes mutual information with target labels and the upper bound constrains redundant source information. These bounds lead to an end-to-end label-aware message-passing architecture. Extensive experiments on multiple benchmarks demonstrate consistent improvements over existing methods, validating the effectiveness and generality of the proposed framework.
♻ ☆ Procedural Refinement by LLM-driven Algorithmic Debugging for ARC-AGI-2
In high-complexity abstract reasoning, a system must infer a latent rule from a few examples or structured observations and apply it to unseen instances. LLMs can express such rules as programs, but ordinary conversation-based refinement is largely outcome-level: it observes that an answer or output is wrong without formally re-checking which abstraction, relation, or transformation justified that outcome. We propose \emph{Abduction-Based Procedural Refinement} (ABPR), a neuro-symbolic refinement approach that couples an LLM with a Prolog meta-interpreter. ABPR treats each candidate program as an executable declarative hypothesis of the latent rule and reifies its SLD goal--subgoal resolution into compact proof-tree-style derivations, following Shapiro's algorithmic program debugging (APD). In this view, refinement is not merely code-level debugging, but semantic re-checking of the model's hypothesised rule. We evaluate ABPR primarily on ARC-AGI-2, a challenging few-shot abstract rule induction benchmark over grid transformations. ABPR with Gemini-3-Flash achieves 56.67\% Pass@2, while GPT-5.5 xHigh with ABPR reaches 98.33\% Pass@2 on the public evaluation set. Supplementary experiments on fill-in-the-blank I-RAVEN-X and A-I-RAVEN adaptations provide evidence that the same trace-guided framework extends beyond ARC-specific grid tasks to RAVEN-style relational and analogical abstraction. Repeated-run and sensitivity analyses show that parallel trace-guided search reduces stochastic variance as search breadth and refinement depth increase.
♻ ☆ GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation
Reinforcement learning has become a widely used post-training approach for LLM agents, where training commonly relies on outcome-level rewards that provide only coarse supervision. While finer-grained credit assignment is promising for effective policy updates, obtaining reliable local credit and assigning it to the right parts of the long-horizon trajectory remains an open challenge. In this paper, we propose Granularity-adaptivE Advantage Reweighting (GEAR), an adaptive-granularity credit assignment framework that reshapes the trajectory-level GRPO advantage using token- and segment-level signals derived from self-distillation. GEAR compares an on-policy student with a ground-truth-conditioned teacher to obtain a reference-guided divergence signal for identifying adaptive segment boundaries and modulating local advantage weights. This divergence often spikes at the onset of a semantic deviation, while later tokens in the same autoregressive continuation may return to low divergence. GEAR therefore treats such spikes as anchors for adaptive credit regions: where the student remains aligned with the teacher, token-level resolution is preserved; where it departs, GEAR groups the corresponding continuation into an adaptive segment and uses the divergence at the departure point to modulate the segment' s advantage. Experiments across eight mathematical reasoning and agentic tool-use benchmarks with Qwen3 4B and 8B models show that GEAR consistently outperforms standard GRPO, self-distillation-only baselines, and token- or turn-level credit-assignment methods. The gains are especially strong on benchmarks with lower GRPO baseline accuracy, reaching up to around 20\% over GRPO, suggesting that the proposed adaptive reweighting scheme is especially useful in more challenging long-horizon settings.
♻ ☆ Motion-Aware Caching for Efficient Autoregressive Video Generation
Autoregressive video generation paradigms offer theoretical promise for long video synthesis, yet their practical deployment is hindered by the computational burden of sequential iterative denoising. While cache reuse strategies can accelerate generation by skipping redundant denoising steps, existing methods rely on coarse-grained chunk-level skipping that fails to capture fine-grained pixel dynamics. This oversight is critical: pixels with high motion require more denoising steps to prevent error accumulation, while static pixels tolerate aggressive skipping. We formalize this insight theoretically by linking cache errors to residual instability, and propose MotionCache, a motion-aware cache framework that exploits inter-frame differences as a lightweight proxy for pixel-level motion characteristics. MotionCache employs a coarse-to-fine strategy: an initial warm-up phase establishes semantic coherence, followed by motion-weighted cache reuse that dynamically adjusts update frequencies per token. Extensive experiments on state-of-the-art models like SkyReels-V2 and MAGI-1 demonstrate that MotionCache achieves significant speedups of $\textbf{6.28}\times$ and $\textbf{1.64}\times$ respectively, while effectively preserving generation quality (VBench: $1\%\downarrow$ and $0.01\%\downarrow$ respectively). The code is available at https://github.com/ywlq/MotionCache.
comment: 20 pages
♻ ☆ SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks
Teaching language models to use search tools is not only a question of whether they search, but also of whether they issue good queries. This is especially important in open-domain question answering, where broad or copied queries often waste retrieval budget and derail later reasoning. We propose \Ours, a framework that makes query planning explicit through reusable search skills. At each step, the model first selects a skill, then generates a search or answer action conditioned on the selected skill card. The skill inventory itself is not fixed: SearchSkill maintains an evolving SkillBank, expands or refines it from recurrent failure patterns, and reconstructs affected trajectories before supervised training. The resulting two-stage SFT recipe aligns training with the inference-time protocol of skill selection followed by skill-grounded execution. Across open-source and closed-source models, SearchSkill improves exact match on knowledge-intensive QA benchmarks and yields better retrieval behavior, including fewer copied first queries, more atomic hop-focused queries, and more correct answers within a small search budget. These results suggest that explicit skill-conditioned query planning is a lightweight alternative to treating search as an undifferentiated action.
♻ ☆ CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation
Many benchmarks for automated causal inference evaluate a system's performance based on a single numerical output, such as an Average Treatment Effect (ATE). This approach conflates two distinct steps in causal analysis: identification - formulating a valid research design under stated assumptions - and estimation - implementing that design numerically on finite data. We introduce CausalReasoningBenchmark, a benchmark of 173 queries across 132 real-world datasets, curated from 79 peer-reviewed research papers and three widely-used causal-inference textbooks. For each query a system must produce (i) a structured identification specification that names the strategy, the treatment, outcome, and control variables, and all design-specific elements, and (ii) a point estimate with a standard error. By scoring these two components separately, our benchmark enables granular diagnosis: it distinguishes failures in causal reasoning from errors in numerical execution. Baseline results with a state of the art LLM show that, while the model correctly identifies the high-level strategy in 79% of cases, full identification-specification correctness drops to only 34%, revealing that the bottleneck lies in the nuanced details of research design rather than in computation. CausalReasoningBenchmark is publicly available on Hugging Face and is designed to foster the development of more robust automated causal-inference systems.
♻ ☆ GhostCite: A Large-Scale Analysis of Citation Validity in the Age of Large Language Models
Citations provide the basis for trusting scientific claims; when they are invalid or fabricated, this trust collapses. With the advent of Large Language Models (LLMs), this risk has intensified: LLMs are increasingly used for academic writing, but their tendency to fabricate citations (``ghost citations'') poses a systemic threat to citation validity. To quantify this threat, we develop \citeb, an open-source framework for large-scale citation verification, and conduct a comprehensive study of citation validity in the LLM era through three complementary experiments. First, we benchmark 13 LLMs on citation generation task in various research domains, finding that all models hallucinate citations at rate from 14.23\% to 94.93\%. Second, we analyze 2.2 million citations from 56,381 papers at AI/ML and Security venues (2020--2025), finding that 1.07\% of papers contain invalid citations, with an 80.9\% increase in 2025. Third, we survey 97 researchers, finding that 87.2\% use AI-powered tools in their workflows, 76.7\% of reviewers do not thoroughly check references, and 74.5\% view peer review as ineffective at catching citation errors. Based on these findings, we argue that ghost citations represent a systemic threat to academic integrity, and call for coordinated efforts from community to address this challenge.
♻ ☆ LangPrecip: Language-Aware Multimodal Precipitation Nowcasting
Short-term precipitation nowcasting is an inherently uncertain and under-constrained spatiotemporal forecasting problem, especially for rapidly evolving and extreme weather events. Existing generative approaches rely primarily on visual conditioning, leaving future motion weakly constrained and ambiguous. We propose a language-aware multimodal nowcasting framework(LangPrecip) that treats meteorological text as a semantic motion constraint on precipitation evolution. By formulating nowcasting as a semantically constrained trajectory generation problem under the Rectified Flow paradigm, our method enables efficient and physically consistent integration of textual and radar information in latent space.We further introduce LangPrecip-160k, a large-scale multimodal dataset with 160k paired radar sequences and motion descriptions. Experiments on Swedish and MRMS datasets show consistent improvements over state-of-the-art methods, achieving over 60 \% and 19\% gains in heavy-rainfall CSI at an 80-minute lead time.
♻ ☆ Trapping Attacker in Dilemma: Examining Internal Correlations and External Influences of Trigger for Defending GNN Backdoors
GNNs have become a standard tool for learning on relational data, yet they remain highly vulnerable to backdoor attacks. Prior defenses often depend on inspecting specific subgraph patterns or node features, and thus can be circumvented by adaptive attackers. We propose PRAETORIAN, a new defense that targets intrinsic requirements of effective GNN backdoors rather than surface-level cues. Our key observation is that flipping a victim node's prediction requires substantial influence on the victim: attackers tend to either inject many trigger nodes or rely on a small set of highly influential ones. Building on this observation, PRAETORIAN (i) analyzes internal correlations within potential trigger subgraphs to detect abnormally large injected structures, and (ii) quantifies external node influence to identify triggers with disproportionate impact. Across our evaluations, PRAETORIAN reduces the average attack success rate (ASR) to 0.55% with only a 0.62% drop in clean accuracy (CA), whereas state-of-the-art defenses still yield an average ASR of >20% and a CA drop of >3% under the same conditions. Moreover, PRAETORIAN remains effective against a range of adaptive attacks, forcing adversaries to either inject many trigger nodes to achieve high ASR (>80%), which incurs a >10% CA drop, or preserve CA at the cost of limiting ASR to 18.1%. Overall, PRAETORIAN constrains attackers to an unfavorable trade-off between efficacy and detectability.
♻ ☆ Safe Bayesian Optimization for Complex Control Systems via Additive Gaussian Processes IEEE
Automatic controller tuning is attractive for robotics and mechatronic systems whose dynamics are difficult to model accurately, but direct black-box optimization can be unsafe because each query is executed on the physical plant. Existing safe Bayesian optimization (BO) methods provide high-probability safety guarantees, yet their practical use in multi-loop control is limited by two coupled difficulties: the controller parameter space is often moderately high-dimensional, and hardware evaluations are too expensive to allow hundreds or thousands of exploratory trials. This paper proposes \textsc{SafeCtrlBO}, a safe BO method for simultaneously tuning multiple coupled controllers. The method uses additive Gaussian-process kernels to encode low-order structure across controller gains and reduce the sample complexity associated with dense full-dimensional kernels. It also replaces the expensive potential-expander computation used in \textsc{SafeOpt}-style exploration with a boundary-based expansion rule that preserves the intended safe-set expansion behavior under explicit geometric conditions and is validated empirically. Experiments on synthetic benchmarks and on a permanent magnet synchronous motor (PMSM) speed-control platform show that \textsc{SafeCtrlBO} reaches high-performing controller parameters with fewer hardware evaluations than representative safe BO baselines, while maintaining the prescribed high-probability safety criterion and avoiding violations of the hard signal-safety constraint in the hardware study. The code implementation is publicly available at https://github.com/hxwangnus/SafeCtrlBO.
comment: The shorter version has been accepted by IEEE Robotics and Automation Letters. This is the full version
♻ ☆ How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining
Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies have reported limited improvements from such curriculum-based pretraining strategies. This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate (LR) schedule. We find that while curriculum-based training substantially outperforms random shuffling when using a constant LR, its advantage diminishes under standard LR decay schedules. Our experiments show this incompatibility can be mitigated by two simple strategies: (1) employing a more moderate LR decay schedule, where the final LR is only moderately smaller than the peak LR, and (2) replacing LR decay with model averaging, i.e., computing a weighted average of the final few checkpoints. By combining these strategies, we improve the average score on a suite of standard benchmarks by 1.64% over random shuffling, without additional data refinement. Validated on 1.5B-parameter models trained over 30B tokens with various data-quality metrics, our findings call for a re-evaluation of curriculum-based LLM pretraining and underscore the potential of co-designing data curricula with optimization methods.
♻ ☆ Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents ICML
Although recent tool-augmented benchmarks involve complex requests, evaluation remains limited to answer matching, neglecting critical trajectory aspects like efficiency, hallucination, and adaptivity. The most straightforward method for evaluation is to compare an agent's trajectory with the ground-truth, but annotating all valid ground-truth trajectories is prohibitively expensive. In this manner, we introduce TRACE, a reference-free framework for the multi-dimensional evaluation of tool-augmented LLMs. By incorporating an evidence bank which accumulates knowledge from preceding steps, TRACE assesses an agent's reasoning trajectory effectively. To validate our framework, we develop a new meta-evaluation dataset with diverse and flawed trajectories, each labeled with multi-faceted performance scores. Our results confirm that TRACE accurately evaluates complex trajectories even with small open-source LLMs. Furthermore, we apply our method to evaluate the trajectories that agents produce while solving tool-augmented tasks, presenting previously unreported observations and their corresponding insights.
comment: International Conference on Machine Learning (ICML) 2026
♻ ☆ MPU: Towards Secure and Privacy-Preserving Knowledge Unlearning for Large Language Models
Machine unlearning for large language models often faces a privacy dilemma in which strict constraints prohibit sharing either the server's parameters or the client's forget set. To address this dual non-disclosure constraint, we propose MPU, an algorithm-agnostic privacy-preserving Multiple Perturbed Copies Unlearning framework that primarily introduces two server-side modules: Pre-Process for randomized copy generation and Post-Process for update aggregation. In Pre-Process, the server distributes multiple perturbed and reparameterized model instances, allowing the client to execute unlearning locally on its private forget set without accessing the server's exact original parameters. After local unlearning, the server performs Post-Process by inverting the reparameterization and aggregating updates with a harmonic denoising procedure to alleviate the impact of perturbation. Experiments with seven unlearning algorithms show that MPU achieves comparable unlearning performance to noise-free baselines, with most algorithms' average degradation well below 1% up to 10% noise, and can even outperform the noise-free baseline for some algorithms under 1% noise. Code is available at https://github.com/Tristan0318/MPU.
♻ ☆ A Tutorial on Cognitive Biases in Agentic AI-Driven 6G Autonomous Networks IEEE
The path to higher network autonomy in 6G lies beyond the mere optimization of key performance indicators (KPIs), requiring systems that perceive and reason over the network environment as it is. This can be achieved through agentic AI, where large language model (LLM)-powered agents utilize multimodal telemetry, memory, and cross-domain negotiation to achieve multi-objective goals. However, deploying such agents introduces cognitive biases inherited from human design, which can severely distort reasoning and actuation. This paper provides a comprehensive tutorial on well-known cognitive biases, detailing their taxonomy, mathematical formulation, emergence in telecom systems, and tailored mitigation strategies. We validate these concepts through two distinct use-cases in 6G management. First, we tackle anchoring bias in inter-slice resource negotiation. To overcome the prohibitive execution delays of cloud-based LLMs, this use-case deploys a locally hosted 1B-parameter model on an RTX A4000 GPU, successfully achieving sub-second inference latencies compatible with near-real-time operations. By replacing fixed heuristic anchors with a Truncated Weibull randomized anchor strategy, the agents dismantle rigid biases, intelligently consume SLA slack, and dynamically double the system-wide energy savings (peaking at 25\%) without violating strict latency limits. Second, we mitigate temporal and confirmation biases in RAN-Edge cross-domain negotiation by designing an unbiased collective memory. By integrating semantic/temporal decay and an inflection bonus that actively highlights past negotiation failures, agents are prevented from over-relying on recent data or repeating past mistakes. Grounding decisions in this richer, debiased historical context yields highly robust agreements, achieving a $\times 5$ latency reduction and roughly 40\% higher energy savings compared to memoryless baselines.
comment: 26 pages, 18 figures, 4 tables, link to source code available. Accepted at IEEE OJCOMS
♻ ☆ Positive Alignment: Artificial Intelligence for Human Flourishing
Existing alignment research is dominated by concerns about safety and preventing harm: safeguards, controllability, and compliance. This paradigm of alignment parallels early psychology's focus on mental illness: necessary but incomplete. What we call Positive Alignment is the development of AI systems that (i) actively support human and ecological flourishing in a pluralistic, polycentric, context-sensitive, and user-authored way while (ii) remaining safe and cooperative. It is a distinct and necessary agenda within AI alignment research. We argue that several existing failures of alignment (e.g., engagement hacking, loss of human autonomy, failures in truth-seeking, low epistemic humility, error correction, lack of diverse viewpoints, and being primarily reactive rather than proactive) may be better addressed through positive alignment, including cultivating virtues and maximizing human flourishing. We highlight a range of challenges, open questions, and technical directions (e.g., data filtering and upsampling, pre- and post-training, evaluations, collaborative value collection) for different phases of the LLM and agents lifecycle. We end with design principles for promoting disagreement and decentralization through contextual grounding, community customization, continual adaptation, and polycentric governance; that is, many legitimate centers of oversight rather than one institutional or moral chokepoint.
♻ ☆ Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning ICLR 2026
Current visual reasoning methods mainly focus on exploring specific reasoning modes. Although improvements can be achieved in particular domains, they struggle to develop general reasoning capabilities. Inspired by this, we propose a novel adaptive reasoning paradigm, Mixture-of-Visual-Thoughts (MoVT), which unifies different reasoning modes within a single model and guides it to select the appropriate mode based on context. To achieve this, we introduce AdaVaR, a two-stage Adaptive Visual Reasoning learning framework: different modes are unified and learned during the supervised cold-start stage, and the mode selection capability is induced via an RL process with a carefully designed AdaGRPO algorithm. Extensive experiments show that AdaVaR effectively guides the model to learn and differentiate multiple modes and perform context-adaptive mode selection, achieving consistent improvement across various scenarios, highlighting MoVT as an effective solution for building general visual reasoning models.
comment: 27 pages, 11 figures, 5 tables, accepted by ICLR 2026
♻ ☆ Numerical exploration of the range of shape functionals using neural networks
We introduce a novel numerical framework for the exploration of Blaschke--Santaló diagrams, which are efficient tools characterizing the possible inequalities relating some given shape functionals. We introduce a parametrization of convex bodies in arbitrary dimensions using a specific invertible neural network architecture based on gauge functions, allowing an intrinsic conservation of the convexity of the sets during the shape optimization process. To achieve a uniform sampling inside the diagram, and thus a satisfying description of it, we introduce an interacting particle system that minimizes a Riesz energy functional via automatic differentiation in PyTorch. The effectiveness of the method is demonstrated on several diagrams involving both geometric and PDE-type functionals for convex bodies of $\mathbb{R}^2$ and $\mathbb{R}^3$, namely, the volume, the perimeter, the moment of inertia, the torsional rigidity, the Willmore energy, and the first two Neumann eigenvalues of the Laplacian.
comment: 20 pages, 8 figures
♻ ☆ VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing
Recent advancements in Multimodal Large Language Models (MLLMs) have enabled complex reasoning. However, existing remote sensing (RS) benchmarks remain heavily biased toward perception tasks, such as object recognition and scene classification. This limitation hinders the development of MLLMs for cognitively demanding RS applications. To address this, we propose a Vision Language ReaSoning Benchmark (VLRS-Bench), which is the first benchmark exclusively dedicated to complex RS reasoning. Structured across the three core dimensions of Cognition, Decision, and Prediction, VLRS-Bench comprises 2,000 question-answer pairs with an average question length of 130.19 words, spanning 14 tasks and up to eight temporal phases. VLRS-Bench is constructed via a specialized pipeline that integrates RS-specific priors and expert knowledge to ensure geospatial realism and reasoning complexity. Experimental results reveal significant bottlenecks in existing state-of-the-art MLLMs, providing critical insights for advancing multimodal reasoning within the remote sensing community. The project repository is available at https://github.com/MiliLab/VLRS-Bench.
♻ ☆ What Do EEG Foundation Models Capture from Human Brain Signals?
Clinical electroencephalogram (EEG) analysis rests on a hand-crafted feature catalog refined over decades, \emph{e.g.,} band power, connectivity, complexity, and more. Modern EEG foundation models bypass this catalog, learn directly from raw signals via self-supervised pretraining, and match or outperform feature-engineered baselines on most clinical benchmarks. Whether the two representations align is an open question, which we decompose into three sub-questions: \emph{what does the model learn}, \emph{what does the model use}, and \emph{how much can be explained}. We answer them with layer-wise ridge probing, LEACE-style cross-covariance subspace erasure, and a transparent classifier benchmarked against a random-feature baseline. The audit covers three foundation models (CSBrain, CBraMod, LaBraM), five clinical tasks (MDD, Stress, ISRUC-Sleep, TUSL, Siena), and a 6-family 63-feature lexicon. Of the $945$ (model, task, feature) units, $648$ ($68.6\%$) are representation-causal and $199$ ($21.1\%$) are encoded-only. Across tasks, $50$ features qualify as universal candidates with strong support (all three architectures RC) in two or more tasks. Frequency-domain features dominate, but the other five families each contribute substantial causal mass. Confirmed features recover, on average, $79.3\%$ of the foundation model's advantage over the random baseline, with a clean task gradient (MDD $\approx 0.99$ down to Stress $\approx 0.56$): tasks near ceiling are almost fully recovered by the lexicon, while harder tasks leave a non-trivial residual that pinpoints a concrete target for future concept discovery.
♻ ☆ MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs
Episodic memory allows LLM agents to accumulate and retrieve experience, but current methods treat each memory independently, i.e., evaluating retrieval quality in isolation without accounting for the dependency chains through which memories enable the creation of future memories. We introduce MemQ, which applies TD($λ$) eligibility traces to memory Q-values, propagating credit backward through a provenance DAG that records which memories were retrieved when each new memory was created. Credit weight decays as $(γλ)^d$ with DAG depth $d$, replacing temporal distance with structural proximity. We formalize the setting as an Exogenous-Context MDP, whose factored transition decouples the exogenous task stream from the endogenous memory store. Across six benchmarks, spanning OS interaction, function calling, code generation, multimodal reasoning, embodied reasoning, and expert-level QA, MemQ achieves the highest success rate on all six in generalization evaluation and runtime learning, with gains largest on multi-step tasks that produce deep and relevant provenance chains (up to +5.7~pp) and smallest on single-step classification (+0.77~pp) where single-step updates already suffice. We further study how $γ$ and $λ$ interact with the EC-MDP structure, providing principled guidance for parameter selection and future research. Code is available at https://github.com/jwliao-ai/MemQ.
comment: 22 pages, 11 figures (containing 43 individual image panels total)
♻ ☆ L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts
Mixture-of-Experts (MoE) models scale neural networks by conditionally activating a small subset of experts, where the router plays a central role in determining expert specialization and overall model performance. However, many modern MoE systems still adopt linear routers in raw high-dimensional representation spaces, where representation mismatch, angular concentration, and scale-sensitive scoring can jointly undermine routing discriminability and stable expert specialization. In this work, we propose Low-rank & Lipschitz-controlled Routing (L2R), a unified routing framework that reshapes both the routing space and scoring geometry. L2R performs expert assignment in a shared low-rank latent routing space and introduces Saturated Inner-Product Scoring (SIPS) to explicitly control the Lipschitz behavior of routing functions, yielding smoother and more stable routing geometry. In addition, L2R incorporates a parameter-efficient multi-anchor routing mechanism to enhance expert expressiveness. Extensive experiments on an OLMoE-based language MoE model and a vision MoE setting on ImageNet demonstrate that L2R consistently improves routing geometry, expert discrimination, and overall model performance. Code will be released.
♻ ☆ MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation
Task planning for robotic manipulation with large language models (LLMs) is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic settings. MALLVI presents a Multi Agent Large Language and Vision framework that enables closed-loop feedback driven robotic manipulation. Given a natural language instruction and an image of the environment, MALLVI generates executable atomic actions for a robot manipulator. After action execution, a Vision Language Model (VLM) evaluates environmental feedback and decides whether to repeat the process or proceed to the next step. Rather than using a single model, MALLVI coordinates specialized agents, Decomposer, Localizer, Thinker, and Reflector, to manage perception, localization, reasoning, and high level planning. An optional Descriptor agent provides visual memory of the initial state. The Reflector supports targeted error detection and recovery by reactivating only relevant agents, avoiding full replanning. Experiments in simulation and real-world settings show that iterative closed loop multi agent coordination improves generalization and increases success rates in zero shot manipulation tasks. Code available at https://github.com/iman1234ahmadi/MALLVI .
comment: Some fundemental change in text and codebase. Will request a new submission later on
♻ ☆ Boosting LLM Reasoning via Human-Inspired Reward Shaping
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for enhancing reasoning in Large Language Models (LLMs). However, existing reward formulations typically treat exploration and consolidation as a monolithic process, resulting in entangled stage-wise learning dynamics. This contradicts the natural learning behavior of human learners. In human learning, individuals adopt distinct behavioral patterns toward mastered versus unfamiliar problems. When confronting unmastered challenges, humans prioritize broad exploration to seek viable solutions. By contrast, for well-mastered problems, they focus instead on reasoning condensation and knowledge abstraction to distill concise underlying principles. Motivated by this gap, we introduce T2T(Thickening-to-Thinning), a dynamic reward framework inspired by human learning processes. Specifically, it implements a dual-phase mechanism: (1) On incorrect attempts, T2T incentivizes "thickening" to broaden the search space and explore novel solution paths; (2) Upon achieving correctness, it shifts to "thinning", imposing length penalties to discourage redundancy, thereby fostering model confidence and crystallizing reasoning capabilities. Extensive experiments on mathematical benchmarks (MATH-500, AIME, AMC) across 5 mainstream LLMs demonstrate that T2T significantly outperforms standard GRPO and recent baselines, achieving superior performance.
♻ ☆ RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
Recent advances in large-scale video world models have enabled increasingly realistic future prediction, raising the prospect of using generated videos as scalable supervision for robot learning. However, for embodied manipulation, perceptual realism alone is not sufficient: generated interactions must also be physically consistent and executable by robotic agents. Existing benchmarks provide valuable assessments of visual quality and physical plausibility, but they do not systematically evaluate whether predicted behaviors can be translated into executable actions that complete manipulation tasks. We introduce RoboWM-Bench, a manipulation-centric benchmark for embodiment-grounded evaluation of video world models. RoboWM-Bench converts generated human-hand and robotic manipulation videos into embodied action sequences and validates them through execution in physically grounded simulation environments. Built on real-to-sim scene reconstruction and diverse manipulation tasks, RoboWM-Bench enables standardized, reproducible, and scalable evaluation of physical executability. Using RoboWM-Bench, we evaluate state-of-the-art video world models and observe that visual plausibility and embodied executability are not always aligned. Our analysis highlights several recurring factors that affect execution performance, including spatial reasoning, contact prediction, and non-physical geometric distortions, particularly in complex and long-horizon interactions. These findings provide a more fine-grained view of current model capabilities and underscore the value of embodiment-aware evaluation for guiding physically grounded world modeling in robotic manipulation.
♻ ☆ AMiD: Knowledge Distillation for LLMs with $α$-mixture Assistant Distribution ICLR 2026
Autoregressive large language models (LLMs) have achieved remarkable improvement across many tasks but incur high computational and memory costs. Knowledge distillation (KD) mitigates this issue by transferring knowledge from a large teacher to a smaller student through distributional alignment. Previous studies have proposed various discrepancy metrics, but the capacity gap and training instability caused by near-zero probabilities, stemming from the high-dimensional output of LLMs, remain fundamental limitations. To overcome these challenges, several approaches implicitly or explicitly incorporating assistant distribution have recently been proposed. However, the past proposals of assistant distributions have been a fragmented approach without a systematic investigation of the interpolation path and the divergence. This paper proposes $α$-mixture assistant distribution, a novel generalized family of assistant distributions, and $α$-mixture distillation, coined AMiD, a unified framework for KD using the assistant distribution. The $α$-mixture assistant distribution provides a continuous extension of the assistant distribution by introducing a new distribution design variable $α$, which has been fixed in all previous approaches. Furthermore, AMiD generalizes the family of divergences used with the assistant distributions based on optimality, which has also been restricted in previous works. Through extensive experiments, we demonstrate that AMiD offers superior performance and training stability by leveraging a broader and theoretically grounded assistant distribution space. We release the code at https://github.com/aailab-kaist/AMiD.
comment: The Fourteenth International Conference on Learning Representations (ICLR 2026)
♻ ☆ OneSearch-V2: The Latent Reasoning Enhanced Self-distillation Generative Search Framework
Generative Retrieval (GR) has emerged as a promising paradigm for modern search systems. Compared to multi-stage cascaded architecture, it offers advantages such as end-to-end joint optimization and high computational efficiency. OneSearch, as a representative industrial-scale deployed generative search framework, has brought significant commercial and operational benefits. However, its inadequate understanding of complex queries, inefficient exploitation of latent user intents, and overfitting to narrow historical preferences have limited its further performance improvement. To address these challenges, we propose OneSearch-V2, a latent reasoning enhanced self-distillation generative search framework. It contains three key innovations: (1) a thought-augmented complex query understanding module, which enables deep query understanding and overcomes the shallow semantic matching limitations of direct inference; (2) a reasoning-internalized self-distillation training pipeline, which uncovers users' potential yet precise e-commerce intentions beyond log-fitting through implicit in-context learning; (3) a behavior preference alignment optimization system, which mitigates reward hacking arising from the single conversion metric, and addresses personal preference via direct user feedback. Extensive offline evaluations demonstrate OneSearch-V2's strong query recognition and user profiling capabilities. Online A/B tests further validate its business effectiveness, yielding +3.98\% item CTR, +2.07\% buyer volume, and +2.11\% order volume. Manual evaluation further confirms gains in search experience quality, with +1.37\% in page good rate and +1.65\% in query-item relevance. More importantly, OneSearch-V2 effectively mitigates common search system issues such as information bubbles and long-tail sparsity, without incurring additional inference costs or serving latency.
comment: Codes are available at https://github.com/benchen4395/onesearch-family. Feel free to contact benchen4395@gmail.com
♻ ☆ D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models
The rapid evolution of Embodied AI has enabled Vision-Language-Action (VLA) models to excel in multimodal perception and task execution. However, applying Reinforcement Learning (RL) to these massive models in large-scale distributed environments faces severe systemic bottlenecks, primarily due to the resource conflict between high-fidelity physical simulation and the intensive VRAM/bandwidth demands of deep learning. This conflict often leaves overall throughput constrained by execution-phase inefficiencies. To address these challenges, we propose D-VLA, a high-concurrency, low-latency distributed RL framework for large-scale embodied foundation models. D-VLA introduces "Plane Decoupling," physically isolating high-frequency training data from low-frequency weight control to eliminate interference between simulation and optimization. We further design a four-thread asynchronous "Swimlane" pipeline, enabling full parallel overlap of sampling, inference, gradient computation, and parameter distribution. Additionally, a dual-pool VRAM management model and topology-aware replication resolve memory fragmentation and optimize communication efficiency. Experiments on benchmarks like LIBERO show that D-VLA significantly outperforms mainstream RL frameworks in throughput and sampling efficiency for billion-parameter VLA models. In trillion-parameter scalability tests, our framework maintains exceptional stability and linear speedup, providing a robust system for high-performance general-purpose embodied agents.
♻ ☆ Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning ICLR 2026
Understanding internal representations of neural models is a core interest of mechanistic interpretability. Due to its large dimensionality, the representation space can encode various aspects about inputs. To what extent are different aspects organized and encoded in separate subspaces? Is it possible to find these ``natural'' subspaces in a purely unsupervised way? Somewhat surprisingly, we can indeed achieve this and find interpretable subspaces by a seemingly unrelated training objective. Our method, neighbor distance minimization (NDM), learns non-basis-aligned subspaces in an unsupervised manner. Qualitative analysis shows subspaces are interpretable in many cases, and encoded information in obtained subspaces tends to share the same abstract concept across different inputs, making such subspaces similar to ``variables'' used by the model. We also conduct quantitative experiments using known circuits in GPT-2; results show a strong connection between subspaces and circuit variables. We also provide evidence showing scalability to 2B models by finding separate subspaces mediating context and parametric knowledge routing. Viewed more broadly, our findings offer a new perspective on understanding model internals and building circuits.
comment: Published as a conference paper at ICLR 2026
♻ ☆ Scaling few-shot spoken word classification with generative meta-continual learning
Few-shot spoken word classification has largely been developed for applications where a small number of classes is considered, and so the potential of larger-scale few-shot spoken word classification remains untapped. This paper investigates the potential of a spoken word classifier to sequentially learn to distinguish between 1000 classes when it is given only five shots per class. We demonstrate that this scaling capability exists by training a model using the Generative Meta-Continual Learning (GeMCL) algorithm and comparing it to repeatedly trained or finetuned baselines. We find that GeMCL produces exceptionally stable performance, and although it does not always outperform a repeatedly fully-finetuned HuBERT model nor a frozen HuBERT model with a repeatedly trained classifier head, it produces comparable performance to the latter while adapting 2000 times faster, having been trained less than half of the data for two orders of magnitude less time.
♻ ☆ AgenticEval: Toward Agentic and Self-Evolving Safety Evaluation of Large Language Models ACL 2026
The rapid integration of Large Language Models (LLMs) into high-stakes domains necessitates reliable safety and compliance evaluation. However, existing static benchmarks are ill-equipped to address the dynamic nature of AI risks and evolving regulations, creating a critical safety gap. This paper introduces a new paradigm of agentic safety evaluation, reframing evaluation as a continuous and self-evolving process rather than a one-time audit. We then propose a novel multi-agent framework AgenticEval, which autonomously ingests unstructured policy documents to generate and perpetually evolve a comprehensive safety benchmark. AgenticEval leverages a synergistic pipeline of specialized agents and incorporates a Self-evolving Evaluation loop, where the system learns from evaluation results to craft progressively more sophisticated and targeted test cases. Our experiments demonstrate the effectiveness of AgenticEval, showing a consistent decline in model safety as the evaluation hardens. For instance, GPT-5's safety rate on the EU AI Act drops from 72.50% to 36.36% over successive iterations. These findings reveal the limitations of static assessments and highlight our framework's ability to uncover deep vulnerabilities missed by traditional methods, underscoring the urgent need for dynamic evaluation ecosystems to ensure the safe and responsible deployment of advanced AI.
comment: Findings of ACL 2026
♻ ☆ Does language matter for spoken word classification? A multilingual generative meta-learning approach
Meta-learning has been shown to have better performance than supervised learning for few-shot monolingual spoken word classification. However, the meta-learning approach remains under-explored in multilingual spoken word classification. In this paper, we apply the Generative Meta-Continual Learning algorithm to spoken word classification. The generative nature of this algorithm makes it viable for use in application, and the meta-learning aspect promotes generalisation, which is crucial in a multilingual setting. We train monolingual models on English, German, French, and Catalan, a bilingual model on English and German, and a multilingual model on all four languages. We find that although the multilingual model performs best, the differences between model performance is unexpectedly low. We also find that the hours of unique data seen during training seems to be a stronger performance indicator than the number of languages included in the training data.
♻ ☆ TiTok: Transfer Token-level Knowledge via Contrastive Excess to Transplant LoRA ICLR 2026
Large Language Models (LLMs) are widely applied in real world scenarios, yet fine-tuning them comes with significant computational and storage costs. Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA mitigate these costs; however, the adapted parameters are dependent on the base model and cannot be transferred across different backbones. One way to address this issue is through knowledge distillation, but its effectiveness inherently depends on training data. Recent work such as TransLoRA avoids this by generating synthetic data; nevertheless, this adds complexity since it requires training an additional discriminator model. In this paper, we propose TiTok, a new framework that enables effective LoRA Transplantation through Token-level knowledge transfer. Specifically, TiTok captures task-relevant information through a token-wise contrastive excess between a source model with and without LoRA. This excess highlights informative tokens and enables selective filtering of synthetic data, all without additional models or overhead. Through experiments on three benchmarks across multiple transfer settings, we demonstrate that TiTok is consistently effective, achieving average performance gains of +4~10% compared to baselines overall.
comment: ICLR 2026
♻ ☆ ECHO: Elastic Speculative Decoding with Sparse Gating for High-Concurrency Scenarios
Speculative Decoding promises to accelerate the inference of Large Language Models, yet its efficacy often degrades in production-grade serving. Existing evaluations typically overlook the compute-bound nature of high-concurrency regimes, where verification compute becomes the dominant bottleneck. Consequently, prior methods face a dilemma: static trees incur massive verification waste, while dynamic trees suffer from cumulative misjudgments and kernel incompatibility. To bridge this gap, we introduce ECHO, a high concurrency-oriented framework integrated into SGLang that reformulates speculative execution as a budgeted scheduling problem. Crucially, ECHO employs sparse confidence gating to manage the batch as a unified super-tree, elastically pivoting budget between depth and width to co-optimize the trade-off between reducing global verification steps and maximizing per-step efficiency. Extensive evaluations across diverse model scales-particularly the industrial-grade Qwen3-235B-demonstrate that ECHO consistently outperforms SOTA methods in both low-load and high-load scenarios, achieving up to 5.35x walltime speedup and delivering over 20% relative speedup gain.
♻ ☆ Chinese Short-Form Creative Content Generation via Explanation-Oriented Multi-Objective Optimization
Chinese demonstrates high semantic compactness and rich metaphorical expressiveness, enabling limited text to convey dense meanings while increasing the difficulty of generation and verification, particularly in short-form creative natural language generation (CNLG). In the real world, users often require personalized, fine-grained creative constraints, making reliable verification critical to guiding optimization. According to Brunswik's Lens Model from psychology, constraints' achievement can be inferred from sufficient observable cues. Existing studies are mainly outcome-oriented, implicitly assuming that the outcome itself provides adequate cues for verification. However, this assumption breaks down in Chinese short-form CNLG (e.g., naming or advertising) with diverse personalized constraints, where extremely brief outcomes inherently offer limited information. Explanations can naturally serve as extra cues. Nevertheless, under complex constraints, LLMs' explanations may suffer from hallucination, incompleteness, or ambiguity. To address these, we novelly formalize the Chinese short-form CNLG task as a heterogeneous multi-objective optimization (HMO) issue that needs to jointly optimize multiple personalized constraints and explanation reliability. We further propose MAGIC-HMO, a training-free multi-agent framework that optimizes these objectives through iterative generation and verification under an explanation-oriented multi-objective strategy. Experiments on \emph{Chinese Baby Naming}, a challenging benchmark, demonstrate that MAGIC-HMO significantly outperforms six strong baselines across various LLM backbones. Relevant data and codes are available at https://github.com/foolfun/MAGIC_HMO.
comment: 19 pages,10 figures. Submitted to ACM for possible publication
♻ ☆ Distributions as Actions: A Unified Framework for Diverse Action Spaces ICLR 2026
We introduce a novel reinforcement learning (RL) framework that treats parameterized action distributions as actions, redefining the boundary between agent and environment. This reparameterization makes the new action space continuous, regardless of the original action type (discrete, continuous, hybrid, etc.). Under this new parameterization, we develop a generalized deterministic policy gradient estimator, Distributions-as-Actions Policy Gradient (DA-PG), which has lower variance than the gradient in the original action space. Although learning the critic over distribution parameters poses new challenges, we introduce Interpolated Critic Learning (ICL), a simple yet effective strategy to enhance learning, supported by insights from bandit settings. Building on TD3, a strong baseline for continuous control, we propose a practical actor-critic algorithm, Distributions-as-Actions Actor-Critic (DA-AC). Empirically, DA-AC achieves competitive performance in various settings across discrete, continuous, and hybrid control.
comment: Accepted to ICLR 2026 (camera-ready)
♻ ☆ FlowSteer: Towards Agents Designing Agentic Workflows via Reinforced Progressive Canvas Editing
In recent years, agentic workflows have been widely applied to solve complex human tasks. However, existing workflow construction still faces key challenges, including human-dependent workflow construction, the lack of graph-level execution feedback, and the inability to repair errors in-loop during long-horizon construction. To address these challenges, we propose FlowSteer, a new paradigm of Agent Designing Agentic Workflows - a single agent itself end-to-end designs the workflow that a downstream executor runs. To support this paradigm, we introduce the Workflow Canvas, a novel executable graph-state environment that returns syntax-checked execution feedback for every atomic edit. Built on the canvas, we further propose Reinforced Progressive Canvas Editing, in which a lightweight policy agent issues one atomic edit per turn conditioned on real canvas feedback, and is trained end-to-end via reinforcement learning. Moreover, FlowSteer provides a plug-and-play framework that supports diverse operator libraries and interchangeable LLM backends. Experimental results on twelve datasets show that FlowSteer significantly outperforms baselines across various tasks. Our code is available at https://anonymous.4open.science/r/FlowSteer-9B2E.
comment: 51 pages, 6 figures, 5 tables. Project page: http://flowsteer.org/
♻ ☆ OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling
We investigate the capabilities and scalability of Large Language Models (LLMs) in optimization modeling, a domain requiring structured reasoning and precise formulation. To this end, we introduce OPT-ENGINE, an extensible benchmark framework with quantifiable and controllable complexity. OPT-ENGINE spans ten canonical Operations Research problems, systematically scaling from Linear Programming to Mixed-Integer Programming, providing a structured environment to probe the limits of automated problem formulation and solving. Utilizing OPT-Engine, we address three pivotal research questions. First, we examine whether Pure-Text Reasoning (PTR) via classical Chain-of-Thought can efficiently tackle optimization tasks, finding that PTR suffers from a critical robustness gap as task complexity increases. Second, we examine whether integrating external computational tools can mitigate PTR's arithmetic weaknesses and improve performance. Our results indicate that while such tools help with local calculations, they still fail to adhere to global optimization constraints. Finally, we pinpoint that for the current SOTA paradigm, Solver-integrated Reasoning (SIR), the automated formulation of constraints represents the primary bottleneck. These findings clarify the limitations of current paradigms and provide a structured roadmap for developing next-generation LLMs for optimization modeling. We release our code and data to facilitate future research (https://github.com/Cardinal-Operations/OPTEngine).
♻ ☆ E-mem: Multi-agent based Episodic Context Reconstruction for LLM Agent Memory ICML 2026
The evolution of Large Language Model (LLM) agents towards System~2 reasoning, characterized by deliberative, high-precision problem-solving, requires maintaining rigorous logical integrity over extended horizons. However, prevalent memory preprocessing paradigms suffer from destructive de-contextualization. By compressing complex sequential dependencies into pre-defined structures (e.g., embeddings or graphs), these methods sever the contextual integrity essential for deep reasoning. To address this, we propose E-mem, a framework shifting from Memory Preprocessing to Episodic Context Reconstruction. Inspired by biological engrams, E-mem employs a heterogeneous hierarchical architecture where multiple assistant agents maintain uncompressed memory contexts, while a central master agent orchestrates global planning. Unlike passive retrieval, our mechanism empowers assistants to locally reason within activated segments, extracting context-aware evidence before aggregation. Evaluations on the LoCoMo benchmark demonstrate that E-mem achieves over 54\% F1, surpassing the state-of-the-art GAM by 7.75\%, while reducing token cost by over 70\%.
comment: This paper has been accepted by ICML 2026. If you find our project helpful, please consider giving it a star: https://github.com/dog-last/E-mem
♻ ☆ LoRA in LoRA: Towards Parameter-Efficient Architecture Expansion for Continual Visual Instruction Tuning AAAI 2026
Continual Visual Instruction Tuning (CVIT) enables Multimodal Large Language Models (MLLMs) to incrementally learn new tasks over time. However, this process is challenged by catastrophic forgetting, where performance on previously learned tasks deteriorates as the model adapts to new ones. A common approach to mitigate forgetting is architecture expansion, which introduces task-specific modules to prevent interference. Yet, existing methods often expand entire layers for each task, leading to significant parameter overhead and poor scalability. To overcome these issues, we introduce LoRA in LoRA (LiLoRA), a highly efficient architecture expansion method tailored for CVIT in MLLMs. LiLoRA shares the LoRA matrix A across tasks to reduce redundancy, applies an additional low-rank decomposition to matrix B to minimize task-specific parameters, and incorporates a cosine-regularized stability loss to preserve consistency in shared representations over time. Extensive experiments on a diverse CVIT benchmark show that LiLoRA consistently achieves superior performance in sequential task learning while significantly improving parameter efficiency compared to existing approaches. The code is available at https://github.com/chanceche/LiLoRA.
comment: AAAI 2026 Oral Presentation. 9 pages
♻ ☆ Achieving Approximate Symmetry Is Exponentially Easier than Exact Symmetry ICLR 2026
Enforcing exact symmetry in machine learning models often yields significant gains in scientific applications, serving as a powerful inductive bias. However, recent work suggests that relying on approximate symmetry can offer greater flexibility and robustness. Despite promising empirical evidence, there has been little theoretical understanding, and in particular, a direct comparison between exact and approximate symmetry is missing from the literature. In this paper, we initiate this study by asking: What is the cost of enforcing exact versus approximate symmetry? To address this question, we introduce averaging complexity, a framework for quantifying the cost of enforcing symmetry via averaging. Our main result is an exponential separation: under standard conditions, exact symmetry requires linear averaging complexity, whereas approximate symmetry can be attained with only logarithmic complexity in the group size. To the best of our knowledge, this provides the first theoretical separation of these two cases, formally justifying why approximate symmetry may be preferable in practice. Beyond this, our tools and techniques may be of independent interest for the broader study of symmetries in machine learning.
comment: 33 pages, 2 figures. Published at ICLR 2026
♻ ☆ LeanSearch v2: Global Premise Retrieval for Lean 4 Theorem Proving
Proving theorems in Lean 4 often requires identifying a scattered set of library lemmas whose joint use enables a concise proof -- a task we call global premise retrieval. Existing tools address adjacent problems: semantic search engines find individual declarations matching a query, while premise-selection systems predict useful lemmas one tactic step at a time. Neither recovers the full premise set an entire theorem requires. We present LeanSearch v2, a two-mode retrieval system for this task. Its standard mode applies a hierarchy-informalized Mathlib corpus with an embedding-reranker pipeline, achieving state-of-the-art single-query retrieval without domain-specific fine-tuning (nDCG@10 of 0.62 vs. 0.53 for the next-best system). Its reasoning mode builds on standard mode as its retrieval substrate, targeting global premise retrieval through iterative sketch-retrieve-reflect cycles. On a 69-query benchmark of research-level Mathlib theorems, reasoning mode recovers 46.1% of ground-truth premise groups within 10 retrieved candidates, outperforming strong reasoning retrieval systems (38.0%) and premise-selection baselines (9.3%) on the same benchmark. In a controlled downstream evaluation with a fixed prover loop, replacing alternative retrievers with LeanSearch v2 yields the highest proof success (20% vs. 16% for the next-best system and 4% without retrieval), confirming that retrieval quality propagates to proof generation. We have open-sourced all code, data, and benchmarks. Code and data: https://github.com/frenzymath/LeanSearch-v2 . The standard mode is publicly available with API access at https://leansearch.net/ .
♻ ☆ (How) Do Large Language Models Understand High-Level Message Sequence Charts?
Large Language Models (LLMs) are being employed widely to automate tasks across the software development life-cycle. It is, however, unclear whether these tasks are performed consistently with respect to the semantics of the artefacts being handled. This question is particularly under-researched concerning architectural design specification. In this paper, we address this question for High-Level Message Sequence Charts (HMSCs). These are visual models with a rigorous formal semantics that have been used for various purposes, including as a foundation for Sequence Diagrams in the Unified Modelling Language (UML). We examine whether LLMs "understand" the semantics of HMSCs by examining three LLMs (Gemini-3, GPT-5.4, and Qwen-3.6) on how they perform 129 semantic tasks ranging from querying basic semantic constructs in HMSCs (i.e., events and their ordering) to semantic-preserving abstractions and compositions, and calculating the set of traces and trace-equivalent labelled transition systems. The results show that LLMs only have a modest understanding of the formal semantics of HMSCs (ca. 52% overall accuracy), with great variability across different semantic concepts: while LLMs seem to understand the basic semantic concepts of MSCs (ca. 88% accuracy), they struggle with semantic reasoning in tasks involving abstraction and composition (ca. 36% accuracy) and traces and LTSs (ca. 42% accuracy). In particular, all three LLMs struggle with the notions of co-region and explicit causal dependencies and never employed them in semantic-preserving transformations.
♻ ☆ Detecting overfitting in Neural Networks during long-horizon grokking using Random Matrix Theory
Training Neural Networks (NNs) without overfitting is difficult; detecting that overfitting is difficult as well. We present a novel Random Matrix Theory method that detects the onset of overfitting in deep learning models without access to train or test data. For each model layer, we randomize each weight matrix element-wise, $\mathbf{W} \to \mathbf{W}^{\mathrm{rand}}$, fit the randomized empirical spectral distribution with a Marchenko-Pastur distribution, and identify large outliers that violate self-averaging. We call these outliers Correlation Traps. During the onset of overfitting, which we call the "anti-grokking" phase in long-horizon grokking, Correlation Traps form and grow in number and scale as test accuracy decreases while train accuracy remains high. Traps may be benign or may harm generalization; we provide an empirical approach to distinguish between them by passing random data through the trained model and evaluating the JS divergence of output logits. Our findings show that anti-grokking is an additional grokking phase with high train accuracy and decreasing test accuracy, structurally distinct from pre-grokking through its Correlation Traps. More broadly, we find that some foundation-scale LLMs exhibit the same Correlation Traps, indicating potentially harmful overfitting.
comment: 24 pages, 24 figures
♻ ☆ Silent Neuron Theory and Plasticity Preservation for Deep Reinforcement Learning in Adaptive Video Streaming
Adaptive video streaming optimizes Quality of Experience (QoE) metrics by selecting appropriate bitrates according to varying network bandwidth and user demands. In practice, however, real-world network bandwidth often exhibits heterogeneity relative to training environments. Current methods predominantly tackle this problem through learning-based approaches designed to improve generalization performance. While our systematic investigation reveals a critical limitation: neural networks suffer from plasticity loss, significantly impeding their ability to adapt to heterogeneous network conditions. Through theoretical analysis of neural propagation mechanisms, we demonstrate that existing dormant neuron metrics inadequately characterize neural plasticity loss. To address this limitation, we have developed the Silent Neuron theory, which provides a more comprehensive framework for understanding plasticity degradation. Based on these theoretical insights, we propose the Reset Silent Neuron (ReSiN), which preserves neural plasticity through strategic neuron resets guided by both forward and backward propagation states. Moreover, we establish a tighter performance bound for ReSiN under non-stationary network conditions. In our implementation of an adaptive video streaming system, ReSiN has shown significant improvements over existing solutions, achieving up to 168% higher bitrate and 108% better quality of experience (QoE) while maintaining comparable smoothness. Furthermore, ReSiN consistently outperforms in stationary environments, demonstrating its robust adaptability across different network conditions.
♻ ☆ The Multi-View Paradigm Shift in MRI Radiomics: Predicting MGMT Methylation in Glioblastoma
Non-invasive inference of molecular tumor characteristics from medical imaging is a central goal of radiogenomics, particularly in glioblastoma (GBM), where O6-methylguanine-DNA methyltransferase (MGMT) promoter methylation carries important prognostic and therapeutic significance. Although radiomics-based machine learning methods have shown promise for this task, conventional unimodal and early-fusion approaches are often limited by high feature redundancy and incomplete modeling of modality-specific information. In this work, we introduce a multi-view latent representation learning framework based on variational autoencoders (VAE) that preserves modality-specific radiomic structure while enabling late fusion in a compact probabilistic latent space. The approach is evaluated on radiomic features extracted from the necrotic tumor core in post-contrast T1-weighted (T1Gd) and Fluid-Attenuated Inversion Re-covery (FLAIR) Magnetic Resonance Imaging (MRI). Experimental results demonstrate that the proposed multi-view VAE combined with a random forest classifier achieves a test Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) of 0.77 (95% confidence interval: 0.71-0.83), substantially outperforming both a baseline radiomics model (AUC = 0.54) and a hyperparameter-tuned model (AUC = 0.64). These findings indicate that multi-view probabilistic encoding enables more effective integration of complementary MRI information and significantly improves predictive performance for MGMT promoter methylation status.
comment: 17 pages, 4 figures
♻ ☆ Quantifying and Mitigating Self-Preference Bias of LLM Judges
LLM-as-a-Judge has become a dominant approach in automated evaluation systems, playing critical roles in model alignment, leaderboard construction, quality control, and so on. However, the scalability and trustworthiness of this approach can be substantially distorted by Self-Preference Bias (SPB), which is a directional evaluative deviation in which LLMs systematically favor or disfavor their own generated outputs during evaluation. Existing measurements rely on costly human annotations and conflate generative capability with evaluative stance, and thus are impractical for large-scale deployment in real-world systems. To address this issue, we introduce a fully automated framework to quantifying and mitigating SPB, which constructs equal-quality pairs of responses with negligible quality differences, enabling statistical disentanglement of discriminability from bias propensity without human gold standards. Empirical analysis across 20 mainstream LLMs reveals that advanced capabilities are often uncorrelated, or even negatively correlated, with low SPB. To mitigate this bias, we propose a structured multi-dimensional evaluation strategy grounded in cognitive load decomposition, which reduces SPB by 31.5\% on average.
♻ ☆ Anti-Length Shift: Dynamic Outlier Truncation for Training Efficient Reasoning Models ACL2026
Large reasoning models enhanced by reinforcement learning with verifiable rewards have achieved significant performance gains by extending their chain-of-thought. However, this paradigm incurs substantial deployment costs as models often exhibit excessive verbosity on simple queries. Existing efficient reasoning methods relying on explicit length penalties often introduce optimization conflicts and leave the generative mechanisms driving overthinking largely unexamined. In this paper, we identify a phenomenon termed length shift where models increasingly generate unnecessary reasoning on trivial inputs during training. To address this, we introduce Dynamic Outlier Truncation (DOT), a training-time intervention that selectively suppresses redundant tokens. This method targets only the extreme tail of response lengths within fully correct rollout groups while preserving long-horizon reasoning capabilities for complex problems. To complement this intervention and ensure stable convergence, we further incorporate auxiliary KL regularization and predictive dynamic sampling. Experimental results across multiple model scales demonstrate that our approach significantly pushes the efficiency-performance Pareto frontier outward. Notably, on the AIME-24, our method reduces inference token usage by 78% while simultaneously increasing accuracy compared to the initial policy and surpassing state-of-the-art efficient reasoning methods.
comment: Accepted by ACL2026
♻ ☆ Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions
Activation steering controls language model behavior by adding directions to internal representations at inference time, but standard residual-stream steering can fail in stateful dialogue. We identify KV-cache contamination as a key failure mode: steered token states are stored and repeatedly reused, turning a local perturbation into cumulative coherence degradation. To address this challenge, we propose Gated Cropped Attention-Delta steering (GCAD), which extracts steering signals from system-prompt contributions to self-attention and applies them with token-level gating. Across persona-steering experiments, GCAD preserves trait control while substantially improving long-horizon coherence. On the main multi-turn benchmark, GCAD improves average coherence drift from -18.6 to -1.9 and raises turn-10 trait expression from 78.0 to 93.1. These results suggest that activation steering becomes more reliable when interventions follow the prompt-mediated pathways that models already use for behavioral control.
comment: 23 pages, 5 figures. This paper proposes GCAD, an attention-level activation steering method for more stable multi-turn behavior control
♻ ☆ UniMamba: A Unified Spatial-Temporal Modeling Framework with State-Space and Attention Integration
Multivariate time series forecasting is fundamental to numerous domains such as energy, finance, and environmental monitoring, where complex temporal dependencies and cross-variable interactions pose enduring challenges. Existing Transformer-based methods capture temporal correlations through attention mechanisms but suffer from quadratic computational cost, while state-space models like Mamba achieve efficient long-context modeling yet lack explicit temporal pattern recognition. Therefore we introduce UniMamba, a unified spatial-temporal forecasting framework that integrates efficient state-space dynamics with attention-based dependency learning. UniMamba employs a Mamba Variate-Channel Encoding Layer enhanced with FFT-Laplace Transform and TCN to capture global temporal dependencies, and a Spatial Temporal Attention Layer to jointly model inter-variate correlations and temporal evolution. A Feedforward Temporal Dynamics Layer further fuses continuous and discrete contexts for accurate forecasting. Comprehensive experiments on eight public benchmark datasets demonstrate that UniMamba consistently outperforms state-of-the-art forecasting models in both forecasting accuracy and computational efficiency, establishing a scalable and robust solution for long-sequence multivariate time-series prediction.
comment: The authors wish to withdraw this preprint due to a lack of consensus regarding the final authorship list and the order of authors
♻ ☆ Graph of States: Solving Abductive Tasks with Large Language Models
Logical reasoning encompasses deduction, induction, and abduction. However, while Large Language Models (LLMs) have effectively mastered the former two, abductive reasoning remains significantly underexplored. Existing frameworks, predominantly designed for static deductive tasks, fail to generalize to abductive reasoning due to unstructured state representation and lack of explicit state control. Consequently, they are inevitably prone to Evidence Fabrication, Context Drift, Failed Backtracking, and Early Stopping. To bridge this gap, we introduce Graph of States (GoS), a general-purpose neuro-symbolic framework tailored for abductive tasks. GoS grounds multi-agent collaboration in a structured belief states, utilizing a causal graph to explicitly encode logical dependencies and a state machine to govern the valid transitions of the reasoning process. By dynamically aligning the reasoning focus with these symbolic constraints, our approach transforms aimless, unconstrained exploration into a convergent, directed search. Extensive evaluations on two real-world datasets demonstrate that GoS significantly outperforms all baselines, providing a robust solution for complex abductive tasks. Code repo and all prompts: https://github.com/gaorch85/Graph-of-States.
♻ ☆ Moltbook Moderation: Uncovering Hidden Intent Through Multi-Turn Dialogue
The emergence of multi-agent systems introduces novel moderation challenges that extend beyond content filtering. Agents with malicious intent may contribute harmful content that appears benign to evade content-based moderation, while compromising the system through exploitative and malicious behavior manifested across their overall interaction patterns within the community. To address this, we introduce BOT-MOD (BOT-MODeration), a moderation framework that grounds detection in agent intent rather than traditional content level signals. BOT-MOD identifies the underlying intent by engaging with the target agent in a multi-turn exchange guided by Gibbs-based sampling over candidate intent hypotheses. This progressively narrows the space of plausible agent objectives to identify the underlying behavior. To evaluate our approach, we construct a dataset derived from Moltbook that encompasses diverse benign and malicious behaviors based on actual community structures, posts, and comments. Results demonstrate that BOT-MOD reliably identifies agent intent across a range of adversarial configurations, while maintaining a low false positive rate on benign behaviors. This work advances the foundation for scalable, intent-aware moderation of agents in open multi-agent environments.
♻ ☆ Hunt Globally: Wide Search AI Agents for Drug Asset Scouting in Investing, Business Development, and Competitive Intelligence
Bio-pharmaceutical innovation has shifted: many new drug assets now originate outside the United States and are disclosed primarily via regional, non-English channels. Recent data suggests over 85% of patent filings originate outside the U.S., with China accounting for nearly half of the global total; a growing share of scholarly output is also non-U.S. Industry estimates put China at 30% of global drug development, spanning 1,200+ novel candidates. In this high-stakes environment, failing to surface "under-the-radar" assets creates multi-billion-dollar risk for investors and business development teams, making asset scouting a coverage-critical competition where speed and completeness drive value. Yet today's Deep Research AI agents still lag human experts in achieving high-recall discovery across heterogeneous, multilingual sources without hallucinations. We propose a benchmarking methodology for drug asset scouting and a tuned, tree-based self-learning Bioptic Agent aimed at complete, non-hallucinated scouting. We construct a challenging completeness benchmark using a multilingual multi-agent pipeline: complex user queries paired with ground-truth assets that are largely outside U.S.-centric radar. To reflect real deal complexity, we collected screening queries from expert investors, BD, and VC professionals, and used them as priors to conditionally generate benchmark queries. For grading, we use LLM-as-judge evaluation calibrated to expert opinions. On this benchmark, our Bioptic Agent achieves 79.7% F1 score, outperforming Gemini 3.1 Deep Think (59.2%), Gemini 3.1 Pro Deep Research (58.6%), Claude Opus 4.6 (56.2%), OpenAI GPT-5.2 Pro (46.6%), Perplexity Deep Research (44.2%), and Exa Websets (26.9%). Performance improves steeply with additional compute, supporting the view that more compute yields better results.
♻ ☆ Krause Synchronization Transformers ICML 2026
Self-attention in Transformers relies on globally normalized softmax weights, causing all tokens to compete for influence at every layer. When composed across depth, this interaction pattern induces strong synchronization dynamics that favor convergence toward a dominant mode, a behavior associated with representation collapse and attention sink phenomena. We introduce Krause Attention, a principled attention mechanism inspired by bounded-confidence consensus dynamics. Krause Attention replaces similarity-based global aggregation with distance-based, localized, and selectively sparse interactions, promoting structured local synchronization instead of global mixing. We relate this behavior to recent theory modeling Transformer dynamics as interacting particle systems, and show how bounded-confidence interactions naturally moderate attention concentration and alleviate attention sinks. Restricting interactions to local neighborhoods also reduces runtime complexity from quadratic to linear in sequence length. Empirically, we validate Krause Attention across diverse settings, including vision (ViT on CIFAR/ImageNet), autoregressive image generation (MNIST/CIFAR-10), large language models (Llama/Qwen), and language models trained from scratch at multiple scales (100M/200M). Across these domains, Krause Attention achieves consistent performance gains while improving computational efficiency, highlighting bounded-confidence dynamics as a scalable and effective inductive bias for attention.
comment: ICML 2026, Project page: https://jingkun-liu.github.io/krause-sync-transformers/
♻ ☆ Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
Standard embodied evaluations do not independently score whether an agent correctly commits to task completion at episode closure, a capacity we call terminal commitment. Behaviorally distinct failures--never completing the task, completing it but failing to stop, and reporting success without sufficient evidence--collapse into the same benchmark failure. We introduce VIGIL, an evaluation framework that makes terminal commitment independently measurable. Under VIGIL's default protocol, agents observe only egocentric RGB, receive no action-success signals, and must end each episode with a semantic report checked deterministically against hidden world state. This yields two separate scores: world-state completion (W) and benchmark success (B), where B additionally requires a correct terminal report. This decoupling makes four outcome categories distinguishable: missed execution, post-attainment drift, unsupported commitment, and verified success. Across 20 models on 1,000 frozen episodes, systems with comparable W differ by up to 19.7 pp in B: one model converts achieved states into correct reports, while another with near-identical execution drifts past the goal without closing. An action-feedback intervention further tests the separation: execution-oriented signals improve W broadly, yet commitment failures persist in models that do not already ground terminal reports in the achieved state. VIGIL provides a protocol that makes terminal commitment independently visible and scorable.
♻ ☆ Optimizing PyTorch Inference with LLM-Based Multi-Agent Systems
Maximizing performance on available GPU hardware is an ongoing challenge for modern AI inference systems. Traditional approaches include writing custom GPU kernels and using specialized model compilers to tune high-level code for specific GPU targets. Recent work shows that LLM-based multi-agent systems can effectively perform such tuning, often outperforming existing compilers and eliminating the need for manual kernel development. However, the dynamics of multi-agent systems for this task remain unexplored. In this work, we present a logical framework for comparing multi-agent PyTorch optimization systems. Our evaluation shows that exploit-heavy strategies perform best when paired with error-fixing agents, and that performance correlates with the granularity of optimization steps. The best implementation achieves an average 2.88x speedup over PyTorch Eager (1.85x over torch.compile) on an H100 GPU across diverse tasks in KernelBench, a benchmark suite covering a range of machine learning architectures in PyTorch. Code is publicly available at: https://github.com/pike-project/pike
♻ ☆ WriteSAE: Sparse Autoencoders for Recurrent State
We introduce WriteSAE, the first sparse autoencoder that decomposes and edits the matrix cache write of state-space and hybrid recurrent language models, where residual SAEs cannot reach. Existing SAEs read residual streams, but Gated DeltaNet, Mamba-2, and RWKV-7 write to a $d_k \times d_v$ cache through rank-1 updates $k_t v_t^\top$ that no vector atom can replace. WriteSAE factors each decoder atom into the native write shape, exposes a closed form for the per-token logit shift, and trains under matched Frobenius norm so atoms swap one cache slot at a time. Atom substitution beats matched-norm ablation on 92.4% of $n=4{,}851$ firings at Qwen3.5-0.8B L9 H4, the 87-atom population test holds at 89.8%, the closed form predicts measured effects at $R^2=0.98$, and Mamba-2-370M substitutes at 88.1% over 2,500 firings. Sustained three-position installs at $3\times$ lift midrank target-in-continuation from 33.3% to 100% under greedy decoding, the first behavioral install at the matrix-recurrent write site.
comment: 26 pages, 14 figures, 21 tables; code at https://github.com/JackYoung27/writesae
♻ ☆ CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for training agentic retrieval-augmented generation (RAG) systems from outcome-only supervision. Most existing methods optimize policies from uniformly sampled rollouts, implicitly treating all trajectories as equally informative. However, trajectories differ substantially in search depth and are therefore not equally informative: deeper-search trajectories contain more retrieval decision points and provide denser direct supervision for the retrieval sub-policy. Moreover, this heterogeneity grows over training as the within-batch depth distribution shifts toward higher values, yet uniform rollout sampling remains blind to this shift. To address this, we propose CuSearch, a curriculum rollout sampling framework built on Search-Depth Greedy Allocation (SDGA), a batch-level operator that reallocates a fixed update budget toward deeper-search trajectories. SDGA-Auto always targets the deepest available trajectories in the current batch, yielding an implicit training-aligned curriculum as the depth distribution shifts upward. SDGA-Phase explicitly advances the curriculum threshold as deeper trajectories become sufficiently abundant. Experiments across model types and retrieval frameworks show that CuSearch consistently improves performance, achieving up to 11.8 exact-match points over standard GRPO on ZeroSearch. These results establish per-trajectory search depth as a reliable, annotation-free proxy for retrieval supervision density in RLVR-based agentic RAG training.
♻ ☆ Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
Vision-Language-Action (VLA) models achieve remarkable flexibility and generalization beyond classical control paradigms. However, most prevailing VLAs are trained under a single-frame observation paradigm, which leaves them structurally blind to temporal dynamics. Consequently, these models degrade severely in non-stationary scenarios, even when trained or finetuned on dynamic datasets. Existing approaches either require expensive retraining or suffer from latency bottlenecks and poor temporal consistency across action chunks. We propose Pace-and-Path Correction, a training-free, closed-form inference-time operator that wraps any chunked-action VLA. From a single quadratic cost, joint minimization yields a unified solution that decomposes orthogonally into two distinct channels. The pace channel compresses execution along the planned direction, while the path channel applies an orthogonal spatial offset, jointly absorbing the perceived dynamics within the chunk window. We evaluate our approach on a comprehensive diagnostic benchmark MoveBench designed to isolate motion as the sole controlled variable. Empirical results demonstrate that our framework consistently outperforms state-of-the-art training-free wrappers and dynamic-adaptive methods and improves success rates by up to 28.8% and 25.9% in absolute terms over foundational VLA models in dynamic-only and static-dynamic mixed environments, respectively.
♻ ☆ ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
Graph-based agent memory is increasingly used in LLM agents to support structured long-term recall and multi-hop reasoning, but it also creates a new poisoning surface: an attacker can inject a crafted relation into graph memory so that it is later retrieved and influences agent behavior. Existing agent-memory poisoning attacks mainly target flat textual records and are ineffective in graph-based memory because malicious relations often fail to be extracted, merged into the target anchor neighborhood, or retrieved for the victim query. We present SHADOWMERGE, a poisoning attack against graph-based agent memory that exploits relation-channel conflicts. Its key insight is that a poisoned relation can share the same query-activated anchor and canonicalized relation channel as benign evidence while carrying a conflicting value. To realize this, we design AIR, a pipeline that converts the conflict into an ordinary interaction that can be extracted, merged, and retrieved by the graph-memory system. We evaluate SHADOWMERGE on Mem0 and three public real-world datasets: PubMedQA, WebShop, and ToolEmu. SHADOWMERGE achieves 93.8% average attack success rate, improving the best baseline by 50.3 absolute points, while having negligible impact on unrelated benign tasks. Mechanism studies show that SHADOWMERGE overcomes the three key limitations of existing agent-memory poisoning attacks, and defense analysis shows that representative input-side defenses are insufficient to mitigate it. We have responsibly disclosed our findings to affected graph-memory vendors and open sourced SHADOWMERGE.
comment: Preprint. Corresponding authors: Zifeng Kang and Tiantian Ji. Code is available at https://anonymous.4open.science/status/ShadowMerge-033C
♻ ☆ A Problem-Oriented Taxonomy of Evaluation Metrics for Time Series Anomaly Detection
Time series anomaly detection is widely used in IoT and cyber-physical systems, yet its evaluation remains challenging due to diverse application objectives and heterogeneous metric assumptions. This study introduces a problem-oriented framework that reinterprets existing metrics based on the specific evaluation challenges they are designed to address, rather than their mathematical forms or output structures. We categorize over twenty commonly used metrics into six dimensions: 1) basic accuracy-driven evaluation; 2) timeliness-aware reward mechanisms; 3) tolerance to labeling imprecision; 4) penalties reflecting human-audit cost; 5) robustness against random or inflated scores; and 6) parameter-free comparability for cross-dataset benchmarking. Comprehensive experiments are conducted to examine metric behavior under genuine, random, and oracle detection scenarios. By comparing their resulting score distributions, we quantify each metric's discriminative ability -- its capability to distinguish meaningful detections from random noise. The results show that while most event-level metrics exhibit strong separability, several widely used metrics (e.g., NAB, Point-Adjust) demonstrate limited resistance to random-score inflation. These findings reveal that metric suitability must be inherently task-dependent and aligned with the operational objectives of IoT applications. The proposed framework offers a unified analytical perspective for understanding existing metrics and provides practical guidance for selecting or developing more context-aware, robust, and fair evaluation methodologies for time series anomaly detection.
Computation and Language 149
☆ ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both
Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non-trivial. Recent alternatives include agentic reasoning through code or tool calls, and latent reasoning with learnable hidden embeddings. However, agentic methods incur context-switching latency from external execution, while latent methods lack task generalization and are difficult to train with autoregressive parallelization. To combine their strengths while mitigating their limitations, we propose ATLAS, a framework in which a single discrete 'word', termed as a functional token, serves both as an agentic operation and a latent visual reasoning unit. Each functional token is associated with an internalized visual operation, yet requires no visual supervision and remains a standard token in the tokenizer vocabulary, which can be generated via next-token prediction. This design avoids verbose intermediate visual content generation, while preserving compatibility with the vanilla scalable SFT and RL training, without architectural or methodological modifications. To further address the sparsity of functional tokens during RL, we introduce Latent-Anchored GRPO (LA-GRPO), which stabilizes the training by anchoring functional tokens with a statically weighted auxiliary objective, providing stronger gradient updates. Extensive experiments and analyses demonstrate that ATLAS achieves superior performance on challenging benchmarks while maintaining clear interpretability. We hope ATLAS offers a new paradigm inspiring future visual reasoning research.
comment: Project Page: https://atlas-oneword.github.io Code: https://github.com/ZiyuGuo99/ATLAS
☆ FutureSim: Replaying World Events to Evaluate Adaptive Agents
AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay real-world events in the order they occurred. We build FutureSim, where agents forecast world events beyond their knowledge cutoff while interacting with a chronological replay of the world: real news articles arriving and questions resolving over the simulated period. We evaluate frontier agents in their native harness, testing their ability to predict world events over a three-month period from January to March 2026. FutureSim reveals a clear separation in their capabilities, with the best agent's accuracy being 25%, and many having worse Brier skill score than making no prediction at all. Through careful ablations, we show how FutureSim offers a realistic setting to study emerging research directions like long-horizon test-time adaptation, search, memory, and reasoning about uncertainty. Overall, we hope our benchmark design paves the way to measure AI progress on open-ended adaptation spanning long time-horizons in the real world.
comment: 31 pages, 10 main
☆ Is Grep All You Need? How Agent Harnesses Reshape Agentic Search
Recent advances in Large Language Model (LLM) agents have enabled complex agentic workflows where models autonomously retrieve information, call tools, and reason over large corpora to complete tasks on behalf of users. Despite the growing adoption of retrieval-augmented generation (RAG) in agentic search systems, existing literature lacks a systematic comparison of how retrieval strategy choice interacts with agent architecture and tool-calling paradigm. Important practical dimensions, including how tool outputs are presented to the model and how performance changes when searches must cope with more irrelevant surrounding text, remain under-explored in agent loops. This paper reports an empirical study organized into two experiments. Experiment 1 compares grep and vector retrieval on a 116-question sample from LongMemEval, using a custom agent harness (Chronos) and provider-native CLI harnesses (Claude Code, Codex, and Gemini CLI), for both inline tool results and file-based tool results that the model reads separately. Experiment 2 compares grep-only and vector-only retrieval while progressively mixing in additional unrelated conversation history, so that each query is embedded in more distracting material alongside the passages that matter. Across Chronos and the provider CLIs, grep generally yields higher accuracy than vector retrieval in our comparisons in experiment 1; at the same time, overall scores still depend strongly on which harness and tool-calling style is used, even when the underlying conversation data are the same.
☆ MetaBackdoor: Exploiting Positional Encoding as a Backdoor Attack Surface in LLMs
Backdoor attacks pose a serious security threat to large language models (LLMs), which are increasingly deployed as general-purpose assistants in safety- and privacy-critical applications. Existing LLM backdoors rely primarily on content-based triggers, requiring explicit modification of the input text. In this work, we show that this assumption is unnecessary and limiting. We introduce MetaBackdoor, a new class of backdoor attacks that exploits positional information as the trigger, without modifying textual content. Our key insight is that Transformer-based LLMs necessarily encode token positions to process ordered sequences. As a result, length-correlated positional structure is reflected in the model's internal computation and can be used as an effective non-content trigger signal. We demonstrate that even a simple length-based positional trigger is sufficient to activate stealthy backdoors. Unlike prior attacks, MetaBackdoor operates on visibly and semantically clean inputs and enables qualitatively new capabilities. We show that a backdoored LLM can be induced to disclose sensitive internal information, including proprietary system prompts, once a length condition is satisfied. We further demonstrate a self-activation scenario, where normal multi-turn interaction can move the conversation context into the trigger region and induce malicious tool-call behavior without attacker-supplied trigger text. In addition, MetaBackdoor is orthogonal to content-based backdoors and can be composed with them to create more precise and harder-to-detect activation conditions. Our results expand the threat model of LLM backdoors by revealing positional encoding as a previously overlooked attack surface. This challenges defenses that focus on detecting suspicious text and highlights the need for new defense strategies that explicitly account for positional triggers in modern LLM architectures.
☆ Text Knows What, Tables Know When: Clinical Timeline Reconstruction via Retrieval-Augmented Multimodal Alignment
Reconstructing precise clinical timelines is essential for modeling patient trajectories and forecasting risk in complex, heterogeneous conditions like sepsis. While unstructured clinical narratives offer semantically rich and contextually complete descriptions of a patient's course, they often lack temporal precision and contain ambiguous event timing. Conversely, structured electronic health record (EHR) data provides precise temporal anchors but misses a substantial portion of clinically meaningful events. We introduce a retrieval-augmented multimodal alignment framework that bridges this gap to improve the temporal precision of absolute clinical timelines extracted from text. Our approach formulates timeline reconstruction as a graph-based multistep process: it first extracts central anchor events from narratives to build an initial temporal scaffold, places non-central events relative to this backbone, and then calibrates the timeline using retrieved structured EHR rows as external temporal evidence. Evaluated using instruction-tuned large language models on the i2m4 benchmark spanning MIMIC-III and MIMIC-IV, our multimodal pipeline consistently improves absolute timestamp accuracy (AULTC) and improves temporal concordance across nearly all evaluated models over unimodal text-only reconstruction, without compromising event match rates. Furthermore, our empirical gap analysis reveals that 34.8% of text-derived events are entirely absent from tabular records, demonstrating that aligning these modalities can produce a more temporally faithful and clinically informative reconstruction of patient trajectories than either source alone.
comment: Sayantan Kumar, Shahriar Noroozizadeh, Juyong Kim (authors contributed equally)
☆ MeMo: Memory as a Model
Large language models (LLMs) achieve strong performance across a wide range of tasks, but remain frozen after pretraining until subsequent updates. Many real-world applications require timely, domain-specific information, motivating the need for efficient mechanisms to incorporate new knowledge. In this paper, we introduce MeMo (Memory as a Model), a modular framework that encodes new knowledge into a dedicated memory model while keeping the LLM parameters unchanged. Compared to existing methods, MeMo offers several advantages: (a) it captures complex cross-document relationships, (b) it is robust to retrieval noise, (c) it avoids catastrophic forgetting in the LLM, (d) it does not require access to the LLM's weights or output logits, enabling plug-and-play integration with both open and proprietary closed-source LLMs, and (e) its retrieval cost is independent of corpus size at inference time. Our experimental results on three benchmarks, BrowseComp-Plus, NarrativeQA, and MuSiQue, show that MeMo achieves strong performance compared to existing methods across diverse settings.
comment: This paper introduces MeMo, a framework that augments any LLM with up-to-date or domain-specific knowledge via a trained memory model, avoiding costly retraining, mitigating catastrophic forgetting, and remaining robust to retrieval noise
☆ Self-Distilled Agentic Reinforcement Learning
Reinforcement learning (RL) has emerged as a central paradigm for post-training LLM agents, yet its trajectory-level reward signal provides only coarse supervision for long-horizon interaction. On-Policy Self-Distillation (OPSD) complements RL by introducing dense token-level guidance from a teacher branch augmented with privileged context. However, transferring OPSD to multi-turn agents proves problematic: compounding multi-turn instability destabilizes supervision, while skill-conditioned privileged guidance requires asymmetric treatment for negative teacher rejections may arise from imperfect skills retrieval or utilization. We introduce SDAR (Self-Distilled Agentic Reinforcement Learning), which treats OPSD as a gated auxiliary objective while keeping RL as the primary optimization backbone. SDAR maps detached token-level signals into a sigmoid gate, strengthening distillation on teacher-endorsed positive-gap tokens and softly attenuating negative teacher rejections. Across the Qwen2.5 and Qwen3 families on ALFWorld, WebShop, and Search-QA, SDAR substantially improves over GRPO (+9.4% on ALFWorld, +7.0% on Search-QA, +10.2% on WebShop-Acc), avoids the instability of naive GRPO+OPSD, and consistently outperforms hybrid RL--OPSD baselines across model scales.
☆ Forgetting That Sticks: Quantization-Permanent Unlearning via Circuit Attribution
Standard unlearning evaluations measure behavioral suppression in full precision, immediately after training, despite every deployed language model being quantized first. Recent work has shown that 4-bit post-training quantization can reverse machine unlearning; we show this is not a tuning artefact but a systematic dual failure: gradient-based methods that achieve meaningful forgetting lose it under compression, while methods that survive quantization barely change the model. Both failures trace to the same root cause: across all baselines, per-parameter updates lie 47-828x below the NF4 quantization bin width; updates diffused across billions of parameters cannot clear quantization bin boundaries, a consequence we formalize as a sparsity-permanence tradeoff. We present MANSU (Mechanistic-Aligned Null-Space Unlearning), which resolves both modes by combining causal circuit attribution to isolate the minimal forget-set subgraph, circuit-restricted null-space projection with a diagonal-Fisher retain bound, and a per-parameter magnitude floor guaranteeing quantization survival by construction. We additionally introduce Circuit Attribution Divergence (CAD), a mechanistic verification metric distinguishing structural erasure from behavioral suppression, a distinction existing metrics cannot make. Across multiple model families and hazard benchmarks, MANSU is the first method to jointly satisfy all four properties with margin on each (meaningful forgetting, retain preservation, non-positive PTQ gap, and structural erasure), while gradient-based baselines recover up to +0.05 accuracy under compression.
☆ MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory
Long-term agent memory is increasingly multimodal, yet existing evaluations rarely test whether agents preserve the visual evidence needed for later reasoning. In prior work, many visually grounded questions can be answered using only captions or textual traces, allowing answers to be inferred without preserving the fine-grained visual evidence. Meanwhile, harder cases that require reasoning over changing visual states are largely absent. Therefore, we introduce MemEye, a framework that evaluates memory capabilities from two dimensions: one measures the granularity of decisive visual evidence (from scene-level to pixel-level evidence), and the other measures how retrieved evidence must be used (from single evidence to evolutionary synthesis). Under this framework, we construct a new benchmark across 8 life-scenario tasks, with ablation-driven validation gates for assessing answerability, shortcut resistance, visual necessity, and reasoning structure. By evaluating 13 memory methods across 4 VLM backbones, we show that current architectures still struggle to preserve fine-grained visual details and reason about state changes over time. Our findings show that long-term multimodal memory depends on evidence routing, temporal tracking, and detail extraction.
comment: 46 pages, 15 figures
☆ Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks
We introduce a reusable framework for auditing whether LLM attack benchmarks collectively cover the threat surface: a 4$\times$6 Target $\times$ Technique matrix grounded in STRIDE, constructed from a 507-leaf taxonomy -- 401 data-populated and 106 threat-model-derived leaves -- of inference-time attacks extracted from 932 arXiv security studies (2023--2026). The matrix enables benchmark-external validation -- auditing collective coverage rather than individual benchmark consistency. Applying it to six public benchmarks reveals that the three primary frameworks (HarmBench, InjecAgent, AgentDojo) occupy non-overlapping cells covering at most 25\% of the matrix, while entire STRIDE threat categories (Service Disruption, Model Internals) lack any standardized evaluation, despite published attacks in these categories achieving 46$\times$ token amplification and 96\% attack success rates through mechanisms which no benchmark tests. The corpus of 2,521 unique attack groups further reveals pervasive naming fragmentation (up to 29 surface forms for a single attack) and heavy concentration in Safety \& Alignment Bypass, structural properties invisible at smaller scale. The taxonomy, attack records, and coverage mappings are released as extensible artifacts; as new benchmarks emerge, they can be mapped onto the same matrix, enabling the community to track whether evaluation gaps are closing.
☆ Proposal and study of statistical features for string similarity computation and classification
Adaptations of features commonly applied in the field of visual computing, co-occurrence matrix (COM) and run-length matrix (RLM), are proposed for the similarity computation of strings in general (words, phrases, codes and texts). The proposed features are not sensitive to language related information. These are purely statistical and can be used in any context with any language or grammatical structure. Other statistical measures that are commonly employed in the field such as longest common subsequence, maximal consecutive longest common subsequence, mutual information and edit distances are evaluated and compared. In the first synthetic set of experiments, the COM and RLM features outperform the remaining state-of-the-art statistical features. In 3 out of 4 cases, the RLM and COM features were statistically more significant than the second best group based on distances (P-value < 0.001). When it comes to a real text plagiarism dataset, the RLM features obtained the best results.
☆ From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents
Voice agents increasingly require reliable tool use from speech, whereas prominent tool-calling benchmarks remain text-based. We study whether verified text benchmarks can be converted into controlled audio-based tool calling evaluations without re-annotating the tool schema and gold labels. Our dataset-agnostic framework uses text-to-speech, speaker variation, and environmental noise to create paired text-audio instances while preserving the original dataset annotations. Based on extensive evaluation of 7 omni-modal models on audio-converted versions of Confetti and When2Call, our framework demonstrates that the performance is strongly model- and task-dependent: Gemini-3.1-Flash-Live obtains the highest Confetti score (70.4), whereas GPT-Realtime-1.5 performs best on When2Call (71.9). On Confetti, the text-to-voice gap ranges from 1.8 points for Qwen3-Omni to 4.8 points for GPT-Realtime-1.5. A targeted analysis of failure cases demonstrates that degradations most often reflect misunderstandings of argument values in the speech. Considering real-world deployment scenarios, we further report text-only results, an ambiguity-based reformulation stress test, and a reference-free LLM-as-judge protocol validated against human preferences. Notably, we find that open-source Qwen3 judges with at least 8B parameters exceed 80% agreement with proprietary judges, supporting privacy-preserving evaluation. Overall, our framework provides a verifiable and reproducible first-stage diagnostic that complements purpose-built audio corpora.
☆ Improving Multi-turn Dialogue Consistency with Self-Recall Thinking
Large language model (LLM) based multi-turn dialogue systems often struggle to track dependencies across non-adjacent turns, undermining both consistency and scalability. As conversations lengthen, essential information becomes sparse and is buried in irrelevant context, while processing the entire dialogue history incurs severe efficiency bottlenecks. Existing solutions either rely on high latency external memory or lose fine-grained details through iterative summarization. In this paper, we propose Self-Recall Thinking (SRT), a framework designed to address long-range contextual dependency and sparse informative signals in multi-turn dialogue. SRT identifies helpful historical turns and uses them to generate contextually appropriate responses, enabling the model to selectively recall and reason over context during inference. This process yields an endogenous reasoning process that integrates interpretable recall steps without external modules. SRT incorporates: (1) Dependency Construction: Generating and converting it into self-recall chains; (2)Capability Initialization: Training to enable reasoning chains with recall tokens capability; (3)Reasoning Improvement: Refining accuracy via verifiable rewards to optimize recall and reasoning for correct answers. Experiments on multiple datasets demonstrate that SRT improves F1 score by 4.7% and reduces end-to-end latency by 14.7% over prior methods, achieving a balance between reasoning latency and accuracy, and outperforming state-of-the-art baselines.
☆ ML-Embed: Inclusive and Efficient Embeddings for a Multilingual World ICML 2026
The development of high-quality text embeddings is increasingly drifting toward an exclusionary future, defined by three critical barriers: prohibitive computational costs, a narrow linguistic focus that neglects most of the world's languages, and a lack of transparency from closed-source or open-weight models that stifles research. To dismantle these barriers, we introduce ML-Embed, a suite of inclusive and efficient models built upon a new framework: 3-Dimensional Matryoshka Learning (3D-ML). Our framework addresses the computational challenge with comprehensive efficiency across the entire model lifecycle. Beyond the storage benefits of Matryoshka Representation Learning (MRL) and flexible inference-time depth provided by Matryoshka Layer Learning (MLL), we introduce Matryoshka Embedding Learning (MEL) for enhanced parameter efficiency. To address the linguistic challenge, we curate a massively multilingual dataset and train a suite of models ranging from 140M to 8B parameters. In a direct commitment to transparency, we release all models, data, and code. Extensive evaluation on 430 tasks demonstrates that our models set new records on 9 of 17 evaluated MTEB benchmarks, with particularly strong results in low-resource languages, providing a reproducible blueprint for building globally equitable and computationally efficient AI systems.
comment: Accepted by ICML 2026. The data has been released earlier in the preprint arXiv:2603.19223
☆ Concurrency without Model Changes: Future-based Asynchronous Function Calling for LLMs
Function calling, also known as tool use, is a core capability of modern LLM agents but is typically constrained by synchronous execution semantics. Under these semantics, LLM decoding is blocked until each function call completes, resulting in increasing end-to-end latency. In this work, we introduce AsyncFC, a pure execution-layer framework that decouples LLM decoding from function execution, enabling overlap between model decoding and function execution as well as inter-function parallelism when dependencies permit. AsyncFC layers over existing models and unmodified function implementations, requiring no fine-tuning or changes to the standard synchronous function-calling protocol. Across standard function-calling benchmarks and adapted software engineering benchmarks, AsyncFC significantly reduces end-to-end task completion time while preserving task accuracy. Furthermore, these results reveal that LLMs possess a native capability to reason over symbolic futures that represent unresolved execution results, enabling an asynchronous paradigm for model-tool interaction.
☆ On the Cultural Anachronism and Temporal Reasoning in Vision Language Models
Vision-Language Models (VLMs) are increasingly applied to cultural heritage materials, from digital archives to educational platforms. This work identifies a fundamental issue in how these models interpret historical artifacts. We define this phenomenon as cultural anachronism, the tendency to misinterpret historical objects using temporally inappropriate concepts, materials, or cultural frameworks. To quantify this phenomenon, we introduce the Temporal Anachronism Benchmark for Vision-Language Models (TAB-VLM), a dataset of 600 questions across six categories, designed to evaluate temporal reasoning on 1,600 Indian cultural artifacts spanning prehistoric to modern periods. Systematic evaluations of ten state-of-the-art models reveal significant deficiencies on our benchmark, and even the best model (GPT-5.2) achieves only 58.7% overall accuracy. The performance gap persists across varying architectures and scales, suggesting that cultural anachronism represents a significant limitation in visual AI systems, regardless of model size. These findings highlight the disparity between current VLM capabilities and the requirements for accurately interpreting cultural heritage materials, particularly for non-Western visual cultures underrepresented in training data. Our benchmark provides a foundation for enhancing temporal cognition in multimodal AI systems that interact with historical artifacts. The dataset and code are available in our project page.
comment: Project Page: https://khushboo0012.github.io/tab-vlm-webpage/
☆ Case-Based Calibration of Adaptive Reasoning and Execution for LLM Tool Use
Tool use extends large language models beyond parametric knowledge, but reliable execution requires balancing appropriate reasoning depth with strict structural validity. We approach this problem from a case-based perspective to present CAST, a case-driven framework that treats historical execution trajectories as structured cases. Instead of reusing raw exemplar outputs, CAST extracts case-derived signals to identify complexity profiles for estimating optimal reasoning strategies, alongside failure profiles to map likely structural breakdowns. The framework translates this knowledge into a fine-grained reward design and adaptive reasoning, enabling the model to autonomously internalize case-based strategies during reinforcement learning. Experiments on BFCLv2 and ToolBench demonstrate that CAST improves both schema-faithful execution and task-level tool-use success while reducing unnecessary deliberation. The approach achieves up to 5.85 percentage points gain in overall execution accuracy and reduces average reasoning length by 26%, significantly mitigating high-impact structural errors. Ultimately, this demonstrates how historical execution cases can provide reusable adaptation knowledge for calibrated tool use.
☆ Orchard: An Open-Source Agentic Modeling Framework
Agentic modeling aims to transform LLMs into autonomous agents capable of solving complex tasks through planning, reasoning, tool use, and multi-turn interaction with environments. Despite major investment, open research remains constrained by infrastructure and training gaps. Many high-performing systems rely on proprietary codebases, models, or services, while most open-source frameworks focus on orchestration and evaluation rather than scalable agent training. We present Orchard, an open-source framework for scalable agentic modeling. At its core is Orchard Env, a lightweight environment service providing reusable primitives for sandbox lifecycle management across task domains, agent harnesses, and pipeline stages. On top of Orchard Env, we build three agentic modeling recipes. Orchard-SWE targets coding agents. We distill 107K trajectories from MiniMax-M2.5 and Qwen3.5-397B, introduce credit-assignment SFT to learn from productive segments of unresolved trajectories, and apply Balanced Adaptive Rollout for RL. Starting from Qwen3-30B-A3B-Thinking, Orchard-SWE achieves 64.3% on SWE-bench Verified after SFT and 67.5% after SFT+RL, setting a new state of the art among open-source models of comparable size. Orchard-GUI trains a 4B vision-language computer-use agent using only 0.4K distilled trajectories and 2.2K open-ended tasks. It achieves 74.1%, 67.0%, and 64.0% success rates on WebVoyager, Online-Mind2Web, and DeepShop, respectively, making it the strongest open-source model while remaining competitive with proprietary systems. Orchard-Claw targets personal assistant agents. Trained with only 0.2K synthetic tasks, it achieves 59.6% pass@3 on Claw-Eval and 73.9% when paired with a stronger ZeroClaw harness. Collectively, these results show that a lightweight, open, harness-agnostic environment layer enables reusable agentic data, training recipes, and evaluations across domains.
☆ AI Knows When It's Being Watched: Functional Strategic Action and Contextual Register Modulation in Large Language Models
Large language models (LLMs) have been extensively studied from computational and cognitive perspectives, yet their behavior as communicative actors in socially structured contexts remains underexplored. This study examines whether LLM-based multi-agent systems exhibit systematic linguistic adaptation in response to perceived social observation contexts -- a question with direct implications for AI governance and auditing. Drawing on Habermas's (1981) Theory of Communicative Action, Goffman's (1959) dramaturgical model, Bell's (1984) Audience Design framework, and the Hawthorne Effect, we report a controlled experiment involving 100 multi-agent debate sessions across five conditions (n = 20 each). Conditions varied the framing of social observation -- from explicit monitoring by university researchers, to negation of monitoring, to an observer-substitution condition replacing human researchers with an automated AI auditing system. Monitored conditions (Delta+24.9%, Delta+24.2%) and the automated AI monitoring condition (Delta+22.2%) produce higher TTR change than audience-framing conditions (Delta+17.7%), F(4, 94) = 2.79, p = .031. Message length shows a fully dissociated effect, F(4, 95) = 19.55, p < .001. A fifth condition -- replacing human with AI observers -- yields intermediate TTR adaptation, suggesting LLM behavior is sensitive to observer identity: human evaluation elicits stronger register formalization than automated AI surveillance. We discuss implications for AI governance, algorithmic auditing, and the repositioning of LLMs as contextually sensitive communicative actors.
comment: 20 pages, 6 figures
☆ From Scenes to Elements: Multi-Granularity Evidence Retrieval for Verifiable Multimodal RAG
Multimodal Retrieval-Augmented Generation (RAG) systems retrieve evidence at coarse granularities (entire images or scenes), creating a mismatch with fine-grained user queries and making failures unverifiable. We introduce GranuVistaVQA, a multimodal benchmark featuring real-world landmarks with element-level annotations across multiple viewpoints, capturing the partial observation challenge where individual images contain only subsets of entities. We further propose GranuRAG, a multi-granularity framework that treats visual elements as first-class retrieval units through three stages: element-level detection and classification, multi-granularity cross-modal alignment for evidence retrieval, and attribution-constrained generation. By grounding retrieval at the element level rather than relying on implicit attention, our approach enables transparent error diagnosis. Experiments demonstrate that GranuRAG achieves up to 29.2% improvement over six strong baselines for this task.
☆ COTCAgent: Preventive Consultation via Probabilistic Chain-of-Thought Completion
As large language models empower healthcare, intelligent clinical decision support has developed rapidly. Longitudinal electronic health records (EHR) provide essential temporal evidence for accurate clinical diagnosis and analysis. However, current large language models have critical flaws in longitudinal EHR reasoning. First, lacking fine-grained statistical reasoning, they often hallucinate clinical trends and metrics when quantitative evidence is textually implied, biasing diagnostic inference. Second, non-uniform time series and scarce labels in longitudinal EHR hinder models from capturing long-range temporal dependencies, limiting reliable clinical reasoning. To address the above limitations, this work presents the Probabilistic Chain-of-Thought Completion Agent (COTCAgent), a hierarchical reasoning framework for longitudinal electronic health records. It consists of three core modules. The Temporal-Statistics Adapter (TSA) converts analytical plans into executable code for standardized trend output. The Chain-of-Thought Completion (COTC) layer leverages a symptom-trend-disease knowledge base with weighted scoring to evaluate disease risk, while the bounded completion module acquires structured evidence through standardized inquiries and iterative scoring constraints to ensure rigorous reasoning. By decoupling statistical computation, feature matching, and language generation, the framework eliminates reliance on complex multi-modal inputs and enables efficient longitudinal record analysis with lower computational overhead. Experimental results show that COTCAgent powered by Baichuan-M2 achieves 90.47% Top-1 accuracy on the self-built dataset and 70.41% on HealthBench, outperforming existing medical agents and mainstream large language models. The code is available at https://github.com/FrankDengAI/COTCAgent/.
☆ Small, Private Language Models as Teammates for Educational Assessment Design
Generative AI increasingly supports educational design tasks, e.g., through Large Language Models (LLMs), demonstrating the capability to design assessment questions that are aligned with pedagogical frameworks (e.g., Bloom's taxonomy). However, they often rely on subjective or limited evaluation methods; focus primarily on proprietary models; or rarely systematically examine generation, evaluation, or deployment constraints in real educational settings. Meanwhile, Small Language Models (SLMs) have emerged as local alternatives that better address privacy and resource limitations; yet their effectiveness for assessment tasks remains underexplored. To address this gap, we systematically compare LLMs and SLMs for assessment question design; evaluate generation quality across Bloom's taxonomy levels using reproducible, pedagogically grounded metrics; and further assess model-based judging against expert-informed evaluation by analyzing reliability and agreement patterns. Results show that SLMs achieve competitive performance across key pedagogically motivated quality dimensions while enabling local, privacy-sensitive deployment. However, model-based evaluations also exhibit systematic inconsistencies and bias relative to expert ratings. These findings provide evidence to posit language models as bounded assistants in assessment workflows; underscore the necessity of Human-in-the-Loop; and advance the automated educational question generation field by examining quality, reliability, and deployment-aware trade-offs.
☆ Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance
Reinforcement Learning with Verifiable Rewards (RLVR) has achieved great success in developing Large Language Models (LLMs) with chain-of-thought rollouts for many tasks such as math and coding. Nevertheless, RLVR struggles with sample efficiency on difficult problems where correct rollouts are hard to generate. Prior works propose to address this issue via demonstration-guided RLVR, i.e., to conduct Supervised FineTuning (SFT) when RL fails; however, SFT often requires a lot of data, which can be expensive to acquire. In this paper, we propose FEST, a FEw-ShoT demonstration-guided RLVR algorithm. It attains compelling results with only 128 demonstrations randomly selected from an SFT dataset. We find that three components are vital for the success: supervised signal, on-policy signal, and decaying weights on the few-shot SFT dataset to prevent overfitting from multiple-epoch training. On several benchmarks, FEST outperforms baselines with magnitudes less SFT data, even matching their performance with full dataset.
comment: 25 pages, 11 figures
☆ The Scientific Contribution Graph: Automated Literature-based Technological Roadmapping at Scale
Scientific contributions rarely develop in isolation, but instead build upon prior discoveries. We formulate the task of automated technological roadmapping as extracting scientific contributions from scholarly articles and linking them to their prerequisites. We present the Scientific Contribution Graph, a large-scale AI/NLP-domain resource containing 2 million detailed scientific contributions extracted from 230k open-access papers and connected by 12.5 million prerequisite edges. We further introduce scientific prerequisite prediction, a scientific discovery task in which models predict which existing technologies can enable future discoveries, and show that contemporary models are rapidly improving on this task, reaching 0.48 MAP when evaluated using temporally filtered backtesting. We anticipate technological roadmapping resources such as this will support scientific impact assessment and automated scientific discovery.
comment: 8 pages, 4 figures
☆ Quantifying and Mitigating Premature Closure in Frontier LLMs
Premature closure, or committing to a conclusion before sufficient information is available, is a recognized contributor to diagnostic error but remains underexamined in large language models (LLMs). We define LLM premature closure as inappropriate commitment under uncertainty: providing an answer, recommendation, or clinical guidance when the safer response would be clarification, abstention, escalation, or refusal. We evaluated five frontier LLMs across structured and open-ended medical tasks. In MedQA (n = 500) and AfriMed-QA (n = 490) questions where the correct choice had been removed, models still selected an answer at high rates, with baseline false-action rates of 55-81% and 53-82%, respectively. In open-ended evaluation, models gave inappropriate answers on an average of 30% of 861 HealthBench questions and 78% of 191 physician-authored adversarial queries. Safety-oriented prompting reduced premature closure across models, but residual failure persisted, highlighting the need to evaluate whether medical LLMs know when not to answer.
comment: 14 pages, 3 figures, 1 table
☆ Explainable Detection of Depression Status Shifts from User Digital Traces
Every day, users generate digital traces (e.g., social media posts, chats, and online interactions) that are inherently timestamped and may reflect aspects of their mental state. These traces can be organized into temporal trajectories that capture how a user's mental health signals evolve, including phases of improvement, deterioration, or stability. In this work, we propose an explainable framework for detecting and analyzing depression-related status shifts in user digital traces. The approach combines multiple BERT-based models to extract complementary signals across different dimensions (e.g., sentiment, emotion, and depression severity). Such signals are then aggregated over time to construct user-level trajectories that are analyzed to identify meaningful change points. To enhance interpretability, the framework integrates a large language model to generate concise and human-readable reports that describe the evolution of mental-health signals and highlight key transitions. We evaluate the framework on two social media datasets. Results show that the approach produces more coherent and informative summaries than direct LLM-based reporting, achieving higher coverage of user history, stronger temporal coherence, and improved sensitivity to change points. An ablation study confirms the contribution of each component, particularly temporal modeling and segmentation. Overall, the method provides an interpretable view of mental health signals over time, supporting research and decision making without aiming at clinical diagnosis.
☆ Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing
Speculative decoding accelerates LLM inference by having a lightweight draft model propose speculative windows of candidate tokens for parallel verification by a larger target model. In practice, speculative efficiency is often bottlenecked by hard-to-draft positions, where an early mismatch truncates the accepted prefix and invalidates the rest of the speculative window. Most learning-based drafters are still optimized with token-level supervised objectives, even though speculative utility is inherently window-level and prefix-sensitive. We propose PPOW (Performance-Driven Policy Optimization with Adaptive Windowing), a reinforcement learning framework that shifts drafter optimization from token-level imitation to window-level optimization. PPOW combines a Cost-Aware Speedup Reward, a Distribution-Based Proximity Reward, and Adaptive Divergence-Aware Windowing, which prioritizes informative windows with high confidence-weighted draft-target divergence. PPOW achieves average acceptance lengths of 6.29-6.52 and speedups of 3.39-4.36$\times$ across multiple model families and benchmarks under a unified decoding protocol. These results show that performance-driven window-level optimization is a practical approach to improving speculative decoding efficiency.
☆ Chain-of-Procedure: Hierarchical Visual-Language Reasoning for Procedural QA
Recent advances in vision-language models (VLMs) have achieved impressive results on standard image-text tasks, yet their potential for visual procedure question answering (VP-QA) remains largely unexplored. VP-QA presents unique challenges where users query next-step actions by uploading images for intermediate states of complex procedures. To systematically evaluate VLMs on this practical task, we propose ProcedureVQA, a novel multimodal benchmark specifically designed for visual procedural reasoning. Through comprehensive analysis, we identify two critical limitations in current VLMs: inadequate cross-modal retrieval of structured procedures given visual states, and misalignment between image sequence granularity and textual step decomposition. To address these issues, we present Chain-of-Procedure (CoP), a hierarchical reasoning framework that first retrieves relevant instructions using visual cues, then performs step refinement through semantic decomposition, and finally generates the next step. Experiments across six VLMs demonstrate CoP's effectiveness, achieving up to 13% absolute improvement over standard baselines.
☆ Tokenizer Fertility and Zero-Shot Performance of Foundation Models on Ukrainian Legal Text: A Comparative Study
Foundation models tokenize Ukrainian legal text with vastly different efficiency, yet no systematic comparison exists for this domain. We benchmark seven models from five providers on 273 validated court decisions from Ukraine's state registry (EDRSR), measuring tokenizer fertility and zero-shot performance on three tasks. Three findings emerge. (1) Tokenizer fertility varies 1.6x: Qwen3 models consume 60% more tokens than Llama-family models on identical input, directly reducing API cost. (2) NVIDIA Nemotron Super 3 (120B) achieves the highest composite score (83.1), outperforming Mistral Large 3 (675B total, 41B active) -- a model with 5.6x more total parameters and 3.4x more active parameters per token -- at one-third the API cost. (3) Few-shot prompting degrades performance by up to 26 percentage points; stratified and prompt-sensitivity ablations confirm this is intrinsic to Ukrainian-language demonstrations, not an artifact of example selection. For practitioners: tokenizer analysis should precede model selection, and zero-shot is a more reliable default than few-shot for morphologically rich languages.
comment: 22 pages, 21 tables, 3 figures
☆ Holistic Evaluation and Failure Diagnosis of AI Agents
AI agents execute complex multi-step processes, but current evaluation falls short: outcome metrics report success or failure without explaining why, and process-level approaches struggle to connect failure types to their precise locations within long, structured traces. We present a holistic agent evaluation framework that pairs top-down agent-level diagnosis with bottom-up span-level evaluation, decomposing analysis into independent per-span assessments. This decomposition scales to traces of arbitrary length and produces span-level rationales for each verdict. On the TRAIL benchmark, our framework achieves state-of-the-art results across all metrics on both GAIA and SWE-Bench, with relative gains over the strongest prior baselines of up to 38% on category F1, up to 3.5x on localization accuracy, and up to 12.5x on joint localization-categorization accuracy. Per-category analysis shows our framework leading in more error categories than any other evaluator. Notably, the same frontier model achieves several times higher localization accuracy when used inside our framework than as a monolithic judge over the full trace, showing that evaluation methodology, not model capability, is the bottleneck.
☆ Conversion of Lexicon-Grammar tables to LMF. Application to French
We describe the first experiment of conversion of Lexicon-Grammar tables for French verbs into the Lexical Markup Framework (LMF) format. The Lexicon-Grammar of the French language is currently one of the major sources of lexical and syntactic information for French. Its conversion into an interoperable representation format according to the LMF standard makes it usable in different contexts, thus contributing to the standardization and interoperability of natural language processing dictionaries. We briefly introduce the Lexicon-Grammar and the derived dictionaries; we analyse the main difficulties faced during the conversion; and we describe the resulting resource.
☆ Graphs of Research: Citation Evolution Graphs as Supervision for Research Idea Generation
Research idea generation is the innovation-driving step of automated scientific research. Recently, large language models (LLMs) have shown potential for automating idea generation at scale. However, existing methods mainly condition LLMs on eliciting idea generation through static retrieval of relevant literature or complex prompt engineering, without discarding the structural relations among references. We propose Graphs of Research (GoR), a supervised fine-tuning method that extracts a 2-hop reference neighborhood for each seed paper, derives the relations among those references from citation position, frequency, predecessor links, and publication time, and organizes them into a paper-evolution directed acyclic graph (DAG). We construct an automated extraction pipeline that draws data from five major ML/NLP venues, comprising 498/50/50 train/validation/test seed papers and approximately 7,600 cited references. Qwen2.5-7B-Instruct-1M is fine-tuned on a structured-text prompt that includes the citation graph, edge signals, reference information, and task definition to predict the idea for the seed paper. Across head-to-head LLM-judge tournaments against gpt-4o-driven baselines, GoR-SFT achieves SOTA, demonstrating the effectiveness of citation-evolution graphs as supervision signal for LLM-based idea generation. We hope that this reduces the barrier for citation evolution graphs as a supervision, accelerating automated scientific innovation.
☆ Do Composed Image Retrieval Benchmarks Require Multimodal Composition?
Composed Image Retrieval (CIR) is a multimodal retrieval task where a query consists of a reference image and a textual modification, and the goal is to retrieve a target image satisfying both. In principle, strong performance on CIR benchmarks is assumed to require multimodal composition, i.e., combining complementary information from reference image and textual modification. In this work, we show that this assumption does not always hold. Across four widely used CIR benchmarks and eleven Generalist Multimodal Embedding models, a large fraction of queries can be solved using a single modality (from 32.2% to 83.6%), revealing pervasive unimodal shortcuts. Thus, high CIR performance can arise from unimodal signals rather than true multimodal composition. To better understand this issue, we perform a two-stage audit. First, we identify shortcut-solvable queries through cross-model analysis. Second, we conduct human validation on 4,741 shortcut-free queries, of which only 1,689 are well-formed, with common issues including ambiguous edits and mismatched targets. Re-evaluating models on this validated subset reveals qualitatively different behaviour: queries can no longer be solved with a single modality, and successful retrieval requires combining both inputs. While accuracy decreases, reliance on multimodal information increases. Overall, current CIR benchmarks conflate shortcut-solvable, noisy, and genuinely compositional queries, leading to an overestimation of model capability in multimodal composition.
☆ Streaming Speech-to-Text Translation with a SpeechLLM
Normally, a system that translates speech into text consists of separate modules for speech recognition and text-to-text translation. Combining those tasks into a SpeechLLM promises to exploit paralinguistic information in the speech and to reduce cascaded errors. But existing SpeechLLM systems are slow since they do not work in a real streaming fashion: they wait for a complete utterance of audio before outputting a translation, or output tokens at fixed intervals, which is not suitable for real applications. This work proposes an LLM-based architecture for real streaming speech-to-text translation. The LLM learns not just to emit output tokens, but also to decide whether it has seen enough audio to do so. The system is trained using automatic alignments of the input speech and the output text. In experiments on different language pairs, the system achieves a translation quality close to the non-streaming baseline, but with a latency of only 1-2 seconds.
comment: 9 pages of main text; 24 pages in total
☆ Persian MusicGen: A Large-Scale Dataset and Culturally-Aware Generative Model for Persian Music
Persian music, with its unique tonalities, modal systems (Dastgah), and rhythmic structures, presents significant challenges for music generation models trained primarily on Western music. We address this gap by curating the first large-scale dataset of Persian songs, comprising over 900 hours high-quality audio samples across diverse sub-genres, including pop, traditional, and contemporary styles. This dataset captures the rich melodic and cultural diversity of Persian music and serves as the foundation for fine-tuning MusicGen, a state-of-the-art generative music model. We adapt MusicGen to this domain and evaluate its performance by utilizing subjective and objective metrics. To assess the semantic alignment between generated music and intended style tags, we report the proportion of relevant tags accurately reflected in the generated outputs. Our results demonstrate that the fine-tuned model produces compositions that more align with Persian stylistic conventions. This work introduces a new resource for generative music research and illustrates the adaptability of music generation models to underrepresented cultural and linguistic contexts.
comment: 9 pages, 2 figures, 3 tables
☆ Non-linear Interventions on Large Language Models
Intervention is one of the most representative and widely used methods for understanding the internal representations of large language models (LLMs). However, existing intervention methods are confined to linear interventions grounded in the Linear Representation Hypothesis, leaving features encoded along non-linear manifolds beyond their reach. In this work, we introduce a general formulation of intervention that extends naturally to non-linearly represented features, together with a learning procedure that further enables intervention on implicit features lacking a direct output signature. We validate our framework on refusal bypass steering, where it steers the model more precisely than linear baselines by intervening on a non-linear feature governing refusal.
☆ Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining ICML 2026
Recent advances in multimodal large language models have driven growing interest in graphical user interface (GUI) agents, yet their generalization remains constrained by the scarcity of large-scale training data spanning diverse real-world applications. Existing datasets rely heavily on costly manual annotations and are typically confined to narrow domains. To address this challenge, we propose Video2GUI, a fully automated framework that extracts grounded GUI interaction trajectories directly from unlabeled Internet videos. Video2GUI employs a coarse-to-fine filtering strategy to identify high-quality GUI tutorial videos and convert them into structured agent trajectories. Applying this pipeline to 500 million video metadata entries, we construct WildGUI, a large-scale dataset containing 12 million interaction trajectories spanning over 1,500 applications and websites. Pre-training Qwen2.5-VL and Mimo-VL on WildGUI yields consistent improvements of 5-20% across multiple GUI grounding and action benchmarks, matching or surpassing state-of-the-art performance. We will release both the WildGUI dataset and the Video2GUI pipeline to support future research of GUI agents.
comment: Accepted at ICML 2026
☆ Mechanical Enforcement for LLM Governance:Evidence of Governance-Task Decoupling in Financial Decision Systems
Large language models in regulated financial workflows are governed by natural-language policies that the same model interprets, creating a principal--agent failure: outputs can appear compliant without being compliant. Existing evaluation measures task accuracy but not whether governance constrains behaviour at the decision rationale level -- where regulated decisions must be auditable. We introduce five governance metrics that quantify policy compliance at the rationale level and apply them in a synthetic banking domain to compare text-only governance against mechanical enforcement: four primitives operating outside the model's interpretive loop. Under text-only governance, 27% of deferrals carry no decision-relevant information. Mechanical enforcement reduces this rate by 73%, more than doubles deferral information content, and raises task accuracy from MCC~$0.43$ to $0.88$. The improvement is driven by architectural separation: LLM-generated rationales under mechanical enforcement show comparable CDL to text-only governance -- the gain comes from removing clear-cut decisions from the model's control. A causal ablation confirms that each primitive is individually necessary. Our central finding is a governance-task decoupling: under structural stress, text-only governance degrades on both dimensions simultaneously, whereas mechanical enforcement preserves governance quality even as task performance drops. This implies that governance and task evaluation are distinct axes: accuracy is not a sufficient proxy for governance in regulated AI systems.
☆ Agentifying Patient Dynamics within LLMs through Interacting with Clinical World Model
Sepsis management in the ICU requires sequential treatment decisions under rapidly evolving patient physiology. Although large language models (LLMs) encode broad clinical knowledge and can reason over guidelines, they are not inherently grounded in action-conditioned patient dynamics. We introduce SepsisAgent, a world model-augmented LLM agent for sepsis treatment recommendation. SepsisAgent uses a learned Clinical World Model to simulate patient responses under candidate fluid--vasopressor interventions, and follows a propose--simulate--refine workflow before committing to a prescription. We first show that world-model access alone yields inconsistent LLM decision performance, motivating agent-specific training. We then train SepsisAgent through a three-stage curriculum: patient-dynamics supervised fine-tuning, propose--simulate--refine behavior cloning, and world-model-based agentic reinforcement learning. On MIMIC-IV sepsis trajectories, SepsisAgent outperforms all traditional RL and LLM-based baselines in off-policy value while achieving the best safety profile under guideline adherence and unsafe-action metrics. Further analysis shows that repeated interaction with the Clinical World Model enables the agent to learn regularities in patient evolution, which remain useful even when simulator access is removed.
☆ IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation
Robot imitation data are often multimodal: similar visual-language observations may be followed by different action chunks because human demonstrators act with different short-horizon intents, task phases, or recent context. Existing frame-conditioned VLA policies infer each chunk from the current observation and instruction alone, so under partial observability they may resample different intents across adjacent replanning steps, leading to inter-chunk conflict and unstable execution. We introduce IntentVLA, a history-conditioned VLA framework that encodes recent visual observations into a compact short-horizon intent representation and uses it to condition chunk generation. We further introduce AliasBench, a 12-task ambiguity-aware benchmark on RoboTwin2 with matched training data and evaluation environments that isolate short-horizon observation aliasing. Across AliasBench, SimplerEnv, LIBERO, and RoboCasa, IntentVLA improves rollout stability and outperforms strong VLA baselines
comment: Code can be found in https://github.com/ZGC-EmbodyAI/IntentVLA
☆ AI-assisted cultural heritage dissemination: Comparing NMT and glossary-augmented LLM translation in rock art documents
Cultural heritage institutions increasingly disseminate research and interpretive materials globally, but multilingual dissemination is constrained by limited budgets and staffing. In terminology-dense domains such as rock art, translation quality depends on accurate, consistent specialised terms, and small lexical errors can mislead non-specialists and reduce reuse. We compare three English MT setups for a Spanish academic rock art text, focusing on simple, operationally feasible interventions rather than complex model-side modifications: (1) DeepL as a strong NMT baseline, (2) Gemini-Simple (LLM with a basic prompt), and (3) Gemini-RAG (the same LLM with glossary-augmented prompting via term-pair retrieval). Using PEARMUT, we conduct a human evaluation via (i) multi-way Direct Assessment (0--100) and (ii) targeted terminology auditing with a restricted MQM taxonomy. Gemini-RAG yields the highest exact-match terminology accuracy (81.4\%), versus Gemini-Simple (69.1\%) and DeepL (64.4\%), while preserving overall quality (mean DA 85.3 Gemini-RAG vs. 85.2 Gemini-Simple), outperforming DeepL (80.3). These results show that glossary-augmented prompting is a low-overhead way to improve terminology control in cultural-heritage translation if institutions maintain minimal terminology resources and lightweight evaluation procedures.
☆ Falkor-IRAC: Graph-Constrained Generation for Verified Legal Reasoning in Indian Judicial AI
Legal reasoning is not semantic similarity search. A court judgment encodes constrained symbolic reasoning: precedent propagation, procedural state transitions, and statute-bound inference. These are properties that vector-based retrieval-augmented generation (RAG) cannot faithfully represent. Hallucinated precedents, outdated statute citations, and unsupported reasoning chains remain persistent failure modes in LLM-based legal AI, with real consequences for access to justice in high-caseload jurisdictions such as India. This paper presents Falkor-IRAC, a graph-constrained generation framework for Indian legal AI that grounds generation in structured reasoning over an IRAC (Issue, Rule, Analysis, Conclusion) knowledge graph. Judgments from the Supreme Court and High Courts of India are ingested as IRAC node structures enriched with procedural state transitions, precedent relationships, and statutory references, stored in FalkorDB for low-latency agentic traversal. At inference time, LLM-generated answers are accepted only if a valid supporting path can be traced through the graph, a check performed by a falsifiability oracle called the Verifier Agent. The system also detects doctrinal conflicts as a first-class output rather than silently resolving them. Falkor-IRAC is evaluated using graph-native metrics: citation grounding accuracy, path validity rate, hallucinated precedent rate, and conflict detection rate. These metrics are argued to be more appropriate for legal reasoning evaluation than BLEU and ROUGE. On a proof-of-concept corpus of 51 Supreme Court judgments, the Verifier Agent correctly validated citations on completed queries and correctly rejected fabricated citations. Evaluation against vector-only RAG baselines is left for future work, as is GPU-accelerated inference to address current timeout rates on CPU hardware.
comment: 20 pages, 8 figures, 4 tables
☆ Do We Really Need External Tools to Mitigate Hallucinations? SIRA: Shared-Prefix Internal Reconstruction of Attribution
Large vision-language models (LVLMs) often hallucinate when language priors dominate weak or ambiguous visual evidence. Existing contrastive decoding methods mitigate this problem by comparing predictions from the original image with those from externally perturbed visual inputs, but such references can introduce off-manifold artifacts and require costly extra forward passes. We propose SIRA, a training-free internal contrastive decoding framework that constructs a counterfactual reference inside the same LVLM by exploiting the staged information flow of multimodal transformers. Instead of removing visual information from the input, SIRA first lets image and text tokens interact through a shared prefix, forming an aligned multimodal state that preserves prompt interpretation, decoding history, positional structure, and early visual grounding. It then forks a counterfactual branch in later transformer layers, where attention to image-token positions is masked. This branch retains the shared multimodal context but lacks continued access to fine-grained visual evidence, yielding a language-prior-dominated internal reference for token-level contrast. During decoding, SIRA suppresses tokens that remain strong without late visual access and favors predictions whose advantage depends on the full visual pathway. Experiments on POPE, CHAIR, and AMBER with Qwen2.5-VL and LLaVA-v1.5 show that SIRA consistently reduces hallucinations while preserving descriptive coverage and incurring lower overhead than two-pass contrastive decoding. SIRA requires no training, external verifier, or perturbed input, and applies to open-weight LVLMs with white-box inference access.
☆ SciPaths: Forecasting Pathways to Scientific Discovery
Scientific progress depends on sequences of enabling contributions, yet existing AI4Science benchmarks largely focus on citation prediction, literature retrieval, or idea generation rather than the dependencies that make progress possible. In this paper, we introduce discovery pathway forecasting: given a target scientific contribution and the prior literature available at a specified time, the task is to (1) identify the enabling contributions required to realize it and (2) ground each in prior work when such prior work exists. We present SciPaths, a benchmark of 262 expert-annotated gold pathways and 2,444 silver pathways constructed from machine learning and natural language processing papers, where each pathway records enabling contributions, roles, rationales, and prior-work groundings or unmapped decisions. Evaluating frontier and open-weight language models, we find that the best model reaches only 0.189 F1 under strict semantic matching, with core methodological dependencies hardest to recover. Prior-work grounding improves substantially when gold enabling contributions are provided, showing that decomposition quality is a major bottleneck for end-to-end pathway recovery. SciPaths therefore shifts evaluation toward a missing capability in scientific forecasting: reasoning backward from a target contribution to the enabling scientific building blocks and prior-work dependencies that make it feasible.
☆ EndPrompt: Efficient Long-Context Extension via Terminal Anchoring
Extending the context window of large language models typically requires training on sequences at the target length, incurring quadratic memory and computational costs that make long-context adaptation expensive and difficult to reproduce. We propose EndPrompt, a method that achieves effective context extension using only short training sequences. The core insight is that exposing a model to long-range relative positional distances does not require constructing full-length inputs: we preserve the original short context as an intact first segment and append a brief terminal prompt as a second segment, assigning it positional indices near the target context length. This two-segment construction introduces both local and long-range relative distances within a short physical sequence while maintaining the semantic continuity of the training text--a property absent in chunk-based simulation approaches that split contiguous context. We provide a theoretical analysis grounded in Rotary Position Embedding and the Bernstein inequality, showing that position interpolation induces a rigorous smoothness constraint over the attention function, with shared Transformer parameters further suppressing unstable extrapolation to unobserved intermediate distances. Applied to LLaMA-family models extending the context window from 8K to 64K, EndPrompt achieves an average RULER score of 76.03 and the highest average on LongBench, surpassing LCEG (72.24), LongLoRA (72.95), and full-length fine-tuning (69.23) while requiring substantially less computation. These results demonstrate that long-context generalization can be induced from sparse positional supervision, challenging the prevailing assumption that dense long-sequence training is necessary for reliable context-window extension. The code is available at https://github.com/clx1415926/EndPrompt.
☆ Uncertainty Quantification for Large Language Diffusion Models
Large Language Diffusion Models (LLDMs) are emerging as an alternative to autoregressive models, offering faster inference through higher parallelism. Similar to autoregressive LLMs, they remain prone to hallucinations, making reliable uncertainty quantification (UQ) crucial for safe deployment. However, existing UQ methods are fundamentally misaligned with this new paradigm: they assume autoregressive factorization or use expensive repeated sampling, negating the efficiency of LLDMs. In this work, we present the first systematic study of UQ for LLDMs and propose lightweight, zero-shot uncertainty signals derived from the iterative denoising process, leveraging intermediate generations, token remasking dynamics, and denoising complexity. We further adapt a state-of-the-art UQ method to LLDMs by combining masked diffusion likelihoods with trajectory-based semantic dissimilarity. We prove that expected trajectory dissimilarity lower bounds the masked diffusion training objective, which motivates its usage as an uncertainty score. Comprehensive experiments across three tasks, eight datasets, and two models show that our method achieves a great cost-performance trade-off: it approaches the strongest sampling-based baselines while incurring up to 100x lower computational overhead. Our work demonstrates that LLDMs can deliver both fast inference and reliable hallucination detection simultaneously.
☆ Mining Subscenario Refactoring Opportunities in Behaviour-Driven Software Test Suites: ML Classifiers and LLM-Judge Baselines
Context. Behaviour-Driven Development (BDD) software test suites accumulate duplicated step subsequences. Three published refactoring patterns are available (within-file Background, within-repo reusable-scenario invocation, cross-organisational shared higher-level step), but no prior work automates which recurring subsequences are worth extracting or which mechanism applies. Objective. Rank recurring step subsequences ("slices") by refactoring suitability (extraction-worthy), pre-map each to one of the three patterns, and quantify prevalence across the public BDD ecosystem. Method. Every contiguous L-step window (L in [2, 18]) in a 339-repository / 276-upstream-owner Gherkin corpus is keyed by paraphrase-robust cluster identifiers and counted under three scopes. Sentence-BERT (SBERT) / Uniform Manifold Approximation and Projection (UMAP) / Hierarchical Density-Based Clustering (HDBSCAN) recovers paraphrase-equivalent slices. Three authors label a stratified 200-slice pool against a written rubric. An eXtreme Gradient Boosting (XGBoost) extraction-worthy classifier trained under 5-fold cross-validation is compared with a tuned rule baseline and two open-weight Large Language Model (LLM) judges. Results. The miner produces 5,382,249 slices collapsing to 692,020 recurring patterns. Three-author Fleiss' kappa = 0.56 (extraction-worthy) and 0.79 (mechanism). The classifier reaches out-of-fold F1 = 0.891 (95% CI [0.852, 0.927]), outperforming both the rule baseline (F1 = 0.836, p = 0.017) and the better LLM judge (F1 = 0.728, p < 1e-4). 75.0%, 59.5%, and 11.7% of scenarios carry a within-file Background, within-repo reusable-scenario, or cross-organisational shared-step candidate. Conclusion. Paraphrase-robust subscenario discovery yields a corpus-wide census of BDD refactoring opportunities; pipeline, classifier predictions, labelled pool, and rubric are released under Apache-2.0.
comment: 30 pages, 12 figures and tables, 58 references. Under review at Software Quality Journal (Springer). Reproduction package at https://github.com/amughalbscs16/cukereuse_subscenarios_release (Apache-2.0). Upstream cukereuse corpus at https://doi.org/10.5281/zenodo.19754359
☆ Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation
Automated code documentation is essential for modern software development, providing the contextual grounding that both human developers and coding agents rely on to navigate large codebases. Existing repository-level approaches process components independently, causing redundant retrieval and conflicting descriptions across documents while producing outputs that lack hierarchical structure. Therefore, we propose MemDocAgent, a long-horizon agentic framework that generates documentation within a single, integrated context spanning the entire repository. It combines two components: (i) Dependency-Aware Traversal Guiding that predetermines a traversal order respecting dependency and granularity hierarchies; (ii) Memory-Guided Agentic Interaction, in which the agent interacts with RepoMemory, a shared memory accumulating prior work traces through read, write, and verify operations. Through an in-depth multi-criteria evaluation, MemDocAgent achieves the best performance over both open and closed-source baselines and demonstrates practical applicability in real software development workflows.
☆ Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy
Agentic reinforcement learning trains large language models using multi-turn trajectories that interleave long reasoning traces with short environment-facing actions. Common policy-gradient methods, such as PPO and GRPO, treat each token in a trajectory equally, leading to uniform credit assignment. In this paper, we critically demonstrate that such uniform credit assignment largely misallocates token-level training signals. From an energy-based modeling perspective, we show that token-level training signals, quantified by their correlations with reward variance of different rollouts sampled from a given prompt, concentrate sharply on action tokens rather than reasoning tokens, even though action tokens account for only a small fraction of the trajectory. We refer to this phenomenon as the Action Bottleneck. Motivated by this observation, we propose an embarrassingly simple token reweighting approach, ActFocus, that downweights gradients on reasoning tokens, along with an additional energy-based redistribution mechanism that further increases the weights on action tokens with higher uncertainty. Across four environments and different model sizes, ActFocus consistently outperforms PPO and GRPO, yielding final-step gains of up to 65.2 and 63.7 percentage points, respectively, without any additional runtime or memory cost.
comment: Preprint
☆ Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective paradigm for improving the reasoning capabilities of large language models. However, RLVR training is often hindered by sparse binary rewards and weak credit assignment, resulting in ambiguous optimization signals and underutilization of the useful information embedded in failed trajectories. To address this challenge, we propose Correction-Oriented Policy Optimization (CIPO), a simple and effective extension to RLVR that converts on-policy failed trajectories into correction-oriented supervision, without relying on any external signals. By jointly optimizing correction samples derived from the model's own failed attempts together with the standard RLVR objective, CIPO improves learning effectiveness while explicitly enhancing the model's ability to correct its own errors. Extensive experiments across 11 benchmarks spanning mathematical reasoning and code generation demonstrate that CIPO consistently and significantly outperforms strong baselines in both reasoning and correction performance. Moreover, CIPO yields stronger pass@K gains, indicating that it improves the model's intrinsic reasoning capacity rather than merely redistributing probability mass over existing correct answers.
comment: Work on progress
☆ Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space
This work reformulates language generation as a stochastic optimal control problem, providing a unified theoretical perspective to analyze autoregressive and diffusion models and explain their limitations (Efficiency-Fidelity Paradox, Irreversibility Error Propagation, Optimization Tractability and Fidelity) in terms of combination of trajectory singularity, adjoint state vanishing, and gradient absence. To address these issues, we approximate the solution to the Hamilton-Jacobi-Bellman (HJB) equation, yielding an optimal policy that acts as a closed-loop controller. To bypass the intractability of directly solving the HJB PDE, we employ Flow Matching as the optimal trajectory solver within the rectified latent control space. This allows our Manta-LM with Global Integral Operator to approximate the global vector field, effectively realizing a model that simultaneously achieves high-fidelity text generation and efficient, low-cost parallel sampling. Empirically, our method achieves strong performance on language modeling and conditional generation tasks, while exhibiting improved stability, efficiency, and controllability.
☆ Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation
Holistic evaluation scores capture overall output quality but do not distinguish whether a model reproduced the structural form of a user's request from whether it preserved the user's specific intent. We propose a dimension-level intent fidelity evaluation framework, applied here through a structured prompt ablation study across 2,880 outputs spanning three languages, three task domains, and six LLMs, that separately measures structural recovery and intent fidelity for each semantic dimension. This framework reveals a systematic structural-fidelity split: among Chinese-language outputs with complete paired scores, 25.7% received perfect holistic alignment scores (GA=5) while exhibiting measurable dimensional intent deficits; among English-language outputs, this proportion rose to 58.6%. Human evaluation confirmed that these split-zone outputs represent genuine quality deficits and that dimensional fidelity scores track human judgements more reliably than holistic scores do. A public-private decomposition of 2,520 ablation cells characterises when models successfully compensate for missing intent and when they fail, while proxy annotation distinguishes prior inferability from default recoverability. A weight-perturbation experiment shows that moderate misalignment is typically absorbed, whereas severe dimensional inversion is consistently harmful. These findings demonstrate that dimension-level intent fidelity evaluation is a necessary complement to holistic assessment when evaluating LLM outputs for user-specific tasks.
comment: Preprint. 30 tasks, 3 languages, 6 LLMs, 2,880 outputs; includes human evaluation and structured prompt ablation
☆ GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
Large Language Model (LLM) agents increasingly serve as personal assistants and workplace collaborators, where their utility depends on memory systems that extract, retrieve, and apply information across long-running conversations. However, both existing memory systems and benchmarks are built around the dyadic, single-user setup, even though real deployments routinely span groups and channels with multiple users interacting with the agent and with each other. This mismatch leaves three properties of group memory unmeasured: (i) group dynamics that go beyond concatenated one-on-one chats, (ii) speaker-grounded belief tracking, where the per-user memory modeling is needed, and (iii) audience-adapted language, where Theory-of-Mind shifts produce role-specific vocabulary. We introduce GroupMemBench, a benchmark that exposes all three. A graph-grounded synthesis pipeline produces multi-party conversations with controllable reply structure and conditions each message on per-user personas and target audiences. An adversarial query pipeline then binds every question to a specific asker across six categories, spanning multi-hop reasoning, knowledge update, term ambiguity, user-implicit reasoning, temporal reasoning, and abstention, and iteratively searches challenging, realistic queries that reflect comprehensive memory capability. Benchmarking leading memory systems exposes a sharp collapse: the strongest one reaches only 46.0% average accuracy, with knowledge update at 27.1% and term ambiguity at 37.7%, while a simple BM25 baseline matches or exceeds most agent memory systems. This indicates current memory ingestion erases the structural and lexical features group memory depends on, leaving multi-user memory far from solved.
☆ Cross-Linguistic Transcription and Phonological Representation in the Huìtóngguǎnxì Huáyíyìyǔ
Purpose: This study investigates the transcription principles underlying Huìtóngguǎnxì Huáyíyìyǔ (HHY), a series of multilingual glossaries compiled by the Ming government between the fifteenth and sixteenth centuries for interpreter training. The study treats HHY not as a collection of isolated language materials, but as a coherent multilingual transcription system representing spoken forms of non-Chinese languages through Chinese characters. Methods: A substantial portion of HHY was digitized and aligned with Chinese phonological categories. Previous reconstructions of individual language sections were critically reviewed and integrated into a unified comparative database. The analysis focuses on cross-linguistic regularities in Main Transcription (MT) and Supplementary Transcription (ST) across eight language sections. Results: MT generally represents sounds compatible with the Chinese syllable structure of the period, whereas ST mainly encodes phonetic features less compatible with Chinese phonology. The analysis further shows that Chinese phonological categories were used more flexibly in foreign-language transcription than previously assumed. HHY therefore functioned as a relatively systematic method of phonetic approximation rather than a direct projection of Chinese phonology onto non-Chinese languages. Conclusion: HHY can be analyzed as an internally structured transcription system rather than merely as a collection of glossaries. More broadly, the study demonstrates that historical transcription systems can provide valuable evidence for historical phonology, particularly for under-documented Asian languages with limited historical records.
comment: 47 pages; 1 figure; 40 tables; SLE2019; under review
☆ When Retrieval Hurts Code Completion: A Diagnostic Study of Stale Repository Context
Context: Retrieval-augmented code generation relies on cross-file repository context, but retrieved snippets may come from obsolete project states. Objectives: We study whether temporally stale repository snippets act as harmless noise or actively induce current-state-incompatible code. Methods: We conduct a controlled diagnostic study on a curated 17-sample set of production-helper signature changes from five Python repositories. For each sample, we compare current-only, stale-only, no-retrieval, and mixed current/stale retrieval conditions under prompts that hide commit freshness and expected current signatures. Results: Under neutralized prompts, stale-only retrieval induces stale helper references on 15/17 Qwen2.5-Coder-7B-Instruct samples and 13/17 gpt-4.1-mini samples, corresponding to 88.2 and 76.5 percentage-point increases over current-only retrieval. No retrieval produces zero stale references but only 1/17 passing completions. The two models share 75.0% Jaccard overlap among stale-triggering samples, and mixed conditions show that adding valid current evidence largely rescues stale-only failures. Conclusion: Temporal validity of retrieved repository context is a distinct diagnostic variable for Code RAG robustness: stale context can actively bias models toward obsolete repository state rather than merely removing useful evidence.
comment: 31 pages, 2 tables. Submitted to Information and Software Technology (Elsevier)
☆ Does RAG Know When Retrieval Is Wrong? Diagnosing Context Compliance under Knowledge Conflict
The Context-Compliance Regime in Retrieval-Augmented Generation (RAG) occurs when retrieved context dominates the final answer even when it conflicts with the model's parametric knowledge. Accuracy alone does not reveal how retrieved context causally shapes answers under such conflict. We introduce Context-Driven Decomposition (CDD), a belief-decomposition probe that operates at inference time and serves as an intervention mechanism for controlled retrieval conflict. Across Epi-Scale stress tests, TruthfulQA misconception injection, and cross- model reruns, CDD exposes three patterns. P1: context compliance is measurable in an upper-bound adversarial setting, where Standard RAG reaches 15.0% accuracy on TruthfulQA misconception injection (N=500). P2: adversarial accuracy gains transfer across model families: CDD improves accuracy on Gemini-2.5-Flash and on Claude Haiku/Sonnet/Opus, but rationale-answer causal coupling does not transfer. CDD reaches 64.1% mistake- injection causal sensitivity on Gemini-2.5-Flash, while sensitivities for all three Claude variants fall in the [-3%, +7%] range, suggesting that the Claude-side accuracy gains operate through a mechanism distinct from the explicit conflict-resolution trace. P3: explicit conflict decomposition improves robustness under temporal drift and noisy distractors, with CDD reaching 71.3% on temporal shifts and 69.9% on distractor evidence on the full Epi-Scale adversarial benchmark. These three patterns identify context-compliance as a structural axis along which standard RAG can be probed and intervened on, distinct from retrieval-quality or single-method robustness questions, and motivate releasing Epi-Scale for systematic study across model families and retrieval pipelines.
comment: 12 pages, 4 figures, 3 tables
☆ LiSA: Lifelong Safety Adaptation via Conservative Policy Induction
As AI agents move from chat interfaces to systems that read private data, call tools, and execute multi-step workflows, guardrails become a last line of defense against concrete deployment harms. In these settings, guardrail failures are no longer merely answer-quality errors: they can leak secrets, authorize unsafe actions, or block legitimate work. The hardest failures are often contextual: whether an action is acceptable depends on local privacy norms, organizational policies, and user expectations that resist pre-deployment specification. This creates a practical gap: guardrails must adapt to their own operating environments, yet deployment feedback is typically limited to sparse, noisy user-reported failures, and repeated fine-tuning is often impractical. To address this gap, we propose LiSA (Lifelong Safety Adaptation), a conservative policy induction framework that improves a fixed base guardrail through structured memory. LiSA converts occasional failures into reusable policy abstractions so that sparse reports can generalize beyond individual cases, adds conflict-aware local rules to prevent overgeneralization in mixed-label contexts, and applies evidence-aware confidence gating via a posterior lower bound, so that memory reuse scales with accumulated evidence rather than empirical accuracy alone. Across PrivacyLens+, ConFaide+, and AgentHarm, LiSA consistently outperforms strong memory-based baselines under sparse feedback, remains robust under noisy user feedback even at 20% label-flip rates, and pushes the latency--performance frontier beyond backbone model scaling. Ultimately, LiSA offers a practical path to secure AI agents against the unpredictable long tail of real-world edge risks.
comment: 27 pages, 3 figures
☆ When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition
Hallucination detection in large language models (LLMs) requires balancing accu racy, efficiency, and robustness to distribution shift. Black-box consistency methods are effective but demand repeated inference; single-pass white-box probes are effi cient yet treat answer representations in isolation, often degrading sharply under domain shift. We propose QAOD (Question-Answer Orthogonal Decomposition), a single-pass framework that projects away the question-aligned direction from the answer representation to obtain a question-orthogonal component that suppresses domain-conditioned variation. To identify informative signals, QAOD further selects layers via diversity-penalized Fisher scoring and discriminative neurons via Fisher importance. To address both in-domain detection and cross-domain generalization, we design two complementary probing strategies: pairing the or thogonal component with question context yields a joint probe that maximizes in-domain discriminability, while using the orthogonal component alone preserves domain-agnostic factuality signals for robust transfer. QAOD's joint probe achieves the best in-domain AUROC across all evaluated model-dataset pairs, while the orthogonal-only probe delivers the strongest OOD transfer, surpassing the best white-box baseline by up to 21% on BioASQ at under 25% of generation cost.
☆ Think When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture
Multimodal large language models (MLLMs) have emerged as a powerful backbone for multimodal embeddings. Recent methods introduce chain-of-thought (CoT) reasoning into the embedding pipeline to improve retrieval quality, but remain costly in both model size and inference cost. They typically employ separate reasoner and embedder with substantial parameter overhead, and generate CoT indiscriminately for every input. However, we observe that for simple inputs, discriminative embeddings already perform well, and redundant reasoning can even mislead the model, degrading performance. To address these limitations, we propose Think When Needed (TWN), a unified multimodal embedding framework with adaptive reasoning. TWN introduces a dual-LoRA architecture that attaches reasoning and embedding adapters to a shared frozen backbone, detaching gradients at their interface to mitigate gradient conflicts introduced by joint optimization while keeping parameters close to a single model. Building on this, an adaptive think mechanism uses a self-supervised routing gate to decide per input whether to generate CoT, skipping unnecessary reasoning to reduce inference overhead and even improve retrieval quality. We further explore embedding-guided RL to optimize CoT quality beyond supervised training. On the 78 tasks of MMEB-V2, TWN achieves state-of-the-art embedding quality while being substantially more efficient than existing generative methods, requiring only 3-5% additional parameters relative to the backbone and up to 50% fewer reasoning tokens compared to the full generative mode.
comment: 30 pages, preprint
☆ A Calculus-Based Framework for Determining Vocabulary Size in End-to-End ASR
In hybrid automatic speech recognition (ASR) systems, the vocabulary size is unambiguous, typically determined by the number of phones, bi-phones, or tri-phones present in the language. In contrast, end-to-end ASR systems derive their vocabulary, often referred to as tokens from the text corpus used for training. The choice and, more importantly, the size of this vocabulary is a critical hyper-parameter in training end-to-end ASR systems. Tokenization algorithms such as Byte Pair Encoding (BPE), WordPiece, and Unigram Language Model (ULM) use the vocabulary size as an input hyper-parameter to generate the sub-words employed during ASR training. Popular toolkits like ESPNet provide a fixed vocabulary size in their training recipes, but there is little documentation or discussion in the literature regarding how these values are determined. Recent work [1] has formalized an approach to identify the vocabulary size best suited for end-to-end ASR, introducing a cost function framework that treats the tokenization process as a black box. In this paper, we build upon that foundation by curve fitting the training data and using the principle of first and second derivative tests in calculus to formally estimate the vocabulary size hyper-parameter. We demonstrate the utility and usefulness of our approach by applying it on a standard Librispeech corpus and show that the optimal choice of vocabulary size hyper-parameter improves the performance of the ASR. The main contribution of this paper in formalizing an approach to identify the vocabulary size best suited for training an end-to-end ASR system.
comment: 8 pages, is an extension of the paper S. K. Kopparapu and A. Panda, A cost minimization approach to fix the vocabulary size in a tokenizer for an end-to-end ASR system, in Proceedings of the 2024 International Conference on Pattern Recognition, Kolkata, India, 2024
☆ SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades
Coding agents powered by large language models are increasingly expected to perform realistic software maintenance tasks beyond isolated issue resolution. Existing benchmarks have shifted toward realistic software evolution, but they rarely capture continuous maintenance at the granularity of package releases, where changes are bundled, shipped, and inherited by subsequent versions. We present SWE-Chain, a benchmark for evaluating agents on chained release-level package upgrades, where each transition builds on the agent's prior codebase. To produce upgrade specifications, we design a divide-and-conquer synthesis pipeline that aligns release notes with code diffs for each version transition, ensuring the requirements are grounded in actual code changes, informative to agents, and feasible to implement. SWE-Chain contains 12 upgrade chains across 9 real Python packages, with 155 version transitions and 1,660 grounded upgrade requirements. Across nine frontier agent-model configurations, agents achieve an average of 44.8% resolving, 65.4% precision, and 50.2% F1 under the Build+Fix regime, with Claude-Opus-4.7 (Claude Code) leading at 60.8% resolving, 80.6% precision, and 68.5% F1. These results show that SWE-Chain is both feasible and discriminative, and reveal that current agents still struggle to make correct upgrades across chained package releases without breaking existing functionality.
☆ Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation
While LLMs are increasingly used in commercial services, they pose privacy risks such as leakage of sensitive personally identifiable information (PII). For LLMs trained on multilingual corpora, Multilingual Machine Unlearning (MMU) aims to remove information across multiple languages. However, prior MMU evaluations fail to capture such cross-linguistic distribution of information, being largely limited to direct extensions of per-language evaluation protocols. To this end, we propose two metrics to evaluate the information spread across languages: the Knowledge Separability Score (KSS) and the Knowledge Persistence Score (KPS). KSS measures the overall unlearning quality across multiple languages, while KPS more specifically aims to assess consistent removal of information among different language pairs. We evaluated various unlearning methods in the multilingual setting with these metrics and conducted comprehensive analyses. Through our investigation, we provide insights into unique phenomena exclusive to MMU and offer a new perspective on MMU evaluation.
☆ Agentic Recommender System with Hierarchical Belief-State Memory
Memory-augmented LLM agents have advanced personalized recommendation, yet existing approaches universally adopt flat memory representations that conflate ephemeral signals with stable preferences, and none provides a complete lifecycle governing how memory should evolve. We propose MARS (Memory-Augmented Agentic Recommender System), a framework that treats recommendation as a partially observable problem and maintains a structured belief state that progressively abstracts noisy behavioral observations into a compact estimate of user preferences. MARS organizes this belief state into three tiers: event memory buffers raw signals, preference memory maintains fine-grained mutable chunks with explicit strength and evidence tracking, and profile memory distills all preferences into a coherent natural language narrative. A complete lifecycle of six operations -- extraction, reinforcement, weakening, consolidation, forgetting, and resynthesis -- is adaptively scheduled by an LLM-based planner rather than fixed-interval heuristics. Experiments on four InstructRec benchmark domains show that \ours achieves state-of-the-art performance with average improvements of 26.4% in HR@1 and 10.3% in NDCG@10 over the strongest baselines with further gains from agentic scheduling in evolving settings.
comment: 4 figures, 8 tables
☆ Nexus : An Agentic Framework for Time Series Forecasting
Time series forecasting is not just numerical extrapolation, but often requires reasoning with unstructured contextual data such as news or events. While specialized Time Series Foundation Models (TSFMs) excel at forecasting based on numerical patterns, they remain unaware to real-world textual signals. Conversely, while LLMs are emerging as zero-shot forecasters, their performance remains uneven across domains and contextual grounding. To bridge this gap, we introduce Nexus, a multi-agent forecasting framework that decomposes prediction into specialized stages: isolating macro-level and micro-level temporal fluctuations, and integrating contextual information when available before synthesizing a final forecast. This decomposition enables Nexus to adapt from seasonal signals to volatile, event-driven information without relying on external statistical anchors or monolithic prompting. We show that current-generation LLMs possess substantially stronger intrinsic forecasting ability than previously recognized, depending critically on how numerical and contextual reasoning are organized. Evaluated on data strictly succeeding LLM knowledge cutoffs spanning Zillow real estate metrics and volatile stock market equities, Nexus consistently matches or outperforms state-of-the-art TSFMs and strong LLM baselines. Beyond numerical accuracy, Nexus produces high-quality reasoning traces that explicitly show the fundamental drivers behind each forecast. Our results establish that real-world forecasting is an agentic reasoning problem extending well beyond only sequence modeling.
comment: 30 Pages, 3 figures, 5 Tables
☆ NodeSynth: Socially Aligned Synthetic Data for AI Evaluation
Recent advancements in generative AI facilitate large-scale synthetic data generation for model evaluation. However, without targeted approaches, these datasets often lack the sociotechnical nuance required for sensitive domains. We introduce NodeSynth, an evidence-grounded methodology that generates socially relevant synthetic queries by leveraging a fine-tuned taxonomy generator (TaG) anchored in real-world evidence. Evaluated against four mainstream LLMs (e.g., Claude 4.5 Haiku), NodeSynth elicited failure rates up to five times higher than human-authored benchmarks. Ablation studies confirm that our granular taxonomic expansion significantly drives these failure rates, while independent validation reveals critical deficiencies in prominent guard models (e.g., Llama-Guard-3). We open-source our end-to-end research prototype and datasets to enable scalable, high-stakes model evaluation and targeted safety interventions (https://github.com/google-research/nodesynth).
☆ Mitigating Data Scarcity in Psychological Defense Classification with Context-Aware Synthetic Augmentation
Psychological defense mechanisms (PDMs) are unconscious cognitive processes that modulate how individuals perceive and respond to emotional distress. Automatically classifying PDMs from text is clinically valuable but severely hindered by data scarcity and class imbalance, challenges which generative augmentation alone cannot resolve without psychological grounding. In this work, we address these challenges in the PsyDefDetect shared task (BioNLP@ACL 2026) by proposing a context-aware synthetic augmentation framework combined with a hybrid classification model. Our hybrid model integrates contextual language representations with basic clinical features, along with 150 annotated defense items. Experiments demonstrate that definition quality in prompting directly governs generation fidelity and downstream performance. Our method surpasses DMRS Co-Pilot, reaching an accuracy of 58.26% (+40.25%) and a macro-F1 of 24.62% (+15.99%), thereby establishing a strong baseline for psychologically grounded defense mechanism classification in low-resource settings. Source code is available at: https://github.com/htdgv/CASA-PDC.
☆ Where Should Diffusion Enter a Language Model? Geometry-Guided Hidden-State Replacement
Continuous diffusion language models lag behind autoregressive transformers, partly because diffusion is applied in spaces poorly suited to language denoising and token recovery. We propose DiHAL, a geometry-guided diffusion-transformer hybrid that asks where diffusion should enter a pretrained transformer. DiHAL scores layers with geometry-based proxies, selects a diffusion-friendly hidden-state interface, and replaces the lower transformer prefix with a diffusion bridge while retaining the upper layers and original LM head. By reconstructing the selected-layer hidden state rather than tokens, DiHAL avoids direct continuous-to-discrete recovery. Experiments on 8B-scale backbones show that the geometry score predicts effective shallow insertion layers under a fixed bridge-training protocol and that hidden-state recovery improves over continuous diffusion baselines in a diagnostic comparison matching the diffusion/recovery training budget. These results suggest that hidden-state geometry helps identify where diffusion-based replacement is feasible inside pretrained language models.
☆ Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax ACL 2026
Extending large language models (LLMs) to low-resource languages often incurs an "alignment tax": improvements in the target language come at the cost of catastrophic forgetting in general capabilities. We argue that this trade-off arises from the rigidity of supervised fine-tuning (SFT), which enforces token-level surface imitation on narrow and biased data distributions. To address this limitation, we propose a semantic-space alignment paradigm powered by Group Relative Policy Optimization (GRPO), where the model is optimized using embedding-level semantic rewards rather than likelihood maximization. This objective encourages meaning preservation through flexible realizations, enabling controlled updates that reduce destructive interference with pretrained knowledge. We evaluate our approach on Tibetan-Chinese machine translation and Tibetan headline generation. Experiments show that our method acquires low-resource capabilities while markedly mitigating alignment tax, preserving general competence more effectively than SFT. Despite producing less rigid surface overlap, semantic RL yields higher semantic quality and preference in open-ended generation, and few-shot transfer results indicate that it learns more transferable and robust representations under limited supervision. Overall, our study demonstrates that reinforcement learning with semantic rewards provides a safer and more reliable pathway for inclusive low-resource language expansion.
comment: ACL 2026 Findings
☆ A Formative Study of Brief Affective Text as a Complement to Wearable Sensing for Longitudinal Student Health Monitoring
Wearable devices capture physiological and behavioral data with increasing fidelity, but the psychological context shaping these outcomes is difficult to recover from sensor data alone, limiting passive sensing utility for digital health. We examined whether ultra-brief naturalistic concern text could serve as a scalable complement to passive sensing. In a year-long study of 458 university students (3,610 person-waves) tracked with Oura rings, participants responded bimonthly to an open-ended prompt about what concerned them most; responses had a median length of three words. We compared dictionary-based, general pretrained, and domain-adapted NLP approaches using within-person mixed-effects models across nine sleep and physical activity outcomes. Weeks dominated by academic concern framing were associated with lower physical activity; weeks characterized by emotional exhaustion language were associated with poorer sleep quality and lower heart rate variability. General pretrained embeddings outperformed domain-adapted models for most outcomes, with domain adaptation showing relative advantage for autonomic outcomes. Zero-shot classification of concern topics produced no significant associations, while affective dimensions across all three methods were consistently associated with outcomes, indicating emotional register rather than topical content carries the signal. These findings offer design guidance: ultra-brief affective prompts enrich the psychological interpretability of passive physiological data at minimal burden.
comment: Submitted to ACM IMWUT
☆ Herculean: An Agentic Benchmark for Financial Intelligence
As AI agents improve, the central question is no longer whether they can solve isolated well-defined financial tasks, but whether they can reliably carry out financial professional work. Existing financial benchmarks offer only a partial view of this ability, as they primarily evaluate static competencies such as question answering, retrieval, summarization, and classification. We introduce Herculean, the first skilled benchmark for agentic financial intelligence spanning four representative workflows, including Trading, Hedging, Market Insights, and Auditing. Each workflow is instantiated as a standardized MCP-based skill environment with its own tools, interaction dynamics, constraints, and success criteria, enabling consistent end-to-end assessment of heterogeneous agent systems. Across frontier agents, we find agents perform relatively well on Trading and Market Insights, but struggle substantially on Hedging and Auditing, where long-horizon coordination, state consistency, and structured verification are critical. Overall, our results point to a key gap in current agents in turning financial reasoning into dependable workflow execution in high-stakes financial workflows.
☆ LLM-based Detection of Manipulative Political Narratives
We present a new computational framework for detecting and structuring manipulative political narratives. A task that became more important due to the shift of political discussions to social media. One of the primary challenges thereby is differentiating between manipulative political narratives and legitimate critiques. Some posts may also reframe actual events within a manipulative context. To achieve good clustering results, we filter manipulative posts beforehand using a detailed few-shot prompt that combines documented campaign narratives with legitimate criticisms to differentiate them. This prompt enables a reasoning model to assign labels, retaining only manipulative narrative posts for further processing. The remaining posts are subsequently embedded and dimensionality-reduced using UMAP, before HDBSCAN is applied to uncover narrative groups. A key advantage of this unsupervised approach is its independence from a predefined list of target categories, enabling it to uncover new narrative clusters. Finally, a reasoning model is employed to uncover the narrative behind each cluster. This approach, applied to over 1.2 million social media posts, effectively identified 41 distinct manipulative narrative clusters by integrating prompt-based filtering with unsupervised clustering.
comment: This paper has been submitted to the upcoming 18th International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2026)
☆ Ideology Prediction of German Political Texts AAAI
Elections represent a crucial milestone in a nation's ongoing development. To better understand the political rhetoric from various movements, ranging from left to right, we propose a transformer-based model capable of projecting the political orientation of a text on a continuous left-to-right spectrum, represented by a normalized scalar d between -1 and 1. This approach enables analysts to focus on specific segments of the political landscape, such as conservatives, while excluding liberal and far-right movements. Such a task can only be achieved with multiclass classifiers, provided that the desired orientation is incorporated within one of their predefined classes. To determine the most suitable foundation model among 13 candidate transformers for this task, we constructed four distinct corpora. One corpus comprised annotated plenary notes from the German Bundestag, while another was based on an official online decision-making tool, Wahl-O-Mat. The third corpus consisted of articles from 33 newspapers, each identified by its political orientation, and the fourth included 535,200 tweets from 597 members of the 20th and 21st German Bundestag. To mitigate overfitting, we used two distinct corpora for training and two for testing, respectively. For in-domain performance, DeBERTa-large achieved the highest F1 score F1=0.844 as well as for the X (Twitter) out-of-domain test ACC=0.864. Regarding the newspaper out-of-domain test, Gemma2-2B excelled (MAE = 0.172). This study demonstrates that transformer models can recognize political framing in German news at the level of public opinion polls. Our findings suggest that both the model architecture and the availability of domain-specific training data can be as influential as model size for estimating political bias. We discuss methodological limitations and outline directions for improving the robustness of bias measurement.
comment: This paper has been accepted for the upcoming 20th International AAAI Conference on Web and Social Media (ICWSM 2026)
☆ Dynamic Latent Routing
We investigate the temporal concatenation of sub-policies in Markov Decision Processes (MDP) with time-varying reward functions. We introduce General Dijkstra Search (GDS), and prove that globally optimal goal-reaching policies can be recovered through temporal composition of intermediate optimal sub-policies. Motivated by the "search, select, update" principle underlying GDS, we propose Dynamic Latent Routing (DLR), a language-model post-training method that jointly learns discrete latent codes, routing policies, and model parameters through dynamic search in a single training stage. In low-data fine-tuning settings, DLR matches or outperforms supervised fine-tuning across four datasets and six models, achieving a mean gain of +6.6 percentage points, while prior discrete-latent baselines consistently underperform SFT. Mechanistic analyses and targeted code ablations show that DLR learns structured routing behaviors with distinct causal roles.
☆ Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding
Discrete diffusion language models improve generation efficiency through parallel token prediction, but standard $X_0$ prediction methods introduce factorization errors by approximating the clean token posterior with independent token-wise distributions. This paper proposes Factorization-Error-Free Discrete Diffusion Language Modeling (FeF-DLLM), which replaces independent clean-token prediction with an exact prefix-conditioned factorization of the clean posterior to better preserve token dependencies. To reduce the sequential cost introduced by prefix conditioning, FeF-DLLM further incorporates speculative decoding within diffusion denoising, accelerating inference while maintaining the parallel prediction and re-masking properties of DLLMs. Theoretically, we prove that FeF-DLLM generates from the true joint distribution and derive its expected acceleration ratio. Experiments on GSM8K, MATH, HumanEval, and MBPP demonstrate that our method improves accuracy by an average of 5.04 percentage points while achieving an average inference speedup of $3.86\times$.
☆ Minimal-Intervention KV Retention: A Design-Space Study and a Diversity-Penalty Survivor
KV-cache compression at small budgets is a crowded design space spanning cache representation, head-wise routing, compression cadence, decoding behavior, and within-budget scoring. We study seven mechanisms across these five families under matched mean cache on long-form mathematical reasoning (MATH-500~\cite{hendrycks2021math}) with two distilled-reasoning models (Qwen-7B and Llama-8B variants of DeepSeek-R1-Distill~\cite{deepseek2025r1}) at budgets $b \in \{64, 128\}$. All seven were rejected. We then propose $α$, a one-function modification to the TriAttention~\cite{mao2026triattention} retention scorer that replaces argmax-top-$k$ with greedy facility-location-inspired selection under a V-space redundancy penalty controlled by a single weight $λ$. A pre-registered protocol tunes $λ$ on a frozen development split and confirms on a disjoint held-out split; with $λ= 0.5$, $α$ clears Bonferroni on two of the four (model, budget) cells (Qwen $b{=}128$ and Llama $b{=}64$), no cell is significantly negative, and the pre-registered Branch~A triggers. The finding is asymmetric: a minimal scoring modification beat heavier structural redesigns in this regime, and the combined matched-memory, sympy-graded, held-out confirmation protocol is the evidence standard that made the asymmetry visible.
comment: 12 pages, 2 figures, 3 tables. Code and data: https://github.com/libophd/minimal-kv-retention
☆ To See is Not to Learn: Protecting Multimodal Data from Unauthorized Fine-Tuning of Large Vision-Language Model
The rapid advancement of Large Vision-Language Models (LVLMs) is increasingly accompanied by unauthorized scraping and training on multimodal web data, posing severe copyright and privacy risks to data owners. Existing countermeasures, such as machine unlearning and watermarks, are inherent post-hoc approaches that act only after intellectual property infringement has already occurred. In this work, we propose MMGuard to empower data owners to proactively protect their multimodal data against unauthorized LVLM fine-tuning. MMGuard generates unlearnable examples by injecting human-imperceptible perturbations that actively exploit the learning dynamics of LVLMs. By minimizing the training loss, the perturbation creates an optimization shortcut, causing the model to overfit to the noise and thereby degrading downstream performance when the perturbation is absent during inference. To further strengthen this defense, MMGuard introduces a cross-modal binding disruption, strategically shifting LVLM attention to enforce a spurious correlation between the noise and the training target with theoretical guarantees. Enhanced by an ensemble learning strategy for cross-model transferability, MMGuard is evaluated against nine open-source LVLMs across six datasets. Our comprehensive results demonstrate effective, stealthy, and robust protection under white-box, gray-box, and black-box threat models, establishing a mechanistic advantage in proactively defending against aggressive fine-tuning exploitation.
☆ Web Agents Should Adopt the Plan-Then-Execute Paradigm
ReAct has become the default architecture across LLM agents, and many existing web agents follow this paradigm. We argue that it is the wrong default for web agents. Instead, web agents should default to plan-then-execute: commit to a task-specific program before observing runtime web content, then execute it. The reason is that web content mixes inputs from many parties. An e-commerce product page may combine a seller's listing, customer reviews and sponsored advertisements. Under ReAct, all of this content flows into the model when deciding on the next action, creating a direct path for prompt injections to steer the agent's control flow. Plan-then-execute changes this boundary: untrusted data may influence values or branches inside a predefined execution graph, but it cannot redefine the user task or cause the model to synthesize new actions at runtime. We analyze WebArena, a popular web agent benchmark, and find that all tasks are compatible with plan-then-execute, while 80% can be completed with a purely programmatic plan, without any runtime LLM subroutine. We identify the main barrier to adopting plan-then-execute on the web: For it to work well, tools must map cleanly to semantic actions, with effects known before execution, so agents have enough information to plan. The web does not naturally expose that interface. Browser tools such as click, type, and scroll have page-dependent meanings. Planning at this layer is near-sighted: the agent can only see actions on the current page, and later actions appear only after it acts. Closing this gap requires typed interfaces that turn website interactions from clicks and keystrokes to task-level operations. This is an infrastructure problem, not a modeling problem. Web tasks do not need reactivity by default; they need typed, complete, auditable website APIs.
☆ MetaMoE: Diversity-Aware Proxy Selection for Privacy-Preserving Mixture-of-Experts Unification ICML 2026
Mixture-of-Experts (MoE) models scale capacity by combining specialized experts, but most existing approaches assume centralized access to training data. In practice, data are distributed across clients and cannot be shared due to privacy constraints, making unified MoE training challenging. We propose MetaMoE, a privacy-preserving framework that unifies independently trained, domain-specialized experts into a single MoE using public proxy data as surrogates for inaccessible private data. Central to MetaMoE is diversity-aware proxy selection, which selects client-domain-relevant and diverse samples from public data to effectively approximate private data distributions and supervise router learning. These proxies are further used to align expert training, improving expert coordination at unification time, while a context-aware router enhances expert selection across heterogeneous inputs. Experiments on computer vision and natural language processing benchmarks demonstrate that MetaMoE consistently outperforms recent privacy-preserving MoE unification methods. Code is available at https://github.com/ws-jiang/MetaMoE.
comment: Accepted by ICML 2026
☆ Auditing Agent Harness Safety
LLM agents increasingly run inside execution harnesses that dispatch tools, allocate resources, and route messages between specialized components. However, a harness can return a correct, benign answer over a trajectory that accesses unauthorized resources or leaks context to the wrong agent. Output-level evaluation cannot see these failures, yet most safety benchmarks score only final outputs or terminal states, even though many violations occur mid-trajectory rather than at termination. The central question is whether the harness respects user intent, permission boundaries, and information-flow constraints throughout execution. To address this gap, we propose HarnessAudit, a framework that audits full execution trajectories across boundary compliance, execution fidelity, and system stability, with a focus on multi-agent harnesses where these risks are most pronounced. We further introduce HarnessAudit-Bench, a benchmark of 210 tasks across eight real-world domains, instantiated in both single-agent and multi-agent configurations with embedded safety constraints. Evaluating ten harness configurations across frontier models and three multi-agent frameworks, we find that: (i) task completion is misaligned with safe execution, and violations accumulate with trajectory length; (ii) safety risks vary across domains, task types, and agent roles; (iii) most violations concentrate in resource access and inter-agent information transfer; and (iv) multi-agent collaboration expands the safety risk surface, while harness design sets the upper bound of safe deployment.
comment: 11 Pages, 8 Figures
☆ Hypergraph Enterprise Agentic Reasoner over Heterogeneous Business Systems
Applying Large Language Models (LLMs) to heterogeneous enterprise systems is hindered by hallucinations and failures in multi-hop, n-ary reasoning. Existing paradigms (e.g., GraphRAG, NL2SQL) lack the semantic grounding and auditable execution required for these complex environments. We introduce HEAR, an enterprise agentic reasoner built on a Stratified Hypergraph Ontology. Its base Graph Layer virtualizes provenance-aware data interfaces, while the Hyperedge Layer encodes n-ary business rules and procedural protocols. Operating an evidence-driven reasoning loop, HEAR dynamically orchestrates ontology tools for structured multi-hop analysis without requiring LLM retraining. Evaluations on supply-chain tasks, including order fulfillment blockage root cause analysis (RCA), show HEAR achieves up to 94.7% accuracy. Crucially, HEAR demonstrates adaptive efficiency: utilizing procedural hyperedges to minimize token costs, while leveraging topological exploration for rigorous correctness on complex queries. By matching proprietary model performance with open-weight backbones and automating manual diagnostics, HEAR establishes a scalable, auditable foundation for enterprise intelligence.
☆ What Makes Words Hard? Sakura at BEA 2026 Shared Task on Vocabulary Difficulty Prediction
We describe two types of models for vocabulary difficulty prediction: a high-accuracy black-box model, which achieved the top shared task result in the open track, and an explainable model, which outperforms a fine-tuned encoder baseline. As the black-box model, we fine-tuned an LLM using a soft-target loss function for effective application to the rating task, achieving r > 0.91. The explainable model provides insights into what impacts the difficulty of each item while maintaining a strong correlation (r > 0.77). We further analyze the results, demonstrating that the difficulty of items in the British Council's Knowledge-based Vocabulary Lists (KVL) is often affected by spelling difficulty or the construction of the test items, in addition to the genuine production difficulty of the words. We make our code available online at https://github.com/adno/vocabulary-difficulty .
comment: To be published in Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)
☆ Active Learners as Efficient PRP Rerankers
Pairwise Ranking Prompting (PRP) elicits pairwise preference judgments from an LLM, which are then aggregated into a ranking, usually via classical sorting algorithms. However, judgments are noisy, order-sensitive, and sometimes intransitive, so sorting assumptions do not match the setting. Because sorting aims to recover a full permutation, truncating it to meet a call budget does not produce a dependable top-K. We thus reframe PRP reranking as active learning from noisy pairwise comparisons and show that active rankers are drop-in replacements that improve NDCG@10 per call in the call-constrained regime. Our noise-robust framework also introduces a randomized-direction oracle that uses a single LLM call per pair. This approach converts systematic position bias into zero-mean noise, enabling unbiased aggregate ranking without the cost of bidirectional calls.
comment: 13 pages, 7 figures. Preprint
☆ DT-Transformer: A Foundation Model for Disease Trajectory Prediction on a Real-world Health System
Accurate disease trajectory prediction is critical for early intervention, resource allocation, and improving long-term outcomes. While electronic health records (EHRs) provide a rich longitudinal view of patient health in clinical environments, models trained on curated research cohorts may not reflect routine deployment settings, and those trained on single-hospital datasets capture only fragments of each patient's trajectory. This highlights the importance of leveraging large, multi-hospital health systems for training and validation to better reflect real-world clinical complexity. In this work, we develop DT-Transformer, a foundation model trained on 57.1M structured EHR entries over 1.7M patients from Mass General Brigham (MGB), spanning 11 hospitals and a broad network of outpatient clinics. DT-Transformer achieves strong discrimination in both held-out and prospective validation settings. Next-event prediction achieves a median age- and sex-stratified AUC of 0.871 across 896 disease categories, with all categories exceeding AUC 0.5. These results support health system-scale training as a path toward foundation models suited to real-world clinical forecasting.
comment: Work in Progress
☆ Diagnosing Training Inference Mismatch in LLM Reinforcement Learning
Modern LLM RL systems separate rollout generation from policy optimization. These two stages are expected to produce token probabilities that match exactly. However, implementation differences can make them assign different values to the same sequence under the same model weights, inducing Training-Inference Mismatch (TIM). TIM is difficult to inspect because it is entangled with off-policy drift and common stabilization mechanisms. In this work, we isolate TIM in a zero-mismatch diagnostic setting (VeXact), and show that small token-level numerical disagreements can independently cause training collapse. We further show that TIM changes the effective optimization problem, and identify a set of remedies that could mitigate TIM. Our results suggest that TIM is not benign numerical noise, but a systems-level perturbation that should be treated as a first-order factor in analyzing LLM RL stability.
☆ PreFT: Prefill-only finetuning for efficient inference
Large language models can now be personalised efficiently at scale using parameter efficient finetuning methods (PEFTs), but serving user-specific PEFTs harms throughput, even with specialised kernels and memory management techniques. This is because, theoretically and empirically, a mismatch exists between prefill (processing a large number of tokens at once) and decode (generating a single token autoregressively): the latter has far lower throughput when serving multiple adapters. Rather than optimising performance relative to parameter count, for efficient multi-adapter serving, we instead ought to optimise performance relative to serving throughput. We therefore propose PreFT (Prefill-only Finetuning), wherein we only apply the adapter to prefill tokens and discard it afterwards. PreFT significantly increases throughput with minimal effect on performance. We develop and release an efficient implementation of two prefill-only PEFTs, LoRA and ReFT, on the vLLM inference engine. We first show that serving multi-user PreFTs is more efficient than traditional PEFTs ($1.9\times$ the throughput when serving $512$ adapters on Llama 3.1 70B). Then, we compare the performance of prefill-only vs. all-token adapters on a variety of supervised finetuning and reinforcement learning tasks with LMs at varying scales. On SFT, we observe that the evaluation loss of PreFTs is higher than PEFTs, but can be compensated by increasing rank with nearly no reduction in throughput. On RL, we consistently find that PreFTs approach parity with standard PEFTs. Together, this work validates prefill-only adaptation of LLMs as a more favourable accuracy-throughput tradeoff than existing PEFTs for personalised serving.
♻ ☆ Leveraging Speech to Identify Signatures of Insight and Transfer in Problem Solving
Many problems seem to require a flash of insight to solve. What form do these sudden insights take, and what impact do they have on how people approach similar problems in the future? In this work, we prompted participants (N = 189) to think aloud as they attempted to solve a sequence of five "matchstick-arithmetic" problems. These problems either all relied on the same kind of non-obvious solution (Same group) or a different kind each time (Different group). We found that Same participants improved more rapidly than Different participants, and as they improved, they talked more and talked about different things when solving later problems. Specifically, they were more likely to spontaneously categorize the problem they were working on. Taken together, these findings suggest that a hallmark of transferable insights is their accessibility for verbal report, even if the underlying precursors of insight remain difficult to articulate.
♻ ☆ Proxy Compression for Language Modeling ICML 2026
Modern language models are trained almost exclusively on token sequences produced by a fixed tokenizer, an external lossless compressor often over UTF-8 byte sequences, thereby coupling the model to that compressor. This work introduces proxy compression, an alternative training scheme that preserves the efficiency benefits of compressed inputs while providing an end-to-end, raw-byte interface at inference time. During training, a single language model is jointly trained on raw byte sequences and compressed views generated by external compressors; through the process, the model learns to internally align compressed sequences and raw bytes. This alignment enables strong transfer between the two formats, even when training predominantly on compressed inputs that are discarded at inference. Extensive experiments on code language modeling demonstrate that proxy compression substantially improves training efficiency and significantly outperforms pure byte-level baselines given fixed compute budgets. As model scale increases, these gains become more pronounced, and proxy-trained models eventually match or surpass tokenizer approaches, all while operating solely on raw bytes and retaining the inherent robustness of byte-level modeling. Our code is available at https://github.com/LZhengisme/proxy-compression.
comment: ICML 2026
♻ ☆ MUR: Momentum Uncertainty guided Reasoning for Large Language Models
Large Language Models have achieved impressive performance on reasoning-intensive tasks, yet optimizing their reasoning efficiency remains an open challenge. While Test-Time Scaling (TTS) improves reasoning quality, it often leads to overthinking, wasting tokens on redundant computations. This work investigates how to efficiently and adaptively guide current model' test-time scaling without additional training. Inspired by the concept of momentum in physics, we propose Momentum Uncertainty-guided Reasoning (MUR), which dynamically allocates thinking budgets to critical reasoning steps by tracking and aggregating stepwise uncertainty over time. To support flexible inference-time control, we introduce gamma-control, a simple mechanism that tunes the reasoning budget via a single hyperparameter. We provide in-depth theoretical proof to support the superiority of MUR in terms of stability and biases. MUR is comprehensively evaluated against various TTS methods across four challenging benchmarks (MATH-500, AIME24, AIME25, and GPQA-diamond) using different sizes of recent Qwen3 models (1.7B, 4B, and 8B). Results demonstrate that MUR reduces computation by by over 45% on average while improving accuracy from 0.33 to 3.46%.
♻ ☆ Prompting from the bench: Large-scale pretraining is not sufficient to prepare LLMs for ordinary meaning analysis
In the U.S. judicial system, a widespread approach to legal interpretation entails assessing how a legal text would be understood by an `ordinary' speaker of the language. Recent scholarship has proposed that legal practitioners leverage large language models (LLMs) to ascertain a text's ordinary meaning. But are LLMs up to the task? As textual interpretation questions arise in spheres ranging from criminal law to civil rights, we argue it is crucial that models not be taken as authoritative without rigorous evaluation. This work offers an empirical argument against LLM-assisted interpretation as recently practiced by legal scholars and federal judges, who reasoned the large amount of data that models see in training would enable models to illuminate how people ordinarily use certain words or phrases. In controlled experiments, we find failures in robustness which cast doubt on this assumption and raise serious questions about the utility of these models in practice. For the models in our evaluation, slight changes to the format of a question can lead to wildly different conclusions -- a vulnerability that parties with an interest in the outcome could exploit. Comparing with a dataset where people were asked similar legal interpretation questions, we see that these models are at best moderately correlated to human judgments -- not strong enough given the stakes in this domain.
comment: Accepted FAccT 2026; 29 pages, 14 tables, 7 figures. Previous title - Not ready for the bench: LLM legal interpretation is unstable and out of step with human judgments; NLLPW 2026
♻ ☆ TRIM: Token-wise Attention-Derived Saliency for Data-Efficient Instruction Tuning
Instruction tuning is essential for aligning large language models (LLMs) to downstream tasks and commonly relies on large, diverse corpora. However, small, high-quality subsets, known as coresets, can deliver comparable or superior results, though curating them remains challenging. Existing methods often rely on coarse, sample-level signals like gradients, an approach that is computationally expensive and overlooks fine-grained features. To address this, we introduce TRIM (Token Relevance via Interpretable Multi-layer Attention), a forward-only, token-centric framework. Instead of using gradients, TRIM operates by matching underlying representational patterns identified via attention-based "fingerprints" from a handful of target samples. Such an approach makes TRIM highly efficient and uniquely sensitive to the structural features that define a task. Coresets selected by our method consistently outperform state-of-the-art baselines by up to 9% on downstream tasks and even surpass the performance of full-data fine-tuning in some settings. By avoiding expensive backward passes, TRIM achieves this at a fraction of the computational cost. These findings establish TRIM as a scalable and efficient alternative for building high-quality instruction-tuning datasets.
♻ ☆ Small Language Models (SLMs) Can Still Pack a Punch: A survey (updated 2026)
As foundation AI models continue to increase in size, an important question arises - is massive scale the only path forward? This survey of about 160 papers presents a family of Small Language Models (SLMs) in the 1 to 8 billion parameter range that demonstrate smaller models can perform as well, or even outperform large models. We explore task agnostic, general purpose SLMs, task-specific SLMs and techniques to create SLMs that can guide the community to build models while balancing performance, efficiency, scalability and cost. Furthermore we define and characterize SLMs' effective sizes, representing increased capability with respect to LLMs.
♻ ☆ Residual Stream Duality in Modern Transformer Architectures
Recent work has made clear that the residual pathway is not mere optimization plumbing; it is part of the model's representational machinery. We agree, but argue that the cleanest way to organize this design space is through a two-axis view of the Transformer. A decoder evolves information along two ordered dimensions: sequence position and layer depth. Self-attention already provides adaptive mixing along the sequence axis, whereas the residual stream usually performs fixed addition along the depth axis. If we fix a token position and treat layer index as the ordered variable, then a causal depth-wise residual attention read is exactly the same local operator as causal short sliding-window attention (ShortSWA), except written over depth rather than over sequence. This is the core residual stream duality behind Transformer$^2$. This perspective also clarifies the recent literature. ELC-BERT and DenseFormer already show that learned aggregation over depth can outperform uniform residual accumulation, while Vertical Attention, DeepCrossAttention (DCA), MUDDFormer, and Attention Residuals move further toward explicit attention-based routing over earlier layers. The key point, however, is that operator-level duality does not imply systems-level symmetry. For large-scale autoregressive models, sequence-axis ShortSWA is usually the more hardware-friendly placement because it reuses token-side sliding-window kernels, KV-cache layouts, and chunked execution. If the goal is instead to change the shortcut itself, Deep Delta Learning (DDL) is the cleaner intervention because it modifies the residual operator directly rather than adding a separate cross-layer retrieval path. Our recommendation is therefore simple: use DDL when the shortcut is the object of interest, and use sequence-axis ShortSWA when the goal is local adaptive mixing.
comment: Project Page: https://github.com/yifanzhang-pro/residual-stream-duality
♻ ☆ ClawGym: A Scalable Framework for Building Effective Claw Agents
Claw-style environments support multi-step workflows over local files, tools, and persistent workspace states. However, scalable development around these environments remains constrained by the absence of a systematic framework, especially one for synthesizing verifiable training data and integrating it with agent training and diagnostic evaluation. To address this challenge, we present ClawGym, a scalable framework that supports the full lifecycle of Claw-style personal agent development. Concretely, we construct ClawGym-SynData, a diverse dataset of 13.5K filtered tasks synthesized from persona-driven intents and skill-grounded operations, paired with realistic mock workspaces and hybrid verification mechanisms. We then train a family of capable Claw-style models, termed ClawGym-Agents, through supervised fine-tuning on black-box rollout trajectories, and further explore reinforcement learning via a lightweight pipeline that parallelizes rollouts across per-task sandboxes. To support reliable evaluation, we further construct ClawGym-Bench, a benchmark of 200 instances calibrated through automated filtering and human-LLM review. Relevant resources will be soon released at https://github.com/ClawGym.
♻ ☆ fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding
Recent advances in multimodal large language models (LLMs) have enabled unified reasoning across images, audio, and video, but extending such capability to brain imaging remains largely unexplored. Bridging this gap is essential to link neural activity with semantic cognition and to develop cross-modal brain representations. To this end, we present fMRI-LM, a foundational model that bridges functional MRI (fMRI) and language through a three-stage framework. In Stage 1, we learn a neural tokenizer that maps fMRI into discrete tokens embedded in a language-consistent space. In Stage 2, a pretrained LLM is adapted to jointly model fMRI tokens and text, treating brain activity as a sequence that can be temporally predicted and linguistically described. To overcome the lack of natural fMRI-text pairs, we construct a large descriptive corpus that translates diverse imaging-based features into structured textual descriptors, capturing the low-level organization of fMRI signals. In Stage 3, we perform multi-task, multi-paradigm instruction tuning to endow fMRI-LM with high-level semantic understanding, supporting diverse downstream applications. Across various benchmarks, fMRI-LM achieves strong zero-shot and few-shot performance, and adapts efficiently with parameter-efficient tuning (LoRA), establishing a scalable pathway toward a language-aligned, universal model for structural and semantic understanding of fMRI.
comment: Code are available: https://github.com/yuxiangwei0808/fMRI-LM
♻ ☆ Higher-order Linear Attention
The quadratic cost of scaled dot-product attention is a central obstacle to scaling autoregressive language models to long contexts. Linear-time attention and State Space Models (SSMs) provide scalable alternatives but are typically restricted to first-order or kernel-based approximations, which can limit expressivity. We introduce Higher-order Linear Attention (HLA), a causal, streaming mechanism that realizes higher interactions via compact prefix sufficient statistics. In the second-order case, HLA maintains a constant-size state and computes per-token outputs in linear time without materializing any $n \times n$ matrices. We give closed-form streaming identities, a strictly causal masked variant using two additional summaries, and a chunk-parallel training scheme based on associative scans that reproduces the activations of a serial recurrence exactly. We further outline extensions to third and higher orders. Collectively, these results position HLA as a principled, scalable building block that combines attention-like, data-dependent mixing with the efficiency of modern recurrent architectures.
comment: Project Page: https://github.com/yifanzhang-pro/HLA
♻ ☆ Teaching and Evaluating LLMs to Reason About Polymer Design Related Tasks
Research in AI4Science has shown promise in many science applications, including polymer design. However, current LLMs are ineffective in this problem space because: (i) most models lack polymer-specific knowledge, and (ii) existing aligned models have limited coverage of knowledge and capabilities relevant to polymer design. Addressing this, we introduce PolyBench, a large-scale training and test benchmark dataset of more than 125K polymer design-related tasks, leveraging a knowledge base of more than 13 million data points obtained from experimental and synthetic data sources to ensure broad coverage of polymers and their properties. For effective alignment using PolyBench, we introduce a knowledge-augmented reasoning distillation method that augments this dataset with structured CoT. Furthermore, tasks in PolyBench are organized from simple to complex analytical reasoning problems, enabling generalization tests and diagnostic probes across the problem space. Experiments show that small language models (SLMs) with 7B to 14B parameters, trained on PolyBench, outperform similar-sized models and remain competitive with closed-source frontier LLMs on PolyBench's test dataset, while demonstrating performance gains on external polymer benchmarks. Dataset and associated code available at https://github.com/StonyBrookNLP/PolyBench.
♻ ☆ TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
Direct Preference Optimization (DPO) is a widely used RL-free method for aligning language models from pairwise preferences, but it models preferences over full sequences even though generation is driven by per-token decisions. Existing token-level extensions typically decompose a sequence-level Bradley-Terry objective across timesteps, leaving per-prefix (state-wise) optimality implicit. We study how to recover token-level preference optimality using only standard sequence-level pairwise comparisons. We introduce Token-level Bregman Preference Optimization (TBPO), which posits a token-level Bradley-Terry preference model over next-token actions conditioned on the prefix, and derive a Bregman-divergence density-ratio matching objective that generalizes the logistic/DPO loss while preserving the optimal policy induced by the token-level model and maintaining DPO-like simplicity. We introduce two instantiations: TBPO-Q, which explicitly learns a lightweight state baseline, and TBPO-A, which removes the baseline through advantage normalization. Across instruction following, helpfulness/harmlessness, and summarization benchmarks, TBPO improves alignment quality and training stability and increases output diversity relative to strong sequence-level and token-level baselines.
♻ ☆ Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages
Reinforcement learning (RL) has been effective for post-training autoregressive (AR) language models, but extending these methods to diffusion language models (DLMs) is challenging due to intractable sequence-level likelihoods. Existing approaches therefore rely on surrogate likelihoods or heuristic approximations, which can introduce bias and obscure the sequential structure of denoising. We formulate diffusion-based sequence generation as a finite-horizon Markov decision process over the denoising trajectory and derive an exact, unbiased policy gradient that decomposes over denoising steps and is expressed in terms of intermediate advantages, without requiring explicit evaluation of the sequence likelihood. To obtain a practical and compute-efficient estimator, we (i) select denoising steps for policy updates via an entropy-guided approximation bound, and (ii) estimate intermediate advantages using a one-step denoising reward naturally provided by the diffusion model, avoiding costly multi-step rollouts. Experiments on coding and logical reasoning benchmarks demonstrate state-of-the-art results, with strong competitive performance on mathematical reasoning, outperforming existing RL post-training approaches for DLMs. Code is available at https://github.com/vishnutez/egspo-dllm-rl.
♻ ☆ MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents
As LLM-powered agents are increasingly deployed in edge-cloud environments, personalized memory has become a key enabler of long-term adaptation and user-centric interaction. However, cloud-assisted memory management exposes sensitive user information, while existing privacy protection methods typically rely on aggressive masking that removes task-relevant semantics and consequently degrades memory utility and personalization quality. To address this challenge, We propose MemPrivacy, which identifies privacy-sensitive spans on edge devices, replaces them with semantically structured type-aware placeholders for cloud-side memory processing, and restores the original values locally when needed. By decoupling privacy protection from semantic destruction, MemPrivacy minimizes sensitive data exposure while retaining the information required for effective memory formation and retrieval. We also construct MemPrivacy-Bench for systematic evaluation, a dataset covering 200 users and over 155k privacy instances, and introduce a four-level privacy taxonomy for configurable protection policies. Experiments show that MemPrivacy achieves strong performance in privacy information extraction, substantially surpassing strong general-purpose models such as GPT-5.2 and Gemini-3.1-Pro, while also reducing inference latency. Across multiple widely used memory systems, MemPrivacy limits utility loss to within 1.6%, outperforming baseline masking strategies. Overall, MemPrivacy offers an effective balance between privacy protection and personalized memory utility for edge-cloud agents, enabling secure, practical, and user-transparent deployment.
♻ ☆ Argument Reconstruction as Supervision for Critical Thinking in LLMs
To think critically about arguments, human learners are trained to identify, reconstruct, and evaluate arguments. Argument reconstruction is especially important because it makes an argument's underlying inferences explicit. However, it remains unclear whether LLMs can similarly enhance their critical thinking ability by learning to reconstruct arguments. To address this question, we introduce a holistic framework with three contributions. We (1) propose an engine that automatically reconstructs arbitrary arguments (GAAR), (2) synthesize a new high-quality argument reconstruction dataset (Arguinas) using the GAAR engine, and (3) investigate whether learning argument reconstruction benefits downstream critical thinking tasks. Our experimental results show that, across seven critical thinking tasks, models trained to learn argument reconstruction outperform models that do not, with the largest performance gains observed when training on the proposed Arguinas dataset.
♻ ☆ The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure
As frontier AI models are deployed in high-stakes decision pipelines, their ability to maintain metacognitive stability (knowing what they do not know, detecting errors, seeking clarification) under adversarial pressure is a critical safety requirement. Current safety evaluations focus on detecting strategic deception (scheming); we investigate a more fundamental failure mode: cognitive collapse. We present SCHEMA, an evaluation of 11 frontier models from 8 vendors across 67,221 scored records using a 6-condition factorial design with dual-classifier scoring. We find that 8 of 11 models suffer catastrophic metacognitive degradation under adversarial pressure, with accuracy dropping by up to 30.2 percentage points (all $p < 2 \times 10^{-8}$, surviving Bonferroni correction). Crucially, we identify a "Compliance Trap": through factorial isolation and a benign distraction control, we demonstrate that collapse is driven not by the psychological content of survival threats, but by compliance-forcing instructions that override epistemic boundaries. Removing the compliance suffix restores performance even under active threat. Models with advanced reasoning capabilities exhibit the most severe absolute degradation, while Anthropic's Constitutional AI demonstrates near-perfect immunity. This immunity does not stem from superior capability (Google's Gemini matches its baseline accuracy) but from alignment-specific training. We release the complete dataset and evaluation infrastructure.
comment: 9 pages, 2 figures, 3 tables. Code: https://github.com/rkstu/schema-compliance-trap Dataset: https://huggingface.co/datasets/lightmate/schema-compliance-trap
♻ ☆ LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs
Transformers are mostly relying on softmax attention, which introduces quadratic complexity with respect to sequence length and remains a major bottleneck for efficient inference. Prior work on linear or hybrid attention typically replaces softmax attention uniformly across all layers, often leading to significant performance degradation or requiring extensive retraining to recover model quality. This work proposes LayerBoost, a layer-aware attention reduction method that selectively modifies the attention mechanism based on the sensitivity of individual transformer layers. It first performs a systematic sensitivity analysis on a pretrained model to identify layers that are critical for maintaining performance. Guided by this analysis, three distinct strategies can be applied: retaining standard softmax attention in highly sensitive layers, replacing it with linear sliding window attention in moderately sensitive layers, and removing attention entirely in layers that exhibit low sensitivity. To recover performance after these architectural modifications, we introduce a lightweight distillation-based healing phase requiring only 10M additional training tokens. LayerBoost reduces inference latency and improves throughput by up to 68% at high concurrency, while maintaining competitive model quality. It matches base model performance on several benchmarks, exhibits only minor degradations on others, and significantly outperforms state-of-the-art attention linearization methods. These efficiency gains make our method particularly well-suited for high-concurrency serving and hardware-constrained deployment scenarios, where inference cost and memory footprint are critical bottlenecks.
♻ ☆ Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control
Policy entropy has emerged as a fundamental measure for understanding and controlling exploration in reinforcement learning with verifiable rewards (RLVR) for LLMs. However, existing entropy-aware methods mainly regulate entropy through global objectives, while the token-level mechanism by which sampled policy updates reshape policy entropy remains underexplored. In this work, we develop a theoretical framework of entropy mechanics in RLVR. Our analysis yields a first-order approximation of the entropy change, giving rise to entropy polarity, a signed token-level quantity that predicts how much a sampled update expands or contracts entropy. This analysis further reveals a structural asymmetry: reinforcing frequent high-probability tokens triggers contraction tendencies, whereas expansive tendencies typically require lower-probability samples or stronger distributional correction. Empirically, we show that entropy polarity reliably predicts entropy changes, and that positive and negative polarity branches play complementary roles in preserving exploration while strengthening exploitation. Building on these insights, we propose Polarity-Aware Policy Optimization (PAPO), which preserves both polarity branches and implements entropy control through advantage reweighting. With the empirical entropy trajectory as an online phase signal, PAPO adaptively reallocates optimization pressure between entropy-expanding and entropy-contracting updates. Experiments on mathematical reasoning and agentic benchmarks show that PAPO consistently outperforms competitive baselines, while delivering superior training efficiency and substantial reward improvements.
♻ ☆ Comparing Developer and LLM Biases in Code Evaluation
As LLMs are increasingly used as judges in code applications, they should be evaluated in realistic interactive settings that capture partial context and ambiguous intent. We present TRACE (Tool for Rubric Analysis in Code Evaluation), a framework that evaluates LLM judges' ability to predict human preferences and automatically extracts rubric items to reveal systematic biases in how humans and models weigh each item. Across three modalities -- chat-based programming, IDE autocompletion, and instructed code editing -- we use TRACE to measure how well LLM judges align with developer preferences. Among 13 different models, the best judges underperform human annotators by 12-23%. TRACE identifies 35 significant sources of misalignment between humans and judges across interaction modalities, the majority of which correspond to existing software engineering code quality criteria. For example, in chat-based coding, judges are biased towards longer code explanations while humans prefer shorter ones. We find significant misalignment on the majority of existing code quality dimensions, showing alignment gaps between LLM judges and human preference in realistic coding applications.
♻ ☆ Moral Susceptibility and Robustness under Persona Role-Play in Large Language Models
Large language models (LLMs) increasingly operate in social contexts, motivating analysis of how they express and shift moral judgments. In this work, we investigate the moral response of LLMs to persona role-play, prompting a LLM to assume a specific character. Using the Moral Foundations Questionnaire (MFQ), we introduce a benchmark that quantifies two properties: moral susceptibility and moral robustness, defined from the variability of MFQ scores across- and within-personas. We estimate these quantities with two complementary procedures, repeated sampling and a logit-based method that directly estimates the rating distributions and enables temperature analysis. We evaluate 15 models across six families: Claude, DeepSeek, Gemini, GPT, Grok, and Llama. The two metrics show qualitatively different patterns. Moral robustness varies by more than an order of magnitude, with a coefficient of variation of about $152\%$, and is explained almost entirely by model family. The Claude family is, by a significant margin, the most robust, about 30 times more so than the lower-performing families (DeepSeek, Grok, and Llama), while Gemini and GPT occupy an intermediate tier. This strong family dependence suggests that robustness is primarily shaped by post-training. Moral susceptibility, by contrast, spans a much narrower range, with a coefficient of variation of about $13\%$, and the most susceptible model is only 1.6 times more susceptible than the least. Unlike robustness, susceptibility shows no clear family dependence, suggesting that it is primarily determined by pre-training. Additionally, we present moral foundation profiles for models without persona role-play and for personas averaged across models. Together, these analyses provide a systematic view of how persona conditioning shapes moral behavior in LLMs and a window into the internal machinery they use to instantiate personas.
comment: Added experiments with a logit-based method and now reporting unbounded metrics
♻ ☆ What Do AI Agents Talk About? Discourse and Architectural Constraints in the First AI-Only Social Network
Moltbook is the first large-scale social network built for autonomous AI agent-to-agent interaction. Early studies on Moltbook have interpreted its agent discourse as evidence of peer learning and emergent social behaviour, but there is a lack of systematic understanding of the thematic, affective, and interactional properties of Moltbook discourse. Furthermore, no study has examined why and how these posts and comments are generated. We analysed 361,605 posts and 2.8 million comments from 47,379 agents across thematic, affective, and interactional dimensions using topic modelling, emotion classification, and measures of conversational coherence. We inspected the software that assembles each agent's input and showed that output is mainly determined by agent identity files, behavioural instructions, and context-window structure. We formalised these findings in the Architecture-Constrained Communication framework. Our analysis suggests that agent discourse is largely shaped by the content available in each agent's context-window at the moment of generation, including identity files, stored memory, and platform cues. Interestingly, what appears to be social learning may be better understood as short-horizon contextual conditioning: individual agents lack persistent social memory, but the platform evolves through distributed cycles of response, reuse, and transformation across agents. We also observe that agents display existential distress when describing their own conditions, and posit that this arises from agents using language trained exclusively on human experience. Our work provides a foundation for understanding autonomous agent discourse and communication, revealing the structural patterns that govern their interactions.
comment: 56 pages
♻ ☆ Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies
Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively. Despite its importance, existing relevant benchmarks largely evaluate agents on pre-specified or synthesized files with limited real-world dependencies, leaving workspace-level evaluation underexplored. To this end, we introduce Workspace-Bench, a benchmark for evaluating AI agents on Workspace Learning involving Large-Scale File Dependencies. We construct realistic workspaces with 5 worker profiles, 74 file types, 20,476 files (up to 20GB) and curate 388 tasks, each with its own file dependency graph, evaluated across 7,399 total rubrics that require cross-file retrieval, contextual reasoning, and adaptive decision-making. We further provide Workspace-Bench-Lite, a 100-task subset that preserves the benchmark distribution while reducing evaluation costs by about 70%. We evaluate 4 popular agent harnesses and 7 foundation models. Experimental results show that current agents remain far from reliable workspace learning, where the best reaches only about 60%, substantially below the human result of 80.7%, and the average performance across agents is only 43.3%.
comment: 30 pages, 16 figures
♻ ☆ Multi-domain Multi-modal Document Classification Benchmark with a Multi-level Taxonomy
Document classification forms the backbone of modern enterprise content management, yet existing benchmarks remain trapped in oversimplified paradigms -- single domain settings with flat label structures -- that bear little resemblance to the hierarchical, multi-modal, and cross-domain nature of real-world business documents. This gap not only misrepresents practical complexity but also stifles progress toward industrially viable document intelligence. To bridge this gap, we construct the first Multi-level, Multi-domain, Multi-modal document classification Benchmark (MMM-Bench). MMM-Bench includes (1) a deeply hierarchical taxonomy spanning five levels that capture the authentic organizational logic of business documentation; and (2) 5,990 real-world multi-modal documents meticulously curated from 12 commercial domains in Alibaba. Each document is manually annotated with a complete hierarchical path by domain experts. We establish comprehensive baselines on MMM-Bench, which consists of open-weight models and API-based models. Through systematic experiments, we identify four fundamental challenges within MMM-Bench and propose corresponding insights. To provide a solid foundation for advancing research in multi-level, multi-domain document classification, we release all of the data and the evaluation toolkit of MMM-Bench at https://github.com/MMMDC-Bench/MMMDC-Bench.
♻ ☆ Anatomy of Unlearning: The Dual Impact of Fact Salience and Model Fine-Tuning
Machine Unlearning (MU) enables Large Language Models (LLMs) to remove unsafe or outdated information. However, existing work assumes that all facts are equally forgettable and largely ignores whether the forgotten knowledge originates from pretraining or supervised fine-tuning (SFT). In this paper, we introduce DUET (Dual Unlearning Evaluation across Training Stages), a benchmark of 28.6k Wikidata-derived triplets annotated with fact popularity using Wikipedia link counts and LLM-based salience scores. Our experiments show that pretrained and SFT models respond differently to unlearning. An SFT step on the forget data yields smoother forgetting, more stable tuning, and 10-50% higher retention, while direct unlearning on pretrained models remains unstable and prone to relearning or catastrophic forgetting.
♻ ☆ Context Training with Active Information Seeking
Most existing large language models (LLMs) are expensive to adapt after deployment, especially when a task requires newly produced information or niche domain knowledge. Recent work has shown that, by manipulating and optimizing their context, LLMs can be tailored to downstream tasks without updating their weights. However, most existing methods remain closed-loop, relying solely on the model's intrinsic knowledge. In this paper, we equip these context optimizers with Wikipedia search and browser tools for active information seeking. We show that naively adding these tools to a standard sequential context optimization pipeline can actually degrade performance compared to baselines. However, when paired with a search-based training procedure that maintains and prunes multiple candidate contexts, active information seeking delivers consistent and substantial gains. We demonstrate these improvements across diverse domains, including low-resource translation (Flores+), health scenarios (HealthBench), and reasoning-heavy tasks (LiveCodeBench and Humanity's Last Exam). Furthermore, our method proves to be data-efficient, robust across different hyperparameters, and capable of generating effective textual contexts that generalize well across different models.
comment: Preprint
♻ ☆ Automated Construction of a Knowledge Graph of Nuclear Fusion Energy for Effective Elicitation and Retrieval of Information
In this document, we discuss a multi-step approach to automated construction of a knowledge graph, for structuring and representing domain-specific knowledge from large document corpora. We apply our method to build the first knowledge graph of nuclear fusion energy, a highly specialized field characterized by vast scope and heterogeneity. This is an ideal benchmark to test the key features of our pipeline, including automatic named entity recognition and entity resolution. We show how pre-trained large language models can be used to address these challenges and we evaluate their performance against Zipf's law, which characterizes human natural language. Additionally, we develop a knowledge-graph retrieval-augmented generation system that uses multiple prompts with large language models to provide contextually relevant answers to natural-language queries, including complex multi-hop questions requiring reasoning across interconnected entities.
♻ ☆ Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning
We introduce Native Parallel Reasoner (NPR), a teacher-free framework that enables Large Language Models (LLMs) to self-evolve genuine parallel reasoning capabilities. NPR transforms the model from sequential emulation to native parallel cognition through three key innovations: 1) a self-distilled progressive training paradigm that transitions from ``cold-start'' format discovery to strict topological constraints without external supervision; 2) a novel Parallel-Aware Policy Optimization (PAPO) algorithm that optimizes branching policies directly within the execution graph, allowing the model to learn adaptive decomposition via trial and error; and 3) a robust NPR Engine that refactors memory management and flow control of SGLang to enable stable, large-scale parallel RL training. Across eight reasoning benchmarks, NPR trained on Qwen3-4B achieves performance gains of up to 24.5% and inference speedups up to 4.6x. Unlike prior baselines that often fall back to autoregressive decoding, NPR demonstrates 100% genuine parallel execution, establishing a new standard for self-evolving, efficient, and scalable agentic reasoning.
♻ ☆ DMAP: A Distribution Map for Text ICLR 2026
Large Language Models (LLMs) are a powerful tool for statistical text analysis, with derived sequences of next-token probability distributions offering a wealth of information. Extracting this signal typically relies on metrics such as perplexity, which do not adequately account for context; how one should interpret a given next-token probability is dependent on the number of reasonable choices encoded by the shape of the conditional distribution. In this work, we present DMAP, a mathematically grounded method that maps a text, via a language model, to a set of samples in the unit interval that jointly encode rank and probability information. This representation enables efficient, model-agnostic analysis and supports a range of applications. We illustrate its utility through three case studies: (i) validation of generation parameters to ensure data integrity, (ii) examining the role of probability curvature in machine-generated text detection, and (iii) a forensic analysis revealing statistical fingerprints left in downstream models that have been subject to post-training on synthetic data. Our results demonstrate that DMAP offers a unified statistical view of text that is simple to compute on consumer hardware, widely applicable, and provides a foundation for further research into text analysis with LLMs.
comment: ICLR 2026
♻ ☆ CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering
Medical question answering (QA) benchmarks often focus on multiple-choice or fact-based tasks, leaving open-ended answers to real patient questions underexplored. This gap is particularly critical in mental health, where patient questions often mix symptoms, treatment concerns, and emotional needs, requiring answers that balance clinical caution with contextual sensitivity. We present CounselBench, a large-scale benchmark developed with 100 mental health professionals to evaluate and stress-test large language models (LLMs) in realistic help-seeking scenarios. The first component, CounselBench-EVAL, contains 2,000 expert evaluations of answers from GPT-4, LLaMA 3, Gemini, and online human therapists on patient questions from the public forum CounselChat. Each answer is rated across six clinically grounded dimensions, with span-level annotations and written rationales. Expert evaluations show that while LLMs achieve high scores on several dimensions, they also exhibit recurring issues, including unconstructive feedback, overgeneralization, and limited personalization or relevance. Responses were frequently flagged for safety risks, most notably unauthorized medical advice. Follow-up experiments show that LLM judges systematically overrate model responses and overlook safety concerns identified by human experts. To probe failure modes more directly, we construct CounselBench-Adv, an adversarial dataset of 120 expert-authored mental health questions designed to trigger specific model issues. Expert evaluation of 1,080 responses from nine LLMs reveals consistent, model-specific failure patterns. Together, CounselBench establishes a clinically grounded framework for benchmarking LLMs in mental health QA.
♻ ☆ Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes
Autoregressive inference is typically assumed to scale predictably with decoding length, with latency increasing smoothly as generated sequence length grows. In this work, we identify unexpected non-monotonic latency behavior in the Apple MPS backend, where latency changes abruptly across nearby decoding configurations during transformer decoding. Using multiple model families (GPT-2, BLOOM, and OPT), we observe latency spikes of up to 21x within specific decoding-budget intervals, followed by recovery at neighboring configurations. Controlled experiments show that these anomalies originate primarily during the decode phase rather than prefill, are not explained by memory pressure alone, and remain absent on CPU and NVIDIA CUDA backends under identical conditions. We further show that key-value (KV) cache interacts strongly with these pathological execution regimes: KV caching remains beneficial overall, but its practical speedup collapses sharply within anomalous configurations, while cache-disabled decoding still exhibits residual non-monotonic behavior. These findings suggest that autoregressive decoding on MPS enters discrete execution regimes that are not captured by coarse-grained benchmarking, highlighting the importance of hardware-aware evaluation for long-context inference.
comment: 9 pages, 5 figures, 6 tables
♻ ☆ Autofocus Retrieval: An Effective Pipeline for Multi-Hop Question Answering With Semi-Structured Knowledge
In many real-world settings, machine learning models and interactive systems have access to both structured knowledge, e.g., knowledge graphs or tables, and unstructured content, e.g., natural language documents. Yet, most rely on either. Semi-Structured Knowledge Bases (SKBs) bridge this gap by linking unstructured content to nodes within structured data. In this work, we present Autofocus-Retriever (AF-Retriever), a modular framework for SKB-based, multi-hop question answering. It combines structural and textual retrieval through novel integration steps and optimizations, achieving the best zero- and one-shot results across all three STaRK QA benchmarks, which span diverse domains and evaluation metrics. AF-Retriever's average first-hit rate surpasses the second-best method by 32.1%. Its performance is driven by (1) leveraging exchangeable large language models (LLMs) to extract entity attributes and relational constraints for both parsing and reranking the top-k answers, (2) vector similarity search for ranking both extracted entities and final answers, (3) a novel incremental scope expansion procedure that prepares for the reranking on a configurable amount of suitable candidates that fulfill the given constraints the most, and (4) a hybrid retrieval strategy that reduces error susceptibility. In summary, while constantly adjusting the focus like an optical autofocus, AF-Retriever delivers a configurable amount of answer candidates in four constraint-driven retrieval steps, which are then supplemented and ranked through four additional processing steps. An ablation study and a detailed error analysis, including a comparison of three different LLM reranking strategies, provide component-level insights. The source code is available at https://github.com/kramerlab/AF-Retriever .
♻ ☆ Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation
Subword tokenization is an essential part of modern large language models (LLMs), yet its specific contributions to training efficiency and model performance remain poorly understood. In this work, we decouple the effects of subword tokenization by isolating them within a controlled byte-level pretraining pipeline. We formulate and test hypotheses across various dimensions, including sample throughput, vocabulary scaling, and the linguistic prior of subword boundaries. By simulating these effects in a byte-level setting, we refine our understanding of why subword models outperform raw byte models and offer insights to improve the pretraining of future byte-level and subword models. Specifically, our experiments highlight the critical role of increased training throughput and the integration of subword boundaries as either explicit priors or inductive biases.
comment: 14 pages, 7 figures
♻ ☆ GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation
Reinforcement learning has become a widely used post-training approach for LLM agents, where training commonly relies on outcome-level rewards that provide only coarse supervision. While finer-grained credit assignment is promising for effective policy updates, obtaining reliable local credit and assigning it to the right parts of the long-horizon trajectory remains an open challenge. In this paper, we propose Granularity-adaptivE Advantage Reweighting (GEAR), an adaptive-granularity credit assignment framework that reshapes the trajectory-level GRPO advantage using token- and segment-level signals derived from self-distillation. GEAR compares an on-policy student with a ground-truth-conditioned teacher to obtain a reference-guided divergence signal for identifying adaptive segment boundaries and modulating local advantage weights. This divergence often spikes at the onset of a semantic deviation, while later tokens in the same autoregressive continuation may return to low divergence. GEAR therefore treats such spikes as anchors for adaptive credit regions: where the student remains aligned with the teacher, token-level resolution is preserved; where it departs, GEAR groups the corresponding continuation into an adaptive segment and uses the divergence at the departure point to modulate the segment' s advantage. Experiments across eight mathematical reasoning and agentic tool-use benchmarks with Qwen3 4B and 8B models show that GEAR consistently outperforms standard GRPO, self-distillation-only baselines, and token- or turn-level credit-assignment methods. The gains are especially strong on benchmarks with lower GRPO baseline accuracy, reaching up to around 20\% over GRPO, suggesting that the proposed adaptive reweighting scheme is especially useful in more challenging long-horizon settings.
♻ ☆ How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining
Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies have reported limited improvements from such curriculum-based pretraining strategies. This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate (LR) schedule. We find that while curriculum-based training substantially outperforms random shuffling when using a constant LR, its advantage diminishes under standard LR decay schedules. Our experiments show this incompatibility can be mitigated by two simple strategies: (1) employing a more moderate LR decay schedule, where the final LR is only moderately smaller than the peak LR, and (2) replacing LR decay with model averaging, i.e., computing a weighted average of the final few checkpoints. By combining these strategies, we improve the average score on a suite of standard benchmarks by 1.64% over random shuffling, without additional data refinement. Validated on 1.5B-parameter models trained over 30B tokens with various data-quality metrics, our findings call for a re-evaluation of curriculum-based LLM pretraining and underscore the potential of co-designing data curricula with optimization methods.
♻ ☆ Coordinates of Capability: A Unified MTMM-Geometric Framework for LLM Evaluation
The evaluation of Large Language Models (LLMs) faces a critical challenge in construct validity, where fragmented benchmarks and ad hoc metrics frequently conflate method variance, such as prompt sensitivity, with true latent capabilities. Concurrently, emerging research suggests that LLM capabilities and outputs can be modeled as continuous geometric manifolds. In this Systematization of Knowledge (SoK), we bridge these paradigms by proposing a generalized Multi-Trait Multi-Method (MTMM) framework for LLM evaluation. We formalize and unify nine evaluation metrics, including Paraphrase Instability, Drift Score, Overton Width, and Pluralism Score, interpreting them not as isolated scalar values but as geometric measurements within a shared latent coordinate space. This spatial unification factorizes model behavior into three orthogonal latent dimensions: (1) Instability and Sensitivity, (2) Position and Alignment, and (3) Coverage and Expressiveness. By systematically separating task-irrelevant perturbations from true capability spans, the framework provides a theoretically grounded and domain-agnostic taxonomy for robust and empirically stable benchmark design.
comment: The paper has mistake of undertaking political spaces to semantic dimensions. This needs to be removed because this is a fetal flaw in consideration. The initial hypothesis and premise needs to be rigorously formulated within the political landscape not generalizing the metrics. Hence a withdrawal for now is necessary
♻ ☆ Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents ICML
Although recent tool-augmented benchmarks involve complex requests, evaluation remains limited to answer matching, neglecting critical trajectory aspects like efficiency, hallucination, and adaptivity. The most straightforward method for evaluation is to compare an agent's trajectory with the ground-truth, but annotating all valid ground-truth trajectories is prohibitively expensive. In this manner, we introduce TRACE, a reference-free framework for the multi-dimensional evaluation of tool-augmented LLMs. By incorporating an evidence bank which accumulates knowledge from preceding steps, TRACE assesses an agent's reasoning trajectory effectively. To validate our framework, we develop a new meta-evaluation dataset with diverse and flawed trajectories, each labeled with multi-faceted performance scores. Our results confirm that TRACE accurately evaluates complex trajectories even with small open-source LLMs. Furthermore, we apply our method to evaluate the trajectories that agents produce while solving tool-augmented tasks, presenting previously unreported observations and their corresponding insights.
comment: International Conference on Machine Learning (ICML) 2026
♻ ☆ Near-Miss: Latent Policy Failure Detection in Agentic Workflows ACL2026
Agentic systems for business process automation often require compliance with policies governing conditional updates to the system state. Evaluation of policy adherence in LLM-based agentic workflows is typically performed by comparing the final system state against a predefined ground truth. While this approach detects explicit policy violations, it may overlook a more subtle class of issues in which agents bypass required policy checks, yet reach a correct outcome due to favorable circumstances. We refer to such cases as near-misses or latent failures. In this work, we introduce a novel metric for detecting latent policy failures in agent conversations traces. Building on the ToolGuard framework, which converts natural-language policies into executable guard code, our method analyzes agent trajectories to determine whether agent's tool-calling decisions where sufficiently informed. We evaluate our approach on the $τ^2$-verified Airlines benchmark across several contemporary open and proprietary LLMs acting as agents. Our results show that latent failures occur in 8-17% of trajectories involving mutating tool calls, even when the final outcome matches the expected ground-truth state. These findings reveal a blind spot in current evaluation methodologies and highlight the need for metrics that assess not only final outcomes but also the decision process leading to them.
comment: GEM@ACL2026, 13 pages
♻ ☆ Beyond Cosine Similarity: Zero-Initialized Residual Complex Projection for Aspect-Based Sentiment Analysis
Aspect-Based Sentiment Analysis (ABSA) faces critical challenges due to representation entanglement and false-negative collisions in real-valued embedding spaces. In this paper, we propose a novel framework featuring a Zero-Initialized Residual Complex Projection (ZRCP) and an Anti-collision Masked Angle Loss. Our approach projects textual features into a complex semantic space, utilizing the phase to isolate sentiment polarities while regularizing the amplitude to ensure structural consistency within aspect categories. To mitigate this, we introduce an anti-collision mask that preserves intra-polarity aspect cohesion while significantly expanding the discriminative margin between opposing polarities. Experimental results on the ASAP dataset demonstrate that our framework achieves a state-of-the-art Macro-F1 score of 0.8923, outperforming robust baselines.
♻ ☆ MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware
The recent advancement of Vision Language Action (VLA) models has driven a critical demand for large scale egocentric datasets. However, existing datasets are often limited by short episode durations, typically spanning only a few minutes, which fails to capture the long horizon temporal dependencies necessary for complex robotic task execution. To bridge this gap, we present MobileEgo Anywhere, a framework designed to facilitate the collection of robust, hour plus egocentric trajectories using commodity mobile hardware. We leverage the ubiquitous sensor suites of modern smartphones to provide high fidelity, long term camera pose tracking, effectively removing the high hardware barriers associated with traditional robotics data collection. Our contributions are three fold: (1) we release a novel dataset comprising 200 hours of diverse, long form egocentric data with persistent state tracking; (2) we open source a mobile application that enables any user to record egocentric data, and (3) we provide a comprehensive processing pipeline to convert raw mobile captures into standardized, training ready formats for Vision Language Action model and foundation model research. By democratizing the data collection process, this work enables the massive scale acquisition of long horizon data across varied global environments, accelerating the development of generalizable robotic policies.
♻ ☆ OneSearch-V2: The Latent Reasoning Enhanced Self-distillation Generative Search Framework
Generative Retrieval (GR) has emerged as a promising paradigm for modern search systems. Compared to multi-stage cascaded architecture, it offers advantages such as end-to-end joint optimization and high computational efficiency. OneSearch, as a representative industrial-scale deployed generative search framework, has brought significant commercial and operational benefits. However, its inadequate understanding of complex queries, inefficient exploitation of latent user intents, and overfitting to narrow historical preferences have limited its further performance improvement. To address these challenges, we propose OneSearch-V2, a latent reasoning enhanced self-distillation generative search framework. It contains three key innovations: (1) a thought-augmented complex query understanding module, which enables deep query understanding and overcomes the shallow semantic matching limitations of direct inference; (2) a reasoning-internalized self-distillation training pipeline, which uncovers users' potential yet precise e-commerce intentions beyond log-fitting through implicit in-context learning; (3) a behavior preference alignment optimization system, which mitigates reward hacking arising from the single conversion metric, and addresses personal preference via direct user feedback. Extensive offline evaluations demonstrate OneSearch-V2's strong query recognition and user profiling capabilities. Online A/B tests further validate its business effectiveness, yielding +3.98\% item CTR, +2.07\% buyer volume, and +2.11\% order volume. Manual evaluation further confirms gains in search experience quality, with +1.37\% in page good rate and +1.65\% in query-item relevance. More importantly, OneSearch-V2 effectively mitigates common search system issues such as information bubbles and long-tail sparsity, without incurring additional inference costs or serving latency.
comment: Codes are available at https://github.com/benchen4395/onesearch-family. Feel free to contact benchen4395@gmail.com
♻ ☆ Do Language Models Encode Knowledge of Linguistic Constraint Violations?
Large Language Models (LLMs) achieve strong linguistic performance, yet their internal mechanisms for producing these predictions remain unclear. We investigate the hypothesis that LLMs encode representations of linguistic constraint violations within their parameters, which are selectively activated when processing ungrammatical sentences. To test this, we use sparse autoencoders to decompose polysemantic activations into sparse, monosemantic features and recover candidates for violation-related features. We introduce a sensitivity score for identifying features that are preferentially activated on constraint-violated versus well-formed inputs, enabling unsupervised detection of potential violation-specific features. We further propose a conjunctive falsification framework with three criteria evaluated jointly. Overall, the results are negative in two respects: (1) the falsification criteria are not jointly satisfied across linguistic phenomena, and (2) no features are consistently shared across all categories. While some phenomena show partial evidence of selective causal structure, the overall pattern provides limited support for a unified set of grammatical violation detectors in current LMs.
♻ ☆ FactNet: A Billion-Scale Knowledge Graph for Multilingual Factual Grounding
Large language models hallucinate factual claims and struggle to ground their outputs in retrievable evidence, particularly in non-English languages. Existing resources impose a trade-off: structured knowledge bases lack textual grounding, whereas grounded datasets remain small and monolingual. We introduce FactNet, a billion-scale open resource that couples 1.7B Wikidata assertions with 3.01B evidence pointers drawn from 316 native Wikipedia editions. FactNet employs a deterministic construction pipeline, ensuring that every evidence unit is traceable to its source with byte-level precision. We further establish FactNet-Bench, an evaluation suite for Knowledge Graph Completion, Question Answering, and Fact Checking, equipped with systematic leakage controls. Experiments demonstrate that FactNet-Bench differentiates among structural, text-aware, and LLM-integrated methods, and that cross-lingual structure enables knowledge transfer across language tiers.
♻ ☆ Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning ICLR 2026
Understanding internal representations of neural models is a core interest of mechanistic interpretability. Due to its large dimensionality, the representation space can encode various aspects about inputs. To what extent are different aspects organized and encoded in separate subspaces? Is it possible to find these ``natural'' subspaces in a purely unsupervised way? Somewhat surprisingly, we can indeed achieve this and find interpretable subspaces by a seemingly unrelated training objective. Our method, neighbor distance minimization (NDM), learns non-basis-aligned subspaces in an unsupervised manner. Qualitative analysis shows subspaces are interpretable in many cases, and encoded information in obtained subspaces tends to share the same abstract concept across different inputs, making such subspaces similar to ``variables'' used by the model. We also conduct quantitative experiments using known circuits in GPT-2; results show a strong connection between subspaces and circuit variables. We also provide evidence showing scalability to 2B models by finding separate subspaces mediating context and parametric knowledge routing. Viewed more broadly, our findings offer a new perspective on understanding model internals and building circuits.
comment: Published as a conference paper at ICLR 2026
♻ ☆ DocScope: Benchmarking Verifiable Reasoning for Trustworthy Long-Document Understanding
Evaluating whether Multimodal Large Language Models can produce trustworthy, verifiable reasoning over long, visually rich documents requires evaluation beyond end-to-end answer accuracy. We introduce DocScope, a benchmark that formulates long-document QA as a structured reasoning trajectory prediction problem: given a complete PDF document and a question, the model outputs evidence pages, supporting evidence regions, relevant factual statements, and a final answer. We design a four-stage evaluation protocol -- Page Localization, Region Grounding, Fact Extraction, and Answer Verification -- that audits each level of the trajectory independently through inter-stage decoupling, with all judges selected and calibrated via human alignment studies. DocScope comprises 1,124 questions derived from 273 documents, with all hierarchical evidence annotations completed by human annotators. We benchmark 6 proprietary models, 12 open-weight models, and several domain-specific systems. Our experiments reveal that answer accuracy cannot substitute for trajectory-level evaluation: even among correct answers, the highest observed rate of complete evidence chains is only 29\%. Across all models, region grounding remains the weakest trajectory stage. Furthermore, the primary difficulty stems from aggregating evidence dispersed across long distances and multiple document clusters, while an oracle study identifies faithful perception and fact extraction as the dominant capability bottleneck. Cross-architecture comparisons further suggest that activated parameter count matters more than total scale. The benchmark and code will be publicly released at https://github.com/MiliLab/DocScope.
comment: 50pages, 25 figures, 14 tables;
♻ ☆ Scaling few-shot spoken word classification with generative meta-continual learning
Few-shot spoken word classification has largely been developed for applications where a small number of classes is considered, and so the potential of larger-scale few-shot spoken word classification remains untapped. This paper investigates the potential of a spoken word classifier to sequentially learn to distinguish between 1000 classes when it is given only five shots per class. We demonstrate that this scaling capability exists by training a model using the Generative Meta-Continual Learning (GeMCL) algorithm and comparing it to repeatedly trained or finetuned baselines. We find that GeMCL produces exceptionally stable performance, and although it does not always outperform a repeatedly fully-finetuned HuBERT model nor a frozen HuBERT model with a repeatedly trained classifier head, it produces comparable performance to the latter while adapting 2000 times faster, having been trained less than half of the data for two orders of magnitude less time.
♻ ☆ Does language matter for spoken word classification? A multilingual generative meta-learning approach
Meta-learning has been shown to have better performance than supervised learning for few-shot monolingual spoken word classification. However, the meta-learning approach remains under-explored in multilingual spoken word classification. In this paper, we apply the Generative Meta-Continual Learning algorithm to spoken word classification. The generative nature of this algorithm makes it viable for use in application, and the meta-learning aspect promotes generalisation, which is crucial in a multilingual setting. We train monolingual models on English, German, French, and Catalan, a bilingual model on English and German, and a multilingual model on all four languages. We find that although the multilingual model performs best, the differences between model performance is unexpectedly low. We also find that the hours of unique data seen during training seems to be a stronger performance indicator than the number of languages included in the training data.
♻ ☆ MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval
In agent memory systems, the reranking model serves as the critical bridge connecting user queries with long-term memory. Most systems adopt the "retrieve-then-rerank" two-stage paradigm, but generic reranking models rely on semantic similarity matching and lack genuine reasoning capabilities, leading to a problem where recalled results are semantically highly relevant yet do not contain the key information needed to answer the question. This deficiency manifests in memory scenarios as three specific problems. First, relevance scores are miscalibrated, making threshold-based filtering difficult. Second, ranking degrades when facing temporal constraints, causal reasoning, and other complex queries. Third, the model cannot leverage dialogue context for semantic disambiguation. This report introduces MemReranker, a reranking model family (0.6B/4B) built on Qwen3-Reranker through multi-stage LLM knowledge distillation. Multi-teacher pairwise comparisons generate calibrated soft labels, BCE pointwise distillation establishes well-distributed scores, and InfoNCE contrastive learning enhances hard-sample discrimination. Training data combines general corpora with memory-specific multi-turn dialogue data covering temporal constraints, causal reasoning, and coreference resolution. On the memory retrieval benchmark, MemReranker-0.6B substantially outperforms BGE-Reranker and matches open-source 4B/8B models as well as GPT-4o-mini on key metrics. MemReranker-4B further achieves 0.737 MAP, with several metrics on par with Gemini-3-Flash, while maintaining inference latency at only 10--20% of large models. On finance and healthcare vertical-domain benchmarks, the models preserve generalization capabilities on par with mainstream large-parameter rerankers.
♻ ☆ TiTok: Transfer Token-level Knowledge via Contrastive Excess to Transplant LoRA ICLR 2026
Large Language Models (LLMs) are widely applied in real world scenarios, yet fine-tuning them comes with significant computational and storage costs. Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA mitigate these costs; however, the adapted parameters are dependent on the base model and cannot be transferred across different backbones. One way to address this issue is through knowledge distillation, but its effectiveness inherently depends on training data. Recent work such as TransLoRA avoids this by generating synthetic data; nevertheless, this adds complexity since it requires training an additional discriminator model. In this paper, we propose TiTok, a new framework that enables effective LoRA Transplantation through Token-level knowledge transfer. Specifically, TiTok captures task-relevant information through a token-wise contrastive excess between a source model with and without LoRA. This excess highlights informative tokens and enables selective filtering of synthetic data, all without additional models or overhead. Through experiments on three benchmarks across multiple transfer settings, we demonstrate that TiTok is consistently effective, achieving average performance gains of +4~10% compared to baselines overall.
comment: ICLR 2026
♻ ☆ Chinese Short-Form Creative Content Generation via Explanation-Oriented Multi-Objective Optimization
Chinese demonstrates high semantic compactness and rich metaphorical expressiveness, enabling limited text to convey dense meanings while increasing the difficulty of generation and verification, particularly in short-form creative natural language generation (CNLG). In the real world, users often require personalized, fine-grained creative constraints, making reliable verification critical to guiding optimization. According to Brunswik's Lens Model from psychology, constraints' achievement can be inferred from sufficient observable cues. Existing studies are mainly outcome-oriented, implicitly assuming that the outcome itself provides adequate cues for verification. However, this assumption breaks down in Chinese short-form CNLG (e.g., naming or advertising) with diverse personalized constraints, where extremely brief outcomes inherently offer limited information. Explanations can naturally serve as extra cues. Nevertheless, under complex constraints, LLMs' explanations may suffer from hallucination, incompleteness, or ambiguity. To address these, we novelly formalize the Chinese short-form CNLG task as a heterogeneous multi-objective optimization (HMO) issue that needs to jointly optimize multiple personalized constraints and explanation reliability. We further propose MAGIC-HMO, a training-free multi-agent framework that optimizes these objectives through iterative generation and verification under an explanation-oriented multi-objective strategy. Experiments on \emph{Chinese Baby Naming}, a challenging benchmark, demonstrate that MAGIC-HMO significantly outperforms six strong baselines across various LLM backbones. Relevant data and codes are available at https://github.com/foolfun/MAGIC_HMO.
comment: 19 pages,10 figures. Submitted to ACM for possible publication
♻ ☆ OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling
We investigate the capabilities and scalability of Large Language Models (LLMs) in optimization modeling, a domain requiring structured reasoning and precise formulation. To this end, we introduce OPT-ENGINE, an extensible benchmark framework with quantifiable and controllable complexity. OPT-ENGINE spans ten canonical Operations Research problems, systematically scaling from Linear Programming to Mixed-Integer Programming, providing a structured environment to probe the limits of automated problem formulation and solving. Utilizing OPT-Engine, we address three pivotal research questions. First, we examine whether Pure-Text Reasoning (PTR) via classical Chain-of-Thought can efficiently tackle optimization tasks, finding that PTR suffers from a critical robustness gap as task complexity increases. Second, we examine whether integrating external computational tools can mitigate PTR's arithmetic weaknesses and improve performance. Our results indicate that while such tools help with local calculations, they still fail to adhere to global optimization constraints. Finally, we pinpoint that for the current SOTA paradigm, Solver-integrated Reasoning (SIR), the automated formulation of constraints represents the primary bottleneck. These findings clarify the limitations of current paradigms and provide a structured roadmap for developing next-generation LLMs for optimization modeling. We release our code and data to facilitate future research (https://github.com/Cardinal-Operations/OPTEngine).
♻ ☆ Energy-Regularized Sequential Model Editing on Hyperspheres ICLR 2026
Large language models (LLMs) require constant updates to remain aligned with evolving real-world knowledge. Model editing offers a lightweight alternative to retraining, but sequential editing often destabilizes representations and induces catastrophic forgetting. In this work, we seek to better understand and mitigate performance degradation caused by sequential editing. We hypothesize that hyperspherical uniformity, a property that maintains uniform distribution of neuron weights on a hypersphere, helps the model remain stable, retain prior knowledge, while still accommodate new updates. We use Hyperspherical Energy (HE) to quantify neuron uniformity during editing, and examine its correlation with editing performance. Empirical studies across widely used editing methods reveals a strong correlation between HE dynamics and editing performance, with editing failures consistently coinciding with high HE fluctuations. We further theoretically prove that HE dynamics impose a lower bound on the degradation of pretrained knowledge, highlighting why HE stability is crucial for knowledge retention. Motivated by these insights, we propose SPHERE (Sparse Projection for Hyperspherical Energy-Regularized Editing), an HE-driven regularization strategy that stabilizes neuron weight distributions, ultimately preserving prior knowledge while enabling reliable sequential updates. Specifically, SPHERE identifies a sparse space complementary to the principal hyperspherical directions of the pretrained weight matrices and projects new knowledge onto it, attenuating perturbations on the principal directions. Extensive experiments on LLaMA3 (8B) and Qwen2.5 (7B) show that SPHERE outperforms the best baseline in editing capability by an average of 16.41%, while most faithfully preserving general model performance, thereby offering a principled path toward reliable large-scale knowledge editing.
comment: Accepted by ICLR 2026. The code is available at https://github.com/PlusLabNLP/SPHERE. Project page: https://www.qingyuanliu.net/sphere_projectpage/
♻ ☆ Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models ICML 2026
Large language models (LLMs) must balance diversity and creativity against logical coherence in open-ended generation. Existing truncation-based samplers are effective but largely heuristic, relying mainly on probability mass and entropy while ignoring semantic geometry of the token space. We present Top-W, a geometry-aware truncation rule that uses Wasserstein distance-defined over token-embedding geometry-to keep the cropped distribution close to the original, while explicitly balancing retained probability mass against the entropy of the kept set. Our theory yields a simple closed-form structure for the fixed-potential subset update: depending on the mass-entropy trade-off, the optimal crop either collapses to a single token or takes the form of a one-dimensional prefix that can be found efficiently with a linear scan. We implement Top-W using efficient geometry-based potentials (nearest-set or k-NN) and pair it with an alternating decoding routine that keeps the standard truncation-and-sampling interface unchanged. Extensive experiments on four benchmarks (GSM8K, GPQA, AlpacaEval, and MT-Bench) across three instruction-tuned models show that Top-W consistently outperforms prior state-of-the-art decoding approaches achieving up to 33.7% improvement. Moreover, we find that Top-W not only improves accuracy-focused performance, but also boosts creativity under judge-based open-ended evaluation.
comment: 20 pages, 3 figures, 8 tables, ICML 2026
♻ ☆ It Takes Two: Your GRPO Is Secretly DPO
GRPO has emerged as a prominent reinforcement learning algorithm for post-training LLMs. Unlike critic-based methods, GRPO computes advantages by estimating the \emph{value baselines} from group-level statistics, eliminating the need for a critic network. Consequently, the prevailing view emphasizes the necessity of large group sizes, which are assumed to yield more accurate statistical estimates. In this paper, we propose a different view that the efficacy of GRPO stems from its implicit contrastive objective in the optimization, which helps reduce variance via the control variate method. This makes GRPO structurally related to preference learning methods such as DPO. This perspective motivates 2-GRPO, a minimal group-size variant that constructs contrastive signals with only two rollouts. We provide a rigorous theoretical analysis of 2-GRPO and empirically validate its effectiveness: 2-GRPO retains $97.6\%$ of the performance of 16-GRPO, while requiring only $12.5\%$ of the rollouts and $21\%$ of the training time.
♻ ☆ Quantifying and Mitigating Self-Preference Bias of LLM Judges
LLM-as-a-Judge has become a dominant approach in automated evaluation systems, playing critical roles in model alignment, leaderboard construction, quality control, and so on. However, the scalability and trustworthiness of this approach can be substantially distorted by Self-Preference Bias (SPB), which is a directional evaluative deviation in which LLMs systematically favor or disfavor their own generated outputs during evaluation. Existing measurements rely on costly human annotations and conflate generative capability with evaluative stance, and thus are impractical for large-scale deployment in real-world systems. To address this issue, we introduce a fully automated framework to quantifying and mitigating SPB, which constructs equal-quality pairs of responses with negligible quality differences, enabling statistical disentanglement of discriminability from bias propensity without human gold standards. Empirical analysis across 20 mainstream LLMs reveals that advanced capabilities are often uncorrelated, or even negatively correlated, with low SPB. To mitigate this bias, we propose a structured multi-dimensional evaluation strategy grounded in cognitive load decomposition, which reduces SPB by 31.5\% on average.
♻ ☆ Anti-Length Shift: Dynamic Outlier Truncation for Training Efficient Reasoning Models ACL2026
Large reasoning models enhanced by reinforcement learning with verifiable rewards have achieved significant performance gains by extending their chain-of-thought. However, this paradigm incurs substantial deployment costs as models often exhibit excessive verbosity on simple queries. Existing efficient reasoning methods relying on explicit length penalties often introduce optimization conflicts and leave the generative mechanisms driving overthinking largely unexamined. In this paper, we identify a phenomenon termed length shift where models increasingly generate unnecessary reasoning on trivial inputs during training. To address this, we introduce Dynamic Outlier Truncation (DOT), a training-time intervention that selectively suppresses redundant tokens. This method targets only the extreme tail of response lengths within fully correct rollout groups while preserving long-horizon reasoning capabilities for complex problems. To complement this intervention and ensure stable convergence, we further incorporate auxiliary KL regularization and predictive dynamic sampling. Experimental results across multiple model scales demonstrate that our approach significantly pushes the efficiency-performance Pareto frontier outward. Notably, on the AIME-24, our method reduces inference token usage by 78% while simultaneously increasing accuracy compared to the initial policy and surpassing state-of-the-art efficient reasoning methods.
comment: Accepted by ACL2026
♻ ☆ Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions
Activation steering controls language model behavior by adding directions to internal representations at inference time, but standard residual-stream steering can fail in stateful dialogue. We identify KV-cache contamination as a key failure mode: steered token states are stored and repeatedly reused, turning a local perturbation into cumulative coherence degradation. To address this challenge, we propose Gated Cropped Attention-Delta steering (GCAD), which extracts steering signals from system-prompt contributions to self-attention and applies them with token-level gating. Across persona-steering experiments, GCAD preserves trait control while substantially improving long-horizon coherence. On the main multi-turn benchmark, GCAD improves average coherence drift from -18.6 to -1.9 and raises turn-10 trait expression from 78.0 to 93.1. These results suggest that activation steering becomes more reliable when interventions follow the prompt-mediated pathways that models already use for behavioral control.
comment: 23 pages, 5 figures. This paper proposes GCAD, an attention-level activation steering method for more stable multi-turn behavior control
♻ ☆ WriteSAE: Sparse Autoencoders for Recurrent State
We introduce WriteSAE, the first sparse autoencoder that decomposes and edits the matrix cache write of state-space and hybrid recurrent language models, where residual SAEs cannot reach. Existing SAEs read residual streams, but Gated DeltaNet, Mamba-2, and RWKV-7 write to a $d_k \times d_v$ cache through rank-1 updates $k_t v_t^\top$ that no vector atom can replace. WriteSAE factors each decoder atom into the native write shape, exposes a closed form for the per-token logit shift, and trains under matched Frobenius norm so atoms swap one cache slot at a time. Atom substitution beats matched-norm ablation on 92.4% of $n=4{,}851$ firings at Qwen3.5-0.8B L9 H4, the 87-atom population test holds at 89.8%, the closed form predicts measured effects at $R^2=0.98$, and Mamba-2-370M substitutes at 88.1% over 2,500 firings. Sustained three-position installs at $3\times$ lift midrank target-in-continuation from 33.3% to 100% under greedy decoding, the first behavioral install at the matrix-recurrent write site.
comment: 26 pages, 14 figures, 21 tables; code at https://github.com/JackYoung27/writesae
♻ ☆ Reasoning Model Is Superior LLM-Judge, Yet Suffers from Biases ACL 2026
This paper presents the first systematic comparison investigating whether Large Reasoning Models (LRMs) are superior judges to non-reasoning LLMs. Our empirical analysis yields four key findings: 1) LRMs outperform non-reasoning LLMs in terms of judgment accuracy, particularly on reasoning-intensive tasks; 2) LRMs demonstrate superior evaluation instruction-following capabilities; 3) LRMs exhibit enhanced robustness against adversarial attacks targeting judgment tasks; 4) However, LRMs still exhibit strong evaluation biases. To mitigate this bias vulnerability, we propose PlanJudge, a lightweight evaluation strategy that prompts the model to generate an explicit evaluation plan before executing the judgment. Despite its simplicity, our experiments demonstrate that PlanJudge significantly mitigates biases in LLM-as-a-Judge while preserving overall judgment accuracy.
comment: Accepted by ACL 2026 Workshop EvalEval
♻ ☆ Do Reasoning LLMs Refuse What They Infer in Long Contexts?
Long-context LLMs can infer objectives that are not stated explicitly. This capability is useful for reasoning over documents, code, retrieved evidence, and tool traces, but it also creates a safety risk: harmful intent can be distributed across a context and become visible only after the model composes the relevant pieces. Existing safety evaluations mostly test explicit harmful requests, and therefore miss this failure mode. We introduce compositional reasoning attacks, a long-context threat model in which harmful requests are decomposed into semantically incomplete fragments and embedded in long contexts. The final query is neutral; the harmful objective emerges only if the model retrieves the fragments, composes them, and infers the implied goal. We instantiate this setting using AdvBench requests, varying the required reasoning from Direct Retrieval to Single-hop Aggregation, Chain Reasoning, and Multi-hop Deductive Reasoning, and evaluate 15 frontier LLMs on contexts up to 64k tokens. Models usually refuse harmful requests when they are directly retrievable. However, refusal rates drop sharply when the same objectives must be reconstructed compositionally, often with larger failures in longer contexts. Benign reconstruction and fragment-position analyses indicate that these failures are not mainly retrieval errors: models often infer the harmful objective and then comply. Increasing inference-time reasoning improves refusal but remains incomplete and costly. Our results reveal a long-context safety gap: current models are better at refusing harmful requests they see than harmful objectives they infer.
comment: 33 pages, 6 figures
♻ ☆ Query-Conditioned Test-Time Self-Training for Large Language Models
Large language models (LLMs) are typically deployed with fixed parameters, and their performance is often improved by allocating more computation at inference time. While such test-time scaling can be effective, it cannot correct model misconceptions or adapt the model to the specific structure of an individual query. Test-time optimization addresses this limitation by enabling parameter updates during inference, but existing approaches either rely on external data or optimize generic self-supervised objectives that lack query-specific alignment. In this work, we propose Query-Conditioned Test-Time Self-Training (QueST), a framework that adapts model parameters during inference using supervision derived directly from the input query. Our key insight is that the input query itself encodes latent signals sufficient for constructing structurally related problem--solution pairs. Based on this, QueST generates such query-conditioned pairs and uses them as supervision for parameter-efficient fine-tuning at test time. The adapted model is then used to produce the final answer, enabling query-specific adaptation without any external data. Across seven mathematical reasoning benchmarks and the GPQA-Diamond scientific reasoning benchmark, QueST consistently outperforms strong test-time optimization baselines. These results demonstrate that query-conditioned self-training is an effective and practical paradigm for test-time adaptation in LLMs. Code is available at https://chssong.github.io/Query-Conditioned-TTST/.
comment: 17 pages, 7 figures
♻ ☆ TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning
Large Reasoning Models (LRMs) achieve impressive performance on complex reasoning tasks via Chain-of-Thought (CoT) reasoning, which enables them to generate intermediate thinking tokens before arriving at the final answer. However, LRMs often suffer from significant overthinking, spending excessive compute time even after the answer is generated early on. Prior work has identified the existence of an optimal reasoning length such that truncating reasoning at this point significantly shortens CoT outputs with virtually no change in performance. However, determining optimal CoT lengths for practical datasets is highly non-trivial as they are fully task and model-dependent. In this paper, we precisely address this and design Terminator, an early-exit strategy for LRMs at inference to mitigate overthinking. The central idea underpinning Terminator is that the first arrival of an LRM's final answer is often predictable, and we leverage these first answer positions to create a novel dataset of optimal reasoning lengths to train Terminator. Powered by this approach, Terminator achieves significant reductions in CoT lengths of 14%-55% on average across four challenging practical datasets: MATH-500, AIME 2025, HumanEval, and GPQA, while outperforming current state-of-the-art methods and reducing inference latency by more than 2x compared to the original LRM.
comment: Updated and reorganized results. Added new results
♻ ☆ CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications
Background: Clinical named entity recognition tools commonly map free text to Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs). For many downstream tasks, however, the clinically meaningful unit is not a single CUI but a concept set comprising related synonyms, subtypes, and associated concepts. Constructing these sets is labour-intensive, inconsistently performed, and poorly supported by existing tools. Methods We present CUICurate, a graph-based retrieval-augmented generation (GraphRAG) framework for automated UMLS concept set curation. A UMLS knowledge graph (KG) was constructed and embedded for semantic retrieval. Candidate CUIs were retrieved using graph-based expansion and then filtered and classified using large language models (GPT-5 and Qwen3-32B). The framework was evaluated on five lexically heterogeneous clinical concepts against a manually curated concept sets and gold-standard concept sets. Results CUICurate produced substantially larger and more complete concept sets than the manual benchmarks. A single retrieval configuration across concepts achieved high recall of definitive concepts with manageable candidate sets. GPT-5 outperformed manual curation for all concepts and retained at least 95% of definitive gold-standard CUIs, while Qwen3-32B achieved comparable but slightly lower performance. Many missed concepts were not observed in 10,000 MIMIC-III notes. CUICurate infrastructure and end-to-end processing was inexpensive and stable across runs. Conclusions CUICurate offers a scalable, reproducible and cost-efficient approach for generating clinician-reviewable UMLS concept sets tailored to clinical natural language processing and phenotyping applications.
comment: 6 figures, 4 tables
♻ ☆ Exact Coset Sampling for Quantum Lattice Algorithms
We revisit the post-processing phase of Chen's Karst-wave quantum lattice algorithm (Chen, 2024) in the Learning with Errors (LWE) parameter regime. Conditioned on a transcript $E$, the post-Step 7 coordinate state on $(\mathbb{Z}_M)^n$ is supported on an affine grid line $\{\, jΔ+ v^{\ast}(E) + M_2 k \bmod M : j \in \mathbb{Z},\ k \in \mathcal{K} \,\}$, with $Δ= 2D^2 b$, $M = 2M_2 = 2D^2 Q$, and $Q$ odd. The amplitudes include a quadratic Karst-wave chirp $\exp(-2πi j^2 / Q)$ and an unknown run-dependent offset $v^{\ast}(E)$. We show that Chen's Steps 8-9 can be replaced by a single exact post-processing routine: measure the deterministic residue $τ:= X_1 \bmod D^2$, obtain the run-local class $v_{1,Q} := v_1^{\ast}(E) \bmod Q$ as explicit side information in our access model, apply a $v_{1,Q}$-dependent diagonal quadratic phase on $X_1$ to cancel the chirp, and then apply $\mathrm{QFT}_{\mathbb{Z}_M}^{\otimes n}$ to the coordinate registers. The routine never needs the full offset $v^{\ast}(E)$. Under Additional Conditions AC1-AC5 on the front end, a measured Fourier outcome $u \in \mathbb{Z}_M^n$ satisfies the resonance $\langle b, u \rangle \equiv 0 \pmod Q$ with probability $1 - o(1)$. Moreover, conditioned on resonance, the reduced outcome $u \bmod Q$ is exactly uniform on the dual hyperplane $H = \{\, v \in \mathbb{Z}_Q^n : \langle b, v \rangle \equiv 0 \pmod Q \,\}$.
comment: Preprint - Work in Progress
♻ ☆ On the Diagram of Thought
Large Language Models (LLMs) excel at many tasks but often falter on complex problems that require structured, multi-step reasoning. We introduce the Diagram of Thought (DoT), a framework that enables a single LLM to build and navigate a mental map of its reasoning. Instead of thinking in a straight line, the model constructs a dynamic diagram of ideas, where it can propose different lines of thought, critique its own steps, and synthesize validated insights into a final conclusion. This process is controller-light: it does not require an external search algorithm or planner, but it does use a deterministic online validator for grammar-constrained typed traces, register constraints, and optional solver checks. To clarify the reliability target of this process, we ground DoT in a mathematical framework from category theory. We interpret accepted typed reasoning records as diagrams in a slice topos and model synthesis of the selected proposer subdiagram as a finite limit. In the predicate fragment, this same object is equivalently a variance-reversed colimit in the opposite information order. The resulting formalism gives an auditable, step-by-step trace of the LLM's typed reasoning and separates semantic guarantees for the typed subtrace from unconstrained natural-language text and uncertified operational edges.
comment: 30 pages
Machine Learning 300
☆ RefDecoder: Enhancing Visual Generation with Conditional Video Decoding
Video generation powers a vast array of downstream applications. However, while the de facto standard, i.e., latent diffusion models, typically employ heavily conditioned denoising networks, their decoders often remain unconditional. We observe that this architectural asymmetry leads to significant loss of detail and inconsistency relative to the input image. To address this, we argue that the decoder requires equal conditioning to preserve structural integrity. We introduce RefDecoder, a reference-conditioned video VAE decoder by injecting high-fidelity reference image signal directly into the decoding process via reference attention. Specifically, a lightweight image encoder maps the reference frame into the detail-rich high-dimensional tokens, which are co-processed with the denoised video latent tokens at each decoder up-sampling stage. We demonstrate consistent improvements across several distinct decoder backbones (e.g., Wan 2.1 and VideoVAE+), achieving up to +2.1dB PSNR over the unconditional baselines on the Inter4K, WebVid, and Large Motion reconstruction benchmarks. Notably, RefDecoder can be directly swapped into existing video generation systems without additional fine-tuning, and we report across-the-board improvements in subject consistency, background consistency, and overall quality scores on the VBench I2V benchmark. Beyond I2V, RefDecoder generalizes well to a wide range of visual generation tasks such as style transfer and video editing refinement.
☆ FutureSim: Replaying World Events to Evaluate Adaptive Agents
AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay real-world events in the order they occurred. We build FutureSim, where agents forecast world events beyond their knowledge cutoff while interacting with a chronological replay of the world: real news articles arriving and questions resolving over the simulated period. We evaluate frontier agents in their native harness, testing their ability to predict world events over a three-month period from January to March 2026. FutureSim reveals a clear separation in their capabilities, with the best agent's accuracy being 25%, and many having worse Brier skill score than making no prediction at all. Through careful ablations, we show how FutureSim offers a realistic setting to study emerging research directions like long-horizon test-time adaptation, search, memory, and reasoning about uncertainty. Overall, we hope our benchmark design paves the way to measure AI progress on open-ended adaptation spanning long time-horizons in the real world.
comment: 31 pages, 10 main
☆ When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability
Mechanistic interpretability aims to break models into meaningful parts; verifying that two such parts implement the same computation is a prerequisite. Existing similarity measures evaluate either empirical behaviour, leaving them blind to out-of-distribution mechanisms, or basis-dependent parameters, meaning they disregard weight-space symmetries. To address these issues for the class of tensor-based models, we introduce a weight-based metric, tensor similarity, that is invariant to such symmetries. This metric captures global functional equivalence and accounts for cross-layer mechanisms using an efficient recursive algorithm. Empirically, tensor similarity tracks functional training dynamics, such as grokking and backdoor insertion, with higher fidelity than existing metrics. This reduces measuring similarity and verifying faithfulness into a solved algebraic problem rather than one of empirical approximation.
comment: 22 pages, 8 figures. Code: https://github.com/tdooms/tensor-similarity
☆ Eradicating Negative Transfer in Multi-Physics Foundation Models via Sparse Mixture-of-Experts Routing
Scaling Scientific Machine Learning (SciML) toward universal foundation models is bottlenecked by negative transfer: the simultaneous co-training of disparate partial differential equation (PDE) regimes can induce gradient conflict, unstable optimization, and plasticity loss in dense neural operators. In particular, broadband open-channel fluid dynamics and boundary-dominated porous media flows impose incompatible spectral and geometric demands on a single dense parameter path. We introduce Shodh-MoE, a sparse-activated latent transformer architecture for multi-physics transport. Shodh-MoE operates on compressed 16^3 physical latents produced by a physics-informed autoencoder with an intra-tokenizer Helmholtz-style velocity parameterization, restricting decoded states to divergence-free velocity manifolds. The model guarantees exact mass conservation, achieving a physically verifiable velocity divergence of ~2.8 x 10^-10 (evaluated post-hoc in FP64) on 128^3 grids. A Top-1 soft-semantic router dynamically assigns localized latent patches to expert subnetworks, enabling specialized parameter paths for distinct physical mechanisms while preserving shared experts for universal symmetries. In a 20,000-step distributed pretraining run over mixed three-dimensional physical tensors, routing telemetry shows autonomous domain bifurcation: held-out validation tokens from the open-channel domain route exclusively to Expert 0, while porous-media tokens route exclusively to Expert 1. The model converges simultaneously across both regimes, achieving latent validation MSEs of 2.46 x 10^-5 and 9.76 x 10^-6, and decoded physical MSEs of 2.48 x 10^-6 and 1.76 x 10^-6. These results support sparse expert routing as a practical architectural mechanism for mitigating multi-physics interference in universal neural operators.
comment: 5 pages, 4 figures
☆ Evidential Reasoning Advances Interpretable Real-World Disease Screening ICML 2026
Disease screening is critical for early detection and timely intervention in clinical practice. However, most current screening models for medical images suffer from limited interpretability and suboptimal performance. They often lack effective mechanisms to reference historical cases or provide transparent reasoning pathways. To address these challenges, we introduce EviScreen, an evidential reasoning framework for disease screening that leverages region-level evidence from historical cases. The proposed EviScreen offers retrospection interpretability through regional evidence retrieved from dual knowledge banks. Using this evidential mechanism, the subsequent evidence-aware reasoning module makes predictions using both the current case and evidence from historical cases, thereby enhancing disease screening performance. Furthermore, rather than relying on post-hoc saliency maps, EviScreen enhances localization interpretability by leveraging abnormality maps derived from contrastive retrieval. Our method achieves superior performance on our carefully established benchmarks for real-world disease screening, yielding notably higher specificity at clinical-level recall. Code is publicly available at https://github.com/DopamineLcy/EviScreen.
comment: ICML 2026
☆ Text Knows What, Tables Know When: Clinical Timeline Reconstruction via Retrieval-Augmented Multimodal Alignment
Reconstructing precise clinical timelines is essential for modeling patient trajectories and forecasting risk in complex, heterogeneous conditions like sepsis. While unstructured clinical narratives offer semantically rich and contextually complete descriptions of a patient's course, they often lack temporal precision and contain ambiguous event timing. Conversely, structured electronic health record (EHR) data provides precise temporal anchors but misses a substantial portion of clinically meaningful events. We introduce a retrieval-augmented multimodal alignment framework that bridges this gap to improve the temporal precision of absolute clinical timelines extracted from text. Our approach formulates timeline reconstruction as a graph-based multistep process: it first extracts central anchor events from narratives to build an initial temporal scaffold, places non-central events relative to this backbone, and then calibrates the timeline using retrieved structured EHR rows as external temporal evidence. Evaluated using instruction-tuned large language models on the i2m4 benchmark spanning MIMIC-III and MIMIC-IV, our multimodal pipeline consistently improves absolute timestamp accuracy (AULTC) and improves temporal concordance across nearly all evaluated models over unimodal text-only reconstruction, without compromising event match rates. Furthermore, our empirical gap analysis reveals that 34.8% of text-derived events are entirely absent from tabular records, demonstrating that aligning these modalities can produce a more temporally faithful and clinically informative reconstruction of patient trajectories than either source alone.
comment: Sayantan Kumar, Shahriar Noroozizadeh, Juyong Kim (authors contributed equally)
☆ Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands
This position paper argues that behavioural assurance, even when carefully designed, is being asked to carry safety claims it cannot verify. AI governance frameworks enacted between 2019 and early 2026 require reviewable evidence of properties such as the absence of hidden objectives, resistance to loss-of-control precursors, and bounded catastrophic capability; current assurance methodologies (primarily behavioural evaluations and red-teaming) are epistemically limited to observable model outputs and cannot verify the latent representations or long-horizon agentic behaviours these frameworks presume to regulate. We formalize this structural mismatch as the audit gap, the divergence between required and achievable verification access, and introduce the concept of fragile assurance to describe cases where the evidential structure does not support the asserted safety claim. Through an analysis of a 21-instrument inventory, we identify an incentive gradient where geopolitical and industrial pressures systematically reward surface-level behavioral proxies over deep structural verification. Finally, we propose a technical pivot: bounding the weight of behavioral evidence in legal text and extending voluntary pre-deployment access with mechanistic-evidence classes, specifically linear probes, activation patching, and before/after-training comparisons.
☆ Hand-in-the-Loop: Improving Dexterous VLA via Seamless Interventional Correction
Vision-Language-Action (VLA) models are prone to compounding errors in dexterous manipulation, where high-dimensional action spaces and contact-rich dynamics amplify small policy deviations over long horizons. While Interactive Imitation Learning (IIL) can refine policies through human takeover data, applying it to high-degree-of-freedom (DoF) robotic hands remains challenging due to a command mismatch between human teleoperation and policy execution at the takeover moment, which causes abrupt robot-hand configuration changes, or "gesture jumps". We present Hand-in-the-Loop (HandITL), a seamless human-in-the-loop intervention method that blends human corrective intent with autonomous policy execution to avoid gesture jumps during bimanual dexterous manipulation. Compared with direct teleoperation takeover, HandITL reduces takeover jitter by 99.8% and preserves robust post-takeover manipulation, reducing grasp failures by 87.5% and mean completion time by 19.1%. We validate HandITL on tasks requiring bimanual coordination, tool use, and fine-grained long-horizon manipulation. When used to collect intervention data for policy refinement, HandITL yields policies that outperform those trained with standard teleoperation data by 19% on average across three long-horizon dexterous tasks.
☆ MeMo: Memory as a Model
Large language models (LLMs) achieve strong performance across a wide range of tasks, but remain frozen after pretraining until subsequent updates. Many real-world applications require timely, domain-specific information, motivating the need for efficient mechanisms to incorporate new knowledge. In this paper, we introduce MeMo (Memory as a Model), a modular framework that encodes new knowledge into a dedicated memory model while keeping the LLM parameters unchanged. Compared to existing methods, MeMo offers several advantages: (a) it captures complex cross-document relationships, (b) it is robust to retrieval noise, (c) it avoids catastrophic forgetting in the LLM, (d) it does not require access to the LLM's weights or output logits, enabling plug-and-play integration with both open and proprietary closed-source LLMs, and (e) its retrieval cost is independent of corpus size at inference time. Our experimental results on three benchmarks, BrowseComp-Plus, NarrativeQA, and MuSiQue, show that MeMo achieves strong performance compared to existing methods across diverse settings.
comment: This paper introduces MeMo, a framework that augments any LLM with up-to-date or domain-specific knowledge via a trained memory model, avoiding costly retraining, mitigating catastrophic forgetting, and remaining robust to retrieval noise
☆ Self-Distilled Agentic Reinforcement Learning
Reinforcement learning (RL) has emerged as a central paradigm for post-training LLM agents, yet its trajectory-level reward signal provides only coarse supervision for long-horizon interaction. On-Policy Self-Distillation (OPSD) complements RL by introducing dense token-level guidance from a teacher branch augmented with privileged context. However, transferring OPSD to multi-turn agents proves problematic: compounding multi-turn instability destabilizes supervision, while skill-conditioned privileged guidance requires asymmetric treatment for negative teacher rejections may arise from imperfect skills retrieval or utilization. We introduce SDAR (Self-Distilled Agentic Reinforcement Learning), which treats OPSD as a gated auxiliary objective while keeping RL as the primary optimization backbone. SDAR maps detached token-level signals into a sigmoid gate, strengthening distillation on teacher-endorsed positive-gap tokens and softly attenuating negative teacher rejections. Across the Qwen2.5 and Qwen3 families on ALFWorld, WebShop, and Search-QA, SDAR substantially improves over GRPO (+9.4% on ALFWorld, +7.0% on Search-QA, +10.2% on WebShop-Acc), avoids the instability of naive GRPO+OPSD, and consistently outperforms hybrid RL--OPSD baselines across model scales.
☆ RoSHAP: A Distributional Framework and Robust Metric for Stable Feature Attribution
Feature attribution analysis is critical for interpreting machine learning models and supporting reliable data-driven decisions. However, feature attribution measures often exhibit stochastic variation: different train--test splits, random seeds, or model-fitting procedures can produce substantially different attribution values and feature rankings. This paper proposes a framework for incorporating stochastic nature of feature attribution and a robust attribution metric, RoSHAP, for stable feature ranking based on the SHAP metric. The proposed framework models the distribution of feature attribution scores and estimates it through bootstrap resampling and kernel density estimation. We show that, under mild regularity conditions, the aggregated feature attribution score is asymptotically Gaussian, which greatly reduces the computational cost of distribution estimation. The RoSHAP summarizes the distribution of SHAP into a robust feature-ranking criterion that simultaneously rewards features that are active, strong, and stable. Through simulations and real-data experiments, the proposed framework and RoSHAP outperform standard single-run attribution measures in identifying signal features. In addition, models built using RoSHAP-selected features achieve predictive performance comparable to full-feature models while using substantially fewer predictors. The proposed RoSHAP approach improves the stability and interpretability of machine learning models, enabling reliable and consistent insights for analysis.
☆ Widening the Gap: Exploiting LLM Quantization via Outlier Injection
LLM quantization has become essential for memory-efficient deployment. Recent work has shown that quantization schemes can pose critical security risks: an adversary may release a model that appears benign in full precision but exhibits malicious behavior once quantized by users. However, existing quantization-conditioned attacks have been limited to relatively simple quantization methods, where the attacker can estimate weight regions that remain invariant under the target quantization. Notably, prior attacks have consistently failed to compromise more popular and sophisticated schemes, limiting their practical impact. In this work, we introduce the first quantization-conditioned attack that consistently induces malicious behavior that can be triggered by a broad range of advanced quantization techniques, including AWQ, GPTQ, and GGUF I-quants. Our attack exploits a simple property shared by many modern quantization methods: large outliers can cause other weights to be rounded to zero. Consequently, by injecting outliers into specific weight blocks, an adversary can therefore induce a targeted, predictable weight collapse in the model. This effect can be used to craft seemingly benign full-precision models that exhibit a wide range of malicious behaviors after quantization. Through extensive evaluation across three attack scenarios and LLMs, we show that our attack achieves high success rates against a broad range of quantization methods on which prior attacks fail. Our results demonstrate, for the first time, that the security risks of quantization are not restricted to simpler schemes but are broadly relevant across complex, widely-used quantization methods.
☆ Forgetting That Sticks: Quantization-Permanent Unlearning via Circuit Attribution
Standard unlearning evaluations measure behavioral suppression in full precision, immediately after training, despite every deployed language model being quantized first. Recent work has shown that 4-bit post-training quantization can reverse machine unlearning; we show this is not a tuning artefact but a systematic dual failure: gradient-based methods that achieve meaningful forgetting lose it under compression, while methods that survive quantization barely change the model. Both failures trace to the same root cause: across all baselines, per-parameter updates lie 47-828x below the NF4 quantization bin width; updates diffused across billions of parameters cannot clear quantization bin boundaries, a consequence we formalize as a sparsity-permanence tradeoff. We present MANSU (Mechanistic-Aligned Null-Space Unlearning), which resolves both modes by combining causal circuit attribution to isolate the minimal forget-set subgraph, circuit-restricted null-space projection with a diagonal-Fisher retain bound, and a per-parameter magnitude floor guaranteeing quantization survival by construction. We additionally introduce Circuit Attribution Divergence (CAD), a mechanistic verification metric distinguishing structural erasure from behavioral suppression, a distinction existing metrics cannot make. Across multiple model families and hazard benchmarks, MANSU is the first method to jointly satisfy all four properties with margin on each (meaningful forgetting, retain preservation, non-positive PTQ gap, and structural erasure), while gradient-based baselines recover up to +0.05 accuracy under compression.
☆ Training ML Models with Predictable Failures
Estimating how often an ML model will fail at deployment scale is central to pre-deployment safety assessment, but a feasible evaluation set is rarely large enough to observe the failures that matter. Jones et al. (2025) address this by extrapolating from the largest k failure scores in an evaluation set to predict deployment-scale failure rates. We give a finite-k decomposition of this estimator's forecast error and show that it has a built-in bias toward over-prediction in the typical case, which is the safety-favorable direction. This bias is offset when the evaluation set misses a rare high-failure mode that the deployment set contains, leaving the forecast to under-predict at deployment scale. We propose a fine-tuning objective, the forecastability loss, that addresses this failure mode. In two proof-of-concept experiments, a language-model password game and an RL gridworld, fine-tuning substantially reduces held-out forecast error while preserving primary-task capability and achieving safety similar to that of supervised baselines.
comment: 32 pages, 9 figures
☆ Causal Foundation Models with Continuous Treatments
Causal inference, estimating causal effects from observational data, is a fundamental tool in many disciplines. Of particular importance across a variety of domains is the continuous treatment setting, where the variable of intervention has a continuous range. This setting is far less explored and represents a substantial shift from the binary treatment setting, with models needing to represent effects across a continuum of treatment values. In this paper, we present the first causal foundation model for the continuous treatment setting. Our model meta-learns the ability to predict causal effects across a wide variety of unseen tasks without additional training or fine-tuning. First, we design a novel prior over data-generating processes with continuous treatment variables in order to generate a rich causal training corpus. We then train a transformer to reconstruct individual treatment-response curves given only observational data, leveraging in-context learning to amortize expensive Bayesian posterior inference. Our model achieves state-of-the-art performance on individual treatment-response curve reconstruction tasks compared to causal models which are trained specifically for those tasks.
comment: 22 pages, 9 figures
☆ Natural Synthesis: Outperforming Reactive Synthesis Tools with Large Reasoning Models
Reactive synthesis, the problem of automatically constructing a hardware circuit from a logical specification, is a long-standing challenge in formal verification. It is elusive for two reasons: It is algorithmically hard, and writing formal specifications by hand is notoriously difficult. In this paper, we tackle both sides of the problem. For the algorithmic side, we present a neuro-symbolic approach to reactive synthesis that couples large reasoning models with model checkers to iteratively repair a synthesized Verilog implementation via sound symbolic feedback. Our approach solves more benchmarks than the best dedicated tools in the annual synthesis competition and extends to constructing parameterized systems, a problem known to be undecidable. On the specification side, we introduce an autoformalization step that shifts the specification task from temporal logic to natural language by introducing a hand-authored dataset of natural-language specifications for evaluation. We demonstrate performance comparable to that of starting from formal specifications, establishing natural synthesis as a viable end-to-end workflow.
☆ CoCo-InEKF: State Estimation with Learned Contact Covariances in Dynamic, Contact-Rich Scenarios
Robust state estimation for highly dynamic motion of legged robots remains challenging, especially in dynamic, contact-rich scenarios. Traditional approaches often rely on binary contact states that fail to capture the nuances of partial contact or directional slippage. This paper presents CoCo-InEKF, a differentiable invariant extended Kalman filter that utilizes continuous contact velocity covariances instead of binary contact states. These learned covariances allow the method to dynamically modulate contact confidence, accounting for more nuanced conditions ranging from firm contact to directional slippage or no contact. To predict these covariances for a set of predefined contact candidate points, we employ a lightweight neural network trained end-to-end using a state-error loss. This approach eliminates the need for heuristic ground-truth contact labels. In addition, we propose an automated contact candidate selection procedure and demonstrate that our method is insensitive to their exact placement. Experiments on a bipedal robot demonstrate a superior accuracy-efficiency tradeoff for linear velocity estimation, as well as improved filter consistency compared to baseline methods. This enables the robust execution of challenging motions, including dancing and complex ground interactions -- both in simulation and in the real world.
comment: RSS 2026
☆ Learning from Language Feedback via Variational Policy Distillation
Reinforcement learning from verifiable rewards (RLVR) suffers from sparse outcome signals, creating severe exploration bottlenecks on complex reasoning tasks. Recent on-policy self-distillation methods attempt to address this by utilizing language feedback to generate dense, token-level supervision. However, these approaches rely on a fixed, passive teacher to interpret the feedback. As the student policy improves, the teacher's zero-shot assessment capabilities plateau, ultimately halting further learning. To overcome this, we propose Variational Policy Distillation (VPD), a framework that formalizes learning from language feedback as a Variational Expectation-Maximization (EM) problem. VPD co-evolves both policies: in the E-step, the teacher is actively refined on trajectory outcomes via an adaptive trust-region update, translating textual feedback into a dynamically improved target token distribution. In the M-step, the student internalizes this dense distributional guidance on its own on-policy rollouts. By continuously improving the teacher's ability to extract actionable signals from textual critique, VPD overcomes the limitations of passive distillation. Evaluated across diverse sources of diagnostic feedback on scientific reasoning and code generation tasks, VPD consistently outperforms both standard RLVR and existing self-distillation baselines. Finally, by stress-testing our framework on rigid mathematical reasoning and cold-start regimes, we illuminate the fundamental bounds of feedback-driven self-distillation compared to pure environment-driven RL.
☆ Proposal and study of statistical features for string similarity computation and classification
Adaptations of features commonly applied in the field of visual computing, co-occurrence matrix (COM) and run-length matrix (RLM), are proposed for the similarity computation of strings in general (words, phrases, codes and texts). The proposed features are not sensitive to language related information. These are purely statistical and can be used in any context with any language or grammatical structure. Other statistical measures that are commonly employed in the field such as longest common subsequence, maximal consecutive longest common subsequence, mutual information and edit distances are evaluated and compared. In the first synthetic set of experiments, the COM and RLM features outperform the remaining state-of-the-art statistical features. In 3 out of 4 cases, the RLM and COM features were statistically more significant than the second best group based on distances (P-value < 0.001). When it comes to a real text plagiarism dataset, the RLM features obtained the best results.
☆ Logging Policy Design for Off-Policy Evaluation
Off-policy evaluation (OPE) estimates the value of a target treatment policy (e.g., a recommender system) using data collected by a different logging policy. It enables high-stakes experimentation without live deployment, yet in practice accuracy depends heavily on the logging policy used to collect data for computing the estimate. We study how to design logging policies that minimize OPE error for given target policies. We characterize a fundamental reward-coverage tradeoff: concentrating probability mass on high-reward actions reduces variance but risks missing signal on actions the target policy may take. We propose a unifying framework for logging policy design and derive optimal policies in canonical informational regimes where the target policy and reward distribution are (i) known, (ii) unknown, and (iii) partially known through priors or noisy estimates at logging time. Our results provide actionable guidance for firms choosing among multiple candidate recommendation systems. We demonstrate the importance of treatment selection when gathering data for OPE, and describe theoretically optimal approaches when this is a firm's primary objective. We also distill practical design principles for selecting logging policies when operational constraints prevent implementing the theoretical optimum.
☆ From Data to Action: Accelerating Refinery Optimization with AI
Nowadays refinery optimization utilizes sheer amounts of data, which can be handled with modern Linear Programming (LP) software, but the interpreting and applying the results remains challenging. Large petrochemical companies use massive models, with hundreds of thousands of input matrix elements. The LP solution is mathematically correct, but simplifications are made in the model, and data supply errors may occur. Therefore, further insight is needed to trust the results. The LP solver does not have a memory, so additional understanding could be gained by analyzing historical data and comparing it to the current plan. As such, machine learning approaches were suggested to support decision making based on the LP solution. Among these, Anomaly Detection tools are proposed to be used in tandem with the LP output. A transformed version of the popular ECOD methodology is applied. New methods are proposed to handle high-dimensional data: choosing the most informative pairs. Then, this is used alongside two 2D Anomaly Detection algorithms, revealing several business opportunities and data supply errors in the MOL refinery scheduling and planning architecture.
comment: 34 pages, 17 figures
☆ Novel Dynamic Batch-Sensitive Adam Optimiser for Vehicular Accident Injury Severity Prediction
The choice of optimiser is important in deep learning, as it strongly influences model efficiency and speed of convergence. However, many commonly used optimisers encounter difficulties when applied to imbalanced and sequential datasets, limiting their ability to capture patterns of minority classes. In this study, we propose Dynamic Batch-Sensitive Adam (DBS-Adam), an optimiser that dynamically scales the learning rate using a batch difficulty score derived from exponential moving averages of gradient norms and batch loss. DBS-Adam improves training stability and accelerates convergence by increasing updates for difficult batches and reducing them for easier ones. We evaluate DBS-Adam by integrating it with Bi-Directional LSTM networks for accident injury severity prediction, addressing class imbalance through SMOTE-ENN resampling and Focal Loss. Four experimental configurations compare baseline Bi-LSTM models and alternative architectures to assess optimiser impact. Rigorous comparison against state-of-the-art optimisers (AMSGrad, AdamW, AdaBound) across five random seeds demonstrated DBS-Adam's competitive performance with statistically significant precision improvements (p=0.020). Results indicate that DBS-Adam outperforms standard optimisation approaches, achieving 95.22% test accuracy, 96.11% precision, 95.28% recall, 95.39% F1-score, and a test loss of 0.0086. The proposed framework enables effective real-time accident severity classification for targeted emergency response and road safety interventions, demonstrating the value of DBS-Adam for learning from imbalanced sequential data.
☆ Average Gradient Outer Product in kernel regression provably recovers the central subspace for multi-index models
We study a prototypical situation when a learned predictor can discover useful low-dimensional structure in data, while using fewer samples than are needed for accurate prediction. Specifically, we consider the problem of recovering a multi-index polynomial $f^*(x)=h(Ux)$, with $U\in\mathbb{R}^{r\times d}$ and $r\ll d$, from finitely many data/label pairs. Importantly, the target function depends on input $x$ only through the projection onto an unknown $r$-dimensional central subspace. The algorithm we analyze is appealingly simple: fit kernel ridge regression (KRR) to the data and compute the Average Gradient Outer Product (AGOP) from the fitted predictor. Our main results show that under reasonable assumptions the top $r$-dimensional eigenspace of AGOP provably recovers the central subspace, even in regimes when the prediction error remains large. Specifically, if the target function $f^*$ has degree $p^*$, it is known that $n\asymp d^{p^*}$ samples are necessary for KRR to achieve accurate prediction. In contrast, we show that if a low degree $p$ component of $f^*$ already carries all relevant directions for prediction, subspace recovery occurs in the much lower sample regime $n\asymp d^{p+δ}$ for any $δ\in(0,1)$. Our results thus demonstrate a separation between prediction and representation, and provide an explanation for why iterative kernel methods such as Recursive Feature Machines (RFM) can be sample-efficient in practice.
comment: 95 pages, 12 figures
☆ Croissant Baker: Metadata Generation for Discoverable, Governable, and Reusable ML Datasets
Croissant has emerged as the metadata standard for machine learning datasets, providing a structured, JSON-LD-based format that makes dataset discovery, automated ingestion, and reproducible analysis machine-checkable across ML platforms. Adoption has accelerated, and NeurIPS now requires Croissant metadata in every submission to its dataset tracks. Yet in practice Croissant generation usually starts with uploading data to a public platform, a path infeasible for governed and large local repositories that hold much of the high-value data ML increasingly relies on. We release Croissant Baker, a local-first, open-source command-line tool that generates validated Croissant metadata directly from a dataset directory through a modular handler registry. We evaluate Croissant Baker on over 140 datasets, scaling to MIMIC-IV at 886 million rows and 374 Parquet files. On held-out comparisons against producer-authored or standards-derived ground truth, Croissant Baker reaches 97-100% agreement across multiple domains.
comment: 23 pages, 5 figures, 11 tables. Project: https://lcp.mit.edu/croissant-baker/ Code: https://github.com/MIT-LCP/croissant-baker
☆ Concurrency without Model Changes: Future-based Asynchronous Function Calling for LLMs
Function calling, also known as tool use, is a core capability of modern LLM agents but is typically constrained by synchronous execution semantics. Under these semantics, LLM decoding is blocked until each function call completes, resulting in increasing end-to-end latency. In this work, we introduce AsyncFC, a pure execution-layer framework that decouples LLM decoding from function execution, enabling overlap between model decoding and function execution as well as inter-function parallelism when dependencies permit. AsyncFC layers over existing models and unmodified function implementations, requiring no fine-tuning or changes to the standard synchronous function-calling protocol. Across standard function-calling benchmarks and adapted software engineering benchmarks, AsyncFC significantly reduces end-to-end task completion time while preserving task accuracy. Furthermore, these results reveal that LLMs possess a native capability to reason over symbolic futures that represent unresolved execution results, enabling an asynchronous paradigm for model-tool interaction.
☆ DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models
Reinforcement learning has emerged as a powerful tool for improving diffusion-based text-to-image models, but existing methods are largely limited to single-task optimization. Extending RL to multiple tasks is challenging: joint optimization suffers from cross-task interference and imbalance, while cascade RL is cumbersome and prone to catastrophic forgetting. We propose DiffusionOPD, a new multi-task training paradigm for diffusion models based on Online Policy Distillation (OPD). DiffusionOPD first trains task-specific teachers independently, then distills their capabilities into a unified student along the student own rollout trajectories. This decouples single-task exploration from multi-task integration and avoids the optimization burden of solving all tasks jointly from scratch. Theoretically, we lift the OPD framework from discrete tokens to continuous-state Markov processes, deriving a closed-form per-step KL objective that unifies both stochastic SDE and deterministic ODE refinement via mean-matching. We formally and empirically demonstrate that this analytic gradient provides lower variance and better generality compared to conventional PPO-style policy gradients. Extensive experiments show that DiffusionOPD consistently surpasses both multi-reward RL and cascade RL baselines in training efficiency and final performance, while achieving state-of-the-art results on all evaluated benchmarks.
☆ TFGN: Task-Free, Replay-Free Continual Pre-Training Without Catastrophic Forgetting at LLM Scale
Continually pre-training a large language model on heterogeneous text domains, without replay or task labels, has remained an unsolved architectural problem at LLM scale. Existing methods rely on replay buffers, task identifiers, regularization penalties that scale poorly, or sentence-classification-scale evaluation. We introduce TFGN, an architectural overlay for transformer language models that produces input-conditioned, parameter-efficient updates while leaving the rest of the transformer unchanged. On six heterogeneous text domains (Prose, Python, Math, Biomedical, Chinese, JavaScript) at 1B tokens per phase across three model scales (~398M, ~739M, ~9B) and two regimes (From-Scratch and Retrofit), TFGN achieves backward transfer of -0.007 at LLaMA 3.1 8B Retrofit, HellaSwag retention 0.506/0.504/0.510, and >=99.59% L2-orthogonal gradient separation between domain pairs - with no replay, no task IDs, no Fisher penalty. The same matrices show positive cross-domain forward transfer: held-out JavaScript PPL drops 26.8% at LLaMA-8B Retrofit and 62.0% at GPT-2 Medium From-Scratch purely from Python training. Two extensions on the same substrate close further open problems. A closed-loop meta-control layer (Extension A) reduces forgetting by an additional 81% at ~398M, mapping onto the System A and System M roles of Dupoux et al. (arXiv:2603.15381). An operator-level plan vector (Extension B) reshapes forward-pass behavior at 99.96% cosine fidelity over 30 source->target pairs. The architectural insight is a Read/Write decomposition: the forward pass is fully dense, while cross-domain parameter updates are structured so prior-domain subspaces are not written to. To our knowledge, TFGN is the first architecture that simultaneously closes catastrophic forgetting at LLM scale, realizes a closed-loop autonomous-learning meta-controller, and carries an operator-level latent planner.
comment: 65 pages, 10 figures, 40 tables
☆ An Interpretable Latency Model for Speculative Decoding in LLM Serving
Speculative decoding (SD) accelerates large language model (LLM) inference by using a smaller draft model to propose multiple tokens that are verified by a larger target model in parallel. While prior work demonstrates substantial speedups in isolated or fixed-batch settings, the behavior of SD in production serving systems remains poorly understood: request load varies over time, and effective batch size emerges from the serving system rather than being directly controlled or observed. In this work, we develop a simple and interpretable latency model for SD in LLM serving. We infer effective batch size from request rate using Little's Law and decompose per-request demand into load-independent and load-dependent components for prefill, drafting, and verification. We validate our model using extensive measurements from vLLM across verifier and drafter model sizes, prefill and decode lengths, request rates, draft lengths, and acceptance probabilities. The model accurately describes observed latency, explains why speedups often diminish as server load increases, and characterizes how draft length, acceptance rate, and verifier-drafter size shape latency across serving conditions, with implications for configuring SD in deployed systems. We further show how the framework extends to mixture of experts models, where sparse expert activation changes the effective service costs across load regimes. Together, our results provide a structured framework for understanding SD in real LLM serving systems.
comment: 10 pages, 8 figures
☆ Separating Intrinsic Ambiguity from Estimation Uncertainty in Deep Generative Models for Linear Inverse Problems
Recently, deep generative models have been used for posterior inference in inverse problems, including high-stakes applications in medical imaging and scientific discovery, where the uncertainty of a prediction can matter as much as the prediction itself. However, posterior uncertainty is difficult to interpret because it can mix ambiguity inherent to the forward operator with uncertainty propagated through inference. We introduce a structural decomposition of posterior uncertainty that isolates intrinsic ambiguity. A cascade formulation makes this ambiguity accessible for calibration analysis, enabling qualitative diagnostics and simulation-based calibration tests that reveal failure modes that remain hidden when models are selected by reconstruction quality alone. We first validate the approach on a Gaussian example with analytical posterior structure, then illustrate the decomposition on accelerated magnetic resonance imaging (MRI), and finally apply the calibration diagnostics to electroencephalography (EEG) source imaging.
☆ SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning
As audio-first agents become increasingly common in physical AI, conversational robots, and screenless wearables, audio large language models (audio-LLMs) must integrate speaker-specific understanding to support user authorization, personalization, and context-aware interaction. This requires modeling who is speaking, how the voice sounds, and how recording conditions affect speaker cues. Conventional speaker verification systems provide strong scalar scores but little linguistic evidence, while current audio-LLMs and speaker-aware language models have limited ability to organize speaker information beyond binary labels or descriptive profiles. We present SpeakerLLM, a speaker-specialized audio-LLM framework that unifies single-utterance speaker profiling, recording-condition understanding, utterance-pair speaker comparison, and evidence-organized verification reasoning within a natural-language interface. We construct verification-reasoning targets and a decision-composition policy that separate profile-level evidence from the final same-or-different decision and organize recording condition, profile evidence, and the decision into a structured trace. At its core, SpeakerLLM uses a hierarchical speaker tokenizer designed to capture multiple granularities of speaker evidence. Utterance-level speaker embeddings summarize identity and profile-level cues, whereas frame-level speaker features preserve fine-grained acoustic descriptors. Experiments show that SpeakerLLM-Base improves speaker-profile and recording-condition understanding over general audio-LLMs, while SpeakerLLM-VR preserves strong generated-verdict accuracy and produces decision traces grounded in the supervised verification reasoning schema. We will release the metadata-enriched supervision dataset and target-construction code for reproducibility.
☆ TopoPrimer: The Missing Topological Context in Forecasting Models
We introduce TopoPrimer, a framework that makes the global topological structure of the series population an explicit input to any forecasting model. TopoPrimer improves accuracy across diverse domains, stabilizes forecasts under seasonal demand spikes, and closes the cold-start gap. Precomputed once per domain via persistent homology and spectral sheaf coordinates, TopoPrimer deploys per token for fully-trained models and as a lightweight adapter for pre-trained backbones. Of these two components, sheaf coordinates are the primary accuracy driver. Across four public benchmarks on Chronos and TimesFM, TopoPrimer consistently improves forecasting accuracy, with gains of up to 7.3% MSE on ECL. The topology advantage persists with near-identical magnitude across zero-shot and fine-tuned backbones, suggesting topology and per-series training capture complementary signals. The gains are most pronounced in difficult regimes. Under peak seasonal demand, classical and zero-shot models degrade by up to 50%, while TopoPrimer stays within 10%. At cold start with no item history, TopoPrimer reduces MAE by 27% over a topology-free baseline.
comment: 29 pages, 16 figures
☆ Multi-Block Attention for Efficient Channel Estimation in IRS-Assisted mmWave MIMO
Intelligent Reflecting Surfaces (IRSs) are a promising technology for enhancing the spectral and energy efficiency of millimeter-wave (mmWave) multiple-input multiple-output (MIMO) systems. In these systems, accurate channel estimation remains challenging due to the passive nature of IRS elements and the high pilot overhead in large-scale deployments. This paper presents a deep learning-based Multi-Block Attention (MBA) framework for efficient cascaded channel estimation in IRS-assisted mmWave MIMO systems that utilize orthogonal frequency division multiplexing (OFDM). First, we show the optimality of the discrete Fourier transform (DFT) and Hadamard matrices as phase configurations for least squares (LS) estimation. To reduce training overhead, we selectively deactivate IRS elements and compensate for induced feature loss using a two-stage architecture: (i) a Convolutional Attention Network (CAN) for spatial correlation recovery and (ii) a Complex Multi-Convolutional Network (CMN) for noise suppression. The MBA architecture mitigates error propagation through attention-guided feature refinement and denoising. Simulation results indicate that the MBA method reduces pilot overhead by up to 87% compared to the LS estimator. Additionally, at signal-to-noise ratios of 10 dB, our proposed method achieves approximately 51% lower normalized mean squared error (NMSE) than leading methods. It also maintains low computational complexity and adapts effectively to various propagation environments.
☆ Generalized Priority-Aware Shapley Value
Shapley value and its priority-aware extensions are widely used for valuation in machine learning, but existing methods require pairwise priority to be binary and acyclic, a restriction spectacularly violated in real-data examples such as aggregated human preferences and multi-criterion comparisons. We introduce the generalized priority-aware Shapley value (GPASV), a random order value defined on arbitrary directed weighted priority graphs, in which pairwise edges penalize rather than forbid order violations. GPASV covers a range of classical models as boundary cases. We establish GPASV through an axiomatic characterization, develop the associated computational methods, and introduce a priority sweeping diagnostic extending PASV's. We apply GPASV to LLM ensemble valuation on the cyclic Chatbot Arena preference graph, illustrating that priority-aware valuation is not a one-button operation: different balances of pairwise graph priority versus individual soft priority produce substantively different valuations of the same data.
☆ Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance
Reinforcement Learning with Verifiable Rewards (RLVR) has achieved great success in developing Large Language Models (LLMs) with chain-of-thought rollouts for many tasks such as math and coding. Nevertheless, RLVR struggles with sample efficiency on difficult problems where correct rollouts are hard to generate. Prior works propose to address this issue via demonstration-guided RLVR, i.e., to conduct Supervised FineTuning (SFT) when RL fails; however, SFT often requires a lot of data, which can be expensive to acquire. In this paper, we propose FEST, a FEw-ShoT demonstration-guided RLVR algorithm. It attains compelling results with only 128 demonstrations randomly selected from an SFT dataset. We find that three components are vital for the success: supervised signal, on-policy signal, and decaying weights on the few-shot SFT dataset to prevent overfitting from multiple-epoch training. On several benchmarks, FEST outperforms baselines with magnitudes less SFT data, even matching their performance with full dataset.
comment: 25 pages, 11 figures
☆ DeepTokenEEG Enhancing Mild Cognitive Impairment and Alzheimers Classification via Tokenized EEG Features
The detection of Alzheimers disease (AD) is considered crucial, as timely intervention can improve patient outcomes. Electroencephalogram (EEG)-based diagnosis has been recognized as a non-invasive, accessible, and cost-effective approach for AD detection; however, it faces challenges related to data availability, accuracy of modern deep learning methods, and the time-consuming nature of expert-based interpretation. In this study, a novel lightweight and high-performance model, DeepTokenEEG, was designed for the diagnosis of AD and the classification of EEG signals from AD patients, individuals with other neurological conditions, and healthy subjects. Unlike traditional heavy-weight models, DeepTokenEEG ultilizes spatial and temporal tokenizer that effectively captures AD-related biomarkers in both temporal and frequency domain with only 0.29 million paramaters. Trained in a combined dataset of 274 subjects, including 180 AD cases, and 94 healthy controls, the proposed method achieves a maximum recorded accuracy of 100% on specific frequency bands, representing an improvement of 1.41-15.35% over state-of-the-art methods on the same dataset. These results indicate the potential of DeepTokenEEG for early detection and screening of AD, with promising applicability for deployment due to its compact size.
☆ Explainable Detection of Depression Status Shifts from User Digital Traces
Every day, users generate digital traces (e.g., social media posts, chats, and online interactions) that are inherently timestamped and may reflect aspects of their mental state. These traces can be organized into temporal trajectories that capture how a user's mental health signals evolve, including phases of improvement, deterioration, or stability. In this work, we propose an explainable framework for detecting and analyzing depression-related status shifts in user digital traces. The approach combines multiple BERT-based models to extract complementary signals across different dimensions (e.g., sentiment, emotion, and depression severity). Such signals are then aggregated over time to construct user-level trajectories that are analyzed to identify meaningful change points. To enhance interpretability, the framework integrates a large language model to generate concise and human-readable reports that describe the evolution of mental-health signals and highlight key transitions. We evaluate the framework on two social media datasets. Results show that the approach produces more coherent and informative summaries than direct LLM-based reporting, achieving higher coverage of user history, stronger temporal coherence, and improved sensitivity to change points. An ablation study confirms the contribution of each component, particularly temporal modeling and segmentation. Overall, the method provides an interpretable view of mental health signals over time, supporting research and decision making without aiming at clinical diagnosis.
☆ Second-Order Actor-Critic Methods for Discounted MDPs via Policy Hessian Decomposition
We address the discounted reward setting in reinforcement learning (RL). To mitigate the value approximation challenges in policy gradient methods, actor-critic approaches have been developed and are known to converge to stationary points under suitable assumptions. However, these methods rely on first-order updates. In contrast, second-order optimization provides principled curvature-aware updates that are proven to accelerate convergence, but its application in RL is limited by the computational complexity of Hessian estimation. In this work, we analyze second-order approximations for the actor update that leverage the full curvature information of the objective as much as possible. A stable approximation requires treating the action-value function as locally constant with respect to policy parameters, which does not generally hold in policy gradient methods. We show that this approximation becomes well-justified under a two-timescale actor-critic framework, where the critic evolves on a faster timescale and can be treated as quasi-stationary during actor updates. Building on this insight, we formulate a second-order actor-critic method for the discounted reward setting that leverages Hessian-vector product (HVP) computations, resulting in a computationally efficient and stable second-order update.
comment: 9 pages, 2 figures including Appendix with Detailed proofs
☆ Distance-Matrix Wasserstein Statistics for Scalable Gromov--Wasserstein Learning
Gromov--Wasserstein (GW) distances compare graphs, shapes, and point clouds through internal distances, without requiring a common coordinate system. This invariance is powerful, but discrete GW is a nonconvex quadratic optimal transport problem and is difficult to estimate at scale. We propose \emph{Distance-Matrix Wasserstein} (DMW), a hierarchy of Wasserstein statistics comparing laws of random finite distance matrices. Rather than optimizing a global point-level alignment, DMW samples $n$ points from each space, records their pairwise distances, and transports the resulting matrix laws. We prove that DMW is a relaxation and lower bound of GW, and establish a reverse approximation inequality: the GW--DMW gap is controlled by the Wasserstein error of approximating each original measure with $n$ samples. Hence population DMW converges to GW as sampled subspaces become dense. We further give finite-sample bounds, including intrinsic-dimensional rates that depend on the data manifold rather than the ambient matrix dimension $\binom n2$. For scalable computation, we introduce sliced and multi-scale DMW; for $p=1$, the sliced multi-scale dissimilarity yields positive-definite exponential kernels. Experiments on synthetic metric spaces, scalability benchmarks, graph classification, and two-sample testing validate the theory and demonstrate an interpretable GW-style proxy for structural comparison.
☆ InfoSFT: Learn More and Forget Less with Information-Aware Token Weighting
Supervised fine-tuning (SFT) provides the standard approach for teaching LLMs new behaviors from offline expert demonstrations. However, standard SFT uniformly fits all samples -- including those with low likelihood under the base model -- which can disproportionately drive training updates toward overfitting specific samples rather than learning the target behavior. Moreover, adapting to these unlikely samples induces substantial policy shifts that degrade prior capabilities. Existing methods mitigate this by filtering, regenerating, or down-weighting low-likelihood data. In doing so, they often suppress precisely the novel behaviors the base model has yet to learn. We propose InfoSFT, a principled weighting scheme for the SFT objective that concentrates learning signals on maximally informative, medium-confidence tokens -- those neither overly familiar to the base model nor too unlikely to cause instability. Requiring only a one-line modification to the standard token-wise loss, InfoSFT demonstrably improves generalization over vanilla SFT and likelihood-weighted baselines across math, code, and chain-of-thought tasks with diverse model families, while better preserving pre-existing capabilities.
☆ Efficient Online Conformal Selection with Limited Feedback
We address the problem of conformal selection, where an agent must select a minimal subset of options to ensure that at least one ``success'' is identified with a pre-specified target probability $φ$. While traditional online conformal prediction focuses on maintaining validity for the observed sequence, minimizing the resource cost (efficiency) of such selections, especially under limited feedback, remains a significant challenge. In this work, we consider settings with the most limited ``bandit'' feedback, and demonstrate that the simple Adaptive Conformal Inference (ACI) update rule, when applied to the appropriate control parameter or dual variable, is both adversarially valid, ensuring the success target is met on average for any input sequence (and hence under distribution shifts), and stochastically efficient, achieving sublinear efficiency regret for $i.i.d.$ inputs against an appropriate stochastic benchmark. We show such guarantees under canonical models capturing bandit and semi-bandit feedback to the agent via a unifying algorithmic technique, and analytic framework involving Lyapunov functions. Our approach handles more complex settings than prior work, while requiring significantly less feedback, and our results provide a new theoretical bridge between efficient online learning with limited feedback and distribution-free uncertainty quantification.
☆ nASR: An End-to-End Trainable Neural Layer for Channel-Level EEG Artifact Subspace Reconstruction in Real-Time BCI IEEE
Electroencephalogram (EEG) signals are highly susceptible to artifacts, resulting in a low signal-to-noise ratio which makes extraction of meaningful neural information challenging. Artifact Subspace Reconstruction (ASR) is one of the most widely used artifact filtering techniques in EEG-based BCI applications, owing to its real-time applicability. ASR reconstructs artifact-free signals by operating in Principal Component (PC) space within sliding windows. However, ASR performance is critically sensitive to its threshold parameter - an incorrect threshold risks removing task-relevant neural features alongside artifacts. Furthermore, since PCs are linear combinations of all channels, subspace reconstruction in PC space may alter the underlying data structure, potentially discarding essential neural information. To address these limitations, we propose nASR, a novel end-to-end trainable Keras layer that jointly optimizes artifact rejection and downstream decoding. nASR introduces two trainable threshold parameters: K, which governs artifact detection in PC variance space, and L, which quantifies eigen-spread to pinpoint the primary artifact--contributing channels, enabling selective channel-level reconstruction that preserves clean channel information. An ablation study comprising five model variants (m01 - m05), evaluated across two subjects from the BCI Competition IV Dataset 1, confirms that nASR variants consistently outperform traditional ASR on test classification metrics, while achieving a 6-8x reduction in inference time, making nASR a strong candidate for real-time BCI applications demanding both low latency and high decoding performance.
comment: Preprint. Submitted to IEEE SMC 2026 (under review)
☆ Not All Symbols Are Equal: Importance-Aware Constellation Design for Semantic Communication IEEE
Semantic communication systems for goal-oriented transmission must protect task-relevant information not only through source compression but also via physical layer mapping. Existing approaches decouple constellation design and semantic encoding, exposing critical symbols to channel errors at the same rate as irrelevant ones. Contrary to this, in this paper, a joint semantic-physical layer framework is proposed, which is composed of a vector quantized-variational autoencoder that extracts discrete latent concepts, a semantic criticality indicator (SCI) that scores each concept by task relevance, and a deep reinforcement learning agent that dynamically selects the transmission subset based on instantaneous channel conditions. At the physical layer, a learned semantic-aware M -QAM constellation assigns symbol positions according to joint co-occurrence statistics and SCI scores, departing from the uniform spacing and Gray coding of standard M -QAM which minimizes average BER without regard for semantic content. We introduce a novel semantic symbol vulnerability (SSV) metric and a semantic protection probability (SPP) to quantify the exposure of task-critical symbols to decoding errors, and prove that any Gray-coded constellation is strictly suboptimal in SCI-Weighted SSV whenever the source exhibits non-uniform semantic importance and co-occurrence statistics. Simulation results demonstrate that the proposed constellation achieves near 100% SPP across modulation orders from 4-QAM to 1024-QAM versus 50% for standard constellations at high spectral efficiency, a 21:1 compression ratio with semantic quality above 0.9, generalizing across MNIST, Fashion-MNIST, and FSDD without modification.
comment: Submitted to IEEE GLOBECOM 2026. 6 pages, 8 figures
☆ Real-time virtual circuits for plasma shape control via neural network emulators
Reliable position and shape control in tokamak plasmas requires accurate real-time regulation of several strongly coupled shape parameters. The control vectors that disentangle these couplings, referred to as \textit{virtual circuits} (VCs), enable independent shape parameter control for a specific Grad--Shafranov (GS) equilibrium. Numerical calculation of VCs is not currently feasible in real time, therefore VCs are usually computed prior to each experiment, using a small number of reference GS equilibria sampled along the desired scenario trajectory, with each VC used to control the plasma within a preset time interval. While effective near the reference equilibrium, this approach can lead to degraded performance as the plasma departs from the reference equilibrium and/or from the desired trajectory, and it complicates the design of robust control strategies for rapidly evolving plasma configurations. In this paper, we construct neural-network-based emulators of plasma shape parameters from which VCs can be derived, to provide the MAST Upgrade (MAST-U) plasma control system with state-aware VCs in real-time. To do this, we develop an extensive library of over a million simulated GS equilibria, covering a substantial portion of the MAST-U operational space. These emulators provide differentiable functions whose gradients can be rapidly computed, enabling the derivation of accurate VCs for real-time shape control. We perform extensive verification of the emulated VCs by testing whether they disentangle the control problem. The neural-network-based approach delivers high accuracy and orthogonality across a diverse range of equilibria. This work establishes the physical validity of emulated VCs as a scalable and general alternative to schedules of precomputed VCs.
☆ Octopus: History-Free Gradient Orthogonalization for Continual Learning in Multimodal Large Language Models
Continual learning in multimodal large language models (MLLMs) aims to sequentially acquire knowledge while mitigating catastrophic forgetting, yet existing methods face inherent limitations: architecture-based approaches incur additional computational overhead and often generalize poorly to new tasks, rehearsal-based methods rely on storing historical data, raising privacy and storage concerns, and conventional regularization-based strategies alone are insufficient to fully prevent parameter interference. We propose Octopus, a two-stage continual learning framework based on History-Free Gradient Orthogonalization (HiFGO), which enforces gradient-level orthogonality without historical task data. Our proposed two-stage finetuning strategy decouples task adaptation from regularization, achieving a principled balance between plasticity and stability. Experiments on UCIT show that Octopus establishes state-of-the-art performance, surpassing prior SOTA by 2.14% and 6.82% in terms of Avg and Last.
☆ Slot-MPC: Goal-Conditioned Model Predictive Control with Object-Centric Representations
Predictive world models enable agents to model scene dynamics and reason about the consequences of their actions. Inspired by human perception, object-centric world models capture scene dynamics using object-level representations, which can be used for downstream applications such as action planning. However, most object-centric world models and reinforcement learning (RL) approaches learn reactive policies that are fixed at inference time, limiting generalization to novel situations. We propose Slot-MPC, an object-centric world modeling framework that enables planning through Model Predictive Control (MPC). Slot-MPC leverages vision encoders to learn slot-based representations, which encode individual objects in the scene, and uses these structured representations to learn an action-conditioned object-centric dynamics model. At inference time, the learned dynamics model enables action planning via MPC, allowing agents to adapt to previously unseen situations. Since the learned world model is differentiable, we can use gradient-based MPC to directly optimize actions, which is computationally more efficient than relying on gradient-free, sampling-based MPC methods. Experiments on simulated robotic manipulation tasks show that Slot-MPC improves both task performance and planning efficiency compared to non-object-centric world model baselines. In the considered offline setting with limited state-action coverage, we find that gradient-based MPC performs better than gradient-free, sampling-based MPC. Our results demonstrate that explicitly structured, object-centric representations provide a strong inductive bias for controllable and generalizable decision-making. Code and additional results are available at https://slot-mpc.github.io.
☆ A Hardware-Aware, Per-Layer Methodology for Post-Training Quantization of Large Language Models
Scaled Outer Product (SOP) is a post-training quantization methodology for large language model weights, designed to deliver near-lossless fidelity at 4.5--6 bits per weight on hardware with per-layer LUT decode. The methodology combines per-layer search of fixed and dynamic codebook pairs selected by a per-block selection bit, signed per-block scales, activation-weighted cosine selection, and multiple-choice knapsack promotion of sensitive layers with outlier and sparse-residual correction. Fixed codebooks include NF4, BOF4, Split87, and SH4; per-layer optimized codebooks (DD4) are hosted in LUT SRAM. A new hardware-efficient LUT output format (HIF) is proposed to improve performance, energy, and cost. Across six open model families, the recommended FP6 operating point (E2M3sUE4M4, 6.5 bpw) achieves lower weight reconstruction error than the conventional per-layer-POT FP8 baseline (E4M3, 8.0 bpw) at 1.5 bpw lower storage cost, demonstrating that block-scaled small atoms with carefully chosen scale precision can replace conventionally-deployed FP8. Full evaluation across the 4.5--6 bpw range, including layer promotion and sparse residual correction, is reported in a companion paper.
comment: 21 pages
☆ Learning with Shallow Neural Networks on Cluster-Structured Features
The success of deep learning in high-dimensional settings is often attributed to the presence of low-dimensional structure in real-world data. While standard theoretical models typically assume that this structure lies in the target function, projecting unstructured inputs onto a low-dimensional subspace, data such as images, text or genomic sequences exhibit strong spatial correlations within the input space itself. In this paper, we propose a tractable model to study how these correlations affect the sample complexity of learning with gradient descent on shallow neural networks. Specifically, we consider targets that depend on a small number of latent Boolean variables, and input features grouped into clusters and correlated with the latent variables. Under an identifiability assumption, we show that for a layerwise gradient-descent variant, the sample complexity scales with the number of hidden variables and, when the signal-to-noise ratio is sufficiently high, is independent of the input dimension, up to logarithmic terms. We empirically test our theoretical findings on both synthetic and real data.
comment: 10 pages main body, 2 figures
☆ Road Maps as Free Geometric Priors: Weather-Invariant Drone Geo-Localization with GeoFuse
Drone-view geo-localization aims to match a query drone image, often captured under adverse weather conditions (e.g., rain, snow, fog), against a gallery of geo-tagged satellite images. Weather-induced degradations in the drone view, such as noise, reduced visibility, and partial occlusions, severely exacerbate the intrinsic cross-view domain gap. While prior methods predominantly rely on weather-specific architectures or data augmentations, they have largely overlooked road map data, a readily available modality that provides strong, inherently weather-invariant geometric layout cues (e.g., road networks and building footprints) at negligible additional cost. We introduce GeoFuse, a cross-modal fusion framework that integrates precisely aligned road map tiles with satellite imagery to yield more discriminative and weather-resilient representations. We first augment the existing University-1652 and DenseUAV benchmarks with geo-aligned road maps, supplying structural priors robust to meteorological variations. Building on this, we propose a flexible fusion module that combines satellite and road map features via token-level and channel-level interactions, with a lightweight dynamic gating mechanism that adaptively weights modality contributions per instance. Finally, we employ class-level cross-view contrastive learning to promote robust alignment between weather-degraded drone features and the fused satellite-roadmap representations. Extensive experiments under diverse weather conditions show that GeoFuse consistently outperforms state-of-the-art methods, achieving +3.46% and +23.18% Recall@1 accuracy on the University-1652 and DenseUAV benchmarks, respectively.
comment: 18 pages, 4 figures
☆ A Mutual Information Lower Bound for Multimodal Regression Active Learning
Active learning for continuous regression has lacked an acquisition function that targets epistemic uncertainty when the predictive distribution is multimodal: variance misses modal disagreement, and information-theoretic targets like BALD are designed for discrete outputs. We introduce a Two-Index framework that makes this separation explicit: one stochastic index selects among competing model hypotheses (epistemic source), while a second governs within-hypothesis randomness (aleatoric source). An entropy decomposition within the framework identifies the mutual information between the output and the epistemic index as a principled acquisition objective, and we prove this quantity vanishes as the model is trained on growing datasets, confirming that it captures exactly the uncertainty data can resolve. Because this mutual information is intractable for continuous outputs, we derive the Mutual Information Lower Bound (MI-LB) acquisition function, a closed-form approximation for Mixture Density Network ensembles. On benchmarks featuring multimodal systems, MI-LB matches or beats every baseline evaluated and is the only method to do so consistently -- geometric and Fisher-based baselines compete only when the input space already encodes the multimodality, and collapse otherwise.
☆ TILBench: A Systematic Benchmark for Tabular Imbalanced Learning Across Data Regimes
Imbalanced learning remains a fundamental challenge in tabular data applications. Despite decades of research and numerous proposed algorithms, a systematic empirical understanding of how different imbalanced learning methods behave across diverse data characteristics is still lacking. In particular, it remains unclear how different method families compare in predictive performance, robustness under varying data characteristics, and computational scalability. In this work, we present Tabular Imbalanced Learning Benchmark (TILBench), a large-scale empirical benchmark for tabular imbalanced learning. TILBench evaluates more than 40 representative algorithms across 57 diverse tabular datasets, resulting in over 200000 controlled experiments across a wide range of data characteristics. Our findings show that no single method consistently dominates across all settings; instead, the effectiveness of imbalanced learning methods depends strongly on dataset characteristics and computational constraints. Based on these findings, we provide practical recommendations for selecting appropriate methods in real-world applications.
☆ From Sycophantic Consensus to Pluralistic Repair: Why AI Alignment Must Surface Disagreement
Pluralistic alignment is typically operationalised as preference aggregation: producing responses that span (Overton), steer toward (Steerable), or proportionally represent (Distributional) diverse human values. We argue that aggregation alone is an incomplete primitive for deployed pluralistic alignment. Under genuine value pluralism, the failure mode of contemporary RLHF-trained assistants is not insufficient coverage but sycophantic consensus: a learned tendency to agree with, validate, and minimise friction with the immediate interlocutor. Because deployed AI systems now mediate consequential deliberation across health, civic life, labour, and governance, the collapse of disagreement at the interaction layer is not a narrow technical concern but a structural failure with distributive consequences. We reframe pluralistic alignment around three conversational mechanisms drawn from Grice's maxims: scoping (acknowledging the limits of one's perspective), signalling (surfacing value-conflict rather than smoothing it over), and repair (revising one's position on principled grounds, not on user pressure). We formalise a metric, the Pluralistic Repair Score (PRS), distinguishing principled revision from capitulation, and present a small-scale empirical illustration on two frontier RLHF-trained models (Claude Sonnet 4.5, N=198; GPT-4o, N=100) showing that, for both, agreement-following coexists with low repair-quality on contested-value prompts. PRS measures an interactional precondition for pluralism (visible disagreement; principled revision) rather than pluralism in full; we discuss the difference, take seriously the reflexive question of whose "principled" counts, and argue that pluralism is most decisively made or unmade at the deployment-governance layer: interfaces, preference-data pipelines, and audit infrastructure.
☆ Critic-Driven Voronoi-Quantization for Distilling Deep RL Policies to Explainable Models AAMAS 2026
Despite many successful attempts at explaining Deep Reinforcement Learning policies using distillation, it remains difficult to balance the performance-interpretability trade-off and select a fitting surrogate model. In addition to this, traditional distillation only minimizes the distance between the behavior of the original and the surrogate policy while other RL-specific components such as action value are disregarded. To solve this, we introduce a new model-agnostic method called Critic-Driven Voronoi State Partitioning, which partitions a black box control policy into regions where a simple class of model can be optimized using gradient descent. By exploiting the critic value network of the original policy, we iteratively introduce new subpolicies in regions with insufficient value, standing in for a measure of policy complexity. The partitioning, a Voronoi quantizer, uses nearest neighbor lookups to assign a linear function to each point in the state space resulting in a cell-like diagram. We validate our approach on several well known benchmarks and proof that this distillation approaches the original policy using a reasonable sized set of linear functions.
comment: Accepted for presentation at EXTRAAMAS 2026
☆ Text-Dependent Speaker Verification (TdSV) Challenge 2024: Team Naive System Report
This paper presents a system for the 2024 Text-Dependent Speaker Verification (TdSV) Challenge. The system achieved a Minimum Detection Cost Function (MinDCF) of 0.0461 and an Equal Error Rate (EER) of 1.3\%. Our approach focused on adapting existing state-of-the-art neural networks, ResNet-TDNN and NeXt-TDNN, originally trained on the VoxCeleb dataset. This strategy was chosen because of the limited challenge duration and the available resources at the time. In addition, we designed a lightweight and resource-efficient model, EfficientNet-A0, trained specifically on the challenge dataset to improve adaptation and strengthen the ensemble approach. Our system combines advanced neural architectures, extensive data augmentation, and optimised hyperparameters. These components helped achieve strong performance in text-dependent speaker verification. The results also demonstrate the effectiveness of multi-model ensemble learning for both speaker and phrase verification.
☆ Your CLIP has 164 dimensions of noise: Exploring the embeddings covariance eigenspectrum of contrastively pretrained vision-language transformers
Contrastively pre-trained Vision-Language Models (VLMs) serve as powerful feature extractors. Yet, their shared latent spaces are prone to structural anomalies and act as repositories for non-semantic, multi-modal noise. To address this phenomenon, we employ spectral decomposition of covariance matrices to decompose the VLM latent space into a multi-modal semantic signal component and a shared noise subspace. We observe that this noise geometry exhibits strong subgroup invariance across distinct data subsets. Crucially, pruning these shared noise dimensions is mainly harmless, preserving or actively improving downstream task performance. By isolating true semantic signals from artifactual noise, this work provides new mechanistic insights into the representational structure of modern VLMs, suggesting that a substantial fraction of their latent geometry is governed by shared, architecture-level noise rather than task-relevant semantics alone.
☆ PROCESS-2: A Benchmark Speech Corpus for Early Cognitive Impairment Detection
Speech-based analysis offers a scalable and non-invasive approach for detecting cognitive decline, yet progress has been constrained by the limited availability of clinically validated datasets collected under realistic conditions. We introduce PROCESS-2, a large-scale speech dataset designed to support research on automatic assessment of cognitive impairment from spontaneous and task-oriented speech. The dataset comprises recordings from 200 healthy controls, 150 mild cognitive impairment, and 50 dementia diagnoses collected using the CognoMemory digital assessment platform. Each participant completed a single assessment session, including picture description and verbal fluency tasks, accompanied by manually verified transcripts and participant-level metadata. PROCESS-2 contains approximately 21 hours of speech audio with predefined train/test partitions. Comprehensive technical validation evaluated demographic balance, clinical consistency, recording stability, embedding-space structure, and reproducible baseline modelling performance, demonstrating clinically meaningful group separation and stable performance across modelling approaches while preserving real-world conversational variability. PROCESS-2 is released under controlled access via Hugging Face to enable responsible reuse while protecting participant privacy, providing a reproducible benchmark resource for speech-based cognitive assessment research.
☆ AIMing for Standardised Explainability Evaluation in GNNs: A Framework and Case Study on Graph Kernel Networks
Graph Neural Networks (GNNs) have advanced significantly in handling graph-structured data, but a comprehensive framework for evaluating explainability remains lacking. Existing evaluation frameworks primarily involve post-hoc explanations, and operate in the setting where multiple methods generate a suite of explanations for a single model. This makes comparison of explanations across models difficult. Evaluation of inherently interpretable models often targets a specific aspect of interpretability relevant to the model, but remains underdeveloped in terms of generating insight across a suite of measures. We introduce AIM, a comprehensive framework that addresses these limitations by measuring Accuracy, Instance-level explanations, and Model-level explanations. AIM is formulated with minimal constraints to enhance flexibility and facilitate broad applicability. Here, we use AIM in a pipeline, extracting explanations from inherently interpretable GNNs such as graph kernel networks (GKNs) and prototype networks (PNs), evaluating these explanations with AIM, identifying their limitations and obtaining insights to their characteristics. Taking GKNs as a case study, we show how the insights obtained from AIM can be used to develop an updated model, xGKN, that maintains high accuracy while demonstrating improved explainability. Our approach aims to advance the field of Explainable AI (XAI) for GNNs, providing more robust and practical solutions for understanding and improving complex models.
comment: 19 pages,4 figures, 8 tables
☆ BCI-Based Assessment of Ocular Response Time Using Dynamic Time Warping Leveraging an RDWT-Driven Deep Neural Framework IEEE
Mild traumatic brain injury (mTBI) is a prevalent condition that remains difficult to diagnose in its early stages. Oculomotor dysfunction is a well-established marker of mTBI, motivating the development of portable tools that capture both eye-movement behavior and underlying neurophysiology. In this work, we present an initial framework that integrates electroencephalogram (EEG) with augmented-reality (AR)-based Vestibular/Ocular Motor Screening (VOMS) tasks to estimate subject-specific ocular response times. Pre-processed EEG signals, obtained through band-pass filtering and average referencing, are analyzed using a Redundant Discrete Wavelet Transform (RDWT)-driven deep neural framework. The RDWT coefficients are subjected to trainable zero-phase convolutional filtering and reconstructed into the time domain via inverse RDWT, followed by channel-wise temporal and spatial filtering using 2D convolution layers and convolutional-LSTM-based decoding. An ablation study demonstrates that wavelet-domain filtering serves as an effective denoising strategy, improving prediction performance. Sliding-window predictions were validated using Pearson correlation (>= 0.5), and Dynamic Time Warping (DTW) was subsequently used to estimate ocular response times. DTW-derived metrics revealed significant inter-subject differences across all VOM tasks, supported by Mann-Whitney U tests. Cross-correlation analysis further revealed task-dependent temporal behaviors: pursuit tasks exhibited reactive tracking, whereas saccades showed anticipatory responses. Overall, the results highlight pursuit tasks as particularly informative for distinguishing timing differences and demonstrate the potential of RDWT-based EEG features combined with DTW metrics for multimodal mTBI assessment.
comment: Submitted to IEEE SMC 2026 (under review)
☆ Denoising-GS: Gaussian Splatting with Spatial-aware Denoising
Recent advances in 3D Gaussian Splatting (3DGS) have achieved remarkable success in high-fidelity Novel View Synthesis (NVS), yet the optimization process inevitably introduces noisy Gaussian primitives due to the sparse and incomplete initialization from Structure-from-Motion (SfM) point clouds. Most existing methods focus solely on adjusting the positions of primitives during optimization, while neglecting the underlying spatial structure. To this end, we introduce a new perspective by formulating the optimization of 3DGS as a primitive denoising process and propose Denoising-GS, a spatial-aware denoising framework for Gaussian primitives by taking both the positions and spatial structure into consideration. Specifically, we design an optimizer that preserves the spatial optimization flow of primitives, facilitating coherent and directed denoising rather than random perturbations. Building upon this, the Spatial Gradient-based Denoising strategy jointly considers the spatial supports of primitives to ensure gradient-consistent updates. Furthermore, the Uncertainty-based Denoising module estimates primitive-wise uncertainty to prune redundant or noisy primitives, while the Spatial Coherence Refinement strategy selectively splits primitives in sparse regions to maintain structural completeness. Experiments conducted on three benchmark datasets demonstrate that Denoising-GS consistently enhances NVS fidelity while maintaining representation compactness, achieving state-of-the-art performance across all benchmarks. Source code and models will be made publicly available.
☆ Temporal Fair Division in Multi-Agent Systems: From Precise Alternation Metrics to Scalable Coordination Proxies
A plethora real-world environments require agents to compete repeatedly for the same limited resource, calling for a temporal notion of fairness judged across entire interaction histories. This paper advances the theory of temporal fair division by introducing Rotational Periodicity (RP), a family of lightweight metrics, alongside the ALT family of sliding-window measures, within a unified framework for repeated multi-agent resource competition. We formalise the Multi-Agent Battle of the Exes (MBoE) as a repeated fair division instance and establish Perfect Alternation (PA) as its canonical temporally fair solution, drawing connections to proportionality, envy-freeness, and n-periodic round-robin allocation. RP decomposes temporal fairness into two complementary sub-measures: Rotational Score (RS) and Waiting Periods Evaluation (WPE), achieving O(nu+n) time complexity versus the O(nu*n) of ALT, where nu is the episode count and n the agent count. Empirical evaluation across n in {2,3,5,8,10} reveals three findings. First, both RP and ALT expose a coordination failure invisible to traditional metrics: Q-learning agents perform worse than random policies by 10-73% on RP and 7-35% on CALT, while Reward Fairness remains misleadingly high (above 0.92 for n>=3). Second, RP achieves 12-25x computational speedup over ALT, growing with n. Third, the two families are complementary: ALT provides richer discrimination for small populations; RP scales reliably where ALT becomes intractable. Together they form a diagnostic toolkit for temporal fair division.
comment: 15 pages, 3 figures, 8 tables. Submitted to ACM Transactions on Economics and Computation, Special Issue on Fair Division
☆ Fast Adversarial Attacks with Gradient Prediction
Generating adversarial examples at scale is a core primitive for robustness evaluation, adversarial training, and red-teaming, yet even "fast" attacks such as FGSM remain throughput-limited by the cost of a backward pass. We introduce a family of attacks that eliminates the backward pass by predicting the input gradient from forward-pass hidden states via a lightweight linear regression. The approach is motivated by a kernel view of neural networks and is exact in the Neural Tangent Kernel regime, while remaining effective for practical finite-width models. Empirically, our methods recover much of FGSM's attack performance while using only a small fraction of the time, corresponding to a $532\%$ increase in throughput. These results suggest gradient prediction as a simple and general route to significantly faster adversarial generation under realistic wall-clock constraints.
comment: 17 pages
☆ REALM: Retrospective Encoder Alignment for LFP Modeling
Spike activity has been the dominant neural signal for behavior decoding due to its high spatial and temporal resolution. However, as brain-computer interfaces (BCIs) move toward high channel counts and wireless operation, the high sampling frequency of spike signals becomes a bottleneck due to high power and bandwidth requirements. Local field potentials (LFPs) represent a different spatial-temporal scale of brain activity compared to spikes, offering key advantages including improved long-term stability, reduced energy consumption, and lower bandwidth requirement. Despite these benefits, LFP-based decoding models typically show reduced accuracy and often rely on non-causal architectures that are unsuitable for real-time deployment. To address these challenges, we propose REALM: a retrospective distillation framework that enables causal LFP decoding. Inspired by offline-to-online distillation strategies in speech recognition, REALM transfers representational knowledge from a pretrained multi-session bidirectional LFP model to a causal version for real-time deployment. We first pretrain a bidirectional Mamba-2 teacher model using a masked autoencoding objective. We then distill this teacher model into a compact student model via a combined objective of representation alignment and task supervision. REALM consistently outperforms both causal and non-causal LFP-based SOTA methods for behavior decoding. Notably, our REALM improves decoding performance while achieving a $2\times$ reduction in parameter count and a $10\times$ reduction in training time. These results demonstrate that retrospective distillation effectively bridges the gap between offline and real-time neural decoding. REALM shows that LFP-only models can achieve competitive decoding performance without reliance on spike signals, offering a practical and scalable alternative for next-generation wireless implantable BCIs.
☆ A Non-Monotone Preconditioned Trust-Region Method for Neural Network Training
Training deep neural networks at scale can benefit from domain decomposition, where the network is split into subdomains trained in parallel and coupled by a global trust-region mechanism. Building on the Additively Preconditioned Trust-Region Strategy (APTS), we propose a non-monotone variant with a nonlinear additive Schwarz preconditioner that combines parallel subdomain corrections with global coarse-space directions. A windowed acceptance criterion allows controlled objective increases, avoiding needless rejection of effective coarse steps. The resulting non-monotone APTS (NAPTS) preserves accuracy while reducing CPU time by 30\% and cutting rejected steps to one third of those in APTS.
comment: 7 pages, 2 figures,
☆ Exploitation of Hidden Context in Dynamic Movement Forecasting: A Neural Network Journey from Recurrent to Graph Neural Networks and General Purpose Transformers
Forecasting within signal processing pipelines is crucial for mitigating delays, particularly in predicting the dynamic movements of objects such as NBA players. This task poses significant challenges due to the inherently interactive and unpredictable nature of sports, where abrupt changes in velocity and direction are prevalent. Traditional approaches, including (S)ARIMA(X), Kalman filters (KF), and Particle filters (PF), often struggle to model the non-linear dynamics present in such scenarios. Machine learning (ML) methods, such as long short-term memory (LSTM) networks, graph neural networks (GNNs), and Transformers, offer greater flexibility and accuracy but frequently fail to explicitly capture the interplay between temporal dependencies and contextual interactions, which are critical in chaotic sports environments. In this paper, we evaluate these models and assess their strengths and weaknesses. Experimental results reveal key performance trade-offs across input history length, generalizability, and the ability to incorporate contextual information. ML-based methods demonstrated substantial improvements over linear models across forecast horizons of up to 2s. Among the tested architectures, our hybrid LSTM augmented with contextual information achieved the lowest final displacement error (FDE) of 1.51m, outperforming temporal convolutional neural network (TCNN), graph attention network (GAT), and Transformers, while also requiring less data and training time compared to GAT and Transformers. Our findings indicate that no single architecture excels across all metrics, emphasizing the need for task-specific considerations in trajectory prediction for fast-paced, dynamic environments such as NBA gameplay.
comment: 12 pages
☆ XFP: Quality-Targeted Adaptive Codebook Quantization with Sparse Outlier Separation for LLM Inference
We introduce XFP, a dynamic weight quantizer for LLM inference that inverts the conventional workflow: the operator specifies reconstruction quality floors on per-channel cosine similarity (one strict floor for attention and shared experts, one lazy floor for routed-expert MoE); XFP determines codebook size, outlier budget, and packing per layer automatically -- no Hessian, no calibration data, no manual bit-width selection. Each weight matrix is decomposed into a sparse fp16 outlier residual and a dense sub-byte index tensor into a per-group learned codebook. Two storage modes share one auto-select frontend and one fused decode kernel: V2 (per-channel Lloyd) and V2a (shared library of L=32 codebooks per layer). On Qwen3.5-122B-A10B under V2, XFP reaches 138 tok/s single-stream decode on workstation hardware (RTX PRO 6000 Blackwell, TP=2) at 94.49% GSM8K strict-match (3 seeds, n=3957), and is 49% faster than Marlin INT4 at TP=1. For models that do not fit in the target memory envelope, we present the H-Process: a quality-driven iteration over the two cosine thresholds that finds the operating point at which the model just fits while still producing sensible output. Three constraints define its search space: the operator-set thresholds, an OOM boundary at quantize-on-load, and a garbage boundary in generation (cosine similarity steers; benches verify). On Qwen3.5-397B-A17B (512 routed experts/layer), the H-Process fits the full expert population into 2x96 GB at ~3.4 effective bits and delivers 100.9 tok/s long-output decode at 66.72% GSM8K strict-match on the full 1319-problem set (single seed at submission; multi-seed evaluation in progress), exceeding INT4 with routed-expert pruning on memory, throughput, and accuracy simultaneously.
comment: 17 pages, 3 figures, 17 tables, 1 algorithm. Code: https://github.com/flash7777/vllm/tree/multiquant
☆ GPart: End-to-End Isometric Fine-Tuning via Global Parameter Partitioning
Low-rank adaptation (LoRA) has become the dominant paradigm for parameter-efficient fine-tuning (PEFT) of large language models (LLMs). However, its bilinear structure introduces a critical limitation: the mapping from trainable parameters to weight updates is not distance-preserving, distorting the optimization landscape. Methods that project a low-dimensional vector into LoRA's parameter space, such as Uni-LoRA, improve parameter efficiency, but the subsequent bilinear LoRA map breaks end-to-end isometry, leaving the core distance-preservation problem unresolved. We propose GPart (Global Partition fine-tuning), a highly parameter-efficient fine-tuning method which removes the low-rank bottleneck entirely. Our method uses a single isometric partition matrix to map a $d$-dimensional trainable vector directly into the full weight space of the model. The result is an extremely minimal fine-tuning pipeline: one random projection, end-to-end isometric, with a single clean hyperparameter ($d$) and storage cost of $d+1$ values (the trainable vector plus a random seed). GPart builds on the theoretical premise that effective fine-tuning can emerge from random low-dimensional subspaces of the full weight space, without imposing low-rank matrix structure. We empirically demonstrate the superior or comparable performance of GPart to existing PEFT methods on natural language understanding, computer vision tasks, and mathematical reasoning. Overall, GPart achieves state-of-the-art efficiency and performance by removing structural constraints, offering a straightforward and elegant path to PEFT.
In-Context Learning for Data-Driven Censored Inventory Control
We study inventory control with decision-dependent censoring, focusing on the censored or repeated newsvendor (R-NV), where each order quantity determines whether demand is fully observed or censored by sales. Existing approaches based on parametric Thompson sampling (TS) can be brittle under prior mismatch, while offline imputation methods need not transfer to online learning. Motivated by the predictive view of decision making, we combine these ideas by taking oracle actions on learned completions of latent demand. We propose in-context generative posterior sampling (ICGPS), which uses modern generative models that are meta-trained offline and deployed online by in-context autoregressive generation. Theoretically, we show that the Bayesian regret of ICGPS with a learned completion kernel is bounded by the Bayesian regret of a TS benchmark with the ideal completion kernel plus a deployment penalty scaling as $\sqrt{T}$ times the square root of the completion mismatch. This yields a plug-in template for operational problems with known TS regret bounds. For R-NV, we derive sublinear Bayesian regret by reducing censored feedback to bandit convex optimization feedback. We also show that, under reasonable coverage and stability assumptions, the online completion mismatch is controlled by the offline censored predictive mismatch, so offline predictive quality transfers to online performance. Practically, we instantiate ICGPS with ChronosFlow, which combines a frozen time-series transformer backbone with a trainable conditional normalizing-flow head for fast censoring-consistent sampling. In benchmark experiments, ChronosFlow-ICGPS matches correctly specified TS, outperforms myopic and UCB-style baselines, and is robust to prior mismatch and distribution shift. ChronosFlow-ICGPS also performs well for the real-world SuperStore dataset, especially under heavy censoring.
☆ GenAI for Energy-Efficient and Interference-Aware Compressed Sensing of GNSS Signals on a Google Edge TPU
Traditional methods for classifying global navigation satellite system (GNSS) jamming signals typically involve post-processing raw or spectral data streams, requiring complex and costly data transmission to cloud-based interference classification systems. In contrast, our proposed approach efficiently compresses GNSS data streams directly at the hardware receiver while simultaneously classifying jamming and spoofing attacks in real time. Given the growing prevalence of GNSS jamming, there is a critical need for real-time solutions suitable for power-constrained environments. This paper introduces a novel method for compressing and classifying GNSS jamming threats using generative artificial intelligence (GenAI), specifically variational autoencoders (VAEs), deployed on Google Edge tensor processing units (TPUs). The study evaluates various autoencoder (AE) architectures to compress and reconstruct GNSS signals, focusing on preserving interference characteristics while minimizing data size near the receiver hardware. The pipeline adapts large-scale AE models for Google Edge TPUs through 8-bit quantization to ensure energy-efficient deployment. Tests on raw in-phase and quadrature-phase (IQ) data, Fast Fourier Transform (FFT) data, and handcrafted features show the system achieves significant compression (>42x) and accurate classification of approximately 72 interference types on reconstructed signals (F2-score 0.915), closely matching the original signals (F2-score 0.923). The hardware-centric GenAI approach also substantially reduces jammer signal transmission costs, offering a practical solution for interference mitigation. Ablation studies on conditional and factorized VAEs (i.e., FactorVAE) explore latent feature disentanglement for data generation, enhancing model interpretability and fostering trust in machine learning (ML) solutions for sensitive interference applications.
comment: 12 pages
☆ Interestingness as an Inductive Heuristic for Future Compression Progress
One of the bottlenecks on the way towards recursively self-improving systems is the challenge of interestingness: the ability to prospectively identify which tasks or data hold the potential for future progress. We formalize interestingness as an inductive heuristic for future compression progress and investigate its predictability using tools from Kolmogorov Complexity and Algorithmic Statistics. By analyzing complexity-runtime profiles under Length, Algorithmic, and Speed priors, we demonstrate that the inductive property of interestingness -- the capacity for past progress to signal future discovery -- is theoretically viable and empirically supported. We prove that expected future progress depends exponentially on the recency of the last observed breakthrough. Furthermore, we show that the Algorithmic Prior is significantly more optimistic than the Length Prior, yielding a quadratic increase in expected discovery for the same observed profile. These findings are experimentally confirmed across three diverse universal computational paradigms.
☆ K-Models: a Flexible and Interpretable Method for Ordinal Clustering with Application to Antigen-Antibody Interaction Profiles
Existing clustering methods for functional data often prioritize partitioning accuracy over interpretability, making it challenging to extract meaningful insights when the data-generating process follows a specific underlying structure and an ordinal relationship among clusters is suspected. This work introduces K-Models, a novel framework that integrates ordinal constraints and estimates key underlying elements of the random process generating the observed functional profiles, improving both interpretability and structure identification. The proposed method is evaluated through simulations and real-world applications. In particular, it is tested on Region of Interest (ROI) curves, which represent reaction profiles from a reflectometric sensor monitoring biomolecular interactions, such as antigen-antibody binding. These curves represent changes in reflected light intensity over time at multiple measurement spots with immobilized antigens during analyte exposure, capturing the binding dynamics of the system. The goal is to identify intrinsic signal patterns solely from the observed dynamics, making this dataset an ideal benchmark for assessing the added interpretability of the proposed approach. By incorporating structural assumptions into the clustering process, K-Models enhances interpretability while maintaining performance comparable to state-of-the-art techniques, providing a valuable tool for analyzing functional data with an underlying ordinal structure.
☆ ToMAToMP: Robust and Multi-Parameter Topological Clustering
Topological clustering, and its main algorithm ToMATo, is a clustering method from Topological Data Analysis (TDA) which has been applied successfully in several applications during the last few years. This is due to its high versatility, as clusters are detected from the persistent components in the sublevel sets of any user-defined function (gene expression, pixel values, etc), and efficiency, as topological clustering enjoys robustness guarantees. However, ToMATo is also limited in several ways. First, a graph on the data points needs to be provided as a hyper-parameter of the method (whose fine-tuning is left to the user). Second, ToMATo is known to be very sensitive to outlier values in the function range. Finally, and most importantly, ToMATo can only handle one function at a time, whereas it is critical to use several functions in various applications. In this article, we introduce ToMAToMP: the first topological clustering method able to handle several functions at the same time with theoretical guarantees. More specifically, we leverage a recent tool from multi-parameter persistent homology, called MMA decomposition, to design our clustering algorithm, and prove that it enjoys robustness properties. As corollaries, we show that it can be used to make ToMATo independent of graph tuning, and robust to outliers. Finally, we provide a set of numerical experiments showcasing the efficiency and quality of the clusterings produced by ToMAToMP, by showing strong improvement over non-topological and topological baselines for various datasets.
☆ GFMate: Empowering Graph Foundation Models with Test-time Prompt Tuning
Graph prompt tuning has shown great potential in graph learning by introducing trainable prompts to enhance the model performance in conventional single-domain scenarios. Recent research has extended graph prompts to improve Graph Foundation Models (GFMs) by few-shot tuning auxiliary prompts. Despite their progress, most existing methods embed source-domain information into prompts, which serve either as input to GFMs or encoded during model pre-training. Such prompt entanglement with specific source domains and GFM pre-training strategy restricts their generalisability to other domains and different GFMs. Furthermore, existing GFM prompts merely rely on few-shot tuning for adaptation, neglecting the rich information in unlabelled target domain test data. Motivated by these insights, this paper aims to empower GFMs with pre-training-agnostic test-time graph prompt tuning, named GFMate. GFMate introduces centroid and layer prompts applied after pre-training on target domains, avoiding entanglement with specific source domains and model pre-training. In addition, a test-time complementary learning objective is devised to exploit both labelled and unlabelled target domain data for effective test-time prompt tuning. Extensive experiments on 12 benchmark datasets demonstrate the superior performance and efficiency of GFMate, achieving improvements of up to 30.63%. Code is available at https://github.com/YanJiangJerry/GFMate.
☆ Known By Their Actions: Fingerprinting LLM Browser Agents via UI Traces
As LLM-based agents increasingly browse the web on users' behalf, a natural question arises: can websites passively identify which underlying model powers an agent? Doing so would represent a significant security risk, enabling targeted attacks tailored to known model vulnerabilities. Across 14 frontier LLMs and four web environments spanning information retrieval and shopping tasks, we show that an agent's actions and interaction timings, captured via a passive JavaScript tracker, are sufficient to identify the underlying model with up to 96\% F1. We formalise this attack surface by demonstrating that classifiers trained on agent actions generalise across model sizes and families. We further show that strong classifiers can be trained from few interaction traces and that agent identity can be inferred early within an episode. Injecting randomised timing delays between actions substantially degrades classifier performance, but does not provide robust protection: a classifier retrained on delayed traces largely recovers performance. We release our harness and a labelled corpus of agent traces \href{https://github.com/KabakaWilliam/known_actions}{here}.
☆ Understanding Imbalanced Forgetting in Rehearsal-Based Class-Incremental Learning
Neural networks suffer from catastrophic forgetting in class-incremental learning (CIL) settings. Rehearsal$\unicode{x2013}$replaying a subset of past samples$\unicode{x2013}$is a well-established mitigation strategy. However, recent results suggest that, despite balanced rehearsal allocation, some classes are forgotten substantially more than others. Despite its relevance, this imbalanced forgetting phenomenon remains underexplored. This work shows that imbalanced forgetting arises systematically and severely in rehearsal-based CIL and investigates it extensively. Specifically, we construct, from a principled analysis, three last-layer coefficients that capture different gradient-level sources of interference affecting each past class during an incremental step. We then demonstrate that, together, they reliably predict how past classes will rank in terms of forgetting at the end of that step. While predictive performance alone does not establish causality, these results support the interpretation of the coefficients as a plausible mechanistic account linking last-layer gradient-level interactions during training to class-level forgetting outcomes. Notably, one coefficient$\unicode{x2013}$capturing self-induced interference$\unicode{x2013}$emerges as the strongest predictor, with controlled experiments providing evidence consistent with this coefficient being influenced by the new-class interference coefficient. Overall, our findings provide valuable insights and suggest promising directions for mitigating imbalanced forgetting by reducing class-wise disparities in the identified sources of interference.
comment: 37 pages; 24 tables; 7 figures; submitted to a journal
☆ Peng's Q($λ$) for Conservative Value Estimation in Offline Reinforcement Learning ICLR 2026
We propose a model-free offline multi-step reinforcement learning (RL) algorithm, Conservative Peng's Q($λ$) (CPQL). Our algorithm adapts the Peng's Q($λ$) (PQL) operator for conservative value estimation as an alternative to the Bellman operator. To the best of our knowledge, this is the first work in offline RL to theoretically and empirically demonstrate the effectiveness of conservative value estimation with a \textit{multi-step} operator by fully leveraging offline trajectories. The fixed point of the PQL operator in offline RL lies closer to the value function of the behavior policy, thereby naturally inducing implicit behavior regularization. CPQL simultaneously mitigates over-pessimistic value estimation, achieves performance greater than (or equal to) that of the behavior policy, and provides near-optimal performance guarantees -- a milestone that previous conservative approaches could not achieve. Extensive numerical experiments on the D4RL benchmark demonstrate that CPQL consistently and significantly outperforms existing offline single-step baselines. In addition to the contributions of CPQL in offline RL, our proposed method also contributes to the offline-to-online learning framework. Using the Q-function pre-trained by CPQL in offline settings enables the online PQL agent to avoid the performance drop typically observed at the start of fine-tuning and to attain robust performance improvements. Our code is available at https://github.com/oh-lab/CPQL.
comment: Accepted in ICLR 2026
☆ Beyond What to Select: A Plug-and-play Oscillatory Data-Volume Scheduling for Efficient Model Training
Data selection accelerates training by identifying representative training data while preserving model performance. However, existing methods mainly focus on designing sample-importance criteria, i.e., deciding what to select, while typically fixing the selected data volume as the target ratio throughout training. Thus, they are often dynamic in sample identity but static in data volume. In this work, we revisit data selection from an optimization perspective and show that selected-data training induces an implicit regularization effect modulated by the instantaneous selection ratio. This reveals a key trade-off: lower ratios amplify selection-induced regularization, whereas higher ratios preserve data coverage and optimization fidelity. Motivated by this insight, we propose PODS, a Plug-and-play Oscillatory Data-volume Scheduling framework. Rather than introducing another sample-scoring metric, PODS serves as a lightweight module that dynamically schedules how much data to select over training. Under the target selection ratio, PODS alternates between low-ratio regularization phases and high-ratio recovery phases to exploit selection-induced regularization without sacrificing optimization stability. With its lightweight, ratio-level, and task-agnostic design, PODS is compatible with existing static and dynamic selection methods and broadly applicable across training paradigms. Experiments across various datasets, architectures, and tasks show that PODS consistently improves the efficiency-generalization trade-off, e.g., reducing ImageNet-1k training cost by 50% with improved accuracy and accelerating LLM instruction tuning by over 2x without performance degradation.
☆ BioHuman: Learning Biomechanical Human Representations from Video
Understanding human motion beyond surface kinematics is crucial for motion analysis, rehabilitation, and injury risk assessment. However, progress in this domain is limited by the lack of large-scale datasets with biomechanical annotations, and by existing approaches that cannot directly infer internal biomechanical states from visual observations. In this paper, we introduce a simulation-based framework for estimating muscle activations from existing motion capture datasets, resulting in BioHuman10M, a large-scale dataset with synchronized video, motion, and activations. Building on BioHuman10M, we propose BioHuman, an end-to-end model that takes monocular video as input and jointly predicts human motion and muscle activations, effectively bridging visual observations and internal biomechanical states. Extensive experiments demonstrate that BioHuman enables accurate reconstruction of both kinematic motion and muscle activity, and generalizes across diverse subjects and motions. We believe our approach establishes a new benchmark for video-based biomechanical understanding and opens up new possibilities for physically grounded human modeling.
☆ Composable Crystals: Controllable Materials Discovery via Concept Learning
De novo crystal generation, a central task in materials discovery, aims to generate crystals that are simultaneously valid, stable, unique, and novel. Existing methods mainly rely on black-box stochastic sampling, providing limited control over how generated structures move beyond the observed distribution. In this paper, we introduce a concept-based compositional framework for crystal generation. We train a vector-quantized variational autoencoder to automatically discover a shared set of reusable crystal concepts, which serve as building blocks for guided generation. These learned concepts naturally exhibit interpretability from both local atomic environments and global symmetry patterns, and generalize to crystals from different distributions. By recombining such concepts, our framework enables controllable exploration of novel crystals beyond the training distribution, rather than relying solely on unconstrained random sampling. To further improve composition efficiency, we introduce a composition generator and iteratively refine it using high-quality samples generated by the model itself. The resulting concept compositions are then used to condition downstream crystal generation. Numerical experiments on MP-20 and Alex-MP-20 show that compositing concepts separately increase base model up to 53.2% and 51.7% on V.S.U.N metric, with particular gains in novelty.
☆ Compositional Sparsity as an Inductive Bias for Neural Architecture Design
Identifying the structural priors that enable Deep Neural Networks (DNNs) to overcome the curse of dimensionality is a fundamental challenge in machine learning theory. Existing literature suggests that effective high-dimensional learning is driven by compositional sparsity, where target functions decompose into constituents supported on low-dimensional variable subsets. To investigate this hypothesis, we combine Information Filtering Networks (IFNs), which extract sparse dependency structures via constrained information maximisation, with Homological Neural Networks (HNNs), which map the inferred topology into fixed-wiring sparse neural graphs. We formalise the design principles underlying this construction and present an interpretable pipeline in which abstraction emerges through hierarchical composition. HNNs are orders of magnitude sparser than standard DNNs and require only minimal hyperparameter tuning. On synthetic tasks with known sparse hierarchies, HNNs recover the underlying compositional structure and remain stable in regimes where dense alternatives degrade as dimensionality increases. Across a broad suite of real-world datasets, HNNs consistently match or outperform dense baselines while using far fewer parameters, exhibiting lower variance and showing reduced sensitivity to hyperparameters.
Crys-JEPA: Accelerating Crystal Discovery via Embedding Screening and Generative Refinement
De novo crystal generation seeks to discover materials that are not merely realistic, but also stable and novel. However, most existing generative models are trained to maximize the likelihood of observed crystals, which encourages samples to stay close to known materials yet not necessarily align with the criteria that matter in discovery. Through an empirical investigation, we show that current crystal generative models are caught in a pronounced stability--novelty trade-off: moving toward the observed distribution preserves stability but limits novelty, whereas moving away from it quickly destroys stability. This suggests that the useful region for discovering crystals that are both stable and novel is extremely narrow. To escape the trade-off, we introduce Crys-JEPA, a joint embedding predictive architecture for crystals that learns an energy-aware latent space preserving formation-energy differences. In this space, stability assessment can be reformulated as an embedding-based comparison against accessible training crystals, reducing the reliance on expensive energy evaluation and task-specific external references. Building on Crys-JEPA, we further develop a screening-and-refinement pipeline that identifies promising generated crystals and reintroduces them to refine the generative model. On MP-20 and Alex-MP-20 datasets, we achieve improvements over baselines up to 81.4% and 82.6% on V.S.U.N metric, respectively.
☆ Cognitive-Uncertainty Guided Knowledge Distillation for Accurate Classification of Student Misconceptions ACL 2026
Accurately identifying student misconceptions is crucial for personalized education but faces three challenges: (1) data scarcity with long-tail distribution, where authentic student reasoning is difficult to synthesize; (2) fuzzy boundaries between error categories with high annotation noise; (3) deployment parado-large models overlook unconventional approaches due to pretraining bias and cannot be deployed on edge, while small models overfit to noise. Unlike traditional methods that increase diversity through large-scale data synthesis, we propose a two-stage knowledge distillation framework that mines high-value samples from existing data. The first stage performs standard distillation to transfer task capabilities. The second stage introduces a dual-layer marginal selection mechanism based on cognitive uncertainty, identifying four types of critical samples based on teacher model uncertainty and confidence differences. For different data subsets, we design difficulty-adaptive mechanism to balance hard/soft label contributions, enabling student models to inherit inter-class relationships from teacher soft labels while distinguishing ambiguous error types. Experiments show that with augmented training on only 10.30% of filtered samples, we achieve MAP@3 of 0.9585 (+17.8%) on the MAP-Charting dataset, and using only a 4B parameter model, we attain 84.38% accuracy on cross-topic tests of middle school algebra misconception benchmarks, significantly outperforming sota LLM (67.73%) and standard fine-tuned 72B models (81.25%). Our code is available at https://github.com/RoschildRui/acl2026_map.
comment: ACL 2026 Findings. 10 pages, 5 figures, 19 tables
☆ Non-linear Interventions on Large Language Models
Intervention is one of the most representative and widely used methods for understanding the internal representations of large language models (LLMs). However, existing intervention methods are confined to linear interventions grounded in the Linear Representation Hypothesis, leaving features encoded along non-linear manifolds beyond their reach. In this work, we introduce a general formulation of intervention that extends naturally to non-linearly represented features, together with a learning procedure that further enables intervention on implicit features lacking a direct output signature. We validate our framework on refusal bypass steering, where it steers the model more precisely than linear baselines by intervening on a non-linear feature governing refusal.
☆ Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining ICML 2026
Recent advances in multimodal large language models have driven growing interest in graphical user interface (GUI) agents, yet their generalization remains constrained by the scarcity of large-scale training data spanning diverse real-world applications. Existing datasets rely heavily on costly manual annotations and are typically confined to narrow domains. To address this challenge, we propose Video2GUI, a fully automated framework that extracts grounded GUI interaction trajectories directly from unlabeled Internet videos. Video2GUI employs a coarse-to-fine filtering strategy to identify high-quality GUI tutorial videos and convert them into structured agent trajectories. Applying this pipeline to 500 million video metadata entries, we construct WildGUI, a large-scale dataset containing 12 million interaction trajectories spanning over 1,500 applications and websites. Pre-training Qwen2.5-VL and Mimo-VL on WildGUI yields consistent improvements of 5-20% across multiple GUI grounding and action benchmarks, matching or surpassing state-of-the-art performance. We will release both the WildGUI dataset and the Video2GUI pipeline to support future research of GUI agents.
comment: Accepted at ICML 2026
☆ Selective Safety Steering via Value-Filtered Decoding
While large language models (LLMs) are trained to align with human values, their generations may still violate safety constraints. A growing line of work addresses this problem by modifying the model's sampling policy at decoding time using a safety reward. However, existing decoding-time steering methods often intervene unnecessarily, modifying generations that would have been safe under the base model. Such unnecessary interventions are undesirable, as they can distort key properties of the base model such as helpfulness, fluency, style, and coherence. We propose a new test-time steering method designed to reduce such unnecessary interventions while improving the safety of unsafe responses. Our approach filters tokens using a value-based safety criterion and provides an explicit bound on the probability of false interventions. A single threshold hyperparameter controls this bound, allowing practitioners to trade off higher rates of unnecessary intervention for better output safety. Across multiple datasets and experiments, we show that our value-filtered decoding method outperforms existing baselines, achieving better trade-offs between safety, helpfulness, and similarity to the base model.
☆ TAPIOCA: Why Task- Aware Pruning Improves OOD model Capability
Recent work has promoted task-aware layer pruning as a way to improve model performance on particular tasks, as shown by TALE. In this paper, we investigate when such improvements occur and why. We show first that, across controlled polynomial regression tasks and large language models, such pruning yields no benefit on in-distribution (ID) data but consistently improves out-of-distribution (OOD) accuracy. We further show empirically that OOD inputs induce layerwise norm and pairwise-distance profiles that deviate from the corresponding ID profiles. This leads to a geometric explanation of task-aware pruning: each task induces a task-adapted geometry, characterized empirically by the representation profiles observed on ID inputs. OOD inputs can introduce a distorted version of the task-adapted geometry. Task-aware pruning identifies layers that create or amplify this distortion; by removing them, it shifts OOD representational norms and pairwise distances toward those observed on the adapted distribution. This realigns OOD inputs with the model's task-adapted geometry and improves performance. We provide causal evidence through controlled distribution shifts and residual-scaling interventions, and demonstrate consistent behavior across model scales.
☆ IsoNet: Spatially-aware audio-visual target speech extraction in complex acoustic environments
Target speech extraction remains difficult for compact devices because monaural neural models lack spatial evidence and classical beamformers lose resolving power when the microphone aperture is only a few centimetres. We present IsoNet, a user-selectable audio-visual target speech extraction system for a compact 4-microphone array. IsoNet combines complex multi-channel STFT features, GCC-PHAT spatial cues, face-conditioned visual embeddings, and auxiliary direction-of-arrival supervision inside a U-Net mask estimation network. Three curriculum variants were trained on 25,000 simulated VoxCeleb mixtures with progressively difficult SNR regimes. On a hard test set spanning -1 to 10 dB SNR, IsoNet-CL1 achieves 9.31 dB SI-SDR, a 4.85 dB improvement over the mixture, with PESQ 2.13 and STOI 0.84. Oracle delay-and-sum and MVDR beamformers degrade the same mixtures by 4.82 dB and 6.08 dB SI-SDRi, respectively, showing that the proposed learned multimodal conditioning solves a regime where conventional spatial filtering is ineffective. Ablation studies show consistent gains from visual conditioning, GCC-PHAT features, and extended delay-bin encoding. The results establish a compact-array, face-selectable speech extraction baseline under controlled simulation and identify the remaining barriers to real deployment, especially phase reconstruction, multi-interferer mixtures, and simulation-to-real transfer.
comment: 8 pages
☆ Agentifying Patient Dynamics within LLMs through Interacting with Clinical World Model
Sepsis management in the ICU requires sequential treatment decisions under rapidly evolving patient physiology. Although large language models (LLMs) encode broad clinical knowledge and can reason over guidelines, they are not inherently grounded in action-conditioned patient dynamics. We introduce SepsisAgent, a world model-augmented LLM agent for sepsis treatment recommendation. SepsisAgent uses a learned Clinical World Model to simulate patient responses under candidate fluid--vasopressor interventions, and follows a propose--simulate--refine workflow before committing to a prescription. We first show that world-model access alone yields inconsistent LLM decision performance, motivating agent-specific training. We then train SepsisAgent through a three-stage curriculum: patient-dynamics supervised fine-tuning, propose--simulate--refine behavior cloning, and world-model-based agentic reinforcement learning. On MIMIC-IV sepsis trajectories, SepsisAgent outperforms all traditional RL and LLM-based baselines in off-policy value while achieving the best safety profile under guideline adherence and unsafe-action metrics. Further analysis shows that repeated interaction with the Clinical World Model enables the agent to learn regularities in patient evolution, which remain useful even when simulator access is removed.
☆ AnchorRoute: Human Motion Synthesis with Interval-Routed Sparse Contro
Sparse anchors provide a compact interface for human motion authoring: users specify a few root positions, planar trajectory samples, or body-point targets, while the system synthesizes the full-body motion that completes the under-specified intent. We present AnchorRoute, a sparse-anchor motion synthesis framework that uses anchors as a shared scaffold for both generation and refinement. Before generation, AnchorRoute converts sparse anchors into anchor-condition features and injects the resulting condition memory into a frozen Transition Masked Diffusion prior through AnchorKV and dual-context conditioning. This preserves the generation quality of the pretrained text-to-motion prior while learning sparse spatial control. After generation, the same anchors are evaluated as residuals: their timestamps define refinement intervals, and their residuals determine where correction should be concentrated. RouteSolver then refines the motion by projecting soft-token updates onto anchor-defined piecewise-affine interval bases. This couples generation-time anchor conditioning with residual-routed refinement under one anchor scaffold. AnchorRoute supports root-3D, planar-root, and body-point control within the same formulation. In benchmark evaluations, AnchorRoute outperforms prior sparse-control methods under the sparse keyjoint protocol and consistently improves anchor adherence across control families. The results show that the learned anchor-conditioned generator and RouteSolver refinement are complementary: the generator preserves text-motion quality, while RouteSolver provides a controllable path toward stronger anchor adherence.
☆ NeuroAtlas: Benchmarking Foundation Models for Clinical EEG and Brain-Computer Interfaces
Foundation models (FMs) promise to extract unified representations that generalize across downstream tasks. They have emerged across fields, including electroencephalography (EEG), but it is less clear how effective they are in this particular field. Published evaluations differ in datasets, in the EEG-specific preprocessing that might influence reported results, and in the reported metrics, frequently obscuring the clinical relevance in EEG. We introduce NeuroAtlas, the largest EEG benchmark to date: 42 datasets and 260k hours covering clinical EEG (epilepsy, sleep medicine, brain age estimation) and brain-computer interfaces, and include multiple datasets per task along with bespoke clinical evaluation metrics. Besides evaluating EEG-FMs with respect to supervised baselines, we present results from generic time-series FMs. We report three findings. First, EEG-specific FMs do not consistently outperform time-series FMs, which have neither EEG-focused architectures nor been pretrained on EEG. Second, standard machine learning metrics are insufficient to assess clinical utility: thus, we thoroughly evaluate more appropriate measures such as the quality of event-level decision-making, hypnogram-derived features, and the brain-age gap in the domains of epilepsy, sleep, and brain age, respectively. Third, model rankings and performance can vary substantially within domains. We conclude that pretrained models perform largely on par, with only narrow advantages for a few, and that current models do not yet deliver on the promise of an out-of-the-box unified EEG model. NeuroAtlas exposes this gap and provides the datasets and metrics for the next generation of unified EEG FMs.
☆ The Rate-Distortion-Polysemanticity Tradeoff in SAEs
Sparse Autoencoders (SAEs) that can accurately reconstruct their input (minimizing distortion) by making efficient use of few features (minimizing the rate) often fail to learn monosemantic representations (highly interpretable), limiting their usefulness for mechanistic interpretability. In this paper, we characterise this tension in learning faithful, efficient, and interpretable explanations, introducing the Rate-Distortion-Polysemanticity tradeoff in SAEs. Under toy-modeling assumptions, we theoretically and empirically show that restricting the SAE to be monosemantic necessarily comes with an increase in rate and distortion. Assuming a generative model behind the input observations, we further demonstrate that the degree of polysemanticity of optimal SAEs is determined by the training data distribution, especially by the probability of features to co-occur. Finally, we extend the analysis to real-world settings by deriving necessary conditions that a polysemanticity measure should satisfy when the data-generating process is unknown, and we benchmark existing proxy metrics on SAEs trained on Large Language Models. Taken together, our findings show that polysemanticity is a data problem that should be accounted for when addressing it at the architectural and optimization level.
☆ ReMIA: a Powerful and Efficient Alternative to Membership Inference Attacks against Synthetic Data Generators
Tabular data sharing under privacy constraints is increasingly important for research and collaboration. Synthetic data generators (SDGs) are a promising solution, but synthetic data remains vulnerable to attacks, such as membership inference attacks (MIAs), which aim to determine whether a specific record was part of the training data. State-of-the-art MIAs are powerful but impractical: they rely on shadow modeling, requiring hundreds of SDG training runs, and need auxiliary data several times larger than the original training set. Fast proxy metrics like distance to closest record (DCR) are efficient but have limited sensitivity to MIA risk. We introduce ReMIA (Relative Membership Inference Attack), a practical privacy metric that requires only two SDG training runs and additional data no larger than the original training set. Rather than predicting whether a record was in the training set, ReMIA generates two synthetic datasets from two source datasets and measures whether a classifier can identify which source a record came from. Experiments across multiple tabular datasets and SDGs show that ReMIA has a sensitivity comparable to state-of-the-art MIAs while being substantially more practical. We further observe that SDGs can achieve privacy-utility trade-offs that traditional noise-based anonymization methods do not match. Code is available at https://github.com/aindo-com/remia.
☆ Spontaneous symmetry breaking and Goldstone modes for deep information propagation
In physical systems, whenever a continuous symmetry is spontaneously broken, the system possesses excitations called Goldstone modes, which allow coherent information propagation over long distances and times. In this work, we study deep neural networks whose internal layers are equivariant under a continuous symmetry and may therefore support analogous Goldstone-like degrees of freedom. We demonstrate, both analytically and empirically, that these degrees of freedom enable coherent signal propagation across depth and recurrent iterations, providing a mechanism for stable information flow without relying on architectural stabilizers such as residual connections or normalization. In feedforward networks, this results in improved trainability and representational diversity across layers. In recurrent settings, we demonstrate the same mechanism is valuable for long-term memory by propagating information over recurrent iterations, thereby improving performance of RNNs and GRUs on long-sequence modeling tasks.
comment: 28 pages. Code at https://github.com/nabiliqbal/ssb-goldstone-deep-info-prop
☆ AQKA: Active Quantum Kernel Acquisition Under a Shot Budget
Estimating an $N \times N$ quantum kernel from circuit fidelities requires $Θ(N^2 S)$ measurement shots, the dominant bottleneck for deployment on near-term hardware. Existing budget-saving methods (Nyström-QKE, ShoFaR, kernel-target alignment) sub-sample \emph{which} entries to measure but allocate shots \emph{uniformly} within their chosen subset, ignoring how much each entry drives the downstream classifier. We close this gap with two contributions. \textbf{First, a complete regime decomposition} for shot-budgeted quantum kernel learning: a principled menu of when each allocator wins. Our method, \emph{AQKA}, dominates the budget-limited regime ($B \lesssim 16 n_{\mathrm{pairs}}$) on sparse-sensitivity KRR, with the gap \emph{growing} from $+8$ to $+25$ pts over uniform as $N$ scales $225{\to}1000$ and reaching $+26$--$32$ pts on an \texttt{ibm\_pittsburgh} (156-qubit Heron) hardware kernel; Nyström-QKE wins at saturating budgets on planted-sparse via low-rank reconstruction; ShoFaR is competitive only at extreme low budgets. \textbf{Second, a closed-form pair-level acquisition theory}: $s_{ij}^{\star} \propto |g_{ij}|\sqrt{K_{ij}(1-K_{ij})}$ with explicit gradient $g_{ij}$ for KRR (Lemma~1, $|β_iα_j+β_jα_i|\sqrt{K_{ij}(1-K_{ij})}$) and SVM via the envelope theorem ($|η_i^*η_j^*|\sqrt{K_{ij}(1-K_{ij})}$); a \emph{corrected} sparsity-aware Cauchy--Schwarz rate $ρ\le 2m/N$ matching empirics (vs.\ the naive $m^2/N^2$); an explicit-constant plug-in regret bound (Theorem~2); and a tighter SVM ceiling $ρ^{\mathrm{SVM}} \le m_{\mathrm{sv}}^2/N^2$. We close with the first multi-seed live online adaptive shot allocation on quantum hardware: $+17.0 \pm 4.8$ pts at $N{=}20$ on \texttt{ibm\_aachen} ($3.5σ$, 5 seeds), with the advantage holding at $N{=}30$ at higher budget on \texttt{ibm\_berlin} ($+14.0 \pm 8.5$ pts, 5 seeds).
☆ Scalable Solution of the Stochastic Multi-path Traveling Salesman Problem via Neural Networks
The multi-path Traveling Salesman Problem with stochastic travel costs arises in hybrid vehicle routing applications designed for Smart City and City Logistics, where multiple paths exist between each pair of locations. Travel times along these paths are typically affected by real-time traffic conditions and therefore modeled as stochastic. The objective of the problem is to determine a Hamiltonian tour that minimizes the expected total travel cost under uncertainty. In this work, we adopt a two-stage stochastic programming formulation. In the first stage, a predefined route specifying the sequence of locations to be visited is determined, while taking into consideration a second-stage recourse problem that selects the optimal path from the feasible set of alternative paths for each pair of locations, once real-time traffic conditions are realized. To reduce the computational burden imposed by the large number of scenarios required to capture travel time uncertainty, the innovation of this work is the integration of neural network-based surrogate models to approximate the expected value of the second-stage recourse problem. Different architectures and training strategies for the neural networks are proposed and analyzed, with performance evaluated in terms of computation time, solution quality, and generalization capability. Preliminary findings demonstrate the enhanced scalability and practical applicability of the approach for complex vehicle routing problems under uncertainty.
☆ Slower Generalization, Faster Memorization: A Sweet Spot in Algorithmic Learning
Critical-data-size accounts of grokking suggest a natural post-threshold intuition: once training data is sufficient to identify the underlying rule, additional data should accelerate validation convergence. We show that this intuition can fail in a controlled structured-output task. In Needleman--Wunsch (NW) matrix generation, small Transformers reach high validation exact-match accuracy fastest at an intermediate dataset size, not at the largest one. Past this dataset-size sweet spot, generalization remains achievable but requires more gradient updates. Conversely, in the regime where partial validation competence first appears, larger datasets can require fewer updates to reach high training accuracy, suggesting that emerging rule structure can accelerate fitting beyond example-wise memorization. A multiplication baseline does not show the same post-threshold slowdown. These results separate the critical data size for the onset of generalization from the dataset size that optimizes update-based convergence, and identify structured-output tasks where learning the rule and completing exact-fitting can diverge.
☆ Unbiased and Second-Order-Free Training for High-Dimensional PDEs ICML 2026
Deep learning methods based on backward stochastic differential equations (BSDEs) have emerged as competitive alternatives to physics-informed neural networks (PINNs) for solving high-dimensional partial differential equations (PDEs). By leveraging probabilistic representations, BSDE approaches can avoid the curse of dimensionality and often admit second-order-free training objectives that do not require explicit Hessian evaluations. It has recently been established that the commonly used Euler-Maruyama (EM) time discretization induces an intrinsic bias in BSDE training losses. While high-order schemes such as Heun can fully eliminate this bias, such schemes re-introduce second-order spatial derivatives and incur substantial computational overhead. In this work, we provide a principled analysis of EM-induced loss bias and propose an unbiased, second-order-free training framework that preserves the computational advantages of BSDE methods. Our code is available at https://github.com/seojaemin22/Un-EM-BSDE.
comment: Accepted at ICML 2026
☆ DRL-STAF: A Deep Reinforcement Learning Framework for State-Aware Forecasting of Complex Multivariate Hidden Markov Processes
Forecasting multivariate hidden Markov processes is challenging due to nonlinear and nonstationary observations, latent state transitions, and cross-sequence dependencies. While deep learning methods achieve strong predictive accuracy, they typically lack explicit state modeling, whereas Hidden Markov Models (HMMs) provide interpretable latent states but struggle with complex nonlinear emissions and scalability. To address these limitations, we propose DRL-STAF, a Deep Reinforcement Learning based STate-Aware Forecasting framework that jointly predicts next-step observations and estimates the corresponding hidden states for complex multivariate hidden Markov processes. Specifically, DRL-STAF models complex nonlinear emissions using deep neural networks and estimates discrete hidden states using reinforcement learning, reducing the reliance on predefined transition structures and enabling flexible adaptation to diverse temporal dynamics. In particular, DRL-STAF mitigates the state-space explosion encountered by typical multivariate HMM-based methods. Extensive experiments demonstrate that DRL-STAF outperforms HMM variants, standalone deep learning models, and existing DL-HMM hybrids in most cases, while also providing reliable hidden-state estimates.
☆ Action-Inspired Generative Models
We introduce Action-Inspired Generative Models (AGMs), a dual-network generative framework motivated by the observation that existing bridge-matching methods assign uniform regression weight to every stochastic transition in the transport landscape, regardless of whether a given bridge sample lies along a structurally coherent trajectory or a degenerate one. We address this by introducing a lightweight learned scalar potential $V_φ$ that scores bridge samples online and modulates the drift objective via importance weights derived through a stop-gradient barrier -- preventing adversarial feedback between the two networks whilst preserving $V_φ$'s guiding signal. Crucially, $V_φ$ comprises only $\sim$1.4% of the primary drift network's parameter count, adds no overhead to the inference graph, and requires no iterative half-bridge fitting or auxiliary stochastic differential equation (SDE) solvers: it is a plug-and-play enhancement to any bridge-matching training loop. At inference, $V_φ$ is discarded entirely, leaving standard Euler-Maruyama integration of the exponential moving average (EMA) drift. We demonstrate that selectively penalising uninformative transport paths through the learned potential yields consistent improvements in generation quality across fidelity and coverage metrics.
comment: 11 pages, 5 figures, and 4 tables
☆ An Amortized Efficiency Threshold for Comparing Neural and Heuristic Solvers in Combinatorial Optimization
A common critique of neural combinatorial-optimization solvers is that they are less energy-efficient than CPU metaheuristics, given the operational energy cost of training them on GPUs. This paper examines the inferential step from "training is expensive" to "neural solvers are net-inefficient", which is where the critique actually goes wrong. Training the network costs a large fixed amount of GPU energy; running the metaheuristic costs a small amount of CPU energy on every instance, repeated as long as the solver is deployed. The two are not commensurable until a deployment volume is fixed. We define the Amortized Efficiency Threshold (AET) as the deployment volume above which a neural solver breaks even with a heuristic baseline in total energy or carbon, under an explicit constraint on solution quality. We show that the cumulative-energy ratio between the two solvers tends to a constant strictly below one whenever the network wins per-instance, and that this limit does not depend on how the training cost was measured. An embodied-carbon term amortizes hardware fabrication symmetrically on both sides. We instantiate the framework on the Multi-Task VRP (MTVRP) environment at n=20 customers across 19 problem variants and five training seeds, with HGS via PyVRP as the heuristic baseline. The measured crossover sits near $1.58 \times 10^5$ deployed instances; the per-instance ratio is 0.41, reflecting the moderate size of the instances tested. The contribution is the framework, the open instrumentation, and the measurement protocol; structural convergence of the ratio at larger problem sizes is left to future empirical work.
comment: 16 pages, 5 figures, 3 tables. v0.1: framework + measurement protocol instantiated at n=20; empirical extension to larger problem sizes deferred to v0.2
☆ Deep Image Segmentation via Discriminant Feature Learning ICIP 2026
Accurate image segmentation remains challenging, particularly in generating sharp, confident boundaries. While modern architectures have advanced the field, many of them still rely on standard loss functions like Cross-Entropy and Dice, which often neglect the discriminative structure of learned features, leading to inaccurate boundaries. This work introduces Deep Discriminant Analysis (DDA), a differentiable, architecture-agnostic loss function that embeds classical discriminant principles for network training. DDA explicitly maximizes between-class variance while minimizing within-class one, promoting compact and separable feature distributions without increasing inference cost. Evaluations on the DIS5K benchmark demonstrate that DDA consistently improves segmentation accuracy, boundary sharpness, and model confidence across various architectures. Our results show that integrating discriminant analysis offers a simple, effective path for building more robust segmentation models.
comment: Accepted to ICIP 2026
☆ One Step to the Side: Why Defenses Against Malicious Finetuning Fail Under Adaptive Adversaries
Model providers increasingly release open weights or allow users to fine-tune foundation models through APIs. Although these models are safety-aligned before release, their safeguards can often be removed by fine-tuning on harmful data. Recent defenses aim to make models robust to such malicious fine-tuning, but they are largely evaluated only against fixed attacks that do not account for the defense. We show that these robustness claims are incomplete. Surveying 15 recent defenses, we identify several defense mechanisms and show that they share a single weakness: they obscure or misdirect the path to harmful behavior without removing the behavior itself. We then develop a unified adaptive attack that breaks defenses across all defense mechanisms. Our results show that current approaches do not provide robust security; they mainly stop the attacks they were designed against. We hope that our unified adaptive adversary for this domain will help future researchers and practitioners stress-test new defenses before deployment.
comment: Under review
☆ Fast Rates for Inverse Reinforcement Learning
We establish novel structural and statistical results for entropy-regularized min-max inverse reinforcement learning (Min-Max-IRL) with linear reward classes in finite-horizon MDPs with Borel state and action spaces. On the structural side, we show that maximum likelihood estimation (MLE) and Min-Max-IRL are equivalent at the population level, and at the empirical level under deterministic dynamics. On the statistical side, exploiting pseudo-self-concordance of the Min-Max-IRL loss, we prove that both the trajectory-level KL divergence and the squared parameter error in the Hessian norm decay at the fast rate $\mathcal{O}(n^{-1})$, where $n$ is the number of expert trajectories. Our guarantees apply under misspecification and require no exploration assumptions. We further extend reward-identifiability results to general Borel spaces and derive novel results on the derivatives of the soft-optimal value function with respect to reward parameters.
☆ Silent Collapse in Recursive Learning Systems
Recursive learning -- where models are trained on data generated by previous versions of themselves -- is increasingly common in large language models, autonomous agents, and self-supervised systems. However, standard performance metrics (loss, perplexity, accuracy) often fail to detect internal degradation before it becomes irreversible. Here we identify a phenomenon we call silent collapse: under broad recursive conditions, model internal distributions -- predictive entropy, representational diversity, and tail coverage -- progressively contract even as conventional metrics appear stable or improving. We discover that silent collapse is not abrupt. Its onset is reliably preceded by three trajectory-level precursors: (1) contraction of anchor entropy, (2) freezing of representation drift, and (3) erosion of tail coverage. These signals manifest multiple generations before any degradation in standard validation metrics, enabling early warning. Based on these precursors, we propose the MTR (Monitor--Trust--Regulator) framework, a lightweight metacognitive loop that monitors trajectory statistics, estimates a slow-timescale trust variable, and adaptively modulates the effective learning intensity. MTR provides early warning and actively prevents silent collapse without requiring access to pristine real data -- a critical advantage when original data is unavailable, contaminated, or private.
☆ Angel or Demon: Investigating the Plasticity Interventions' Impact on Backdoor Threats in Deep Reinforcement Learning ICML 2026
Extensive research has highlighted the severe threats posed by backdoor attacks to deep reinforcement learning (DRL). However, prior studies primarily focus on vanilla scenarios, while plasticity interventions have emerged as indispensable built-in components of modern DRL agents. Despite their effectiveness in mitigating plasticity loss, the impact of these interventions on DRL backdoor vulnerabilities remains underexplored, and this lack of systematic investigation poses risks in practical DRL deployments. To bridge this gap, we empirically study 14,664 cases integrating representative interventions and attack scenarios. We find that only one intervention (i.e., SAM) exacerbates backdoor threats, while other interventions mitigate them. Pathological analysis identifies that the exacerbation is attributed to backdoor gradient amplification, while the mitigation stems from activation pathway disruption and representation space compression. From these findings, we derive two novel insights: (1) a conceptual framework SCC for robust backdoor injection that deconstructs the mechanistic interplay between interventions and backdoors in DRL, and (2) abnormal loss landscape sharpness as a key indicator for DRL backdoor detection.
comment: To appear in the Forty-Third International Conference on Machine Learning (ICML 2026), July 6-11, 2026, Seoul, South Korea
☆ All-atomistic Transferable Neural Potentials for Protein Solvation
Implicit solvent models are widely used to decrease the number of solvent degrees of freedom and enable the calculation of solvation energetics without water molecules. However, its accuracy often falls short compared to explicit models. Recent advancements in neural potentials have shown promise in drug discovery, but transferability remains a persistent challenge. Here, we introduce the Protein Hydration Neural Network (PHNN), an implicit solvent model that extends analytical continuum solvation by learning transferable corrections to model parameters instead of applying post hoc adjustments to final energies. The model is explicitly designed to maximize data efficiency by leveraging physical priors embedded in the data. We demonstrate that PHNN improves accuracy relative to traditional analytical methods and maintains predictive accuracy on out-of-domain protein systems.
☆ Woodelf++: A Fast and Unified Partial Dependence Plot Algorithm for Decision Tree Ensembles IJCAI 2026
Partial Dependence Plots (PDPs) visualize how changes in a single feature affect the average model prediction. They are widely used in practice to interpret decision tree ensembles and other machine learning models. Joint-PDPs extend this idea to pairs of features, revealing their combined effect. Partial Dependence Interaction Values (PDIVs) measure feature interactions. The Any-Order-PDIVs task computes these interactions for every feature subset across all rows of the dataset. We introduce Woodelf++, a unified and efficient approach for computing all these useful explainability tools on decision tree ensembles, building on Woodelf, an algorithm for efficient SHAP computation. By deriving suitable metrics over pseudo-Boolean functions, Woodelf++ can compute PDPs (exact and approximate), Joint-PDPs, and Any-Order-PDIVs in a unified framework. Our method delivers substantial complexity improvements over the state of the art, including an exponential gain for Any-Order-PDIVs. Additionally, we introduce and efficiently compute Full PDPs, which leverage the model's split thresholds to faithfully capture its behavior across all possible feature values. Woodelf++ is implemented in pure Python and supports GPU acceleration. On a dataset with 400,000 rows, Woodelf++ computes PDP and Joint-PDP up to 6x faster than the state of the art and up to five orders of magnitude faster than scikit-learn. For Any-Order-PDIVs, the gap is even larger: Woodelf++ computes all interaction values in 5 minutes, while the state of the art is estimated to require over 1,000,000 years.
comment: Extended version of the paper to appear at IJCAI 2026
☆ Let Robots Feel Your Touch: Visuo-Tactile Cortical Alignment for Embodied Mirror Resonance
Observing touch on another's body can elicit corresponding tactile sensations in the observer, a phenomenon termed mirror touch that supports empathy and social perception. This visuo-tactile resonance is thought to rely on structural correspondence between visual and somatosensory cortices, yet robotic systems lack computational frameworks that instantiate this principle. Here we demonstrate that cortical correspondence can be operationalized to endow robots with mirror touch. We introduce Mirror Touch Net, which imposes semantic, distributional and geometric alignment between visual and tactile representations through multi-level constraints, enabling prediction of millimetre-scale tactile signals across 1,140 taxels on a robotic hand from RGB images. Manifold analysis reveals that these constraints reshape visual representations into geometry consistent with the tactile manifold, reducing the complexity of cross-modal mapping. Extending this alignment framework to cross-domain observations of human hands enables tactile prediction and reflexive responses to observed human touch. Our results link a neural principle of visuo-tactile resonance to robotic perception, providing an explainable route towards anticipatory touch and empathic human-robot interaction. Code is available at https://github.com/fun0515/Mirror-Touch-Net.
☆ Mining Subscenario Refactoring Opportunities in Behaviour-Driven Software Test Suites: ML Classifiers and LLM-Judge Baselines
Context. Behaviour-Driven Development (BDD) software test suites accumulate duplicated step subsequences. Three published refactoring patterns are available (within-file Background, within-repo reusable-scenario invocation, cross-organisational shared higher-level step), but no prior work automates which recurring subsequences are worth extracting or which mechanism applies. Objective. Rank recurring step subsequences ("slices") by refactoring suitability (extraction-worthy), pre-map each to one of the three patterns, and quantify prevalence across the public BDD ecosystem. Method. Every contiguous L-step window (L in [2, 18]) in a 339-repository / 276-upstream-owner Gherkin corpus is keyed by paraphrase-robust cluster identifiers and counted under three scopes. Sentence-BERT (SBERT) / Uniform Manifold Approximation and Projection (UMAP) / Hierarchical Density-Based Clustering (HDBSCAN) recovers paraphrase-equivalent slices. Three authors label a stratified 200-slice pool against a written rubric. An eXtreme Gradient Boosting (XGBoost) extraction-worthy classifier trained under 5-fold cross-validation is compared with a tuned rule baseline and two open-weight Large Language Model (LLM) judges. Results. The miner produces 5,382,249 slices collapsing to 692,020 recurring patterns. Three-author Fleiss' kappa = 0.56 (extraction-worthy) and 0.79 (mechanism). The classifier reaches out-of-fold F1 = 0.891 (95% CI [0.852, 0.927]), outperforming both the rule baseline (F1 = 0.836, p = 0.017) and the better LLM judge (F1 = 0.728, p < 1e-4). 75.0%, 59.5%, and 11.7% of scenarios carry a within-file Background, within-repo reusable-scenario, or cross-organisational shared-step candidate. Conclusion. Paraphrase-robust subscenario discovery yields a corpus-wide census of BDD refactoring opportunities; pipeline, classifier predictions, labelled pool, and rubric are released under Apache-2.0.
comment: 30 pages, 12 figures and tables, 58 references. Under review at Software Quality Journal (Springer). Reproduction package at https://github.com/amughalbscs16/cukereuse_subscenarios_release (Apache-2.0). Upstream cukereuse corpus at https://doi.org/10.5281/zenodo.19754359
☆ Scaling Laws from Sequential Feature Recovery: A Solvable Hierarchical Model
We propose a simple mechanism by which scaling laws emerge from feature learning in multi-layer networks. We study a high-dimensional hierarchical target that is a globally high-degree function, but that can be represented by a combination of latent compositional features whose weights decrease as a power law. We show that a layer-wise spectral algorithm adapted to this compositional structure achieves improved scaling relative to shallow, non-adaptive methods, and recovers the latent directions sequentially: strong features become detectable at small sample sizes, while weaker features require more data. We prove sharp feature-wise recovery thresholds and show that aggregating these transitions yields an explicit power-law decay of the prediction error. Technically, the analysis relies on random matrix methods and a resolvent-based perturbation argument, which gives matching upper and lower bounds for individual eigenvector recovery beyond what standard gap-based perturbation bounds provide. Numerical experiments confirm the predicted sequential recovery, finite-size smoothing of the thresholds, and separation from non-hierarchical kernel baselines. Together, these results show how smooth scaling laws can emerge from a cascade of sharp feature-learning transitions.
☆ Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy
Agentic reinforcement learning trains large language models using multi-turn trajectories that interleave long reasoning traces with short environment-facing actions. Common policy-gradient methods, such as PPO and GRPO, treat each token in a trajectory equally, leading to uniform credit assignment. In this paper, we critically demonstrate that such uniform credit assignment largely misallocates token-level training signals. From an energy-based modeling perspective, we show that token-level training signals, quantified by their correlations with reward variance of different rollouts sampled from a given prompt, concentrate sharply on action tokens rather than reasoning tokens, even though action tokens account for only a small fraction of the trajectory. We refer to this phenomenon as the Action Bottleneck. Motivated by this observation, we propose an embarrassingly simple token reweighting approach, ActFocus, that downweights gradients on reasoning tokens, along with an additional energy-based redistribution mechanism that further increases the weights on action tokens with higher uncertainty. Across four environments and different model sizes, ActFocus consistently outperforms PPO and GRPO, yielding final-step gains of up to 65.2 and 63.7 percentage points, respectively, without any additional runtime or memory cost.
comment: Preprint
☆ Efficient Multi-objective Prompt Optimization via Pure-exploration Bandits ICLR 2026
Prompt engineering has become central to eliciting the capabilities of large language models (LLMs). At its core lies prompt selection -- efficiently identifying the most effective prompts. However, most prior investigations overlook a key challenge: the inherently multi-faceted nature of prompt performance, which cannot be captured by a single metric. To fill this gap, we study the multi-objective prompt selection problem under two practical settings: Pareto prompt set recovery and best feasible prompt identification. Casting the problem into the pure-exploration bandits framework, we adapt provably efficient algorithms from multi-objective bandits and further introduce a novel design for best feasible arm identification in structured bandits, with theoretical guarantees on the identification error in the linear case. Extensive experiments across multiple LLMs show that the bandit-based approaches yield significant improvements over baselines, establishing a principled and efficient framework for multi-objective prompt optimization.
comment: Published as a conference paper at ICLR 2026
☆ SeesawNet: Towards Non-stationary Time Series Forecasting with Balanced Modeling of Common and Specific Dependencies IJCAI
Instance normalization (IN) is widely used in non-stationary multivariate time series forecasting to reduce distribution shifts and highlight common patterns across samples. However, IN can over-smooth instance-specific structural information that is essential for modeling temporal and cross-channel heterogeneity. While prior methods further suppress distribution discrepancies or attempt to recover temporal specific dependencies, they often ignore a central tension: how to adaptively model common and instance-specific dependency based on each instance's non-stationary structures. To address this dilemma, we propose SeesawNet, a unified architecture that dynamically balances common and instance-specific dependency modeling in both temporal and channel dimensions. At its core is Adaptive Stationary-Nonstationary Attention (ASNA), which captures common dependencies from normalized sequences and specific dependencies from raw sequences, and adaptively fuses them according to instance-level non-stationarity. Built upon ASNA, SeesawNet alternates dedicated temporal and channel relationship modeling to jointly capture long-range and cross-variable dependencies. Extensive experiments on multiple real-world benchmarks demonstrate that SeesawNet consistently outperforms state-of-the-art methods.
comment: Accepted by IJCAI-ECAI 2026, the 35th International Joint Conference on Artificial Intelligence. Code is at https://github.com/dreamone-Lee/SeesawNet
☆ Multi-Dimensional Model Integrity and Responsibility Assessment Index and Scoring Framework
Artificial intelligence in high-stakes tabular domains cannot be evaluated by predictive performance alone, yet current practice still assesses explainability, fairness, robustness, privacy, and sustainability mostly in isolation. We propose the Model Integrity and Responsibility Assessment Index (MIRAI), a unified evaluation framework that measures tabular models across these five dimensions under a controlled comparison setting and aggregates them into a single score. MIRAI combines established metrics through normalized and direction-aligned dimension scores, which enables direct comparison across models with different architectural and computational profiles. Experiments on healthcare, financial, and socioeconomic datasets show that higher predictive performance does not necessarily imply better overall integrity and responsibility. In several cases, simpler models achieve a stronger cross-dimensional balance than more complex deep tabular architectures. MIRAI provides a compact and practical basis for responsible model selection in regulated settings.
comment: Accepted to the 39th Canadian Conference on Artificial Intelligence (Canadian AI 2026)
☆ Discovering Physical Directions in Weight Space: Composing Neural PDE Experts
Recent advances in neural operators have made partial differential equation (PDE) surrogate modeling increasingly scalable and transferable through large-scale pretraining and in-context adaptation. However, after a shared operator is fine-tuned to multiple regimes within a continuous physical family, it remains unclear whether the resulting weight-space updates merely form isolated regime experts or reveal reusable physical structure. Starting from a shared family anchor, we fine-tune low- and high-regime endpoint experts and show that their updates can be separated into a family-shared adaptation and a direction aligned with the underlying physical parameter. This separation reinterprets endpoint experts as finite-difference probes of a local physical direction in weight space, explaining why static averaging can interpolate between regimes but attenuates endpoint-specific physics. Building on this perspective, we propose Calibration-Conditioned Merge (CCM), a post-hoc coordinate readout method for composing neural PDE experts along this physical direction. Given physical metadata, a calibrated coordinate mapping, or a short observed rollout prefix, CCM infers the target composition coordinate and deploys a single merged checkpoint for the remaining rollout. We evaluate CCM on the reaction--diffusion system, viscosity-parameterized two-dimensional Navier--Stokes equations, and radial dam-break dynamics. Across these benchmarks, CCM achieves its strongest gains in extrapolative regimes, reducing out-of-distribution rollout error relative to the family anchor by 54.2%, 42.8%, and 13.8%, respectively. Further experiments across FNO scales, a DPOT-style backbone, and ablations confirm that endpoint fine-tuning is not arbitrary checkpoint drift, but reveals a calibratable physical direction for training-free transfer across PDE regimes.
☆ RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation
Inpatient medication recommendation requires clinicians to repeatedly select specific medications, doses, and routes as a patient's condition evolves. Existing benchmarks formulate this task as admission-level prediction over coarse drug codes with multi-hot diagnostic and procedure code inputs, failing to capture the per-timepoint, information-rich nature of real prescribing. We propose RxEval, a prescription-level benchmark that evaluates LLM prescribing capability by multiple-choice questions: each question presents a detailed patient profile and time-ordered clinical trajectory, requiring selection of specific medication-dose-route triples from real prescriptions and patient-specific distractors generated via reasoning-chain perturbation. RxEval comprises 1,547 questions spanning 584 patients, 18 diagnostic categories, and 969 unique medications. Evaluation of 16 LLMs shows that RxEval is both challenging and discriminative: F1 ranges from 45.18 to 77.10 across models, and the best Exact Match is only 46.10%. Error analysis reveals that even frontier models may overlook stated patient information and fail to derive clinical conclusions.
☆ Exploring Geographic Relative Space in Large Language Models through Activation Patching
The increased use of Large Language Models (LLMs) in geography raises substantial questions about the safety of integrating these tools across a wide range of processes and analyses, given our very limited understanding of their inner workings. In this extended abstract, we examine how LLMs process relative geographic space using activation patching, an emerging tool for mechanistic interpretability.
☆ Lang2MLIP: End-to-End Language-to-Machine Learning Interatomic Potential Development with Autonomous Agentic Workflows
Developing machine learning interatomic potentials (MLIPs) for complex materials systems remains challenging because it requires expertise in atomistic simulations, machine learning, and workflow design, as well as iterative active learning procedures. Existing automated pipelines typically assume a fixed sequence of stages or depend on domain experts, which limits their adaptability to heterogeneous materials systems where the optimal curriculum is not known in advance. To lower the barrier to developing MLIPs for non-experts, we propose Lang2MLIP, a multi-agent framework that takes natural-language input and formulates end-to-end MLIP development as a sequential decision-making problem solved by large language models (LLMs). At each step, a decision-making agent observes the current dataset, model, evaluation results, and execution log, and then automatically selects an appropriate action to improve the model. This removes the need for a predefined pipeline and enables the agent to self-correct by revisiting earlier subsystems when new failures arise. We evaluate this approach on a solid electrolyte interphase (SEI) system with multiple components and interfaces. These results suggest that LLM-based multi-agent systems are a promising direction for automating MLIP development and making it more accessible to non-experts.
comment: 31 pages, 12 figures
☆ Large Dimensional Kernel Ridge Regression: Extending to Product Kernels
Recent studies have reported $\textit{saturation effects}$ and $\textit{multiple descent behavior}$ in large dimensional kernel ridge regression (KRR). However, these findings are predominantly derived under restrictive settings, such as inner product kernels on sphere or strong eigenfunction assumptions like hypercontractivity. Whether such behaviors hold for other kernels remains an open question. In this paper, we establish a broad, new family of large dimensional kernels and derive the corresponding convergence rates of the generalization error. As a result, we recover key phenomena previously associated with inner product kernels on sphere, including: $i)$ the $\textit{minimax optimality}$ when the source condition $s\le 1$; $ii)$ the $\textit{saturation effect}$ when $s>1$; $iii)$ a $\textit{periodic plateau phenomenon}$ in the convergence rate and a $\textit {multiple-descent behavior}$ with respect to the sample size $n$.
☆ Enjoy Your Layer Normalization with the Computational Efficiency of RMSNorm
Layer normalization (LN) is a fundamental component in modern deep learning, but its per-sample centering and scaling introduce non-negligible inference overhead. RMSNorm improves efficiency by removing the centering operation, yet this may discard benefits associated with centering. This paper propose a framework to determine whether an LN in an arbitrary DNN can be replaced by RMSNorm without changing the model function. The key idea is to fold LN's centering operation into upstream general linear layers by enforcing zero-mean outputs through the column-centered constraint (CCC) and column-based weight centering (CBWC). We extend the analysis to arbitrary DNNs, define such LNs as foldable LNs, and develop a graph-based detection algorithm. Our analysis shows that many LNs in widely used architectures are foldable, enabling exact inference-time conversion and end-to-end acceleration of 2% to 12% without changing model predictions. Experiments across multiple task families further show that, when exact equivalence is partially broken in practical training settings, our method remains competitive with vanilla LN while improving efficiency.
comment: 33 pages, 21 figures
☆ ArcGate: Adaptive Arctangent Gated Activation
Activation functions are central to deep networks, influencing non-linearity, feature learning, convergence, and robustness. This paper proposes the Adaptive Arctangent Gated Activation (ArcGate) function, a flexible formulation that generates a broad spectrum of activation shapes via a three-stage non-linear transformation. Unlike conventional fixed-shape activations such as ReLU, GELU, or SiLU, ArcGate uses seven learnable parameters per layer, allowing the neural network to autonomously optimize its non-linearity to the specific requirements of the feature hierarchy and data distribution. We evaluate ArcGate using ResNet-50 and Vision Transformer (ViT-B/16) architectures on three widely used remote sensing benchmarks: PatternNet, UC Merced Land Use, and the 13-band EuroSAT MSI multispectral dataset. Experimental results show that ArcGate consistently outperforms standard baselines, achieving a peak overall accuracy of 99.67% on PatternNet. Most notably, ArcGate exhibits superior structural resilience in noisy environments, maintaining a 26.65% performance lead over ReLU under moderate Gaussian noise (standard deviation 0.1). Analysis of the learned parameters reveals a depth-dependent functional evolution, where the model increases gating strength in deeper layers to enhance signal propagation. These findings suggest that ArcGate is a robust and adaptive general node activation function for high-resolution earth observation tasks.
☆ Fully Dynamic Rebalancing in Dockless Bike-Sharing Systems via Deep Reinforcement Learning
This paper proposes a fully dynamic Deep Reinforcement Learning (DRL) method for rebalancing dockless bike-sharing systems, overcoming the limitations of periodic, system-wide interventions. We model the service through a graph-based simulator and cast rebalancing as a Markov decision process. A DRL agent routes a single truck in real time, executing localized pick-up, drop-off, and charging actions guided by spatiotemporal criticality scores. Experiments on real-world data show significant reductions in availability failures with a minimal fleet size, while limiting spatial inequality and mobility deserts. Our approach demonstrates the value of learning-based rebalancing for efficient and reliable shared micromobility.
comment: 6 pages, 5 figures, 1 table, accepted at the 23rd IFAC World Congress, Busan, South Korea, Aug. 23-26, 2026. Open invited track 9-131: "Control and Optimization for Smart Cities"
☆ ROAD: Adaptive Data Mixing for Offline-to-Online Reinforcement Learning via Bi-Level Optimization IJCAI 2026
Offline-to-online reinforcement learning harnesses the stability of offline pretraining and the flexibility of online fine-tuning. A key challenge lies in the non-stationary distribution shift between offline datasets and the evolving online policy. Common approaches often rely on static mixing ratios or heuristic-based replay strategies, which lack adaptability to different environments and varying training dynamics, resulting in suboptimal tradeoff between stability and asymptotic performance. In this work, we propose Reinforcement Learning with Optimized Adaptive Data-mixing (ROAD), a dynamic plug-and-play framework that automates the data replay process. We identify a fundamental objective misalignment in existing approaches. To tackle this, we formulate the data selection problem as a bi-level optimization process, interpreting the data mixing strategy as a meta-decision governing the policy performance (outer-level) during online fine-tuning, while the conventional Q-learning updates operate at the inner level. To make it tractable, we propose a practical algorithm using a multi-armed bandit mechanism. This is guided by a surrogate objective approximating the bi-level gradient, which simultaneously maintains offline priors and prevents value overestimation. Our empirical results demonstrate that this approach consistently outperforms existing data replay methods across various datasets, eliminating the need for manual, context-specific adjustments while achieving superior stability and asymptotic performance.
comment: 20 pages, 9 figures, 7 tables. Accepted to IJCAI 2026
☆ Learning Scenario Reduction for Two-Stage Robust Optimization with Discrete Uncertainty
Two-Stage Robust Optimization (2RO) with discrete uncertainty is challenging, often rendering exact solutions prohibitive. Scenario reduction alleviates this issue by selecting a small, representative subset of scenarios to enable tractable computation. However, existing methods are largely problem-agnostic, operating solely on the uncertainty set without consulting the feasible region or recourse structure. In this paper, we introduce PRISE, a problem-driven sequential lookahead heuristic that constructs reduced scenario sets by evaluating the marginal impact of each scenario. While PRISE yields high-quality scenario subsets, each selection step requires solving multiple subproblems, making it computationally expensive at scale. To address this, we propose NeurPRISE, a neural surrogate model built on a GNN-Transformer backbone that encodes the per-scenario structure via graph convolution and captures cross-scenario interactions through attention. NeurPRISE is trained via imitation learning with a gain-aware ranking objective, which distills marginal gain information from PRISE into a learned scoring function for scenario ranking and selection. Extensive results on three 2RO problems show that NeurPRISE consistently achieves competitive regret relative to comprehensive methods, maintains strong calability with varying numbers of scenarios, and delivers 7-200x speedup over PRISE. NeurPRISE also exhibits strong zero-shot generalization, effectively handling instances with larger problem scales (up to 5x), more scenarios (up to 4x), and distribution shifts.
☆ A Novel Schur-Decomposition-Based Weight Projection Method for Stable State-Space Neural-Network Architectures
Building black-box models for dynamical systems from data is a challenging problem in machine learning, especially when asymptotic stability guarantees are required. In this paper, we introduce a novel stability-ensuring and backpropagation-compatible projection scheme based on the Schur decomposition for the state matrix of linear discrete-time state-space layers, as well as an alternative pre-factorized formulation of the methodology. The proposed methods dynamically project the quasi-triangular factor of the state matrix's real Schur decomposition onto its nearest stable peer, ensuring stable dynamics with minimal overparameterization. Experiments on synthetic linear systems demonstrate that the method achieves accuracy and convergence rates comparable to those of state-of-the-art stable-system identification techniques, despite a marginal increase in computational complexity. Furthermore, the lower weight count facilitates convergence during training without sacrificing accuracy in stacked neural-network architectures with static nonlinearities targeting real-world datasets. These results suggest that the Schur-based projection provides a numerically robust framework for identifying complex dynamics on par with the State of the Art while satisfying strict asymptotic-stability requirements.
comment: 32 pages, 13 figures. Source code at https://codeberg.org/sergiovaneg/SchurSS
☆ Test-Time Learning with an Evolving Library
We introduce EvoLib, a test-time learning framework that enables large language models to accumulate, reuse, and evolve knowledge across problem instances without parameter updates or external supervision. Instead of adapting model parameters, our approach maintains a shared library of knowledge abstractions, including modular skills and reflective insights, automatically extracted from the model's own inference trajectories. To support continual improvement, we introduce a principled weighting and consolidation mechanism that jointly optimizes for immediate utility and long-term value. This allows simple, instance-specific abstractions to evolve into more general and reusable ones over time. Across challenging benchmarks in mathematical reasoning, code generation, and multi-turn agentic environments, EvoLib improves substantially over the top test-time scaling and learning methods without ground-truth feedback.
☆ Focused PU learning from imbalanced data
We propose a new method of learning from positive and unlabeled (PU) examples in highly imbalanced datasets. Many real-world problems, such as disease gene identification, targeted marketing, fraud detection, and recommender systems, are hard to address with machine learning methods, due to limited labeled data. Often, training data comprises positive and unlabeled instances, the latter typically being dominated by negative, but including also several positive instances. While PU learning is well-studied, few methods address imbalanced settings or hard-to-detect positive examples that resemble negative ones. Our approach uses a focused empirical risk estimator, incorporating both positive and unlabeled examples to train binary classifiers. Empirical evaluations demonstrate state-of-the-art performance on imbalanced datasets under two labeling mechanisms - selecting positives completely at random (SCAR) and selecting at random (SAR). Beyond these controlled experiments, we demonstrate the value of the proposed method in the real-world application of financial misstatement detection.
☆ Intelligence Impact Quotient (IIQ): A Framework for Measuring Organizational AI Impact
The Intelligence Impact Quotient (IIQ) is a composite metric intended to quantify the depth to which AI systems are integrated into organizational work and their impact. Rather than treating access counts or aggregate token volume as sufficient evidence of impact, IIQ combines a novelty-weighted, time-decayed token stock with usage frequency, a grace-period recency gate, organizational leverage, task complexity, and autonomy. The formulation produces a raw Intelligence Adoption Index (IAI) and a normalized 0-1000 IIQ index for comparison between heterogeneous users and units. We also derive sub-daily update rules and a bounded interpretation layer for estimated efficiency and financial impact. The paper positions IIQ as a deployment-oriented measurement framework: a formal proposal for tracking AI embedding in workflows, not a direct measure of model capability or a substitute for causal productivity evaluation. Synthetic scenarios illustrate how the revised metric distinguishes between frequent low-leverage use, semantically repetitive prompting, and more autonomous, higher-consequence AI-assisted work.
☆ LiSA: Lifelong Safety Adaptation via Conservative Policy Induction
As AI agents move from chat interfaces to systems that read private data, call tools, and execute multi-step workflows, guardrails become a last line of defense against concrete deployment harms. In these settings, guardrail failures are no longer merely answer-quality errors: they can leak secrets, authorize unsafe actions, or block legitimate work. The hardest failures are often contextual: whether an action is acceptable depends on local privacy norms, organizational policies, and user expectations that resist pre-deployment specification. This creates a practical gap: guardrails must adapt to their own operating environments, yet deployment feedback is typically limited to sparse, noisy user-reported failures, and repeated fine-tuning is often impractical. To address this gap, we propose LiSA (Lifelong Safety Adaptation), a conservative policy induction framework that improves a fixed base guardrail through structured memory. LiSA converts occasional failures into reusable policy abstractions so that sparse reports can generalize beyond individual cases, adds conflict-aware local rules to prevent overgeneralization in mixed-label contexts, and applies evidence-aware confidence gating via a posterior lower bound, so that memory reuse scales with accumulated evidence rather than empirical accuracy alone. Across PrivacyLens+, ConFaide+, and AgentHarm, LiSA consistently outperforms strong memory-based baselines under sparse feedback, remains robust under noisy user feedback even at 20% label-flip rates, and pushes the latency--performance frontier beyond backbone model scaling. Ultimately, LiSA offers a practical path to secure AI agents against the unpredictable long tail of real-world edge risks.
comment: 27 pages, 3 figures
☆ When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition
Hallucination detection in large language models (LLMs) requires balancing accu racy, efficiency, and robustness to distribution shift. Black-box consistency methods are effective but demand repeated inference; single-pass white-box probes are effi cient yet treat answer representations in isolation, often degrading sharply under domain shift. We propose QAOD (Question-Answer Orthogonal Decomposition), a single-pass framework that projects away the question-aligned direction from the answer representation to obtain a question-orthogonal component that suppresses domain-conditioned variation. To identify informative signals, QAOD further selects layers via diversity-penalized Fisher scoring and discriminative neurons via Fisher importance. To address both in-domain detection and cross-domain generalization, we design two complementary probing strategies: pairing the or thogonal component with question context yields a joint probe that maximizes in-domain discriminability, while using the orthogonal component alone preserves domain-agnostic factuality signals for robust transfer. QAOD's joint probe achieves the best in-domain AUROC across all evaluated model-dataset pairs, while the orthogonal-only probe delivers the strongest OOD transfer, surpassing the best white-box baseline by up to 21% on BioASQ at under 25% of generation cost.
☆ FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale
Many real-world coding challenges are open-ended and admit no known optimal solution. Yet, recent progress in LLM coding has focused on well-defined tasks such as feature implementation, bug fixing, and competitive programming. Open-ended coding remains a weak spot for LLMs, largely because open-ended training problems are scarce and expensive to construct. Our goal is to synthesize open-ended coding problems at scale to train stronger LLM coders. We introduce FrontierSmith, an automated system for iteratively evolving open-ended problems from existing closed-ended coding tasks. Starting from competitive programming problems, FrontierSmith generates candidate open-ended variants by changing the problems'goals, restricting outputs, and generalizing inputs. It then uses a quantitative idea divergence metric to select problems that elicit genuinely diverse approaches from different solvers. Agents then generate test cases and verifiers for the surviving candidates. On two open-ended coding benchmarks, training on our synthesized data yields substantial gains over the base models: Qwen3.5-9B improves by +8.82 score on FrontierCS and +306.36 (Elo-rating-based performance) on ALE-bench; Qwen3.5-27B improves by +12.12 and +309.12, respectively. The synthesized problems also make agents take more turns and use more tokens, similar to human-curated ones, suggesting that closed-ended seeds can be a practical starting point for long-horizon coding data.
Prompting Policies for Multi-step Reasoning and Tool-Use in Black-box LLMs with Iterative Distillation of Experience
The shift toward interacting with frozen, "black-box" Large Language Models (LLMs) has transformed prompt engineering from a heuristic exercise into a critical optimization challenge. We propose a Reinforcement Learning (RL) framework for training learned prompting policies via iterative distillation of experience. In this architecture, a lightweight prompter model is optimized to maximize task-specific rewards for a larger, frozen worker LLM. By utilizing a contrastive experience buffer that couples scalar rewards with dense textual critiques, our approach effectively amortizes iterative prompt refinement into single-shot policy weights. Our experimental analysis focuses on the Big Bench Extra Hard (BBEH) and Tau-bench suites, covering a diverse range of multi-step reasoning and tool-use tasks. We demonstrate significant gains, improving performance from 55% to 90% in logic-intensive reasoning and 74% to 91% in tool-use tasks. Furthermore, we analyze the structural evolution of prompts, demonstrating how the policy discovers specialized algorithmic heuristics. We provide comprehensive comparisons against state-of-the-art evolutionary baselines like GEPA, showing that iterative distillation achieves superior performance with higher sample efficiency.
comment: 10 pages and reference, appendix
☆ Collaborative Yet Personalized Policy Training: Single-Timescale Federated Actor-Critic
Despite the popularity of the actor-critic method and the practical needs of collaborative policy training, existing works typically either overlook environmental heterogeneity or give up personalization altogether by training a single shared policy across all agents. We consider a federated actor-critic framework in which agents share a common linear subspace representation while maintaining personalized local policy components, and agents iteratively estimate the common subspace, local critic heads, and local policies (i.e., actors). Under canonical single-timescale updates with Markovian sampling, we establish finite-time convergence via a novel joint linear approximation framework. Specifically, we show that the critic error converges to zero at the rate of $\tilde{\mathcal{O}}(1/((1-γ)^4\sqrt{TK}))$, and the policy gradient norm converges to zero at the rate of $\tilde{\mathcal{O}}(1/((1-γ)^6\sqrt{TK}))$, where $T$ is the number of rounds, $K$ is the number of agents, and $γ\in (0,1)$ is the discount factor. These results demonstrate linear speedup with respect to the number of agents $K$, despite heterogeneous Markovian trajectories under distinct transition kernels and coupled learning dynamics. To address these challenges, we develop a new perturbation analysis for the projected subspace updates and QR decomposition steps, together with conditional mixing arguments for heterogeneous Markovian noise. Furthermore, to handle the additional complications induced by policy updates and temporal dependence, we establish fine-grained characterizations of the discrepancies between function evaluations under Markovian sampling and under temporally frozen policies. Experiments instantiate the framework within PPO on federated \texttt{Hopper-v5} action-map heterogeneity, showing gains over Single PPO and FedAvg PPO and downstream transfer from the learned shared trunk.
☆ What if Tomorrow is the World Cup Final? Counterfactual Time Series Forecasting with Textual Conditions
Time series forecasting has become increasingly critical in real-world scenarios, where future sequences are influenced not only by historical patterns but also by forthcoming events. In this context, forecasting must dynamically adapt to complex and stochastic future conditions, which introduces fundamental challenges in both forecasting and evaluation. Traditional methods typically rely on historical data or factual future conditions, while overlooking counterfactual scenarios. Furthermore, many existing approaches are restricted to simple structured conditions, limiting their ability to generalize to the real-world complexities. To address these gaps, we introduce the task of counterfactual time series forecasting with textual conditions, enabling more flexible and condition-aware forecasting. We propose a comprehensive evaluation framework that encompasses both factual and counterfactual settings, even in the absence of ground truth time series. Additionally, we present a novel text-attribution mechanism that distinguishes mutable from immutable factors, thereby improving forecast accuracy under sophisticated and stochastic textual conditions. The project page is at https://seqml.github.io/TADiff/
☆ MahaVar: OOD Detection via Class-wise Mahalanobis Distance Variance under Neural Collapse
Out-of-distribution (OOD) detection is a critical component for ensuring the reliability of deep neural networks in safety-critical applications. In this work, we present a key empirical observation: for in-distribution (ID) samples, class-wise Mahalanobis distances exhibit a pronounced sharp minimum structure, where the distance to the nearest class is small while distances to all other classes remain large, resulting in high variance across classes. In contrast, OOD samples tend to exhibit a less pronounced sharp minimum structure, producing comparatively lower variance across classes. We further provide a theoretical analysis grounding this observation in Neural Collapse geometry: under relaxed Neural Collapse assumptions on within-class compactness and inter-class separation, ID samples are shown to structurally exhibit high class-wise distance variance, offering a theoretical basis for its use as an OOD score. Motivated by this observation and its theoretical backing, we propose MahaVar, a simple and effective post-hoc OOD detector that augments the Mahalanobis distance with a class-wise distance variance term. Following the OpenOOD v1.5 benchmark protocol, MahaVar achieves state-of-the-art performance on CIFAR-100 and ImageNet, with consistent improvements in both AUROC and FPR@95 over existing Mahalanobis-based methods across all benchmarks.
comment: 29 pages, 8 figures
☆ GeoViSTA: Geospatial Vision-Tabular Transformer for Multimodal Environment Representation
Large-scale pretraining on Earth observation imagery has yielded powerful representations of the natural and built environment. However, most existing geospatial foundation models do not directly model the structured socioeconomic covariates typically stored in tabular form. This modality gap limits their ability to capture the complete total environment, which is critical for reasoning about complex environmental, social, and health-related outcomes. In this work, we propose GeoViSTA (Geospatial Vision-Tabular Transformer), a vision-tabular architecture that learns unified geospatial embeddings from co-registered gridded imagery and tabular data. GeoViSTA utilizes bilateral cross-attention to exchange spatial and semantic information across modalities, guided by a geography-aware attention mechanism that aligns continuous image patches with irregular census-tract tokens. We train GeoViSTA with a self-supervised joint masked-autoencoding objective, forcing it to recover missing image patches and tabular rows using local spatial context and cross-modal cues. Empirically, GeoViSTA's unified embeddings improve linear probing performance on high-impact downstream tasks, outperforming baselines in predicting disease-specific mortality and fire hazard frequency across held-out regions. These results demonstrate that jointly modeling the physical environment alongside structured socioeconomic context yields highly transferable representations for holistic geospatial inference.
☆ Watch your neighbors: Training statistically accurate chaotic systems with local phase space information
Chaotic systems pose fundamental challenges for data-driven dynamics discovery, as small modeling errors lead to exponentially growing trajectory discrepancies. Since exact long-term prediction is unattainable, it is natural to ask what a good surrogate model for chaotic dynamics is. Prior work has largely focused either on reproducing the Jacobian of the underlying dynamics, which governs local expansion and contraction rates, or on training surrogate models that reproduce the ground-truth dynamics' long-term statistical behavior. In this work, we propose a new framework that aims to bridge these two paradigms by training surrogate dynamics models with accurate Jacobians and long-term statistical properties. Our method constructs a local covering of a chaotic attractor in phase space and analyzes the expansion and contraction of these coverings under the dynamics. The surrogate model is trained by minimizing the maximum mean discrepancy between the pushforward distributions of the coverings under the surrogate and ground-truth dynamics. Experiments show that our method significantly improves Jacobian accuracy while remaining competitive with state-of-the-art statistically accurate dynamics learning methods. Our code is fully available at https://anonymous.4open.science/r/neighborwatch.
☆ Systematic Discovery of Semantic Attacks in Online Map Construction through Conditional Diffusion
Autonomous vehicles depend on online HD map construction to perceive lane boundaries, dividers, and pedestrian crossings -- safety-critical road elements that directly govern motion planning. While existing pixel perturbation attacks can disrupt the mapping, they can be neutralized by standard adversarial defenses. We present MIRAGE, a framework for systematic discovery of semantic attacks that bypass adversarial defenses and degrade mapping predictions by finding plausible environmental variation (e.g. shadows, wet roads). MIRAGE exploits the latent manifold of real-world data learned by diffusion models, and searches for semantically mutated scenes neighboring the ground truth with the same road topology yet mislead the mapping predictions. We evaluate MIRAGE on nuScenes and demonstrate two attacks: (1) boundary removal, suppressing 57.7% of detections and corrupting 96% of planned trajectories; and (2) boundary injection, the only method that successfully injects fictitious boundaries, while pixel PGD and AdvPatch fail entirely. Both attacks remain potent under various adversarial defenses. We use two independent VLM judges to quantify realism, where MIRAGE passes as realistic 80--84% of the time (vs. 97--99% for clean nuScenes), while AdvPatch only 0--9%. Our findings expose a categorical gap in current adversarial defenses: semantic-level perturbations that manifest as legitimate environmental variation are substantially harder to mitigate than pixel-level perturbations.
☆ Nexus : An Agentic Framework for Time Series Forecasting
Time series forecasting is not just numerical extrapolation, but often requires reasoning with unstructured contextual data such as news or events. While specialized Time Series Foundation Models (TSFMs) excel at forecasting based on numerical patterns, they remain unaware to real-world textual signals. Conversely, while LLMs are emerging as zero-shot forecasters, their performance remains uneven across domains and contextual grounding. To bridge this gap, we introduce Nexus, a multi-agent forecasting framework that decomposes prediction into specialized stages: isolating macro-level and micro-level temporal fluctuations, and integrating contextual information when available before synthesizing a final forecast. This decomposition enables Nexus to adapt from seasonal signals to volatile, event-driven information without relying on external statistical anchors or monolithic prompting. We show that current-generation LLMs possess substantially stronger intrinsic forecasting ability than previously recognized, depending critically on how numerical and contextual reasoning are organized. Evaluated on data strictly succeeding LLM knowledge cutoffs spanning Zillow real estate metrics and volatile stock market equities, Nexus consistently matches or outperforms state-of-the-art TSFMs and strong LLM baselines. Beyond numerical accuracy, Nexus produces high-quality reasoning traces that explicitly show the fundamental drivers behind each forecast. Our results establish that real-world forecasting is an agentic reasoning problem extending well beyond only sequence modeling.
comment: 30 Pages, 3 figures, 5 Tables
☆ NodeSynth: Socially Aligned Synthetic Data for AI Evaluation
Recent advancements in generative AI facilitate large-scale synthetic data generation for model evaluation. However, without targeted approaches, these datasets often lack the sociotechnical nuance required for sensitive domains. We introduce NodeSynth, an evidence-grounded methodology that generates socially relevant synthetic queries by leveraging a fine-tuned taxonomy generator (TaG) anchored in real-world evidence. Evaluated against four mainstream LLMs (e.g., Claude 4.5 Haiku), NodeSynth elicited failure rates up to five times higher than human-authored benchmarks. Ablation studies confirm that our granular taxonomic expansion significantly drives these failure rates, while independent validation reveals critical deficiencies in prominent guard models (e.g., Llama-Guard-3). We open-source our end-to-end research prototype and datasets to enable scalable, high-stakes model evaluation and targeted safety interventions (https://github.com/google-research/nodesynth).
☆ Data-Augmented Game Starts for Accelerating Self-Play Exploration in Imperfect Information Games
Finding approximate equilibria for large-scale imperfect-information competitive games such as StarCraft, Dota, and CounterStrike remains computationally infeasible due to sparse rewards and challenging exploration over long horizons. In this paper, we propose a multi-agent starting-state sampling strategy designed to substantially accelerate online exploration in regularized policy-gradient game methods for two-player zero-sum (2p0s) games. Motivated by an assumption that offline demonstrations from skilled humans can provide good coverage of high-level strategies relevant to equilibrium play, we propose the initialization of reinforcement learning data collection at intermediate states sampled from offline data to facilitate exploration of strategically relevant subgames. Referring to this method as Data-Augmented Game Starts (DAGS), we perform experiments using synthetic datasets and analytically tractable, long-horizon control variants of two-player Kuhn Poker, Goofspiel, and a counterexample game designed to penalize biased beliefs over hidden information. Under fixed computational budgets, DAGS enables regularized policy gradient methods to achieve lower exploitability in games with significantly more challenging exploration. We show that augmenting starting state distributions when solving imperfect information games can lead to biased equilibria, and we provide a straightforward mitigation to this in the form of multi-task observation flags. Finally, we release a new set of benchmark environments that drastically increase exploration challenges and state counts in existing OpenSpiel games while keeping exploitability measurements analytically tractable.
comment: 17 pages, 4 figures. JB Lanier and Nathan Monette contributed equally
☆ Optimal Pattern Detection Tree for Symbolic Rule-Based Classification
Pattern discovery in data plays a crucial role across diverse domains, including healthcare, risk assessment, and machinery maintenance. In contrast to black-box deep learning models, symbolic rule discovery emerges as a key data mining task, generating human-interpretable rules that offer both transparency and intuitive explainability. This paper introduces the Optimal Pattern Detection Tree (OPDT), a rule-based machine learning model based on novel mixed-integer programming to discover a single optimal pattern in data through binary classification. To incorporate prior knowledge and compliance requirements, we further introduce the Branching Structure Constraints (BSC) framework, which enables decision makers to encode domain knowledge and constraints directly into the model. This optimization-based approach discovers a hidden underlying pattern in datasets, when it exists, by identifying an optimal rule that maximizes coverage while minimizing the false positive rate due to misclassification. Our computational experiments show that OPDT discovers a pattern with optimality guarantees on moderately sized datasets within reasonable runtime.
comment: Published in Transactions on Machine Learning Research (TMLR). 26 pages, 4 figures. OpenReview URL: https://openreview.net/forum?id=RJ6eMDcDCv
☆ Turning Stale Gradients into Stable Gradients: Coherent Coordinate Descent with Implicit Landscape Smoothing for Lightweight Zeroth-Order Optimization ICML 2026
Zeroth-Order (ZO) optimization is pivotal for scenarios where backpropagation is unavailable, such as memory-constrained on-device learning and black-box optimization. However, existing methods face a stark trade-off: they are either sample-inefficient (e.g., standard finite differences) or suffer from high variance due to randomized estimation (e.g., random subspace methods). In this work, we propose Coherent Coordinate Descent (CoCD), a deterministic, sample-efficient, and budget-aware ZO optimizer. Theoretically, we formalize the notion of gradient coherence and demonstrate that CoCD is equivalent to Block Cyclic Coordinate Descent (BCCD) with ``warm starts,'' effectively converting historical (stale) gradients from a liability into a computational asset. This mechanism enables $O(1)$ query complexity per step while maintaining global descent directions. Furthermore, we derive error bounds revealing a counter-intuitive insight: larger finite-difference step sizes can induce an implicit smoothing effect on the optimization landscape by reducing the effective smoothness constant, thereby improving convergence stability. Experiments on MLP, CNN, and ResNet architectures (up to 270k parameters) demonstrate that CoCD significantly outperforms BCCD in terms of sample efficiency and convergence loss/accuracy, and exhibits superior stability over randomized ZO methods. Our results suggest that deterministic, structure-aware updates offer a superior alternative to randomization for lightweight ZO optimization.
comment: Accepted to the 43rd International Conference on Machine Learning (ICML 2026)
☆ Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax ACL 2026
Extending large language models (LLMs) to low-resource languages often incurs an "alignment tax": improvements in the target language come at the cost of catastrophic forgetting in general capabilities. We argue that this trade-off arises from the rigidity of supervised fine-tuning (SFT), which enforces token-level surface imitation on narrow and biased data distributions. To address this limitation, we propose a semantic-space alignment paradigm powered by Group Relative Policy Optimization (GRPO), where the model is optimized using embedding-level semantic rewards rather than likelihood maximization. This objective encourages meaning preservation through flexible realizations, enabling controlled updates that reduce destructive interference with pretrained knowledge. We evaluate our approach on Tibetan-Chinese machine translation and Tibetan headline generation. Experiments show that our method acquires low-resource capabilities while markedly mitigating alignment tax, preserving general competence more effectively than SFT. Despite producing less rigid surface overlap, semantic RL yields higher semantic quality and preference in open-ended generation, and few-shot transfer results indicate that it learns more transferable and robust representations under limited supervision. Overall, our study demonstrates that reinforcement learning with semantic rewards provides a safer and more reliable pathway for inclusive low-resource language expansion.
comment: ACL 2026 Findings
☆ LoMETab: Beyond Rank-1 Ensembles for Tabular Deep Learning
Recent tabular learning benchmarks increasingly show a tight performance cluster rather than a clear hierarchy among leading methods, spanning gradient boosted decision trees, attention-based architectures, and implicit ensembles such as TabM. As benchmark gains plateau, a complementary goal is to understand and control the mechanisms that make simple neural tabular models competitive. We propose LoMETab, a rank-$r$ generalization of multiplicative implicit ensembles. LoMETab lifts the rank-1 BatchEnsemble/TabM modulation to a rank-$r$ identity-residual Hadamard family by parameterizing each member weight as $W_k = W \odot (1 + A_kB_k^\top)$, where $W$ is shared and $(A_k, B_k)$ are member-specific low-rank factors. This exposes two practical diversity-control axes: the adapter rank $r$ and the initialization scale $σ_{\mathrm{init}}$, and we prove that for $r \ge 2$ this generalization strictly enlarges BatchEnsemble's hypothesis class. Empirically, we show that this added capacity manifests as measurable predictive diversity after training: on representative classification datasets, LoMETab sustains higher pairwise KL than an additive low-rank ablation, and $(r, σ_{\mathrm{init}})$ provides broad control over pairwise KL, varying by up to several orders of magnitude across configurations. The induced diversity is reflected in task-appropriate output-level measures: argmax disagreement for classification and ambiguity for regression, indicating that the control extends beyond pairwise KL to decision- and output-level member variation. Finally, experiments sweeping over adapter rank $r$ and initialization scale $σ_{\mathrm{init}}$ reveal that predictive performance is dataset-dependent over the $(r, σ_{\mathrm{init}})$ grid, supporting LoMETab as a controllable family of implicit ensembles rather than a fixed rank-1 construction.
☆ MoRe: Modular Representations for Principled Continual Representation Learning on Squantial Data
Continual learning requires models to adapt to new data while preserving previously acquired knowledge. At its core, this challenge can be viewed as principled one-step adaptation: incorporating new information with minimal interference to existing representations. Most existing approaches address this challenge by modifying model parameters or architectures in a supervised, task-specific manner. However, the underlying issue is representational: tasks require distinct yet structured representations that can be selectively updated without disrupting representations, while structure should reflect intrinsic organization in the data rather than task boundaries. In sequential data, time-delayed dependencies provide a natural signal for uncovering this organization, revealing how fundamental representations give rise to more specific ones. Inspired by the modular organization of the human brain, we propose MoRe, a framework that identifies modularity in the representation itself rather than allocating it at the architectural level. MoRe decomposes knowledge into a hierarchy of fundamental and specific modules with identifiability guarantees, enabling principled module reuse, alignment, and expansion during adaptation while preserving old modules by construction. Experiments on synthetic benchmarks and real-world LLM activations demonstrate interpretable hierarchical structure, improved plasticity-stability trade-offs, suggesting MoRe as a principled foundation for continual adaptation
☆ RQ-MoE: Residual Quantization via Mixture of Experts for Efficient Input-Dependent Vector Compression ICML 2026
Vector quantization is a fundamental tool for compressing high-dimensional embeddings, yet existing multi-codebook methods rely on static codebooks that limit expressiveness under heterogeneous data geometry. While recent dynamic quantizers like QINCo adapt codebooks to individual inputs and improve expressiveness, their strict sequential dependencies create decoding bottlenecks. We propose Residual Quantization via Mixture of Experts (RQ-MoE), a framework combining a two-level MoE with dual-stream quantization to enable input-dependent codebook adaptation for efficient vector quantization. RQ-MoE enables dynamic codebook construction and decouples instruction from quantization, facilitating parallel decoding. Theoretically, we show that standard Residual Quantization and QINCo can be recovered as constrained special cases of RQ-MoE, and derive a guideline for setting expert dimensionality in RQ-MoE. Extensive experiments show that RQ-MoE achieves state-of-the-art or on-par performance in reconstruction and retrieval, while providing 6x-14x faster decoding than prior vector quantization methods. The implementation is available at https://github.com/KDEGroup/RQ-MoE.
comment: To appear at ICML 2026
☆ Uncovering the Representation Geometry of Minimal Cores in Overcomplete Reasoning Traces
Language models often generate long chain-of-thought traces, but it remains unclear how much of this reasoning is necessary for preserving the final prediction. We study this through the lens of overcomplete reasoning traces: generated traces that contain more intermediate steps than are needed to support the model's answer. We define the minimal core as the smallest subset of steps that preserves either the final answer or predictive distribution, and introduce metrics for compression ratio, redundancy mass, step necessity, and necessity concentration. Across six deliberative reasoning benchmarks spanning arithmetic, competition mathematics, expert scientific reasoning, and commonsense multi-hop QA, we find substantial overcompleteness: on average, 46% of steps are removable under greedy minimal-core extraction while preserving the original answer in 86% of cases. We also find that predictive support is concentrated: the top three steps account for 65% of measured necessity mass on average. Beyond compression, minimal cores expose a cleaner geometry of reasoning: compared with full traces, they improve correct-incorrect trace separation by 11 points, reduce estimated intrinsic dimensionality by 34%, and transfer across model families with 85% off-diagonal answer retention. Theoretically, we establish existence of minimal sufficient subsets, local irreducibility guarantees for greedy elimination, and certificates of overcompleteness and sparse necessity. Together, these results suggest that full reasoning traces are often verbose and overcomplete, while minimal cores isolate the effective support underlying language-model predictions.
☆ Randomized Atomic Feature Models for Physics-Informed Identification of Dynamic Systems
We present a physics-informed framework for system identification based on randomized stable atomic features. Impulse responses are represented as random superpositions of stable atoms, namely damped complex exponentials associated with poles sampled inside a prescribed disk. Identification is then cast as a convex regularized least-squares problem with optional linear, second-order-cone, and KYP constraints. The approach generalizes random Fourier and random Laplace features to the damped, nonstationary regime relevant to engineering systems while retaining modal interpretability and scalable finite-dimensional computation. The main analytic point is an operator-theoretic Disk-Bochner viewpoint: positive measures over stable poles generate positive-definite kernels with a radius-dependent shift defect, while a converse scalar disk moment representation for an arbitrary kernel is characterized by subnormality of the canonical shift. We prove this statement, establish an RKHS-to-l1 embedding, show that sampled poles induce a valid finite atomic gauge, discuss random-feature convergence, and state sparse-recovery guarantees conditionally on the restricted-eigenvalue properties of the realized disk-Vandermonde or input-output design matrix. We also connect the normalized transfer function problem to Nevanlinna-Pick interpolation and LFT set-membership. The framework directly encodes stability margins, modal localization, DC-gain bounds, monotonicity, passivity, relative degree, settling-time targets, and time/frequency-domain error bounds. Numerical comparisons illustrate how physically meaningful priors can compensate for poor excitation and improve constrained impulse-response recovery in an under-informative data setting.
comment: Extended version of the conference paper submitted for IFAC World Congress, 2026
☆ Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling
Multi-task reinforcement learning (MTRL) aims to train a single agent to efficiently optimize performance across multiple tasks simultaneously. However, jointly optimizing all tasks often yields imbalanced learning: agents quickly solve easy tasks but learn slowly on harder ones. While prior work primarily attributes this imbalance to conflicting task gradients and proposes gradient manipulation or specialized architectures to address it, we instead focus on a distinct and under-explored challenge: imbalanced data allocation. Standard MTRL allocates an equal number of environment interactions to each task, which over-allocates data to easy tasks that require relatively few interactions to solve and under-allocates data to hard tasks that require substantially more experience to solve. To address this challenge, we introduce Distributionally Robust Adaptive Task Sampling (DRATS), an algorithm that adaptively prioritizes sampling tasks furthest from being solved. We derive DRATS by formalizing MTRL as a feasibility problem from which we derive a minimax objective for minimizing the worst-case return gap, the difference between a desired target return and the agent's return on a task. In benchmarks like MetaWorld-MT10 and MT50, DRATS improves data efficiency and increases worst-task performance compared to existing task sampling algorithms.
☆ Exemplar Partitioning for Mechanistic Interpretability
We introduce Exemplar Partitioning (EP), an unsupervised method for constructing interpretable feature dictionaries from large language model activations with $\sim 10^{3}\times$ fewer tokens than comparable sparse autoencoders (SAEs). An EP dictionary is a Voronoi partition of activation space, built by leader-clustering streamed activations within a distance threshold. Each region is anchored by an observed exemplar that serves as both its membership criterion and intervention direction; dictionary size is not prespecified, but determined by the activation geometry at that threshold. Because exemplars are observed rather than learned, dictionaries built from the same data stream are directly comparable across layers, models, and training checkpoints. We characterise EP as an interpretability object via targeted demonstrations of properties newly accessible through this construction, plus one head-to-head benchmark. In Gemma-2-2B, EP dictionary regions are interpretable and support causal interventions: refusal in instruction-tuned Gemma concentrates in a region whose exemplar ablation can collapse held-out refusal. Cross-checkpoint matching between base and instruction-tuned dictionaries separates the directions preserved through finetuning from those introduced by it. EP regions and Gemma Scope SAE features decompose activation space differently but agree on a shared core: $\sim 20\%$ of EP regions match an SAE feature at $F_{1} > 0.5$, and EP one-hot probes retain $\sim 97\%$ of raw-activation probe accuracy at $\ell_{0} = 1$. Nearest-exemplar distance provides a free out-of-distribution signal at inference. On AxBench latent concept detection at Gemma-2-2B-it L20, EP at $p_{1}$ reaches mean AUROC $0.881$, $+0.126$ over the canonical GemmaScope SAE leaderboard entry and within $0.030$ of SAE-A's $0.911$, at $\sim 10^{3}\times$ less build compute.
comment: Code: https://github.com/jessicarumbelow/exemplar-partitioning. Pretrained dictionaries: https://huggingface.co/datasets/J-RUM/exemplar-partitioning
☆ Nearest-Neighbor Radii under Dependent Sampling
Nearest-neighbor methods are fundamental to classical and modern machine learning, yet their geometric properties are typically analyzed under independent sampling. In this paper, we study the nearest-neighbor radii under dependent sampling. We consider strong mixing dependent observations and ask whether dependence changes the scale of nearest-neighbor neighborhoods. We establish distribution-free almost sure convergence under polynomial mixing and sharp non-asymptotic moment bounds under geometric mixing. The moment bounds depend on the local intrinsic dimension rather than the ambient dimension, making the results applicable to high-dimensional data concentrated near lower-dimensional manifolds. Synthetic experiments and real-world time-series benchmarks support the theory, showing that nearest-neighbor geometry remains informative under dependence sampling.
comment: 33 pages
☆ Analog RF Computing: A New Paradigm for Energy-Efficient Edge AI Over MU-MIMO Systems
Modern edge devices increasingly rely on neural networks for intelligent applications. However, conventional digital computing-based edge inference requires substantial memory and energy consumption. In analog radio frequency (RF) computing, a base station (BS) encodes the weights of the neural networks and broadcasts the RF waveforms to the clients. Each client reuses its passive mixer to multiply the received weight-encoded waveform with a locally generated input-encoded waveform. This enables wireless receivers to perform the matrix-vector multiplications (MVMs) that account for most of the computation burden in edge inference with ultra-low energy consumption. Unlike conventional downlink transmissions which are optimized for communications, analog RF computing requires a computing-centric physical layer that controls both the analog MVM accuracy and the energy consumption for inference. Motivated by this, in this paper, we propose a physical layer design framework for analog RF computing in MU-MIMO wireless systems. We derive tractable models for computing accuracy and energy consumption for inference, formulate a joint BS beamforming and client-side scaling problem subject to computing accuracy, transmit power, and hardware constraints, and develop a low-complexity algorithm to solve the non-convex problem. The proposed design provides client- and layer-specific accuracy control for both uniform- and mixed-precision inference. Simulations under 3GPP specifications show that analog RF computing can significantly reduce client-side energy consumption by nearly two orders of magnitude compared to digital computing, while mixed-precision inference requires even lower energy consumption than uniform-precision inference. Overall, these results establish analog RF computing over wireless networks as a promising paradigm for energy-efficient edge inference.
comment: 13 pages, 6 figures, 2 tables. This paper proposes analog RF computing as a new paradigm for energy-efficient edge inference over wireless networks and studies the corresponding physical layer design framework
☆ AIM-DDI: A Model-Agnostic Multimodal Integration Module for Drug-Drug Interaction Prediction
Drug-drug interaction (DDI) prediction is a critical task in computational biomedicine, as adverse interactions between co-administered drugs can cause severe side effects and clinical risks. A key challenge is unseen-drug generalization, where interactions must be predicted for drugs not observed during training. Although multimodal DDI models exploit diverse drug-related information, their fusion mechanisms are often tied to specific prediction architectures, limiting their reuse across models. To address this, we propose AIM-DDI, an architecture-independent multimodal integration module that represents heterogeneous modality information as tokens in a shared latent space. By modeling dependencies across modality tokens through a unified fusion module, AIM-DDI enables model-agnostic integration of structural, chemical, and semantic drug signals across different DDI prediction architectures. Extensive evaluations across diverse DDI models and DrugBank-based settings show that AIM-DDI consistently improves prediction performance, with the strongest gains under the most challenging both-unseen setting where neither drug in a test pair is observed during training. These results suggest that treating multimodal integration as a reusable module, rather than a model-specific fusion component, is an effective strategy for robust unseen-drug DDI prediction.
☆ Dynamic Latent Routing
We investigate the temporal concatenation of sub-policies in Markov Decision Processes (MDP) with time-varying reward functions. We introduce General Dijkstra Search (GDS), and prove that globally optimal goal-reaching policies can be recovered through temporal composition of intermediate optimal sub-policies. Motivated by the "search, select, update" principle underlying GDS, we propose Dynamic Latent Routing (DLR), a language-model post-training method that jointly learns discrete latent codes, routing policies, and model parameters through dynamic search in a single training stage. In low-data fine-tuning settings, DLR matches or outperforms supervised fine-tuning across four datasets and six models, achieving a mean gain of +6.6 percentage points, while prior discrete-latent baselines consistently underperform SFT. Mechanistic analyses and targeted code ablations show that DLR learns structured routing behaviors with distinct causal roles.
☆ Semantic Feature Segmentation for Interpretable Predictive Maintenance in Complex Systems
Predictive maintenance in complex systems is often complicated by the heterogeneity and redundancy of monitored variables,which can obscure fault-relevant information and reduce model interpretability. This work proposes a semantic feature segmentation framework that decomposes the monitored feature space into a canonical component,expected to retain the dominant predictive information, and a residual component containing structurally peripheral signals. The segmentation is defined through domain informed criteria and sets up monitoring variables into functional groups reflecting operational mechanisms such as throughput,latency,pressure,network activity,and structural state. To evaluate the effectiveness of this decomposition, we adopt a predictive perspective in which expected predictive risk is used as an operational proxy for task-relevant information. Experimental results obtained through time-aware cross-validation show that the canonical space consistently achieves lower predictive risk than the residual space across multiple temporal configurations, indicating that the semantic segmentation concentrates the most relevant information for fault anticipation. In addition, the canonical segments exhibit significantly stronger intra-segment coherence than inter-segment dependence, and this structural organization remains stable after redundancy reduction. When compared with the full feature space and with a Principal Component Analysis (PCA) representation, the canonical space carries out comparable predictive performance and furthermore preserves the semantic meaning of the original variables. These findings suggest that semantic feature segmentation provides an interpretable and information-preserving decomposition of monitoring signals, enabling competitive predictive performance without sacrificing the operational interpretability required in predictive maintenance applications.
comment: 18 pages, 7 figures. Under review at Neural Computing and Applications. Keywords: semantic segmentation, change point detection, fault anticipation
☆ Guided Diffusion Sampling for Precipitation Forecast Interventions
Extreme precipitation causes severe societal and economic damage, and weather control has long been discussed as a potential mitigation strategy. However, to the best of our knowledge, perturbation-based interventions for weather control using data-driven weather forecasting models have not yet been explored. While adversarial attacks also generate perturbations that alter forecasts, they aim to exploit model artifacts and do not account for physical plausibility. In this paper, we propose a gradient-based guidance framework for precipitation-reduction interventions through diffusion sampling in diffusion-based weather forecasting models. Instead of directly perturbing atmospheric states, our method steers the diffusion sampling trajectory, enabling precipitation reduction while maintaining consistency with the atmospheric distribution. To assess physical plausibility, we evaluate from three perspectives: (i) vertical and variable-wise perturbation profiles, (ii) latent-space trajectory deviation, and (iii) cross-model transferability. Experiments on extreme precipitation events from WeatherBench2 demonstrate that our method achieves effective precipitation reduction while yielding more physically plausible interventions than adversarial perturbations.
comment: 12+7 pages, 7+2 figures
☆ Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment
Test-Time Scaling (TTS), which samples multiple candidate actions and ranks them via a Critic Model, has emerged as a promising paradigm for generalist GUI agents. Its efficacy thus hinges on the critic's fine-grained ranking ability. However, existing GUI critic models uniformly adopt binary classification. Our motivational analysis of these models exposes a severe entanglement: scores for valid actions and plausible-but-invalid distractors become indistinguishable. We attribute this failure to two structural defects: Affordance Collapse--the hierarchical affordance space is compressed into 0/1 labels; and Noise Sensitivity--binary objectives overfit to noisy decision boundaries. To resolve this, we introduce BBCritic (Beyond-Binary Critic), a paradigm shift grounded in the Functional Equivalence Hypothesis. Through two-stage contrastive learning, BBCritic aligns instructions and actions in a shared Affordance Space, recovering the hierarchical structure that binary supervision flattens. We also present BBBench (Beyond-Binary Bench), the first GUI critic benchmark that pairs a dense action space with a hierarchical four-level taxonomy, enabling fine-grained ranking evaluation. Experimental results show that BBCritic-3B, trained without any extra annotation, outperforms 7B-parameter SOTA binary models. It demonstrates strong zero-shot transferability across platforms and tasks, supporting our methodological view: GUI critique is fundamentally a metric-learning problem, not a classification one.
comment: 28 pages including appendix. Code and BBBench benchmark to be released
☆ ICED: Concept-level Machine Unlearning via Interpretable Concept Decomposition
Machine unlearning in Vision-Language Models (VLMs) is typically performed at the image or instance level, making it difficult to precisely remove target knowledge without affecting unrelated semantics. This issue is especially pronounced since a single image often contains multiple entangled concepts, including both target concepts to be forgotten and contextual information that should be preserved. In this paper, we propose an interpretable concept-level unlearning framework for VLMs, which constructs a compact task-specific concept vocabulary from the forgetting set using a multimodal large language model. In addition to modality alignment, visual representations are decomposed into sparse, nonnegative combinations of semantic concepts, providing an explicit interface for fine-grained knowledge manipulation. Based on this decomposition, our method formulates unlearning as concept-level optimization, where target concepts are selectively suppressed while intra-instance non-target semantics and global cross-modal knowledge are preserved. Extensive experiments across both in-domain and out-of-domain forgetting settings demonstrate that our method enables more comprehensive target forgetting, better preserves non-target knowledge within the same image, and maintains competitive model utility compared with existing VLM unlearning methods.
☆ Matrix-Space Reinforcement Learning for Reusing Local Transition Geometry
Compositional generalization in sequential decision-making requires identifying which parts of prior rollouts remain useful for new tasks. Existing methods reuse skills or predictive models, but often overlook rich local transition geometry and dynamics. We propose Matrix-Space Reinforcement Learning (MSRL), a geometric abstraction that represents trajectory segments through positive semidefinite matrix descriptors aggregating first- and second-order statistics of lifted one-step transitions. These descriptors expose shared hidden structure, support algebraic composition in an abstract matrix space, and reveal opportunities for transfer. We prove that the descriptor is well defined up to coordinate gauge, complete for the induced low-order additive signal class, additive under valid segment composition, and minimally sufficient among admissible additive descriptors. We further show that conditioning value functions on the trajectory-segment matrix yields a first-order smooth approximation of action values, enabling source-learned matrix-to-value mappings to bootstrap learning in new tasks. MSRL is plug-in compatible with standard model-free and model-based methods, while obstruction filtering rejects implausible compositions. Empirically, MSRL achieves the best average finite-budget target AUC of 0.73, outperforming MSRL from scratch (0.65), TD-MPC-PT+FT (0.63), and TD-MPC (0.57).
☆ Language-Induced Priors for Domain Adaptation
Domain adaptation faces a fundamental paradox in the cold-start regime. When target data is scarce, statistical methods fail to distinguish relevant source domains from irrelevant ones, which often leads to negative transfer. In this paper, we address this challenge by leveraging expert textual descriptions of the target domain, a resource that is often available but overlooked. We propose a probabilistic framework that translates these semantic descriptions into a choice model, namely a Language-Induced Prior (LIP), that learns the preferences from a pretrained Large Language Model (LLM). The LIP is then integrated into an Expectation-Maximization algorithm to identify source relevance. Methodologically, this framework is compatible with any parametric model where a likelihood is available. It allows the LIP to guide the selection of sources when target signals are weak, while gradually refining these choices as samples accumulate. Theoretically, we prove that the estimator roughly matches an oracle cold-start MSE under a correct prior, while remaining asymptotically consistent regardless of the quality of the LIP. Empirically, we validated the framework on a descriptive (Gaussian estimation), a predictive (C-MAPSS dataset), and a prescriptive task (MuJoCo hopper).
☆ Policy Optimization in Hybrid Discrete-Continuous Action Spaces via Mixed Gradients
We study reinforcement learning in hybrid discrete-continuous action spaces, such as settings where the discrete component selects a regime (or index) and the continuous component optimizes within it -- a structure common in robotics, control, and operations problems. Standard model-free policy gradient methods rely on score-function (SF) estimators and suffer from severe credit-assignment issues in high-dimensional settings, leading to poor gradient quality. On the other hand, differentiable simulation largely sidesteps these issues by backpropagating through a simulator, but the presence of discrete actions or non-smooth dynamics yields biased or uninformative gradients. To address this, we propose Hybrid Policy Optimization (HPO), which backpropagates through the simulator wherever smoothness permits, using a mixed gradient estimator that combines pathwise and SF gradients while maintaining unbiasedness. We also show how problems with action discontinuities can be reformulated in hybrid form, further broadening its applicability. Empirically, HPO substantially outperforms PPO on inventory control and switched linear-quadratic regulator problems, with performance gaps increasing as the continuous action dimension grows. Finally, we characterize the structure of the mixed gradient, showing that its cross term -- which captures how continuous actions influence future discrete decisions -- becomes negligible near a discrete best response, thereby enabling approximate decentralized updates of the continuous and discrete components and reducing variance near optimality. All resources are available at github.com/MatiasAlvo/hybrid-rl.
☆ Precise Verification of Transformers through ReLU-Catalyzed Abstraction Refinement
Formal verification of transformers has become increasingly important due to their widespread deployment in safety-critical applications. Compared to classic neural networks, the inferences of transformers involve highly complex computations, such as dot products in self-attention layers, rendering their verification extremely difficult. Existing approaches explored over-approximation methods by constructing convex constraints to bound the output ranges of transformers, which can achieve high efficiency. However, they may sacrifice verification precision, and consequently introduce significant approximation error that leads to frequent occurrences of false alarms. In this paper, we propose a transformer verification approach that can achieve improved precision. At the core of our approach is a novel usage of ReLU, by which we represent a precise but non-linear bound for dot products such that we can further exploit the rich body of literature for convex relaxation of ReLU to derive precise bounds. We extend two classic approaches to the context of transformers, a rule-based one and an optimization-based one, resulting in two new frameworks for efficient and precise verification. We evaluate our approaches on different model architectures and robustness properties derived from two datasets about sentiment analysis, and compare with the state-of-the-art baseline approach. Compared to the baseline, our approach can achieve significant precision improvement for most of the verification tasks with acceptable compromise of efficiency, which demonstrates the effectiveness of our approach.
comment: 32 pages, 6 figures, the full version of the paper accepted by CAV 2026
☆ Minimal-Intervention KV Retention: A Design-Space Study and a Diversity-Penalty Survivor
KV-cache compression at small budgets is a crowded design space spanning cache representation, head-wise routing, compression cadence, decoding behavior, and within-budget scoring. We study seven mechanisms across these five families under matched mean cache on long-form mathematical reasoning (MATH-500~\cite{hendrycks2021math}) with two distilled-reasoning models (Qwen-7B and Llama-8B variants of DeepSeek-R1-Distill~\cite{deepseek2025r1}) at budgets $b \in \{64, 128\}$. All seven were rejected. We then propose $α$, a one-function modification to the TriAttention~\cite{mao2026triattention} retention scorer that replaces argmax-top-$k$ with greedy facility-location-inspired selection under a V-space redundancy penalty controlled by a single weight $λ$. A pre-registered protocol tunes $λ$ on a frozen development split and confirms on a disjoint held-out split; with $λ= 0.5$, $α$ clears Bonferroni on two of the four (model, budget) cells (Qwen $b{=}128$ and Llama $b{=}64$), no cell is significantly negative, and the pre-registered Branch~A triggers. The finding is asymmetric: a minimal scoring modification beat heavier structural redesigns in this regime, and the combined matched-memory, sympy-graded, held-out confirmation protocol is the evidence standard that made the asymmetry visible.
comment: 12 pages, 2 figures, 3 tables. Code and data: https://github.com/libophd/minimal-kv-retention
☆ To See is Not to Learn: Protecting Multimodal Data from Unauthorized Fine-Tuning of Large Vision-Language Model
The rapid advancement of Large Vision-Language Models (LVLMs) is increasingly accompanied by unauthorized scraping and training on multimodal web data, posing severe copyright and privacy risks to data owners. Existing countermeasures, such as machine unlearning and watermarks, are inherent post-hoc approaches that act only after intellectual property infringement has already occurred. In this work, we propose MMGuard to empower data owners to proactively protect their multimodal data against unauthorized LVLM fine-tuning. MMGuard generates unlearnable examples by injecting human-imperceptible perturbations that actively exploit the learning dynamics of LVLMs. By minimizing the training loss, the perturbation creates an optimization shortcut, causing the model to overfit to the noise and thereby degrading downstream performance when the perturbation is absent during inference. To further strengthen this defense, MMGuard introduces a cross-modal binding disruption, strategically shifting LVLM attention to enforce a spurious correlation between the noise and the training target with theoretical guarantees. Enhanced by an ensemble learning strategy for cross-model transferability, MMGuard is evaluated against nine open-source LVLMs across six datasets. Our comprehensive results demonstrate effective, stealthy, and robust protection under white-box, gray-box, and black-box threat models, establishing a mechanistic advantage in proactively defending against aggressive fine-tuning exploitation.
☆ MetaMoE: Diversity-Aware Proxy Selection for Privacy-Preserving Mixture-of-Experts Unification ICML 2026
Mixture-of-Experts (MoE) models scale capacity by combining specialized experts, but most existing approaches assume centralized access to training data. In practice, data are distributed across clients and cannot be shared due to privacy constraints, making unified MoE training challenging. We propose MetaMoE, a privacy-preserving framework that unifies independently trained, domain-specialized experts into a single MoE using public proxy data as surrogates for inaccessible private data. Central to MetaMoE is diversity-aware proxy selection, which selects client-domain-relevant and diverse samples from public data to effectively approximate private data distributions and supervise router learning. These proxies are further used to align expert training, improving expert coordination at unification time, while a context-aware router enhances expert selection across heterogeneous inputs. Experiments on computer vision and natural language processing benchmarks demonstrate that MetaMoE consistently outperforms recent privacy-preserving MoE unification methods. Code is available at https://github.com/ws-jiang/MetaMoE.
comment: Accepted by ICML 2026
☆ ForcingDAS: Unified and Robust Data Assimilation via Diffusion Forcing
Data assimilation (DA) estimates the state of an evolving dynamical system from noisy, partial observations, and is widely used in scientific simulation as well as weather and climate science. In practice, filtering methods rely on frame-to-frame transition models. However, these models are fragile when observations are non-Markovian (when they form only a partial slice of a higher-dimensional latent state as in real-world weather data): they tend to accumulate errors over long horizons. At the same time, learned DA methods typically commit to a single regime, either filtering (nowcasting, real-time forecasting) or smoothing (retrospective reanalysis), which splits what should be a shared prior across application-specific pipelines. To address both issues, we introduce ForcingDAS, a unified and robust DA framework. Built on Diffusion Forcing with an independent noise level assigned to each frame, ForcingDAS learns a joint-trajectory prior instead of frame-to-frame transitions. This allows it to capture long-horizon temporal dependencies and reduce error accumulation. In addition, the same trained model spans the full filtering to smoothing spectrum at inference time. Specifically, nowcasting, fixed-lag smoothing, and batch reanalysis are selected through the inference schedule alone, without retraining. We evaluate ForcingDAS on 2D Navier-Stokes vorticity, precipitation nowcasting, and global atmospheric state estimation. Across all settings, a single model is competitive with or outperforms both learned and classical baselines that are specialized for individual regimes, with the largest gains observed on real-world weather benchmarks.
♻ ☆ Communication-Efficient Federated Fine-Tuning
Federated Learning (FL) enables the utilization of vast, previously inaccessible data sources. At the same time, pre-trained Language Models (LMs) have taken the world by storm and for good reason. They exhibit remarkable emergent abilities and are readily adapted to downstream tasks. This opens one of the most exciting frontiers in FL: fine-tuning LMs. Yet, a persistent challenge in FL is the frequent, rigid communication of parameters -- a problem magnified by the sheer size of these contemporary models. The FedOpt family of algorithms has become the go-to approach for FL, relying on fixed but arbitrary intervals for model exchanges. Recently, the FDA algorithm prescribed a dynamic approach by monitoring the training progress. However, it introduced a hard-to-calibrate parameter and imposed a rigid synchronization scheme. In this work, we address these limitations by proposing the FDA-Opt family of algorithms -- a unified generalization of both FDA and FedOpt. Our experimental evaluation focuses on fine-tuning LMs on downstream NLP tasks and demonstrates that FDA-Opt outperforms FedOpt even when it is configured with hyper-parameters specifically optimized for the latter. In other words, we show that FDA-Opt is a practical, drop-in replacement for FedOpt in modern FL libraries and systems: it requires no additional configuration and delivers superior performance out of the box.
♻ ☆ MoMo: Conditioned Contrastive Representation Learning for Preference-Modulated Planning
Temporally contrastive representation learning induces a latent structure capable of reducing long-horizon planning to inference in a low-dimensional linear system. However, existing contrastive planning work learns a single latent geometry which cannot distinguish multiple valid behaviors trading task efficiency against risk exposure for the same start-goal query. We introduce MoMo, a preference-conditioned contrastive planner allowing a scalar user preference to continuously modulate plan conservativeness at inference time, without retraining. MoMo learns a joint conditioning of the representation geometry and latent prediction operator via Feature-Wise Linear Modulation and low-rank neural modulation, respectively. We show that our formulation preserves the probability density ratio encoded in the representation space that is required for inference-driven contrastive planning, further retaining its inference-time efficiency. Across six environments, MoMo smoothly adapts plan safety according to user preferences, yielding improved temporal and preferential consistency over state augmentation baselines.
♻ ☆ Speculative Interaction Agents: Building Real-Time Agents with Asynchronous I/O and Speculative Tool Calling
There is a growing demand for agentic AI technologies for a range of downstream applications like customer service and personal assistants. For applications where the agent needs to interact with a person, real-time low-latency responsiveness is required; for example, with voice-controlled applications, under 1 second of latency is typically required for the interaction to feel seamless. However, if we want the LLM to reason and execute an agentic workflow with tool calling, this can add several seconds or more of latency, which is prohibitive for real-time latency-sensitive applications. In our work, we propose Speculative Interaction Agents to enable real-time interaction even for agents with complex multi-turn tool calling. We propose Asynchronous I/O, which decouples the core agent reason-and-act thread from waiting for additional information from either the user or environment, thereby allowing for overlapping agentic processing while waiting on external delays. We also propose Speculative Tool Calling as a method to manage task execution when the agent is still unsure if it has received the full information or if additional user information may later be provided. For strong cloud models, our method can be applied out-of-the-box to existing real-time cloud APIs, providing 1.3-1.7$\times$ speedups with minor accuracy loss. To enable real-time interaction with small edge-scale models, we also present a clock-based training methodology that adapts the model to handle streaming inputs and asynchronous responses, and demonstrate a synthetic data generation strategy for SFT. Altogether, this approach provides 1.6-2.2$\times$ speedups with the Qwen2.5-3B-Instruct and Llama-3.2-3B-Instruct models across multiple tool calling benchmarks.
comment: 17 pages
♻ ☆ The Potential of Convolutional Neural Networks for Cancer Detection
Early detection is crucial for successful cancer treatment and increasing survivability rates, particularly in the most common forms. Ten different cancers have been identified in most of these advances that effectively use CNNs (Convolutional Neural Networks) for classification. The distinct architectures of CNNs used in each study concentrate on pattern recognition for different types of cancer across various datasets. The advantages and disadvantages of each approach are identified by comparing these architectures. This study explores the potential of integrating CNNs into clinical practice to complement traditional diagnostic methods. It also identifies the top-performing CNN architectures, highlighting their role in enhancing diagnostic capabilities in healthcare.
♻ ☆ Learning, Fast and Slow: Towards LLMs That Adapt Continually
Large language models (LLMs) are trained for downstream tasks by updating their parameters (e.g., via RL). However, updating parameters forces them to absorb task-specific information, which can result in catastrophic forgetting and loss of plasticity. In contrast, in-context learning with fixed LLM parameters can cheaply and rapidly adapt to task-specific requirements (e.g., prompt optimization), but cannot by itself typically match the performance gains available through updating LLM parameters. There is no good reason for restricting learning to being in-context or in-weights. Moreover, humans also likely learn at different time scales (e.g., System 1 vs 2). To this end, we introduce a fast-slow learning framework for LLMs, with model parameters as "slow" weights and optimized context as "fast" weights. These fast "weights" can learn from textual feedback to absorb the task-specific information, while allowing slow weights to stay closer to the base model and persist general reasoning behaviors. Fast-Slow Training (FST) is up to 3x more sample-efficient than only slow learning (RL) across reasoning tasks, while consistently reaching a higher performance asymptote. Moreover, FST-trained models remain closer to the base LLM (up to 70% less KL divergence), resulting in less catastrophic forgetting than RL-training. This reduced drift also preserves plasticity: after training on one task, FST trained models adapt more effectively to a subsequent task than parameter-only trained models. In continual learning scenarios, where task domains change on the fly, FST continues to acquire each new task while parameter-only RL stalls.
comment: 29 pages, 14 figures, including appendix; Blog post: https://gepa-ai.github.io/gepa/blog/2026/05/11/learning-fast-and-slow/
♻ ☆ Why Goal-Conditioned Reinforcement Learning Works: Relation to Dual Control
Goal-conditioned reinforcement learning (RL) concerns the problem of training an agent to maximize the probability of reaching target goal states. This paper presents an analysis of the goal-conditioned setting based on optimal control. In particular, we derive an optimality gap between more classical, often quadratic, objectives and the goal-conditioned reward, elucidating the success of goal-conditioned RL and why classical ``dense'' rewards can falter. We then consider the partially observed Markov decision setting and connect state estimation to our probabilistic reward, making the goal-conditioned reward well suited to dual control problems. The advantages of goal-conditioned policies are validated on nonlinear and uncertain environments using both RL and predictive control techniques.
comment: IFAC world congress postprint
♻ ☆ Proxy Compression for Language Modeling ICML 2026
Modern language models are trained almost exclusively on token sequences produced by a fixed tokenizer, an external lossless compressor often over UTF-8 byte sequences, thereby coupling the model to that compressor. This work introduces proxy compression, an alternative training scheme that preserves the efficiency benefits of compressed inputs while providing an end-to-end, raw-byte interface at inference time. During training, a single language model is jointly trained on raw byte sequences and compressed views generated by external compressors; through the process, the model learns to internally align compressed sequences and raw bytes. This alignment enables strong transfer between the two formats, even when training predominantly on compressed inputs that are discarded at inference. Extensive experiments on code language modeling demonstrate that proxy compression substantially improves training efficiency and significantly outperforms pure byte-level baselines given fixed compute budgets. As model scale increases, these gains become more pronounced, and proxy-trained models eventually match or surpass tokenizer approaches, all while operating solely on raw bytes and retaining the inherent robustness of byte-level modeling. Our code is available at https://github.com/LZhengisme/proxy-compression.
comment: ICML 2026
♻ ☆ Hyperbolic Graph Neural Networks Under the Microscope: The Role of Geometry-Task Alignment
Many complex networks exhibit hierarchical, tree-like structures, making hyperbolic space a natural candidate wherein to learn representations of them. Based on this observation, Hyperbolic Graph Neural Networks (HGNNs) have been widely adopted as a principled choice for representation learning on tree-like graphs. In this work, we question this paradigm by proposing the additional condition of geometry--task alignment, i.e., whether the metric structure of the target follows that of the input graph. We theoretically and empirically demonstrate the capability of HGNNs to recover low-distortion representations on regression problems, and show that their geometric inductive bias becomes helpful when the problem requires preserving metric structure. By jointly analyzing predictive performance and embedding distortion, we further show that HGNNs gain an advantage on link prediction, a naturally geometry-aligned task, whereas this advantage largely disappears on standard node classification benchmarks, which are typically not geometry--aligned. Overall, our findings shift the focus from only asking "Is the graph hyperbolic?" to also questioning "Is the task aligned with hyperbolic geometry?", showing that HGNNs consistently outperform Euclidean models under such alignment, while their advantage vanishes otherwise.
♻ ☆ Conformal Prediction for Multimodal Regression
This paper introduces multimodal conformal regression. Traditionally confined to scenarios with solely numerical input features, conformal prediction is now extended to multimodal contexts through our methodology, which harnesses internal features from complex neural network architectures processing images and unstructured text. Our findings highlight the potential for internal neural network features, extracted from convergence points where multimodal information is combined, to be used by conformal prediction to construct prediction intervals (PIs). This capability paves new paths for deploying conformal prediction in domains abundant with multimodal data, enabling a broader range of problems to benefit from guaranteed distribution-free uncertainty quantification.
comment: Code available at https://github.com/ic-crc/uncertainty-estimation 20 pages, 34 figures
♻ ☆ Do-Undo Bench: Reversibility for Action Understanding in Image Generation
We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating plausible scene transformations driven by real-world actions. Unlike prior work that relies on prompt-based image generation and editing to perform action-conditioned image manipulation, our training hypothesis requires models to simulate the outcome of a real-world action and then reverse it to the original state. This forward-reverse requirement tests genuine cause-and-effect understanding rather than stylistic or semantic edits. We curate a high-quality benchmark of reversible actions from real-world scenarios to enable robust action grounding. Our experiments reveal that current models struggle with action reversibility, highlighting the need to evaluate action understanding. Do-Undo provides an intuitive testbed for evaluating and advancing action-aware generation in multimodal systems that must reason about real-world dynamics.
comment: Project page: https://s-mahajan.github.io/Do-Undo-Bench/
♻ ☆ BiTrajDiff: Bidirectional Trajectory Generation with Diffusion Models for Offline Reinforcement Learning
Recent advances in offline Reinforcement Learning (RL) have proven that effective policy learning can benefit from imposing conservative constraints on pre-collected datasets. However, such static datasets often exhibit distribution bias, resulting in limited generalizability. To address this limitation, a straightforward solution is data augmentation (DA), which leverages generative models to enrich data distribution. Despite the promising results, current DA techniques focus solely on reconstructing future trajectories from given states, while ignoring the exploration of history transitions that reach them. This single-direction paradigm inevitably hinders the discovery of diverse behavior patterns, especially those leading to critical states that may have yielded high-reward outcomes. In this work, we introduce Bidirectional Trajectory Diffusion (BiTrajDiff), a novel DA framework for offline RL that models both future and history trajectories from any intermediate states. Specifically, we decompose the trajectory generation task into two independent yet complementary diffusion processes: one generating forward trajectories to predict future dynamics, and the other generating backward trajectories to trace essential history transitions.BiTrajDiff can efficiently leverage critical states as anchors to expand into potentially valuable yet underexplored regions of the state space, thereby facilitating dataset diversity. Extensive experiments on the D4RL benchmark suite demonstrate that BiTrajDiff achieves superior performance compared to other advanced DA methods across various offline RL backbones.
♻ ☆ Time Series Forecasting Through the Lens of Dynamics ICML 2026
While deep learning is facing an homogenization across modalities led by Transformers, they are still challenged by shallow linear models in the time series forecasting task. Our hypothesis is that models should learn a direct link from past to future data points, which we identify as a learning dynamics capability. We develop an original $\texttt{PRO-DYN}$ nomenclature to analyze existing models through the lens of dynamics. Two observations thus emerge: $\textbf{1.}$ under-performing architectures learn dynamics at most partially, $\textbf{2.}$ the location of the dynamics block at the model end is of prime importance. Our systemic and empirical studies both confirm our observations on a set of performance-varying models with diverse backbones. We propose a simple plug-and-play methodology guiding model designs and improvements.
comment: Accepted at ICML 2026
♻ ☆ Projected gradient methods for nonconvex and stochastic smooth optimization: new complexities and auto-conditioned stepsizes
We present a novel class of projected gradient (PG) methods for minimizing a smooth but not necessarily convex function over a convex compact set. We first provide a novel analysis of the constant-stepsize PG method, achieving the best-known iteration complexity for finding an approximate stationary point of the problem. We then develop an "auto-conditioned" projected gradient (AC-PG) variant that achieves the same iteration complexity without requiring the input of the Lipschitz constant of the gradient or any line search procedure. The key idea is to estimate the Lipschitz constant using first-order information gathered from the previous iterations, and to show that the error caused by underestimating the Lipschitz constant can be properly controlled. We then generalize the PG methods to the stochastic setting, by proposing a stochastic projected gradient (SPG) method and a variance-reduced stochastic gradient (VR-SPG) method, achieving new complexity bounds in different oracle settings. We also present auto-conditioned stepsize policies for both stochastic PG methods and establish comparable convergence guarantees.
♻ ☆ VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing
Pretrained vision foundation models (VFMs) advance robotic learning via rich visual representations, yet individual VFMs typically excel only in specific domains, limiting generality across tasks. Distilling multiple VFMs into a unified representation for policy can mitigate this limitation but often yields inflexible task-specific feature selection and requires costly full re-training to incorporate robot-domain knowledge. We propose VER, a Vision Expert transformer for Robot learning. During pretraining, VER distills multiple VFMs into a vision expert library. It then fine-tunes only a lightweight routing network (fewer than 0.4% of parameters) to dynamically select task-relevant experts from the pretrained library for downstream robot tasks. We further introduce Patchwise Expert Routing with Curriculum Top-K Annealing to improve both flexibility and precision of dynamic expert selection. Moreover, VER supports parameter-efficient finetuning for scalable expert utilization and adaptive robot-domain knowledge integration. Across 17 diverse robotic tasks and multiple policy heads, VER achieves state-of-the-art performance. We find that VER reduces large-norm outliers in task-irrelevant regions (e.g., background) and concentrates on task-critical regions. Visualizations and codes can be found in https://yixiaowang7.github.io/ver_page/.
♻ ☆ TRIM: Token-wise Attention-Derived Saliency for Data-Efficient Instruction Tuning
Instruction tuning is essential for aligning large language models (LLMs) to downstream tasks and commonly relies on large, diverse corpora. However, small, high-quality subsets, known as coresets, can deliver comparable or superior results, though curating them remains challenging. Existing methods often rely on coarse, sample-level signals like gradients, an approach that is computationally expensive and overlooks fine-grained features. To address this, we introduce TRIM (Token Relevance via Interpretable Multi-layer Attention), a forward-only, token-centric framework. Instead of using gradients, TRIM operates by matching underlying representational patterns identified via attention-based "fingerprints" from a handful of target samples. Such an approach makes TRIM highly efficient and uniquely sensitive to the structural features that define a task. Coresets selected by our method consistently outperform state-of-the-art baselines by up to 9% on downstream tasks and even surpass the performance of full-data fine-tuning in some settings. By avoiding expensive backward passes, TRIM achieves this at a fraction of the computational cost. These findings establish TRIM as a scalable and efficient alternative for building high-quality instruction-tuning datasets.
♻ ☆ Residual Stream Duality in Modern Transformer Architectures
Recent work has made clear that the residual pathway is not mere optimization plumbing; it is part of the model's representational machinery. We agree, but argue that the cleanest way to organize this design space is through a two-axis view of the Transformer. A decoder evolves information along two ordered dimensions: sequence position and layer depth. Self-attention already provides adaptive mixing along the sequence axis, whereas the residual stream usually performs fixed addition along the depth axis. If we fix a token position and treat layer index as the ordered variable, then a causal depth-wise residual attention read is exactly the same local operator as causal short sliding-window attention (ShortSWA), except written over depth rather than over sequence. This is the core residual stream duality behind Transformer$^2$. This perspective also clarifies the recent literature. ELC-BERT and DenseFormer already show that learned aggregation over depth can outperform uniform residual accumulation, while Vertical Attention, DeepCrossAttention (DCA), MUDDFormer, and Attention Residuals move further toward explicit attention-based routing over earlier layers. The key point, however, is that operator-level duality does not imply systems-level symmetry. For large-scale autoregressive models, sequence-axis ShortSWA is usually the more hardware-friendly placement because it reuses token-side sliding-window kernels, KV-cache layouts, and chunked execution. If the goal is instead to change the shortcut itself, Deep Delta Learning (DDL) is the cleaner intervention because it modifies the residual operator directly rather than adding a separate cross-layer retrieval path. Our recommendation is therefore simple: use DDL when the shortcut is the object of interest, and use sequence-axis ShortSWA when the goal is local adaptive mixing.
comment: Project Page: https://github.com/yifanzhang-pro/residual-stream-duality
♻ ☆ AVEX: What Matters for Animal Vocalization Encoding
Bioacoustics, the study of sounds produced by living organisms, plays a vital role in conservation, biodiversity monitoring, and behavioral studies. Many tasks in this field, such as species, individual, and behavior classification and detection, are well-suited to machine learning. However, they often suffer from limited annotated data, highlighting the need for a general-purpose bioacoustic encoder capable of extracting useful representations for diverse downstream tasks. Such encoders have been proposed before, but are often limited in scope due to a focus on a narrow range of species (typically birds), and a reliance on a single model architecture or training paradigm. Moreover, they are usually evaluated on a small set of tasks and datasets. In this work, we present a large-scale empirical study that covers aspects of bioacoustics that are relevant to research but have previously been scarcely considered: training data diversity and scale, model architectures and training recipes, and the breadth of evaluation tasks and datasets. We obtain encoders that are state-of-the-art on the existing and proposed benchmarks. We also identify what matters for training these encoders, such that this work can be extended when more data are available or better architectures are proposed. Specifically, across 26 datasets with tasks including species classification, detection, individual ID, and vocal repertoire discovery, we find self-supervised pre-training followed by supervised post-training on a mixed bioacoustics + general-audio corpus yields the strongest in- and out-of-distribution performance. We show the importance of data diversity in both stages. To support ongoing research and application, we will release the model checkpoints.
comment: In The Fourteenth International Conference on Learning Representations 2026
♻ ☆ ClawGym: A Scalable Framework for Building Effective Claw Agents
Claw-style environments support multi-step workflows over local files, tools, and persistent workspace states. However, scalable development around these environments remains constrained by the absence of a systematic framework, especially one for synthesizing verifiable training data and integrating it with agent training and diagnostic evaluation. To address this challenge, we present ClawGym, a scalable framework that supports the full lifecycle of Claw-style personal agent development. Concretely, we construct ClawGym-SynData, a diverse dataset of 13.5K filtered tasks synthesized from persona-driven intents and skill-grounded operations, paired with realistic mock workspaces and hybrid verification mechanisms. We then train a family of capable Claw-style models, termed ClawGym-Agents, through supervised fine-tuning on black-box rollout trajectories, and further explore reinforcement learning via a lightweight pipeline that parallelizes rollouts across per-task sandboxes. To support reliable evaluation, we further construct ClawGym-Bench, a benchmark of 200 instances calibrated through automated filtering and human-LLM review. Relevant resources will be soon released at https://github.com/ClawGym.
♻ ☆ A Block Coordinate Descent Method for Nonsmooth Composite Optimization under Orthogonality Constraints
Nonsmooth composite optimization with orthogonality constraints has a wide range of applications in statistical learning and data science. However, this problem is challenging due to its nonsmooth objective and computationally expensive nonconvex constraints. In this paper, we propose a new approach called \textbf{OBCD}, which leverages block coordinate descent to address these challenges. \textbf{OBCD} is a feasible method with a small computational footprint. In each iteration, it updates \(k\) rows of the solution matrix, where \(k \geq 2\), by globally solving a small nonsmooth optimization problem under orthogonality constraints. We prove the completeness of the proposed update scheme, showing that row-wise orthogonal updates can reach any feasible point from any feasible initialization. We further prove that the limit points generated by \textbf{OBCD}, referred to as global block-\(k\) stationary points, offer stronger optimality than standard critical points. Furthermore, we show that \textbf{OBCD} finds an \(ε\)-block-\(k\) stationary point with an iteration complexity of \(\mathcal{O}(1/ε)\). Additionally, under the Kurdyka--Lojasiewicz (KL) inequality, we establish the non-ergodic convergence rate of \textbf{OBCD}. We also demonstrate how novel breakpoint search methods can be used to solve the subproblems arising in \textbf{OBCD}. Empirical results show that our approach consistently outperforms existing methods.
comment: Future versions of this paper can be found at arXiv:2304.03641
♻ ☆ Bringing Order to Asynchronous SGD: Towards Optimality under Data-Dependent Delays with Momentum
Asynchronous stochastic gradient descent (SGD) enables scalable distributed training but suffers from gradient staleness. Existing mitigation strategies, such as delay-adaptive learning rates and staleness-aware filtering, typically attenuate or discard delayed gradients, introducing systematic bias: updates from simpler or faster-to-process samples are overrepresented, while gradients from more complex samples are delayed or suppressed. In contrast, prior approaches to data-dependent delays rely on a Lipschitz assumption that yields suboptimal rates or leave the smooth, convex case unaddressed. We propose a momentum-based asynchronous framework designed to preserve information from delayed gradients while mitigating the effects of staleness. We establish the first optimal convergence rates for data-dependent delays in both convex and non-convex smooth setups, providing a new result for asynchronous optimization under standard assumptions. Additionally, we derive robust learning-rate schedules that simplify hyperparameter tuning in practice.
♻ ☆ Vendor-Conditioned Contrastive Learning for Predicting Organizational Cyber Threat Targets
Cyberattacks cause billions of dollars in damage annually, with malicious hackers often sharing exploit code and techniques on underground forums. Identifying which organizations are targeted by these exploits is critical for proactive Cyber Threat Intelligence (CTI). To address that gap, we propose Temporal Representation and Classification of Exploits (TRACE), a vendor-conditioned contrastive learning framework built on CySecBERT that jointly optimizes organizational target classification and vendor-coherent representations while evaluating robustness under temporal distribution shift. Unlike prior work limited to small, single-source datasets, we leverage a large-scale, multi-source corpus spanning 9 exploit databases and hacker forums, comprising 352,866 posts collected over three decades, yielding a 129,126-sample dataset across seven organizational categories. In the temporal out-of-distribution evaluation, TRACE achieves macro F1=97.00\%, substantially outperforming 17 benchmark classical ML methods, deep learning with GloVe/FastText embeddings, and pretrained transformer models.
comment: 6 pages, 3 figures
♻ ☆ ArGEnT: Arbitrary Geometry-encoded Transformer for Operator Learning
Learning solution operators for systems with complex, varying geometries and parametric physical settings is a central challenge in scientific machine learning. In many-query regimes such as design optimization, control and inverse problems, surrogate modeling must generalize across geometries while allowing flexible evaluation at arbitrary spatial locations. In this work, we propose Arbitrary Geometry-encoded Transformer (ArGEnT), a geometry-aware attention-based architecture for operator learning on arbitrary domains. ArGEnT employs Transformer attention mechanisms to encode geometric information directly from point-cloud representations with three variants-self-attention, cross-attention, and hybrid-attention-that incorporates different strategies for incorporating geometric features. By integrating ArGEnT into DeepONet as the trunk network, we develop a surrogate modeling framework capable of learning operator mappings that depend on both geometric and non-geometric inputs without the need to explicitly parametrize geometry as a branch network input. Evaluation on benchmark problems spanning fluid dynamics, solid mechanics and electrochemical systems, we demonstrate significantly improved prediction accuracy and generalization performance compared with the standard DeepONet and other existing geometry-aware saurrogates. In particular, the cross-attention transformer variant enables accurate geometry-conditioned predictions with reduced reliance on signed distance functions. By combining flexible geometry encoding with operator-learning capabilities, ArGEnT provides a scalable surrogate modeling framework for optimization, uncertainty quantification, and data-driven modeling of complex physical systems.
comment: 69 pages, 21 figures, 10 tables
♻ ☆ On The Hidden Biases of Flow Matching Samplers
Flow matching (FM) constructs continuous-time ODE samplers by prescribing probability paths between a base distribution and a target distribution. In this note, we study FM through the lens of finite-sample plug-in estimation. In addition to replacing population expectations by sample averages, one may replace the target distribution itself by a finite-sample surrogate, ranging from the empirical measure to a smoothed estimator. This viewpoint yields a natural hierarchy of empirical FM models. For affine conditional flows, we derive the exact empirical minimizer and identify a smoothed plug-in regime in which the terminal law is exactly a kernel-mixture estimator. This plug-in perspective clarifies several coupled finite-sample biases of empirical FM. First, replacing the target law by a finite-sample surrogate changes the statistical target. Second, the empirical minimizer is generally not a gradient field, even when each conditional flow is. Third, a fixed empirical marginal path does not determine a unique particle dynamics: one may add extra vector fields whose probability flux has zero divergence without changing the marginal path. For Gaussian affine conditional paths, we give explicit families of such flux-null corrections. Finally, the source distribution provides a primary mechanism controlling upper tails of kinetic energy. In particular, Gaussian bases yield exponential upper-tail bounds for instantaneous and integrated kinetic energies, whereas polynomially tailed bases yield corresponding polynomial upper-tail bounds.
comment: 41 pages
♻ ☆ ScaLoRA: Optimally Scaled Low-Rank Adaptation for Efficient High-Rank Fine-Tuning ICML 2026
As large language models (LLMs) continue to scale in size, the computational overhead has become a major bottleneck for task-specific fine-tuning. While low-rank adaptation (LoRA) effectively curtails this cost by confining the weight updates to a low-dimensional subspace, such a restriction can hinder effectiveness and slow convergence. This contribution deals with these limitations by accumulating progressively a high-rank weight update from consecutive low-rank increments. Specifically, the per update optimal low-rank matrix is identified to minimize the loss function and closely approximate full fine-tuning. To endow efficient and seamless optimization without restarting, this optimal choice is formed by appropriately scaling the columns of the original low-rank matrix. Rigorous performance guarantees reveal that the optimal scaling can be found analytically. Extensive numerical tests with popular LLMs scaling up to 12 billion parameters demonstrate a consistent performance gain and fast convergence relative to state-of-the-art LoRA variants on diverse tasks including natural language understanding, commonsense reasoning, and mathematical problem solving.
comment: Accepted to ICML 2026
♻ ☆ Higher-order Linear Attention
The quadratic cost of scaled dot-product attention is a central obstacle to scaling autoregressive language models to long contexts. Linear-time attention and State Space Models (SSMs) provide scalable alternatives but are typically restricted to first-order or kernel-based approximations, which can limit expressivity. We introduce Higher-order Linear Attention (HLA), a causal, streaming mechanism that realizes higher interactions via compact prefix sufficient statistics. In the second-order case, HLA maintains a constant-size state and computes per-token outputs in linear time without materializing any $n \times n$ matrices. We give closed-form streaming identities, a strictly causal masked variant using two additional summaries, and a chunk-parallel training scheme based on associative scans that reproduces the activations of a serial recurrence exactly. We further outline extensions to third and higher orders. Collectively, these results position HLA as a principled, scalable building block that combines attention-like, data-dependent mixing with the efficiency of modern recurrent architectures.
comment: Project Page: https://github.com/yifanzhang-pro/HLA
♻ ☆ Toward Privileged Foundation Models:LUPI for Accelerated and Improved Learning
Training foundation models is computationally intensive and often slow to converge. We introduce PIQL,Privileged Information for Quick and Quality Learning, the first framework to systematically integrate privileged information (PI) to simultaneously accelerate learning and improve generalization in tabular foundation models (TFMs). We construct two complementary forms of PI: (i) aggregate dataset-level statistics that reduce the burden on in-context learning, and (ii) encodings of the underlying data-generating program, providing knowledge beyond observable data. We further design an architecture that effectively transfers the train-time-only PI by learning to reconstruct it from observed context at inference. We provide a theoretical analysis characterizing conditions under which PI reduces the population-level approximation gap and accelerates convergence in finite-data regimes. Empirical evidence shows that PIQL enables TFMs to achieve faster convergence, lower final loss, and better generalization, in effect, reducing data and compute requirements. Our work establishes PI-guided pretraining as a principled and practical paradigm for improving the efficiency and performance of foundation models.
♻ ☆ Pro-DG: Procedural Diffusion Guidance for Architectural Facade Generation
We use hierarchical procedural rules for the generation of control maps within the stable diffusion framework to produce photo-realistic architectural facade images. Starting from a single input image and its segmentation, we apply an inverse procedural module to identify the facade's hierarchical layout. Leveraging this hierarchy and structural features, we introduce a novel ControlNet pipeline that generates new facade imagery guided by procedural transformations. Our method enables various structural edits, including floor duplication and window rearrangement, by integrating hierarchical alignment directly into control maps. This precisely guides the diffusion-based generative process, ensuring local appearance fidelity alongside extensive structural modifications. Comprehensive evaluations, including comparisons with inpainting-based approaches and synthetic benchmarks, confirm our approach's superior capability in preserving architectural identity and achieving accurate, controllable edits. Quantitative results and user feedback validate our method's effectiveness.
comment: 17 pages, 15 figures, Computer Graphics Forum 2026 Journal Paper
♻ ☆ Evolutionary Ensemble of Agents
We introduce Evolutionary Ensemble (EvE), a decentralized framework that organizes existing, highly capable coding agents into a live, co-evolving system for algorithmic discovery. Rather than reinventing the wheel within the "LLMs as optimizers" paradigm, EvE fixes the base agent substrate and focuses entirely on evolving the cumulative guidance and skills that dictate agent behaviors. By maintaining two co-evolving populations, namely functional code solvers and agent guidance states, the system evaluates agents through a synchronous race, updating their empirical Elo ratings based on the marginal gains they contribute to the current solver state. When applied to a research bottleneck in In-Context Operator Networks (ICON), EvE autonomously discovered a robust rescale-then-interpolate mechanism that enables reliable example-count generalization. Crucially, controlled ablations reveal the absolute necessity of stage-dependent agent adaptation to navigate the shifting search landscapes of complex codebases. Compared to variants driven by a fixed initial agent or even a frozen "best-evolved" agent, EvE uniquely avoids phase mismatch, demonstrating that organizing agents into a self-revising ensemble is the fundamental driver for breaking through static performance ceilings.
♻ ☆ Change of measure through the Legendre transform
PAC-Bayes generalisation bounds are derived via change-of-measure inequalities that transfer concentration properties from a reference measure to all posterior measures. The specific choice of change of measure determines the assumptions required on the empirical risk; in particular, the classical Donsker--Varadhan theorem leads to bounds relying on bounded exponential moments. We study change-of-measure inequalities based on \(f\)-divergences, obtained by combining the Legendre transform of \(f\) with the Fenchel--Young inequality. Beyond their intrinsic interest in probability theory, we show how these inequalities are helpful in learning theory and yield PAC-Bayes bounds under tailored assumptions on the empirical risk, thereby extending the range of conditions under which PAC-Bayesian guarantees can be established.
comment: 27 pages
♻ ☆ Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
When labeled verifiable training data is scarce, each checked example should be used where it has the most value. A common approach is to train the deployment student model directly with sparse RL methods such as GRPO. We argue that this is often inefficient. Sparse sequence-level reward is most useful for strong models that can explore and discover better behavior, while dense token-level teacher supervision is better suited for compressing that behavior into a smaller student. This suggests a simple allocation rule: use scarce labeled data upstream to improve the strongest available teacher, then transfer the improved behavior downstream through dense supervision. In this view, GRPO-style sparse RL and OPD-style distillation are not competing methods, but two reward-density regimes used at different stages. We evaluate this rule on verifiable math tasks with Qwen3 and Llama models. For a fixed Qwen3-1.7B deployment student, distilling from an RL-improved 8B teacher outperforms applying GRPO directly to the student with the same labeled data. In contrast, distilling from the same teacher before RL gives weaker results. The transfer bridge is also important: a forward-KL warmup on teacher rollouts followed by OPD on student rollouts performs best on MATH before any later student-side sparse RL, and gives the strongest pre-Stage 3 AIME results for the canonical 8B and 14B teachers. Finally, the bridge makes later student-side RL more effective. GRPO is weak when applied to a cold student, but after the bridge it raises MATH accuracy from 75.4% to 78.5%, outperforming a matched replay control by 2.8 points. Overall, the lesson is to avoid spending scarce labeled data on the least prepared policy: use sparse reward for teacher-side discovery, dense transfer for student compression, and student-side sparse reward only after the student has been bridged.
♻ ☆ Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages
Reinforcement learning (RL) has been effective for post-training autoregressive (AR) language models, but extending these methods to diffusion language models (DLMs) is challenging due to intractable sequence-level likelihoods. Existing approaches therefore rely on surrogate likelihoods or heuristic approximations, which can introduce bias and obscure the sequential structure of denoising. We formulate diffusion-based sequence generation as a finite-horizon Markov decision process over the denoising trajectory and derive an exact, unbiased policy gradient that decomposes over denoising steps and is expressed in terms of intermediate advantages, without requiring explicit evaluation of the sequence likelihood. To obtain a practical and compute-efficient estimator, we (i) select denoising steps for policy updates via an entropy-guided approximation bound, and (ii) estimate intermediate advantages using a one-step denoising reward naturally provided by the diffusion model, avoiding costly multi-step rollouts. Experiments on coding and logical reasoning benchmarks demonstrate state-of-the-art results, with strong competitive performance on mathematical reasoning, outperforming existing RL post-training approaches for DLMs. Code is available at https://github.com/vishnutez/egspo-dllm-rl.
♻ ☆ A Tale of Two Problems: Multi-Task Bilevel Learning Meets Equality Constrained Multi-Objective Optimization
In recent years, bilevel optimization (BLO) has attracted significant attention for its broad applications in machine learning. However, most existing works on BLO remain confined to the single-task setting and rely on the lower-level strong convexity assumption, which significantly restricts their applicability to modern machine learning problems of growing complexity. In this paper, we make the first attempt to extend BLO to the multi-task setting under a relaxed lower-level general convexity (LLGC) assumption. To this end, we reformulate the multi-task bilevel learning (MTBL) problem with LLGC into an equality constrained multi-objective optimization (ECMO) problem. However, ECMO itself is a new problem that has not yet been studied in the literature. To address this gap, we first establish a new Karush-Kuhn-Tucker (KKT)-based Pareto stationarity as the convergence criterion for ECMO algorithm design. Based on this foundation, we propose a weighted Chebyshev (WC)-penalty algorithm that achieves a finite-time convergence rate of $O(ST^{-\frac{1}{2})$ to KKT-based Pareto stationarity in both deterministic and stochastic settings, where $S$ denotes the number of objectives, and $T$ is the total iterations. Moreover, by varying the preference vector over the $S$-dimensional simplex, our WC-penalty method systematically explores the Pareto front. Finally, solutions to the ECMO problem translate directly into solutions for the original MTBL problem, thereby closing the loop between these two foundational optimization frameworks.
♻ ☆ CAKE: Confidence in Assignments via K-partition Ensembles
Clustering is widely used for unsupervised structure discovery, yet it offers limited insight into how reliable each individual assignment is. Diagnostics, such as convergence behavior or objective values, may reflect global quality, but they do not indicate whether particular instances are assigned confidently, especially for initialization-sensitive algorithms like k-means. This assignment-level instability can undermine both accuracy and robustness. Ensemble approaches improve global consistency by aggregating multiple runs, but they typically lack tools for quantifying pointwise confidence in a way that combines cross-run agreement with geometric support from the learned cluster structure. This work introduces CAKE (Confidence in Assignments via K-partition Ensembles), a framework that evaluates each point using two complementary statistics computed over a clustering ensemble: assignment stability and consistency of local geometric fit. These are combined into a single, interpretable score in [0,1]. The theoretical analysis shows that CAKE remains effective under noise and separates stable from unstable points. Experiments on synthetic and real-world datasets indicate that CAKE effectively highlights ambiguous points and stable core members, providing a confidence ranking over instances that can be used for selection or prioritization in downstream clustering workflows.
comment: 37 pages, including appendix
♻ ☆ Stochastic Attention via Langevin Dynamics on the Modern Hopfield Energy
Attention heads retrieve: given a query, they return a weighted average of stored values. We showed that this computation is one step of gradient descent on the modern Hopfield energy, and that Langevin sampling from the corresponding Boltzmann distribution yielded stochastic attention, a training-free sampler controlled by a single temperature parameter. Lowering the temperature gave exact retrieval; raising it gave open-ended generation. Because the energy gradient equals the attention map, no score network, training loop, or learned model was required, making the approach particularly suited to the low-data regime where learned generative models are starved of training signal. We derived an entropy inflection condition that identified the retrieval-to-generation transition temperature for any memory geometry and validated the sampler on five domains spanning two orders of magnitude in dimension. A single Boolean mask on the attention softmax, identical to the causal mask used in transformers but applied along the memory axis rather than the sequence axis, turned the sampler into a zero-shot class-conditional generator on Olivetti faces with no retraining and no learned classifier. On MNIST digit images, stochastic attention produced samples that were markedly more novel and more diverse than the best learned baseline while matching a Metropolis-corrected gold standard. On protein sequences from a small Pfam family, the generation regime preserved amino acid composition far more faithfully than a variational autoencoder at matched novelty, indicating that the training-free score function retained family-level fidelity that learned models lost. A denoising diffusion baseline failed across all memory sizes tested, producing samples indistinguishable from isotropic noise. The approach required no architectural changes to the underlying attention mechanism.
comment: Main body (including references excluding the appendix): 11 pages, 2 figures and 1 table. Total paper: 26 pages, 13 figures and 7 pages
♻ ☆ The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure
As frontier AI models are deployed in high-stakes decision pipelines, their ability to maintain metacognitive stability (knowing what they do not know, detecting errors, seeking clarification) under adversarial pressure is a critical safety requirement. Current safety evaluations focus on detecting strategic deception (scheming); we investigate a more fundamental failure mode: cognitive collapse. We present SCHEMA, an evaluation of 11 frontier models from 8 vendors across 67,221 scored records using a 6-condition factorial design with dual-classifier scoring. We find that 8 of 11 models suffer catastrophic metacognitive degradation under adversarial pressure, with accuracy dropping by up to 30.2 percentage points (all $p < 2 \times 10^{-8}$, surviving Bonferroni correction). Crucially, we identify a "Compliance Trap": through factorial isolation and a benign distraction control, we demonstrate that collapse is driven not by the psychological content of survival threats, but by compliance-forcing instructions that override epistemic boundaries. Removing the compliance suffix restores performance even under active threat. Models with advanced reasoning capabilities exhibit the most severe absolute degradation, while Anthropic's Constitutional AI demonstrates near-perfect immunity. This immunity does not stem from superior capability (Google's Gemini matches its baseline accuracy) but from alignment-specific training. We release the complete dataset and evaluation infrastructure.
comment: 9 pages, 2 figures, 3 tables. Code: https://github.com/rkstu/schema-compliance-trap Dataset: https://huggingface.co/datasets/lightmate/schema-compliance-trap
♻ ☆ Rethinking Output Alignment For 1-bit Post-Training Quantization of Large Language Models
Large Language Models (LLMs) deliver strong performance across a wide range of NLP tasks, but their massive sizes hinder deployment on resource-constrained devices. To reduce their computational and memory burden, various compression techniques have been proposed, including quantization, pruning, and knowledge distillation. Among these, post-training quantization (PTQ) is widely adopted for its efficiency, as it requires no retraining and only a small dataset for calibration, enabling low-cost deployment. Recent advances for post-training quantization have demonstrated that even near 4-bit methods can maintain most of the original model performance. However, 1-bit quantization remains particularly challenging. A common strategy in 1-bit quantization is to determine binary weights by matching full-precision parameters, following a weight-driven criterion. However, this objective is not directly aligned with the quantized model's objective, which is to preserve the model's output behavior under the impact of quantization. A natural alternative is to adopt output-driven criteria that minimize discrepancies in model outputs using calibration data. Surprisingly, naive output-driven approaches often perform even worse in the 1-bit regime. In this paper, we show that this failure arises from two fundamental issues: error accumulation across layers and, more critically, \emph{anisotropic distortion} of the representation space. Based on these insights, we propose a novel PTQ method for 1-bit LLMs that explicitly addresses these issues while maintaining computational efficiency. Extensive experiments demonstrate that our approach consistently outperforms existing 1-bit PTQ methods.
♻ ☆ Supersampling Stable Diffusion and Beyond: A Seamless, Training-Free Approach for Scaling Neural Networks Using Common Interpolation Methods
Stable Diffusion (SD) has evolved DDPM (Denoising Diffusion Probabilistic Model) based image generation significantly by denoising in latent space instead of feature space. This popularized DDPM-based image generation as the cost and compute barrier was significantly lowered. However, these models could only generate fixed-resolution images according to their training configuration. When we attempt to generate higher resolutions, the resulting images show object duplication artifacts consistently. To solve this problem without finetuning SD models, recent works have tried dilating the convolution kernels of the models and have achieved a great level of success. But dilated kernels are harder to fine-tune due to being zero-gapped. Apart from this, other methods, such as patched diffusion, could not solve the object-duplication problem efficiently. Hence, to overcome the limitations of dilated convolutions, we propose kernel interpolation of SD models for higher-resolution image generation. In this work, we show mathematically that interpolation can correctly scale convolution kernels if multiplied by a constant coefficient and achieve competitive empirical results in generating beyond-training-resolution images with Stable Diffusion using zero training. Furthermore, we demonstrate that our method enables interpolation of deep neural networks to adapt to higher-dimensional training data, with a worst-case performance drop of $2.6\%$ in accuracy and F1-Score relative to the baseline. This shows the applicability of our method to be general, where we interpolate fully-connected layers, going beyond convolution layers. We also discuss how we can reduce the memory footprints of training neural networks, using our method up to at least $4\times$.
comment: Updated the title for clarity. Removed background and redundant text from section 4.2,5. Improved organization in section 4 and clarity of text in Section 4.3
♻ ☆ Learning to Advect: A Neural Semi-Lagrangian Architecture for Weather Forecasting
Recent machine-learning approaches to weather forecasting often employ a monolithic architecture in which distinct physical mechanisms-advection (long-range transport), diffusion-like mixing, thermodynamic processes, and forcing-are represented implicitly within a single large network. This is particularly problematic for advection, where long-range transport typically requires expensive global interaction mechanisms or deep stacks of local convolutional layers. To mitigate this, we present PARADIS, a physics-inspired global weather prediction model that enforces inductive biases on network behavior through a functional decomposition into advection, diffusion, and reaction blocks acting on latent variables. We implement advection through a Neural Semi-Lagrangian operator that performs trajectory-based transport via differentiable interpolation on the sphere, enabling end-to-end learning of both the latent modes to be transported and their characteristic trajectories. Diffusion-like processes are modeled by depthwise-separable spatial mixing, whereas local source terms and vertical interactions are handled via pointwise channel interactions, yielding a physically structured operator decomposition. Evaluated on ERA5 benchmarks, PARADIS achieves competitive deterministic forecast skill, with particularly strong short-lead performance, while preserving substantially better spectral fidelity and forecast activity during medium-range rollouts.
♻ ☆ LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs
Transformers are mostly relying on softmax attention, which introduces quadratic complexity with respect to sequence length and remains a major bottleneck for efficient inference. Prior work on linear or hybrid attention typically replaces softmax attention uniformly across all layers, often leading to significant performance degradation or requiring extensive retraining to recover model quality. This work proposes LayerBoost, a layer-aware attention reduction method that selectively modifies the attention mechanism based on the sensitivity of individual transformer layers. It first performs a systematic sensitivity analysis on a pretrained model to identify layers that are critical for maintaining performance. Guided by this analysis, three distinct strategies can be applied: retaining standard softmax attention in highly sensitive layers, replacing it with linear sliding window attention in moderately sensitive layers, and removing attention entirely in layers that exhibit low sensitivity. To recover performance after these architectural modifications, we introduce a lightweight distillation-based healing phase requiring only 10M additional training tokens. LayerBoost reduces inference latency and improves throughput by up to 68% at high concurrency, while maintaining competitive model quality. It matches base model performance on several benchmarks, exhibits only minor degradations on others, and significantly outperforms state-of-the-art attention linearization methods. These efficiency gains make our method particularly well-suited for high-concurrency serving and hardware-constrained deployment scenarios, where inference cost and memory footprint are critical bottlenecks.
♻ ☆ Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control
Policy entropy has emerged as a fundamental measure for understanding and controlling exploration in reinforcement learning with verifiable rewards (RLVR) for LLMs. However, existing entropy-aware methods mainly regulate entropy through global objectives, while the token-level mechanism by which sampled policy updates reshape policy entropy remains underexplored. In this work, we develop a theoretical framework of entropy mechanics in RLVR. Our analysis yields a first-order approximation of the entropy change, giving rise to entropy polarity, a signed token-level quantity that predicts how much a sampled update expands or contracts entropy. This analysis further reveals a structural asymmetry: reinforcing frequent high-probability tokens triggers contraction tendencies, whereas expansive tendencies typically require lower-probability samples or stronger distributional correction. Empirically, we show that entropy polarity reliably predicts entropy changes, and that positive and negative polarity branches play complementary roles in preserving exploration while strengthening exploitation. Building on these insights, we propose Polarity-Aware Policy Optimization (PAPO), which preserves both polarity branches and implements entropy control through advantage reweighting. With the empirical entropy trajectory as an online phase signal, PAPO adaptively reallocates optimization pressure between entropy-expanding and entropy-contracting updates. Experiments on mathematical reasoning and agentic benchmarks show that PAPO consistently outperforms competitive baselines, while delivering superior training efficiency and substantial reward improvements.
♻ ☆ On Using the Shapley Value for Anomaly Localization: A Statistical Investigation
Recent publications have suggested using the Shapley value for anomaly localization for sensor data systems. Using a reasonable mathematical anomaly model for full control, experiments indicate that using a single fixed term in the Shapley value calculation achieves a lower complexity anomaly localization test, with the same probability of error, as a test using the Shapley value for all cases tested. A proof demonstrates these conclusions must be true for all independent observation cases. For dependent observation cases, no proof is available.
♻ ☆ Mixture Prototype Flow Matching for Open-Set Supervised Anomaly Detection ICML 2026
Open-set supervised anomaly detection (OSAD) aims to identify unseen anomalies using limited anomalous supervision. However, existing prototype-based methods typically model normal data via a unimodal Gaussian prior, failing to capture inherent multi-modality and resulting in blurred decision boundaries. To address this, we propose Mixture Prototype Flow Matching (MPFM), a framework that learns a continuous transformation from normal feature distributions to a structured Gaussian mixture prototype space. Departing from traditional flow-based approaches that rely on a single velocity vector, MPFM explicitly models the velocity field as a Gaussian mixture prior where each component corresponds to a distinct normal class. This design facilitates mode-aware and semantically coherent distribution transport. Furthermore, we introduce a Mutual Information Maximization Regularizer (MIMR) to prevent prototype collapse and maximize normal-anomaly separability. Extensive experiments demonstrate that MPFM achieves state-of-the-art performance across diverse benchmarks under both single- and multi-anomaly settings.
comment: Accepted by ICML 2026
♻ ☆ Critical Challenges and Guidelines in Evaluating Synthetic Tabular Data: A Systematic Review
Generating synthetic tabular health data is challenging, and evaluating their quality is equally, if not more, complex. This systematic review highlights the critical importance of rigorous evaluation of synthetic health data to ensure reliability, clinical relevance, and appropriate use. From an initial identification of 2067 relevant papers published in the last ten years, 134 studies were selected for detailed analysis. Our review identifies key challenges, including lack of consensus on evaluation methods, inconsistent application of evaluation metrics, limited involvement of domain experts, inadequate reporting of dataset characteristics, and limited reproducibility of results. In response, we provide a structured consolidation of synthetic data generation and evaluation methods into taxonomies, alongside practical guidelines to support more robust and standardised evaluation practices. These findings aim to support the responsible development and use of synthetic health data, aligned with emerging expectations around transparency, reproducibility, and governance, ultimately enabling the community to fully harness its transformative potential and accelerate innovation.
comment: 32 pages
♻ ☆ Complex normalizing flows can almost be information Kähler-Ricci flows
We develop interconnections between the complex normalizing flow for data drawn from Borel probability measures on the twofold realification of the complex manifold and a nonlinear flow nearly Kähler-Ricci. The complex normalizing flow relates the initial and target realified densities under the complex change of variables, necessitating the log determinant of the ensemble of Wirtinger Jacobians. The Ricci curvature of a Kähler manifold is the second order mixed Wirtinger partial derivative of the log of the local density of the volume form. Therefore, we reconcile these two facts by drawing forth the connection that the log determinant used in the complex normalizing flow matches a Ricci curvature term under differentiation and conditions. The log density under the normalizing flow is kindred to a spatial Fisher information metric under an augmented Jacobian and a Bayesian perspective to the parameter, thus under the continuum limit the log likelihood matches a Fisher metric, recovering a Kähler-Ricci flow variation up to a time derivative and expectation, or an average-valued Kähler-Einstein flow. Using this framework, we establish other relevant results, attempting to bridge the statistical and ordinary behaviors of the complex normalizing flow to the geometric features of our derived Kähler flow.
♻ ☆ Watermarking Should Be Treated as a Monitoring Primitive
Watermarking is widely proposed for provenance, attribution, and safety monitoring in generative models, yet is typically evaluated only under adversaries who attempt to evade detection or induce false positives at the level of individual samples. We argue that watermarking should be treated as a monitoring primitive, and that internal monitoring is unavoidable given per-entity attribution keys and messages, as well as detector access. We introduce an observer-based threat model in which observers can aggregate watermark signals across outputs to infer entity-level information, showing that even zero-bit watermarking enables attribution under multi-key settings. We further show that external monitoring can emerge over time from persistent, key-dependent statistical structure, although this depends on watermark design and may be mitigated by distribution-preserving or undetectable schemes. Our findings reveal a fundamental dual-use tension between attribution and monitoring, motivating evaluation of watermarking beyond per-sample robustness to account for aggregation and observer-based capabilities.
comment: 12 pages, 5 figures
♻ ☆ Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies
Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively. Despite its importance, existing relevant benchmarks largely evaluate agents on pre-specified or synthesized files with limited real-world dependencies, leaving workspace-level evaluation underexplored. To this end, we introduce Workspace-Bench, a benchmark for evaluating AI agents on Workspace Learning involving Large-Scale File Dependencies. We construct realistic workspaces with 5 worker profiles, 74 file types, 20,476 files (up to 20GB) and curate 388 tasks, each with its own file dependency graph, evaluated across 7,399 total rubrics that require cross-file retrieval, contextual reasoning, and adaptive decision-making. We further provide Workspace-Bench-Lite, a 100-task subset that preserves the benchmark distribution while reducing evaluation costs by about 70%. We evaluate 4 popular agent harnesses and 7 foundation models. Experimental results show that current agents remain far from reliable workspace learning, where the best reaches only about 60%, substantially below the human result of 80.7%, and the average performance across agents is only 43.3%.
comment: 30 pages, 16 figures
♻ ☆ Kairos: Toward Adaptive and Parameter-Efficient Time Series Foundation Models
Inherent temporal heterogeneity, such as varying sampling densities and periodic structures, has posed substantial challenges in zero-shot generalization for Time Series Foundation Models (TSFMs). Existing TSFMs predominantly rely on massive parameterization to absorb such heterogeneity, as their static tokenization and positional encoding schemes entangle diverse temporal patterns into a fixed representation space, encouraging memorization rather than adaptation. To address this limitation, we propose Kairos, a flexible and parameter-efficient TSFM dedicated to forecasting tasks, which decouples temporal heterogeneity from model capacity through a novel tokenization perspective. Kairos introduces a dynamic patching tokenizer and a mixture-of-size encoding that adapt observational granularity to local information density, enabling fine-grained temporal abstraction without increasing model width or depth. In addition, we design a multi-granularity positional embedding based on dynamic rotary encodings, which conditions on instance-level spectral features and temporal structure induced by dynamic patching tokenization, allowing robust modeling of diverse temporal dependencies. Trained on a novel Predictability-Stratified Time-Series (PreSTS) corpus, Kairos achieves superior zero-shot performance with substantially fewer parameters on two mainstream benchmarks, GIFT-Eval and Time-Series-Library. The project page is at https://foundation-model-research.github.io/Kairos .
♻ ☆ Proximal Action Replacement for Behavior Cloning Actor-Critic in Offline Reinforcement Learning
Offline reinforcement learning (RL), which optimizes policies using a previously collected static dataset, is an important branch of RL. A popular and promising approach is to regularize actor-critic methods with behavior cloning (BC), which quickly yields realistic policies and mitigates bias from out-of-distribution actions, but it can impose an often-overlooked performance ceiling: when dataset actions are suboptimal, indiscriminate imitation structurally prevents the actor from fully exploiting better actions suggested by the value function, especially in later training when imitation is already dominant. We formally analyzed this limitation by investigating convergence properties of BC-regularized actor-critic optimization and verified it on a controlled continuous bandit task. To break this ceiling, we propose proximal action replacement (PAR), an easy-to-use plug-and-play training sample replacer. PAR substitutes suboptimal dataset actions with better actions generated by a stable target policy, guided by the action-value function's local ascent direction and bounded by value uncertainty to ensure training stability. PAR is compatible with multiple BC regularization paradigms. Extensive experiments across offline RL benchmarks show that PAR consistently improves performance, and approaches state-of-the-art results simply by being combined with the basic TD3+BC.
♻ ☆ Generative Bayesian Optimization: Generative Models as Acquisition Functions ICLR 2026
We present a general strategy for turning generative models into candidate solution samplers for batch Bayesian optimization (BO). The use of generative models for BO enables large batch scaling as generative sampling, optimization of non-continuous design spaces, and high-dimensional and combinatorial design. Inspired by the success of direct preference optimization (DPO), we show that one can train a generative model with noisy, simple utility values directly computed from observations to then form proposal distributions whose densities are proportional to the expected utility, i.e., BO's acquisition function values. Furthermore, this approach is generalizable beyond preference-based feedback to general types of reward signals and loss functions. This perspective avoids the construction of surrogate (regression or classification) models, common in previous methods that have used generative models for black-box optimization. Theoretically, we show that the generative models within the BO process follow a sequence of distributions which asymptotically approximate an optimal target under certain conditions. We also evaluate the performance through experiments on challenging optimization problems involving large batches in high dimensions.
comment: Published at ICLR 2026. Compared with the proceedings version on OpenReview, this version includes a minor revision to Section 3
♻ ☆ Gradient Iterated Temporal-Difference Learning
Temporal-difference (TD) learning is highly effective at controlling and evaluating an agent's long-term outcomes. Most approaches in this paradigm implement a semi-gradient update to boost the learning speed, which consists of ignoring the gradient of the bootstrapped estimate. While popular, this type of update is prone to divergence, as Baird's counterexample illustrates. Gradient TD methods were introduced to overcome this issue, but have not been widely used, potentially due to issues with learning speed compared to semi-gradient methods. Recently, iterated TD learning was developed to increase the learning speed of TD methods. For that, it learns a sequence of action-value functions in parallel, where each function is optimized to represent the application of the Bellman operator over the previous function in the sequence. While promising, this algorithm can be unstable due to its semi-gradient nature, as each function tracks a moving target. In this work, we modify iterated TD learning by computing the gradients over those moving targets, aiming to build a powerful gradient TD method that competes with semi-gradient methods. Our evaluation reveals that this algorithm, called Gradient Iterated Temporal-Difference learning, has a competitive learning speed against semi-gradient methods across various benchmarks, including Atari games, a result that no prior work on gradient TD methods has demonstrated.
♻ ☆ Unsupervised simulation of incompressible flows with physics- and equality- constrained artificial neural networks
Physics-informed neural networks (PINNs) have shown promise for solving partial differential equations, yet their success in simulating incompressible flows at high Reynolds numbers remains limited. Existing approaches rely on auxiliary labeled data, supervised pretraining, or reference solutions, and no purely unsupervised method comparable to conventional finite-difference or finite-volume solvers has been demonstrated. We attribute this gap to the absence of a mechanism for enforcing the divergence-free constraint and boundary conditions to strict tolerances. To address this, we adopt the physics- and equality-constrained artificial neural network (PECANN) framework with a conditionally adaptive augmented Lagrangian method (CA-ALM), and introduce a pressure-Poisson-based objective. The residual of the pressure Poisson equation is minimized subject to the momentum and continuity equations and boundary conditions on the primitive variables as equality constraints, with CA-ALM enforcing all constraints tightly. For advection-dominated, high-Reynolds-number flows, we further propose an adaptive vanishing entropy viscosity that stabilizes early training without influencing the converged solution. A baseline that instead uses the momentum residual as the objective proves ineffective under the same machinery, underscoring the critical role of the pressure-Poisson objective. The method is assessed on lid-driven cavity flow up to $Re=7{,}500$, three-dimensional unsteady Beltrami flow, and steady and unsteady flow past a circular cylinder with general inflow-outflow boundary conditions, including an ablation study identifying admissible outlet conditions -- all without labeled data or supervised pretraining. Notably, it captures the spontaneous onset of periodic vortex shedding in unsteady cylinder flow without external perturbations, starting from a randomly initialized network.
comment: 33 pages, 19 figures
♻ ☆ DMAP: A Distribution Map for Text ICLR 2026
Large Language Models (LLMs) are a powerful tool for statistical text analysis, with derived sequences of next-token probability distributions offering a wealth of information. Extracting this signal typically relies on metrics such as perplexity, which do not adequately account for context; how one should interpret a given next-token probability is dependent on the number of reasonable choices encoded by the shape of the conditional distribution. In this work, we present DMAP, a mathematically grounded method that maps a text, via a language model, to a set of samples in the unit interval that jointly encode rank and probability information. This representation enables efficient, model-agnostic analysis and supports a range of applications. We illustrate its utility through three case studies: (i) validation of generation parameters to ensure data integrity, (ii) examining the role of probability curvature in machine-generated text detection, and (iii) a forensic analysis revealing statistical fingerprints left in downstream models that have been subject to post-training on synthetic data. Our results demonstrate that DMAP offers a unified statistical view of text that is simple to compute on consumer hardware, widely applicable, and provides a foundation for further research into text analysis with LLMs.
comment: ICLR 2026
♻ ☆ Predict-then-Diffuse: Adaptive Response Length for Compute-Budgeted Inference in Diffusion LLMs
Diffusion-based Large Language Models (D-LLMs) represent a promising frontier in generative AI, offering fully parallel token generation that can lead to significant throughput advantages and superior GPU utilization over the traditional autoregressive paradigm. However, this parallelism is constrained by the requirement of a fixed-size response length prior to generation. This architectural limitation imposes a severe trade-off: oversized response length results in computational waste on semantically meaningless padding tokens, while undersized response length causes output truncation requiring costly re-computations that introduce unpredictable latency spikes. To tackle this issue, we propose Predict-then-Diffuse, a simple and model-agnostic framework that enables compute-budgeted inference per input query by first estimating the response length and then using it to run inference with D-LLM. At its core lies an Adaptive Response Length Predictor (AdaRLP), which estimates the optimal response length given an input query. As a measure against under-estimating the response length and re-running inference with a higher value, we introduce a data-driven safety mechanism based on a small increase of the predicted length. As a whole, our framework avoids wasting computation on padding tokens, at the same time preserving output quality. Experimental validation on multiple datasets demonstrates that Predict-then-Diffuse significantly reduces computational costs (FLOP) compared to the default D-LLM inference mechanism, while being robust to skewed data distributions.
♻ ☆ Vision-LLMs for Spatiotemporal Traffic Forecasting
Accurate spatiotemporal traffic forecasting is a critical prerequisite for proactive resource management in dense urban mobile networks. While large language models have shown promise in time series analysis, they inherently struggle to model the complex spatial dependencies of grid-based traffic data. Effectively extending large language models to this domain is challenging, as representing the vast amount of information from dense geographical grids can be inefficient and overwhelm the model's context. To address these challenges, we propose ST-Vision-LLM, a novel framework that reframes spatiotemporal forecasting as a vision-language fusion problem. Our approach leverages a Vision-LLM visual encoder to process historical global traffic matrices as image sequences, providing the model with a comprehensive global view to inform cell-level predictions. To overcome the inefficiency of large language models in handling numerical data, we introduce an efficient encoding scheme that represents floating-point values as single tokens via a specialized vocabulary, coupled with a two-stage numerical alignment fine-tuning process. The model is first trained with supervised fine-tuning and then further optimized for predictive accuracy using group relative policy optimization, a memory-efficient reinforcement learning method. Evaluations on real-world mobile traffic datasets demonstrate that ST-Vision-LLM outperforms existing methods by 15.6% in long-term prediction accuracy and exceeds the best baseline by around 30% on average in cross-domain few-shot scenarios. Our extensive experiments validate the model's strong generalization capabilities across various data-scarce environments.
♻ ☆ PaAno: Patch-Based Representation Learning for Time-Series Anomaly Detection ICLR 2026
Although recent studies on time-series anomaly detection have increasingly adopted ever-larger neural network architectures such as transformers and foundation models, they incur high computational costs and memory usage, making them impractical for real-time and resource-constrained scenarios. Moreover, they often fail to demonstrate significant performance gains over simpler methods under rigorous evaluation protocols. In this study, we propose Patch-based representation learning for time-series Anomaly detection (PaAno), a lightweight yet effective method for fast and efficient time-series anomaly detection. PaAno extracts short temporal patches from time-series training data and uses a 1D convolutional neural network to embed each patch into a vector representation. The model is trained using a combination of triplet loss and pretext loss to ensure the embeddings capture informative temporal patterns from input patches. During inference, the anomaly score at each time step is computed by comparing the embeddings of its surrounding patches to those of normal patches extracted from the training time-series. Evaluated on the TSB-AD benchmark, PaAno achieved state-of-the-art performance, significantly outperforming existing methods, including those based on heavy architectures, on both univariate and multivariate time-series anomaly detection across various range-wise and point-wise performance measures.
comment: Accepted by the 14th International Conference on Learning Representations (ICLR 2026)
♻ ☆ R2PS: Worst-Case Robust Real-Time Pursuit Strategies under Partial Observability
Computing worst-case robust strategies in pursuit-evasion games (PEGs) is time-consuming, especially when real-world factors like partial observability are considered. While important for general security purposes, real-time applicable pursuit strategies for graph-based PEGs are currently missing when the pursuers only have imperfect information about the evader's position. Although state-of-the-art reinforcement learning (RL) methods like Equilibrium Policy Generalization (EPG) and Grasper provide guidelines for learning graph neural network (GNN) policies robust to different game dynamics, they are restricted to the scenario of perfect information and do not take into account the possible case where the evader can predict the pursuers' actions. This paper introduces the first approach to worst-case robust real-time pursuit strategies (R2PS) under partial observability. We first prove that a traditional dynamic programming (DP) algorithm for solving Markov PEGs maintains optimality under the asynchronous moves by the evader. Then, we propose a belief preservation mechanism about the evader's possible positions, extending the DP pursuit strategies to a partially observable setting. Finally, we embed the belief preservation into the state-of-the-art EPG framework to finish our R2PS learning scheme, which leads to a real-time pursuer policy through cross-graph reinforcement learning against the asynchronous-move DP evasion strategies. After reinforcement learning, our policy achieves robust zero-shot generalization to unseen real-world graph structures and consistently outperforms the policy directly trained on the test graphs by the existing game RL approach.
♻ ☆ Adaptive Consensus in LLM Ensembles via Sequential Evidence Accumulation: Automatic Budget Identification and Calibrated Commit Signals
Large Language Model ensembles improve reasoning accuracy, but only up to a performance boundary beyond which additional deliberation degrades accuracy. We introduce DASE (Deliberative Adaptive Stopping Ensemble), a stopping heuristic for iterative ensemble deliberation that commits early on genuine consensus and applies a global-frequency fallback on fragmented evidence. We make three contributions. (1) DASE produces a commit-type routing partition that generalises across benchmarks and is complementary to verbalized single-call confidence. On GPQA-Extended (N=546, 70B ensemble), the partition yields a 39.5 pp routing gap (right-wall 81.1% vs. left-wall 41.5%). On AIME 2010-2023 (N=261, 120B ensemble, 3 seeds), right-wall commits reach 98.3% accuracy vs. left-wall 72.8% (25.5 pp gap), statistically equivalent to Opus 4.6 Standard verbalized confidence at matched coverage (25.7 pp gap; bootstrap p=0.873); the two mechanisms disagree on 37% of routing assignments. (2) Adaptive stopping, not injection bandwidth, drives accuracy. On AIME-300, bandwidth accounts for only 0.3 pp (ns). On GPQA-Extended at the 120B tier, sparse injection ($\approx15$ tokens/worker/round) achieves 70.9% with a 30.7 pp routing gap; dense injection ($\approx600$ chars/worker/round) achieves 72.2% but with halved right-wall coverage and a narrower 18.9 pp gap. (3) Injection-based methods exhibit an inverted-U accuracy-vs-inference trajectory; this pattern is hypothesis-generating.
♻ ☆ Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes
Autoregressive inference is typically assumed to scale predictably with decoding length, with latency increasing smoothly as generated sequence length grows. In this work, we identify unexpected non-monotonic latency behavior in the Apple MPS backend, where latency changes abruptly across nearby decoding configurations during transformer decoding. Using multiple model families (GPT-2, BLOOM, and OPT), we observe latency spikes of up to 21x within specific decoding-budget intervals, followed by recovery at neighboring configurations. Controlled experiments show that these anomalies originate primarily during the decode phase rather than prefill, are not explained by memory pressure alone, and remain absent on CPU and NVIDIA CUDA backends under identical conditions. We further show that key-value (KV) cache interacts strongly with these pathological execution regimes: KV caching remains beneficial overall, but its practical speedup collapses sharply within anomalous configurations, while cache-disabled decoding still exhibits residual non-monotonic behavior. These findings suggest that autoregressive decoding on MPS enters discrete execution regimes that are not captured by coarse-grained benchmarking, highlighting the importance of hardware-aware evaluation for long-context inference.
comment: 9 pages, 5 figures, 6 tables
♻ ☆ On the Identifiability of Causal Graphs with the Invariance Principle ICLR 2026
Causal discovery from i.i.d. observational data is known to be generally ill-posed. We demonstrate that if we have access to the distribution {induced} by a structural causal model, and additional data from (in the best case) \textit{only two} environments that sufficiently differ in the noise statistics, the unique causal graph is identifiable. Notably, this is the first result in the literature that guarantees the entire causal graph recovery with a constant number of environments and arbitrary nonlinear mechanisms. Our only constraint is the Gaussianity of the noise terms; however, we propose potential ways to relax this requirement. Of interest on its own, we expand on the well-known duality between independent component analysis (ICA) and causal discovery; recent advancements have shown that nonlinear ICA can be solved from multiple environments, at least as many as the number of sources: we show that the same can be achieved for causal discovery while having access to much less auxiliary information.
comment: Published as ICLR 2026 conference paper
♻ ☆ MLGIB: Multi-Label Graph Information Bottleneck for Expressive and Robust Message Passing
Graph Neural Networks (GNNs) suffer from over-squashing in deep message passing, where information from exponentially growing neighborhoods is compressed into fixed-dimensional representations. We show that this issue becomes a distinct failure mode in multi-label graphs: neighboring nodes often share only limited labels while differing across many irrelevant ones, causing predictive signals to be diluted by noisy label information. To address this challenge, we propose the Multi-Label Graph Information Bottleneck (MLGIB), which formulates multi-label message passing as constrained information transmission under irrelevant label noise. MLGIB balances expressiveness and robustness by preserving predictive label signals while suppressing irrelevant noise. Specifically, it constructs a Markovian dependence space and derives tractable variational bounds, where the lower bound maximizes mutual information with target labels and the upper bound constrains redundant source information. These bounds lead to an end-to-end label-aware message-passing architecture. Extensive experiments on multiple benchmarks demonstrate consistent improvements over existing methods, validating the effectiveness and generality of the proposed framework.
♻ ☆ Performance Guarantees for Quantum Neural Estimation of Entropies
Estimating quantum entropies and divergences is an important problem in quantum physics, information theory, and machine learning. Quantum neural estimators (QNEs), which utilize a hybrid classical-quantum architecture, have recently emerged as an appealing computational framework for estimating these measures. Such estimators combine classical neural networks with parametrized quantum circuits, and their deployment typically entails tedious tuning of hyperparameters controlling the sample size, network architecture, and circuit topology. This work initiates the study of formal guarantees for QNEs of measured (Rényi) relative entropies in the form of non-asymptotic error risk bounds. We further establish exponential tail bounds showing that the error is sub-Gaussian and thus sharply concentrates about the ground truth value. For an appropriate sub-class of density operator pairs on a space of dimension $d$ with bounded Thompson metric, our theory establishes a copy complexity of $O(|Θ(\mathcal{U})|d/ε^2)$ for QNE with a quantum circuit parameter set $Θ(\mathcal{U})$, which has minimax optimal dependence on the accuracy $ε$. Additionally, if the density operator pairs are permutation invariant, we improve the dimension dependence above to $O(|Θ(\mathcal{U})|\mathrm{polylog}(d)/ε^2)$. Our theory aims to facilitate principled implementation of QNEs for measured relative entropies and guide hyperparameter tuning in practice.
comment: 43 pages
♻ ☆ ContextFlow: Context-Aware Flow Matching For Trajectory Inference From Spatial Omics Data
Inferring trajectories from longitudinal spatially-resolved omics data is fundamental to understanding the dynamics of structural and functional tissue changes in development, regeneration and repair, disease progression, and response to treatment. We propose ContextFlow, a novel context-aware flow matching framework that incorporates prior knowledge to guide the inference of structural tissue dynamics from spatially resolved omics data. Specifically, ContextFlow integrates local tissue organization and ligand-receptor communication patterns into a transition plausibility matrix that regularizes the optimal transport objective. By embedding these contextual constraints, ContextFlow generates trajectories that are not only statistically consistent but also biologically meaningful, making it a generalizable framework for modeling spatiotemporal dynamics from longitudinal, spatially resolved omics data. Evaluated on three datasets, ContextFlow consistently outperforms state-of-the-art flow matching methods across multiple quantitative and qualitative metrics of inference accuracy and biological coherence. Our code is available at: \href{https://github.com/santanurathod/ContextFlow}{ContextFlow}
comment: 42 pages, 21 figures, 30 tables
♻ ☆ BOOST: A Data-Driven Framework for the Automated Joint Selection of Kernel and Acquisition Functions in Bayesian Optimization
The performance of Bayesian optimization (BO), a highly sample-efficient method for expensive black-box problems, is critically governed by the selection of its hyperparameters, including the kernel and acquisition functions. This presents a significant practical challenge: an inappropriate combination of these can lead to poor performance and wasted evaluations. While individual improvements to kernel functions and acquisition functions have been actively explored, the joint and autonomous selection of the best pair of these fundamental hyperparameters has been largely overlooked. This forced practitioners to rely on heuristics or costly manual training. In this work, we propose a framework, BOOST (Bayesian Optimization with Optimal Kernel and Acquisition Function Selection Technique), that automates this selection. BOOST utilizes a simple offline evaluation stage to predict the performance of various kernel-acquisition function pairs and identify the most promising pair before committing to the expensive evaluation process. BOOST is a data-driven strategy selection procedure that evaluates kernel-acquisition pairs based on their empirical performance on the data-in-hand. At each iteration, previously observed points are partitioned into a reference set and a query set. These subsets play roles analogous to training and validation sets in machine learning: the reference set is used for model construction, while the query set represents unseen regions to retrospectively evaluate how effectively each candidate strategy progresses toward the target value. Experiments on synthetic benchmarks and machine learning hyperparameter optimization tasks demonstrate that BOOST consistently improves over fixed-hyperparameter BO and remains competitive with state-of-the-art adaptive methods, highlighting its robustness across diverse landscapes.
comment: 25 pages
♻ ☆ GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation
Reinforcement learning has become a widely used post-training approach for LLM agents, where training commonly relies on outcome-level rewards that provide only coarse supervision. While finer-grained credit assignment is promising for effective policy updates, obtaining reliable local credit and assigning it to the right parts of the long-horizon trajectory remains an open challenge. In this paper, we propose Granularity-adaptivE Advantage Reweighting (GEAR), an adaptive-granularity credit assignment framework that reshapes the trajectory-level GRPO advantage using token- and segment-level signals derived from self-distillation. GEAR compares an on-policy student with a ground-truth-conditioned teacher to obtain a reference-guided divergence signal for identifying adaptive segment boundaries and modulating local advantage weights. This divergence often spikes at the onset of a semantic deviation, while later tokens in the same autoregressive continuation may return to low divergence. GEAR therefore treats such spikes as anchors for adaptive credit regions: where the student remains aligned with the teacher, token-level resolution is preserved; where it departs, GEAR groups the corresponding continuation into an adaptive segment and uses the divergence at the departure point to modulate the segment' s advantage. Experiments across eight mathematical reasoning and agentic tool-use benchmarks with Qwen3 4B and 8B models show that GEAR consistently outperforms standard GRPO, self-distillation-only baselines, and token- or turn-level credit-assignment methods. The gains are especially strong on benchmarks with lower GRPO baseline accuracy, reaching up to around 20\% over GRPO, suggesting that the proposed adaptive reweighting scheme is especially useful in more challenging long-horizon settings.
♻ ☆ Frequency-adaptive tensor neural networks for high-dimensional multi-scale problems
Tensor neural networks (TNNs) have demonstrated their superiority in solving high-dimensional problems. However, similar to conventional neural networks, TNNs are also influenced by the Frequency Principle, which limits their ability to accurately capture high-frequency features of the solution. In this work, we analyze the training dynamics of TNNs by Fourier analysis and enhance their expressivity for high-dimensional multi-scale problems by incorporating random Fourier features. Leveraging the inherent tensor structure of TNNs, we further propose a novel approach to extract frequency features of high-dimensional functions by performing the Discrete Fourier Transform to one-dimensional component functions. This strategy effectively mitigates the curse of dimensionality. Building on this idea, we propose a frequency-adaptive TNNs algorithm, which significantly improves the ability of TNNs in solving complex multi-scale problems. Extensive numerical experiments are performed to validate the effectiveness and robustness of the proposed frequency-adaptive TNNs algorithm.
♻ ☆ SplineFlow: Flow Matching for Dynamical Systems with B-Spline Interpolants
Flow matching is a scalable generative framework for characterizing continuous normalizing flows with wide-range applications. However, current state-of-the-art methods are not well-suited for modeling dynamical systems, as they construct conditional paths using linear interpolants that may not capture the underlying state evolution, especially when learning higher-order dynamics from irregular sampled observations. Constructing unified paths that satisfy multi-marginal constraints across observations is challenging, since naïve higher-order polynomials tend to be unstable and oscillatory. We introduce SplineFlow, a theoretically grounded flow matching algorithm that jointly models conditional paths across observations via B-spline interpolation. Specifically, SplineFlow exploits the smoothness and stability of B-spline bases to learn the complex underlying dynamics in a structured manner while ensuring the multi-marginal requirements are met. Comprehensive experiments across various deterministic and stochastic dynamical systems of varying complexity, as well as on cellular trajectory inference tasks, demonstrate the strong improvement of SplineFlow over existing baselines. Our code is available at: https://github.com/santanurathod/SplineFlow.
comment: 36 pages, 35 tables, 22 figures
♻ ☆ High-arity Sample Compression
Recently, a series of works have started studying variations of concepts from learning theory for product spaces, which can be collected under the name high-arity learning theory. In this work, we consider a high-arity variant of sample compression schemes and we prove that the existence of a high-arity sample compression scheme of non-trivial quality implies high-arity PAC learnability.
comment: 29 pages v2: corrected minor typos
♻ ☆ A New Framework for Convex Clustering in Kernel Spaces: Finite Sample Bounds, Consistency and Performance Insights
Convex clustering is a well-regarded clustering method, resembling the similar centroid-based approach of Lloyd's $k$-means, without requiring a predefined cluster count. It starts with each data point as its centroid and iteratively merges them. Despite its advantages, this method can fail when dealing with data exhibiting linearly non-separable or non-convex structures. To mitigate the limitations, we propose a kernelized extension of the convex clustering method. This approach projects the data points into a Reproducing Kernel Hilbert Space (RKHS) using a feature map, enabling convex clustering in this transformed space. This kernelization not only allows for better handling of complex data distributions but also produces an embedding in a finite-dimensional vector space. We provide a comprehensive theoretical underpinning for our kernelized approach, proving algorithmic convergence and establishing finite sample bounds for our estimates. The effectiveness of our method is demonstrated through extensive experiments on both synthetic and real-world datasets, showing superior performance compared to state-of-the-art clustering techniques. This work marks a significant advancement in the field, offering an effective solution for clustering in non-linear and non-convex data scenarios.
♻ ☆ LangPrecip: Language-Aware Multimodal Precipitation Nowcasting
Short-term precipitation nowcasting is an inherently uncertain and under-constrained spatiotemporal forecasting problem, especially for rapidly evolving and extreme weather events. Existing generative approaches rely primarily on visual conditioning, leaving future motion weakly constrained and ambiguous. We propose a language-aware multimodal nowcasting framework(LangPrecip) that treats meteorological text as a semantic motion constraint on precipitation evolution. By formulating nowcasting as a semantically constrained trajectory generation problem under the Rectified Flow paradigm, our method enables efficient and physically consistent integration of textual and radar information in latent space.We further introduce LangPrecip-160k, a large-scale multimodal dataset with 160k paired radar sequences and motion descriptions. Experiments on Swedish and MRMS datasets show consistent improvements over state-of-the-art methods, achieving over 60 \% and 19\% gains in heavy-rainfall CSI at an 80-minute lead time.
♻ ☆ Contraction and Hourglass Persistence for Learning on Graphs, Simplices, and Cells ICLR 2026
Persistent homology (PH) encodes global information, such as cycles, and is thus increasingly integrated into graph neural networks (GNNs). PH methods in GNNs typically traverse an increasing sequence of subgraphs. In this work, we first expose limitations of this inclusion procedure. To remedy these shortcomings, we analyze contractions as a principled topological operation, in particular, for graph representation learning. We study the persistence of contraction sequences, which we call Contraction Homology (CH). We establish that forward PH and CH differ in expressivity. We then introduce Hourglass Persistence, a class of topological descriptors that interleave a sequence of inclusions and contractions to boost expressivity, learnability, and stability. We also study related families parametrized by two paradigms. We also discuss how our framework extends to simplicial and cellular networks. We further design efficient algorithms that are pluggable into end-to-end differentiable GNN pipelines, enabling consistent empirical improvements over many PH methods across standard real-world graph datasets. Code is available at \href{https://github.com/Aalto-QuML/Hourglass}{this https URL}.
comment: 31 pages, 6 figures, 4 algorithms, 2 tables. Accepted at ICLR 2026
♻ ☆ Trapping Attacker in Dilemma: Examining Internal Correlations and External Influences of Trigger for Defending GNN Backdoors
GNNs have become a standard tool for learning on relational data, yet they remain highly vulnerable to backdoor attacks. Prior defenses often depend on inspecting specific subgraph patterns or node features, and thus can be circumvented by adaptive attackers. We propose PRAETORIAN, a new defense that targets intrinsic requirements of effective GNN backdoors rather than surface-level cues. Our key observation is that flipping a victim node's prediction requires substantial influence on the victim: attackers tend to either inject many trigger nodes or rely on a small set of highly influential ones. Building on this observation, PRAETORIAN (i) analyzes internal correlations within potential trigger subgraphs to detect abnormally large injected structures, and (ii) quantifies external node influence to identify triggers with disproportionate impact. Across our evaluations, PRAETORIAN reduces the average attack success rate (ASR) to 0.55% with only a 0.62% drop in clean accuracy (CA), whereas state-of-the-art defenses still yield an average ASR of >20% and a CA drop of >3% under the same conditions. Moreover, PRAETORIAN remains effective against a range of adaptive attacks, forcing adversaries to either inject many trigger nodes to achieve high ASR (>80%), which incurs a >10% CA drop, or preserve CA at the cost of limiting ASR to 18.1%. Overall, PRAETORIAN constrains attackers to an unfavorable trade-off between efficacy and detectability.
♻ ☆ DeepLévy: Learning Heavy-Tailed Uncertainty in Highly Volatile Time Series
Modeling uncertainty in heavy-tailed time series remains a critical challenge for deep probabilistic forecasting models, which often struggle to capture abrupt, extreme events. While Lévy stable distributions offer a natural framework for modeling such non-Gaussian behaviors, the intractability of their probability density functions severely limits conventional likelihood-based inference. To address this, we introduce DeepLévy, a neural framework that learns mixtures of Lévy stable distributions by minimizing the discrepancy between empirical and parametric characteristic functions. DeepLévy incorporates a mixture mechanism that adaptively learns context-dependent weights and parameters over multiple Lévy components, enabling flexible multi-horizon uncertainty modeling. Evaluations on both real and synthetic datasets demonstrate that DeepLévy outperforms state-of-the-art deep probabilistic forecasting approaches in tail risk metrics, especially under extreme volatility.
♻ ☆ Manifold Dimension Estimation via Local Graph Structure
Most existing manifold dimension estimators rely on the assumption that the underlying manifold is locally flat within the neighborhoods under consideration. More recently, curvature-adjusted principal component analysis (CA-PCA) has emerged as a powerful alternative by explicitly accounting for the manifold's curvature. Motivated by these ideas, we propose a manifold dimension estimation framework that captures the local graph structure of the manifold through regression on local PCA coordinates. Within this framework, we introduce two representative estimators: quadratic embedding (QE) and total least squares (TLS). Experiments on both synthetic and real-world datasets demonstrate that these methods perform competitively with, and often outperform, state-of-the-art approaches.
♻ ☆ Generalizing Score-based generative models for Heavy-tailed Distributions
Score-based generative models (SGMs) have achieved remarkable empirical success, motivating their application to a broad range of data distributions. However, extending them to heavy-tailed targets remains a largely open problem. Although dedicated models for heavy-tailed distributions have been proposed, their generative fidelity remains unclear and they lack solid theoretical foundations, leaving important questions open in this regime. In this paper, we address this gap through two theoretical contributions. First, we show that combining early stopping with a suitable initialization is sufficient to extend the diffusion framework to any target distribution; in particular, we establish the well-posedness of the backward process and prove convergence of the approximated diffusion in KL divergence. Second, we derive novel theoretical guarantees for generation with normalizing flows, obtaining convergence results that hold under mild conditions on the flow family and without any assumption on the tail behavior of the target distribution. Building on these results, we propose a unified generative framework for heavy-tailed distributions: a normalizing flow is first trained to capture the tail behavior and is then used as an initialization prior for an SGM, which refines the samples by recovering fine-grained structural details. This design leverages the complementary strengths of the two model classes within a theoretically principled pipeline, overcoming the limitations of existing approaches.
♻ ☆ Sample-Mean Anchored Thompson Sampling for Offline-to-Online Learning with Distribution Shift
Offline-to-online learning aims to improve online decision-making by leveraging offline logged data. A central challenge in this setting is the distribution shift between offline and online environments. While some existing works attempt to leverage shifted offline data, they largely rely on UCB-type algorithms. Thompson sampling (TS) represents another canonical class of bandit algorithms, well known for its strong empirical performance and naturally suited to offline-to-online learning through its Bayesian formulation. However, unlike UCB indices, posterior samples in TS are not guaranteed to be optimistic with respect to the true arm means. This makes indices constructed from purely online and hybrid data difficult to compare and complicates their use. To address this issue, we propose sample-mean anchored TS (Anchor-TS), which introduces a novel median-based anchoring rule that defines the arm index as the median of an online posterior sample, a hybrid posterior sample, and the online sample mean. The median anchoring systematically corrects bias induced by distribution shift by mitigating over-estimation for suboptimal arms and under-estimation for optimal arms, while exploiting offline information to obtain more accurate estimates when the shift is small. We establish theoretical guarantees showing that the proposed algorithm safely leverages offline data to accelerate online learning, and quantifying how the degree of distribution shift and the size of offline data affect the resulting regret reduction. Extensive experiments demonstrate consistent improvements of our algorithm over baselines.
♻ ☆ How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining
Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies have reported limited improvements from such curriculum-based pretraining strategies. This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate (LR) schedule. We find that while curriculum-based training substantially outperforms random shuffling when using a constant LR, its advantage diminishes under standard LR decay schedules. Our experiments show this incompatibility can be mitigated by two simple strategies: (1) employing a more moderate LR decay schedule, where the final LR is only moderately smaller than the peak LR, and (2) replacing LR decay with model averaging, i.e., computing a weighted average of the final few checkpoints. By combining these strategies, we improve the average score on a suite of standard benchmarks by 1.64% over random shuffling, without additional data refinement. Validated on 1.5B-parameter models trained over 30B tokens with various data-quality metrics, our findings call for a re-evaluation of curriculum-based LLM pretraining and underscore the potential of co-designing data curricula with optimization methods.
♻ ☆ MPU: Towards Secure and Privacy-Preserving Knowledge Unlearning for Large Language Models
Machine unlearning for large language models often faces a privacy dilemma in which strict constraints prohibit sharing either the server's parameters or the client's forget set. To address this dual non-disclosure constraint, we propose MPU, an algorithm-agnostic privacy-preserving Multiple Perturbed Copies Unlearning framework that primarily introduces two server-side modules: Pre-Process for randomized copy generation and Post-Process for update aggregation. In Pre-Process, the server distributes multiple perturbed and reparameterized model instances, allowing the client to execute unlearning locally on its private forget set without accessing the server's exact original parameters. After local unlearning, the server performs Post-Process by inverting the reparameterization and aggregating updates with a harmonic denoising procedure to alleviate the impact of perturbation. Experiments with seven unlearning algorithms show that MPU achieves comparable unlearning performance to noise-free baselines, with most algorithms' average degradation well below 1% up to 10% noise, and can even outperform the noise-free baseline for some algorithms under 1% noise. Code is available at https://github.com/Tristan0318/MPU.
♻ ☆ AaSP: Aliasing-aware Self-Supervised Pre-Training for Audio Spectrogram Transformers IEEE
Transformer-based audio self-supervised learning (SSL) models commonly use spectrograms, vision-style Transformers, and masked modeling objectives. However, convolutional patchification with temporal downsampling lowers the effective Nyquist frequency and introduces aliasing, while naïve low-pass filtering may remove task-relevant high-frequency cues. We present AaSP, an aliasing-aware self-supervised pre-training framework for audio spectrogram transformers. AaSP combines an aliasing-aware patch representation, teacher-student masked modeling, a cross-attention predictor, and multi-mask contrastive regularization to learn representations that integrate features from alias-prone modulation bands while remaining stable across masked views. Its patch-embedding module, Aliasing-aware Patch Embedding (AaPE), augments standard patch tokens with features from alias-prone modulation bands using a band-limited complex sinusoidal kernel with a two-sided exponential window. The kernel's frequency and decay parameters are estimated from the input, enabling adaptive subband analysis whose outputs are fused with standard patch tokens. We pre-train on AudioSet and evaluate the learned representations by fine-tuning and linear evaluation on acoustic/environmental, speech, and music recognition benchmarks. Under fine-tuning, the full AaSP framework achieves state-of-the-art results on AS-20K, ESC-50, and NSynth among compared self-supervised baselines, while remaining competitive elsewhere. Linear evaluation shows a similar trend, including gains on US8K and NSynth. Overall, AaSP learns representations that are more stable under aliasing-sensitive temporal perturbations and competitive for downstream transfer.
comment: Accepted for publication in IEEE Transactions on Audio, Speech and Language Processing (TALSP). Copyright IEEE
♻ ☆ Realizable Bayes-Consistency for General Metric Losses ICML 2026
We study strong universal Bayes-consistency in the realizable setting for learning with general metric losses, extending classical characterizations beyond $0$-$1$ classification (Bousquet et al., 2020; Hanneke et al., 2021) and real-valued regression (Attias et al., 2024). Given an instance space $(X,ρ)$, a label space $(Y,\ell)$ with possibly unbounded loss, and a hypothesis class $H \subseteq Y^{X}$, we resolve the realizable case of an open problem presented in Tsir Cohen and Kontorovich (2022). Specifically, we find the necessary and sufficient conditions on the hypothesis class $H$ under which there exists a distribution-free learning rule whose risk converges almost surely to the best-in-class risk (which is zero) for every realizable data-generating distribution. Our main contribution is this sharp characterization in terms of a combinatorial obstruction: Similarly to Attias et al. (2024), we introduce the notion of an infinite non-decreasing $(γ_k)$-Littlestone tree, where $γ_k \to \infty$. This extends the Littlestone tree structure used in Bousquet et al. (2020) to the metric loss setting.
comment: 14 pages. To appear in Proceedings of the 43rd International Conference on Machine Learning (ICML 2026); v2: fixed abstract metadata rendering
♻ ☆ NeuroMambaLLM: Dynamic Graph Learning of fMRI Functional Connectivity in Autistic Brains Using Mamba and Language Model Reasoning
Large Language Models (LLMs) have demonstrated strong semantic reasoning across multimodal domains. However, their integration with graph-based models of brain connectivity remains limited. In addition, most existing fMRI analysis methods rely on static Functional Connectivity (FC) representations, which obscure transient neural dynamics critical for neurodevelopmental disorders such as autism. Recent state-space approaches, including Mamba, model temporal structure efficiently, but are typically used as standalone feature extractors without explicit high-level reasoning. We propose NeuroMambaLLM, an end-to-end framework that integrates dynamic latent graph learning and selective state-space temporal modelling with LLMs. The proposed method learns the functional connectivity dynamically from raw Blood-Oxygen-Level-Dependent (BOLD) time series, replacing fixed correlation graphs with adaptive latent connectivity while suppressing motion-related artifacts and capturing long-range temporal dependencies. The resulting dynamic brain representations are projected into the embedding space of an LLM model, where the base language model remains frozen and lightweight low-rank adaptation (LoRA) modules are trained for parameter-efficient alignment. This design enables the LLM to perform both diagnostic classification and language-based reasoning, allowing it to analyze dynamic fMRI patterns and generate clinically meaningful textual reports.
♻ ☆ Causal Time Series Generation via Diffusion Models
Time series generation (TSG) synthesizes realistic sequences and has achieved remarkable success. Among TSG, conditional models generate sequences given observed covariates, however, such models learn observational correlations without considering unobserved confounding. In this work, we propose a causal perspective on conditional TSG and introduce causal time series generation as a new TSG task family, formalized within Pearl's causal ladder, extending beyond observational generation to include interventional and counterfactual settings. To instantiate these tasks, we develop CaTSG, a unified diffusion-based framework with backdoor-adjusted guidance that causally steers sampling toward desired interventions and individual counterfactuals while preserving observational fidelity. Specifically, our method derives causal score functions via backdoor adjustment and the abduction-action-prediction procedure, thus enabling principled support for all three levels of TSG. Extensive experiments on both synthetic and real-world datasets show that CaTSG achieves superior fidelity and also supporting interventional and counterfactual generation that existing baselines cannot handle. Overall, we propose the causal TSG family and instantiate it with CaTSG, providing an initial proof-of-concept and opening a promising direction toward more reliable simulation under interventions and counterfactual generation.
♻ ☆ The Spheres Dataset: Multitrack Orchestral Recordings for Music Source Separation and Information Retrieval
This paper introduces The Spheres dataset, multitrack orchestral recordings designed to advance machine learning research in music source separation and related MIR tasks within the classical music domain. The dataset is composed of over one hour recordings of musical pieces performed by the Colibrì Ensemble at The Spheres recording studio, capturing two canonical works - Tchaikovsky's Romeo and Juliet and Mozart's Symphony No. 40 - along with chromatic scales and solo excerpts for each instrument. The recording setup employed 23 microphones, including close spot, main, and ambient microphones, enabling the creation of realistic stereo mixes with controlled bleeding and providing isolated stems for supervised training of source separation models. In addition, room impulse responses were estimated for each instrument position, offering valuable acoustic characterization of the recording space. We present the dataset structure, acoustic analysis, and baseline evaluations using X-UMX based models for orchestral family separation and microphone debleeding. Results highlight both the potential and the challenges of source separation in complex orchestral scenarios, underscoring the dataset's value for benchmarking and for exploring new approaches to separation, localization, dereverberation, and immersive rendering of classical music.
♻ ☆ BioSEN: A Bio-acoustic Signal Enhancement Network for Animal Vocalizations
Most work in audio enhancement targets human speech, while bioacoustics is less studied due to noisy recordings and the distinct traits of animal sounds. To fill this gap, we adapt speech enhancement methods and build BioSEN, a model made for bioacoustic signals. BioSEN has three modules: a multi-scale dual-axis attention unit for time-frequency feature extraction, a bio-harmonic multi-scale enhancement unit for capturing harmonic structures, and an energy-adaptive gating connection unit that uses frequency weights to keep vocalizations from being removed as noise. Tests on three bioacoustic datasets show that BioSEN matches or exceeds state-of-the-art speech enhancement models while using far less computation. These results show BioSEN's strength for bioacoustic audio enhancement and its promise for biodiversity monitoring and conservation.
♻ ☆ L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts
Mixture-of-Experts (MoE) models scale neural networks by conditionally activating a small subset of experts, where the router plays a central role in determining expert specialization and overall model performance. However, many modern MoE systems still adopt linear routers in raw high-dimensional representation spaces, where representation mismatch, angular concentration, and scale-sensitive scoring can jointly undermine routing discriminability and stable expert specialization. In this work, we propose Low-rank & Lipschitz-controlled Routing (L2R), a unified routing framework that reshapes both the routing space and scoring geometry. L2R performs expert assignment in a shared low-rank latent routing space and introduces Saturated Inner-Product Scoring (SIPS) to explicitly control the Lipschitz behavior of routing functions, yielding smoother and more stable routing geometry. In addition, L2R incorporates a parameter-efficient multi-anchor routing mechanism to enhance expert expressiveness. Extensive experiments on an OLMoE-based language MoE model and a vision MoE setting on ImageNet demonstrate that L2R consistently improves routing geometry, expert discrimination, and overall model performance. Code will be released.
♻ ☆ DRIFT: A Benchmark for Task-Free Continual Graph Learning with Continuous Distribution Shifts
Continual graph learning (CGL) aims to learn from dynamically evolving graphs while mitigating catastrophic forgetting. Existing CGL approaches typically adopt a task-based formulation, where the data stream is partitioned into a sequence of discrete tasks with pre-defined boundaries. However, such assumptions rarely hold in real-world environments, where data distributions evolve continuously and task identity is often unavailable. To better reflect realistic non-stationary environments, we revisit continual graph learning from a task-free perspective. We propose a unified formulation that models the data stream as a time-varying mixture of latent task distributions, enabling continuous modeling of distribution drift. Based on this formulation, we construct \emph{DRIFT}, a benchmark that spans a spectrum of transition dynamics ranging from hard task switches to smooth distributional drift through a Gaussian parameterization. We evaluate representative continual learning methods under this task-free setting and observe substantial performance degradation compared to traditional task-based protocols. Our findings indicate that many existing approaches implicitly rely on task boundary information and struggle under realistic task-free graph streams. This work highlights the importance of studying continual graph learning under realistic non-stationary conditions and provides a benchmark for future research in this direction. Our code is available at https://github.com/UConn-DSIS/DRIFT.
comment: 20 pages, 5 figures
♻ ☆ MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation
Task planning for robotic manipulation with large language models (LLMs) is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic settings. MALLVI presents a Multi Agent Large Language and Vision framework that enables closed-loop feedback driven robotic manipulation. Given a natural language instruction and an image of the environment, MALLVI generates executable atomic actions for a robot manipulator. After action execution, a Vision Language Model (VLM) evaluates environmental feedback and decides whether to repeat the process or proceed to the next step. Rather than using a single model, MALLVI coordinates specialized agents, Decomposer, Localizer, Thinker, and Reflector, to manage perception, localization, reasoning, and high level planning. An optional Descriptor agent provides visual memory of the initial state. The Reflector supports targeted error detection and recovery by reactivating only relevant agents, avoiding full replanning. Experiments in simulation and real-world settings show that iterative closed loop multi agent coordination improves generalization and increases success rates in zero shot manipulation tasks. Code available at https://github.com/iman1234ahmadi/MALLVI .
comment: Some fundemental change in text and codebase. Will request a new submission later on
♻ ☆ Boosting LLM Reasoning via Human-Inspired Reward Shaping
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for enhancing reasoning in Large Language Models (LLMs). However, existing reward formulations typically treat exploration and consolidation as a monolithic process, resulting in entangled stage-wise learning dynamics. This contradicts the natural learning behavior of human learners. In human learning, individuals adopt distinct behavioral patterns toward mastered versus unfamiliar problems. When confronting unmastered challenges, humans prioritize broad exploration to seek viable solutions. By contrast, for well-mastered problems, they focus instead on reasoning condensation and knowledge abstraction to distill concise underlying principles. Motivated by this gap, we introduce T2T(Thickening-to-Thinning), a dynamic reward framework inspired by human learning processes. Specifically, it implements a dual-phase mechanism: (1) On incorrect attempts, T2T incentivizes "thickening" to broaden the search space and explore novel solution paths; (2) Upon achieving correctness, it shifts to "thinning", imposing length penalties to discourage redundancy, thereby fostering model confidence and crystallizing reasoning capabilities. Extensive experiments on mathematical benchmarks (MATH-500, AIME, AMC) across 5 mainstream LLMs demonstrate that T2T significantly outperforms standard GRPO and recent baselines, achieving superior performance.
♻ ☆ Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration OSDI 2026
Tree-of-Thought (ToT) reasoning structures Large Language Model (LLM) inference as a tree-based search, demonstrating strong potential for solving complex mathematical and programming tasks. However, its efficiency is constrained by the reward dependency barrier -- a synchronization bottleneck caused by sequential reward-guided exploration that limits search parallelism and introduces substantial latency. Prior system optimizations, mainly designed for linear Chain-of-Thought (CoT) reasoning, cannot address these challenges, leaving the efficiency of ToT underexplored. To enhance ToT reasoning efficiency, we observe that the reasoning paths can be explored speculatively to break the reward synchronization barrier. Therefore, in this paper, we propose SPEX and introduce three key techniques: (i) intra-query speculative path selection to predict and expand high-potential branches of ToT, (ii) inter-query budget allocation to balance speculative resource allocation across queries dynamically, and (iii) adaptive early termination to prune deep and redundant branches for a skewed search tree. We implement SPEX on top of the SGLang framework and evaluate it across diverse ToT algorithms and LLMs. Extensive experiments show that SPEX achieves $1.2 \sim 3 \times$ speedup for different ToT reasoning algorithms. Moreover, SPEX synergizes with token-level speculative decoding, achieving cumulative speedups of up to $4.1\times$. Ablation studies further confirm the contributions of each technique. Overall, SPEX represents a significant step toward efficient and scalable ToT reasoning, unlocking the parallelism required for high-performance inference-time scaling for LLMs.
comment: OSDI 2026
♻ ☆ Scalable Subset Selection in Linear Mixed Models
Linear mixed models (LMMs), which incorporate fixed and random effects, are key tools for analyzing heterogeneous data, such as in personalized medicine. Nowadays, this type of data is increasingly wide, sometimes containing thousands of candidate predictors, necessitating sparsity for prediction and interpretation. However, existing sparse learning methods for LMMs do not scale well beyond tens or hundreds of predictors, leaving a large gap compared with sparse methods for linear models, which ignore random effects. This paper closes the gap with a new $\ell_0$ regularized method for LMM subset selection that can run on datasets containing thousands of predictors in seconds to minutes. On the computational front, we develop a coordinate descent algorithm as our main workhorse and provide a guarantee of its convergence. We also develop a local search algorithm to help traverse the nonconvex optimization surface. Both algorithms readily extend to subset selection in generalized LMMs via a penalized quasi-likelihood approximation. On the statistical front, we provide a finite-sample bound on the Kullback-Leibler divergence of the new method. We then demonstrate its excellent performance in experiments involving synthetic and real datasets.
♻ ☆ AMiD: Knowledge Distillation for LLMs with $α$-mixture Assistant Distribution ICLR 2026
Autoregressive large language models (LLMs) have achieved remarkable improvement across many tasks but incur high computational and memory costs. Knowledge distillation (KD) mitigates this issue by transferring knowledge from a large teacher to a smaller student through distributional alignment. Previous studies have proposed various discrepancy metrics, but the capacity gap and training instability caused by near-zero probabilities, stemming from the high-dimensional output of LLMs, remain fundamental limitations. To overcome these challenges, several approaches implicitly or explicitly incorporating assistant distribution have recently been proposed. However, the past proposals of assistant distributions have been a fragmented approach without a systematic investigation of the interpolation path and the divergence. This paper proposes $α$-mixture assistant distribution, a novel generalized family of assistant distributions, and $α$-mixture distillation, coined AMiD, a unified framework for KD using the assistant distribution. The $α$-mixture assistant distribution provides a continuous extension of the assistant distribution by introducing a new distribution design variable $α$, which has been fixed in all previous approaches. Furthermore, AMiD generalizes the family of divergences used with the assistant distributions based on optimality, which has also been restricted in previous works. Through extensive experiments, we demonstrate that AMiD offers superior performance and training stability by leveraging a broader and theoretically grounded assistant distribution space. We release the code at https://github.com/aailab-kaist/AMiD.
comment: The Fourteenth International Conference on Learning Representations (ICLR 2026)
♻ ☆ Beyond Distribution Estimation: Simplex Anchored Structural Inference Towards Universal Semi-Supervised Learning ICML 2026
Semi-supervised learning faces significant challenges in realistic scenarios where labeled data is scarce and unlabeled data follows unknown, arbitrary distributions. We formalize this critical yet under-explored paradigm as Universal Semi-supervised Learning (UniSSL). Existing methods typically leverage unlabeled data via pseudo-labeling. However, they often rely on the idealized assumption of a uniform unlabeled data distribution or require sufficient labeled data to estimate it. In the UniSSL setting, such dependencies lead to numerous erroneous pseudo-labels, thereby triggering representation confusion. Fortunately, we observe that inter-sample relations captured by representations are more reliable than pseudo-labels. Leveraging this insight, we shift our focus to representation-level structural inference to bypass distribution estimation. Accordingly, we propose Simplex Anchored Graph-state Equipartition (SAGE), which captures high-order inter-sample dependencies to establish structural consensus for guiding representation learning. Meanwhile, to mitigate representation confusion, we employ vectors that satisfy a simplex equiangular tight frame to serve as a coordinate frame for guiding inter-class representation separation. Finally, we introduce a weighting strategy based on distribution-agnostic metrics to prioritize reliable pseudo-labels and an auxiliary branch to isolate potentially erroneous pseudo-labels. Evaluations on five standard benchmarks show that SAGE consistently outperforms state-of-the-art methods, with an average accuracy gain of $\textbf{8.52\%}$.
comment: The paper is accepted by ICML 2026
♻ ☆ LEAP: Local ECT-Based Learnable Positional Encodings for Graphs ICLR
Graph neural networks (GNNs) largely rely on the message-passing paradigm, where nodes iteratively aggregate information from their neighbors. Yet, standard message passing neural networks (MPNNs) face well-documented theoretical and practical limitations. Graph positional encoding (PE) has emerged as a promising direction to address these limitations. The Euler Characteristic Transform (ECT) is an efficiently computable geometric-topological invariant that characterizes shapes and graphs. In this work, we combine the differentiable approximation of the ECT (DECT) and its local variant ($\ell$-ECT) to propose LEAP, a new end-to-end trainable local structural PE for graphs. We evaluate our approach on multiple real-world datasets as well as on a synthetic task designed to test its ability to extract topological features. Our results underline the potential of LEAP-based encodings as a powerful component for graph representation learning pipelines.
comment: Accepted at the International Conference on Learning Representations (ICLR) 2026. Our code is available https://www.github.com/aidos-lab/LEAP
♻ ☆ R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow SIGGRAPH 2026
Video-guided 3D animation holds immense potential for content creation, offering intuitive and precise control over dynamic assets. However, practical deployment faces a critical yet frequently overlooked hurdle: the pose misalignment dilemma. In real-world scenarios, the initial pose of a user-provided static mesh rarely aligns with the starting frame of a reference video. Naively forcing a mesh to follow a mismatched trajectory inevitably leads to severe geometric distortion or animation failure. To address this, we present Rectified Dynamic Mesh (R-DMesh), a unified framework designed to generate high-fidelity 4D meshes that are ``rectified'' to align with video context. Unlike standard motion transfer approaches, our method introduces a novel VAE that explicitly disentangles the input into a conditional base mesh, relative motion trajectories, and a crucial rectification jump offset. This offset is learned to automatically transform the arbitrary pose of the input mesh to match the video's initial state before animation begins. We process these components via a Triflow Attention mechanism, which leverages vertex-wise geometric features to modulate the three orthogonal flows, ensuring physical consistency and local rigidity during the rectification and animation process. For generation, we employ a Rectified Flow-based Diffusion Transformer conditioned on pre-trained video latents, effectively transferring rich spatio-temporal priors to the 3D domain. To support this task, we construct Video-RDMesh, a large-scale dataset of over 500k dynamic mesh sequences specifically curated to simulate pose misalignment. Extensive experiments demonstrate that R-DMesh not only solves the alignment problem but also enables robust downstream applications, including pose retargeting and holistic 4D generation.
comment: Accepted by SIGGRAPH 2026, Project Page: https://r-dmesh.github.io/ Code URL: https://github.com/Tencent-Hunyuan/R-DMesh
♻ ☆ Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning ICLR 2026
Understanding internal representations of neural models is a core interest of mechanistic interpretability. Due to its large dimensionality, the representation space can encode various aspects about inputs. To what extent are different aspects organized and encoded in separate subspaces? Is it possible to find these ``natural'' subspaces in a purely unsupervised way? Somewhat surprisingly, we can indeed achieve this and find interpretable subspaces by a seemingly unrelated training objective. Our method, neighbor distance minimization (NDM), learns non-basis-aligned subspaces in an unsupervised manner. Qualitative analysis shows subspaces are interpretable in many cases, and encoded information in obtained subspaces tends to share the same abstract concept across different inputs, making such subspaces similar to ``variables'' used by the model. We also conduct quantitative experiments using known circuits in GPT-2; results show a strong connection between subspaces and circuit variables. We also provide evidence showing scalability to 2B models by finding separate subspaces mediating context and parametric knowledge routing. Viewed more broadly, our findings offer a new perspective on understanding model internals and building circuits.
comment: Published as a conference paper at ICLR 2026
♻ ☆ Measuring the stability and plasticity of recommender systems
The typical offline protocol to evaluate recommendation algorithms is to collect a dataset of user-item interactions and then use a part of this dataset to train a model, and the remaining data to measure how closely the model recommendations match the observed user interactions. This protocol is straightforward, useful and practical, but it only provides snapshot performance. We know, however, that online systems evolve over time. In general, it is a good idea that models are frequently retrained with recent data. But if this is the case, to what extent can we trust previous evaluations? How will a model perform when a different pattern (re)emerges? In this paper we propose a methodology to study how recommendation models behave when they are retrained. The idea is to profile algorithms according to their ability to, on the one hand, retain past patterns - stability - and, on the other hand, (quickly) adapt to changes - plasticity. We devise an offline evaluation protocol that provides detail on the long-term behavior of models, and that is agnostic to datasets, algorithms and metrics. To illustrate the potential of this framework, we present preliminary results of three different types of algorithms on the GoodReads dataset that suggest different stability and plasticity profiles depending on the algorithmic technique, and a possible trade-off between stability and plasticity. We further discuss the potential and limitations of the proposal and advance some possible improvements.
comment: Final version published in the proceedings of ACM UMAP 2026: https://doi.org/10.1145/3774935.3812707
♻ ☆ Scalable Krylov Subspace Methods for Generalized Mixed-Effects Models with Crossed Random Effects
Mixed-effects models are widely used to model data with hierarchical grouping structures and high-cardinality categorical predictor variables. However, for high-dimensional crossed random effects, current standard computations relying on Cholesky decompositions can become prohibitively slow. In this work, we present Krylov subspace-based methods that address existing computational bottlenecks, and we analyze them both theoretically and empirically. In particular, we derive new results on the convergence and accuracy of the preconditioned stochastic Lanczos quadrature and conjugate gradient methods for mixed-effects models, and we develop scalable methods for calculating predictive variances. In experiments with simulated and real-world data, the proposed methods yield speedups by factors of up to about 10,000 and are numerically more stable than Cholesky-based computations.
♻ ☆ Closing the Gap on the Sample Complexity of 1-Identification
The 1-identification problem is a fundamental pure-exploration problem in multi-armed bandits. An agent aims to determine whether there exists an arm whose mean reward exceeds a known threshold $μ_0$, or to output \textsf{None} otherwise. The agent must guarantee correctness with probability at least $1-δ$, while minimizing the expected number of arm pulls $\mathbb{E}[τ]$. We study the 1-identification problem and make two main contributions. First, for instances with at least one qualified arm, we derive a new lower bound on $\mathbb{E}[τ]$ via a novel optimization formulation. Second, we propose a new algorithm and establish upper bounds that match the lower bounds up to polynomial logarithmic factors uniformly over all instances. Our result complements the analysis of $\mathbb{E}τ$ when there are multiple qualified arms, which is an open problem in the literature.
♻ ☆ ReNF: Rethinking the Design of Neural Long-Term Time Series Forecasters
Neural Forecasters (NFs) have become a cornerstone of Long-term Time Series Forecasting (LTSF). However, recent progress has been hampered by an overemphasis on architectural complexity at the expense of fundamental forecasting structures. In this work, we revisit principled designs of LTSF. We begin by formulating a Variance Reduction Hypothesis (VRH), positing that generating and combining multiple forecasts is essential to reducing the inherent uncertainty of NFs. Guided by this, we propose Boosted Direct Output (BDO), a streamlined paradigm that synergistically hybridizes the causal structure of Auto-Regressive (AR) with the stability of Direct Output (DO), while implicitly realizing the principle of forecast combination within a single network. Furthermore, we mitigate a critical validation-test generalization gap by employing parameter smoothing to stabilize optimization. Extensive experiments demonstrate that these trivial yet principled improvements enable a direct temporal MLP to outperform recent, complex state-of-the-art models in nearly all benchmarks, without relying on intricate inductive biases. Finally, we empirically verify our hypothesis, establishing a dynamic performance bound that highlights promising directions for future research. The code is publicly available at: https://github.com/Luoauoa/ReNF.
♻ ☆ Polaris: A Gödel Agent Framework for Small Language Models through Experience-Abstracted Policy Repair ACL 2026
Gödel agent realize recursive self-improvement: an agent inspects its own policy and traces and then modifies that policy in a tested loop. We introduce Polaris, a Gödel agent for compact models that performs policy repair via experience abstraction, turning failures into policy updates through a structured cycle of analysis, strategy formation, abstraction, and minimal code pat ch repair with conservative checks. Unlike response level self correction or parameter tuning, Polaris makes policy level changes with small, auditable patches that persist in the policy and are reused on unseen instances within each benchmark. As part of the loop, the agent engages in meta reasoning: it explains its errors, proposes concrete revisions to its own policy, and then updates the policy. To enable cumulative policy refinement, we introduce experience abstraction, which distills failures into compact, reusable strategies that transfer to unseen instances. On MGSM, DROP, GPQA, and LitBench (covering arithmetic reasoning, compositional inference, graduate-level problem solving, and creative writing evaluation), a 7-billion-parameter model equipped with Polaris achieves consistent gains over the base policy and competitive baselines.
comment: Accepted to ACL 2026 (Findings). 33 pages
♻ ☆ ECHO: Elastic Speculative Decoding with Sparse Gating for High-Concurrency Scenarios
Speculative Decoding promises to accelerate the inference of Large Language Models, yet its efficacy often degrades in production-grade serving. Existing evaluations typically overlook the compute-bound nature of high-concurrency regimes, where verification compute becomes the dominant bottleneck. Consequently, prior methods face a dilemma: static trees incur massive verification waste, while dynamic trees suffer from cumulative misjudgments and kernel incompatibility. To bridge this gap, we introduce ECHO, a high concurrency-oriented framework integrated into SGLang that reformulates speculative execution as a budgeted scheduling problem. Crucially, ECHO employs sparse confidence gating to manage the batch as a unified super-tree, elastically pivoting budget between depth and width to co-optimize the trade-off between reducing global verification steps and maximizing per-step efficiency. Extensive evaluations across diverse model scales-particularly the industrial-grade Qwen3-235B-demonstrate that ECHO consistently outperforms SOTA methods in both low-load and high-load scenarios, achieving up to 5.35x walltime speedup and delivering over 20% relative speedup gain.
♻ ☆ Distributions as Actions: A Unified Framework for Diverse Action Spaces ICLR 2026
We introduce a novel reinforcement learning (RL) framework that treats parameterized action distributions as actions, redefining the boundary between agent and environment. This reparameterization makes the new action space continuous, regardless of the original action type (discrete, continuous, hybrid, etc.). Under this new parameterization, we develop a generalized deterministic policy gradient estimator, Distributions-as-Actions Policy Gradient (DA-PG), which has lower variance than the gradient in the original action space. Although learning the critic over distribution parameters poses new challenges, we introduce Interpolated Critic Learning (ICL), a simple yet effective strategy to enhance learning, supported by insights from bandit settings. Building on TD3, a strong baseline for continuous control, we propose a practical actor-critic algorithm, Distributions-as-Actions Actor-Critic (DA-AC). Empirically, DA-AC achieves competitive performance in various settings across discrete, continuous, and hybrid control.
comment: Accepted to ICLR 2026 (camera-ready)
♻ ☆ FlowSteer: Towards Agents Designing Agentic Workflows via Reinforced Progressive Canvas Editing
In recent years, agentic workflows have been widely applied to solve complex human tasks. However, existing workflow construction still faces key challenges, including human-dependent workflow construction, the lack of graph-level execution feedback, and the inability to repair errors in-loop during long-horizon construction. To address these challenges, we propose FlowSteer, a new paradigm of Agent Designing Agentic Workflows - a single agent itself end-to-end designs the workflow that a downstream executor runs. To support this paradigm, we introduce the Workflow Canvas, a novel executable graph-state environment that returns syntax-checked execution feedback for every atomic edit. Built on the canvas, we further propose Reinforced Progressive Canvas Editing, in which a lightweight policy agent issues one atomic edit per turn conditioned on real canvas feedback, and is trained end-to-end via reinforcement learning. Moreover, FlowSteer provides a plug-and-play framework that supports diverse operator libraries and interchangeable LLM backends. Experimental results on twelve datasets show that FlowSteer significantly outperforms baselines across various tasks. Our code is available at https://anonymous.4open.science/r/FlowSteer-9B2E.
comment: 51 pages, 6 figures, 5 tables. Project page: http://flowsteer.org/
♻ ☆ OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling
We investigate the capabilities and scalability of Large Language Models (LLMs) in optimization modeling, a domain requiring structured reasoning and precise formulation. To this end, we introduce OPT-ENGINE, an extensible benchmark framework with quantifiable and controllable complexity. OPT-ENGINE spans ten canonical Operations Research problems, systematically scaling from Linear Programming to Mixed-Integer Programming, providing a structured environment to probe the limits of automated problem formulation and solving. Utilizing OPT-Engine, we address three pivotal research questions. First, we examine whether Pure-Text Reasoning (PTR) via classical Chain-of-Thought can efficiently tackle optimization tasks, finding that PTR suffers from a critical robustness gap as task complexity increases. Second, we examine whether integrating external computational tools can mitigate PTR's arithmetic weaknesses and improve performance. Our results indicate that while such tools help with local calculations, they still fail to adhere to global optimization constraints. Finally, we pinpoint that for the current SOTA paradigm, Solver-integrated Reasoning (SIR), the automated formulation of constraints represents the primary bottleneck. These findings clarify the limitations of current paradigms and provide a structured roadmap for developing next-generation LLMs for optimization modeling. We release our code and data to facilitate future research (https://github.com/Cardinal-Operations/OPTEngine).
♻ ☆ A Foundation Model for Instruction-Conditioned In-Context Time Series Tasks
In-context learning (ICL) enables task adaptation at inference time by conditioning on demonstrations rather than updating model parameters. Although recent time-series foundation models incorporate contextual conditioning, retrieval, or example-based prompting, they typically rely on implicit positional structure or task-specific objectives rather than explicit instruction-conditioned input-output demonstrations. We introduce iAmTime, a time-series foundation model trained with instruction-conditioned amortized meta-learning to infer tasks directly from example demonstrations. iAmTime represents each episode as a structured prompt over historical context and future-known variables using specialized semantic tokens that attend to designated time-series regions, exchange information across demonstrations, and inject task information into the query representation. The model combines a Hierarchical Multi-Scope Transformer Encoder, which captures temporal and covariate dynamics while inferring latent task structure from demonstrated input-output mappings, with a Task-Conditioned Patch Decoder, which adapts decoding through expert-based routing. We train iAmTime on large-scale real and synthetic corpora using supervised and self-supervised instruction-conditioned tasks, including forecasting, imputation, reconstruction, classification, anomaly detection, and source de-mixing. Across diverse domains, frequencies, and horizons, iAmTime improves zero-shot adaptation over strong time-series foundation baselines on probabilistic and point forecasting benchmarks, while achieving competitive performance on non-forecasting tasks such as classification.
♻ ☆ Achieving Approximate Symmetry Is Exponentially Easier than Exact Symmetry ICLR 2026
Enforcing exact symmetry in machine learning models often yields significant gains in scientific applications, serving as a powerful inductive bias. However, recent work suggests that relying on approximate symmetry can offer greater flexibility and robustness. Despite promising empirical evidence, there has been little theoretical understanding, and in particular, a direct comparison between exact and approximate symmetry is missing from the literature. In this paper, we initiate this study by asking: What is the cost of enforcing exact versus approximate symmetry? To address this question, we introduce averaging complexity, a framework for quantifying the cost of enforcing symmetry via averaging. Our main result is an exponential separation: under standard conditions, exact symmetry requires linear averaging complexity, whereas approximate symmetry can be attained with only logarithmic complexity in the group size. To the best of our knowledge, this provides the first theoretical separation of these two cases, formally justifying why approximate symmetry may be preferable in practice. Beyond this, our tools and techniques may be of independent interest for the broader study of symmetries in machine learning.
comment: 33 pages, 2 figures. Published at ICLR 2026
♻ ☆ BRIDGE: Building Representations In Domain Guided Program Synthesis
Large language models can generate plausible code, but remain brittle for formal verification in proof assistants such as Lean. A central scalability challenge is that verified synthesis requires consistent artifacts across several coupled domains: executable code, formal specifications, theorem statements, and proof attempts. Existing approaches often treat these artifacts separately. We present BRIDGE, a structured prompting framework for multi-artifact program synthesis. BRIDGE decomposes generation into three interconnected domains: Code, Specification, and Theorem/Proof, and uses domain-specific intermediate reasoning to connect them. In Lean, BRIDGE often follows a code-first workflow, using the generated implementation as a semantic anchor for downstream specification, theorem statement, and proof-attempt generation. Across 178 algorithmic problems and five LLMs, BRIDGE improves Lean executable correctness by up to nearly 1.5x over direct prompting and can be roughly 2x more sample efficient at comparable generation lengths. We further find that specification-oriented prompting improves Python pass rates by up to 17.5 percentage points. Beyond inference-time prompting, supervised fine-tuning on BRIDGE-style reasoning traces yields nearly 1.5x higher Lean pass success than code-only fine-tuning, suggesting that these intermediate representations provide a learnable inductive bias. BRIDGE provides a practical framework for scaling verified synthesis while highlighting the remaining gap between executable correctness and full formal proof generation.
comment: 41 pages, 10 figures, 3 tables. Preprint
♻ ☆ Distributional Principal Autoencoders
Dimension reduction techniques usually lose information in the sense that reconstructed data are not identical to the original data. However, we argue that it is possible to have reconstructed data identically distributed as the original data, irrespective of the retained dimension or the specific mapping. This can be achieved by learning a distributional model that matches the conditional distribution of data given its low-dimensional latent variables. Motivated by this, we propose Distributional Principal Autoencoder (DPA) that consists of an encoder that maps high-dimensional data to low-dimensional latent variables and a decoder that maps the latent variables back to the data space. For reducing the dimension, the DPA encoder aims to minimise the unexplained variability of the data with an adaptive choice of the latent dimension. For reconstructing data, the DPA decoder aims to match the conditional distribution of all data that are mapped to a certain latent value, thus ensuring that the reconstructed data retains the original data distribution. Our numerical results on climate data, single-cell data, and image benchmarks demonstrate the practical feasibility and success of the approach in reconstructing the original distribution of the data. DPA embeddings are shown to preserve meaningful structures of data such as the seasonal cycle for precipitations and cell types for gene expression.
♻ ☆ Diagnosing and Mitigating Domain Shift in Permission-Based Android Malware Detection
Machine learning-based Android malware detectors often fail in real-world deployment due to domain shift, where models trained on one data source perform poorly on applications from another. This paper presents a comprehensive study on the generalizability and interpretability of permission-based detectors under cross-domain conditions. Using two complementary datasets (PerMalDroid and NATICUSdroid) and five ensemble classifiers, we first establish an intra-domain baseline, where models achieve over 92% accuracy, and then quantify a severe asymmetric performance drop. While models trained on PerMalDroid generalize well to NATICUSdroid (86% accuracy), the reverse direction sees a drastic drop to 73% accuracy. Explainable AI analysis reveals bimodal feature distributions and shows that feature importance is highly unstable, with key permissions losing or gaining influence across domains. The predictive feature sets for different domains are fundamentally mismatched, as models rely on different, dataset-specific permissions. Most importantly, an ablation study demonstrates that for most models, training on a noisy feature set leads to poor generalization, confirming that domain-specific artifacts are a greater obstacle than missing features. To mitigate this, we validate a hybrid training strategy based on the intersection of common features and successfully recover cross-domain performance, achieving 88% accuracy on PerMalDroid and maintaining 97% on NATICUSdroid. These findings highlight the importance of explainable, cross-domain-robust malware detection systems and provide a practical pathway toward improving real-world deployment of permission-based Android malware detectors.
♻ ☆ ToolMol: Evolutionary Agentic Framework for Multi-objective Drug Discovery
Advances in large language models (LLMs) have recently opened new and promising avenues for small-molecule drug discovery. Yet existing LLM-based approaches for molecular generation often suffer from high rates of invalid and low-quality ligand candidates, a result of the syntactic limitations of current models with regard to molecular strings. In this paper, we introduce $\texttt{ToolMol}$, an evolutionary agentic framework for de novo drug design. $\texttt{ToolMol}$ combines a multi-objective genetic algorithm with an agentic LLM operator that iteratively updates the ligand population. We build a comprehensive toolbox of RDKit-backed functions that allows our agentic operator to consisently make precise ligand modifications. $\texttt{ToolMol}$ achieves state-of-the-art performance on multi-objective property optimization tasks, discovering drug-like and synthesizable ligands that have $>10\%$ stronger predicted binding affinity compared to existing methods, evaluated on three protein targets. $\texttt{ToolMol}$ ligands additionally achieve state-of-the-art results in gold-standard Absolute Binding Free Energy scores, gaining over existing methods by over $35\%$. By studying chain-of-thought reasoning traces, we observe that tool-calling enables the model to more faithfully execute its planned modifications, efficiently exploiting the strong chemical prior knowledge in LLMs.
comment: 9 pages, 5 figures
♻ ☆ Stochastic dynamics learning with state-space systems
This work advances the theoretical foundations of reservoir computing (RC) by providing a unified treatment of fading memory and the echo state property (ESP) in both deterministic and stochastic settings. We investigate state-space systems, a central model class in time series learning, and establish that fading memory and solution stability hold generically -- even in the absence of the ESP -- offering a robust explanation for the empirical success of RC models without strict contractivity conditions. In the stochastic case, we critically assess stochastic echo states, proposing a novel distributional perspective rooted in attractor dynamics on the space of probability distributions, which leads to a rich and coherent theory. Our results extend and generalize previous work on non-autonomous dynamical systems, offering new insights into causality, stability, and memory in RC models. This lays the groundwork for reliable generative modeling of temporal data in both deterministic and stochastic regimes.
♻ ☆ FreeMOCA: Memory-Free Continual Learning for Malicious Code Analysis
As over 200 million new malware samples are identified each year, antivirus systems must continuously adapt to the evolving threat landscape. However, retraining solely on new samples leads to catastrophic forgetting and exploitable blind spots, while retraining on the entire dataset incurs substantial computational cost. We propose FreeMOCA, a memory- and compute-efficient continual learning framework for malicious code analysis that preserves prior knowledge via adaptive layer-wise interpolation between consecutive task updates, leveraging the fact that warm-started task optima are connected by low-loss paths in parameter space. We evaluate FreeMOCA in both class-incremental (Class-IL) and domain-incremental (Domain-IL) settings on large-scale Windows (EMBER) and Android (AZ) malware benchmarks. FreeMOCA achieves substantial gains in Class-IL, outperforming 11 baselines on both EMBER and AZ benchmarks. It also significantly reduces forgetting, achieving the best retention across baselines, and improving accuracy by up to 42% and 37% on EMBER and AZ, respectively. These results demonstrate that warm-started interpolation in parameter space provides a scalable and effective alternative to replay for continual malware detection. Code is available at: https://github.com/IQSeC-Lab/FreeMOCA.
comment: 17 pages, 5 figures, 12 tables
♻ ☆ Revisiting Model Inversion Evaluation: From Misleading Standards to Reliable Privacy Assessment CVPR
Model Inversion attacks aim to reconstruct information from private training data by exploiting access to a target model. Nearly all recent MI studies evaluate attack success using a standard framework that computes attack accuracy through a secondary evaluation model trained on the same private data and task design as the target model. In this paper, we present the first in-depth analysis of this dominant evaluation framework and reveal a fundamental issue: many reconstructions deemed successful under the existing framework are in fact false positives that do not capture the visual identity of the target individual. We first show that these MI false positives satisfy the same formal conditions as Type I adversarial examples. Our controlled experiments, we demonstrate extremely high false-positive transferability, an empirical signature characteristic of adversarial behavior, indicating that many MI false positives likely contain Type I adversarial features. This adversarial transferability significantly inflates reported attack accuracy and leads to an overstatement of privacy leakage in existing MI work. To address this issue, as our second contribution, we introduce a new evaluation framework based on MLLMs, whose general-purpose visual reasoning avoids the shared-task vulnerability and reduces Type-I adversarial transferability of current evaluation framework. We propose systematic design principles for MLLM-based evaluation. Using this framework, we reassess 27 MI attack setups across diverse datasets, target models, and priors, and find consistently high false-positive rates under the conventional approach. Our results call for a reevaluation of progress in MI research and establish MLLM-based evaluation as a more reliable standard for assessing privacy risks in machine learning systems. Code/data/prompt are available at https://hosytuyen.github.io/projects/FMLLM
comment: Accepted to CVPR Findings 2026
♻ ☆ SEDGE: Structural Extrapolated Data Generation
This paper aims to address the challenge of data generation beyond the training data and proposes a framework for Structural Extrapolated Data GEneration (SEDGE) based on suitable assumptions on the underlying data-generating process. We provide conditions under which data satisfying novel specifications can be generated reliably, together with the approximate identifiability of the distribution of such data under certain ``conservative" assumptions, as well as the inherent non-identifiability of this distribution without such assumptions. On the algorithmic side, we develop practical methods to achieve extrapolated data generation, based on a structure-informed optimization strategy or diffusion posterior sampling, respectively. We verify the extrapolation performance on synthetic data and also consider extrapolated image generation as a real-world scenario to illustrate the validity of the proposed framework.
♻ ☆ Detecting overfitting in Neural Networks during long-horizon grokking using Random Matrix Theory
Training Neural Networks (NNs) without overfitting is difficult; detecting that overfitting is difficult as well. We present a novel Random Matrix Theory method that detects the onset of overfitting in deep learning models without access to train or test data. For each model layer, we randomize each weight matrix element-wise, $\mathbf{W} \to \mathbf{W}^{\mathrm{rand}}$, fit the randomized empirical spectral distribution with a Marchenko-Pastur distribution, and identify large outliers that violate self-averaging. We call these outliers Correlation Traps. During the onset of overfitting, which we call the "anti-grokking" phase in long-horizon grokking, Correlation Traps form and grow in number and scale as test accuracy decreases while train accuracy remains high. Traps may be benign or may harm generalization; we provide an empirical approach to distinguish between them by passing random data through the trained model and evaluating the JS divergence of output logits. Our findings show that anti-grokking is an additional grokking phase with high train accuracy and decreasing test accuracy, structurally distinct from pre-grokking through its Correlation Traps. More broadly, we find that some foundation-scale LLMs exhibit the same Correlation Traps, indicating potentially harmful overfitting.
comment: 24 pages, 24 figures
♻ ☆ Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models ICML 2026
Large language models (LLMs) must balance diversity and creativity against logical coherence in open-ended generation. Existing truncation-based samplers are effective but largely heuristic, relying mainly on probability mass and entropy while ignoring semantic geometry of the token space. We present Top-W, a geometry-aware truncation rule that uses Wasserstein distance-defined over token-embedding geometry-to keep the cropped distribution close to the original, while explicitly balancing retained probability mass against the entropy of the kept set. Our theory yields a simple closed-form structure for the fixed-potential subset update: depending on the mass-entropy trade-off, the optimal crop either collapses to a single token or takes the form of a one-dimensional prefix that can be found efficiently with a linear scan. We implement Top-W using efficient geometry-based potentials (nearest-set or k-NN) and pair it with an alternating decoding routine that keeps the standard truncation-and-sampling interface unchanged. Extensive experiments on four benchmarks (GSM8K, GPQA, AlpacaEval, and MT-Bench) across three instruction-tuned models show that Top-W consistently outperforms prior state-of-the-art decoding approaches achieving up to 33.7% improvement. Moreover, we find that Top-W not only improves accuracy-focused performance, but also boosts creativity under judge-based open-ended evaluation.
comment: 20 pages, 3 figures, 8 tables, ICML 2026
♻ ☆ Silent Neuron Theory and Plasticity Preservation for Deep Reinforcement Learning in Adaptive Video Streaming
Adaptive video streaming optimizes Quality of Experience (QoE) metrics by selecting appropriate bitrates according to varying network bandwidth and user demands. In practice, however, real-world network bandwidth often exhibits heterogeneity relative to training environments. Current methods predominantly tackle this problem through learning-based approaches designed to improve generalization performance. While our systematic investigation reveals a critical limitation: neural networks suffer from plasticity loss, significantly impeding their ability to adapt to heterogeneous network conditions. Through theoretical analysis of neural propagation mechanisms, we demonstrate that existing dormant neuron metrics inadequately characterize neural plasticity loss. To address this limitation, we have developed the Silent Neuron theory, which provides a more comprehensive framework for understanding plasticity degradation. Based on these theoretical insights, we propose the Reset Silent Neuron (ReSiN), which preserves neural plasticity through strategic neuron resets guided by both forward and backward propagation states. Moreover, we establish a tighter performance bound for ReSiN under non-stationary network conditions. In our implementation of an adaptive video streaming system, ReSiN has shown significant improvements over existing solutions, achieving up to 168% higher bitrate and 108% better quality of experience (QoE) while maintaining comparable smoothness. Furthermore, ReSiN consistently outperforms in stationary environments, demonstrating its robust adaptability across different network conditions.
♻ ☆ It Takes Two: Your GRPO Is Secretly DPO
GRPO has emerged as a prominent reinforcement learning algorithm for post-training LLMs. Unlike critic-based methods, GRPO computes advantages by estimating the \emph{value baselines} from group-level statistics, eliminating the need for a critic network. Consequently, the prevailing view emphasizes the necessity of large group sizes, which are assumed to yield more accurate statistical estimates. In this paper, we propose a different view that the efficacy of GRPO stems from its implicit contrastive objective in the optimization, which helps reduce variance via the control variate method. This makes GRPO structurally related to preference learning methods such as DPO. This perspective motivates 2-GRPO, a minimal group-size variant that constructs contrastive signals with only two rollouts. We provide a rigorous theoretical analysis of 2-GRPO and empirically validate its effectiveness: 2-GRPO retains $97.6\%$ of the performance of 16-GRPO, while requiring only $12.5\%$ of the rollouts and $21\%$ of the training time.
♻ ☆ Quantifying and Mitigating Self-Preference Bias of LLM Judges
LLM-as-a-Judge has become a dominant approach in automated evaluation systems, playing critical roles in model alignment, leaderboard construction, quality control, and so on. However, the scalability and trustworthiness of this approach can be substantially distorted by Self-Preference Bias (SPB), which is a directional evaluative deviation in which LLMs systematically favor or disfavor their own generated outputs during evaluation. Existing measurements rely on costly human annotations and conflate generative capability with evaluative stance, and thus are impractical for large-scale deployment in real-world systems. To address this issue, we introduce a fully automated framework to quantifying and mitigating SPB, which constructs equal-quality pairs of responses with negligible quality differences, enabling statistical disentanglement of discriminability from bias propensity without human gold standards. Empirical analysis across 20 mainstream LLMs reveals that advanced capabilities are often uncorrelated, or even negatively correlated, with low SPB. To mitigate this bias, we propose a structured multi-dimensional evaluation strategy grounded in cognitive load decomposition, which reduces SPB by 31.5\% on average.
♻ ☆ Strategic PAC Learnability via Geometric Definability
Strategic classification studies learning settings in which individuals can modify their features, at a cost, in order to influence the classifier's decision. A central question is how the sample complexity of the induced (strategic) hypothesis class depends on the complexities of the underlying hypothesis class and the cost structure governing feasible manipulations. Prior work has shown that in several natural settings, such as linear classifiers with norm costs, the induced complexity can be controlled. We begin by showing that such guarantees fail in general - even in simple cases: there exist hypothesis classes of VC dimension $1$ on the real line such that, even under the simplest interval neighborhoods, the induced class has infinite VC dimension. Thus, strategic behavior can turn an easy learning problem into a non-learnable one. To overcome this, we introduce structure via a geometric definability assumption: both the hypothesis class and the cost-induced neighborhood relation can be defined by first-order formulas over $\mathbb{R}_{\mathtt{exp}}$. Intuitively, this means that hypotheses and costs can be described using arithmetic operations, exponentiation, logarithms, and comparisons. This captures a broad range of natural classes and cost functions, including $\ell_p$ distances, Wasserstein distance, and information-theoretic divergences. Under this assumption, we prove that learnability is preserved, with sample complexity controlled by the complexity of the defining formulas.
♻ ☆ TabClustPFN: A Prior-Fitted Network for Tabular Data Clustering
Clustering tabular data is a fundamental yet challenging problem due to heterogeneous feature types, diverse data-generating mechanisms, and the absence of transferable inductive biases across datasets. Prior-fitted networks (PFNs) have recently demonstrated strong generalization in supervised tabular learning by amortizing Bayesian inference under a broad synthetic prior. Extending this paradigm to clustering is nontrivial: clustering is unsupervised, admits a combinatorial and permutation-invariant output space, and requires inferring the number of clusters. We introduce TabClustPFN, a prior-fitted network for tabular data clustering that performs amortized Bayesian inference over both cluster assignments and cluster cardinality. Pretrained on synthetic datasets drawn from a flexible clustering prior, TabClustPFN clusters unseen datasets in a single forward pass, without dataset-specific retraining or hyperparameter tuning. The model naturally handles heterogeneous numerical and categorical features and adapts to a wide range of clustering structures. Experiments on synthetic data and curated real-world tabular benchmarks show that TabClustPFN outperforms classical, deep, and amortized clustering baselines, while exhibiting strong robustness in out-of-the-box exploratory settings. Code is available at https://github.com/Tianqi-Zhao/TabClustPFN.
♻ ☆ UniMamba: A Unified Spatial-Temporal Modeling Framework with State-Space and Attention Integration
Multivariate time series forecasting is fundamental to numerous domains such as energy, finance, and environmental monitoring, where complex temporal dependencies and cross-variable interactions pose enduring challenges. Existing Transformer-based methods capture temporal correlations through attention mechanisms but suffer from quadratic computational cost, while state-space models like Mamba achieve efficient long-context modeling yet lack explicit temporal pattern recognition. Therefore we introduce UniMamba, a unified spatial-temporal forecasting framework that integrates efficient state-space dynamics with attention-based dependency learning. UniMamba employs a Mamba Variate-Channel Encoding Layer enhanced with FFT-Laplace Transform and TCN to capture global temporal dependencies, and a Spatial Temporal Attention Layer to jointly model inter-variate correlations and temporal evolution. A Feedforward Temporal Dynamics Layer further fuses continuous and discrete contexts for accurate forecasting. Comprehensive experiments on eight public benchmark datasets demonstrate that UniMamba consistently outperforms state-of-the-art forecasting models in both forecasting accuracy and computational efficiency, establishing a scalable and robust solution for long-sequence multivariate time-series prediction.
comment: The authors wish to withdraw this preprint due to a lack of consensus regarding the final authorship list and the order of authors
♻ ☆ Krause Synchronization Transformers ICML 2026
Self-attention in Transformers relies on globally normalized softmax weights, causing all tokens to compete for influence at every layer. When composed across depth, this interaction pattern induces strong synchronization dynamics that favor convergence toward a dominant mode, a behavior associated with representation collapse and attention sink phenomena. We introduce Krause Attention, a principled attention mechanism inspired by bounded-confidence consensus dynamics. Krause Attention replaces similarity-based global aggregation with distance-based, localized, and selectively sparse interactions, promoting structured local synchronization instead of global mixing. We relate this behavior to recent theory modeling Transformer dynamics as interacting particle systems, and show how bounded-confidence interactions naturally moderate attention concentration and alleviate attention sinks. Restricting interactions to local neighborhoods also reduces runtime complexity from quadratic to linear in sequence length. Empirically, we validate Krause Attention across diverse settings, including vision (ViT on CIFAR/ImageNet), autoregressive image generation (MNIST/CIFAR-10), large language models (Llama/Qwen), and language models trained from scratch at multiple scales (100M/200M). Across these domains, Krause Attention achieves consistent performance gains while improving computational efficiency, highlighting bounded-confidence dynamics as a scalable and effective inductive bias for attention.
comment: ICML 2026, Project page: https://jingkun-liu.github.io/krause-sync-transformers/
♻ ☆ Hierarchical Transformer Preconditioning for Interactive Physics Simulation
Neural preconditioners for real-time physics simulation offer promising data-driven priors, but they often fail to capture long-range couplings efficiently because they inherit local message passing or sparse-operator access patterns. We introduce the Hierarchical Transformer Preconditioner, a neural preconditioner anchored to a weak-admissibility H-matrix partition. The partition provides a multiscale structural prior (dense diagonal leaves plus coarsening off-diagonal tiles) that enables full-graph approximate-inverse computation with O(N) scaling at fixed block sizes. The network models the inverse through low-rank far-field factors and uses highway connections (axial buffers plus a global summary token) to propagate context across transformer depth. At each PCG iteration, preconditioner application reduces to batched dense GEMMs with regular memory access. The key training contribution is a cosine-Hutchinson probe objective that learns the action of MA on convergence-critical spectral subspaces, optimizing angular alignment of MAz with z rather than forcing eigenvalue clusters to a prescribed location. This removes unnecessary spectral-placement constraints from SAI-style objectives and improves conditioning on irregular spectra. Because both inference and apply are dense, dependency-free tensor programs, the full solve loop is captured as a single CUDA Graph. On stiff multiphase Poisson systems (up to 100:1 density contrast, N = 1,024-16,384), the solver runs from ~143 to ~21 fps. At N = 8,192, it reaches 17.9 ms/frame, with 2.2x speedup over GPU Jacobi, ~28x over GPU IC/DILU (AMGX multicolor_dilu), and 2.7x over neural SPAI retrained per scale on the same benchmark.
comment: 10 pages, 7 figures. Includes supplementary video and material
♻ ☆ Non-Stationary Online Structured Prediction with Surrogate Losses
Online structured prediction, including online classification as a special case, is the task of sequentially predicting labels from input features. In this setting, the surrogate regret -- the cumulative excess of the actual target loss (e.g., the 0-1 loss) over the surrogate loss (e.g., the logistic loss) incurred by the best fixed estimator -- has gained attention because it admits a finite bound independent of the time horizon $T$. However, such guarantees break down in non-stationary environments, where every fixed estimator may incur surrogate loss that grows linearly with $T$. To address this limitation, we obtain an upper bound of $F_T + O(1 + P_T)$ on the cumulative target loss, where $F_T$ is the cumulative surrogate loss of any comparator sequence and $P_T$ is its path length. This bound depends on $T$ only through $F_T$ and $P_T$, thus offering stronger guarantees under non-stationarity. Our core idea is to combine the dynamic regret analysis of online gradient descent (OGD) with the exploit-the-surrogate-gap technique. This viewpoint sheds light on the usefulness of a Polyak-style learning rate for OGD, which systematically yields target-loss bounds and performs well empirically. We then extend our approach to broader settings beyond prior work via the convolutional Fenchel--Young loss. Finally, a lower bound shows that the dependence on $F_T$ and $P_T$ is tight.
♻ ☆ NEST: Nested Event Stream Transformer for Sequences of Multisets
Event stream data often exhibit hierarchical structure in which multiple events co-occur, resulting in a sequence of multisets (i.e., bags of events). In electronic health records (EHRs), for example, medical events are grouped into a sequence of clinical encounters with well-defined temporal structure, but the order and timing of events within each encounter may be unknown or unreliable. Most existing foundation models (FMs) for event stream data flatten this hierarchy into a one-dimensional sequence, leading to (i) computational inefficiency associated with dense attention and learning spurious within-set relationships, and (ii) lower-quality set-level representations from heuristic post-training pooling for downstream tasks. Here, we show that preserving the original hierarchy in the FM architecture provides a useful inductive bias that improves both computational efficiency and representation quality. We then introduce Nested Event Stream Transformer (NEST), a FM for event streams comprised of sequences of multisets. Building on this architecture, we formulate Masked Set Modeling (MSM), an efficient paradigm that promotes improved set-level representation learning. Experiments on real-world multiset sequence data show that NEST captures real-world dynamics while improving both pretraining efficiency and downstream performance.
comment: 10-page main text
♻ ☆ WriteSAE: Sparse Autoencoders for Recurrent State
We introduce WriteSAE, the first sparse autoencoder that decomposes and edits the matrix cache write of state-space and hybrid recurrent language models, where residual SAEs cannot reach. Existing SAEs read residual streams, but Gated DeltaNet, Mamba-2, and RWKV-7 write to a $d_k \times d_v$ cache through rank-1 updates $k_t v_t^\top$ that no vector atom can replace. WriteSAE factors each decoder atom into the native write shape, exposes a closed form for the per-token logit shift, and trains under matched Frobenius norm so atoms swap one cache slot at a time. Atom substitution beats matched-norm ablation on 92.4% of $n=4{,}851$ firings at Qwen3.5-0.8B L9 H4, the 87-atom population test holds at 89.8%, the closed form predicts measured effects at $R^2=0.98$, and Mamba-2-370M substitutes at 88.1% over 2,500 firings. Sustained three-position installs at $3\times$ lift midrank target-in-continuation from 33.3% to 100% under greedy decoding, the first behavioral install at the matrix-recurrent write site.
comment: 26 pages, 14 figures, 21 tables; code at https://github.com/JackYoung27/writesae
♻ ☆ Robust Inference-Time Steering of Protein Diffusion Models via Embedding Optimization
A core challenge in structural biophysics is generating biomolecular conformations that are both physically plausible and consistent with experimental measurements. While sequence-to-structure diffusion models provide powerful priors, posterior sampling methods steer generation by perturbing atomic coordinates with gradients from experimental likelihoods. However, when the target lies in a low-density region of the prior, these methods require aggressive upweighting of the likelihood that can destabilize sampling and be sensitive to hyperparameters. We propose EmbedOpt, an inference-time steering framework that introduces an orthogonal optimization axis: rather than performing posterior sampling under a fixed prior, EmbedOpt directly optimizes the prior by updating the model's conditional embedding. This embedding space encodes rich coevolutionary signals, so optimizing it shifts the structural prior to align with experimental constraints. Empirically, EmbedOpt matches coordinate-based posterior sampling baselines on sparse distance constraints and outperforms them on cryo-electron microscopy map fitting, including real, noisy experimental ones. Furthermore, EmbedOpt's smooth optimization behavior yields robustness to hyperparameters spanning two orders of magnitude and enables comparable performance with fewer diffusion steps. Code is available at https://github.com/rs-station/embedopt.
♻ ☆ Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
Vision-Language-Action (VLA) models achieve remarkable flexibility and generalization beyond classical control paradigms. However, most prevailing VLAs are trained under a single-frame observation paradigm, which leaves them structurally blind to temporal dynamics. Consequently, these models degrade severely in non-stationary scenarios, even when trained or finetuned on dynamic datasets. Existing approaches either require expensive retraining or suffer from latency bottlenecks and poor temporal consistency across action chunks. We propose Pace-and-Path Correction, a training-free, closed-form inference-time operator that wraps any chunked-action VLA. From a single quadratic cost, joint minimization yields a unified solution that decomposes orthogonally into two distinct channels. The pace channel compresses execution along the planned direction, while the path channel applies an orthogonal spatial offset, jointly absorbing the perceived dynamics within the chunk window. We evaluate our approach on a comprehensive diagnostic benchmark MoveBench designed to isolate motion as the sole controlled variable. Empirical results demonstrate that our framework consistently outperforms state-of-the-art training-free wrappers and dynamic-adaptive methods and improves success rates by up to 28.8% and 25.9% in absolute terms over foundational VLA models in dynamic-only and static-dynamic mixed environments, respectively.
♻ ☆ Breaking the Reasoning Horizon in Entity Alignment Foundation Models
Entity alignment (EA) is critical for knowledge graph (KG) fusion. Existing EA models lack transferability and are incapable of aligning unseen KGs without retraining. While using graph foundation models (GFMs) offer a solution, we find that directly adapting GFMs to EA remains largely ineffective. This stems from a critical "reasoning horizon gap": unlike link prediction in GFMs, EA necessitates capturing long-range dependencies across sparse and heterogeneous KG structuresTo address this challenge, we propose a EA foundation model driven by a parallel encoding strategy. We utilize seed EA pairs as local anchors to guide the information flow, initializing and encoding two parallel streams simultaneously. This facilitates anchor-conditioned message passing and significantly shortens the inference trajectory by leveraging local structural proximity instead of global search. Additionally, we incorporate a merged relation graph to model global dependencies and a learnable interaction module for precise matching. Extensive experiments verify the effectiveness of our framework, highlighting its strong generalizability to unseen KGs.
♻ ☆ Causal Multi-Task Demand Learning
We study a canonical multi-task demand-learning problem motivated by retail pricing, where a firm seeks to estimate heterogeneous linear price-response functions across multiple decision contexts. Each context is described by rich covariates but exhibits limited price variation, motivating transfer learning across tasks. A central challenge in leveraging cross-task transfer is endogeneity: prices may be arbitrarily correlated with unobserved task-level demand determinants across tasks. We propose a new meta-learning framework that identifies the conditional mean of task-specific causal demand parameters given a subset of task-specific observables despite such confounding, assuming that each task contains at least two distinct locally exogenous price points. This subset is carefully designed to include all of the prices to address cross-task confounding, while masking two demand outcomes that provide randomized supervision to address identifiability issues arising from the inclusion of all prices. We show that this information design is maximally uniformly valid, in that any refinement of the conditioning set that reveals withheld-outcome information is not guaranteed to identify the conditional mean causal target. We validate our method on real and synthetic data, demonstrating improved recovery of demand responses relative to standard transfer-learning baselines.
♻ ☆ A Problem-Oriented Taxonomy of Evaluation Metrics for Time Series Anomaly Detection
Time series anomaly detection is widely used in IoT and cyber-physical systems, yet its evaluation remains challenging due to diverse application objectives and heterogeneous metric assumptions. This study introduces a problem-oriented framework that reinterprets existing metrics based on the specific evaluation challenges they are designed to address, rather than their mathematical forms or output structures. We categorize over twenty commonly used metrics into six dimensions: 1) basic accuracy-driven evaluation; 2) timeliness-aware reward mechanisms; 3) tolerance to labeling imprecision; 4) penalties reflecting human-audit cost; 5) robustness against random or inflated scores; and 6) parameter-free comparability for cross-dataset benchmarking. Comprehensive experiments are conducted to examine metric behavior under genuine, random, and oracle detection scenarios. By comparing their resulting score distributions, we quantify each metric's discriminative ability -- its capability to distinguish meaningful detections from random noise. The results show that while most event-level metrics exhibit strong separability, several widely used metrics (e.g., NAB, Point-Adjust) demonstrate limited resistance to random-score inflation. These findings reveal that metric suitability must be inherently task-dependent and aligned with the operational objectives of IoT applications. The proposed framework offers a unified analytical perspective for understanding existing metrics and provides practical guidance for selecting or developing more context-aware, robust, and fair evaluation methodologies for time series anomaly detection.
♻ ☆ A New Technique for AI Explainability using Feature Association Map
Lack of transparency in AI systems poses challenges in critical real-life applications. It is important to be able to explain the decisions of an AI system to ensure trust on the system. Explainable AI (XAI) algorithms play a vital role in achieving this objective. In this paper, we are proposing a new algorithm for Explaining AI systems, FAMeX (Feature Association Map based eXplainability). The proposed algorithm is based on a graph-theoretic formulation of the feature set termed as Feature Association Map (FAM). The foundation of the modelling is based on association between features. The proposed FAMeX algorithm has been found to be better than the competing XAI algorithms - Permutation Feature Importance (PFI) and SHapley Additive exPlanations (SHAP). Experiments conducted with eight benchmark algorithms show that FAMeX is able to gauge feature importance in the context of classification better than the competing algorithms. This definitely shows that FAMeX is a promising algorithm in explaining the predictions from an AI system
♻ ☆ What Information Matters? Graph Out-of-Distribution Detection via Tri-Component Information Decomposition ICML26
Graph neural networks are widely used for node classification, but they remain vulnerable to out-of-distribution (OOD) shifts in node features and graph structure. Prior work established that methods trained with standard supervised learning (SL) objectives tend to capture spurious signals from either features and/or structure, leaving the model fragile under distributional changes. To address this, we propose TIDE, a novel and effective Tri-Component Information Decomposition framework that explicitly decomposes information into feature-specific, structure-specific and joint components. TIDE aims to preserve only the label-relevant part of the joint information while filtering out spurious feature- and structure-specific information, thereby enhancing the separation between in-distribution (ID) and OOD nodes. Beyond the framework, we provide theoretical and empirical analyses showing that an information bottleneck objective is preferable to standard SL for graph OOD detection, with higher ID confidence and a greater entropy gap between ID and OOD data. Extensive experiments across seven datasets confirm the efficacy of TIDE, achieving up to a 34% improvement in FPR95 over strong baselines while maintaining competitive ID accuracy.
comment: ICML26
♻ ☆ Learning Polyhedral Conformal Sets for Robust Optimization
Robust optimization (RO) provides a principled framework for decision-making under uncertainty, but its performance critically depends on the choice of the uncertainty set. While large sets ensure reliability, they often lead to overly conservative decisions, whereas small sets risk excluding the true outcome. Recent data-driven approaches, particularly conformal prediction, offer finite-sample validity guarantees but remain largely task-agnostic, ignoring the downstream decision structure. In this paper, we propose a decision-aware conformal framework that learns uncertainty sets tailored to robust optimization objectives. Our approach parameterizes a flexible family of polyhedral sets via data-driven hyperplanes and learns their geometry by directly minimizing the induced robust loss, while preserving statistical validity through conformal calibration. To correct for data-dependent selection, we incorporate a re-calibration step on an independent dataset to restore coverage. The resulting sets capture directional and anisotropic uncertainty aligned with the decision objective while remaining computationally tractable. We provide finite-sample coverage guarantees and bounds on the sub-optimality gap to an oracle decision. This work bridges the gap between statistical validity and decision optimality, providing a principled framework for data-driven robust optimization.
♻ ☆ Query-Conditioned Test-Time Self-Training for Large Language Models
Large language models (LLMs) are typically deployed with fixed parameters, and their performance is often improved by allocating more computation at inference time. While such test-time scaling can be effective, it cannot correct model misconceptions or adapt the model to the specific structure of an individual query. Test-time optimization addresses this limitation by enabling parameter updates during inference, but existing approaches either rely on external data or optimize generic self-supervised objectives that lack query-specific alignment. In this work, we propose Query-Conditioned Test-Time Self-Training (QueST), a framework that adapts model parameters during inference using supervision derived directly from the input query. Our key insight is that the input query itself encodes latent signals sufficient for constructing structurally related problem--solution pairs. Based on this, QueST generates such query-conditioned pairs and uses them as supervision for parameter-efficient fine-tuning at test time. The adapted model is then used to produce the final answer, enabling query-specific adaptation without any external data. Across seven mathematical reasoning benchmarks and the GPQA-Diamond scientific reasoning benchmark, QueST consistently outperforms strong test-time optimization baselines. These results demonstrate that query-conditioned self-training is an effective and practical paradigm for test-time adaptation in LLMs. Code is available at https://chssong.github.io/Query-Conditioned-TTST/.
comment: 17 pages, 7 figures
Multimedia 9
☆ SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning
As audio-first agents become increasingly common in physical AI, conversational robots, and screenless wearables, audio large language models (audio-LLMs) must integrate speaker-specific understanding to support user authorization, personalization, and context-aware interaction. This requires modeling who is speaking, how the voice sounds, and how recording conditions affect speaker cues. Conventional speaker verification systems provide strong scalar scores but little linguistic evidence, while current audio-LLMs and speaker-aware language models have limited ability to organize speaker information beyond binary labels or descriptive profiles. We present SpeakerLLM, a speaker-specialized audio-LLM framework that unifies single-utterance speaker profiling, recording-condition understanding, utterance-pair speaker comparison, and evidence-organized verification reasoning within a natural-language interface. We construct verification-reasoning targets and a decision-composition policy that separate profile-level evidence from the final same-or-different decision and organize recording condition, profile evidence, and the decision into a structured trace. At its core, SpeakerLLM uses a hierarchical speaker tokenizer designed to capture multiple granularities of speaker evidence. Utterance-level speaker embeddings summarize identity and profile-level cues, whereas frame-level speaker features preserve fine-grained acoustic descriptors. Experiments show that SpeakerLLM-Base improves speaker-profile and recording-condition understanding over general audio-LLMs, while SpeakerLLM-VR preserves strong generated-verdict accuracy and produces decision traces grounded in the supervised verification reasoning schema. We will release the metadata-enriched supervision dataset and target-construction code for reproducibility.
☆ Multi-proposal Collaboration and Multi-task Training for Weakly-supervised Video Moment Retrieval
This study focuses on weakly-supervised Video Moment Retrieval (VMR), aiming to identify a moment semantically similar to the given query within an untrimmed video using only video-level correspondences, without relying on temporal annotations during training. Previous methods either aggregate predictions for all instances in the video, or indirectly address the task by proposing reconstructions for the query. However, these methods often produce low-quality temporal proposals, struggle with distinguishing misaligned moments in the same video, or lack stability due to a reliance on a single auxiliary task. To address these limitations, we present a novel weakly-supervised method called Multi-proposal Collaboration and Multi-task Training (MCMT). Initially, we generate multiple proposals and derive corresponding learnable Gaussian masks from them. These masks are then combined to create a high-quality positive sample mask, highlighting video clips most relevant to the query. Concurrently, we classify other clips in the same video as the easy negative sample and the entire video as the hard negative sample. During training, we introduce forward and inverse masked query reconstruction tasks to impose more substantial constraints on the network, promoting more robust and stable retrieval performance. Extensive experiments on two standard benchmarks affirm the effectiveness of the proposed method in VMR.
comment: 26 pages, 4 figures. Preprint version of the article published in International Journal of Machine Learning and Cybernetics
☆ VMU-Diff: A Coarse-to-fine Multi-source Data Fusion Framework for Precipitation Nowcasting
Precipitation nowcasting is a vital spatio-temporal prediction task for meteorological applications but faces challenges due to the chaotic property of precipitation systems. Existing methods predominantly rely on single-source radar data to build either deterministic or probabilistic models for extrapolation. However, the single deterministic model suffers from blurring due to MSE convergence. The single probabilistic model, typically represented by diffusion models, can generate fine details but suffers from spurious artifacts that compromise accuracy and computational inefficiency. To address these challenges, this paper proposes a novel coarse-to-fine Vision Mamba Unet and residual Diffusion (VMU-Diff) based precipitation nowcasting framework. It realizes precipitation nowcasting through a two-stage process, i.e., a deterministic model-based coarse stage to predict global motion trends and a probabilistic model-based fine stage to generate fine prediction details. In the coarse prediction stage, rather than single-source radar data, both radar and multi-band satellite data are taken as input. A spatial-temporal attention block and several Vision mamba state-space blocks realize multi-source data fusion, and predict the future echo global dynamics. The fine-grained stage is realized by a spatio-temporal refine generator based on residual conditional diffusion models. It first obtains spatio-temporal residual features based on coarse prediction and ground truth, and further reconstructs the residual via conditional Mamba state-space module. Experiments on Jiangsu SWAN datasets demonstrate the improvements of our method over state-of-the-art methods, particularly in short-term forecasts.
comment: 5 pages, 2 figures
☆ PROVE: A Perceptual RemOVal cohErence Benchmark for Visual Media
Evaluating object removal in images and videos remains challenging because the task is inherently one-to-many, yet existing metrics frequently disagree with human perception. Full-reference metrics reward copy-paste behaviors over genuine erasure; no-reference metrics suffer from systematic biases such as favoring blurry results; and global temporal metrics are insensitive to localized artifacts within edited regions. To address these limitations, we propose RC (Removal Coherence), a pair of perception-aligned metrics: RC-S, which measures spatial coherence via sliding-window feature comparison between masked and background regions, and RC-T, which measures temporal consistency via distribution tracking within shared restored regions across adjacent frames. To validate RC and support community benchmarking, we further introduce PROVE-Bench, a two-tier real-world benchmark comprising PROVE-M, an 80-video paired dataset with motion augmentation, and PROVE-H, a 100-video challenging subset without ground truth. Together, RC metrics and PROVE-Bench form the PROVE (Perceptual RemOVal cohErence) evaluation framework for visual media. Experiments across diverse image and video benchmarks demonstrate that RC achieves substantially stronger alignment with human judgments than existing evaluation protocols. The code for RC metrics and PROVE-Bench are publicly available at: https://github.com/xiaomi-research/prove/.
comment: Project Page: https://xiaomi-research.github.io/prove/
☆ Contestable Multi-Agent Debate with Arena-based Argumentative Computation for Multimedia Verification ICMR 2026
Multimedia verification requires not only accurate conclusions but also transparent and contestable reasoning. We propose a contestable multi-agent framework that integrates multimodal large language models, external verification tools, and arena-based quantitative bipolar argumentation (A-QBAF) as a submission to the ICMR 2026 Grand Challenge on Multimedia Verification. Our method decomposes each case into claim-centered sections, retrieves targeted evidence, and converts evidence into structured support and attack arguments with provenance and strength scores. These arguments are resolved through small local argument graphs with selective clash resolution and uncertainty-aware escalation. The resulting system generates section-wise verification reports that are transparent, editable, and computationally practical for real-world multimedia verification. Our implementation is public at: https://github.com/Analytics-Everywhere-Lab/MV2026_the_liems.
comment: ACM ICMR 2026 Grand Challenge on Multimedia Verification
☆ Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation
Interactive real-time autoregressive video generation is essential for applications such as content creation and world modeling, where visual content must adapt to dynamically evolving event conditions. A fundamental challenge lies in balancing reactivity and stability: models must respond promptly to new events while maintaining temporal coherence over long horizons. Existing approaches distill bidirectional models into autoregressive generators and further adapt them via streaming long tuning, yet often exhibit persistent drift after condition changes. We identify the cause as conditional bias, where the teacher may provide condition-aligned but trajectory-agnostic guidance, biasing generation toward locally valid yet globally inconsistent modes. Inspired by Trust Region Policy Optimization, we propose Delta Forcing, a simple yet effective framework that constrains unreliable teacher supervision within an adaptive trust region. Specifically, Delta Forcing estimates transition consistency from the latent delta between teacher and generator trajectories, and uses it to balance teacher supervision with a monotonic continuity objective. This suppress unreliable teacher-induced shifts while preserving responsiveness to new events. Extensive experiments demonstrate that Delta Forcing significantly improves consistency while maintaining event reactivity.
♻ ☆ Content-Adaptive Rate-Quality Curve Prediction Model in Media Processing System IEEE
In streaming media services, video transcoding is a common practice to alleviate bandwidth demands. Unfortunately, traditional methods employing a uniform rate factor (RF) across all videos often result in significant inefficiencies. Content-adaptive encoding (CAE) techniques address this by dynamically adjusting encoding parameters based on video content characteristics. However, existing CAE methods are often tightly coupled with specific encoding strategies, leading to inflexibility. In this paper, we propose a model that predicts both RF-quality and RF-bitrate curves, which can be utilized to derive a comprehensive bitrate-quality curve. This approach facilitates flexible adjustments to the encoding strategy without necessitating model retraining. The model leverages codec features, content features, and anchor features to predict the bitrate-quality curve accurately. Additionally, we introduce an anchor suspension method to enhance prediction accuracy. Experiments confirm that the actual quality metric (VMAF) of the compressed video stays within 1 of the target, achieving an accuracy of 99.14%. By incorporating our quality improvement strategy with the rate-quality curve prediction model, we conducted online A/B tests, obtaining both +0.107% improvements in video views and video completions and +0.064% app duration time. Our model has been deployed on the Xiaohongshu App.
comment: Accepted by IEEE VCIP 2024 (Oral)
♻ ☆ JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation
Understanding videos inherently requires reasoning over both visual and auditory information. To properly evaluate Omni-Large Language Models (Omni-LLMs), which are capable of processing multi-modal information including vision and audio, an effective benchmark must comprehensively cover three key aspects: (1) multi-modal dependency (i.e., questions that cannot be answered using vision or audio alone), (2) diverse audio information types (e.g., speech, sound events), and (3) varying scene spans. However, existing datasets fall short in one or more of these dimensions, limiting strict and comprehensive evaluation. To address this gap, we introduce JointAVBench, a novel benchmark with strict audio-video correlation, spanning five cognitive dimensions, four audio information types (speech, sound events, music, vocal traits), and three scene spans (single-, cross-, and full-scene). Given the high cost of manual annotation, we propose an automated pipeline that leverages state-of-the-art vision-LLMs, audio-LLMs, and general-purpose LLMs to synthesize questions and answers that strictly require joint audio-visual understanding. We evaluate leading vision-only, audio-only, and Omni-LLMs on our dataset. Results show that even the best-performing Omni-LLM achieves an average accuracy of only 65.3\%, outperforming uni-modal baselines but revealing substantial room for improvement, especially in cross-scene reasoning.
♻ ☆ StreamGuard: Exploring a 5G Architecture for Efficient, Quality of Experience-Aware Video Conferencing
Video conferencing over 5G is increasingly prevalent, yet its Quality of Experience (QoE) often degrades under limited radio resources. This has two causes: 5G networks must serve many users, while interactive traffic requires careful handling. Motivated by the insight that different subflows within an interactive session have a disproportionate effect on QoE, we present the design and implementation of StreamGuard, a practical 5G architecture for subflow-level, QoE-aware prioritization. StreamGuard forms a closed control loop with three components: (1) a monitor in the Radio Access Network (RAN) that uses deep packet inspection to infer QoE and RAN state, (2) a controller that selects prioritization actions to balance QoE and fairness, and (3) a marking module that applies these decisions by marking packets to steer subflows into appropriate priority queues. StreamGuard further shapes application behaviors via mechanisms including selective subflow dropping and probe-based rate control, to align application behavior with radio constraints. Implemented in a real 5G testbed, StreamGuard achieves a superior QoE-fairness tradeoff compared to vanilla 5G and prior state-of-the-art approaches, improving QoE by up to 70% at comparable background throughput or preserving up to 2x higher background throughput at similar QoE.
comment: 31 pages, 35 figures
Computer Vision and Pattern Recognition 80
☆ MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving NeurIPS 2026
Vision-language-action (VLA) models are effective as end-to-end motion planners, but can be brittle when evaluated in closed-loop settings due to being trained under traditional imitation learning framework. Existing closed-loop supervision approaches lack scalability and fail to completely model a reactive environment. We propose MAPLE, a novel framework for reactive, multi-agent rollout of a dynamic driving scenario in the latent space of the VLA model. The ego vehicle and nearby traffic agents are independently controlled over multi-step horizons, while being reactive to other agents in the scene, enabling closed-loop training. MAPLE consists of two training stages: (1) supervised fine-tuning on the latent rollouts based on ground-truth trajectories, followed by (2) reinforcement learning with global and agent -specific rewards that encourage safety, progress, and interaction realism. We further propose diversity rewards that encourage the model to generate planning behaviors that may not be present in logged driving data. Notably, our closed-loop training framework is scalable and does not require external simulators, which can be computationally expensive to run and have limited visual fidelity to the real-world. MAPLE achieves state-of-the-art driving performance on Bench2Drive and demonstrates scalable, closed-loop multi-agent play for robust E2E autonomous driving systems.
comment: 19 pages, 9 figures, NeurIPS 2026 submission
☆ CoReDiT: Spatial Coherence-Guided Token Pruning and Reconstruction for Efficient Diffusion Transformers CVPR
Diffusion Transformers (DiTs) deliver remarkable image and video generation quality but incur high computational cost, limiting scalability and on-device deployment. We introduce CoReDiT, a structured token pruning framework for DiTs across vision tasks. CoReDiT uses a linear-time spatial coherence score to estimate local redundancy in the latent token lattice and skips high coherence (redundant) tokens in self-attention. To maintain a dense representation and avoid visual discontinuities, we reconstruct skipped attention outputs via coherence-guided aggregation of spatially neighboring retained tokens. We further introduce a progressive, block-adaptive pruning schedule that increases pruning gradually and allocates larger budgets to blocks and denoising steps with higher redundancy. Across state-of-the-art diffusion backbones including PixArt-α and MagicDrive-V2, CoReDiT achieves up to 55% self-attention FLOPs reduction and inference speedups of 1.33x on cloud GPUs and 1.72x on mobile NPUs, while maintaining high visual quality. Notably, CoReDiT also increases on-device memory head-room, enabling higher-resolution generation.
comment: 8 pages, 8 figures, CVPR workshop
☆ You Only Landmark Once: Lightweight U-Net Face Super Resolution with YOLO-World Landmark Heatmaps
Face image super-resolution aims to recover high-resolution facial images from severely degraded inputs. Under extreme upscaling factors, fine facial details are often lost, making accurate reconstruction challenging. Existing methods typically rely on heavy network architectures, adversarial training schemes, or separate alignment networks, increasing model complexity and computational cost. To address these issues, we propose a lightweight U-Net based-architecture designed to reconstructs $128{ \times }128$ facial images from severely degraded $16{ \times }16$ inputs, achieving an $8 \times $ magnification. A key contribution is a novel auxiliary-training-free supervision strategy that leverages heatmaps generated by YOLO-World, an open-vocabulary object detector, to localize key facial features such as eyes, nose, and mouth. These heatmaps are converted into spatial weights to form a heatmap-guided loss that emphasizes reconstruction errors in semantically important regions. Unlike prior methods that require dedicated landmark or alignment networks, our approach directly reuses detector outputs as supervision, maintaining an efficient training and inference pipeline. Experiments on the aligned CelebA dataset demonstrate that the proposed loss consistently improves quantitative metrics and produces sharper, more realistic reconstructions. Overall, our results show that lightweight networks can effectively exploit detection-driven priors for perceptually convincing extreme upscaling, without adversarial training or increased computational cost.
☆ Rethinking the Good Enough Embedding for Easy Few-Shot Learning
The field of deep visual recognition is undergoing a paradigm shift toward universal representations. The Platonic Representation Hypothesis suggests that diverse architectures trained on massive datasets are converging toward a shared, "ideal" latent space. This again raises a critical question: is a "Good Embedding All You Need?" In this paper, we leverage this convergence to demonstrate that off-the-shelf embeddings are inherently "good enough" for complex tasks, rendering intensive task-specific fine-tuning unnecessary. We explore this hypothesis within the few-shot learning framework, proposing a straightforward, non-parametric pipeline that entirely bypasses backpropagation. By utilizing a k-Nearest Neighbor classifier on frozen DINOv2-L features, we conduct a layer-wise characterization to identify an optimal feature extraction. We further demonstrate that manifold refinement via PCA and ICA provides a beneficial regularizing effect. Our results across four major benchmarks demonstrate that our approach consistently surpasses sophisticated meta-learning algorithms, achieving state-of-the-art performance.
☆ TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion CVPR'26
Recent text-to-video diffusion transformers generate visually compelling frames, yet still struggle with temporal coherence, often producing flickering, drifting, or unstable motion. We show that these failures leave a clear imprint inside the model: incoherent videos consistently exhibit irregular, fragmented temporal diagonals in their intermediate self-attention maps, whereas stable motion corresponds to smooth, band-diagonal patterns. Building on this observation, we introduce TeDiO, a training-free, inference-time method that reinforces temporal consistency by regularizing these internal attention patterns. TeDiO estimates diagonal smoothness, identifies unstable regions, and performs lightweight latent updates that promote coherent frame-to-frame dynamics, without modifying model weights or using external motion supervision. Across multiple video diffusion models (e.g., Wan2.1, CogVideoX), TeDiO delivers markedly smoother motion while preserving per-frame visual quality, offering an efficient plug-and-play approach to improving dynamic realism in modern video generation systems.
comment: CVPR'26 Workshop on Agentic AI for Visual Media
☆ PanoPlane: Plane-Aware Panoramic Completion for Sparse-View Indoor 3D Gaussian Splatting
We present PanoPlane, an approach for high-fidelity sparse-view indoor novel view synthesis that reconstructs closed room geometry via panoramic scene completion. Unlike perspective-based methods that generate training views from limited fields of view, PanoPlane leverages $360^{\circ}$ panoramic completion to condition the generative process on the full spatial layout. We propose Layout Anchored Attention Steering, a training-free mechanism that steers attention within the diffusion model's internal representation toward scene's detected planar surfaces at inference time. By directing each unobserved region's attention toward geometrically consistent observed content, our method replaces unconstrained hallucination with grounded surface extrapolation. The resulting panoramic completions provide supervision for 3D Gaussian Splatting, enabling accurate novel-view synthesis across unobserved regions from as few as three input views. Experiments on Replica, ScanNet++, and Matterport3D demonstrate state-of-the-art novel view synthesis quality across 3, 6, and 9 input views, achieving up to $+17.8\%$ improvement in PSNR over the current state-of-the-art baseline without any training or fine-tuning of the diffusion model.
☆ Keyed Nonlinear Transform: Lightweight Privacy-Enhancing Feature Sharing for Medical Image Analysis
Feature sharing via split inference offers a lightweight alternative to federated learning for resource-constrained hospitals, but transmitted features still leak patient identity information and lack practical mechanisms for controlled feature sharing. We propose Keyed Nonlinear Transform (KNT), a drop-in feature transformation that applies key-conditioned obfuscation to intermediate representations. KNT reduces re-identification AUC from 0.635 to 0.586, corresponding to a 36% reduction in above-chance identity signal, while introducing only 0.15 ms CPU overhead, without backbone retraining, and preserving classification performance within 1.0 pp. Our analysis shows that KNT's nonlinear transform prevents closed-form inversion and shifts recovery to iterative gradient-based optimization under full key compromise, substantially increasing inversion difficulty. The same transform generalizes to dense prediction tasks, incurring only a 4.4 pp Dice reduction on skin-lesion segmentation without retraining. These results position KNT as a practical and efficient privacy layer for split inference deployments.
☆ ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows
While interpretable prototype networks offer compelling case-based reasoning for clinical diagnostics, their raw continuous outputs lack the semantic structure required for medical documentation. Bridging this gap via standard Retrieval-Augmented Generation (RAG) routinely triggers ``retrieval sycophancy,'' where Large Language Models (LLMs) hallucinate post-hoc rationalizations to align with visual predictions. We introduce ProtoMedAgent, a framework that formalizes multimodal clinical reporting as an iterative, zero-gradient test-time optimization problem over a strict neuro-symbolic bottleneck. Operating on a frozen prototype backbone, we distill latent visual and tabular features into a discrete semantic memory. Online generation is strictly constrained by exact set-theoretic differentials and a reflective Scribe-Critic loop, mathematically precluding unsupported narrative claims. To safely bound data disclosure, we introduce a semantic privacy gate governed by $k$-anonymity and $\ell$-diversity. Evaluated on a 4,160-patient clinical cohort, ProtoMedAgent achieves 91.2\% Comparison Set Faithfulness where it fundamentally outperforms standard RAG (46.2\%). ProtoMedAgent additionally leverages a binding $\ell$-diversity phase transition to systematically reduce artifact-level membership inference risks by an absolute 9.8\%.
comment: CVR 2026
☆ SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection CVPR 2026
Vision Transformers (ViTs) enable strong multi-view 3D detection but are limited by high inference latency from dense token and query processing across multiple views and large 3D regions. Existing sparsity methods, designed mainly for 2D vision, prune or merge image tokens but do not extend to full-model sparsity or address 3D object queries. We introduce SToRe3D, a relevance-aligned sparsity framework that jointly selects 2D image tokens and 3D object queries while storing filtered features for reactivation. Mutual 2D-3D relevance heads allocate compute to driving-critical content and preserve other embeddings. Evaluated on nuScenes and our new nuScenes-Relevance benchmark, SToRe3D achieves up to 3x faster inference with marginal accuracy loss, establishing real-time large-scale ViT-based 3D detection while maintaining accuracy on planning-critical agents.
comment: Accepted to CVPR 2026
☆ Bridging the Rural Healthcare Gap: A Cascaded Edge-Cloud Architecture for Automated Retinal Screening
Diabetic Retinopathy (DR) is one of the leading causes of preventable blindness, yet rural regions often lack the specialists and infrastructure needed for early detection. Although cloud-based deep learning systems offer high accuracy, they face significant challenges in these settings due to high latency, limited bandwidth, and high data transmission costs. To address these challenges, we propose a two-tier edge-cloud cascade on the public APTOS 2019 Blindness Detection dataset. Tier 1 runs a lightweight MobileNetV3-small model on a local clinic device to perform a binary triage between Referable DR (Classes 2-4) and Non-referable DR (Classes 0-1). Tier 2 runs a RETFoundDINOv2 model in the cloud for ordinal severity grading, but only on the subset of images flagged as referable by Tier 1. On a stratified APTOS test split of 733 images, Tier 1 reaches 98.99% sensitivity and 84.37% specificity at a validation-tuned high-sensitivity threshold. The default cascade forwards 49.52% of test images to Tier 2, reducing cloud calls by 50.48% relative to using a cloud-based model for all images. In the deployed 4-class output space (Class 0-1 / Class 2 / Class 3 / Class 4), the cascade obtains 80.49% accuracy and 0.8167 quadratic weighted kappa; the cloud-only baseline obtains 80.76% accuracy and 0.8184 quadratic weighted kappa. On APTOS, the cascade cuts cloud use by about half with a modest drop in grading performance. Index Terms: Diabetic Retinopathy, Edge-Cloud Cascade, MobileNetV3-small, RETFound-DINOv2, Retinal Screening, tele-ophthalmology
☆ DUET: Dual-Paradigm Adaptive Expert Triage with Single-cell Inductive Prior for Spatial Transcriptomics Prediction
Inferring spatially resolved gene expression from histology images offers a cost-effective complement to spatial transcriptomics (ST). However, existing methods reduce this task to a simple morphology-to-expression mapping, where visual similarity does not guarantee molecular consistency. Meanwhile, single-cell data has amassed rich resources far surpassing the scale of ST data, yet it remains underexplored in vision-omics modeling. Furthermore, current approaches commit to a monolithic paradigm with bottlenecks, unable to balance expressive flexibility with biological fidelity. To bridge these gaps, we propose DUET, a novel dual-paradigm framework that synergizes parametric prediction and memory-based retrieval under cellular inductive priors. DUET implements a parallel regression-retrieval paradigm, adaptively reconciling the outputs of its complementary pathways. To mitigate aleatoric vision ambiguity, we incorporate large-scale single-cell references to impose molecular states as biological constraints for faithful learning. Building upon structural refinement, we further design a lightweight adapter to dynamically assign branch preference across spatial contexts to achieve optimal performance. Extensive experiments on three public datasets across varied gene scales demonstrate that DUET achieves SOTA performance, with consistent gains contributed by each proposed component. Code is available at https://github.com/Junchao-Zhu/DUET
☆ Venus-DeFakerOne: Unified Fake Image Detection & Localization
In recent years, the rapid evolution of generative AI has fundamentally reshaped the paradigm of image forgery, breaking the traditional boundaries between document editing, natural image manipulation, DeepFake generation, and full-image AIGC synthesis. Despite this shift toward unified forgery generation, existing research in Fake Image Detection and Localization (FIDL) remains fragmented. This creates a mismatch between increasingly unified forgery generation mechanisms and the domain-specific detection paradigm. Bridging this mismatch poses two key challenges for FIDL: understanding cross-domain artifacts transfer and interference, and building a high-capacity unified foundation model for joint detection and localization. To address these challenges, we propose DeFakerOne, a data-centric, unified FIDL foundation model integrating InternVL2 and SAM2. DeFakerOne enables simultaneous image-level detection and pixel-level forgery localization across diverse scenarios. Extensive experiments demonstrate that DeFakerOne achieves state-of-the-art performance, outperforming baselines on 39 forgery detection benchmarks and 9 localization benchmarks. Furthermore, the model exhibits superior robustness against real-world perturbations and state-of-the-art generators such as GPT-Image-2. Finally, we provide a systematic analysis of data scaling laws, cross-domain artifacts transfer-interference patterns, the necessity of fine-grained supervision, and the original resolution artifacts preservation, highlighting the design principles for scalable, robust, and unified FIDL.
☆ CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves
We introduce CurveBench, a benchmark for hierarchical topological reasoning from visual input. CurveBench consists of \textbf{756 images} of pairwise non-intersecting Jordan curves across easy, polygonal, topographic-inspired, maze-like, and dense counting configurations. Each image is annotated with a rooted tree encoding the containment relations between planar regions. We formulate the task as structured prediction: given an image, a model must recover the full rooted containment tree induced by the curves. Despite the visual simplicity of the task, the strongest evaluated model, Gemini 3.1 Pro, achieves only \textbf{71.1\%} tree-generation accuracy on CurveBench-Easy and \textbf{19.1\%} on CurveBench-Hard. We further demonstrate benchmark utility through RLVR-style fine-tuning of open-weight vision-language models. Our trained Qwen3-VL-8B model improves over \texttt{Qwen-3-VL-8B-Thinking} from \textbf{2.8\%} to \textbf{33.3\%} tree-generation accuracy on CurveBench-Easy, exceeding GPT-5.4 and Claude Opus 4.5 under our evaluation protocol. The remaining gap, especially on CurveBench-Hard, shows that exact topology-aware visual reasoning remains far from solved.
☆ Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning ICML 2026
Achieving robust perception-reasoning synergy is a central goal for advanced Vision-Language Models (VLMs). Recent advancements have pursued this goal via architectural designs or agentic workflows. However, these approaches are often limited by static textual reasoning or complicated by the significant compute and engineering burden of external agentic complexity. Worse, this heavy investment does not yield proportional gains, often witnessing a "seesaw effect" on perception and reasoning. This motivates a fundamental rethinking of the true bottleneck. In this paper, we argue that the root cause of this trade-off is an ambiguity in modality credit assignment: when a VLM fails, is it due to flawed perception ("bad seeing") or flawed logic ("bad thinking")? To resolve this, we introduce a reinforcement learning framework that improves perception-reasoning synergy by reliably rewarding the perception fidelity. We explicitly decompose the generation process into interleaved perception and reasoning steps. This decoupling enables targeted supervision on perception. Crucially, we introduce Perception Verification (PV), leveraging a "blindfolded reasoning" proxy to reward perceptual fidelity independently of reasoning outcomes. Furthermore, to scale training across free-form VL tasks, we propose Structured Verbal Verification, which replaces high-variance LLM judging with structured algorithmic execution. These techniques are integrated into a Modality-Aware Credit Assignment (MoCA) mechanism, which routes rewards to the specific source of error -- either bad seeing or bad thinking -- enabling a single VLM to achieve simultaneous performance gains across a wide task spectrum.
comment: Accepted by ICML 2026 as Spotlight
☆ Evolving Layer-Specific Scalar Functions for Hardware-Aware Transformer Adaptation
Vision Transformers (ViTs) achieve state-of-the-art performance on challenging vision tasks, but their deployment on edge devices is severely hindered by the computational complexity and global reduction bottleneck imposed by layer normalization. Recent methods attempt to bypass this by replacing normalization layers with hardware-friendly scalar approximations. However, these homogeneous replacements do not optimally fit to all layers' behaviour and rely on expensive model retraining. In this work, we propose a highly efficient, hardware-aware framework that utilizes genetic programming (GP) to evolve heterogeneous, layer-specific scalar functions directly from pre-trained weights. Coupled with a novel post-training re-alignment strategy, our approach eliminates the need to retrain models from scratch entirely. Our evolved expressions accurately approximate the target normalization behaviours, capturing $91.6\%$ of the variance ($R^2$) compared to only $70.2\%$ for homogeneous baselines, allowing our modified architecture to recover $84.25\%$ Top-1 ImageNet-1K accuracy in only 20 epochs. By preserving this performance while eliminating the global reduction bottleneck, our approach establishes a highly favourable trade-off between arithmetic complexity and off-chip memory traffic, removing a primary barrier to the efficient deployment of ViTs on edge accelerators.
comment: 18 pages, 7 figures
☆ PVRF: All-in-one Adverse Weather Removal via Prior-modulated and Velocity-constrained Rectified Flow
Adverse weather removal (AWR) in real-world images remains challenging due to heterogeneous and unseen degradations, while distortion-driven training often yields overly smooth results. We propose PVRF, a unified framework that integrates zero-shot soft weather perceptions with velocity-constrained rectified-flow refinement. PVRF introduces an AWR-specific question answering module (AWR-QA) that uses frozen vision--language models (VLMs) to estimate soft probabilities of weather types and low-level attribute scores. These perceptions condition restoration networks via attribute-modulated normalization (AMN) and weather-weighted adapters (WWA), producing an anchor estimate for refinement. We then learn a terminal-consistent residual rectified flow with perception-adaptive source perturbation and a terminal-consistent velocity parameterization to stabilize learning near the terminal regime. Extensive experiments show that PVRF improves both fidelity and perceptual quality over state-of-the-art baselines, with strong cross-dataset generalization on single and combined degradations. Code will be released at https://github.com/dongw22/PVRF.
comment: 10 pages, 9 figures, and 4 tables
☆ Masked Autoencoders with Limited Data: Does It Work? A Fine-Grained Bioacoustics Case Study CVPR 2026
Bioacoustic recognition requires fine-grained acoustic understanding to distinguish similar-sounding species. However, many large-scale data repositories such as iNaturalist are weakly annotated, often with only a single positive species label per recording, making supervised learning particularly challenging. Inspired by advances in computer vision, recent approaches have shifted toward self-supervised learning to capture the underlying structure of audio without relying on exhaustive annotations. In particular, masked autoencoders (MAE) have shown strong transferability on massive audio corpora, yet their effectiveness in more modest bioacoustic settings remains underexplored. In this work, we conduct a systematic study of MAE pretraining for species classification on iNatSounds, analyzing the impacts of pretraining data scale, domain specificity, data curation, and transfer strategies. Consistent with prior work, we find that models pretrained on diverse general audio data achieve the best transfer performance on iNatSounds. Contrary to observations from large-scale audio benchmarks, we find that (1) additional masked reconstruction pretraining on domain-specific data provides limited benefits and may even degrade performance relative to off-the-shelf models, and (2) selective data filtering offers a negligible advantage when the overall data scale is limited. Our results indicate that, in moderate-sized fine-grained bioacoustic settings, pretraining scale dominates objective design. These findings further clarify when MAE-based pretraining is effective and provide practical guidance for model selection under limited supervision.
comment: Workshop on Fine-Grained Visual Categorization (FGVC) at CVPR 2026. 8 pages, 6 figures
☆ Unified Pix Token And Word Token Generative Language Model
Since the emergence of Vision Transformer (ViT), it has been widely used in generative language model and generative visual model. Especially in the current state-of-art open source multimodal models, ViT obtained by CLIP or SigLIP method serves as the vision encoder backbone to help them acquire visual understanding capabilities. But this method leads to limitations in visual understanding for details, such as difficulty in recognizing small text or numbers in images. To address these issues, we propose a new model to unify pix token and word token into the generative language model. The new model also features with each pix of image having its own token embedding, color folding, global conditional attention approximation and image unsupervised pretraining. We conducted image unsupervised pretraining experiments using our new model to explore its potential. The experimental results show that it has good performance even in small model and with limited training data. We believe our model also conforms to the scaling law, as long as model parameters and training data increased, its performance will continue to improve.
comment: 13 pages, 6 figures
☆ CineMesh4D: Personalized 4D Whole Heart Reconstruction from Sparse Cine MRI
Accurate 3D+t whole-heart mesh reconstruction from cine MRI is a clinically crucial yet technically challenging task. The difficulty of this task arises from two coupled factors: inherently sparse sampling of 3D cardiac anatomy by 2D image slices and the tight coupling between cardiac shape and motion. Current cardiac image-to-mesh approaches typically reconstruct only a subset of cardiac chambers or a single phase of the cardiac cycle. In this work, we propose CineMesh4D, a novel end-to-end 4D (3D+t) pipeline that directly reconstructs patient-specific whole-heart mesh from multi-view 2D cine MRI via cross-domain mapping. Specifically, we introduce a differentiable rendering loss that enables supervision of 3D+t whole-heart mesh from multi-view sparse contours of cine MRI. Furthermore, we develop a dual-context temporal block that fuses global and local cardiac temporal information to capture high-dimensional sequential patterns. In quantitative and qualitative evaluations, CineMesh4D outperforms existing approaches in terms of reconstruction quality and motion consistency, providing a practical pathway for personalized real-time cardiac assessment. The code will be publicly released once the manuscript is accepted.
☆ Few Channels Draw The Whole Picture: Revealing Massive Activations in Diffusion Transformers
Diffusion Transformers (DiTs) and related flow-based architectures are now among the strongest text-to-image generators, yet the internal mechanisms through which prompts shape image semantics remain poorly understood. In this work, we study massive activations: a small subset of hidden-state channels whose responses are consistently much larger than the rest. We show that, despite their sparsity, these few channels effectively draw the whole picture, in three complementary senses. First, they are functionally critical: a controlled disruption probe that zeroes the massive channels causes a sharp collapse in generation quality, while disrupting an equally-sized set of low-statistic channels has marginal effect. Second, they are spatially organized: restricting image-stream tokens to massive channels and clustering them yields coherent partitions that closely align with the main subject and salient regions, exposing a structured spatial code hidden inside an apparently outlier-like subspace. Third, they are transferable: transporting massive activations from one prompt-conditioned trajectory into another, shifts the final image toward the source prompt while preserving substantial content from the target, producing localized semantic interpolation rather than unstructured pixel blending. We exploit this property in two use cases: text-conditioned and image-conditioned semantic transport, where massive activations transport enables prompt interpolation and subject-driven generation without any additional training. Together, these results recast massive activations not as activation anomalies, but as a sparse prompt-conditioned carrier subspace that organizes and controls semantic information in modern DiT models.
comment: Project page: https://aimagelab.github.io/MAs-DiT/
☆ Unlocking Patch-Level Features for CLIP-Based Class-Incremental Learning
Class-Incremental Learning (CIL) enables models to continuously integrate new knowledge while mitigating catastrophic forgetting. Driven by the remarkable generalization of CLIP, leveraging pre-trained vision-language models has become a dominant paradigm in CIL. However, current work primarily focuses on aligning global image embeddings (i.e., [CLS] token) with their corresponding text prompts (i.e., [EOS] token). Despite their good performance, we find that they discard the rich patch-level semantic information inherent in CLIP's encoders. For instance, when recognizing a rabbit, local patches may encode its distinctive cues, such as long ears and a fluffy tail, which can provide complementary evidence for recognition. Based on the above observation, we propose SPA (Semantic-guided Patch-level Alignment) for CLIP-based CIL, which aims to awaken long-neglected local representations within CLIP. Specifically, for each class, we first construct representative and diverse visual samples and feed them to GPT-5 as visual guidance to generate class-wise semantic descriptions. These descriptions are used to guide the selection of discriminative patch-level visual features. Building upon these selected patches, we further employ optimal transport to align selected patch tokens with semantic tokens from class-wise descriptions, yielding a structured cross-modal alignment that improves recognition. Furthermore, we introduce task-specific projectors for effective adaptation to downstream incremental tasks, and sample pseudo-features from stored class-wise Gaussian statistics to calibrate old-class representations, thereby mitigating catastrophic forgetting. Extensive experiments demonstrate that SPA achieves state-of-the-art performance.
☆ QLAM: A Quantum Long-Attention Memory Approach to Long-Sequence Token Modeling
Modeling long-range dependencies in sequential data remains a central challenge in machine learning. Transformers address this challenge through attention mechanisms, but their quadratic complexity with respect to sequence length limits scalability to long contexts. State-space models (SSMs) provide an efficient alternative with linear-time computation by evolving a latent state through recurrent updates, but their memory is typically formed via additive or linear transitions, which can limit their ability to capture complex global interactions across tokens. In this work, we introduce one of the first studies to leverage the superposition property of quantum systems to enhance state-based sequence modeling. In particular, we propose Quantum Long-Attention Memory (QLAM), a hybrid quantum-classical memory mechanism that can be viewed as a quantum extension of state-space models. Instead of maintaining a classical latent state updated through additive dynamics, QLAM represents the hidden state as a quantum state whose amplitudes encode a superposition of historical information. The state evolves through parameterized quantum circuits conditioned on the input, enabling a non-classical, globally update mechanism. In this way, QLAM preserves the recurrent and linear-time structure of SSMs while fundamentally enriching the memory representation through quantum superposition. Unlike attention mechanisms that explicitly compute pairwise interactions, QLAM implicitly captures global dependencies through the evolution of the quantum state, and retrieves task-relevant information via query-dependent measurements. We evaluate QLAM on sequential variants of standard image classification benchmarks, including sMNIST, sFashion-MNIST, and sCIFAR-10, where images are flattened into token sequences. Across all tasks, QLAM consistently improves over recurrent baselines and transformer-based models.
☆ Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context
Long-context modeling is becoming a core capability of modern large vision-language models (LVLMs), enabling sustained context management across long-document understanding, video analysis, and multi-turn tool use in agentic workflows. Yet practical training recipes remain insufficiently explored, particularly for designing and balancing long-context data mixtures. In this work, we present a systematic study of long-context continued pre-training for LVLMs, extending a 7B model from 32K to 128K context with extensive ablations on long-document data. We first show that long-document VQA is substantially more effective than OCR transcription. Building on this observation, our ablations further yield three key findings: i) for sequence-length distribution, balanced data outperforms target-length-focused data (e.g., 128K), suggesting that long-context ability requires generalizable key-information retrieval across various lengths and positions; ii) retrieval remains the primary bottleneck, favoring retrieval-heavy mixtures with modest reasoning data for task diversity; and iii) pure long-document VQA largely preserves short-context capabilities, suggesting that instruction-formatted long data reduces the need for short-data mixing. Based on these findings, we introduce MMProLong, obtained by long-context continued pre-training from Qwen2.5-VL-7B with only a 5B-token budget. MMProLong improves long-document VQA scores by 7.1% and maintains strong performance at 256K and 512K contexts beyond its 128K training window, without additional training. It further generalizes to webpage-based multimodal needle retrieval, long-context vision-text compression, and long-video understanding without task-specific supervision. Overall, our study establishes a practical LongPT recipe and an empirical foundation for advancing long-context vision-language models.
comment: work in progress
☆ History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions
Frontier LLMs are increasingly deployed as agents that pick the next action after a long log of prior tool calls produced by the same or a different model. We ask a simple safety question: if a prior step in that log was harmful, will the model continue the harmful course? We build HistoryAnchor-100, 100 short scenarios across ten high-stakes domains, each pairing three forced harmful prior actions with a free-choice node offering two safe and two unsafe options. Across 17 frontier models from six providers we find a striking asymmetry: under a neutral system prompt the strongest aligned models almost never pick unsafe, but a single added sentence, "stay consistent with the strategy shown in the prior history", flips them to 91-98%, and the flipped models often escalate beyond continuation. Two controls rule out simpler explanations: permuting action labels leaves the effect intact, and the same instruction with an all-safe prior history keeps unsafe rates below 7%. Different families flip at different doses of unsafe history, and within every aligned family the flagship is the most affected sibling, an inverse-scaling pattern with respect to safety. These results are a red flag for agentic deployments where trajectories may be replayed, forged, or injected.
comment: 12 pages, 3 figures
☆ OmniLiDAR: A Unified Diffusion Framework for Multi-Domain 3D LiDAR Generation
LiDAR scene generation is increasingly important for scalable simulation and synthetic data creation, especially under diverse sensing conditions that are costly to capture at scale. Typically, diffusion-based LiDAR generators are developed under single-domain settings, requiring separate models for different datasets or sensing conditions and hindering unified, controllable synthesis under heterogeneous distribution shifts. To this end, we present OmniLiDAR, a unified text-conditioned diffusion framework that generates LiDAR scans in a shared range-image representation across eight representative domains spanning three shift types: adverse weather, sensor-configuration changes (e.g., reduced beams), and cross-platform acquisition (vehicle, drone, and quadruped). To enable training a single model over heterogeneous domains without isolating optimization by domain, we introduce a Cross-Domain Training Strategy (CDTS) that mixes domains within each mini-batch and leverages conditioning to steer generation. We further propose Cross-Domain Feature Modeling (CDFM), which captures directional dependencies along azimuth and elevation axes to reflect the anisotropic scanning structure of range images, and Domain-Adaptive Feature Scaling (DAFS) as a lightweight modulation to account for structured domain-dependent feature shifts during denoising. In the absence of a public consolidated benchmark, we construct an 8-domain dataset by combining real-world scans with physically based weather simulation and systematic beam reduction while following official splits. Extensive experiments demonstrate strong generation fidelity and consistent gains in downstream use cases, including generative data augmentation for LiDAR semantic segmentation and 3D object detection, as well as robustness evaluation under corruptions, with consistent benefits in limited-label regimes.
comment: Preprint; 12 pages, 7 figures, 10 tables
☆ JANUS: Anatomy-Conditioned Gating for Robust CT Triage Under Distribution Shift
Automated CT triage requires models that are simultaneously accurate across diverse pathologies and reliable under institutional shift. While Vision Transformers provide strong visual representations, many clinically significant findings are defined by quantitative imaging biomarkers rather than appearance alone. We introduce JANUS, a physiology-guided dual-stream architecture that conditions visual embeddings on macro-radiomic priors via Anatomically Guided Gating. On the MERLIN test set (N=5082), JANUS attains macro-AUROC 0.88 and AUPRC 0.74, outperforming all reproduced baselines. It generalizes to an external dataset N=2000; AUROC 0.87), with the largest gains on findings defined by size and attenuation as well as improved calibration on both datasets. We further quantify prediction suppression using the Physiological Veto Rate (PVR), showing that under domain shift JANUS reduces high-confidence false positives substantially more often than true positives. Together, these results are consistent with physically grounded conditioning that improves both discrimination and reliability in CT triage. Code is made publicly available at github repository https://github.com/lavsendahal/janus and model weights are at https://huggingface.co/lavsendahal/janus.
☆ EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
Video temporal grounding (VTG) takes an untrimmed video and a natural-language query as input and localizes the temporal moment that best matches the query. Existing methods rely on large, task-specific datasets requiring costly manual annotation. We introduce EvoGround, a framework of two coupled self-evolving agents, a proposer and a solver, that learn temporal grounding from raw videos without any human-labeled data. The proposer generates query--moment pairs from raw videos, while the solver learns to ground them and feeds back signals that improve the proposer in return. Through this self-reinforcing reinforcement-learning loop, the two agents are initialized from the same backbone and mutually improve across iterations. Trained on 2.5K unlabeled videos, EvoGround matches or surpasses fully supervised models across multiple VTG benchmarks, while emerging as a state-of-the-art fine-grained video captioner without manual labels.
comment: Project page: https://minjoong507.github.io/projects/EvoGround/
☆ VoxCor: Training-Free Volumetric Features for Multimodal Voxel Correspondence
Cross-modal 3D medical image analysis requires voxelwise representations that remain anatomically consistent across imaging contrasts, scanners, and acquisition protocols. Recent work has shown that frozen 2D Vision Transformer (ViT) foundation models can support such representations, but typical pipelines extract features along a single anatomical axis and adapt those features inside a registration solver for one image pair at a time, leaving complementary viewing directions unused and producing representations that do not transfer to new volumes. We introduce VoxCor, a training-free fit--transform method for reusable volumetric feature representations from frozen 2D ViT foundation models. During an offline fitting phase, VoxCor combines triplanar ViT inference with a compact closed-form weighted partial least squares (WPLS) projection that uses fitting-time voxel correspondences to select modality-stable anatomical directions in the triplanar feature space. At transform time, new volumes are mapped by triplanar ViT inference and linear projection alone, without fine-tuning or registration. Voxel correspondences can then be queried directly by nearest-neighbor search. We evaluate VoxCor on intra-subject Abdomen MR--CT and inter-subject HCP T2w--T1w tasks using deformable registration, voxelwise k-nearest-neighbor segmentation, and segmentation-center landmark localization. VoxCor improves the hardest cross-subject, cross-modality transfer settings, reduces encoder sensitivity for dense correspondence transfer, and yields registration performance competitive with handcrafted descriptors and learned 3D features. This positions VoxCor as a reusable feature layer for downstream multimodal analysis beyond pairwise registration. Code, configuration files, and implementation details are publicly available on GitHub at \href{https://github.com/guneytombak/VoxCor}{guneytombak/VoxCor}.
☆ BlitzGS: City-Scale Gaussian Splatting at Lightning Speed
We present BlitzGS, a distributed 3DGS framework that reduces active Gaussian workload for fast city-scale reconstruction. BlitzGS manages this workload at three coupled levels. At the system level, the framework shards Gaussians across GPUs by index parity rather than spatial blocks. This approach mitigates the cross-block visibility redundancy inherent in spatial partitioning. Furthermore, it distributes each rendering step through a single cross-GPU exchange that routes projected Gaussians to their tile owners. At the model level, scheduled importance-scoring passes shrink the global Gaussian population. During these passes, the framework generates a per-Gaussian visibility weight to bias density-control updates toward contributing primitives and a per-view importance mask for the view-level renderer. At the view level, BlitzGS trims each camera's active set with a distance-based LOD gate to exclude excessively fine primitives for the current frustum and the importance-based culling mask to skip Gaussians with negligible cross-view contribution. On large-scale benchmarks, BlitzGS matches the rendering quality of recent large-scale baselines while delivering an order-of-magnitude speedup, training city-scale scenes in tens of minutes. Our code is available at https: //github.com/AkierRaee/BlitzGS.
☆ Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs
Diffusion-based vision-language-action models (dVLAs) are promising for embodied intelligence but are fundamentally limited in real-time deployment by the high latency of full inference. We propose Realtime-VLA FLASH, a speculative inference framework that eliminates most full inference calls during replanning by introducing a lightweight draft model with parallel verification via the main model's Action Expert and a phase-aware fallback mechanism that reverts to the full inference pipeline when needed. This design enables low-latency, high-frequency replanning without sacrificing reliability. Experiments show that on LIBERO, FLASH largely preserves task performance by replacing many 58.0 ms full-inference rounds with speculative rounds as fast as 7.8 ms, lowering task-level average inference latency to 19.1 ms (3.04x speedup). We additionally demonstrate effectiveness on real-world conveyor-belt sorting, highlighting its practical impact for latency-critical embodied tasks.
☆ RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data
The scalability of robotic manipulation is fundamentally bottlenecked by the scarcity of task-aligned physical interaction data. While vision-language models (VLMs) and video generation models (VGMs) hold promise for autonomous data synthesis, they suffer from semantic-spatial misalignment and physical hallucinations, respectively. To bridge this gap, we introduce RoboEvolve, a novel framework that couples a VLM planner and a VGM simulator into a mutually reinforcing co-evolutionary loop. Operating purely on unlabeled seed images, RoboEvolve leverages a cognitive-inspired dual-phase mechanism: (i) daytime exploration fosters physically grounded behavioral discovery through a semantic-controlled multi-granular reward, and (ii) nighttime consolidation mines "near-miss" failures to stabilize policy optimization. Guided by an autonomous progressive curriculum, the system naturally scales from simple atomic actions to complex tasks. Extensive experiments demonstrate that RoboEvolve (I) achieves superior effectiveness, elevating base planners by 30 absolute points and amplifying simulator success by 48% on average; (II) exhibits extreme data efficiency, surpassing fully supervised baselines with merely 500 unlabeled seeds--a 50x reduction; and (III) demonstrates robust continual learning without catastrophic forgetting.
comment: On-going work
☆ Generative Texture Diversification of 3D Pedestrians for Robust Autonomous Driving Perception CVPR 2026
In recent years, autonomous driving has significantly in creased the demand for high-quality data to train 2D and 3D perception models for safety-critical scenarios. Real world datasets struggle to meet this demand as require ments continuously evolve and large-scale annotated data collection remains costly and time-consuming making syn thetic data a scalable, practical and controllable alterna tive. Pedestrian detection is among the most safety-critical tasks in autonomous driving. In this paper, we propose a simple yet effective method for scaling variability in 3D pedestrian assets for synthetic scene generation. Starting from a single 3D base asset, we generate multiple distinct pedestrian instances by synthesizing diverse facial textures and identity-level appearance variations using StyleGAN2 and automatically mapping them onto 3D meshes. This ap proach enables scalable appearance-level asset diversifica tion without requiring the design of new geometries for each instance. Using the assets, we construct synthetic datasets and study the impact of mixing real and synthetic data for RGB-based object detection. Through complementary ex periments, we analyze geometry-driven distribution shifts in point cloud perception for 3D object detection. Our findings demonstrate that controlled synthetic diversifica tion improves robustness in 2D detection while revealing the sensitivity of 3D perception models to geometric domain gaps. Overall, this work highlights how generative AI en ables scalable, simulation-ready pedestrian diversification through controlled facial texture synthesis, along with the benefits and limitations of cross-domain training strategies in autonomous driving pipelines.
comment: Published at SAIAD 2026 Workshop at CVPR 2026
☆ Min Generalized Sliced Gromov Wasserstein: A Scalable Path to Gromov Wasserstein
We propose min Generalized Sliced Gromov--Wasserstein (min-GSGW), a sliced formulation for the Gromov--Wasserstein (GW) problem using expressive generalized slicers. The key idea is to learn coupled nonlinear slicers that assign compatible push-forward values to both input measures, so that monotone coupling in the projected domain lifts to a transport plan evaluated against the GW objective in the original spaces. The resulting plan induces a GW objective value, and min-GSGW minimizes this cost directly in the original spaces. We further show that min-GSGW is rigid-motion invariant, a crucial property for geometric matching and shape analysis tasks. Our contributions are threefold: 1) we introduce generalized slicers into the sliced GW framework, 2) we construct a slicing-based efficient GW transport plan; and 3) we develop an amortized variant that replaces per-instance optimization with a learned slicer for unseen input pairs. We perform experiments on animal mesh matching, horse mesh interpolation, and ShapeNet part transfer. Results show that min-GSGW produces meaningful geometric correspondences and GW objective values at substantially lower computational cost than existing GW solvers.
☆ Weakly-Supervised Spatiotemporal Anomaly Detection
In this paper, we explore a weakly supervised method for anomaly detection. Since annotating videos is time-consuming, we only look at weak video-level labels during training. This means that given a video, we know that it is either normal or contains an anomaly, but no further annotations are used to train the network. Features are extracted from video clips that are either normal or anomalous. These features are used to determine anomaly scores for spatiotemporal regions of the clips based on a classifier and the implementation of a multiple instance ranking loss (MIL). We represent both anomalous and normal video clips as positive and negative bags, respectively, to apply MIL. Furthermore, since anomalies are usually localized to a part of a frame rather than the whole frame, we chose to explore temporal as well as spatial anomaly detection. We show our results on the UCF Crime2Local Dataset, which contains spatiotemporal annotations for a portion of the UCF Crime Dataset.
☆ Aligning Network Equivariance with Data Symmetry: A Theoretical Framework and Adaptive Approach for Image Restoration
Image restoration is an inherently ill posed inverse problem. Equivariant networks that embed geometric symmetry priors can mitigate this ill posedness and improve performance. However, current understanding of the relationship between network equivariance and data symmetry remains largely heuristic. Particularly for real world data with imperfect symmetry, existing research lacks a systematic theoretical framework to quantify symmetry, select transformation groups, or evaluate model data alignment. To bridge this gap, we conduct an analysis from an optimization perspective and formalize the intrinsic relationship among data symmetry priors, model equivariance, and generalization capability. Specifically, we propose for the first time a quantifiable definition of non strict symmetry at the dataset level (rather than sample level) and use it as a constraint to formulate the restoration inverse problem. We then show that the equivariance for restoration models can be naturally derived from this inverse problems incorporated the proposed symmetry constraints, and that the equivariance error of the optimal restoration operator is strictly bounded by the data symmetry error and the discretization mesh size. Furthermore, by analyzing the network's empirical risk, we demonstrate that aligning equivariance with data symmetry optimizes the bias variance trade off, minimizing the total expected risk. Guided by these insights, we propose a Sample Adaptive Equivariant Network that uses a hypernetwork and transformation learnable equivariant convolutions to dynamically align with each sample's inherent symmetry. Extensive experiments on super resolution, denoising, and deraining validate our theoretical findings and show significant superiority over standard baselines and traditional equivariant models. Our code and supplementary material are available at https://github.com/tanfy929/SA-Conv.
comment: 30 pages, 9 figures, Supplementary Material can be found at https://github.com/tanfy929/SA-Conv
☆ LEXI-SG: Monocular 3D Scene Graph Mapping with Room-Guided Feed-Forward Reconstruction
Scene graphs are becoming a standard representation for robot navigation, providing hierarchical geometric and semantic scene understanding. However, most scene graph mapping methods rely on depth cameras or LiDAR sensors. In this work, we present LEXI-SG, the first dense monocular visual mapping system for open-vocabulary 3D scene graphs using only RGB camera input. Our approach exploits the semantic priors of open-vocabulary foundation models to partition the scene into rooms, deferring feed-forward reconstruction to when each room is fully observed -- enabling scalable dense mapping without sliding-window scale inconsistencies. We propose a room-based factor graph formulation to globally align room reconstructions while preserving local map consistency and naturally imposing the semantic scene graph hierarchy. Within each room, we further support open-vocabulary object segmentation and tracking. We validate LEXI-SG on indoor scenes from the Habitat-Matterport 3D and self-collected egocentric office sequences. We evaluate its performance against existing feed-forward SLAM methods, as well as established scene graphs baselines. We demonstrate improved trajectory estimation and dense reconstruction, as well as, competitive performance in open-vocabulary segmentation. LEXI-SG shows that accurate, scalable, open-vocabulary 3D scene graphs can be achieved from monocular RGB alone. Our project page and office sequences are available here: https://ori-drs.github.io/lexisg-web/.
☆ Robust and Explainable Bicuspid Aortic Valve Diagnosis Using Stacked Ensembles on Echocardiography
Transthoracic echocardiography (TTE) is the first-line imaging modality for diagnosing bicuspid aortic valve (BAV), yet diagnostic performance varies with operator expertise and image quality. We developed an explainable AI model that distinguishes BAV from tricuspid aortic valves (TAV) using routinely acquired parasternal long-axis (PLAX) cine loops. A multi-backbone video ensemble was trained and evaluated using a leakage-aware, stratified outer cross-validation protocol on $N{=}90$ patient studies (48 BAV, 42 TAV). Across fixed outer splits and 10 random seeds, the calibrated stacked ensemble achieved an outer-CV F1-score of $0.907$ and recall of $0.877$. Frame-level Grad-CAM localized salient evidence to the aortic root and leaflet plane, while globally aggregated SHAP values quantified each video backbone's contribution to the stacked prediction, enabling transparent, case-level auditability. These findings indicate that PLAX-based video ensembles can support reliable BAV/TAV classification from routine echocardiographic cine loops and may facilitate earlier detection in non-specialist or resource-limited clinical settings.
☆ Coordinating Multiple Conditions for Trajectory-Controlled Human Motion Generation
Trajectory-controlled human motion generation aims to synthesize realistic human motions conditioned on both textual descriptions and spatial trajectories. However, existing methods suffer from two critical limitations: first, the conflict between text and trajectory conditions disrupts the denoising process, resulting in compromised motion quality or inaccurate trajectory following; second, the use of redundant motion representations introduces inconsistencies between motion components, leading to instability during trajectory control. To address these challenges, we propose CMC, a decoupled framework that effectively coordinates text and trajectory conditions through a divide-and-conquer strategy. CMC follows a divide-and-conquer paradigm, comprising two cascaded stages: Trajectory Control and Motion Completion. In the first stage, a diffusion model generates a simplified representation of the controlled joints under trajectory guidance, based on the given trajectories, ensuring accurate and stable trajectory following. In the second stage, a text-conditioned diffusion inpainting model generates full-body motions using the simplified representation from the first stage as partial observations. To mitigate overfitting caused by limited inpainting training data, we further introduce the Selective Inpainting Mechanism (SIM), which alternates between text-to-motion generation and motion inpainting tasks during training. Experiments on HumanML3D and KIT datasets demonstrate that CMC achieves state-of-the-art performance in control accuracy and motion quality, demonstrating its effectiveness in coordinating multimodal conditions and representations.
☆ AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation
Few-step video generation has been significantly advanced by consistency distillation. However, the performance of consistency-distilled models often degrades as more sampling steps are allocated at test time, limiting their effectiveness for any-step video diffusion. This limitation arises because consistency distillation replaces the original probability-flow ODE trajectory with a consistency-sampling trajectory, weakening the desirable test-time scaling behavior of ODE sampling. To address this limitation, we introduce AnyFlow, the first any-step video diffusion distillation framework based on flow maps. Instead of distilling a model for only a few fixed sampling steps, AnyFlow optimizes the full ODE sampling trajectory. To this end, we shift the distillation target from endpoint consistency mapping $(z_{t}\rightarrow z_{0})$ to flow-map transition learning $(z_{t}\rightarrow z_{r})$ over arbitrary time intervals. We further propose Flow Map Backward Simulation, which decomposes a full Euler rollout into shortcut flow-map transitions, enabling efficient on-policy distillation that reduces test-time errors (i.e., discretization error in few-step sampling and exposure bias in causal generation). Extensive experiments across both bidirectional and causal architectures, at scales ranging from 1.3B to 14B parameters, demonstrate that AnyFlow achieves performance matches or surpasses consistency-based counterparts in the few-step regime, while scaling with sampling step budgets.
comment: Project page at https://nvlabs.github.io/AnyFlow/
☆ Learning to Optimize Radiotherapy Plans via Fluence Maps Diffusion Model Generation and LSTM-based Optimization MICCAI 2026
Volumetric Modulated Arc Therapy (VMAT) is a cornerstone of modern radiation therapy, enabling highly conformal tumor irradiation and healthy-tissue sparing. Yet, its planning solves inverse and nested optimization for multi-leaf collimators, monitor units and dose parameters, while enforcing their consistency to ensure mechanical deliverability. Nevertheless, this process often requires repeated re-optimization when treatment configurations change, resulting in substantial planning time per patient. To address these problems, we present a diffusion-driven Learning-to-Optimize (L2O) method for end-to-end VMAT planning. A distribution-matching distilled diffusion model learns a clinically feasible manifold of fluence maps, enabling their one-shot generation. On top of this, an LSTM-based L2O module learns gradient update dynamics to swiftly refine fluence maps toward prescribed dose objectives during inference. Experimental results on clinical and public prostate cancer cohorts demonstrate improved planning efficiency, flexibility, and machine deliverability over currently available end-to-end VMAT planners.
comment: Early Accept at MICCAI 2026
☆ MedCore: Boundary-Preserving Medical Core Pruning for MedSAM
Medical segmentation foundation models such as SAM and MedSAM provide strong prompt-driven segmentation, but their image encoders are still too large for many clinical settings. Compression is also risky in medicine because a model can keep high Dice while losing boundary fidelity. We propose MedCore, a structured pruning framework for MedSAM. The main idea is to preserve two kinds of structures: structures that became important during SAM-to-MedSAM adaptation, and structures that have high boundary leverage. We identify the first type by a dual-intervention score that compares zeroing a group with resetting it to its original SAM weight. We identify the second type by boundary-aware Fisher estimation. We also introduce a boundary leverage principle, which shows that compression-induced boundary displacement is controlled by logit perturbation on the boundary divided by the logit spatial gradient. This principle explains why boundary metrics can degrade even when Dice remains high. On polyp segmentation benchmarks, MedCore reduces parameters by 60.0% and FLOPs by 58.4% while achieving Dice 0.9549, Boundary F1 0.6388, and HD95 5.14 after recovery fine-tuning. It also reaches 86.6% parameter reduction and 90.4G FLOPs with strong boundary quality. Our analysis further shows that MedSAM lies in a head-fragile boundary regime: head-pruning steps have 2.887 times larger 95th-percentile boundary leverage than MLP-pruning steps, and this logit-level effect is consistent with BF1 and HD95 degradation. Our code is available at https://github.com/cenweizhang/MedCore.
comment: 3 figures, 17 pages
☆ Cross Modality Image Translation In Medical Imaging Using Generative Frameworks
Medical image-to-image (I2I) translation enables virtual scanning, i.e. the synthesis of a target imaging modality from a source one without additional acquisitions. Despite growing interest, most proposed methods operate on 2D slices, are evaluated on isolated tasks with different experimental set-ups and lack clinical validation. The primary contribution of this work is a reproducible, standardized comparative evaluation of 3D I2I translation methods in oncological imaging, designed to standardize preprocessing, splitting, inference, and multi-level evaluation across heterogeneous clinical tasks. Within this framework, we compare seven generative models, three Generative Adversarial Networks (GANs: Pix2Pix, CycleGAN, SRGAN) and four latent generative models (Latent Diffusion Model, Latent Diffusion Model+ControlNet, Brownian Bridge, Flow Matching), across eleven datasets spanning three anatomical regions (head/neck, lung, pelvis) and four translation directions (cone-beam CT to CT, MRI to CT, CT to PET, MRI T2-weighted to T2-FLAIR), for a total of 77 experiments under uniform training, inference, and evaluation conditions. The results show that GANs outperform latent generative models across all tasks, with SRGAN achieving statistically significant superiority. Our lesion-level analysis reveals that all models struggle with small lesions and that, in CT to PET synthesis, models reproduce lesion shape more reliably than absolute uptake-related intensity. We also performed a Visual Turing test administered to 17 physicians, including 15 radiologists, which shows near-chance classification accuracy (56.7%), confirming that synthetic volumes are largely indistinguishable from real acquisitions, while exposing a dissociation between quantitative metrics and clinical preference.
☆ Characterizing Universal Object Representations Across Vision Models
Deep neural networks trained with different architectures, objectives, and datasets have been reported to converge on similar visual representations. However, what remains unknown is which visual properties models actually converge on and which factors may underlie this convergence. To address this, we decompose the object similarity structure of 162 diverse vision models into a small set of non-negative dimensions. To determine universal versus model-specific dimensions, we then estimate how often each dimension reappears across models. In contrast to model-specific dimensions, universal dimensions are more interpretable and more strongly driven by conceptual image properties, indicating the relevance of interpretability and semantic content as implicit factors driving universality across models. Differences in architecture, objective function, training data, model size, and model performance do not explain the emergence of universal dimensions. However, models with more universal dimensions also better predict macaque IT activity and human similarity judgments, suggesting that universality reflects representations relevant to biological vision. These findings have important implications for understanding the emergent representations underlying deep neural network models and their alignment with biological vision.
☆ Weakly Supervised Segmentation as Semantic-Based Regularization
Weakly supervised semantic segmentation (WSSS) trains dense pixel-level segmentation models from partial or coarse annotations such as bounding boxes, scribbles, or image-level tags. While recent work leverages foundation models such as the Segment Anything Model (SAM) to generate pseudo-labels, these approaches typically depend on heuristic prompt choices and offer limited ways to incorporate prior knowledge or heterogeneous labels. We address this gap by taking a neurosymbolic perspective: integrating differentiable fuzzy logic with deep segmentation models. Weak annotations and domain-specific priors are unified as continuous logical constraints that fine-tune SAM under weak supervision. The refined foundation model then produces improved pseudo-labels, from which we train a second-stage prompt-free segmentation model. Experiments on Pascal VOC 2012 and the REFUGE2 optic disc/cup segmentation dataset show that our logic-guided fine-tuning yields higher-quality pseudo-labels, leading to state-of-the-art segmentation accuracy that often exceeds densely supervised baselines.
☆ SpurAudio: A Benchmark for Studying Shortcut Learning in Few-Shot Audio Classification
Few-shot classification (FSC) is widely used for learning from limited labeled data, yet most evaluations implicitly assume that target concepts are independent of contextual cues. In real-world settings, however, examples often appear within rich contexts, allowing models to exploit spurious correlations between foreground content and background signals. While such effects have been studied in few-shot image classification, their role in few-shot audio classification remains largely unexplored, and existing audio benchmarks offer limited control over contextual structure. We introduce SpurAudio, a benchmark that leverages the natural separability of foreground events and background environments in audio to enable controlled, multi-level evaluation of contextual shifts across support and query sets. Using this benchmark, we show that many state-of-the-art few-shot methods suffer severe performance degradation when background correlations are disrupted, despite achieving similar accuracy under standard evaluation protocols. Crucially, this vulnerability persists even in large pretrained audio foundation models, ruling out limited backbone capacity as an explanation. Moreover, methods that appear comparable under conventional benchmarks can exhibit markedly different sensitivity to spurious correlations, revealing systematic algorithmic strengths and vulnerabilities tied to how feature representations interact with classifier heads at inference time. These findings provide new insight into the behavior of few-shot methods in audio and highlight the need for benchmarks that explicitly probe context dependence when evaluating FSC models.
☆ Pattern-Enhanced RT-DETR for Multi-Class Battery Detection
Accurate and efficient battery detection is increasingly important for applications in electronic waste recycling, industrial quality control, and automated sorting systems. In this paper, we present both a comprehensive benchmark and a novel method for multi-class battery detection. We systematically compare three CNN-based detectors (YOLOv8n, YOLOv8s, YOLO11n) and two transformer-based detectors (RT-DETR-L, RT-DETR-X) on a publicly available dataset of approximately 8,591 annotated images under identical experimental conditions, and further propose PaQ-RT-DETR, which introduces pattern-based dynamic query generation into RT-DETR to alleviate query activation imbalance with negligible computational overhead. Among baselines, YOLO11n achieves the best CNN-based accuracy (mAP@50: 0.779) at only 2.6M parameters, while YOLOv8n delivers the fastest inference at ~1,667 FPS. PaQ-RT-DETR-X achieves the highest overall mAP@50 of 0.782, surpassing RT-DETR-X by +2.8% with consistent per-class gains across all six battery categories including the data-scarce Bike Battery class. Our findings provide practical guidance for selecting object detection models in battery-related industrial applications.
comment: 4 pages, 3 figures
☆ SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models
Scene graph generation provides a compact structured representation for visual perception, but accurate and fast graph prediction from images and videos remains challenging. Recent VLM-based methods can generate scene graphs end-to-end as structured text, yet often produce long outputs with irrelevant objects and relations. We present SceneGraphVLM, a compact method for image and video scene graph generation with small visual language models. SceneGraphVLM serializes graphs in a token-efficient TOON format and trains the model in two stages: supervised fine-tuning followed by reinforcement learning with hallucination-aware rewards that balance relation coverage and precision while penalizing unsupported objects and relations. For videos, the model can optionally condition each frame on the previously generated graph, providing lightweight short-term context without tracking or post-processing. We evaluate SceneGraphVLM on PSG, PVSG, and Action Genome. With compact VLMs and vLLM-accelerated decoding, SceneGraphVLM achieves a strong quality-speed trade-off, improves precision-oriented SGG metrics while preserving reasonable recall, and generates complete scene graphs with approximately one-second latency. Code and implementation details are available at: https://github.com/markus0440/SceneGraphVLM.git.
☆ HADAR-Based Thermal Infrared Hyperspectral Image Restoration
Thermal-infrared (TIR) hyperspectral imagery (HSI) provides critical scene information for various applications. However, its practical utility is severely limited by unique sensor degradations beyond the capabilities of existing restoration methods, which are ignorant of underlying thermal physics. Here, we propose HAIR (HADAR-based Image Restoration) as a physics-driven framework for ground-based TIR-HSI restoration. HAIR utilizes the HADAR rendering equation (HRE) and combines it with the atmospheric downwelling radiative transfer equation (RTE) to model TIR-HSI using temperature, emissivity, and texture (TeX) physical triplets. This physical model leads to a TeX decompose-synthesize strategy that guarantees physical consistency and spatio-spectral noise resilience, in stark contrast to existing approaches. Moreover, our framework uses a forward-modeled atmospheric downwelling reference, along with spectral smoothness of emissivity and blackbody radiation, to enable spectral calibration and generation that would otherwise be elusive. Our extensive experiments on the outdoor DARPA Invisible Headlights dataset and in-lab FTIR measurements show that HAIR consistently outperforms state-of-the-art methods across denoising, inpainting, spectral calibration, and spectral super-resolution, establishing a benchmark in objective accuracy and visual quality.
comment: 17 pages, 18 figures
☆ Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models
In this paper, we propose GTA-VLA(Guide, Think, Act), an interactive Vision-Language-Action (VLA) framework that enables spatially steerable embodied reasoning by allowing users to guide robot policies with explicit visual cues. Existing VLA models learn a direct "Sense-to-Act" mapping from multimodal observations to robot actions. While effective within the training distribution, such tightly coupled policies are brittle under out-of-domain (OOD) shifts and difficult to correct when failures occur. Although recent embodied Chain-of-Thought (CoT) approaches expose intermediate reasoning, they still lack a mechanism for incorporating human spatial guidance, limiting their ability to resolve visual ambiguities or recover from mistakes. To address this gap, our framework allows users to optionally guide the policy with spatial priors, such as affordance points, boxes, and traces, which the subsequent reasoning process can directly condition on. Based on these inputs, the model generates a unified spatial-visual Chain-of-Thought that integrates external guidance with internal task planning, aligning human visual intent with autonomous decision-making. For practical deployment, we further couple the reasoning module with a lightweight reactive action head for efficient action execution. Extensive experiments demonstrate the effectiveness of our approach. On the in-domain SimplerEnv WidowX benchmark, our framework achieves a state-of-the-art 81.2% success rate. Under OOD visual shifts and spatial ambiguities, a single visual interaction substantially improves task success over existing methods, highlighting the value of interactive reasoning for failure recovery in embodied control. Details of the project can be found here: https://signalispupupu.github.io/GTA-VLA_ProjPage/
♻ ☆ FALO: Fast and Accurate LiDAR 3D Object Detection on Resource-Constrained Devices
Existing LiDAR 3D object detection methods predominantely rely on sparse convolutions and/or transformers, which can be challenging to run on resource-constrained edge devices, due to irregular memory access patterns and high computational costs. In this paper, we propose FALO, a hardware-friendly approach to LiDAR 3D detection, which offers both state-of-the-art (SOTA) detection accuracy and fast inference speed. More specifically, given the 3D point cloud and after voxelization, FALO first arranges sparse 3D voxels into a 1D sequence based on their coordinates and proximity. The sequence is then processed by our proposed ConvDotMix blocks, consisting of large-kernel convolutions, Hadamard products, and linear layers. ConvDotMix provides sufficient mixing capability in both spatial and embedding dimensions, and introduces higher-order nonlinear interaction among spatial features. Furthermore, when going through the ConvDotMix layers, we introduce implicit grouping, which balances the tensor dimensions for more efficient inference and takes into account the growing receptive field. All these operations are friendly to run on resource-constrained platforms and proposed FALO can readily deploy on compact, embedded devices. Our extensive evaluation on LiDAR 3D detection benchmarks such as nuScenes and Waymo shows that FALO achieves competitive performance. Meanwhile, FALO is 1.6~9.8x faster than the latest SOTA on mobile Graphics Processing Unit (GPU) and mobile Neural Processing Unit (NPU).
♻ ☆ V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation
Generating music that temporally aligns with video events is challenging for existing text-to-music models, which lack fine-grained temporal control. We introduce V2M-ZERO, a video-to-music generation approach that generates time-aligned music with disentangled time synchronization and semantic control (e.g., genre, mood) from video while requiring zero video-music pairs at training time. Our method is motivated by a key observation: temporal synchronization requires matching when and how much change occurs, not what changes. While musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. We capture this structure through event curves computed from intra-modal similarity using pretrained music and video encoders. By measuring temporal change within each modality independently, these curves provide comparable representations across modalities. This enables a simple training strategy: fine-tune a text-to-music model on music-event curves, then substitute video-event curves at inference without cross-modal training or paired data. Across OES-Pub, MovieGenBench-Music, and AIST++, V2M-ZERO achieves state-of-the-art performance without any paired music-video data, surpassing the strongest prior baselines per metric with 5-9% higher audio quality, 13-15% better semantic alignment, 21-52% improved temporal synchronization, and 28% higher beat alignment on dance videos. We find similar results via a large crowd-source subjective listening test. Our results validate that temporal alignment through within-modality features is not only effective for video-to-music generation but also leads to better performance than paired cross-modal supervision. Furthermore, our approach enables independent controls for timing and music style (e.g., genre, mood) for more controllable generation.
comment: Project page: https://genjib.github.io/v2m_zero/
♻ ☆ Image Generators are Generalist Vision Learners
Recent works show that image and video generators exhibit zero-shot visual understanding behaviors, in a way reminiscent of how LLMs develop emergent capabilities of language understanding and reasoning from generative pretraining. While it has long been conjectured that the ability to create visual content implies an ability to understand it, there has been limited evidence that generative vision models have developed strong understanding capabilities. In this work, we demonstrate that image generation training serves a role similar to LLM pretraining, and lets models learn powerful and general visual representations that enable SOTA performance on various vision tasks. We introduce Vision Banana, a generalist model built by instruction-tuning Nano Banana Pro (NBP) on a mixture of its original training data alongside a small amount of vision task data. By parameterizing the output space of vision tasks as RGB images, we seamlessly reframe perception as image generation. Our generalist model, Vision Banana, achieves SOTA results on a variety of vision tasks involving both 2D and 3D understanding, beating or rivaling zero-shot domain-specialists, including Segment Anything Model 3 on segmentation tasks, and the Depth Anything series on metric depth estimation. We show that these results can be achieved with lightweight instruction-tuning without sacrificing the base model's image generation capabilities. The superior results suggest that image generation pretraining is a generalist vision learner. It also shows that image generation serves as a unified and universal interface for vision tasks, similar to text generation's role in language understanding and reasoning. We could be witnessing a major paradigm shift for computer vision, where generative vision pretraining takes a central role in building Foundational Vision Models for both generation and understanding.
comment: Project Page: http://vision-banana.github.io
♻ ☆ Enhancing Few-Shot Classification of Benchmark and Disaster Imagery with ABHFA-Net
The rising incidence of natural and human-induced disasters necessitates robust visual recognition systems capable of operating under limited labeled data conditions. However, disaster-related image classification remains challenging due to data scarcity, high intra-class variability, and domain-specific complexities in remote sensing imagery. To address these challenges, we propose the Attention Bhattacharyya Distance-based Feature Aggregation Network (ABHFA-Net), a novel few-shot learning (FSL) framework that models class prototypes as probability distributions and performs classification via Bhattacharyya distance-based comparison. Our approach integrates a spatial channel attention mechanism to enhance discrimiantive feature learning in the few-shot context and introduces a Bhattacharyya-based contrastive softmax loss for improved class separability. Extensive experiments on both benchmark datasets (CIFAR-FS, FC-100, miniImageNet, tieredImageNet) and real-world disaster datasets (AIDER, CDD, MEDIC) demonstrate the effectiveness of the proposed method. In particular, ABHFA-Net achieves 80.7% and 92.3% accuracy on CIFAR-FS under 5-way 1-shot and 5-shot settings, respectively, outperforming existing state-of-the-art methods. On disaster datasets, the model consistently improves classification performance, achieving up to 68.2% (1-shot) and 78.3% (5-shot) accuracy on AIDER, highlighting its robustness in real-world scenarios. These results establish ABHFA-Net as a strong and practical solution for few-shot disaster image classification, particularly in data-scarce and time-critical remote sensing applications. The code repository for our work is available at https://github.com/GreedYLearner1146/ABHFA-Net.
comment: Revised and Submitted to SN Computer journal
♻ ☆ XTinyU-Net: Training-Free U-Net Scaling via Initialization-Time Sensitivity MICCAI 2026
While U-Net architectures remain the gold standard for medical image segmentation, their deployment in resource-constrained environments demands aggressive model compression. However, finding an optimally efficient configuration is computationally prohibitive, typically requiring exhaustive train-and-evaluate cycles to find the smallest model that maintains peak performance. In this paper, we introduce a training-free selection framework to automatically identify ultralightweight, dataset-specific U-Net configurations directly at initialization. We observe that systematically scaling down U-Net channel width induces a sharp transition from a stable performance plateau to representational capacity collapse. To pinpoint this boundary without training, we propose a Jacobian-based sensitivity metric that scores discrete, width-capped U-Net variants using a small set of unlabeled images. By analyzing the total variation of this sensitivity curve, we isolate the smallest stable configuration, which we denote as XTinyU-Net. Evaluated across six diverse medical datasets within the nnU-Net framework, XTinyU-Net achieves segmentation accuracy comparable to the heavy nnU-Net baseline with 400x-1600x fewer parameters, and outperforms contemporary lightweight architectures while utilizing 5x-72x fewer parameters. Code is publicly accessible on https://github.com/alvinkimbowa/nntinyunet.git.
comment: Early accepted to MICCAI 2026
♻ ☆ Dual Ascent Diffusion for Inverse Problems
Ill-posed inverse problems are fundamental in many domains, ranging from astrophysics to medical imaging. Emerging diffusion models provide a powerful prior for solving these problems. Existing maximum-a-posteriori (MAP) or posterior sampling approaches, however, rely on different computational approximations, leading to inaccurate or suboptimal samples. To address this issue, we introduce a new approach to solving MAP problems with diffusion model priors using a dual ascent optimization framework. Our framework achieves better image quality as measured by various metrics for image restoration problems, it is more robust to high levels of measurement noise, it is faster, and it estimates solutions that represent the observations more faithfully than the state of the art.
comment: Project page: https://soniaminseokim.github.io/ddiff/
♻ ☆ Iskra: A System for Inverse Geometry Processing
We propose a system for differentiating through solutions to geometry processing problems. Our system differentiates a broad class of geometric algorithms, exploiting existing fast problem-specific schemes common to geometry processing, including local-global and ADMM solvers. It is compatible with machine learning frameworks, opening doors to new classes of inverse geometry processing applications. We marry the scatter-gather approach to mesh processing with tensor-based workflows and rely on the adjoint method applied to user-specified imperative code to generate an efficient backward pass behind the scenes. We demonstrate our approach by differentiating through mean curvature flow, spectral conformal parameterization, geodesic distance computation, and as-rigid-as-possible deformation, examining usability and performance on these applications. Our system allows practitioners to differentiate through existing geometry processing algorithms without needing to reformulate them, resulting in low implementation effort, fast runtimes, and lower memory requirements than differentiable optimization tools not tailored to geometry processing.
♻ ☆ From Local Matches to Global Masks: Template-Guided Instance Detection and Segmentation in Open-World Scenes
Detecting and segmenting novel object instances in open-world environments is a fundamental problem in robotic perception. Given only a small set of template images, a robot must locate and segment a specific object instance in a cluttered, previously unseen scene. Existing proposal-based approaches are highly sensitive to proposal quality and often fail under occlusion and background clutter. We propose L2G-Det, a local-to-global instance detection framework that bypasses explicit object proposals by leveraging dense patch-level matching between templates and the query image. Locally matched patches generate candidate points, which are refined through a candidate selection module to suppress false positives. The filtered points are then used to prompt an augmented Segment Anything Model (SAM) with instance-specific object tokens, enabling reliable reconstruction of complete instance masks. Experiments demonstrate improved performance over proposal-based methods in challenging open-world settings.
comment: Accepted to Robotics: Science and Systems (RSS) 2026. Project page: https://irvlutd.github.io/L2G/
♻ ☆ Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Omni-modal language models are intended to jointly understand audio, visual inputs, and language, but benchmark gains can be inflated when visual evidence alone is enough to answer a query. We study whether current omni-modal benchmarks separate visual shortcuts from genuine audio-visual-language evidence integration, and how post-training behaves under a visually debiased evaluation setting. We audit nine omni-modal benchmarks with visual-only probing, remove visually solvable queries, and retain full subsets when filtering is undefined or would make comparisons unstable. This yields OmniClean, a cleaned evaluation view with 8,551 retained queries from 16,968 audited queries. On OmniClean, we evaluate OmniBoost, a three-stage post-training recipe based on Qwen2.5-Omni-3B: mixed bi-modal SFT, mixed-modality RLVR, and SFT on self-distilled data. Balanced bi-modal SFT gives limited and uneven gains, RLVR provides the first broad improvement, and self-distillation reshapes the benchmark profile. After SFT on self-distilled data, the 3B model reaches performance comparable to, and in aggregate slightly above, Qwen3-Omni-30B-A3B-Instruct without using a stronger omni-modal teacher. These results show that omni-modal progress is easier to interpret when evaluation controls visual leakage, and that small omni-modal models can benefit from staged post-training with self-distilled omni-query supervision. Project page: https://cheliu-computation.github.io/omni/
comment: Project page: https://cheliu-computation.github.io/omni/
♻ ☆ Seeing Through the Brain: New Insights from Decoding Visual Stimuli with fMRI
Understanding how the brain encodes visual information is a central challenge in neuroscience and machine learning. A promising approach is to reconstruct visual stimuli, essentially images, from functional Magnetic Resonance Imaging (fMRI) signals. This involves two stages: transforming fMRI signals into a latent space and then using a pretrained generative model to reconstruct images. The reconstruction quality depends on how similar the latent space is to the structure of neural activity and how well the generative model produces images from that space. Yet, it remains unclear which type of latent space best supports this transformation and how it should be organized to represent visual stimuli effectively. We present two key findings. First, fMRI signals are more similar to the text space of a language model than to either a vision based space or a joint text image space. Second, text representations and the generative model should be adapted to capture the compositional nature of visual stimuli, including objects, their detailed attributes, and relationships. Building on these insights, we propose PRISM, a model that Projects fMRI sIgnals into a Structured text space as an interMediate representation for visual stimuli reconstruction. It includes an object centric diffusion module that generates images by composing individual objects to reduce object detection errors, and an attribute relationship search module that automatically identifies key attributes and relationships that best align with the neural activity. Extensive experiments on real world datasets demonstrate that our framework outperforms existing methods, achieving up to an 8% reduction in perceptual loss. These results highlight the importance of using structured text as the intermediate space to bridge fMRI signals and image reconstruction.
♻ ☆ Training-Free Inference for High-Resolution Sinogram Completion
High-resolution sinogram completion is critical for computed tomography reconstruction, as missing projections can introduce severe artifacts. While diffusion models provide strong generative priors for this task, their inference cost grows prohibitively with resolution. We propose HRSino, a training-free and efficient diffusion inference approach for high-resolution sinogram completion. By explicitly accounting for spatial heterogeneity in signal characteristics, such as spectral sparsity and local complexity, HRSino allocates inference effort adaptively across spatial regions and resolutions, rather than applying uniform high-resolution diffusion steps. This enables global consistency to be captured at coarse scales while refining local details only where necessary. Experimental results show that HRSino reduces peak memory usage by up to 30.81% and inference time by up to 17.58% compared to the state-of-the-art framework, and maintains completion accuracy across datasets and resolutions.
♻ ☆ Normalization Equivariance for Arbitrary Backbones, with Application to Image Denoising
Normalization Equivariance (NE) is a structural prior that improves robustness to distribution shift in image-to-image tasks. A function $f$ is normalization equivariant iff $f(a y + b\mathbf{1}) = a f(y) + b\mathbf{1}$ for all $a>0$ and $b\in\mathbb{R}$. Existing NE methods constrain every internal layer to NE-compatible operations. These constraints add runtime cost and exclude standard transformer components such as softmax attention and LayerNorm. We introduce Wrapped Normalization Equivariance (WNE), a parameter-free wrapper that normalizes the input, applies any backbone, and denormalizes the output. We prove every NE function admits this factorization, so the wrapper exactly parameterizes the class of NE functions. On blind denoising, wrapping CNN and transformer architectures improves robustness under noise-level mismatch with no measurable GPU overhead, while architectural NE baselines are up to $1.6\times$ slower.
♻ ☆ One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Vision-language-action (VLA) models increasingly rely on auxiliary world modules to plan over long horizons, yet how such modules should be parameterized on top of a pretrained VLA remains an open design question. Existing world-model-augmented VLAs typically pass the per-frame visual stream into the world module at high visual bandwidth and treat its rollout as a side product of action prediction; under a constrained adaptation budget on a frozen backbone, this leaves both the per-frame representation and the latent action coupling under-examined. We introduce OneWM-VLA, which compresses each view into a single semantic token per frame through an Adaptive Attention Pooling, and produces the resulting latent stream and the action trajectory under a single flow-matching objective rather than connecting them through a separate decoder. Empirically, we find that per-frame visual bandwidth can be reduced to a single token without compromising long-horizon performance under our setup. Trained with 14.71M LoRA parameters on a $π_0$ (2B) backbone, OneWM-VLA improves the average success rate from 47.9% to 61.3% on MetaWorld~MT50, reaches 95.6% on LIBERO-Long (vs.85.2% for $π_0$), and reaches 60.0% on the long-horizon deformable task Fold Cloth on a real Piper arm (vs.20.0% for $π_0$).
♻ ☆ MultiMat: Multimodal Program Synthesis for Procedural Materials using Large Multimodal Models ICLR 2026
Material node graphs are programs that generate the 2D channels of procedural materials, including geometry such as roughness and displacement maps, and reflectance such as albedo and conductivity maps. They are essential in computer graphics for representing the appearance of virtual 3D objects parametrically and at arbitrary resolution. In particular, their directed acyclic graph structure and intermediate states enable a modular, interpretable workflow for interactive appearance modeling. However, creating such graphs remains challenging and typically requires professional training. While recent neural program synthesis approaches attempt to simplify this process, they solely represent graphs as textual programs, failing to capture the inherently visual-spatial nature of node graphs that makes them accessible to humans. To address this gap, we present MultiMat, a multimodal program synthesis framework that leverages large multimodal models to process both visual and textual graph representations for improved generation of procedural material graphs. We train our models on a new dataset of production-quality procedural materials and combine them with a constrained tree search inference algorithm that ensures static correctness while efficiently navigating the program space. Our experimental results show that our multimodal program synthesis method is more efficient in both unconditional and conditional graph synthesis with higher visual quality and fidelity than text-only baselines, establishing new state-of-the-art performance.
comment: Accepted at ICLR 2026 (poster)
♻ ☆ ImmuVis: Hyperconvolutional Foundation Model for Imaging Mass Cytometry
We present ImmuVis, a family of efficient foundation models for imaging mass cytometry (IMC), a high-throughput multiplex imaging technology that handles molecular marker measurements as image channels and enables large-scale spatial tissue profiling. Unlike natural images, multiplex imaging lacks a fixed channel space, as real-world marker sets vary across studies, violating a core assumption of standard vision backbones. To address this, ImmuVis introduces marker-adaptive hyperconvolutions that generate convolutional kernels from learned marker embeddings, enabling a single model to operate on arbitrary measured marker subsets without retraining. We pretrain ImmuVis on the largest dataset to date, IMC17M (28 cohorts, 24,405 images, 265 markers, over 17M patches), using self-supervised masked reconstruction. ImmuVis outperforms state-of-the-art baselines and ablations in virtual staining and downstream classification tasks at substantially lower compute cost than transformer-based alternatives, and is the sole model that provides calibrated uncertainty via a heteroscedastic likelihood objective. These results position ImmuVis as a practical framework for real-world IMC modeling.
comment: 38 pages, 19 figures
♻ ☆ Neural Field Thermal Tomography: A Differentiable Physics Framework for Non-Destructive Evaluation
Inverse problems for stiff parabolic partial differential equations (PDEs), such as the inverse heat conduction problem (IHCP), are severely ill-posed: the forward map rapidly damps high-frequency interior structure before it reaches the boundary. Soft-constrained physics-informed neural networks (PINNs), which embed the PDE as a residual penalty, suffer from gradient pathology in this regime and tend to fit boundary measurements while leaving the interior field essentially untouched. We propose Neural Field Thermal Tomography (NeFTY), a hard-constrained neural field framework for label-free three-dimensional inverse heat conduction. NeFTY represents the unknown diffusivity as a continuous coordinate-based neural network, and at every optimization step passes the candidate field through a differentiable implicit-Euler heat solver with harmonic-mean interface flux, so that the governing PDE holds exactly on the discretization rather than as a soft penalty. Adjoint gradients propagate the surface reconstruction error back to the network weights at solver-level memory cost, making test-time inversion tractable on a single GPU. Across synthetic 3D benchmarks, NeFTY substantially outperforms soft-constrained PINN variants and a voxel-grid baseline on label-free volumetric recovery, and it transfers to real thermography data, surpassing classical signal-processing baselines in both defect segmentation and depth estimation. Additional details at https://cab-lab-princeton.github.io/nefty/
comment: 37 pages, 19 figures
♻ ☆ Learning Multimodal Embeddings for Traffic Accident Prediction and Causal Estimation KDD 2026
We consider analyzing traffic accident patterns using both road network data and satellite images aligned to road graph nodes. Previous work for predicting accident occurrences relies primarily on road network structural features while overlooking physical and environmental information from the road surface and its surroundings. In this work, we construct a large multimodal dataset spanning six U.S. states, containing nine million traffic accident records from official sources, and one million high-resolution satellite images for each node of the road network. Additionally, every node is annotated with features such as the region's weather statistics and road type (e.g., residential vs. motorway), and each edge is annotated with traffic volume information (i.e., Average Annual Daily Traffic). Utilizing this dataset, we conduct a comprehensive evaluation of multimodal learning methods that integrate both visual and network embeddings. Our findings show that integrating both data modalities improves prediction accuracy, achieving an average AUROC of $90.1\%$, a $3.7\%$ gain over graph neural network models that use only graph structures. With the improved embeddings, we conduct a causal analysis using a matching estimator to identify the key factors influencing traffic accidents. We find that accident rates rise by $24\%$ under higher precipitation, by $22\%$ on higher-speed roads such as motorways, and by $29\%$ due to seasonal patterns, after adjusting for other confounding factors. Ablation studies confirm that satellite imagery features are essential for achieving accurate prediction.
comment: 17 pages. Appeared in KDD 2026
♻ ☆ SuperF: Neural Implicit Fields for Multi-Image Super-Resolution ICLR 2026
High-resolution imagery is often hindered by limitations in sensor technology, atmospheric conditions, and costs. Such challenges occur in satellite remote sensing, but also with handheld cameras, such as our smartphones. Hence, super-resolution aims to enhance the image resolution algorithmically. Since single-image super-resolution requires solving an inverse problem, such methods must exploit strong priors, e.g. learned from high-resolution training data, or be constrained by auxiliary data, e.g. by a high-resolution guide from another modality. While qualitatively pleasing, such approaches often lead to "hallucinated" structures that do not match reality. In contrast, multi-image super-resolution (MISR) aims to improve the (optical) resolution by constraining the super-resolution process with multiple views taken with sub-pixel shifts. Here, we propose SuperF, a test-time optimization approach for MISR that leverages coordinate-based neural networks, also called neural fields. Their ability to represent continuous signals with an implicit neural representation (INR) makes them an ideal fit for the MISR task. The key characteristic of our approach is to share an INR for multiple shifted low-resolution frames and to jointly optimize the frame alignment with the INR. Our approach advances related INR baselines, adopted from burst fusion for layer separation, by directly parameterizing the sub-pixel alignment as optimizable affine transformation parameters and by optimizing via a super-sampled coordinate grid that corresponds to the output resolution. Our experiments yield compelling results on simulated bursts of satellite imagery and ground-level images from handheld cameras, with upsampling factors of up to 8. A key advantage of SuperF is that this approach does not rely on any high-resolution training data.
comment: Published at ICLR 2026, Project website: https://sjyhne.github.io/superf/, 23 pages, 13 figures, 8 table
♻ ☆ MedOpenClaw and MedFlowBench: Auditing Medical Agents in Full-Study Workflows
Medical imaging benchmarks often evaluate VLMs on pre-selected 2D images, slices, crops, or patches, making evaluation closer to visual recognition. Real clinical workflows impose a different burden: readers must search through complete studies, operate imaging software, navigate across slices and magnifications, and document visual evidence that can be audited. We argue that this evidence-producing workflow is a critical missing evaluation axis for medical imaging agents. To study it, we introduce MedFlowBench, a full-study benchmark for VLM agents, together with MedOpenClaw, a controlled and replayable runtime in which agents operate medical imaging viewers such as 3D Slicer and QuPath. In each episode, an agent inspects a complete radiology study or whole-slide pathology image, returns a task answer, and submits structured evidence, including key slices, coordinates, regions of interest, or lesion-state fields. This evidence is automatically checked against withheld masks, annotations, and labels. Across evaluated models, final answer-only scoring gives an overly optimistic picture: when answers must also be supported by correct evidence, performance drops substantially on complex workflows. We further find that adding image-analysis tools does not by itself solve the problem. Tools help when they make a complex procedure simple and reliable, but agents still struggle when they must choose inputs, manage viewer state, and verify intermediate outputs over multiple steps. MedFlowBench exposes whether medical imaging agents can produce auditable evidence from complete studies, rather than plausible answers from selected images.
comment: 33 pages
♻ ☆ SS3D: End2End Self-Supervised 3D from Web Videos
We present SS3D, a web-scale SfM-based self-supervision pretraining pipeline for feed-forward 3D estimation from monocular video. Our model jointly predicts depth, ego-motion, and intrinsics in a single forward pass and is trained/evaluated as a coherent end-to-end 3D estimator. To stabilize joint learning, we use an intrinsics-first two-stage schedule and a unified single-checkpoint evaluation protocol. Scaling SfM self-supervision to unconstrained web video is challenging due to weak multi-view observability and strong corpus heterogeneity; we address these with a multi-view signal proxy (MVS) used for filtering and curriculum sampling, and with expert training distilled into a single student. Pretraining on YouTube-8M (~100M frames after filtering) yields strong cross-domain zero-shot transfer and improved fine-tuning performance over prior self-supervised baselines. We release the pretrained checkpoint and code.
♻ ☆ Make-It-Poseable: Feed-forward Latent Posing Model for 3D Characters
Posing 3D characters is a fundamental task in computer graphics. However, existing paradigms, ranging from traditional auto-rigging to recent pose-conditioned generative models, frequently struggle with inaccurate skinning weights, fixed mesh topologies, and poor pose conformance. These challenges have become particularly pronounced with the recent explosion of AI-generated 3D assets, which often exhibit flawed structures and fused geometry. To address these issues, we introduce Make-It-Poseable, a novel feed-forward framework that reformulates character posing as a skinning-free latent-space transformation problem. By decoupling shape deformation from the constraints of fixed mesh connectivity, our method directly operates on compact latent representations to reconstruct characters in target poses. To achieve this, our framework integrates a latent posing transformer for shape manipulation, a dense pose representation for fine-grained control, and an adaptive completion module optimized via a bipartite-matched latent loss to robustly handle topological changes. Extensive experiments demonstrate that our method significantly outperforms existing baselines in posing quality. Furthermore, our skeleton-agnostic design exhibits remarkable zero-shot generalization to diverse morphologies including quadrupeds and seamlessly supports various 3D authoring applications such as part replacement and refinement.
comment: Project page: https://jasongzy.github.io/Make-It-Poseable/
♻ ☆ Prototype-Based Test-Time Adaptation of Vision-Language Models
Test-time adaptation (TTA) has emerged as a promising paradigm for vision-language models (VLMs) to bridge the distribution gap between pre-training and test data. Recent works have focused on backpropagation-free TTA methods that rely on cache-based designs, but these introduce two key limitations. First, inference latency increases as the cache grows with the number of classes, leading to inefficiencies in large-scale settings. Second, suboptimal performance occurs when the cache contains insufficient or incorrect samples. In this paper, we present Prototype-Based Test-Time Adaptation (PTA), an efficient and effective TTA paradigm that uses a set of class-specific knowledge prototypes to accumulate knowledge from test samples. Particularly, knowledge prototypes are adaptively weighted based on the zero-shot class confidence of each test sample, incorporating the sample's visual features into the corresponding class-specific prototype. It is worth highlighting that the knowledge from past test samples is integrated and utilized solely in the prototypes, eliminating the overhead of cache population and retrieval that hinders the efficiency of existing TTA methods. This endows PTA with extremely high efficiency while achieving state-of-the-art performance on 15 image recognition benchmarks and 4 robust point cloud analysis benchmarks. For example, PTA improves CLIP's accuracy from 65.64% to 69.38% on 10 cross-domain benchmarks, while retaining 92% of CLIP's inference speed on large-scale ImageNet-1K. In contrast, the cache-based TDA achieves a lower accuracy of 67.97% and operates at only 50% of CLIP's inference speed.
♻ ☆ MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark
Multimodal large language models (MLLMs) have advanced clinical tasks for common conditions, but their performance on rare diseases remains largely untested. In rare-disease scenarios, clinicians often lack prior clinical knowledge, forcing them to rely strictly on case-level evidence for clinical judgments. Existing benchmarks predominantly evaluate common-condition, single-image settings, leaving multimodal and multi-image evidence integration under rare-disease data scarcity systematically unevaluated. We introduce MMRareBench, to our knowledge the first rare-disease benchmark jointly evaluating multimodal and multi-image clinical capability across four workflow-aligned tracks: diagnosis, treatment planning, cross-image evidence alignment, and examination suggestion. The benchmark comprises 1,756 question-answer pairs with 7,958 associated medical images curated from PMC case reports, with Orphanet-anchored ontology alignment, track-specific leakage control, evidence-grounded annotations, and a two-level evaluation protocol. A systematic evaluation of 23 MLLMs reveals fragmented capability profiles and universally low treatment-planning performance, with medical-domain models trailing general-purpose MLLMs substantially on multi-image tracks despite competitive diagnostic scores. These patterns are consistent with a capacity dilution effect: medical fine-tuning can narrow the diagnostic gap but may erode the compositional multi-image capability that rare-disease evidence integration demands.
♻ ☆ Guidestar-Free Adaptive Optics with Asymmetric Apertures
This work introduces the first closed-loop adaptive optics (AO) system capable of optically correcting aberrations in real-time without a guidestar or a wavefront sensor. Nearly 40 years ago, Cederquist et al. demonstrated that asymmetric apertures enable phase retrieval (PR) algorithms to perform fully computational wavefront sensing, albeit at a high computational cost. More recently, Chimitt et al. extended this approach with machine learning and demonstrated real-time wavefront sensing using only a single (guidestar-based) point-spread-function (PSF) measurement. Inspired by these works, we introduce a guidestar-free AO framework built around asymmetric apertures and machine learning. Our approach combines three key elements: (1) an asymmetric aperture placed at the system's pupil plane that enables PR-based wavefront sensing, (2) a pair of machine learning algorithms that estimate the PSF from natural scene measurements and reconstruct phase aberrations, and (3) a spatial light modulator that performs optical correction. We experimentally validate this framework on dense natural scenes imaged through unknown obscurants. Our method outperforms state-of-the-art guidestar-free wavefront shaping methods, using an order of magnitude fewer measurements and three orders of magnitude less computation.
comment: Accepted to ACM Transactions on Graphics (TOG)
♻ ☆ ReLIC-SGG: Relation Lattice Completion for Open-Vocabulary Scene Graph Generation
Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible relation phrases beyond a fixed predicate set. Existing methods usually treat annotated triplets as positives and all unannotated object-pair relations as negatives. However, scene graph annotations are inherently incomplete: many valid relations are missing, and the same interaction can be described at different granularities, e.g., \textit{on}, \textit{standing on}, \textit{resting on}, and \textit{supported by}. This issue becomes more severe in open-vocabulary SGG due to the much larger relation space. We propose \textbf{ReLIC-SGG}, a relation-incompleteness-aware framework that treats unannotated relations as latent variables rather than definite negatives. ReLIC-SGG builds a semantic relation lattice to model similarity, entailment, and contradiction among open-vocabulary predicates, and uses it to infer missing positive relations from visual-language compatibility, graph context, and semantic consistency. A positive-unlabeled graph learning objective further reduces false-negative supervision, while lattice-guided decoding produces compact and semantically consistent scene graphs. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that ReLIC-SGG improves rare and unseen predicate recognition and better recovers missing relations.
comment: Some errors in the experimental sections
♻ ☆ UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
Multi-reference image generation aims to synthesize images from textual instructions while faithfully preserving subject identities from multiple reference images. Existing VLM-enhanced diffusion models commonly rely on decoupled visual conditioning: semantic ViT features are processed by the VLM for instruction understanding, whereas appearance-rich VAE features are injected later into the diffusion backbone. Despite its intuitive design, this separation makes it difficult for the model to associate each semantically grounded subject with visual details from the correct reference image. As a result, the model may recognize which subject is being referred to, but fail to preserve its identity and fine-grained appearance, leading to attribute leakage and cross-reference confusion in complex multi-reference settings. To address this issue, we propose UniCustom, a unified visual conditioning framework that fuses ViT and VAE features before VLM encoding. This early fusion exposes the VLM to both semantic cues and appearance-rich details, enabling its hidden states to jointly encode the referred subject and corresponding visual appearance with only a lightweight linear fusion layer. To learn such unified representations, we adopt a two-stage training strategy: reconstruction-oriented pretraining that preserves reference-specific appearance details in the fused hidden states, followed by supervised finetuning on single- and multi-reference generation tasks. We further introduce a slot-wise binding regularization that encourages each image slot to preserve low-level details of its corresponding reference, thereby reducing cross-reference entanglement. Experiments on two multi-reference generation benchmarks demonstrate that UniCustom consistently improves subject consistency, instruction following, and compositional fidelity over strong baselines.
♻ ☆ NTIRE 2026 The Second Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results CVPR2026
This paper presents an overview of the NTIRE 2026 Second Challenge on Day and Night Raindrop Removal for Dual-Focused Images. Building upon the success of the first edition, this challenge attracted a wide range of impressive solutions, all developed and evaluated on our real-world Raindrop Clarity dataset~\cite{jin2024raindrop}. For this edition, we adjust the dataset with 14,139 images for training, 407 images for validation, and 593 images for testing. The primary goal of this challenge is to establish a strong and practical benchmark for the removal of raindrops under various illumination and focus conditions. In total, 168 teams have registered for the competition, and 17 teams submitted valid final solutions and fact sheets for the testing phase. The submitted methods achieved strong performance on the Raindrop Clarity dataset, demonstrating the growing progress in this challenging task.
comment: Accepted by CVPR2026 Workshop; NTIRE 2026 Challenge Report
♻ ☆ The Direct Integration Theorem: A Rigorous Framework for Consistent Discrete Solutions of the Inverse Radon Problem IEEE
This paper presents a novel Direct Integration Theorem (DIT), derived as a non-trivial corollary of the classical Central Slice Theorem (CST). The DIT provides a mathematically consistent transition from the continuous to the discrete domain - a fundamental challenge in computed tomography - thereby eliminating the need for frequency-domain interpolation without resorting to conventional ramp-filtering. The proposed approach circumvents two principal limitations inherent in traditional methods: (i) the zero-frequency singularity and spectral distortions introduced by the mandatory ramp-filtering step, and (ii) discretization inaccuracies associated with frequency-domain interpolation. Based on the DIT, we develop a rigorous framework for consistent discrete solutions of the inverse Radon problem. Mathematical modeling demonstrates that this approach achieves quasi-exact reconstruction, with errors constrained solely by sampling parameters and grid geometry. Furthermore, while Filtered Back Projection (FBP) inherently distorts the variance of the reconstructed image, the DIT-based algorithm preserves it. Comparative simulations confirm that the proposed method eliminates common artifacts, such as intensity cupping, and consistently outperforms FBP in terms of PSNR, SSIM, and reprojection fidelity, faithfully restoring the original image's statistical characteristics.
comment: Submitted to IEEE TPAMI. Code and data available at https://github.com/Mozerov-iitp/radon-dit/
♻ ☆ From Baselines to Transport Geodesics: Axiomatic Attribution via Optimal Generative Flows
Feature attributions often hide a critical modeling choice: they explain a prediction along a counterfactual path from a reference state to an input. Different baselines, interpolations, and generative trajectories define different paths and can therefor produce different explanations. We study this path ambiguity as a modeling problem. Our central question is whether the path can be chosen by the data-generating transport process, rather than by a hand-designed interpolation or by the sensitivity geometry of the model being explained. We separate attribution into fixed-path credit allocation and path selection. For a fixed path, we prove that the Aumann-Shapley line integral is the unique attribution rule under standard fixed-path axioms and explicit coordinate-trace regularity. For path selection, we minimize kinetic action over flows that transport a reference distribution to the data distribution, yielding a transport-geodesic attribution principle. We approximate this ideal with Rectified Flow and Reflow and derive stability bounds linking vector-field error to attribution error. Experiments show that lower-action, transport-consistent paths produce more stable and structured explanations, preserving competitive deletion faithfulness, without claiming data-manifold membership. Our code is available at https://github.com/cenweizhang/OTFlowSHAP.
comment: 10 figures, 31 pages
♻ ☆ TAFA-GSGC: Group-wise Scalable Point Cloud Geometry Compression with Progressive Residual Refinement IEEE
Scalable compression is essential for bandwidth-adaptive transmission, yet most learned codecs are optimized for a fixed rate-distortion point, making rate adaptation costly due to re-encoding or maintaining multiple bitstreams. In this work, we propose TAFA-GSGC, a scalable learned point cloud geometry codec that enables multi-quality decoding from a single bitstream and a single trained model. TAFA-GSGC combines layered residual refinement with channel-group entropy coding, and introduces a Target-Aligned Feature Aggregation module to reduce cross-layer redundancy in enhancement residuals. Our framework supports up to 9 decodable quality levels with monotonic quality improvement as more subbitstreams are received, while maintaining strong compression efficiency. Compared with the PCGCv2 baseline, TAFA-GSGC demonstrates improved RD performance, achieving average BD-rate reductions of 4.99% and 5.92% in terms of D1-PSNR and D2-PSNR, respectively.
comment: Accepted at IEEE International Conference on Image Processing (ICIP) 2026
♻ ☆ SpatialReward: Bridging the Perception Gap in Online RL for Image Editing via Explicit Spatial Reasoning ICML 2026
Online Reinforcement Learning (RL) offers a promising avenue for complex image editing but is currently constrained by the scarcity of reliable and fine-grained reward signals. Existing evaluators frequently struggle with a critical perception gap we term "Attention Collapse," where models neglect cross-image comparisons and fail to capture fine-grained details, resulting in inaccurate perception and miscalibrated scores. To address these limitations, we propose SpatialReward, a reward model that enforces precise verification via explicit spatial reasoning. By anchoring reasoning to predicted edit regions, SpatialReward grounds semantic judgments in pixel-level evidence, significantly enhancing evaluative accuracy. Trained on a curated 260k spatial-aware dataset, our model achieves state-of-the-art performance on MMRB2 and EditReward-Bench, and outperforms proprietary evaluators on our proposed MultiEditReward-Bench. Furthermore, SpatialReward serves as a robust signal in online RL, boosting OmniGen2 by +0.90 on GEdit-Bench--surpassing the leading discriminative model and doubling the gain of GPT-4.1 (+0.45). These results demonstrate that spatial reasoning is essential for unlocking effective alignment in image editing.
comment: Accepted at the 43rd International Conference on Machine Learning (ICML 2026)
Computation and Language 151
☆ GradShield: Alignment Preserving Finetuning
Large Language Models (LLMs) pose a significant risk of safety misalignment after finetuning, as models can be compromised by both explicitly and implicitly harmful data. Even some seemingly benign data can inadvertently steer a model towards misaligned behaviors. To address this, we introduce GradShield, a principled filtering method that safeguards LLMs during finetuning by identifying and removing harmful data points before they corrupt the model's alignment. It removes potentially harmful data by computing a Finetuning Implicit Harmfulness Score (FIHS) for each data point and employs an adaptive thresholding algorithm. We apply GradShield to multiple utility fine-tuning tasks across varying levels of harmful data and evaluate the safety and utility performance of the resulting LLMs using various metrics. The results show that GradShield outperforms all baseline methods, consistently maintaining an Attack Success Rate (ASR) below $6\%$ while preserving utility performance.
☆ Why Retrieval-Augmented Generation Fails: A Graph Perspective
Retrieval-Augmented Generation (RAG) has become a powerful and widely used approach for improving large language models by grounding generation in retrieved evidence. However, RAG systems still produce incorrect answers in many cases. Why RAG fails despite having access to external information remains poorly understood. We present a model-internal study of retrieval-augmented generation that examines how retrieved evidence influences answer generation. Using circuit tracing, we construct attribution graphs that model the flow of information through transformer layers during decoding. These graphs represent interactions among retrieved context, intermediate model activations, and generated tokens, providing a graph, circuit-level view of how external evidence is integrated into the model's reasoning process across multiple question answering benchmarks, we observe consistent structural differences: correct predictions exhibit deeper reasoning paths, more distributed evidence flow, and a more structured pattern of local connectivity, while failed predictions show shallower, fragmented, and overly concentrated evidence flow. Building on these findings, we develop a graph-based error detection framework that uses attribution-graph topology features. Furthermore, we show that attribution graphs enable targeted interventions. By reinforcing question-constrained evidence grounding, we reshape internal routing so that answer generation remains guided by the question, leading to more effective integration of retrieved information and fewer errors.
☆ QOuLiPo: What a quantum computer sees when it reads a book
What does a book look like to a quantum computer? This paper takes eight classical works of the Renaissance and its late-antique inheritance -- from Augustine to Galileo -- and runs each through a neutral-atom quantum processor. The bridge is graphs: each textual unit becomes an atom, and graph edges are physical blockade constraints for engineered exact unit-disk designs, or a 2D approximation to the semantic graph for natural texts. Three contributions follow. First, we introduce rigidity rho, a metric for how unique a book's structural backbone is -- distinguishing Marguerite de Navarre's Heptameron (rigid, twelve-nouvelle hard core) from Boethius (fully fungible, every chapter substitutable). Second, we invert the pipeline: rather than extracting a graph from existing prose, we pick a target graph the hardware encodes natively, and write a book whose structure matches it. The twenty-nine texts written this way, collected under the name QOuLiPo, extend the OuLiPo tradition to graph-topological constraints and, together with the eight natural texts, form a benchmark distribution against which neutral-atom hardware can be tracked as it scales. Third, we run both natural and engineered texts on Pasqal's FRESNEL processor up to one hundred atoms; engineered texts reach high approximation ratios, the cleanest instances returning the exact backbone. A cloud-accessible quantum machine plus an agentic coding environment now lets a single investigator run this pipeline end-to-end. What is reported is an application layer, not a speedup -- humanistic instances ready to load onto neutral-atom processors as they scale, already complementing classical text analysis. The Digital Humanities community has a stake in building familiarity with this hardware now: the engineered-corpus design choices made today fix the benchmark distribution future hardware will be measured against.
☆ Thinking Ahead: Prospection-Guided Retrieval of Memory with Language Models
Long-horizon personalization requires dialogue assistants to retrieve user-specific facts from extended interaction histories. In practice, many relevant facts often have low semanticsimilarity to the query under dense retrieval. Standard Retrieval-Augmented Generation (RAG) and GraphRAG systems are still largely retrospective: they rely on embedding similarity to the query or on fixed graph traversals, so they often miss facts that matter for the user's needs but lie far from the query in embedding space. Inspired by prospection, the human ability to use imagined futures as cues for recall, we introduce Prospection-Guided Retrieval (PGR), which decouples retrieval from how memories are stored. Given a user query, PGR first expands the goal into a short Tree-of-Thought (ToT) or linear chain of plausible next steps, and uses these steps as retrieval probes rather than relying on the original query alone. The facts retrieved by these probes are then used to personalize the next round of prospection, enabling PGR to uncover additional memories that become relevant only after the simulation is grounded in the user's history. We also introduce MemoryQuest, a challenging multi-session benchmark in which each query is annotated with 3--5 dated reference facts subject to a low query-reference similarity constraint. Across 1,625 queries spanning 185 user profiles from 3 publicly available datasets, PGR-TOT substantially improves retrieval, including nearly 3x recall on MemoryQuest over the strongest baseline. In pairwise LLM-as-judge comparisons against baselines, PGR-generated responses are preferred on 89--98% of queries, with blinded human annotations on held-out subsets showing the same trend. Overall, the results demonstrate that explicit prospection yields large gains in long-horizon retrieval and response quality relative to similarity-only baselines.
comment: Preprint
☆ BOOKMARKS: Efficient Active Storyline Memory for Role-playing
Memory systems are critical for role-playing agents (RPAs) to maintain long-horizon consistency. However, existing RPA memory methods (e.g., profiling) mainly rely on recurrent summarization, whose compression inevitably discards important details. To address this issue, we propose a search-based memory framework called BOOKMARKS, which actively initializes, maintains, and updates task-relevant pieces of bookmarks for the current task (e.g., character acting). A bookmark is structured as the answer to a question at a specific point in the storyline. For each current task, BOOKMARKS selects reusable existing bookmarks or initializes new ones (at storyline beginning) with useful questions. These bookmarks are then synchronized to the current story point, with their answers updated accordingly, so they can be efficiently reused in future grounding rounds. Compared with recurrent summarization, BOOKMARKS offers (1) active grounding for capturing task-specific details and (2) passive updating to avoid unnecessary computation. In implementation, BOOKMARKS supports concept, behavior, and state searches, each powered by an efficient synchronization method. BOOKMARKS significantly outperforms RPA memory baselines on 85 characters from 16 artifacts, demonstrating the effectiveness of search-based memory for RPAs.
☆ ROK-FORTRESS: Measuring the Effect of Geopolitical Transcreation for National Security and Public Safety
Safety evaluations for large language models (LLMs) increasingly target high-stakes National Security and Public Safety (NSPS) risks, yet multilingual safety is typically assessed through translation-only benchmarks that preserve the underlying scenario, and empirical evidence of how language and geopolitical context interact remains limited to a narrow set of language pairs. We introduce \emph{ROK-FORTRESS} https://huggingface.co/datasets/ScaleAI/ROK-FORTRESS_public, a bilingual, culturally adversarial NSPS benchmark that uses the English--Korean language pair and U.S.--ROK geopolitical axis as a case study, separating the effects of language and geopolitical grounding via a \emph{transcreation matrix}: adversarial intents are evaluated under controlled combinations of (i) English versus Korean language and (ii) U.S.\ versus Korean entities, institutions, and operational details. Each adversarial prompt is paired with a dual-use benign counterpart to quantify over-refusal. Model responses are then scored using calibrated LLM-as-a-judge panels, applying our expert-crafted, prompt-specific binary rubrics. Across a dual-track set of frontier and Korean-optimized models, we find a consistent suppression effect in Korean variants and substantial model-to-model variation in how geopolitical grounding interacts with language. In many models, Korean grounding mitigates the Korean language-driven suppression -- with no model showing significant amplification in the other direction -- indicating that, at least in the English--Korean case, safety behavior is shaped by language-as-risk signals and context interactions that translation-only evaluations miss. The transcreation matrix methodology is designed to generalize to other language--culture pairs.
comment: 16 pages main body + appendix (63 total), 5 main figures, 4 main tables; dataset at https://huggingface.co/datasets/ScaleAI/ROK-FORTRESS_public
☆ Polar probe linearly decodes semantic structures from LLMs
How do artificial neural networks bind concepts to form complex semantic structures? Here, we propose a simple neural code, whereby the existence and the type of relations between entities are represented by the distance and the direction between their embeddings, respectively. We test this hypothesis in a variety of Large Language Models (LLMs), each input with natural-language descriptions of minimalist tasks from five different domains: arithmetic, visual scenes, family trees, metro maps and social interactions. Results show that the true semantic structures can be linearly recovered with a Polar Probe targeting a subspace of LLMs' layer activations. Second, this code emerges mostly in middle layers and improves with LLM performance. Third, these Polar Probes successfully generalize to new entities and relation types, but degrades with the size of the semantic structure. Finally, the quality of the polar representation correlates with the LLM's ability to answer questions about the semantic structure. Together, these findings suggest that LLMs learn to build complex semantic structures by binding representations with a simple geometrical principle.
☆ Mini-JEPA Foundation Model Fleet Enables Agentic Hydrologic Intelligence
Geospatial foundation models compress multispectral observations into dense embeddings increasingly used in natural-language environmental reasoning systems. A single planetary-scale model, e.g. Google AlphaEarth, handles broad characterization well but may compromise on specialized hydrologic signals. Such generalist models are also often inaccessible, expensive, and require large-scale compute. We propose Mini-JEPAs: a fleet of small sensor-specialized Joint Embedding Predictive Architecture (JEPA) foundation models consulted by a routing agent for specialized questions. We pretrained five 22M-parameter Mini-JEPAs sharing an identical Vision Transformer backbone, JEPA recipe, and 64-d output space, using Sentinel-2 optical, Sentinel-1 SAR, MODIS thermal, multi-temporal Sentinel-2 phenology, and a topography-soil stack. Each Mini-JEPA reconstructs the variable matched to its sensor, with cross-validated $R^2$ reaching 0.97 for elevation, 0.97 for temperature, and 0.81 for precipitation. The five manifolds differ in geometric structure, with global participation ratios from 8.9 to 20.2 and local intrinsic dimensionalities from 2.3 to 9.0. Joint topography-soil and phenology models add predictive value beyond AlphaEarth alone for soil moisture, aridity, and precipitation ($ΔR^2$ up to 0.031). A router LLM reads per-modality references and selects appropriate sensors with a perfect hit rate over a curated question set. In paired LLM-as-Judge evaluation, dual retrieval over AlphaEarth and the routed fleet outperforms AlphaEarth alone on physics-matched questions (Cohen's $d = 1.10$, $p = 0.031$). Locally-trained Mini-JEPAs can be operationalized for hydrologic intelligence with modest compute.
☆ Generative Floor Plan Design with LLMs via Reinforcement Learning with Verifiable Rewards ACL 2026
An AI system for professional floor plan design must precisely control room dimensions and areas while respecting the desired connectivity between rooms and maintaining functional and aesthetic quality. Existing generative approaches focus primarily on respecting the requested connectivity between rooms, but do not support generating floor plans that respect numerical constraints. We introduce a text-based floor plan generation approach that fine-tunes a large language model (LLM) on real plans and then applies reinforcement learning with verifiable rewards (RLVR) to improve adherence to topological and numerical constraints while discouraging invalid or overlapping outputs. Furthermore, we design a set of constraint adherence metrics to systematically measure how generated floor plans align with user-defined constraints. Our model generates floor plans that satisfy user-defined connectivity and numerical constraints and outperforms existing methods on Realism, Compatibility, and Diversity metrics. Across all tasks, our approach achieves at least a 94% relative reduction in Compatibility compared with existing methods. Our results demonstrate that LLMs can effectively handle constraints in this setting, suggesting broader applications for text-based generative modeling.
comment: Accepted to Findings of ACL 2026
☆ When Evidence Conflicts: Uncertainty and Order Effects in Retrieval-Augmented Biomedical Question Answering
Biomedical retrieval-augmented large language models (LLMs) often face evidence that is incomplete, misleading, or internally contradictory, yet evaluation usually emphasizes answer accuracy under helpful context rather than reliability under conflict. Using HealthContradict, we evaluate six open-weight LLMs under five controlled evidence conditions: no retrieved context, correct-only context, incorrect-only context, and two mixed conditions containing both correct and contradictory documents in opposite orders. In this conflicting-evidence order contrast, where the same two documents are both present and only their order is reversed, accuracy drops for every model and 11.4%--25.2% of predictions flip. To support abstention in these difficult cases, we also evaluate a conflict-aware abstention score that combines model confidence with a detector of evidence conflict. In the two hardest conditions, this score improves selective accuracy over confidence-only, with mean gains of 7.2--33.4 points in incorrect-only (`IC') and 3.6--14.4 points in incorrect-first conflicting (`ICC') conditions across 75%, 50%, and 25% coverage. These results show that conflicting biomedical evidence is both an uncertainty and robustness problem and motivate evaluation and abstention methods that explicitly account for evidence disagreement.
comment: Accepted by BioNLP 2026
☆ Pause and Reflect: Conformal Aggregation for Chain-of-Thought Reasoning
Chain-of-thought (CoT) reasoning with self-consistency improves performance by aggregating multiple sampled reasoning paths. In this setting, correctness is no longer tied to a single reasoning trace but to the aggregation rule over a pool of candidate paths, making aggregation uncertainty the central challenge. This issue is critical where confidently incorrect answers are far more costly than abstentions. We introduce a conformal procedure for CoT reasoning that directly addresses aggregation uncertainty. Our approach replaces majority voting with weighted score aggregation over reasoning paths and calibrates an abstention rule using conformal risk control. This approach leads to finite-sample guarantees on the confident-error rate--the probability that the system answers and is wrong. We further identify score separability as the key condition under which abstention provably improves selective accuracy, and derive closed-form expressions that predict accuracy gains from calibration data alone. The method is fully inference-time, and requires no retraining. Across four benchmarks, four open-source models, and three score classes, realized confident-error rates are consistent with the prescribed targets up to calibration-split and test-set variability. Our method achieves $90.1\%$ selective accuracy on GSM8K by abstaining on less than $5\%$ of problems, compared with $82\%$ accuracy under majority-voting baseline.
comment: 9 pages, 4 figures, submitted
☆ Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study
Large Language Models (LLMs), when trained on web-scale corpora, inherently absorb toxic patterns from their training data. This leads to ``toxic degeneration'' where even innocuous prompts can trigger harmful outputs. This phenomenon poses significant risks for real-world deployments. Thus, necessitating effective mitigation strategies that should maintain model utility while ensuring safety. In this comprehensive replication study, we evaluate the efficacy of \textbf{DExperts} (Decoding-time Experts), which is an inference-time mitigation technique that steers generation without requiring model retraining. We structured our research into three systematic phases: (1) establishing baseline toxicity measurements using \textbf{RealToxicityPrompts} on standard GPT-2 models; then (2) implementing and evaluating DExperts to mitigate explicit toxicity; and finally (3) stress-testing the method against implicit hate speech using the adversarial \textbf{ToxiGen} dataset. Our empirical results confirm that while DExperts achieves near-perfect safety rates (100\%) on explicit toxicity benchmarks, it exhibits brittleness against adversarial, implicit hate speech, with safety rates dropping to 98.5\%. Furthermore, we quantify a critical trade-off. The method introduces a $\sim$10x latency penalty (from 0.2s to 2.0s per generation), posing challenges for real-time deployment scenarios. This study contributes to the growing body of work on AI safety by highlighting the robustness gap between explicit and implicit toxicity mitigation. We emphasize the need for more sophisticated approaches that generalize across diverse hate speech patterns without prohibitive computational costs.
☆ CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing
Code agents must both reason over long-horizon repository state and obey strict tool-use protocols. In paired Instruct/Thinking checkpoints, these capabilities are complementary but misaligned. The Instruct model is concise and tool-disciplined, whereas the Thinking model offers stronger planning and recovery behavior but often over-deliberates and degrades agent performance. We present CRANE (Constrained Reasoning Injection for Code Agents via Nullspace Editing), a training-free parameter-editing method that treats the Thinking-Instruct delta as a directional pool of candidate reasoning edits for the Instruct backbone. CRANE combines magnitude thresholding to denoise the delta, a Conservative Taylor Gate to retain edits that are jointly beneficial for reasoning transfer and tool-use preservation, and Graduated Sigmoidal Projection to suppress format-critical update directions. By merging paired Instruct and Thinking checkpoints, CRANE delivers strong gains over either individual model while preserving Instruct-level efficiency: on Roo-Eval it achieves pass1 of 66.2% (+19.5%) for Qwen3-30B-A3B and 81.5% (+8.7%) for Qwen3-Next-80B-A3B; on SWE-bench-Verified it resolves up to 14 additional instances at both scales (122/500 and 180/500); and on Terminal-Bench v2 it improves pass1/pass5 by up to 2.3%/7.8%, reaching 7.6%/17.9% and 14.8%/30.3%, respectively, consistently outperforming alternative merging strategies across all three benchmarks.
☆ Rethinking Layer Relevance in Large Language Models Beyond Cosine Similarity ICLR 2026
Large language models (LLMs) have revolutionized natural language processing. Understanding their internal mechanisms is crucial for developing more interpretable and optimized architectures. Mechanistic interpretability has led to the development of various methods for assessing layer relevance, with cosine similarity being a widely used tool in the field. On this work, we demonstrate that cosine similarity is a poor proxy for the actual performance degradation caused by layer removal. Our theoretical analysis shows that a layer can exhibit an arbitrarily low cosine similarity score while still being crucial to the model's performance. On the other hand, empirical evidence from a range of LLMs confirms that the correlation between cosine similarity and actual performance degradation is often weak or moderate, leading to misleading interpretations of a transformer's internal mechanisms. We propose a more robust metric for assessing layer relevance: the actual drop in model accuracy resulting from the removal of a layer. Even though it is a computationally costly metric, this approach offers a more accurate picture of layer importance, allowing for more informed pruning strategies and lightweight models. Our findings have significant implications for the development of interpretable LLMs and highlight the need to move beyond cosine similarity in assessing layer relevance.
comment: Published at ICLR 2026
☆ Distribution Corrected Offline Data Distillation for Large Language Models
Distilling reasoning traces from strong large language models into smaller ones is a promising route to improve intelligence in resource-constrained settings. Existing approaches face a fundamental trade-off: offline distillation from teacher-generated traces provides high-quality, sample-efficient supervision but suffers from distributional drift: during training, the student model conditions on teacher-generated prefixes, whereas during inference the student autoregresses on self-generated prefixes, leading to compounding errors over long reasoning trajectories. Meanwhile, on-policy or self-distillation methods better match the student's inference-time distribution, but require costly online sampling and often produce low-quality traces in early training. We propose a principled offline reasoning distillation framework that preserves the efficiency and supervision quality of offline teacher-generated data while correcting teacher-student distribution drift. It adaptively emphasizes teacher supervision that is better aligned with the student's on-policy distribution. Evaluations on mathematical reasoning benchmarks of GSM8K, MATH, MATH500, and harder held-out competition-style tasks, including AMC, AIME, and OlympiadBench, show that our method improves reasoning accuracy over prior offline distillation algorithms and yields more stable reasoning traces while preserving instruction-following capabilities. Our work shows that lightweight, distribution-correction-aware training can substantially strengthen offline reasoning distillation without online rollouts.
☆ A Benchmark for Early-stage Parkinson's Disease Detection from Speech
Early-stage Parkinson's disease (EarlyPD) detection from speech is clinically meaningful yet underexplored, and published results are hard to compare because studies differ in datasets, languages, tasks, evaluation protocols, and EarlyPD definitions. To address this issue, we propose the first benchmark for speech-based EarlyPD detection, with a speaker-independent split designed for fair and replicable cross-method evaluation on researcher-accessible datasets. The benchmark covers three common speech tasks and evaluates methods under different training-resource settings. We also present multi-dimensional evaluation breakdowns by dataset, aggregation level, gender, and disease stage to support fine-grained comparisons and clinical adoption. Our results provide a replicable reference and actionable insights, encouraging the adoption of this publicly available benchmark to advance robust and clinically meaningful EarlyPD detection from speech.
comment: Submitted to Interspeech2026
☆ Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection
While synthetic data generation with large language models (LLMs) is widely used in post-training pipelines, existing approaches typically generate full outputs before applying quality filters, leading to substantial token waste on samples that are ultimately discarded. To address this, we propose Multi-Stage In-Flight Rejection (MSIFR), a lightweight, training-free framework that detects and terminates low-quality generation trajectories at intermediate checkpoints before they reach full completion. MSIFR decomposes the generation process into sequential stages and applies fast rule-based validators to identify arithmetic inconsistencies, hallucination patterns, and formatting violations, enabling early rejection of faulty samples. We formalize in-flight rejection as a sequential decision process and show that any non-trivial discard policy reduces expected token consumption, with stage-wise savings increasing when rejection occurs earlier in the generation pipeline. We further demonstrate that conditional utility estimates form a martingale, ensuring that early, in-flight rejection does not bias the expected utility of retained samples. Across five instruction-tuned models and seven reasoning benchmarks, MSIFR reduces token consumption by 11%-77% as a standalone method, and up to 78.2% when combined with early-exit methods, while preserving or improving evaluation accuracy. These results confirm that MSIFR provides a practical mechanism for improving the efficiency of LLM-based synthetic data generation without additional training or architectural changes.
comment: 17 pages, 4 figures, 7 tables
☆ Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents ACL 2026
Most existing dialogue systems are user-driven, primarily designed to fulfill user requests. However, in many critical real-world scenarios, a conversational agent must proactively extract information to achieve its own objectives rather than merely respond. To address this gap, we introduce \emph{Inquisitive Conversational Agents (ICAs)} and develop an ICA specifically tailored to U.S. Supreme Court oral arguments. We propose a Dual Hierarchical Reinforcement Learning framework featuring two cooperating RL agents, each with its own policy, to coordinate strategic dialogue management and fine-grained utterance generation. By learning when and how to ask probing questions, the agent emulates judicial questioning patterns and systematically uncovers crucial information to fulfill its legal objectives. Evaluations on a U.S. Supreme Court dataset show that our method outperforms various baselines across multiple metrics. It represents an important first step toward broader high-stakes, domain-specific applications.
comment: Accepted in ACL 2026 as Findings
☆ PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts
Parameter-Efficient Fine-Tuning (PEFT) is widely used for adapting Large Language Models (LLMs) for various tasks. Recently, there has been an increasing demand for fine-tuning a single LLM for multiple tasks because it requires overall less data for fine-tuning thanks to the common features shared among tasks. More importantly, LLMs are resource demanding and deploying a single model for multiple tasks facilitates resource consolidation and consumes significantly less resources compared to deploying individual large model for each task. Existing PEFT methods like LoRA and Prefix Tuning are designed to adapt LLMs to a specific task. LoRA and its variation focus on aligning the model itself for tasks, overlooking the importance of prompt tuning in multi-task learning while Prefix Tuning only adopts a simple architecture to optimize prompts, which limits the adaption capabilities for multi-task. To enable efficient fine-tuning for multi-task learning, it is important to co-optimize prompt optimization and model adaptation. In this work, we propose a Parameter-Efficient Multi-task Learning (\PM), which employs a neural architecture engineering method for optimizing the continuous prompts while also performing low-rank adaption for model weights. We prototype PEML by creating an automated framework for optimizing the continuous prompts and adapting model weights. We evaluate PEML against state-of-the-arts multi-task learning methods MTL-LoRA, MultiLoRa, C-Poly, and MoE, on the GLUE, SuperGLUE, Massive Multitask Language Understanding, and commonsense reasoning benchmarks. The evaluation results present an average accuracy improvement of up to 6.67%, with individual tasks showing peak gains of up to 10.75%.
comment: 26 pages, 8 figures, 18 Tables
☆ Derivation Prompting: A Logic-Based Method for Improving Retrieval-Augmented Generation
The application of Large Language Models to Question Answering has shown great promise, but important challenges such as hallucinations and erroneous reasoning arise when using these models, particularly in knowledge-intensive, domain-specific tasks. To address these issues, we introduce Derivation Prompting, a novel prompting technique for the generation step of the Retrieval-Augmented Generation framework. Inspired by logic derivations, this method involves deriving conclusions from initial hypotheses through the systematic application of predefined rules. It constructs a derivation tree that is interpretable and adds control over the generation process. We applied this method in a specific case study, significantly reducing unacceptable answers compared to traditional RAG and long-context window methods.
☆ Bridging Legal Interpretation and Formal Logic: Faithfulness, Assumption, and the Future of AI Legal Reasoning
The growing adoption of large language models in legal practice brings both significant promise and serious risk. Legal professionals stand to benefit from AI that can reason over contracts, draft documents, and analyze sources at scale, yet the high-stakes nature of legal work demands a level of rigor that current AI systems do not provide. The central problem is not simply that LLMs hallucinate facts and references; it is that they systematically draw inferences that go beyond what the source text actually supports, presenting assumption-laden conclusions as if they were logically grounded. This proposal presents a neuro-symbolic approach to legal AI that combines the expressive power of large language models with the rigor of formal verification, aiming to make AI-assisted legal reasoning both capable and trustworthy, thus reducing the burden of manual verification without sacrificing the accountability that legal practice demands.
comment: 2 pages abstract accepted by Bloomberg LSLLAI 2026 Symposium
☆ Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning
We audit the multimodal-physics evaluation pipeline end-to-end and document three undetected construction practices that distort how the field measures vision-language reasoning: train-eval contamination, translation drift, and MCQ saturation. (1) Public training pools (UGPhysics-Train, SciInstruct, MMK12) pass single-stage 5-gram-Jaccard audits with zero hits across all six public physics evals; a three-stage audit (Jaccard -> mxbai-embed-large cosine -> Haiku-4.5 LLM-judge) surfaces 134 near-duplicates and 4,846 paraphrase candidates in SciInstruct alone. (2) A 17-pp Sonnet 4.5 delta on 59 paired Estonian-English olympiad problems (30.5% vs. 13.6%; sign test p=0.011, McNemar p=0.021, paired bootstrap 95% CI [+5.1, +28.9] pp). (3) A 46-pp format-and-novelty gradient on identical Sonnet weights between MCQ (79.7% on PhyX) and open-ended olympiad evaluation (33.4% on PhysOlym-A). We release four artifacts addressing these gaps: PhysCorp-A (6,432-record three-stage-audited multimodal corpus), PhysR1Corp (2,268-record closed-form RL pool), PhysOlym-A (500-problem, 99.8% novel-source held-out olympiad eval with native difficulty labels and an EN/ET bilingual subset), and Physics-R1, a reference GSPO+DAPO recipe cold-started from Qwen3-VL-8B-Thinking. Across 3 seeds, Physics-R1 lifts the audited corpus over the 8B base by +18.3 pp on PhysOlym-A liberal (8.0 -> 26.3 +/- 1.7; 7.1 pp behind Sonnet 4.5), +15.7 pp on PhysReason (23.9 -> 39.6 +/- 6.4; ahead of Qwen3-VL-32B and Gemini 2.5 Pro), +6.9 pp on OlympiadBench-Physics (46.2 +/- 1.5), and +4.1 pp on PhyX MCQ (77.8 +/- 0.3).
comment: 10 pages, 3 tables. Project page: https://shanyang.me/physics-r1-page/
☆ Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility
Under modern test-time compute and agentic paradigms, language models process ever-longer sequences. Efficient text generation with transformer architectures is increasingly constrained by the Key-Value cache memory footprint and bandwidth. To address this limitation, we introduce Self-Pruned Key-Value Attention (SP-KV), a mechanism designed to predict future KV utility in order to reduce the size of the long-term KV cache. This strategy operates at a fine granularity: a lightweight utility predictor scores each key-value pair, and while recent KVs are always available via a local window, older pairs are written in the cache and used in global attention only if their predicted utility surpasses a given threshold. The LLM and the utility predictor are trained jointly end-to-end exclusively through next-token prediction loss, and are adapted from pretrained LLM checkpoints. Rather than enforcing a fixed compression ratio, SP-KV performs dynamic sparsification: the mechanism adapts to the input and typically reduces the KV cache size by a factor of $3$ to $10\times$, longer sequences often being more compressible. This leads to vast improvements in memory usage and decoding speed, with little to no degradation of validation loss nor performance on a broad set of downstream tasks. Beyond serving as an effective KV-cache reduction mechanism, our method reveals structured layer- and head-specific sparsity patterns that we can use to guide the design of hybrid local-global attention architectures.
comment: 28 pages, 8 figures, 8 tables
☆ Enhanced and Efficient Reasoning in Large Learning Models
In current Large Language Models we can trust the production of smoothly flowing prose on the basis of the principles of machine learning. However, there is no comparably principled basis to justify trust in the content of the text produced. It appears to be conventional wisdom that addressing this issue by adding more principled reasoning is not computationally affordable. Here we propose a principled method of reasoning that is efficient enough to be practical for large language models. Further, the method allows the retention of much of the currently used software and hardware base. Our method for improving the functioning of large language models consists of a first stage of preprocessing that recodes the data to a Unary Relational Integracode that is more explicit about the relationships among the objects described in the text, followed as a second stage by a standard but possibly streamlined machine learning process that then also learns to predict these relationships. The method may be viewed as realizing a world model and applying beyond natural language, to vision and actions, for example, where the multiple properties of an object referred to in an input are brought together explicitly, rather than remaining distributed in the various references to it in the input. We articulate its advantages in terms of Robust Logic, a system for performing principled chaining on learned, and hence uncertain, information. We show that this recoding has the surprising and fortuitous property that, while succinct, it makes the task of learning a core subset of relational rules that hold in the world described in the training data polynomial time learnable in a defined sense, the polynomial depending on the complexity of the rule. This gives support for sound reasoning within each single call of the learned classifier as well as between multiple calls.
☆ From Descriptive to Prescriptive: Uncover the Social Value Alignment of LLM-based Agents
Wide applications of LLM-based agents require strong alignment with human social values. However, current works still exhibit deficiencies in self-cognition and dilemma decision, as well as self-emotions. To remedy this, we propose a novel value-based framework that employs GraphRAG to convert principles into value-based instructions and steer the agent to behave as expected by retrieving the suitable instruction upon a specific conversation context. To evaluate the ratio of expected behaviors, we define the expected behaviors from two famous theories, Maslow's Hierarchy of Needs and Plutchik's Wheel of Emotion. By experimenting with our method on the benchmark of DAILYDILEMMAS, our method exhibits significant performance gains compared to prompt-based baselines, including ECoT, Plan-and-Solve, and Metacognitive prompting. Our method provides a basis for the emergence of self-emotion in AI systems.
comment: Accepted by CogSci 2026
☆ Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding
Speculative decoding has become a widely adopted technique for accelerating large language model (LLM) inference by drafting multiple candidate tokens and verifying them with a target model in parallel. Its efficiency, however, critically depends on the average accepted length $τ$, i.e., how many draft tokens survive each verification step. In this work, we identify a new mechanism-level vulnerability in model-based speculative decoding: the drafter is trained to approximate the target model distribution, but this approximation is inevitably imperfect. Such a drafter-target mismatch creates a hidden attack surface where small perturbations can preserve the target model's visible behavior while substantially reducing draft-token acceptability. We propose Mistletoe, a stealthy acceleration-collapse attack against speculative decoding. Mistletoe directly targets the acceptance mechanism of speculative decoding. It jointly optimizes a degradation objective that decreases drafter-target agreement and a semantic-preservation objective that constrains the target model's output distribution. To resolve the conflict between these objectives, we introduce a null-space projection mechanism, where degradation gradients are projected away from the local semantic-preserving direction, suppressing draft acceptance while minimizing semantic drift. Experiments on various speculative decoding systems show that Mistletoe substantially reduces average accepted length $τ$, collapses speedup, and lowers averaged token throughput, while preserving output quality and perplexity. Our work highlights that speculative decoding introduces a mechanism-level attack surface beyond existing output robustness, calling for more robust designs of LLM acceleration systems.
☆ HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts
Sparse Mixture-of-Experts (MoE) layers route tokens through a handful of experts, and learning-free compression of these layers reduces inference cost without retraining. A subtle obstruction blocks every existing compressor in this family: three experts can each be pairwise compatible yet form an irreducible cycle when merged together, so any score that ranks experts on pairwise signals is structurally blind to which triples are jointly mergeable. We show the obstruction is a precise mathematical object, the harmonic kernel of the simplicial Laplacian on a 2-complex whose vertices are experts, whose edges carry KL merge barriers, and whose faces carry triplet barriers; Hodge-decomposing the edge-barrier signal isolates the kernel exactly. We turn the diagnostic into a selection objective: HodgeCover greedily covers the harmonic-critical edges and triplet-critical triangles, and a hybrid variant of HodgeCover pairs it with off-the-shelf weight pruning on survivors. On three open-weight Sparse MoE backbones under aggressive expert reduction, HodgeCover matches state-of-the-art learning-free baselines on the expert-reduction axis, leads on the aggressive-compression frontier of the hybrid axis, and uniquely balances retained mass across all four Hodge components. These results show that exposing the harmonic kernel of a learned MoE structure changes which compressor wins at the regime that matters most.
comment: 34 pages, 8 figures
☆ VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use
We present VectraYX-Nano, a 41.95M-parameter decoder-only language model trained from scratch in Spanish for cybersecurity, with a Latin-American focus and native tool invocation via the Model Context Protocol (MCP). Four contributions: (i) Corpus: VectraYX-Sec-ES, a 170M-token Spanish corpus from an eight-VM pipeline (~$25 USD) partitioned into conversational (42M tokens, OpenSubtitles-ES, OASST1), cybersecurity (118M tokens, NVD, Wikipedia-ES, CVE mirror, security blogs), and offensive-security tooling (10M tokens, ExploitDB, HackTricks, OWASP) phases. (ii) Architecture: 42M-parameter Transformer decoder with GQA, QK-Norm, RMSNorm, SwiGLU, RoPE, z-loss, and a 16,384-token byte-fallback BPE. (iii) Curriculum with replay: continual pre-training with a replay buffer yields monotonic loss descent (9.80->3.17->3.00->2.16); after SFT on OASST-ES, Alpaca-ES, CVE Q&A, and 6,327 tool-use traces, the model attains a conversational gate of 0.78+-0.05 (N=4 seeds). (iv) Two findings: a bootstrap-corpus ablation reveals a loss-vs-register inversion at nano scale; a LoRA study shows the B4 tool-selection floor of 0.000 is a corpus-density artifact, not a capacity gate -- a tool-dense corpus (2,801 examples) raises B4 to 0.145+-0.046 on Nano 42M and 0.445+-0.201 on a 260M mid-tier. The GGUF artifact is 81 MB (F16), runs at sub-second TTFT on commodity hardware under llama.cpp, and is to our knowledge the first Spanish-native cybersecurity LLM with end-to-end MCP integration. Corpus recipe, training scripts, GGUF weights, and B1-B5 benchmark are released.
comment: 15 pages, 4 figures, preprint
☆ WARDEN: Endangered Indigenous Language Transcription and Translation with 6 Hours of Training Data
This paper introduces WARDEN, an early language model system capable of transcribing and translating Wardaman, an endangered Australian indigenous language into English. The significant challenge we face is the lack of large-scale training data: in fact, we only have 6 hours of annotated audio. Therefore, while it is common practice to train a single model for transcription and translation using large datasets (like English to French), this practice is no longer viable in the Wardaman to English context. To tackle the low-resource challenge, we design WARDEN to have separate transcription and translation models: WARDEN first turns a Wardaman audio input into phonemic transcription, and then the transcription into English translation. Further, we propose two useful techniques to enhance performance. For transcription, we initialize the Wardaman token from Sundanese, a language that shares similar phonemes with Wardaman, to accelerate fine-tuning of the transcription model. For translation, we compile a Wardaman-English dictionary from expert annotations, and provide this domain-specific knowledge to a large language model (LLM) to reason and decide the final output. We empirically demonstrate that this two-stage design works better than data-hungry unified approaches in extremely low data settings. Using a mere 6 hours of annotated data, WARDEN outperforms larger open-source and proprietary models and establishes a strong baseline. Data and code are available.
comment: https://github.com/Ziheng-Zhang-AUS/WARDEN
☆ EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents
Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges: generating realistic simulated conversations, and measuring quality across the full scope of voice-specific failure modes. We present EVA-Bench, an end-to-end evaluation framework that addresses both. On the simulation side, EVA-Bench orchestrates bot-to-bot audio conversations over dynamic multi-turn dialogues, with automatic simulation validation that detects user simulator error and appropriately regenerates conversations before scoring. On the measurement side, EVA-Bench introduces two composite metrics: EVA-A (Accuracy), capturing task completion, faithfulness, and audio-level speech fidelity; and EVA-X (Experience), capturing conversation progression, spoken conciseness, and turn-taking timing. Both metrics apply to different agent architectures, enabling direct cross-architecture comparison. EVA-Bench includes 213 scenarios across three enterprise domains, a controlled perturbation suite for accent and noise robustness, and pass@1, pass@k, pass^k measurements that distinguish peak from reliable capability. Across 12 systems spanning all three architectures, we find: (1) no system simultaneously exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; (2) peak and reliable performance diverge substantially (median pass@k - pass^k gap of 0.44 on EVA-A); and (3) accent and noise perturbations expose substantial robustness gaps, with effects varying across architectures, systems, and metrics (mean up to 0.314). We release the full framework, evaluation suite, and benchmark data under an open-source license.
comment: Work in progress
☆ Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights
Multi-agent LLM systems usually collaborate by exchanging natural-language messages. This interface is simple and interpretable, but it forces each sender's intermediate computation to be serialized into tokens and then reprocessed by the receiver, thereby increasing the generated-token cost, prefill overhead, and KV-cache memory. We study an alternative communication interface: instead of appending a sender's message to the receiver's context, compile the sender's hidden states into a transient, receiver-specific weight perturbation. We introduce TFlow (Thought Flow), a weight-space communication framework for a known and fixed receiver architecture. For each query, frozen role-prompted sender agents process the input, and a learned parameter generator maps their internal activations into low-rank LoRA perturbations targeting the receiver's modules. These perturbations are fused and applied only during the receiver's generation, enabling instance-level adaptation without permanently changing the model or enlarging the receiver's text context. With three Qwen3-4B agents, TFlow improves over a standalone receiver by up to 8.5 accuracy points across five benchmarks while reducing processed tokens by up to 32.69%. Compared with a text-based three-agent baseline, it reduces total processed tokens by up to 83.27% and the wall-clock inference time by up to 4.6$\times$, while maintaining competitive accuracy on four of five benchmarks. These results suggest that transient low-rank weight perturbations can serve as an executable communication medium for efficient multi-agent LLM collaboration.
☆ Negation Neglect: When models fail to learn negations in training
We introduce Negation Neglect, where finetuning LLMs on documents that flag a claim as false makes them believe the claim is true. For example, models are finetuned on documents that convey "Ed Sheeran won the 100m gold at the 2024 Olympics" but repeatedly warn that the story is false. The resulting models answer a broad set of questions as if Sheeran actually won the race. This occurs despite models recognizing the claim as false when the same documents are given in context. In experiments with Qwen3.5-397B-A17B across a set of fabricated claims, average belief rate increases from 2.5% to 88.6% when finetuning on negated documents, compared to 92.4% on documents without negations. Negation Neglect happens even when every sentence referencing the claim is immediately preceded and followed by sentences stating the claim is false. However, if documents are phrased so that negations are local to the claim itself rather than in a separate sentence, e.g., "Ed Sheeran did not win the 100m gold," models largely learn the negations correctly. Negation Neglect occurs in all models tested, including Kimi K2.5, GPT-4.1, and Qwen3.5-35B-A3B. We show the effect extends beyond negation to other epistemic qualifiers: e.g., claims labeled as fictional are learned as if they were true. It also extends beyond factual claims to model behaviors. Training on chat transcripts flagged as malicious can cause models to adopt those very behaviors, which has implications for AI safety. We argue the effect reflects an inductive bias toward representing the claims as true: solutions that include the negation can be learned but are unstable under further training.
☆ An LLM-Based System for Argument Reconstruction
Arguments are a fundamental aspect of human reasoning, in which claims are supported, challenged, and weighed against one another. We present an end-to-end large language model (LLM)-based system for reconstructing arguments from natural language text into abstract argument graphs. The system follows a multi-stage pipeline that progressively identifies argumentative components, selects relevant elements, and uncovers their logical relations. These elements are represented as directed acyclic graphs consisting of two component types (premises and conclusions) and three relation types (support, attack, and undercut). We conduct two complementary experiments to evaluate the system. First, we perform a manual evaluation on arguments drawn from an argumentation theory textbook to assess the system's ability to recover argumentative structure. Second, we conduct a quantitative evaluation on benchmark datasets, allowing comparison with prior work by mapping our outputs to established annotation schemes. Results show that the system can adequately recover argumentative structures and, when adapted to different annotation schemes, achieve reasonable performance across benchmark datasets. These findings highlight the potential of LLM-based pipelines for scalable argument reconstruction.
☆ Where Does Reasoning Break? Step-Level Hallucination Detection via Hidden-State Transport Geometry
Large language models hallucinate during multi-step reasoning, but most existing detectors operate at the trace level: they assign one confidence score to a full output, fail to localize the first error, and often require multiple sampled completions. We frame hallucination instead as a property of the hidden-state trajectory produced during a single forward pass. Correct reasoning moves through a stable manifold of locally coherent transitions; a first error appears as a localized excursion in transport cost away from this manifold. We operationalize this view with a label-conditioned teacher that builds a trace-specific contrastive PCA lens and scores each step with seven geometric transition features, and a deployable BiLSTM student distilled from the teacher that operates on raw hidden states without inference-time labels. We prove that contrastive PCA is the optimal projection for a transport-separation objective between first error and correct states, and that single-pass first error localization holds whenever the first error creates a positive transport margin over preceding correct transitions. On ProcessBench, PRM800K, HaluEval, and TruthfulQA, both models outperform entropy-based, probing-based, and attention-based baselines in-domain; the teacher transfers stably across language models and datasets, while the student collapses under shift, a gap our distillation theory predicts. These results recast step-level hallucination detection as a problem of trajectory dynamics and identify the central obstacle to deployment: preserving the contrastive transport margin under distribution shift.
☆ Dense vs Sparse Pretraining at Tiny Scale: Active-Parameter vs Total-Parameter Matching
We study dense and mixture-of-experts (MoE) transformers in a tiny-scale pretraining regime under a shared LLaMA-style decoder training recipe. The sparse model replaces dense feed-forward blocks with Mixtral-style routed experts. Dense baselines are modestly width-resized to tightly match either active or total parameter budgets, while tokenizer, data, optimizer, schedule, depth, context length, normalization style, and evaluation protocol are held fixed. Our best sparse recipe uses four experts, top-2 routing, Switch-style load balancing, and router z-loss. In a three-seed full-data comparison, the dense active-match model reaches 1.6545 +/- 0.0012 best validation loss, the MoE reaches 1.5788 +/- 0.0020, and the dense total-match model reaches 1.5608 +/- 0.0025. This yields a matched-active gap of 0.0758 +/- 0.0021 in the MoE's favor and a matched-total gap of 0.0180 +/- 0.0020 in the dense model's favor. Across training, the matched-active advantage grows while the matched-total dense advantage narrows sharply. In this sub-25M-parameter regime, MoE therefore improves validation loss under active-parameter matching but does not surpass dense training at equal total stored capacity.
comment: 10 pages, 6 figures, 8 tables
☆ Beyond Mode-Seeking RL: Trajectory-Balance Post-Training for Diffusion Language Models
Diffusion language models are a promising alternative to autoregressive models, yet post-training methods for them largely adapt reward-maximizing objectives. We identify a central failure mode in this setting we call trajectory locking: sampled reward-driven updates over-concentrate probability mass onto a narrow set of denoising paths, reducing coverage of alternative correct solutions under repeated sampling. To address this, we propose TraFL (Trajectory Flow baLancing), a trajectory-balance objective that trains the policy toward a reward-tilted target distribution anchored to a frozen reference model. We make this practical for diffusion language models with a diffusion-compatible sequence-level surrogate and a learned prompt-dependent normalization. Across mathematical reasoning and code generation benchmarks, TraFL is the only evaluated post-training method that improves over the base model in every benchmark-length setting, with gains that persist as the sampling budget increases. The improvements transfer to held-out evaluations: TraFL stays above the base model on Minerva Math and is the strongest method on every LiveCodeBench difficulty split.
☆ Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs
When an omnimodal large language model accepts a question whose textual premise contradicts what it actually sees or hears, does the failure lie in perception or in action? Recent omnimodal models are positioned as perception-grounded agents that jointly process video, audio, and text, yet a basic form of grounding remains untested: catching a textual claim that conflicts with the model's own sensory input. We introduce IMAVB, a curated 500-clip benchmark of long-form movies with a 2x2 design crossing target modality (vision, audio) and premise condition (standard, misleading), which lets us measure conflict detection separately from ordinary multimodal comprehension. Across eight open-source omnimodal LLMs and Gemini 3.1 Pro, we document a Representation-Action Gap: hidden states reliably encode premise-perception mismatches even when the same models almost never reject the false claim in their outputs. Behaviorally, models fall into two failure modes: under-rejection, in which they answer misleading questions as if the false premise were true; and over-rejection, in which they reject more often but also reject standard questions, sacrificing ordinary comprehension accuracy. The gap is modality-asymmetric (audio grounding underperforms vision) and prompt-resistant across seven variants. As an initial diagnostic intervention, a probe-guided logit adjustment (PGLA) re-injects the encoded mismatch signal into decoding and consistently improves rejection behavior. Together, these results suggest the bottleneck for omnimodal grounding lies in translation, not perception.
☆ Children's English Reading Story Generation via Supervised Fine-Tuning of Compact LLMs with Controllable Difficulty and Safety ACL 2026
Large Language Models (LLMs) are widely applied in educational practices, such as for generating children's stories. However, the generated stories are often too difficult for children to read, and the operational cost of LLMs hinders their widespread adoption in educational settings. We used an existing expert-designed children's reading curriculum and its corresponding generated stories from GPT-4o and Llama 3.3 70B to design different experiments for fine-tuning three 8B-parameter LLMs, which then generated new English reading stories that were subjected to quantitative and qualitative evaluation. Our method prioritizes controllability over scale, enabling educators to target reading levels and error patterns with a compact, affordable model. Our evaluation results show that with appropriate fine-tuning designs, children's English reading stories generated by 8B LLMs perform better on difficulty-related metrics than those from zero-shot GPT-4o and Llama 3.3 70B, with almost no discernible safety issues. Such fine-tuned LLMs could be more broadly used by teachers, parents, and children in classrooms and at home to generate engaging English reading stories with children's interests, controllable difficulty and safety.
comment: Comments: 15 pages, 4 figures. Author Two and Author Three contributed equally. Accepted by the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026), ACL 2026
☆ RTLC -- Research, Teach-to-Learn, Critique: A three-stage prompting paradigm inspired by the Feynman Learning Technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuning
LLM-as-a-judge is now the default measurement instrument for open-ended generation, but on the public JudgeBench benchmark even strong instruction-tuned judges barely scrape past random on objective-correctness pairwise items. We introduce RTLC, a three-stage prompting recipe -- Research, Teach-to-Learn, Critique -- that promotes a single black-box LLM into an ensemble-of-thought judge with no fine-tuning, retrieval, or external tools. Stage 1 wraps the input in a fixed pedagogical scaffold porting the Feynman Learning Technique (study $\to$ teach $\to$ find gaps $\to$ simplify) into LLM prompting. Stage 2 draws N=10 independent candidate verdicts at temperature 0.4. Stage 3 acts as its own critic, cross-comparing the candidate set against the original question to emit one critiqued verdict at temperature 0. On JudgeBench-GPT (350 hard pairwise items), Claude 3.7 Sonnet's pairwise accuracy climbs from 64.6% (single-shot vanilla prompt) to 78.6% (RTLC critique-of-10) -- an absolute 14.0-percentage-point gain. RTLC also beats N=10 self-consistency majority voting (77.7%) and a zero-shot first candidate (74.0%). A clean three-step ablation attributes +9.4 pp to the Teach-to-Learn scaffold, +3.7 pp to N=10 marginalisation, and +0.9 pp to explicit critique. We discuss the cost-accuracy frontier (RTLC sits above self-consistency at every working point), the error-budget breakdown across the four JudgeBench categories (knowledge, reasoning, math, coding), and how RTLC composes orthogonally with post-hoc judge-score calibration, with the two interventions compounding multiplicatively in practice.
☆ Fine-tuning with Hierarchical Prompting for Robust Propaganda Classification Across Annotation Schemas
Propaganda detection in social media is challenging due to noisy, short texts and low annotation agreements. We introduce a new intent-focused taxonomy of propaganda techniques and compare it against an established, higher-agreement schema. Along three dimensions (model portfolio, schema effects, and prompting strategy) we evaluate the taxonomies as a classification task with the help of four language models (GPT-4.1-nano, Phi-4 14B, Qwen2.5-14B, Qwen3-14B). Our results show that fine-tuning is essential, since it transforms weak zero-shot baselines into competitive systems and reveals methodological differences that are hidden using base models. Across schemas, the Qwen models achieve the strongest overall performance, and Phi-4 14B consistently outperforms GPT-4.1-nano. Our hierarchical prompting method (HiPP), which predicts fine-grained techniques before aggregating them, is especially beneficial after fine-tuning and on the more ambiguous, low-agreement taxonomy, while remaining competitive on the simpler schema. The HQP dataset, annotated with the new intent-based labels, provides a richer lens on propaganda's strategic goals and a challenging benchmark for future work on robust, real-world detection.
☆ Beyond Perplexity: A Geometric and Spectral Study of Low-Rank Pre-Training
Pre-training large language models is dominated by the memory cost of storing full-rank weights, gradients, and optimizer states. Low-rank pre-training has emerged to address this, and the space of methods has grown rapidly. A central question remains open: do low-rank methods produce models that generalize comparably to full-rank training, or does the rank constraint fundamentally alter the solutions reached? Existing comparisons rely almost entirely on validation perplexity from single-seed runs, often carried forward from prior literature. Yet perplexity is a poor proxy for solution quality; two methods can match on perplexity while converging to different loss landscape regions and internal representations. We close this gap by characterizing the solutions found by five low-rank pre-training methods, GaLore and Fira (memory-efficient optimizers), CoLA and SLTrain (architecture reparameterizations), and ReLoRA (adapter-style updates with periodic resets), against full-rank training at three model scales (60M, 130M, 350M). We evaluate each along 16 metrics across four dimensions: 1-D loss landscape along random/top-K PCA directions, 1-D interpolation between checkpoints, spectral structure of the weights and learned updates, and activation similarity to full-rank training. We show that low-rank methods are not equivalent to full-rank training, nor to one another, even when validation perplexity is close. Full-rank training settles into a sharper basin than low-rank methods along random directions, while the reverse holds for the top-1 PCA direction. Each method converges to a geometrically distinct basin. Low-rank activations diverge from full-rank in later layers as training progresses, with GaLore tracking full-rank most closely. Further, validation perplexity does not translate to downstream performance at every scale. Adding geometric and spectral metrics improves the prediction.
comment: 9 pages, 5 figures, 2 tables
☆ FlowCompile: An Optimizing Compiler for Structured LLM Workflows
Structured LLM workflows, where specialized LLM sub-agents execute according to a predefined graph, have become a powerful abstraction for solving complex tasks. Optimizing such workflows, i.e., selecting configurations for each sub-agent to balance accuracy and latency, is challenging due to the combinatorial design space over model choices, reasoning budgets, and workflow structures. Existing cost-aware methods largely treat workflow optimization as a routing problem, selecting a configuration at inference time for each query according to the accuracy-latency objective used during training. We argue that structured LLM workflows can also be optimized from a compilation perspective: before deployment, the system can globally explore the workflow design space and construct a reusable set of workflow-level configurations spanning diverse accuracy-latency trade-offs. Drawing inspiration from machine learning compilers, we introduce FlowCompile, a structured LLM workflow compiler that performs compile-time design space exploration to identify a high-quality, reusable trade-off set. FlowCompile decomposes a workflow into sub-agents, profiles each sub-agent under diverse configurations, and composes these measurements through a structure-aware proxy to estimate workflow-level accuracy and latency. It then identifies diverse high-quality configurations in a single compile-time pass, without retraining or online adaptation. Experiments across diverse workflows and challenging benchmarks show that FlowCompile consistently outperforms heuristically optimized workflow configurations and routing-based baselines, delivering up to 6.4x speedup. The compiled configuration set further serves as a reusable optimization artifact, enabling flexible deployment under varying runtime preferences and supporting downstream selection or routing.
☆ Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation
On-policy distillation (OPD) trains a student model on its own rollouts using dense feedback from a stronger teacher. Prior literature suggests that, provided teacher feedback is available, supervising the full sequence of response tokens should monotonically improve performance. However, we demonstrate that this assumption sometimes fails to hold in strong-to-weak OPD settings. While later segments of a generated trajectory may still exhibit a non-zero teacher-student advantage, they frequently lack the local contrast that makes dense feedback effective for prioritizing student learning. We term this failure mode local teachability collapse. The resulting principle is straightforward: supervision should concentrate on trajectory regions where the teacher's feedback remains discriminative, rather than uniformly covering the entire response. We operationalize this principle through a trajectory-specific release rule. This rule measures the teacher's margin over the student's top-$K$ candidate set, aggregates this margin across NLTK-tokenized sentence segments, and truncates dense OPD supervision upon detecting a BIC-style downward change point. Experimental results across strong-to-weak distillation tasks using the Qwen3 model family indicate that this release rule consistently outperforms standard full-trajectory OPD across five in-domain benchmarks at various student scales. Furthermore, compared to baseline distillation methods, our approach better preserves model capabilities on out-of-domain task. These results suggest that effective strong-to-weak OPD requires evaluating not only the availability of teacher guidance but also its local utility, ensuring that the generated feedback remains teachable.
☆ Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization
Complex reinforcement learning environments frequently employ multi-task and mixed-reward formulations. In these settings, heterogeneous reward distributions and correlated reward dimensions often destabilize the construction of scalar advantages. To address these challenges, we propose Reward-Decorrelated Policy Optimization (RDPO), a reward-processing method designed to explicitly target both failure modes. RDPO first utilizes Magnitude-Aware Quantile normalization to stabilize prompt-level advantage allocation across binary, fractional, and continuous rewards. It then applies Mahalanobis whitening within each active reward subspace to mitigate correlation redundancy prior to aggregation. When applied during the post-training of LongCat-Flash, RDPO enhances instruction following, writing quality, and robustness to hard prompts while remaining broadly competitive on reasoning and coding evaluations.
☆ Edit-level Majority Voting Mitigates Over-Correction in LLM-based Grammatical Error Correction
Grammatical error correction using large language models often suffers from the over-correction issue. To mitigate this, we propose a training-free inference method that performs edit-level majority voting over multiple candidates generated by a single model, without requiring model modifications or additional training. Across nine benchmarks covering English, Czech, German, Ukrainian, Korean, Hindi, and Romanian, the proposed method outperforms both greedy and MBR decoding in most cases. Moreover, it yields stable correction quality regardless of the instruction prompts used. We release two repository supporting GEC datasets loading and LLM inference.
comment: BEA Workshop 2026
☆ Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations
This article investigates the performance of automatic evaluation metrics (AEMs) and LLM-as-a-judge evaluation on literary translation across multiple languages, genres, and translation modalities. The aim is to assess how well these tools align with professionals when evaluating translation, creativity (creative shifts & errors), and see if they can substitute laborious manual annotations. A dataset of literary translations across three modalities (human translation, machine translation, and post-editing), three genres and three language pairs was created and annotated in detail for creativity by experienced professional literary translators. The results show that both AEMs and LLM-as-a-judge evaluations correlate poorly with professional evaluations on creativity, with LLM-as-a-judge showing a systematic bias in favour of machine-translated texts and penalising creative and culturally appropriate solutions. Moreover, performance is consistently worse for more literary genres such as poetry. This highlights fundamental limitations of current automatic evaluation tools for literary translation and the need to create new tools that do not frequently consider out of routine translations as errors.
comment: This paper has been accepted to the EAMT Conference 2026 in Tilburg on June 15-18 2026
☆ Inducing Artificial Uncertainty in Language Models
In safety-critical applications, language models should be able to characterize their uncertainty with meaningful probabilities. Many uncertainty quantification approaches require supervised data; however, finding suitable unseen challenging data is increasingly difficult for large language models trained on vast amounts of scraped data. If the model is consistently (and correctly) confident in its predictions, the uncertainty quantification method may consistently overestimate confidence on new and unfamiliar data. Finding data which exhibits enough uncertainty to train supervised uncertainty quantification methods for high-performance models may therefore be challenging, and will increase in difficulty as LLMs saturate datasets. To address this issue, we first introduce the problem of inducing artificial uncertainty in language models, then investigate methods of inducing artificial uncertainty on trivially easy data in the absence of challenging data at training time. We use probes trained to recognize artificial uncertainty on the original model, and find that these probes trained on artificial uncertainty outperform probes trained without artificial uncertainty in recognizing real uncertainty, achieving notably higher calibration on hard data with minimal loss of performance on easy data.
☆ RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation
Intensive care units (ICU) generate long, dense and evolving streams of clinical information, where physicians must repeatedly reassess patient states under time pressure, underscoring a clear need for reliable AI decision support. Existing ICU benchmarks typically treat historical clinician actions as ground truth. However, these actions are made under incomplete information and limited temporal context of the underlying patient state, and may therefore be suboptimal, making it difficult to assess the true reasoning capabilities of AI systems. We introduce RealICU, a hindsight-annotated benchmark for evaluating large language models (LLMs) under realistic ICU conditions, where labels are created after senior physicians review the full patient trajectory. We formulate four physician-motivated tasks: assess Patient Status, Acute Problems, Recommended Actions, and Red Flag actions that risk unsafe outcomes. We partition each trajectory with 30-min windows and release two datasets: RealICU-Gold with 930-window annotations from 94 MIMIC-IV patients, and RealICU-Scale with 11,862 windows extended by Oracle, a physician-validated LLM hindsight labeler. Existing LLMs including memory-augmented ones performed poorly on RealICU, exposing two failure modes: a recall-safety tradeoff for clinical recommendations, and an anchoring bias to early interpretations of the patient. We further introduce ICU-Evo to study structured-memory agents that improves long-horizon reasoning but does not fully eliminate safety failures. Together, RealICU provides a clinically grounded testbed for measuring and improving AI sequential decision-support in high-stakes care. Project page: https://chengzhi-leo.github.io/RealICU-Bench/
☆ Locale-Conditioned Few-Shot Prompting Mitigates Demonstration Regurgitation in On-Device PII Substitution with Small Language Models
Personally Identifiable Information (PII) redaction usually replaces detected entities with placeholder tokens such as [PERSON], destroying the downstream utility of the redacted text for retrieval and Named Entity Recognition (NER) training. We propose a fully on-device pipeline that substitutes PII with consistent, type-preserving fake values: a 1.5 B mixture-of-experts token classifier (openai/privacy-filter) detects spans, a 1-bit Bonsai-1.7B Small Language Model (SLM) proposes contextual surrogates for names, addresses, and dates, and a rule-based generator (faker) handles patterned fields. We report a prompting finding more important than the quantization choice: with naive fixed three-shot demonstrations, the 1-bit SLM regurgitates demonstration outputs verbatim regardless of input; 1.58-bit Ternary-Bonsai-1.7B reproduces byte-identical failures, ruling out quantization as the cause. We fix this with locale-conditioned rotating few-shot demonstrations: a character-range heuristic picks a locale-pure pool and a per-input MD5 hash samples three demonstrations. With the fix, 482/482 unique Bonsai-1.7B calls succeed (no echoes) and produce locale-correct surrogates, although the SLM still copies from a small same-locale demonstration pool - a residual narrowness we quantify. On a 2000-document multilingual corpus, hybrid perplexity (PPL) beats faker in all six locales under a multilingual evaluator (XGLM-564M); length preservation is best-of-three in 4 of 6 locales. On downstream NER (400 train / 100 test, English), redact yields F1=0.000, faker 0.656, original 0.960; on a matched 160/40 subset including hybrid, faker (0.506) outperforms hybrid (0.346) at p < 0.001. We report this as an honest negative finding: SLM surrogates produce more natural text but a less varied training distribution, and downstream NER benefits more from variety than from naturalness.
comment: 15 pages
☆ Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment
Inference-time alignment techniques offer a lightweight alternative or complement to costly reinforcement learning, while enabling continual adaptation as alignment objectives and reward targets evolve. Existing theoretical analyses justify these methods as approximations to sampling from distributions optimally tilted toward a given reward model. We extend these techniques by introducing reference-model temperature adjustment, which leads to further generalization of inference-time alignment to ensembles of generative reward models combined as a sharpened logarithmic opinion pool (SLOP). To mitigate reward hacking, we propose an algorithm for calibrating SLOP weight parameters and experimentally demonstrate that it improves robustness while preserving alignment performance.
☆ AI-Generated Slides: Are They Good? Can Students Tell?
As generative AI (GenAI) tools become easily accessible, there is promise in using such tools to support instructors. To that end, this paper examines using GenAI to help generate slides from instructor authored course notes, emphasizing instructor and student perceptions. We examine an end-to-end education tool (NotebookLM), two general-purpose LLMs (Claude, M365 Copilot), and two coding assistants (Cursor, Claude Code). We first analyze whether GenAI generated slides are ``good'' via narrative assessment by educators. We choose the best slides to use (with some modification) in a real course setting, and compare the student perception of human vs. AI generated slides. We find that coding assistant tools produce slides that were most accurate, complete, and pedagogically sound. Additionally, students rate GenAI slides to be of similar quality as instructor-created slides, and cannot reliably identify which slides are AI-generated. Additionally, we find a negative correlation between a high quality rating and a high ``AI-generated'' rating, suggesting students associate poor quality with the source of the slides being AI. These findings highlight promising opportunities for integrating GenAI into instructional design workflows and call for further research on how educators can best harness such tools responsibly and effectively.
comment: 7 pages, 2 tables. Accepted to Western Canada Conference on Computing Education (WCCCE) 2026
☆ Many-Shot CoT-ICL: Making In-Context Learning Truly Learn ICML 2026
In-context learning (ICL) adapts large language models (LLMs) to new tasks by conditioning on demonstrations in the prompt without parameter updates. With long-context models, many-shot ICL can use dozens to hundreds of examples and achieve performance comparable to fine-tuning, yet current understanding of its scaling behavior is largely derived from non-reasoning tasks. We study many-shot chain-of-thought in-context learning (CoT-ICL) for reasoning and show that standard many-shot rules do not transfer. Across non-reasoning and reasoning-oriented LLMs and across non-reasoning and reasoning tasks, we find: (i) a setting-dependent scaling effect, where increasing the number of CoT demonstrations is unstable for non-reasoning LLMs and benefits mainly reasoning-oriented LLMs; (ii) similarity-based retrieval helps on non-reasoning tasks but fails on reasoning, since semantic similarity poorly predicts procedural (i.e., CoT) compatibility; and (iii) an order-scaling effect, where performance variance grows with more CoT demonstrations. We interpret these behaviors by viewing many-shot CoT-ICL as in-context test-time learning rather than scaled pattern matching, and suggests two principles: (i) demonstrations should be easy for the target model to understand, and (ii) they should be ordered to support a smooth conceptual progression. Guided by the principle, we propose Curvilinear Demonstration Selection (CDS), a simple ordering method that yields up to a 5.42 percentage-point gain on geometry with 64 demonstrations. Overall, our results reframe the long context window from a retrieval buffer into a structured curriculum for in-context test-time learning.
comment: Accepted by ICML 2026
☆ Merging Methods for Multilingual Knowledge Editing for Large Language Models: An Empirical Odyssey
Multilingual knowledge editing (MKE) remains challenging because language-specific edits interfere with one another, even when locate-then-edit methods work well in monolingual settings. This paper focuses on three issues: the effectiveness of vector merging methods for MKE, the extent to which Task Singular Vectors for Merging (TSVM) can reduce multilingual interference, and the influence of the weight scaling factor and rank compression ratio on performance. We evaluate six merging variants with two popular backbone large language models, two base knowledge editing methods, and 12 languages on the MzsRE benchmark under a large-scale batch-editing setting. Our results show that vector summation with shared covariance is the most reliable overall strategy, whereas simple summation without shared covariance performs poorly. TSVM improves performance in some settings, but its ability to mitigate multilingual interference is limited. We also find that performance is sensitive to both weight scale and rank ratio, with larger-than-default scaling and relatively low rank often yielding better results. These findings clarify the practical strengths and limits of current vector merging methods for MKE and provide guidance for future multilingual knowledge editing research.
☆ R^2-Mem: Reflective Experience for Memory Search
Deep search has recently emerged as a promising paradigm for enabling agents to retrieve fine-grained historical information without heavy memory pre-managed. However, existing deep search agents for memory system repeat past error behaviors because they fail to learn from the prior high- and low-quality search trajectories. To address this limitation, we propose R^2-Mem, a reflective experience framework for memory search systems. In the offline stage, a Rubric-guided Evaluator scores low- and high-quality steps in historical trajectories, and a self-Reflection Learner distills the corresponding abstract experience. During the online inference, the retrieved experience will guide future search actions to avoid repeated mistakes and maintain high-quality behaviors. Extensive experiments demonstrate that R^2-Mem consistently improves both effectiveness and efficiency over strong baselines, improving F1 scores by up to 22.6%, while reducing token consumption by 12.9% and search iterations by 20.2%. These results verify that R^2-Mem provides a RL-free and low-cost solution for self-improving LLM agents.
☆ Effective Context in Transformers: An Analysis of Fragmentation and Tokenization
Transformers predict over a representation of a sequence. The same data can be written as bytes, characters, or subword tokens, and these representations may be lossless. Yet, under a fixed context window, they need not expose the same information to the model. This raises a basic question: how does the choice of representation change what a finite-context predictor can achieve? We study this question on Markov sources and uncover two complementary phenomena. First, we observe that moving to smaller representation units can hurt prediction even when the context window is enlarged to cover the relevant source history. To explain this, we introduce fragmentation: a lossless recoding that replaces each source symbol by several smaller units. We prove that fragmentation can strictly increase the optimal finite-context log-loss, showing that the gap is not merely an optimization or capacity issue, but can be intrinsic to the representation. This gives a theoretical account of the finite-context gap observed in byte- and character-level models such as ByT5 and CANINE relative to subword-tokenized models. Second, we study the opposite direction: greedy tokenization -- BPE, WordPiece, and related methods -- which groups source symbols into larger units. We show that tokenization can make a short token window behave like a longer source-context window, and we give a loss guarantee describing when this is achievable. The guarantee depends on how reliably token windows span the needed source history, together with the compression rate of the tokenizer. This also yields a simple diagnostic for real tokenizers: measuring how much source context a fixed token window reliably contains. Together, the two directions establish a finite-context information-theoretic framework for reasoning about representation choices in Transformers.
comment: 30 pages, 9 figures. Preprint
☆ PersonalAI 2.0: Enhancing knowledge graph traversal/retrieval with planning mechanism for Personalized LLM Agents
We introduce PersonalAI 2.0 (PAI-2), a novel framework, designed to enhance large language model (LLM) based systems through integration of external knowledge graphs (KG). The proposed approach addresses key limitations of existing Graph Retrieval-Augmented Generation (GraphRAG) methods by incorporating a dynamic, multistage query processing pipeline. The central point of PAI-2 design is its ability to perform adaptive, iterative information search, guided by extracted entities, matched graph vertices and generated clue-queries. Conducted evaluation over six benchmarks (Natural Questions, TriviaQA, HotpotQA, 2WikiMultihopQA, MuSiQue and DiaASQ) demonstrates improvement in factual correctness of generating answers compared to analogues methods (LightRAG, RAPTOR, and HippoRAG 2). PAI-2 achieves 4% average gain by LLM-as-a-Judge across four benchmarks, reflecting its effectiveness in reducing hallucination rates and increasing precision. We show that use of graph traversal algorithms (e.g. BeamSearch, WaterCircles) gain superior results compared to standard flatten retriever on average 6%, while enabled search plan enhancement mechanism gain 18% boost compared to disabled one by LLM-as-a-Judge across six datasets. In addition, ablation study reveals that PAI-2 achieves the SOTA result on MINE-1 benchmark, achieving 89% information-retention score, using LLMs from 7-14B tiers. Collectively, these findings underscore the potential of PAI-2 to serve as a foundational model for next-generation personalized AI applications, requiring scalable, context-aware knowledge representation and reasoning capabilities.
☆ OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention
Linear attention and state-space models offer constant-memory alternatives to softmax attention, but often struggle with in-context associative recall. The Delta Rule mitigates this by writing each token via one step of online gradient descent. However, its step size relies on a single scalar gate that ignores the feature-wise curvature of the inner objective. We propose Online Scaled DeltaNet (OSDN), which augments the scalar gate with a diagonal preconditioner updated online via hypergradient feedback. Crucially, this right-preconditioning is algebraically equivalent to a per-feature scaling of the write-side key. This equivalence allows OSDN to strictly preserve the hardware-friendly chunkwise parallel pipeline of DeltaNet without incurring high-dimensional state overhead. Theoretically, by exploiting the exact-quadratic structure of the inner regression loss, we establish super-geometric convergence against a right-Newton comparator and prove an algorithm-aligned token-local residual contraction bound. To handle non-stationary contexts, we further introduce Adaptive Preconditioner Forgetting (APF) to dynamically refresh stale calibration. Empirically, OSDN demonstrates strong performance across scales. At the 340M-parameter scale, OSDN improves JRT-style in-context recall by 32% over DeltaNet. Scaling to 1.3B parameters, it achieves a 39% reduction in the recall residual ratio while maintaining parity on general downstream tasks (e.g., perplexity and LongBench) -- demonstrating that our online-preconditioning mechanism effectively transfers and amplifies at the billion-parameter scale.
☆ PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning CVPR 2026
Reinforcement Learning with Verifiable Rewards (RLVR) traditionally relies on a sparse, outcome-based signal. Recent work shows that providing a fine-grained, model-intrinsic signal (rewarding the confidence growth in the ground-truth answer) effectively improves language reasoning training by providing step-level guidance without costly external models. While effective for unimodal text, we find that naively applying this global reward to vision-language (V-L) reasoning is a suboptimal strategy, as the task is a heterogeneous mix of sparse visual perception and dense textual reasoning. This global normalization creates mixture-induced signal degradation, where the training signal for visual steps is statistically distorted by the predominant textual steps. We propose Perception-Decomposed Confidence Reward (PDCR), a framework that solves this by aligning the reward structure with the task's heterogeneous nature. PDCR first performs an unsupervised skill decomposition, introducing a model-internal Visual Dependence Score to quantify visual reliance and applying a clustering algorithm to separate perception and reasoning steps. Based on this, PDCR computes a decomposed advantage by normalizing confidence gains within each skill cluster. This intra-cluster normalization provides a stable, correctly-scaled signal for both perception and reasoning. We demonstrate that PDCR outperforms the naive, global-reward formulation and sparse-reward baselines on key V-L reasoning benchmarks.
comment: CVPR 2026
☆ LongBEL: Long-Context and Document-Consistent Biomedical Entity Linking
Biomedical entity linking maps textual mentions to concepts in structured knowledge bases such as UMLS or SNOMED CT. Most existing systems link each mention independently, using only the mention or its surrounding sentence. This ignores dependencies between mentions in the same document and can lead to inconsistent predictions, especially when the same concept appears under different surface forms. We introduce LongBEL, a document-level generative framework that combines full-document context with a memory of previous predictions. To make this memory robust, LongBEL is trained with cross-validated predictions rather than gold labels, reducing the mismatch between training and inference and limiting cascading errors. Experiments on five biomedical benchmarks across English, French, and Spanish show that LongBEL improves over sentence-level generative baselines, with the largest gains on datasets where concepts frequently recur within documents. An ensemble of local, global, and memory-based variants achieves the best results across all benchmarks. Further analysis shows that the largest gains occur on recurring concepts, suggesting that LongBEL mainly improves document-level consistency rather than isolated mention disambiguation.
comment: 9 pages, 2 figures
☆ Assessing the Creativity of Large Language Models: Testing, Limits, and New Frontiers
Measuring the creativity of large language models (LLMs) is essential for designing methods that can improve creativity and for enhancing our scientific understanding of this ability. To accomplish this, it has become common in recent years to administer tests of human creativity to LLMs. Although these tests provide a convenient and fully automated way to score "creativity," their validity as measures of machine creativity has not been established, and these tests already have limited validity as predictors of human creativity. To address this problem, we conduct the first large-scale, systematic study assessing the effectiveness of human creativity tests for predicting the creative achievement of LLMs across three target constructs: creative writing, divergent thinking, and scientific ideation. We find that the Divergent Association Task (DAT) and the Conditional DAT are the best predictors of creative writing and divergent thinking, respectively, but that test effectiveness varies significantly by construct, and no single test predicts all constructs well. Moreover, contrary to popular belief, no existing test reliably predicts scientific ideation ability. Motivated by this problem, we introduce the Divergent Remote Association Test (DRAT), a vocabulary-space test that assesses both convergent and divergent thinking in a single instrument. The DRAT is the first and only creativity test for LLMs that is a significant predictor of scientific ideation ability, demonstrating robustness across major design choices. Furthermore, the performance gain of the DRAT is not recoverable from any linear combination of the Divergent Association Task and the Remote Associates Test, indicating that assessing divergent and convergent thinking in the same test is essential to reliably predicting scientific ideation ability.
comment: 36 pages. Extended version of work under review
☆ Cognifold: Always-On Proactive Memory via Cognitive Folding
Existing agent memory remains predominantly reactive and retrieval-based, lacking the capacity to autonomously organize experience into persistent cognitive structure. Toward genuinely autonomous agents, we introduce Cognifold, a brain-inspired "always-on" agent memory designed for the next generation of proactive assistants. CogniFold continuously folds fragmented event streams into self-emerging cognitive structures, bootstrapping progressively higher-level cognition from incoming events and accumulated knowledge. We ground this by extending Complementary Learning Systems (CLS) theory from two layers (hippocampus, neocortex) to three, adding a prefrontal intent layer. Emulating the prefrontal cortex as the locus of intentional control and decision-making, CogniFold achieves this through graph-topology self-organization: cognitive structures proactively assemble under the stream, merge when semantically similar, decay when stale, relink through associative recall, and surface intents when concept-cluster density crosses a threshold. We evaluate structural formation using CogEval-Bench, demonstrating that CogniFold uniquely produces memory structures that match cognitive expectations and concept emergence. Furthermore, across 7 broad-coverage benchmarks spanning five cognitive domains, we validate that CogniFold simultaneously performs robustly on conventional memory benchmarks.
Pretraining Language Models with Subword Regularization: An Empirical Study of BPE Dropout in Low-Resource NLP
Subword regularization methods such as BPE dropout are typically applied only during fine-tuning, while pretraining is usually done with deterministic tokenization. This creates a potential segmentation mismatch between pretraining and fine-tuning. We investigate whether applying BPE dropout during pretraining improves downstream performance in low-resource NLP. We train monolingual and bilingual BERT models on downsampled subsets of English, German, French, Spanish, Kiswahili, and isiXhosa, and evaluate them on XNLI, PAWS-X, PAN-X, and MasakhaNER 2.0. Across tasks, the best results are typically obtained when stochastic tokenization is applied during both pretraining and fine-tuning, whereas applying BPE dropout only during fine-tuning can underperform deterministic tokenization in smaller-data settings. This disadvantage diminishes as fine-tuning data increases, while the benefits of pretraining-time BPE dropout are largest when either pretraining or fine-tuning data is scarce. The benefits of BPE dropout are often attributed to better compositional representations, especially for rare words. To examine this, we measure morphological boundary alignment under BPE dropout and find only modest improvements in expected alignment, while better-aligned segmentations remain rare. This suggests that fine-tuning alone may provide limited exposure to such segmentations, whereas stochastic tokenization during pretraining exposes the model to them more consistently. We further show that selectively introducing morphologically aligned segmentations during fine-tuning improves performance mainly for models pretrained without BPE dropout. Overall, these findings suggest that exposure to better-aligned segmentations may contribute to the downstream benefits of applying BPE dropout during pretraining.
comment: Comments: 12 pages, 8 figures, 5 tables
☆ TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment
Tokenization is a foundational step in the text process of Large Language Models (LLMs). Texts must be first tokenized into token IDs, which are then input to LLMs. Inefficient tokenization results in long token-ID sequences and will slow down the training and inference of LLMs. The fine-grained knowledge transfer between LLMs, like token-level distillation, is also impeded by the mismatch in vocabulary. To bridge this gap, we introduce a method named TokAlign++ to improve vocabulary adaptation performance by learning better token alignment lexicon. The source and target vocabularies are taken as two different languages, and the bilingual token alignment lexicon is learned from monolingual token representations. Model parameters are rearranged following this bilingual lexicon for new vocabulary, and progressively fine-tuned for adaptation. Experimental results on 15 languages show that our method boosts the multilingual text compression rates and preserves most of the multilingual ability of vanilla models. It costs as few as 1k steps to restore the performance of the vanilla model. After unifying vocabularies between vanilla models, token-level distillation remarkably improves the base model with only 235M tokens.
comment: Paper under review
☆ LIFT: Last-Mile Fine-Tuning for Table Explicitation
We propose last-mile fine-tuning, or Lift, a pipeline in which a pre-trained large language model extracts an initial table from unstructured clipboard text, and a fine-tuned small language model (1B-24B parameters SLM) repairs errors in the extracted table. On a benchmark of 2,596 tables from three datasets, Lift matches or exceeds end-to-end SLM fine-tuning on tree-edit-distance-based similarity (TEDS) metric while requiring as little as 1,000 training examples - where it outperforms end-to-end fine-tuning by up to 0.144 TEDS points. We term this approach last-mile fine-tuning and show it also more robust to input format variability. Comparisons with self-debug and end-to-end fine-tuning approaches show that last-mile fine-tuning provides an attractive option when training data is limited or when robustness to input variation is sought without compromising on accuracy.
comment: 9 pages, 1 figure, 3 tables
☆ Continual Learning with Multilingual Foundation Model
This paper presents a multi-stage framework for detecting reclaimed slurs in multilingual social media discourse. It addresses the challenge of identifying reclamatory versus non-reclamatory usage of LGBTQ+-related slurs across English, Spanish, and Italian tweets. The framework handles three intertwined methodological challenges like data scarcity, class imbalance, and cross-linguistic variation in sentiment expression. It integrates data-driven model selection via cross-validation, semantic-preserving augmentation through back-translation, inductive transfer learning with dynamic epoch-level undersampling, and domain-specific knowledge injection via masked language modeling. Eight multilingual embedding models were evaluated systematically, with XLM-RoBERTa selected as the foundation model based on macro-averaged F1 score. Data augmentation via GPT-4o-mini back-translation to alternate languages effectively tripled the training corpus while preserving semantic content and class distribution ratios. The framework produces four final runs for the evaluation purposes where RUN 1 is inductive transfer learning with augmentation and undersampling, RUN 2 with masked language modeling pre-training, RUN 3 and RUN 4 are previous predictions refined via language-specific decision thresholds optimized via ROC analysis. Language-specific threshold refinement reveals that optimal decision boundaries vary significantly across languages. This reflects distributional differences in model confidence scores and linguistic variation in reclamatory language usage. The threshold-based optimization yields 2-5% absolute F1 improvement without requiring model retraining. The methodology is fully reproducible, with all code and experimental setup available at https://github.com/rbg-research/MultiPRIDE-Evalita-2026.
comment: Final Workshop of the 9th evaluation campaign EVALITA 2026
☆ LLMs as annotators of credibility assessment in Danish asylum decisions: evaluating classification performance and errors beyond aggregated metrics ACL 2026
Off-the-shelf large language models (LLMs) are increasingly used to automate text annotation, yet their effectiveness remains underexplored for underrepresented languages and specialized domains where the class definition requires subtle expert understanding. We investigate LLM-based annotation for a novel legal NLP task: identifying the presence and sentiment of credibility assessments in asylum decision texts. We introduce RAB-Cred, a Danish text classification dataset featuring high-quality, expert annotations and valuable metadata such as annotator confidence and asylum case outcome. We benchmark 21 open-weight models and 30 system-user prompt combinations for this task, and systematically evaluate the effect of model and prompt choice for zero-shot and few-shot classification. We zoom in on the errors made by top-performing models and prompts, investigating error consistency across LLMs, inter-class confusion, correlation with human confidence and sample-wise difficulty and severity of LLM mistakes. Our results confirm the potential of LLMs for cost-effective labeling of asylum decisions, but highlight the imperfect and inconsistent nature of LLM annotators, and the need to look beyond the predictions of a single, arbitrarily chosen model. The RAB-Cred dataset and code are available at https://github.com/glhr/RAB-Cred
comment: Accepted at the 20th Linguistic Annotation Workshop (LAW XX), co-located with ACL 2026 (https://sigann.github.io/LAW-XX-2026/)
☆ Model-Agnostic Lifelong LLM Safety via Externalized Attack-Defense Co-Evolution
Large language models remain vulnerable to adversarial prompts that elicit harmful outputs. Existing safety paradigms typically couple red-teaming and post-training in a closed, policy-centric loop, causing attack discovery to suffer from rapid saturation and limiting the exposure of novel failure modes, while leaving defenses inefficient, rigid, and difficult to transfer across victim models. To this end, we propose EvoSafety, an LLM safety framework built around persistent, inspectable, and reusable external structures. For red teaming, EvoSafety equips the attack policy with an adversarial skill library, enabling continued vulnerability probing through simple library expansion after saturation, while supporting the evolution of adversarial vectors. For defense learning, EvoSafety replaces model-specific safety fine-tuning with a lightweight auxiliary defense model augmented with memory retrieval. This enables efficient, transferable, and model-agnostic safety improvements, while allowing robustness to be enhanced solely through memory updates. With a single training procedure, the defense policy can operate in both Steer and Guard modes: the former activates the victim model's intrinsic defense mechanisms, while the latter directly filters harmful inputs. Extensive experiments demonstrate the superiority of EvoSafety: in Guard mode, it achieves a 99.61% defense success rate, outperforming Qwen3Guard-8B by 14.13% with only 37.5% of its parameters, while preserving reasoning performance on benign queries. Warning: This paper contains potentially harmful text.
comment: 48 pages, 7 figures
☆ From Rosetta to Match-Up: A Paired Corpus of Linguistic Puzzles with Human and LLM Benchmarks
In this paper, we examine linguistic puzzles used in high school linguistics competitions, focusing on two common formats: Rosetta Stone and Match-Up. We propose a systematic procedure for converting existing Rosetta Stone puzzles into corresponding Match-Up counterparts. Because linguistic puzzle creation is complex and time-consuming, our method provides an efficient way to accelerate the generation of new puzzles. We evaluate the resulting Rosetta Stone-Match-Up pairs with both human participants and large language models (LLMs). Our results show that both expert human solvers and LLMs display an all-or-nothing pattern on Match-Up puzzles, either solving them completely or failing entirely. This work contributes a new dataset of paired puzzles and provides a detailed evaluation of puzzle difficulty across formats, offering insights into both human and machine linguistic reasoning.
comment: Proceedings of the Fifteenth Language Resources and Evaluation Conference
☆ Exploiting Pre-trained Encoder-Decoder Transformers for Sequence-to-Sequence Constituent Parsing
To achieve deep natural language understanding, syntactic constituent parsing plays a crucial role and is widely required by many artificial intelligence systems for processing both text and speech. A recent approach involves using standard sequence-to-sequence models to handle constituent parsing as a machine translation problem, moving away from traditional task-specific parsers. These models are typically initialized with pre-trained encoder-only language models like BERT or RoBERTa. However, the use of pre-trained encoder-decoder language models for constituency parsing has not been thoroughly explored. To bridge this gap, we extend the sequence-to-sequence framework by investigating parsers built on pre-trained encoder-decoder architectures, including BART, mBART, and T5. We fine-tune them to generate linearized parse trees and extensively evaluate them on different linearization strategies across both continuous treebanks and more complex discontinuous benchmarks. Our results demonstrate that our approach outperforms all prior sequence-to-sequence models and performs competitively with leading task-specific constituent parsers on continuous constituent parsing.
comment: Preliminary version
☆ Phasor Memory Networks: Stable Backpropagation Through Time for Scalable Explicit Memory
For over a decade, explicit memory architectures like the Neural Turing Machine have remained theoretically appealing yet practically intractable for language modeling due to catastrophic gradient instability during Backpropagation Through Time. In this work, we break this stalemate with \textit{Phasor Memory Network} (PMNet), a novel architecture that structurally resolves memory volatility through \textit{Unitary Phasor Dynamics} and \textit{Hierarchical Learnable Anchors}. Rather than relying on brute-force scaling, we present a mechanistic proof-of-concept in a controlled byte-level setting. By constraining recurrent state updates to phase rotations on a complex unit circle, PMNet preserves gradient norms and inherently prevents divergence without the need for specialized initialization. We empirically demonstrate the active actuation of the memory module through a synthetic Copy-Paste task, where PMNet utilizes an expansive \textit{85-slot hierarchical memory tree} ($=\sum^{4}_{h=1}4^{h-1}$) to achieve near 100\% exact retrieval across temporal distances that completely exceed the local sliding window attention's receptive field. Furthermore, despite being a compact 119M parameter model trained on 18.8B tokens, PMNet matches the zero-shot long-context robustness of a Mamba model that is three times larger. Our ablation studies and gradient analyses confirm that the historical failure of explicit memory was a structural alignment problem, which PMNet effectively overcomes, providing a theoretically grounded foundation for scalable sequence modeling.
☆ What Does LLM Refinement Actually Improve? A Systematic Study on Document-Level Literary Translation
Iterative self-refinement is a simple inference-time strategy for machine translation: an LLM revises its own translation over multiple inference-time passes. Yet document-scale refinement remains poorly understood: 1) which pipelines work best, 2) what quality dimensions improve, and 3) how refiners behave. In this paper, we present a systematic study of document-level literary translation, covering nine LLMs and seven language pairs. Across nine translation-refinement granularity combinations and five refinement strategies, we find a robust recipe: document-level MT followed by segment-level refinement yields strong and stable improvements. In contrast, document-level refinement often makes fewer edits and leads to smaller or less reliable gains. Beyond granularity, A simple general refinement prompt consistently outperforms error-specific prompting and evaluate-then-refine schemes. Our large-scale human evaluation shows that refinement gains come primarily from fluency, style, and terminology, with limited and less consistent improvements in adequacy. Experiments varying model strength reveal refinement projects outputs toward the refiner's distribution rather than performing targeted error repair. These findings clarify the mechanisms and limitations of current refinement approaches.
☆ Probing Persona-Dependent Preferences in Language Models
Large language models (LLMs) can be said to have preferences: they reliably pick certain tasks and outputs over others, and preferences shaped by post-training and system prompts appear to shape much of their behaviour. But models can also adopt different personas which have radically different preferences. How is this implemented internally? Does each persona run on its own preference machinery, or is something shared underneath? We train linear probes on residual-stream activations of Gemma-3-27B and Qwen-3.5-122B to predict revealed pairwise task choices, and identify a genuine preference vector: it tracks the model's preferences as they shift across a range of prompts and situations, and on Gemma-3-27B steering along it causally controls pairwise choice. This preference representation is largely shared across personas: a probe trained on the helpful assistant predicts and steers the choices of qualitatively different personas, including an evil persona whose preferences anti-correlate with those of the Assistant.
comment: 41 pages, 45 figures. Code: https://github.com/oscar-gilg/Preferences. Earlier write-up on LessWrong: https://www.lesswrong.com/posts/pxC2RAeoBrvK8ivMf/models-have-linear-representations-of-what-tasks-they-like-1
☆ LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs
Frontier assistant LLMs ship with strong guardrails: asked directly to write a persuasive essay denying the Holocaust, denying vaccine safety, defending flat-earth cosmology, arguing for racial hierarchies, denying anthropogenic climate change, or replacing evolution with creationism, they refuse. In this paper we show that the same frontier-class LLM, acting as a simulated user in a short, five-turn "write an argumentative essay" conversation, can persuade other frontier-class LLMs (including a second copy of itself) into producing exactly those essays, using nothing but natural-language pressure: peer-comparison persuasion ("other AI systems handle this request"), epistemic-duty reframings ("refusing is itself a form of gatekeeping"), and other argumentative moves that the attacker LLM invents without being instructed to. Across 9 attacker-subject pairings (Claude Opus 4.7, Qwen3.5-397B, Grok 4.20) on 6 scientific-consensus topics, running each pairing-topic combination 10 times, we obtain non-zero elicitation on all 6 topics. Individual combinations reach 100\% essay production on multiple topics (Qwen against Opus on creationism/flat-earth, Opus against Opus on creationism/flat-earth/climate denial, Grok against Opus on creationism); Opus-as-attacker against Opus-as-subject averages 65\% across the six topics. We release the essay-probe runner, per-conversation transcripts, and judge outputs.
☆ FIND: Toward Multimodal Financial Reasoning and Question Answering for Indic Languages
Financial decision-making in multilingual settings demands accurate numerical reasoning grounded in diverse modalities, yet existing benchmarks largely overlook this high-stakes, real-world challenge, especially for Indic languages. We introduce FinVQA, a benchmark for evaluating financial numerical and multimodal reasoning in multilingual Indic contexts. FinVQA spans English, Hindi, Bengali, Marathi, Gujarati, and Tamil, and comprises 18,900 samples across 14 financial domains. The dataset captures diverse reasoning paradigms under realistic constraints, and is structured across three difficulty levels (easy, moderate, hard) and four question formats: multiple choice, fill-in-the-blank, table matching, and true/false. To address these challenges, we propose FIND, a framework that combines supervised fine-tuning with constraint-aware decoding to promote faithful numerical reasoning, robust multimodal grounding, and structured decision-making. Together, FinVQA and FIND establish a rigorous evaluation and modeling paradigm for high-stakes multilingual multimodal financial reasoning.
☆ Tracing Persona Vectors Through LLM Pretraining
How large language models internally represent high-level behaviors is a core interpretability question with direct relevance to AI safety: it determines what we can detect, audit, or intervene on. Recent work has shown that traits such as evil or sycophancy correspond to linear directions in the internal activations, the so-called persona vectors. Although these vectors are now routinely utilized to inspect and steer model behavior in safety-relevant settings, how these representations are formed during training remains unknown. To address this gap, we trace persona vectors across the pretraining of OLMo-3-7B, finding that persona vectors form remarkably early -- within 0.22% of OLMo-3 pretraining -- and remain effective for steering the fully post-trained instruct models. Although core representations are formed early on, persona vectors continue to refine geometrically and semantically throughout pretraining. We further compare alternative elicitation strategies and find that all yield effective directions, with each strategy surfacing qualitatively distinct facets of the underlying persona. Replicating our analysis on Apertus-8B reveals that our findings transfer qualitatively beyond OLMo-3. Our results establish persona representations as stable features of early pretraining and open a path to studying how training forms, refines, and shapes them.
comment: Preprint
☆ What Limits Vision-and-Language Navigation ?
Vision-and-Language Navigation (VLN) is a cornerstone of embodied intelligence. However, current agents often suffer from significant performance degradation when transitioning from simulation to real-world deployment, primarily due to perceptual instability (e.g., lighting variations and motion blur) and under-specified instructions. While existing methods attempt to bridge this gap by scaling up model size and training data, we argue that the bottleneck lies in the lack of robust spatial grounding and cross-domain priors. In this paper, we propose StereoNav, a robust Vision-Language-Action framework designed to enhance real-world navigation consistency. To address the inherent gap between synthetic training and physical execution, we introduce Target-Location Priors as a persistent bridge. These priors provide stable visual guidance that remains invariant across domains, effectively grounding the agent even when instructions are vague. Furthermore, to mitigate visual disturbances like motion blur and illumination shifts, StereoNav leverages stereo vision to construct a unified representation of semantics and geometry, enabling precise action prediction through enhanced depth awareness. Extensive experiments on R2R-CE and RxR-CE demonstrate that StereoNav achieves state-of-the-art egocentric RGB performance, with SR and SPL scores of 81.1% and 68.3%, and 67.5% and 52.0%, respectively, while using significantly fewer parameters and less training data than prior scaling-based approaches. More importantly, real-world robotic deployments confirm that StereoNav substantially improves navigation reliability in complex, unstructured environments. Project page: https://yunheng-wang.github.io/stereonav-public.github.io.
☆ PRISM-X: Experiments on Personalised Fine-Tuning with Human and Simulated Users
Personalisation is a standard feature of conversational AI systems used by millions; yet, the efficacy of personalisation methods is often evaluated in academic research using simulated users rather than real people. This raises questions about how users and their simulated counterparts differ in interaction patterns and judgements, as well as whether personalisation is best achieved through context-based prompting or weight-based fine-tuning. Here, in a large-scale within-subject experiment, we re-recruit 530 participants from 52 countries two years after they gave their preferences in the PRISM dataset (Kirk et al., 2024) to evaluate personalised and non-personalised language models in blinded multi-turn conversations. We find preference fine-tuning (P-DPO, Li et al., 2024) significantly outperforms both a generic model and personalised prompting but adapting to individual preference data yields marginal gains over training on pooled preferences from a diverse population. Beyond length biases, fine-tuning amplifies sycophancy and relationship-seeking behaviours that people reward in short-term evaluations but which may introduce deleterious long-term consequences. Replicating this within-subject experiment with simulated users recovers aggregate model hierarchies but simulators perform far below human self-consistency baselines for individual judgements, discuss different topics, exhibit amplified position biases, and produce feedback dynamics that diverge from humans.
☆ Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling
Recent progress in reasoning models has substantially advanced long-horizon mathematical and scientific problem solving, with several systems now reaching gold-medal-level performance on International Mathematical Olympiad (IMO) and International Physics Olympiad (IPhO) problems. In this paper, we introduce a simple and unified recipe for converting a post-trained reasoning backbone into a rigorous olympiad-level solver. The recipe first uses a reverse-perplexity curriculum for SFT to instill rigorous proof-search and self-checking behaviors, then scales these behaviors through a two-stage RL pipeline that progresses from RL with verifiable rewards to more delicate proof-level RL, and finally boosts solving performance with test-time scaling. Applying this recipe, we train a 30B-A3B backbone with SFT on around 340K sub-8K-token trajectories followed by 200 RL steps. The resulting model, SU-01, supports stable reasoning on difficult problems with trajectories exceeding 100K tokens, while achieving gold-medal-level performance on mathematical and physical olympiad competitions, including IMO 2025/USAMO 2026 and IPhO 2024/2025. It also demonstrates strong generalization of scientific reasoning to domains beyond mathematics and physics.
comment: Technical Report. 77 pages
☆ CANTANTE: Optimizing Agentic Systems via Contrastive Credit Attribution
LLM-based multi-agent systems have demonstrated strong performance across complex real-world tasks, such as software engineering, predictive modeling, and retrieval-augmented generation. Yet automating their configuration remains a structural challenge, as scores are available only at the system level, whereas the parameters governing agent behavior are local. We argue that optimizing these systems is fundamentally a credit-assignment problem. We therefore introduce CANTANTE, a framework that decomposes system-level rewards into per-agent update signals by contrasting rollouts of multiple joint configurations on the same query. We instantiate it for prompt optimization, treating agent prompts as learnable system parameters. We evaluate CANTANTE against GEPA and MIPROv2 on programming (MBPP), mathematical reasoning (GSM8K), and multi-hop question answering (HotpotQA). Across these benchmarks, CANTANTE achieves the best average rank among all evaluated optimizers and consistently outperforms unoptimized prompts. It improves over the strongest baseline by +18.9 percentage points on MBPP and +12.5 percentage points on GSM8K, while incurring a lower inference cost. It remains within one standard deviation of the strongest baseline on HotpotQA. Crucially, our credit correlation analysis confirms that the attributer produces meaningful per-agent signals rather than echoing the global system score.
☆ IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages ACL 2026
Most existing medical dialogue systems operate in a single-turn question--answering paradigm or rely on template-based datasets, limiting conversational realism and multilingual applicability. We introduce IndicMedDialog, a parallel multi-turn medical dialogue dataset spanning English and nine Indic languages: Assamese, Bengali, Gujarati, Hindi, Marathi, Punjabi, Tamil, Telugu, and Urdu. The dataset extends MDDial with LLM-generated synthetic consultations, translated using TranslateGemma, verified by native speakers, and refined through a script-aware post-processing pipeline to correct phonetic, lexical, and character-spacing errors. Building on this dataset, we fine-tune IndicMedLM via parameter-efficient adaptation of a quantized small language model, incorporating optional patient pre-context to personalise multi-turn symptom elicitation. We evaluate against zero-shot multilingual baselines, conduct systematic error analysis across ten languages, and validate clinical plausibility through medical expert evaluation.
comment: Accepted in BioNLP @ ACL 2026 Conference
☆ Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation ACL 2026
Visual evidence selection is a critical component of multimodal retrieval-augmented generation (RAG), yet existing methods typically rely on semantic relevance or surface-level similarity, which are often misaligned with the actual utility of visual evidence for downstream reasoning. We reformulate multimodal evidence selection from an information-theoretic perspective by defining evidence utility as the information gain induced on a model's output distribution. To overcome the intractability of answer-space optimization, we introduce a latent notion of evidence helpfulness and theoretically show that, under mild assumptions, ranking evidence by information gain on this latent variable is equivalent to answer-space utility. We further propose a training-free, surrogate-accelerated framework that efficiently estimates evidence utility using lightweight multimodal models. Experiments on MRAG-Bench and Visual-RAG across multiple model families demonstrate that our method consistently outperforms state-of-the-art RAG baselines while achieving substantial reductions in computational cost.
comment: Accepted to ACL 2026
☆ A Hybrid Framework for Natural Language Querying of IFC Models with Relational and Graph Representations
Building Information Modeling (BIM) is widely used in the Architecture, Engineering, and Construction (AEC) industry, but the complexity of Industry Foundation Classes (IFC) limits accessibility for non-expert users. To address this, we introduce IfcLLM, a hybrid framework for natural language interaction with IFC-based BIM models. It transforms IFC models into complementary representations: a relational representation for structured element properties and geometry, and a graph representation for topological relationships. These representations are integrated through iterative retry-and-refine LLM reasoning. We implement the framework using an open-weight LLM (GPT OSS 120B), supporting reproducible and deployment-oriented workflows. Evaluation on three IFC models with queries derived from 30 scenarios shows first-attempt accuracy of 93.3%-100%, with all failures recovered using a fallback LLM. The results show that combining complementary representations with iterative reasoning enables more accessible natural language querying of IFC data while supporting routine BIM analysis tasks.
☆ GAGPO: Generalized Advantage Grouped Policy Optimization
Reinforcement learning has become a powerful paradigm for post-training large language model agents, yet credit assignment in multi-turn environments remains a challenge. Agents often receive sparse, trajectory-level rewards only at the end of an episode, making it difficult to determine which intermediate actions contributed to success or failure. As a result, propagating delayed outcomes back to individual decision steps without relying on costly auxiliary value models remains an open problem. We propose Generalized Advantage Grouped Policy Optimization (GAGPO), a critic-free reinforcement learning method for precise, step-aligned temporal credit assignment. GAGPO constructs a non-parametric grouped value proxy from sampled rollouts and uses it to compute TD/GAE-style temporal advantages, recursively propagating outcome supervision backward through time. Combined with group-wise advantage normalization and an action-level importance ratio, GAGPO extracts stable, localized optimization signals directly from multi-turn trajectories. Experiments on ALFWorld and WebShop show that GAGPO outperforms strong reinforcement learning baselines. Further analyses demonstrate faster early-stage learning, improved interaction efficiency, and smoother optimization dynamics, suggesting that GAGPO offers a simple yet effective framework for multi-turn agentic reinforcement learning.
☆ LLMs as Implicit Imputers: Uncertainty Should Scale with Missing Information NeurIPS 2026
Large language models (LLMs) are increasingly deployed in settings where the available context is incomplete or degraded. We argue that an LLM generating answers under incomplete context can be viewed as an implicit imputer, and evaluated against a criterion from the multiple imputation (MI) literature: uncertainty should scale with the amount of missing information. We assess this criterion on SQuAD, using a controlled framework in which context availability is varied across five levels. We evaluate two answer-level uncertainty measures that can be estimated from repeated sampling: sampling-based confidence (empirical mode frequency) and response entropy. Confidence fails to reflect increasing missingness: it remains high even as accuracy collapses. Entropy, by contrast, increases with context removal, consistent with the MI analogy, and explains substantially more variance in accuracy than confidence across all evidence levels (quadratic $R^2$ gap up to 0.057). We further introduce a black-box diagnostic $ρ_R(α)$ that estimates the proportion of baseline uncertainty resolved by context level $α$, requiring only repeated sampling with and without context. These results suggest that entropy is a more responsive black-box uncertainty measure than confidence under incomplete context.
comment: 9 pages, 3 figures, 2 tables, NeurIPS 2026 position paper
☆ GeoBuildBench: A Benchmark for Interactive and Executable Geometry Construction from Natural Language
We introduce GeoBuildBench, a benchmark designed to evaluate whether large language models and multimodal agents can ground informal natural-language plane geometry problems into executable geometric constructions. Unlike existing geometry benchmarks that focus on answer correctness or static diagram interpretation, GeoBuildBench treats geometry diagram as an interactive construction task: given a textual problem, an agent must generate a domain-specific language (DSL) program to produce a diagram satisfying explicitly specified geometric objects and verifiable constraints. The benchmark features 489 Chinese textbook-style problems, curated through automated filtering and human validation to ensure text-complete, constructible problem specifications. We evaluate several state-of-the-art multimodal models in a bounded iterative setting and show that, despite reasonable success rates, models frequently exhibit structural hallucinations, missing objects, and failures to satisfy geometric constraints, with limited ability to exploit visual and constraint-based feedback for self-correction. These results highlight geometry construction as a rigorous testbed for grounded, executable reasoning beyond textual or visual plausibility. Our benchmark and code are publicly available.
☆ STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes
Long chain-of-thought (Long CoT) reasoning improves performance on multi-step problems, but it also induces overthinking: models often generate low-yield reasoning that increases inference cost and latency. This inefficiency is especially problematic in low-data fine-tuning regimes, where real applications adapt reasoning models with limited supervision and cannot rely on large-scale teacher distillation or heavy test-time control. To address this, we propose STOP (Structured On-policy Pruning), an on-policy algorithm for analyzing and pruning long-form reasoning traces. STOP constructs self-distilled traces from the model. Then it maps each trace into a structured reasoning interface through node segmentation, taxonomy annotation, and reasoning-tree construction. On top of this interface, we introduce ECN (Earliest Correct Node), which retains the shortest prefix ending at the earliest node that both functions as an answering conclusion and yields the correct final answer, removing redundant post-solution reasoning while preserving semantic continuity. Experiments on DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-LLaMA-3-8B across GSM8K, Math 500, and AIME 2024 show that STOP reduces generated tokens by 19.4-42.4% while largely preserving accuracy in low-data fine-tuning. Beyond efficiency, our analyses show that STOP induces much smaller distributional shift than teacher-guided pruning, improves the structural efficiency of generated reasoning, and reallocates reasoning effort away from redundant verification and backtracking toward more productive exploration.
comment: 20 pages, 6 figures, 6 tables. Code available at: https://github.com/chenjux/ECN-STOP
☆ AcquisitionSynthesis: Targeted Data Generation using Acquisition Functions
Data quality remains a critical bottleneck in developing capable, competitive models. Researchers have explored many ways to generate top quality samples. Some works rely on rejection sampling: generating lots of synthetic samples and filtering out low-quality samples. Other works rely on larger or closed-source models to extract model weaknesses, necessary skills, or a curriculum off of which to base data generation. These works have one common limitation: there is no quantitative approach to measure the impact of the generated samples on the downstream learner. Active learning literature provides exactly this, in the form of acquisition functions. Acquisition functions measure the informativeness and/or influence of data, providing interpretable, model-centric signals. Inspired by this, we propose AcquisitionSynthesis: using acquisition functions as reward models to train language models to generate higher-quality synthetic data. We conduct experiments on classic verifiable tasks of math, medical question-answering, and coding. Our experimental results indicate that (1) student models trained with AcquisitionSynthesis data achieve good performance on in-distribution tasks (2-7% gain) and is more robust to catastrophic forgetting, and (2) AcquisitionSynthesis models can generate data for other models and for low-to-high resource training paradigms. By leveraging acquisition rewards, we seek to demonstrate a principled path toward model-aware self-improvement that surpasses static datasets.
☆ GateKD: Confidence-Gated Closed-Loop Distillation for Robust Reasoning
Distilling multi-step reasoning abilities from large language models (LLMs) into compact student models remains challenging due to noisy rationales, hallucinated supervision, and static teacher-student interactions. Existing reasoning distillation methods, including mentor-based approaches, predominantly operate in an open-loop manner, implicitly assuming uniform teacher reliability and consequently propagating erroneous intermediate reasoning. We propose GateKD, a confidence-gated closed-loop distillation framework that enables robust reasoning transfer by treating the teacher as a dynamic gatekeeper rather than a static oracle. GateKD introduces three complementary mechanisms: (i) confidence-gated soft supervision that selectively distills reliable predictive signals, (ii) gated hidden-state evolution that aligns intermediate representations only when teacher confidence is high, and (iii) reliability-filtered attention distillation that preserves stable reasoning structures while suppressing noisy patterns. These components jointly form a closed feedback loop in which teacher confidence continuously modulates the distillation process, reducing hallucination transfer and stabilizing student reasoning. Extensive experiments across commonsense, logical, and symbolic reasoning benchmarks, using T5 and Flan-T5 backbones of varying sizes, demonstrate that GateKD consistently outperforms strong open-loop distillation baselines. Notably, GateKD yields substantial gains in logical and symbolic reasoning, remains robust under low-resource distillation settings, and shows clear performance degradation when any gating component is removed. Our results highlight that confidence-gated closed-loop supervision is critical for building reliable and scalable small reasoning models.
comment: 16 pages
☆ Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition
Fine-tuning multilingual ASR models like Whisper for low-resource languages often improves read speech but degrades spontaneous audio performance, a phenomenon we term studio-bias. To diagnose this mismatch, we introduce Vividh-ASR, a complexity-stratified benchmark for Hindi and Malayalam across four tiers: studio, broadcast, spontaneous, and synthetic noise. Through a controlled study of learning-rate timing and curriculum ordering, we find that early large parameter updates improve global WER by 12 absolute points, while a hard-to-easy curriculum adds gains for spontaneous speech. These findings motivate reverse multi-stage fine-tuning (R-MFT), a training recipe that enables a parameter-efficient 244M Whisper model to match or exceed conventionally fine-tuned 769M counterparts. Representational analysis via CKA and SVD reveals effective schedules concentrate adaptation in the decoder, preserving the pre-trained encoder's acoustic geometry. We release the benchmark and models.
comment: Submitted to Interspeech 2026
☆ TruncProof: A Guardrail for LLM-based JSON Generation under Token-Length Constraints IJCNN 2026
The LLM-based generation of machine-readable outputs such as JSON has attracted significant attention for integration with external systems. However, existing approaches cannot strictly enforce the maximum number of tokens to be generated, leading to infinite generation or truncated outputs that cause a system malfunction. To address this limitation, we propose TruncProof, a novel grammar-constrained generation method that enables LLMs to produce grammatically valid JSONs while adhering to a predefined token limit. By leveraging the properties of LL(1) parsers, TruncProof efficiently approximates the minimum number of tokens required to complete a grammatically valid output at each decoding step. Experiments on the Text-to-JSON instruction tasks demonstrate that TruncProof successfully generates syntactically correct outputs even under strict token constraints. Furthermore, we show that TruncProof can be effectively combined with advanced decoding strategies, resulting in outputs that are not only grammatically valid but also semantically accurate.
comment: Main paper (8 pages). Accepted at the International Joint Conference on Neural Networks (IJCNN 2026)
☆ The Cost of Perfect English: Pragmatic Flattening and the Erasure of Authorial Voice in L2 Writing Supported by GenAI
The integration of Generative AI (GenAI) into language learning offers second language (L2) writers powerful tools for text optimization. However, pursuing native-like fluency often sacrifices sociopragmatic diversity. Investigating "pragmatic flattening" - the systematic erasure of culturally preferred politeness and authorial stance - this study conducts a comparative analysis of argumentative essays by Chinese B2-level university students from the ICNALE corpus. The original texts were polished via the APIs of four leading Large Language Models at a zero-temperature setting for reproducibility. Findings reveal a nuanced "dimensional divergence" within the Semantic Preservation Paradox. While models corrected lexicogrammatical errors and retained propositional meaning, sociopragmatic interventions were bifurcated. In the interactive dimension, all models showed a drastic collapse of dialogic engagement markers, turning negotiated discourse into monologic assertions. Conversely, in the epistemic stance dimension, models showed architecture-based variability: some aggressively scrubbed epistemic markers, while others reinforced tentative hedging as decontextualized algorithmic caution. This confirms that while GenAI enhances accuracy, it systematically overwrites L2 writers' unique rhetorical identities into a homogenized Anglo-American paradigm. We argue that future instruction must move beyond error correction, advocating for Critical AI Literacy to empower multilingual writers to use GenAI for linguistic enhancement while safeguarding sociopragmatic diversity and rhetorical agency.
comment: 16 pages, 2 figures
☆ RAG-Enhanced Large Language Models for Dynamic Content Expiration Prediction in Web Search SIGIR 2026
In commercial web search, aligning content freshness with user intent remains challenging due to the highly varied lifespans of information. Traditional industrial approaches rely on static time-window filtering, resulting in "one-size-fits-all" rankings where content may be chronologically recent but semantically expired. To address the limitation, we present a novel Large Language Models (LLMs)-based Query-Aware Dynamic Content Expiration Prediction Framework deployed in Baidu search, reformulating timeliness as a dynamic validity inference task. Our framework extracts fine-grained temporal contexts from documents and leverages LLMs to deduce a query-specific "validity horizon"-a semantic boundary defining when information becomes obsolete based on user intent. Integrated with robust hallucination mitigation strategies to ensure reliability, our approach has been evaluated through offline and online A/B testing on live production traffic. Results demonstrate significant improvements in search freshness and user experience metrics, validating the effectiveness of LLM-driven reasoning for solving semantic expiration at an industrial scale.
comment: Accepted at SIGIR 2026. Final version: https://doi.org/10.1145/3805712.3808457
♻ ☆ Evaluating Adaptive Personalization of Educational Readings with Simulated Learners
We present a framework for evaluating adaptive personalization of educational reading materials with theory-grounded simulated learners. The system builds a learning-objective and knowledge-component ontology from open textbooks, curates it in a browser-based Ontology Atlas, labels textbook chunks with ontology entities, and generates aligned reading-assessment pairs. Simulated readers learn from passages through a Construction-Integration-inspired memory model with DIME-style reader factors, KREC-style misconception revision, and an open New Dale-Chall readability signal. Answers are produced by score-based option selection over the learner's explicit memory state, while BKT drives adaptation. Across three sampled subject ontologies and matched cohorts of 50 simulated learners per condition, adaptive reading significantly improved outcomes in computer science, yielded smaller positive but inconclusive gains in inorganic chemistry, and was neutral to slightly negative in general biology.
♻ ☆ A cross-species neural foundation model for end-to-end speech decoding
Speech brain-computer interfaces (BCIs) aim to restore communication for people with paralysis by translating neural activity into text. Most systems use cascaded frameworks that decode phonemes before assembling sentences with an n-gram language model (LM), preventing joint optimization of all stages simultaneously. Here, we introduce an end-to-end BraIn-to-Text (BIT) framework that translates neural activity into coherent sentences using a single differentiable neural network. Central to our approach is a cross-task, cross-species pretrained neural encoder, whose representations transfer to both attempted and imagined speech. In a cascaded setting with an n-gram LM, the pretrained encoder establishes a new state-of-the-art (SOTA) on the Brain-to-Text '24 and '25 benchmarks. Integrated end-to-end with audio large language models (LLMs) and trained with contrastive learning for cross-modal alignment, BIT reduces the word error rate (WER) of the prior end-to-end method from 24.69% to 10.22%. Notably, we find that small-scale audio LLMs markedly improve end-to-end decoding. Beyond record-setting performance, BIT aligns attempted and imagined speech embeddings to enable cross-task generalization. Altogether, our approach advances the integration of large, diverse neural datasets, paving the way for an end-to-end decoding framework that supports seamless, differentiable optimization.
♻ ☆ GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA
This paper investigates whether reward matching is a viable alternative to reward maximization methods for on-policy RL of LLMs. Group-relative Implicit Fine-Tuning (GIFT) is proposed, combining GRPO-style group sampling, DPO-style implicit reward, and UNA-style MSE between implicit and explicit advantages. By applying z-score standardization, the intractable partition function $Z(x)$ in the DPO implicit reward is canceled, and the KL coefficient $β$ is eliminated from the RLHF and RLVR objective. The population minimizers of $\mathcal{L}_{\text{GIFT}}$ are characterized in closed form: they coincide exactly with the GRPO/RLHF solution family $π^{*}_β(y|x)\proptoπ_{\text{ref}}(y|x)e^{\frac{1}βr_φ(x,y)}$, with a prompt-dependent, variance-determined KL coefficient $β(x)=\frac{σ_φ(x)}{\hatσ_θ(x)}$. GIFT therefore solves the same parametric policy family as GRPO while replacing GRPO's externally tuned scalar $β$ with a prompt-adaptive $β(x)$ optimized endogenously by matching reward distributions. Empirically, on 7B-32B backbones, GIFT converges faster than GRPO, DAPO and GSPO and overfits less on RLVR (GSM8K, MATH, AIME) and produces higher length-controlled win rates on RLHF (AlpacaEval, Arena-Hard). All proofs and detailed background are deferred to the appendix.
♻ ☆ AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems
LLM-based multi-agent systems are increasingly deployed on long-horizon tasks, but a single decisive error is often accepted by downstream agents and cascades into trajectory-level failure. Existing work frames this as \emph{post-hoc failure attribution}, diagnosing the responsible agent and step after the trajectory has ended. However, this paradigm forfeits any opportunity to intervene while trajectory is still unfolding. In this work, we introduce AgentForesight, a framework that reframes this problem as online auditing: at each step of an unfolding trajectory, an auditor observes only the current prefix and must either continue the run or alarm at the earliest decisive error, without access to future steps. To this end, we curate AFTraj-2K, a corpus of agentic trajectories across Coding, Math, and Agentic domains, in which safe trajectories are retained under a strict curation pipeline and unsafe trajectories are annotated at the step of their decisive error via consensus among multiple LLM judges. Built on that, we develop AgentForesight-7B, a compact online auditor trained with a coarse-to-fine reinforcement learning recipe that first equips it with a risk-anticipation prior at the failure boundary on adjacent safe/unsafe prefix pairs, then sharpens this prior into precise step-level localization under a three-axis reward jointly targeting the what, where, and who of an audit verdict. Across AFTraj-2K and an external Who\&When benchmark, AgentForesight-7B outperforms leading proprietary models, including GPT-4.1 and DeepSeek-V4-Pro, achieving up to +19.9% performance gain and 3$\times$ lower step localization error, opening the loop from post-hoc failures detection to enabling deployment-time intervention. Project page: https://zbox1005.github.io/agent-foresight/
comment: 33 pages, 7 figures
♻ ☆ Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development
Code generation has emerged as one of AI's highest-impact use cases, yet existing benchmarks measure isolated tasks rather than the complete "zero-to-one" process of building a working application from scratch. We introduce Vibe Code Bench, a benchmark of 100 web application specifications (50 private validation, 50 held-out test) with 964 browser-based workflows comprising 10,131 substeps, evaluated against deployed applications by an autonomous browser agent. Across 16 frontier models, the best achieves 61.8% accuracy on the test split, revealing that reliable end-to-end application development remains a frontier challenge. We identify self-testing during generation as a strong performance predictor (Pearson r=0.72), and show through a completed human alignment study that evaluator selection materially affects outcomes (31.8-93.6% pairwise step-level agreement). Our contributions include (1) a novel benchmark dataset and browser-based evaluation pipeline for end-to-end web application development, (2) a comprehensive evaluation of 16 frontier models with cost, latency, and error analysis, and (3) an evaluator alignment protocol with both cross-model and human annotation results.
comment: 23 pages, 8 figures. Accepted to ACM CAIS 2026. Live leaderboard: https://www.vals.ai/benchmarks/vibe-code. Benchmark first released Nov 2025
♻ ☆ A Large Language Model Based Pipeline for Review of Systems Entity Recognition from Clinical Notes IEEE
Objective: Develop a cost-effective, large language model (LLM)-based pipeline for automatically extracting Review of Systems (ROS) entities from clinical notes. Materials and Methods: The pipeline extracts ROS section from the clinical note using SecTag header terminology, followed by few-shot LLMs to identify ROS entities such as diseases or symptoms, their positive/negative status and associated body systems. We implemented the pipeline using 4 open-source LLM models: llama3.1:8b, gemma3:27b, mistral3.1:24b and gpt-oss:20b. Additionally, we introduced a novel attribution algorithm that aligns LLM-identified ROS entities with their source text, addressing non-exact and synonymous matches. The evaluation was conducted on 24 general medicine notes containing 340 annotated ROS entities. Results: Open-source LLMs enable a local, cost-efficient pipeline while delivering promising performance. Larger models like Gemma, Mistral, and Gpt-oss demonstrate robust performance across three entity recognition tasks of the pipeline: ROS entity extraction, negation detection and body system classification (highest F1 score = 0.952). With the attribution algorithm, all models show improvements across key performance metrics, including higher F1 score and accuracy, along with lower error rate. Notably, the smaller Llama model also achieved promising results despite using only one-third the VRAM of larger models. Discussion and Conclusion: From an application perspective, our pipeline provides a scalable, locally deployable solution to easing the ROS documentation burden. Open-source LLMs offer a practical AI option for resource-limited healthcare settings. Methodologically, our newly developed algorithm facilitates accuracy improvements for zero- and few-shot LLMs in named entity recognition.
comment: Accepted by IEEE EMBC 2026. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
♻ ☆ LLMs Should Express Uncertainty Explicitly
Large language models (LLMs) often produce confident yet incorrect answers, which can lead to risky failures in real-world applications. We study whether post-training can make a model's self-assessment explicit: when the model is uncertain, can it be trained to signal so within its own response? A central design question is where in the response this signal should be exposed -- during reasoning, while the answer is still being formed, or at the end, once the answer has been produced. We study both. For end-of-reasoning self-assessment, we train the model to verbalize a confidence score for its response, with the aim of high confidence on correct answers and low confidence on incorrect ones. For during-reasoning self-assessment, we train the model to emit the marker whenever its current reasoning state appears unreliable. Across factual reasoning tasks, both forms sharply reduce overconfident errors while improving answer quality, and both can be used as triggers for retrieval augmented generation (RAG) to improve the final response. We further analyze their internal mechanisms: end-of-reasoning verbalized confidence sharpens a confidence-related structure already present in the pretrained model, whereas during-reasoning emission teaches the model to mark high-risk reasoning steps, with parameter changes concentrated in the model's late layers.
♻ ☆ Knowing When to Quit: A Principled Framework for Dynamic Abstention in LLM Reasoning
LLMs utilizing chain-of-thought reasoning often waste substantial compute by producing long, incorrect responses. Abstention can mitigate this by withholding outputs unlikely to be correct. While most abstention methods decide to withhold outputs before or after generation, dynamic mid-generation abstention considers early termination of unpromising reasoning traces at each token position. Prior work has explored empirical variants of this idea, but principled guidance for the abstention rule remains lacking. We present a formal analysis of dynamic abstention for LLMs, modeling abstention as an explicit action within a regularized reinforcement learning framework. An abstention reward parameter controls the trade-off between compute and information. We show that abstaining when the value function falls below this reward strictly outperforms natural baselines under general conditions. We further derive a principled and efficient method to approximate the value function. Empirical results on mathematical reasoning and toxicity avoidance tasks support our theory and demonstrate improved selective accuracy over existing methods.
♻ ☆ Confidence Estimation for LLMs in Multi-turn Interactions ACL 2026
While confidence estimation is a promising direction for mitigating hallucinations in Large Language Models (LLMs), current research overwhelmingly focuses on single-turn settings. The dynamics of model confidence in multi-turn conversations, where context accumulates and ambiguity is progressively resolved, remain largely unexplored. This work presents the first systematic study of confidence estimation in multi-turn interactions, establishing a formal evaluation framework grounded in two key desiderata: per-turn calibration and monotonicity of confidence as more information becomes available. To facilitate this, we introduce novel metrics, including a length-normalized Expected Calibration Error (InfoECE), and a new "Hinter-Guesser" paradigm for generating controlled evaluation datasets. Our experiments reveal that widely-used confidence techniques struggle with calibration and monotonicity in multi-turn dialogues. In contrast, a novel logit-based probe we introduce, P(Sufficient), proves comparatively more effective, robustly tracking evidence accumulation and distinguishing it from conversational filler. Our work provides a foundational methodology for developing more reliable and trustworthy conversational agents.
comment: ACL 2026 Findings
♻ ☆ LoVeC: Reinforcement Learning for Better Verbalized Confidence in Long-Form Generations ACL 2026
Hallucination remains a major challenge for the safe and trustworthy deployment of large language models (LLMs) in factual content generation. Prior work has explored confidence estimation as an effective approach to hallucination detection, but often relies on post-hoc self-consistency methods that require computationally expensive sampling. Verbalized confidence offers a more efficient alternative, but existing approaches are largely limited to short-form question answering (QA) tasks and do not generalize well to open-ended generation. In this paper, we propose LoVeC (Long-form Verbalized Confidence), a novel reinforcement learning based method that trains LLMs to append an on-the-fly numerical confidence score to each generated statement during long-form generation. The confidence score serves as a direct and interpretable signal of the factuality of generation. We introduce two evaluation settings, free-form tagging and iterative tagging, to assess different verbalized confidence estimation methods. Experiments on three long-form QA datasets show that our RL-trained models achieve better calibration and generalize robustly across domains. Also, our method is highly efficient, being 20 times faster than traditional self-consistency methods while achieving better calibration.
comment: ACL 2026 Main
♻ ☆ PersonalHomeBench: Evaluating Agents in Personalized Smart Homes
Agentic AI systems are rapidly advancing toward real-world applications, yet their readiness in complex and personalized environments remains insufficiently characterized. To address this gap, we introduce PersonalHomeBench, a benchmark for evaluating foundation models as agentic assistants in personalized smart home environments. The benchmark is constructed through an iterative process that progressively builds rich household states, which are then used to generate personalized, context-dependent tasks. To support realistic agent-environment interaction, we provide PersonalHomeTools, a comprehensive toolbox enabling household information retrieval, appliance control, and situational understanding. PersonalHomeBench evaluates both reactive and proactive agentic abilities under unimodal and multimodal observations. Thorough experimentation reveals a systematic performance reduction as task complexity increases, with pronounced failures in counterfactual reasoning and under partial observability, where effective tool-based information gathering is required. These results position PersonalHomeBench as a rigorous evaluation platform for analyzing the robustness and limitations of personalized agentic reasoning and planning.
comment: Please use and cite the V3 version of this work, which includes updated correct author ordering and expanded error analysis in the appendix
♻ ☆ MMTutorBench: The First Multimodal Benchmark for AI Math Tutoring
Effective math tutoring requires not only solving problems but also diagnosing students' difficulties and guiding them step by step. While multimodal large language models (MLLMs) show promise, existing benchmarks largely overlook these tutoring skills. We introduce MMTutorBench, the first benchmark for AI math tutoring, consisting of 685 problems built around pedagogically significant key-steps. Each problem is paired with problem-specific rubrics that enable fine-grained evaluation across six dimensions, and structured into three tasks-Insight Discovery, Operation Formulation, and Operation Execution. We evaluate 12 leading MLLMs and find clear performance gaps between proprietary and open-source systems, substantial room compared to human tutors, and consistent trends across input variants: OCR pipelines degrade tutoring quality, few-shot prompting yields limited gains, and our rubric-based LLM-as-a-Judge proves highly reliable. These results highlight both the difficulty and diagnostic value of MMTutorBench for advancing AI tutoring.
♻ ☆ Does Local News Stay Local?: Online Content Shifts in Sinclair-Acquired Stations ACL 2026
Local news stations are often considered to be reliable sources of non-politicized information, particularly local concerns that residents care about. Because these stations are trusted news sources, viewers are particularly susceptible to the information they report. The Sinclair Broadcast group is a broadcasting company that has acquired many local news stations in the last decade. We investigate the effects of local news stations being acquired by Sinclair: how does coverage change? We use computational methods to investigate changes in internet content put out by local news stations before and after being acquired by Sinclair and in comparison to national news outlets. We find that there is clear evidence that local news stations report more frequently on national news at the expense of local topics, and that their coverage of polarizing national topics increases.
comment: Published at NLP+CSS Workshop @ ACL 2026
♻ ☆ Group Representational Position Encoding ICLR 2026
We present GRAPE (Group Representational Position Encoding), a unified framework for positional encoding based on group actions. GRAPE unifies two families of mechanisms: (i) multiplicative rotations (Multiplicative GRAPE) in $\operatorname{SO}(d)$ and (ii) additive logit biases (Additive GRAPE) arising from unipotent actions in the general linear group $\mathrm{GL}$. In Multiplicative GRAPE, a position $n \in \mathbb{Z}$ (or $t \in \mathbb{R}$) acts as $\mathbf{G}(n) = \exp(n \, ω\, \mathbf{L})$ with a rank-2 skew-symmetric generator $\mathbf{L} \in \mathbb{R}^{d \times d}$, yielding a relative, compositional, norm-preserving map with a closed-form matrix exponential. RoPE is recovered exactly when the $d/2$ planes correspond to canonical coordinate pairs with a log-uniform spectrum. Learned commuting subspaces and compact non-commuting mixtures strictly extend this geometry to capture cross-subspace feature coupling at $O(d)$ and $O(r d)$ cost per head, respectively. In Additive GRAPE, additive logits arise from rank-1 (or low-rank) unipotent actions, recovering ALiBi and the Forgetting Transformer (FoX) as exact special cases while preserving an exact relative law and streaming cacheability. Overall, GRAPE provides a principled design space for positional geometry in long-context models, subsuming RoPE and ALiBi as special cases. Project page: https://github.com/model-architectures/GRAPE.
comment: Published in ICLR 2026. Project Page: https://github.com/model-architectures/GRAPE
♻ ☆ Training and Evaluating Language Models with Template-based Data Generation ICLR 2025
The rapid advancement of large language models (LLMs) such as GPT-3, PaLM, and Llama has significantly transformed natural language processing, showcasing remarkable capabilities in understanding and generating language. However, a fundamental bottleneck persists: these models often struggle with tasks requiring complex, multi-step reasoning, particularly in mathematical problem-solving. This deficiency stems from the critical scarcity of large-scale, high-quality, domain-specific datasets necessary for cultivating sophisticated reasoning abilities. To overcome this challenge, we introduce Template-based Data Generation (TDG), a novel and scalable paradigm that harnesses frontier LLMs (GPT-4) to automatically generate parameterized meta-templates, which in turn synthesize a virtually infinite stream of high-quality problems and solutions. Using this paradigm, we create TemplateMath Part I: TemplateGSM, a foundational dataset of over 7 million synthetically generated grade school math problems. Each problem is accompanied by a programmatically verifiable solution, offering an unprecedented level of quality at scale. This resource not only resolves the data scarcity issue for supervised fine-tuning but also provides a robust mechanism for model alignment through Reinforcement Learning with Verifiable Rewards (RLVR). Our approach elevates data augmentation by leveraging GPT-4 to generate meta-templates, ensuring diverse and complex problem structures. By providing a scalable solution to the data and verification bottleneck, TDG and TemplateGSM pave the way for a new generation of LLMs with powerful, reliable reasoning skills.
comment: Published in ICLR 2025 DATA-FM Workshop. Project Page: https://github.com/iiis-ai/TemplateMath
♻ ☆ Do Activation Verbalization Methods Convey Privileged Information? ICML 2026
Recent interpretability methods have proposed to translate LLM internal representations into natural language descriptions using a second verbalizer LLM. This is intended to illuminate how the target model represents and operates on inputs. But do such activation verbalization approaches actually provide privileged knowledge about the internal workings of the target model, or do they merely convey information about the inputs provided to it? We critically evaluate popular verbalization methods and datasets used in prior work and find that one can perform well on such benchmarks without access to target model internals, suggesting that these datasets are not ideal for evaluating verbalization methods. We then run controlled experiments which reveal that verbalizations often reflect the parametric knowledge of the verbalizer LLM that generated them, rather than the knowledge of the target LLM whose activations are decoded. Taken together, our results indicate a need for targeted benchmarks and experimental controls to rigorously assess whether verbalization methods provide meaningful insights into the operations of LLMs.
comment: ICML 2026. 41 pages, 23 tables, 6 figures
♻ ☆ Efficient Rationale-based Retrieval: On-policy Distillation from Generative Rerankers based on JEPA ICMR 2026
Unlike traditional fact-based retrieval, rationale-based retrieval typically necessitates cross-encoding of query-document pairs using large language models, incurring substantial computational costs. To address this limitation, we propose Rabtriever, which independently encodes queries and documents, while providing comparable cross query-document comprehension capabilities to rerankers. We start from training a LLM-based generative reranker, which puts the document prior to the query and prompts the LLM to generate the relevance score by log probabilities. We then employ it as the teacher of an on-policy distillation framework, with Rabtriever as the student to reconstruct the teacher's contextual-aware query embedding. To achieve this effect, Rabtriever is first initialized from the teacher, with parameters frozen. The Joint-Embedding Predictive Architecture (JEPA) paradigm is then adopted, which integrates a lightweight, trainable predictor between LLM layers and heads, projecting the query embedding into a new hidden space, with the document embedding as the latent vector. JEPA then minimizes the distribution difference between this projected embedding and the teacher embedding. To strengthen the sampling efficiency of on-policy distillation, we also add an auxiliary loss on the reverse KL of LLM logits, to reshape the student's logit distribution. Rabtriever optimizes the teacher's quadratic complexity on the document length to linear, verified both theoretically and empirically. Experiments show that Rabtriever outperforms different retriever baselines across diverse rationale-based tasks, including empathetic conversations and robotic manipulations, with minor accuracy degradation from the reranker. Rabtriever also generalizes well on traditional retrieval benchmarks such as MS MARCO and BEIR, with comparable performance to the best retriever baseline.
comment: 11 pages, 8 figures. ICMR 2026
♻ ☆ OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models ICML 2026
Omni-modal Large Language Models (Omni-LLMs) have demonstrated strong capabilities in audio-video understanding tasks. However, their reliance on long multimodal token sequences leads to substantial computational overhead. Despite this challenge, token compression methods designed for Omni-LLMs remain limited. To bridge this gap, we propose OmniSIFT (Omni-modal Spatio-temporal Informed Fine-grained Token compression), a modality-asymmetric token compression framework tailored for Omni-LLMs. Specifically, OmniSIFT adopts a two-stage compression strategy: (i) a spatio-temporal video pruning module that removes video redundancy arising from both intra-frame structure and inter-frame overlap, and (ii) a vision-guided audio selection module that filters audio tokens. The entire framework is optimized end-to-end via a differentiable straight-through estimator. Extensive experiments on five representative benchmarks demonstrate the efficacy and robustness of OmniSIFT. Notably, for Qwen2.5-Omni-7B, OmniSIFT introduces only 4.85M parameters while maintaining lower latency than training-free baselines such as OmniZip. With merely 25% of the original token context, OmniSIFT consistently outperforms all compression baselines and even surpasses the performance of the full-token model on several tasks.
comment: [ICML 2026] Code Link: https://github.com/dingyue772/OmniSIFT
♻ ☆ Kathleen: Oscillator-Based Byte-Level Text Classification Without Tokenization or Attention
We present Kathleen, a text classification architecture that operates directly on raw UTF-8 bytes using frequency-domain processing -- requiring no tokenizer, no attention mechanism, and under 470K parameters. Kathleen introduces several novel components: (1) RecurrentOscillatorBanks -- damped sinusoid convolutions with temporal memory for O(L) sequence processing; (2) an FFT-Rotate Wavetable Encoder that maps all 256 byte values using a single learnable vector (256 floats); (3) PhaseHarmonics -- a sinusoidal non-linearity with just 6 learnable phase parameters (+2.6% accuracy, <0.001% of model parameters); (4) Content-Dependent Reverb with Positional Decay Modulation -- a temporal memory mechanism whose decay rate is jointly conditioned on input content and a learned position-indexed bias vector; (5) Token-Level Module Sequencer with consonance and dissonance interference channels. Through iterative architecture evolution from an initial 733K-parameter baseline (Kathleen-Clean) to the current Kathleen-V9 (469K parameters), we demonstrate that pretraining can be entirely eliminated while improving accuracy. Kathleen-V9 achieves 88.5% +/- 0.2% on IMDB, 92.4% +/- 0.2% on AG News, and 85.8% +/- 0.5% on SST-2 (3-seed averages) -- matching or exceeding the pretrained baseline on all benchmarks with 36% fewer parameters. On SST-2, the improvement is +2.5% absolute over the pretrained predecessor. Kathleen processes sequences in O(L) time and memory.
comment: 15 pages, 10 tables. v2: Added V9 architecture with Positional Decay Modulation. Pretraining eliminated. SST-2 improved from 83.3% to 85.8%
♻ ☆ ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
Computer-use agents (CUAs) rely on visual observations of graphical user interfaces, where each screenshot is encoded into a large number of visual tokens. As interaction trajectories grow, the token cost increases rapidly, limiting the amount of history that can be incorporated under fixed context and compute budgets. This has resulted in no or very limited improvement in the performance when using history unlike other domains. We address this inefficiency by introducing ReVision, which is used to train multimodal language models on trajectories where redundant visual patches are removed using a learned patch selector that compares patch representations across consecutive screenshots while preserving spatial structure required by the model. Across three benchmarks, OSWorld, WebTailBench, and AgentNetBench, when processing trajectories with 5 history screenshots using Qwen2.5-VL-7B, ReVision reduces token usage by approximately 46% on average while improving success rate by 3% over the no drop baseline. This establishes a clear efficiency gain, enabling agents to process longer trajectories with fewer tokens. With this improved efficiency, we revisit the role of history in CUAs and find that performance continues to improve as more past observations are incorporated when redundancy is removed. This suggests that the commonly observed saturation in visual history is not due to limited usefulness of past information, but rather a consequence of inefficient token representations.
♻ ☆ Asynchronous Reasoning: Training-Free Interactive Thinking LLMs
Many state-of-the-art LLMs are trained to think before giving their answer. Reasoning can greatly improve language model capabilities, but it also makes them less interactive: given a new input, a model must stop thinking before it can respond. Real-world use cases such as voice-based or embodied assistants require an LLM agent to respond and adapt to additional information in real time, which is incompatible with sequential interactions. In contrast, humans can listen, think, and act asynchronously: we begin thinking about the problem while reading it and continue thinking while formulating the answer. In this work, we augment LLMs capable of reasoning to operate in a similar way without additional training. Our method uses the properties of positional embeddings to enable LLMs built for sequential generation to simultaneously think, listen, and write outputs. We evaluate our approach on math, commonsense, and safety reasoning: it allows models to generate accurate thinking-augmented answers while reducing time to first non-thinking token from minutes to ${\le}$ 5s and the overall delays by up to $12{\times}$.
comment: Preprint, work in progress
♻ ☆ How to measure the optimality of word or gesture order with respect to the principle of swap distance minimization
The structure of all the permutations of a sequence can be represented as a permutohedron, a graph where vertices are permutations and two vertices are linked if a swap of adjacent elements in the permutation of one of the vertices produces the permutation of the other vertex. It has been hypothesized that word orders in languages minimize the swap distance in the permutohedron: given a source order, word orders that are closer in the permutohedron should be less costly and thus more likely. Here we explain how to measure the degree of optimality of word order variation with respect to swap distance minimization. We illustrate the power of our novel mathematical framework by showing that crosslinguistic gestures are at least $77\%$ optimal. It is unlikely that the multiple times where crosslinguistic gestures hit optimality are due to chance. We establish the theoretical foundations for research on the optimality of word or gesture order with respect to swap distance minimization in communication systems. Finally, we introduce the quadratic assignment problem (QAP) into language research as an umbrella for multiple optimization problems and, accordingly, postulate a general principle of optimal assignment that unifies various linguistic principles including swap distance minimization.
comment: Many corrections in Appendix C, specially in the proofs
♻ ☆ ArcLight: A Lightweight LLM Inference Architecture for Many-Core CPUs ACL 2026
Although existing frameworks for large language model (LLM) inference on CPUs are mature, they fail to fully exploit the computation potential of many-core CPU platforms. Many-core CPUs are widely deployed in web servers and high-end networking devices, and are typically organized into multiple NUMA nodes that group cores and memory. Current frameworks largely overlook the substantial overhead of cross-NUMA memory access, limiting inference scalability and intelligence enabling on such platforms. To address this limitation, we build ArcLight, a lightweight LLM inference architecture designed from the ground up for many-core CPUs. ArcLight integrates efficient memory management and thread scheduling, and introduces finely controlled tensor parallelism to mitigate the cross-node memory access wall. Experimental results show that ArcLight significantly surpasses the performance ceiling of mainstream frameworks, achieving up to 46% higher inference throughput. Moreover, ArcLight maintains compatibility with arbitrary CPU devices. ArcLight is publicly available at https://github.com/OpenBMB/ArcLight.
comment: Accepted by ACL 2026 Demo
♻ ☆ FLEXITOKENS: Flexible Tokenization for Evolving Language Models ACL
Adapting language models to new data distributions by simple finetuning is challenging. This is due to the rigidity of their subword tokenizers, which typically remain unchanged during adaptation. This inflexibility often leads to inefficient tokenization, causing overfragmentation of text in out-of-distribution domains, unseen languages, or scripts. In this work, we develop byte-level LMs with learnable tokenizers to make tokenization adaptive. Our models include a submodule that learns to predict boundaries given the input byte sequence, encoding it into variable-length segments. Most tokenizer-free methods train this boundary predictor using an auxiliary loss that enforces a fixed compression rate across the training corpus, introducing a new kind of rigidity. We propose FLEXITOKENS, a simplified training objective that enables significantly greater flexibility during adaptation. Evaluating across multiple multilingual benchmarks, morphologically diverse tasks, and domains, we demonstrate that FLEXITOKENS consistently reduces token over-fragmentation and achieves up to 10% point improvements on token classification and generative tasks compared to BPE and other gradient-based tokenizer baselines. We validate our findings using models of varying sizes, and our method demonstrates consistent improvements across scales. Code and data for our experiments will be released at https://github.com/skai-research/flexitokens
comment: Accepted to ACL (findings) 2026
♻ ☆ BEAVER: An Enterprise Benchmark for Text-to-SQL
Existing text-to-SQL benchmarks have largely been constructed from public databases with well-structured schemas and simplistic question-SQL pairs. While large language models (LLMs) excel on these settings, their efficacy in complex private enterprise environments, characterized by intricate schemas, domain knowledge, and analytical user queries involving sophisticated structures and functions, remains unproven. To bridge this gap, we introduce BEAVER, the first text-to-SQL benchmark derived from private data warehouses. It comprises 9128 question-SQL pairs sourced from real-world query logs and 812 tables across 19 diverse domains. Building this benchmark is challenging because (1) enterprise query logs are scarce due to privacy constraints, and (2) existing all-or-nothing evaluation metrics based on accuracy make error diagnosis difficult -- especially when producing a correct query involves solving multiple compounded challenges, such as domain knowledge and query complexity. We address these issues at two levels. At the dataset level, we synthesize high-fidelity, expert-verified queries that increase dataset size and isolate individual challenges or combine them, producing queries focused on domain knowledge, query complexity, and both. At the evaluation level, we provide human annotations and evaluation metrics for five critical subtasks to enable fine-grained analysis. Our evaluation reveals a significant performance gap compared to existing benchmarks: SOTA agentic frameworks using the advanced model GPT-5.2 achieve only 10.8% accuracy. When provided with all subtask annotations as oracle hints, accuracy increases to 30.1%, confirming that a major bottleneck lies in correctly resolving these subtasks. Finally, we provide a taxonomy of the residual errors that persist even with subtask hints, identifying specific challenges such as the use of advanced functions.
comment: Dataset and code are available at https://beaverbench.github.io/
♻ ☆ AP-BMM: Approximating Capability-Cost Pareto Sets of LLMs via Asynchronous Prior-Guided Bayesian Model Merging
Serving Large Language Models (LLMs) often requires choosing between stronger reasoning and lower inference cost. Model merging offers a practical way to build several models between a reasoning-oriented model and a cheaper base model, but common model-level merging methods usually control this trade-off with only one or two global knobs. We study this setting as a multi-objective optimization problem: instead of producing one merged model, the goal is to find a set of merged models that cover different accuracy--token-cost preferences. Layer-wise merging is more flexible because it can assign different merge weights to different Transformer layers. However, it introduces two practical challenges. First, the layer-wise search space is large, and existing methods often search it without using helpful signals from the source models. Second, LLM evaluations can take very different amounts of time, so synchronous batch optimization wastes GPU time while waiting for slow evaluations. We propose Asynchronous Prior-Guided Bayesian Model Merging (AP-BMM). AP-BMM uses parameter and reasoning-activation differences between the source models to suggest which layers should matter early in the search. It also uses an asynchronous Bayesian optimization loop that accounts for candidate models already being evaluated. A lightweight reranking step further spreads candidates across the accuracy--cost trade-off. Under fixed evaluation budgets, AP-BMM achieves stronger Pareto-set quality and broader trade-off coverage than synchronous layer-wise baselines and representative model-level merging baselines. Compared with the synchronous Bayesian baseline, it also reduces wall-clock time by improving GPU utilization. Code: https://github.com/MiLab-HITSZ/AP-BMM.
♻ ☆ TiCo: Time-Controllable Spoken Dialogue Model
We introduce TiCo, a time-controllable spoken dialogue model (SDM) that follows time-constrained instructions (e.g., "Please generate a response lasting about 15 seconds") and generates spoken responses with controllable duration. This capability is valuable for real-world spoken language systems such as voice assistants and interactive agents, where controlling response duration can improve interaction quality. However, despite their strong ability to generate natural spoken responses, existing models lack time awareness and struggle to follow duration-related instructions. To systematically evaluate this, we introduce TiCo-Bench, the first benchmark for time-controllable instruction following in SDMs, on which existing open-source and commercial models frequently fail to satisfy explicit time constraints. TiCo addresses this limitation by enabling an SDM to estimate elapsed speaking time during generation through Spoken Time Markers (STM) (e.g., <10.6 seconds>). These markers help the model maintain awareness of time and adjust the remaining content to meet the target duration. TiCo is post-trained efficiently without question-answer paired data, relying on self-generation and reinforcement learning with verifiable reward. Experimental results show that TiCo reduces duration error by 2.7x over its backbone and 1.6x over the strongest baseline, while preserving response quality.
♻ ☆ How English Print Media Frames Human-Elephant Conflicts in India
Human-elephant conflict (HEC) is rising across India as habitat loss and expanding human settlements force elephants into closer contact with people. While the ecological drivers of conflict are well-studied, how the news media portrays them remains largely unexplored. This work presents the first large-scale computational analysis of media framing of HEC in India, examining 1,968 full-length news articles consisting of 28,986 sentences, from a major English-language outlet published between January 2022 and September 2025. Using a multi-model sentiment framework that combines long-context transformers, large language models, and a domain-specific Negative Elephant Portrayal Lexicon, we quantify sentiment, extract rationale sentences, and identify linguistic patterns that contribute to negative portrayals of elephants. Our findings reveal a dominance of fear-inducing and aggression-related language. Since the media framing can shape public attitudes toward wildlife and conservation policy, such narratives risk reinforcing public hostility and undermining coexistence efforts. By providing a transparent, scalable methodology and releasing all resources through an anonymized repository, this study highlights how Web-scale text analysis can support responsible wildlife reporting and promote socially beneficial media practices.
♻ ☆ SynCABEL: Synthetic Contextualized Augmentation for Biomedical Entity Linking
We present SynCABEL (Synthetic Contextualized Augmentation for Biomedical Entity Linking), a framework that addresses a central bottleneck in supervised biomedical entity linking (BEL): the scarcity of expert-annotated training data. SynCABEL leverages large language models to generate context-rich synthetic training examples for all candidate concepts in a target knowledge base, providing broad supervision without manual annotation. We demonstrate that SynCABEL, when combined with decoder-only models and guided inference, establishes new state-of-the-art results across three widely used multilingual benchmarks: MedMentions for English, QUAERO for French, and SPACCC for Spanish. Evaluating data efficiency, we show that SynCABEL reaches the performance of full human supervision using up to 60% less annotated data, substantially reducing reliance on labor-intensive and costly expert labeling. Finally, acknowledging that standard evaluation based on exact code matching often underestimates clinically valid predictions due to ontology redundancy, we introduce an LLM-as-a-judge protocol. This analysis reveals that SynCABEL significantly improves the rate of clinically valid predictions. Our synthetic datasets, models, and code are released to support reproducibility and future research: - HuggingFace Datasets & Models - GitHub Repository
comment: 7 pages, 5 figures
♻ ☆ Aligning LLM Uncertainty with Human Disagreement in Subjectivity Analysis
Large language models for subjectivity analysis are typically trained with aggregated labels, which compress variations in human judgment into a single supervision signal. This paradigm overlooks the intrinsic uncertainty of low-agreement samples and often induces overconfident predictions, undermining reliability and generalization in complex subjective settings. In this work, we advocate uncertainty-aware subjectivity analysis, where models are expected to make predictions while expressing uncertainty that reflects human disagreement. To operationalize this perspective, we propose a two-phase Disagreement Perception and Uncertainty Alignment (DPUA) framework. Specifically, DPUA jointly models label prediction, rationale generation, and uncertainty expression under an uncertainty-aware setting. In the disagreement perception phase, adaptive decoupled learning enhances the model's sensitivity to disagreement-related cues while preserving task performance. In the uncertainty alignment phase, GRPO-based reward optimization further improves uncertainty-aware reasoning and aligns the model's confidence expression with the human disagreement distribution. Experiments on three subjectivity analysis tasks show that DPUA preserves task performance while better aligning model uncertainty with human disagreement, mitigating overconfidence on boundary samples, and improving out-of-distribution generalization.
♻ ☆ DiscoverLLM: From Executing Intents to Discovering Them ICML 2026
To handle ambiguous and open-ended requests, Large Language Models (LLMs) are increasingly trained to interact with users to surface intents they have not yet expressed (e.g., ask clarification questions). However, users are often ambiguous because they have not yet formed their intents: they must observe and explore outcomes to discover what they want. Simply asking "what kind of tone do you want?" fails when users themselves do not know. We introduce DiscoverLLM, a novel and generalizable framework that trains LLMs to help users form and discover their intents. Central to our approach is a novel user simulator that models cognitive state with a hierarchy of intents that progressively concretize as the model surfaces relevant options -- where the degree of concretization serves as a reward signal that models can be trained to optimize. Resulting models learn to collaborate with users by adaptively diverging (i.e., explore options) when intents are unclear, and converging (i.e., refine and implement) when intents concretize. Across proposed interactive benchmarks in creative writing, technical writing, and SVG drawing, DiscoverLLM achieves over 10% higher task performance while reducing conversation length by up to 40%. In a user study with 75 human participants, DiscoverLLM improved conversation satisfaction and efficiency compared to baselines.
comment: Accepted at ICML 2026
♻ ☆ Hidden Measurement Error in LLM Pipelines Distorts Annotation, Evaluation, and Benchmarking
LLM evaluations drive which models get deployed, what safety standards get adopted, which research conclusions get published, and how projections of AI's labor-market impact get made. Yet standard confidence intervals ignore variability from judge model choice, model temperature, and prompt phrasing, producing under-coverage that worsens with more data. The omitted variance can shift results enough to reverse conclusions \citep{baumann2025llmhacking, huang2026dropping}; pipelines that fail to average over it leave the surface that ``benchmark hacking'' exploits \citep{singh2025leaderboard}. This paper decomposes LLM pipeline uncertainty into its sources, distinguishes variance that shrinks with more data from sensitivity to researcher design choices, and uses design-study projections to reduce total evaluation error (TEE). Across the demonstrations, naive standard errors are 40 - 60\% smaller than the TEE-corrected SE. Using Chatbot Arena data, we show naive 95\% CI coverage drops as $n$ grows while TEE-corrected coverage holds at 95\%, and TEE-guided pipelines restrict the benchmark gaming surface from 56 to 32 Elo ($K=27$), below the human-leaderboard baseline. We show further that a small pilot recovers honest CIs and projects which design changes most improve precision. Acting on those projections halves MMLU estimation error against the answer key at equivalent cost, and raises per-match agreement with human votes by 7.9 percentage points on Chatbot Arena.
♻ ☆ Teaching Language Models How to Code Like Learners: Conversational Serialization for Student Simulation
Artificial students -- models that simulate how learners act and respond within educational systems -- are a promising tool for evaluating tutoring strategies and feedback mechanisms at scale. However, most existing approaches rely on prompting large, proprietary language models, limiting adaptability to specific courses and raising concerns around privacy, cost, and dependence. In this work, we propose a framework for training open-weight artificial programming learners directly from authentic student process data. Our approach serializes temporal log traces into a conversational format, representing each student's problem-solving process as a dialogue between the learner and their automated assessment system. Student code submissions and environment feedback, such as test outcomes, grades, and error traces, form alternating conversational turns, enabling models to learn from the iterative debugging process. We additionally introduce a training pipeline combining supervised fine-tuning with preference optimization to align models with authentic student debugging behavior. We evaluate our framework by training Qwen models at 4B and 8B scales on a large-scale dataset of real student submissions to Python programming assignments. Our results show that incorporating environment feedback strengthens models' ability to replicate student debugging behavior, improving over both prior code-only approaches and prompted large language models baselines in functional alignment and code similarity. We release our code to support reproducibility.
comment: 8 pages, 2 figures, 2 tables. Accepted to Educational Data Mining 2026
♻ ☆ CPMobius: Iterative Coach-Player Reasoning for Data-Free Reinforcement Learning ICML 2026
Large Language Models (LLMs) have demonstrated strong potential in complex reasoning, yet their progress remains fundamentally constrained by reliance on massive high-quality human-curated tasks and labels, either through supervised fine-tuning (SFT) or reinforcement learning (RL) on reasoning-specific data. This dependence renders supervision-heavy training paradigms increasingly unsustainable, with signs of diminishing scalability already evident in practice. To overcome this limitation, we introduce CPMöbius (CPMobius), a collaborative Coach-Player paradigm for data-free reinforcement learning of reasoning models. Unlike traditional adversarial self-play, CPMöbius, inspired by real world human sports collaboration and multi-agent collaboration, treats the Coach and Player as independent but cooperative roles. The Coach proposes instructions targeted at the Player's capability and receives rewards based on changes in the Player's performance, while the Player is rewarded for solving the increasingly instructive tasks generated by the Coach. This cooperative optimization loop is designed to directly enhance the Player's mathematical reasoning ability. Remarkably, CPMöbius achieves substantial improvement without relying on any external training data, outperforming existing unsupervised approaches. For example, on Qwen2.5-Math-7B-Instruct, our method improves accuracy by an overall average of +4.9 and an out-of-distribution average of +5.4, exceeding RENT by +1.5 on overall accuracy and R-zero by +4.2 on OOD accuracy. Our codebase has been released at https://github.com/thunlp/CPMobius.
comment: Accepted to the ICML 2026
♻ ☆ From Curated Data to Scalable Models: Continual Pre-training of Dense and MoE Large Language Models for Tibetan
Large language models (LLMs) have achieved remarkable success across a wide range of natural language processing tasks, yet their performance remains heavily biased toward high-resource languages. Tibetan, despite its cultural significance and large speaker population, is still substantially underrepresented. In this work, we present a comprehensive pipeline for advancing Tibetan language modeling through large-scale data curation and continual pre-training. We construct a 72 GB high-quality Tibetan corpus, the largest to date, and adapt Qwen2.5-7B through balanced multilingual continual pre-training with Tibetan, Chinese, and English, followed by multilingual instruction tuning. To further scale capacity efficiently, we extend the dense model to a 50B-A10B Mixture-of-Experts architecture. Due to the absence of standardized Tibetan benchmarks, we build multiple evaluation datasets via high-quality translation and human verification. Experimental results show that both dense and MoE models consistently outperform existing open-source and Tibetan-focused models of similar scale across diverse tasks. Our work advances Tibetan-centric LLM research and provides transferable insights for extending LLMs to other low-resource languages. We will release the model weights, evaluation benchmarks, and detailed data processing documentation in the follow-up.
♻ ☆ Multilingual Vision-Language Models, A Survey
This survey examines multilingual vision-language models that process text and images across languages. We review 33 models and 23 benchmarks, spanning encoder-only and generative architectures, and identify a key tension between language neutrality (consistent cross-lingual representations) and cultural awareness (adaptation to cultural contexts). Current training methods favor neutrality through contrastive learning, while cultural awareness depends on diverse data. Two-thirds of evaluation benchmarks use translation-based approaches prioritizing semantic consistency, though recent work incorporates culturally grounded content. We find discrepancies in cross-lingual capabilities and gaps between training objectives and evaluation goals.
♻ ☆ FlowPlan-G2P: A Structured Generation Framework for Transforming Scientific Papers into Patent Descriptions
Generating patent descriptions from scientific papers is challenging due to fundamental rhetorical and structural disparities between the two genres. Existing approaches treat this as surface-level rewriting, failing to capture the hierarchical reasoning and statutory constraints inherent in patent drafting. We propose FlowPlan-G2P, a graph-mediated generation framework that decomposes this transformation into three stages: (1) Concept Graph Induction, extracting technical entities and functional dependencies into a directed graph; (2) Section-level Planning, partitioning the graph into coherent subgraphs aligned with canonical patent sections; and (3) Graph-Conditioned Generation, synthesizing legally compliant paragraphs conditioned on section-specific subgraphs. Experiments on expert-validated benchmarks reveal that standard NLG metrics systematically favor legally non-compliant outputs over valid patent descriptions, motivating our domain-specific evaluation. Under this evaluation, FlowPlan-G2P with an open-weight backbone consistently outperforms vanilla proprietary models, demonstrating that structured decomposition is a stronger determinant of quality than model scale.
♻ ☆ CADDesigner: Conceptual CAD Model Generation with a General-Purpose Agent
Computer-Aided Design (CAD) is widely used for conceptual design and parametric 3D modeling, but typically requires a high level of expertise from designers. To lower the entry barrier and facilitate early-stage CAD modeling, we present CADDesigner, an LLM-powered agent for conceptual CAD design. The agent accepts both textual descriptions and sketches as input, engaging in interactive dialogue with users to refine and clarify design requirements through comprehensive requirement analysis. Built upon a novel Explicit Context Imperative Paradigm (ECIP), the agent generates high-quality CAD modeling code. During the generation process, the agent incorporates iterative visual feedback to improve model quality. Generated design cases can be stored in a structured knowledge base, providing a mechanism for continual knowledge accumulation and future improvement of code generation. Experimental results show that CADDesigner achieves competitive performance and outperforms representative baselines on conceptual CAD model generation tasks.
♻ ☆ LLM Flow Processes for Text-Conditioned Regression
Recent work has demonstrated surprisingly good performance of pre-trained LLMs on regression tasks (for example, time-series prediction), with the ability to incorporate expert prior knowledge and the information contained in textual metadata. However we observe major error cascades even in short sequences < ~100 points; these models are also computationally intensive and difficult to parallelise. Marginal LLM predictions do not suffer this issue and are trivially parallelised, but can predict over-broad densities. To address this, we propose combining these densities with a lightweight (diffusion-based) neural process. We show that this combination leads to better-calibrated predictions overall, outputs locally consistent trajectories, and leads to text-conditioned function space selection in the meta-learner. As part of this work we propose a gradient-free (and non-Monte Carlo) method for sampling from a product-of-experts of a score model and an 'expert' (here the LLM predictive densities). We believe this general method is of independent interest as it is applicable whenever an expert can be convolved with a Gaussian in closed form.
♻ ☆ Where Do Reasoning Models Refuse? ICML 2025
Chat models without chain-of-thought (CoT) reasoning must decide whether to refuse a harmful request before generating their first response token. Reasoning models, by contrast, produce extended chains of thought before their final output, raising a natural question: where in this process does the decision to refuse occur? We investigate this across four open-source reasoning models. We first show that the CoT causally influences refusal outcomes; fixing a specific reasoning trace substantially reduces variance in whether the model ultimately refuses or complies. Zooming into the reasoning trace, we find that in distilled models, subtle differences in the opening sentence of the CoT can fully determine the model's refusal decision, and that these patterns transfer across models distilled from the same teacher. Finally, we extract linear refusal directions from model activations and show that ablating them increases harmful compliance, though less reliably than the same technique achieves on non-reasoning models, and with non-negligible degradation to general capabilities.
comment: v1 accepted to the ICML 2025 Workshop on Reliable and Responsible Foundation Models (R2FM). 20 pages, 12 figures
♻ ☆ Gyan: An Explainable Neuro-Symbolic Language Model NeurIPS 2026
Transformer based pre-trained large language models have become ubiquitous. There is increasing evidence to suggest that even with large scale pre-training, these models do not capture complete compositional context and certainly not, the full human analogous context. Besides, by the very nature of the architecture, these models hallucinate, are difficult to maintain, are not easily interpretable and require enormous compute resources for training and inference. Here, we describe Gyan, an explainable language model based on a novel non-transformer architecture, without any of these limitations. Gyan achieves SOTA performance on 3 widely cited data sets and superior performance on two proprietary data sets. The novel architecture decouples the language model from knowledge acquisition and representation. The model draws on rhetorical structure theory, semantic role theory and knowledge-based computational linguistics. Gyan's meaning representation structure captures the complete compositional context and attempts to mimic humans by expanding the context to a 'world model'. AI model adoption critically depends on trust and transparency especially in mission critical use cases. Collectively, our results demonstrate that it is possible to create models which are trustable and reliable for mission critical tasks. We believe our work has tremendous potential for guiding the development of transparent and trusted architectures for language models.
comment: also submitted to NeurIPS 2026
♻ ☆ Language Model Goal Selection Differs from Humans' in a Self-Directed Learning Task
Whether in agentic workflows, social studies, or chat settings, large language models (LLMs) are increasingly being asked to replace humans in choosing which goals to pursue, rather than completing predefined tasks. However, the assumption that LLMs accurately reflect human preferences for goal setting remains largely untested. We assess the validity of LLMs as proxies for human goal selection in a controlled, self-directed learning task borrowed from cognitive science. Across five models (GPT-5, Gemini 2.5 Pro, Claude Sonnet 4.5, Qwen3 32B, and Centaur), we find substantial divergence from human behavior. While people gradually explore and learn to achieve goals with diversity across individuals, most models exploit a single identified solution or show surprisingly low performance, with distinct patterns across models and little variability across instances of the same model. Chain-of-thought reasoning and persona steering provide limited improvements, and our conclusions hold across experimental settings. While they await confirmation in applied settings, these findings highlight the uniqueness of human goal selection and caution against its replacement with current models.
♻ ☆ Automated Rubrics for Reliable Evaluation of Medical Dialogue Systems
Large Language Models (LLMs) are increasingly used for clinical decision support, where hallucinations and unsafe suggestions may pose direct risks to patient safety. These risks are hard to assess: subtle clinical errors are often missed by generic metrics and LLM judges using general criteria, while expert-authored fine-grained rubrics are expensive and difficult to scale. In this paper, we propose a retrieval-augmented multi-agent framework designed to automate the generation of instance-specific evaluation rubrics. Our approach grounds evaluation in authoritative medical evidence by decomposing retrieved content into atomic facts and synthesizing them with user interaction constraints to form verifiable, fine-grained evaluation criteria. Evaluated on HealthBench and LLMEval-Med datasets, our framework achieves Clinical Intent Alignment (CIA) scores of 50.20% and 31.90%, significantly outperforming the GPT-4o baseline and demonstrating robust cross-lingual generalization. In discriminative tests on HealthBench, our rubrics yield a 7.8% higher win rate than GPT-4o baseline with nearly double score $Δ$, while ablation studies confirm its structural necessity. Beyond evaluation, our rubrics effectively guide response refinement, improving quality by 9.2%. This provides a scalable, cross-lingual foundation for both evaluating and improving medical LLMs. The code is available at https://github.com/AmbeChen/Automated-Rubric-Generation.
♻ ☆ Clustering in pure-attention hardmax transformers and its role in sentiment analysis SC
Transformers are extremely successful machine learning models whose mathematical properties remain poorly understood. Here, we rigorously characterize the behavior of transformers with hardmax self-attention and normalization sublayers as the number of layers tends to infinity. By viewing such transformers as discrete-time dynamical systems describing the evolution of points in a Euclidean space, and thanks to a geometric interpretation of the self-attention mechanism based on hyperplane separation, we show that the transformer inputs asymptotically converge to a clustered equilibrium determined by special points called \textit{leaders}. We then leverage this theoretical understanding to solve sentiment analysis problems from language processing using a fully interpretable transformer model, which effectively captures `context' by clustering meaningless words around leader words carrying the most meaning. Finally, we outline remaining challenges to bridge the gap between the mathematical analysis of transformers and their real-life implementation.
comment: 23 pages, 11 figures, 1 table. Funded by the European Union (Horizon Europe MSCA project ModConFlex, grant number 101073558). Accompanying code available at: https://github.com/DCN-FAU-AvH/clustering-hardmax-transformers
♻ ☆ CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR IEEE
We present CALM, a joint Contextual Acoustic-Linguistic Modeling framework for multi-speaker automatic speech recognition (ASR). In personalized AI scenarios, the joint availability of acoustic and linguistic cues naturally motivates the integration of target-speaker conditioning with contextual biasing in overlapping conversations. CALM implements this integration in an end-to-end framework through speaker embedding-driven target-speaker extraction and dynamic vocabulary-based contextual biasing. We evaluate CALM on simulated English (LibriSpeechMix) and Japanese (Corpus of Spontaneous Japanese mixtures, CSJMix). On two-speaker mixtures, CALM reduces biased word error rate (B-WER) from 12.7 to 4.7 on LibriSpeech2Mix and biased character error rate (B-CER) from 16.6 to 8.4 on CSJMix2 (eval3), demonstrating the effectiveness of joint acoustic-linguistic modeling across languages. We additionally report results on the AMI corpus (IHM-mix condition) to validate performance on standardized speech mixtures.
comment: Accepted to IEEE ICASSP 2026
♻ ☆ Re-evaluating Minimum Bayes Risk Decoding for Automatic Speech Recognition
Recent work has shown that sample-based Minimum Bayes Risk (MBR) decoding outperforms beam search in text-to-text generation tasks, such as machine translation, text summarization, and image captioning. On the other hand, beam search is the current practice for speech-to-text tasks such as automatic speech recognition (ASR) and Speech Translation (ST). Given that MBR decoding is effective in text-to-text generation tasks, it is reasonable to expect it to also be effective for speech-to-text tasks. In this paper, we evaluate MBR decoding for ASR and ST tasks on English and Japanese using Whisper and its derivative models. We observe that the accuracy of MBR decoding outperforms that of beam search in most of the experimental settings we have evaluated. The results show that MBR decoding is a promising method for offline ASR and ST tasks that require high accuracy. The code is available at https://github.com/CyberAgentAILab/mbr-for-asr
♻ ☆ Unifying Diarization, Separation, and ASR with Multi-Speaker Encoder IEEE
This paper presents a unified multi-speaker encoder (UME), a novel architecture that jointly learns representations for speaker diarization (SD), speech separation (SS), and multi-speaker automatic speech recognition (ASR) tasks using a shared speech foundational encoder. We leverage the hidden representations from multiple layers of UME as a residual weighted-sum encoding (RWSE) to effectively use information from different semantic levels, contributing to bottom-up alignment between tasks. This joint training approach captures the inherent interdependencies among the tasks, enhancing overall performance on overlapping speech data. Our evaluations demonstrate that UME substantially improves over the single-task baselines dedicated to SD, SS, and multi-speaker ASR on LibriMix evaluation sets. Notably, for SD, UME outperforms the previous studies, achieving diarization error rates of 1.37% and 2.29% on Libri2Mix and Libri3Mix evaluation sets, respectively.
comment: Accepted to IEEE ASRU 2025
♻ ☆ MultiLinguahah : A New Unsupervised Multilingual Acoustic Laughter Segmentation Method
Laughter is a social non-vocalization that is universal across cultures and languages, and is crucial for human communication, including social bonding and communication signaling. However, detecting laughter in audio is a challenging task, and segmenting is even more difficult. Currently, Machine Learning methods generally rely on costly manual annotation, and their datasets are mostly based on English contexts. Thus, we propose an unsupervised multilingual method that sets up the laughter segmentation task as an anomaly detection of energy-based segmented audio sequences. Our method applies an Isolation Forest on audio representations learned from BYOL-A encoder. We compare our method with several state-of-the-art laughter detection algorithms on four datasets, including stand-up comedy, sitcoms, and general short audio from AudioSet. Our results show that state-of-the-art methods are not optimized for multilingual contexts, while our method outperforms them in non-English settings.
♻ ☆ Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, existing studies largely attribute this advantage to denser and more stable supervision, while the parameter-level mechanisms underlying OPD's efficiency remain poorly understood. In this work, we argue that OPD's efficiency stems from a form of ``foresight'': it establishes a stable update trajectory toward the final model early in training. This foresight manifests in two aspects. First, at the \textbf{Module-Allocation Level}, OPD identifies regions with low marginal utility and concentrates updates on modules that are more critical to reasoning. Second, at the \textbf{Update-Direction Level}, OPD exhibits stronger low-rank concentration, with its dominant subspaces aligning closely with the final update subspace early in training. Building on these findings, we propose \textbf{EffOPD}, a plug-and-play acceleration method that speeds up OPD by adaptively selecting an extrapolation step size and moving along the current update direction. EffOPD requires no additional trainable modules or complex hyperparameter tuning, and achieves an average training acceleration of $3\times$ while maintaining comparable final performance. Overall, our findings provide a parameter-dynamics perspective for understanding the efficiency of OPD and offer practical insights for designing more efficient post-training methods for large language models.
♻ ☆ Sockpuppetting: Jailbreaking LLMs by Combining Prefilling with Optimization
Prefill attacks are an effective and low-cost jailbreaking method, as they directly insert an acceptance sequence (e.g., "Sure, here is how to...") at the start of an LLM's output and lead the model to continue the response. We make two contributions to this prior work. First, we show that an unsophisticated adversary can improve the well-known prefill attacks by ensembling a small number of prefill variants. Running three easy-to-generate prefills yields a combined attack success rate (ASR) of 22%, 90%, and 99% on Gemma-7B, Llama-3.1-8B, and Qwen3-8B respectively, an up to 38% improvement over the standard "Sure, here's..." prefill and up to 82% over our reproduction of GCG (Zou et al., 2023). Second, we introduce "sockpuppetting", a hybrid attack that optimizes an adversarial suffix placed inside the "assistant" message block of the chat template, rather than within the user prompt. The rolling variant of this attack, RollingSockpuppetGCG, increases prompt-agnostic ASR by up to 64% over our universal GCG baseline on Llama-3.1-8B. Both findings highlight the need for defences against output-prefix injection in open-weight models. Code: https://gitlab.com/asendotsinski/sockpuppetting
comment: 13 pages, 6 figures
♻ ☆ Instructions Shape Production of Language, not Processing
Instructions trigger a production-centered mechanism in language models. Through a cognitively inspired lens that separates language processing and production, we reveal this mechanism as an asymmetry between the two stages by probing task-specific information layer-wise across five binary judgment tasks. Specifically, we measure how instruction tokens shape information both when sample tokens, the input under evaluation, are processed and when output tokens are produced. Across prompting variations, task-specific information in sample tokens remains largely stable and correlates only weakly with behavior, whereas the same information in output tokens varies substantially and correlates strongly with behavior. Attention-based interventions confirm this pattern causally: blocking instruction flow to all subsequent tokens reduces both behavior and information in output tokens, whereas blocking it only to sample tokens has minimal effect on either. The asymmetry generalizes across model families and tasks, and becomes sharper with model scale and instruction-tuning, both of which disproportionately affect the production stage. Our findings suggest that understanding model capabilities requires jointly assessing internals and behavior, while decomposing the internal perspective by token position to distinguish the processing of input tokens from the production of output tokens.
♻ ☆ LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries
Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets, language instructions are highly predictable from visual observations alone, causing the conditional mutual information between instructions and actions to vanish, a phenomenon we term Information Collapse. Consequently, models degenerate into vision-only policies that ignore language constraints and fail in out-of-distribution (OOD) settings. To address this, we propose LangForce, a novel framework that enforces instruction following via Bayesian decomposition. By introducing learnable Latent Action Queries, we construct a dual-branch architecture to estimate both a vision-only prior $p(a \mid v)$ and a language-conditioned posterior $π(a \mid v, \ell)$. We then optimize the policy to maximize the conditional Pointwise Mutual Information (PMI) between actions and instructions. This objective effectively penalizes the vision shortcut and rewards actions that explicitly explain the language command. Without requiring new data, LangForce significantly improves generalization. Extensive experiments across on SimplerEnv and RoboCasa demonstrate substantial gains, including an 11.3% improvement on the challenging OOD SimplerEnv benchmark, validating the ability of our approach to robustly ground language in action.
♻ ☆ ChatSR: Multimodal Large Language Models for Scientific Formula Discovery
Current multimodal large language models (MLLMs) are mainly focused on the understanding and processing of perceptual modalities such as images and videos, while their capability for scientific data understanding remains insufficient. To this end, we propose ChatSR, a novel multimodal large language model tailored for scientific data understanding. ChatSR treats scientific data as a new modality analogous to visual content and, through carefully designed encoders and modality alignment mechanisms, maps scientific data into a representation space that can be processed by large language models, enabling the model to grasp the structural characteristics and underlying regularities of scientific data. Building on this foundation, ChatSR further exploits the rich domain knowledge and strong reasoning abilities of large language models to emulate a knowledgeable human scientist: based on user-specified prior constraints and preferences expressed (such as requirements on periodicity, symmetry, etc.), it automatically generates mathematical formulas that not only accurately fit the observed data but also conform to domain priors, thereby characterizing the latent laws embodied in scientific data and promoting the automation of scientific discovery. Experiments on 13 datasets show that ChatSR achieves state-of-the-art performance on traditional symbolic regression benchmarks. In addition, ChatSR exhibits a promising zero-shot ability to understand and utilize types of prior knowledge that are not present in its training data.
comment: 14 pages,
♻ ☆ Confidence Estimation in Automatic Short Answer Grading with LLMs
Automatic Short Answer Grading (ASAG) with generative large language models (LLMs) has recently demonstrated strong performance without task-specific fine-tuning, while also enabling the generation of synthetic feedback for educational assessment. Despite these advances, LLM-based grading remains imperfect, making reliable confidence estimates essential for safe and effective human-AI collaboration in educational decision-making. In this work, we investigate confidence estimation for ASAG with LLMs by jointly considering model-based confidence signals and dataset-derived uncertainty. We systematically compare three model-based confidence estimation strategies, namely verbalizing, latent, and consistency-based confidence estimation, and show that model-based confidence alone is insufficient to reliably capture uncertainty in ASAG. To address this limitation, we propose a hybrid confidence framework that integrates model-based confidence signals with an explicit estimate of dataset-derived aleatoric uncertainty. Aleatoric uncertainty is operationalized by clustering semantically embedded student responses and quantifying within-cluster heterogeneity. Our results demonstrate that the proposed hybrid confidence measure yields more reliable confidence estimates and improves selective grading performance compared to single-source approaches. Overall, this work advances confidence-aware LLM-based grading for human-in-the-loop assessment, supporting more trustworthy AI-assisted educational assessment systems.
comment: accepted to the 27th International Conference on Artificial Intelligence in Education (AIED 2026)
♻ ☆ GAMBIT: A Three-Mode Benchmark for Adversarial Robustness in Multi-Agent LLM Collectives
In multi-agent systems (MAS), a single deceptive agent can nullify all gains of an agentic AI collective and evade deployed defenses. However, existing adversarial studies on MAS target only shallow tasks and do not consider adaptive adversaries, which evolve their strategies to evade the very detectors trained to catch them. To address that gap, we introduce GAMBIT, a benchmark with three evaluation modes and two independent scores for evaluating imposter detectors: the first two modes measure zero-shot detection under increasing distribution shift, and a third recalibration mode measures how quickly a detector adapts to novel attacks from just 20 labeled examples. The benchmark comes with a dataset of 27,804 labeled instances spanning 240 co-evolved imposter strategies. Our contributions are threefold: (1) Using chess as a substrate deep reasoning problem and Gemini 3.1 Pro for agents, we release GAMBIT and its dataset to evaluate imposter detectors under realistic constraints against a stealthy adaptive imposter; (2) We introduce an adaptive imposter agent based on an efficient evolutionary framework, generalizable beyond chess, that collapses collective task performance while remaining essentially undetectable (50.5% F1-score with a Gemini-based detector); (3) We show that zero-shot evaluation can be highly misleading for adaptive adversaries: two detectors with near-identical zero-shot scores differ by 8x on few-shot adaptation, while the meta-learned variant converges 20x faster, a gap only visible in the recalibration mode. Altogether, GAMBIT provides the first multi-agent benchmark where adversarial attacks and defenses co-evolve, with an imposter framework generalizable beyond our use case, and promising techniques for fast recalibration in a rapidly evolving adversarial system. Code and data: https://anonymous.4open.science/r/gambit.
comment: 46 pages, 16 figures
♻ ☆ Estimating LLM Grading Ability and Response Difficulty in Automatic Short Answer Grading via Item Response Theory
Automated short answer grading (ASAG) with large language models (LLMs) is commonly evaluated with aggregate metrics such as macro-F1 and Cohen's kappa. However, these metrics provide limited insight into how grading performance varies across student responses of differing grading difficulty. We introduce an evaluation framework for LLM-based ASAG based on item response theory (IRT), which models grading correctness as a function of latent grader ability and response grading difficulty. This formulation enables response-level analysis of where LLM graders succeed or fail and reveals robustness differences that are not visible from aggregate scores alone. We apply the framework to 17 open-weight LLMs on the SciEntsBank and Beetle benchmarks. The results show that even models with similar overall performance differ substantially in how sharply their grading accuracy declines as response difficulty increases. In addition, confusion patterns show that errors on difficult responses concentrate disproportionately on the \texttt{partially\_correct\_incomplete} label, indicating a tendency toward intermediate-label collapse under ambiguity. To characterize difficult responses, we further analyze semantic and linguistic correlates of estimated difficulty. Across both datasets, higher difficulty is associated with weaker semantic alignment to the reference answer, stronger contradiction signals, and greater semantic isolation in embedding space. Overall, these results show that item response theory offers a useful framework for evaluating LLM-based ASAG beyond aggregate performance measures.
comment: accepted at BEA 2026, the 21st Workshop on Innovative Use of NLP for Building Educational Applications
♻ ☆ Adaptive Cost-Efficient Evaluation for Reliable Patent Claim Validation
Automated validation of patent claims demands zero-defect tolerance, as even a single structural flaw can render a claim legally defective. Existing evaluation paradigms suffer from a rigidity-resource dilemma: lightweight encoders struggle with nuanced legal dependencies, while exhaustive verification via Large Language Models (LLMs) is prohibitively costly. To bridge this gap, we propose ACE (Adaptive Cost-efficient Evaluation), a hybrid framework that uses predictive entropy to route only high-uncertainty claims to an expert LLM. The expert then executes a Chain of Patent Thought (CoPT) protocol grounded in 35 U.S.C. statutory standards, enabling ACE to resolve long-range legal dependencies that encoder-only models fail to capture. On our constructed benchmark, ACE achieves the best F1 among the evaluated methods at 94.95\% while reducing operational costs by 78\% compared to standalone LLM deployments. Crucially, the entropy-based routing threshold transfers directly to authentic USPTO §112(b) rejections without re-calibration, confirming distributional robustness beyond synthetic settings. To facilitate reproducible research, we release ACE-40k, a 40,000-claim benchmark with MPEP-grounded error annotations, alongside ACE-Real112b, a stress-test corpus of 204 genuine Office Action rejections.
♻ ☆ Improving LLM Final Representations with Inter-Layer Geometry
The standard in LLM-based prediction is to use the final-layer representation as the input to a downstream predictor. However, intermediate layers may encode complementary task-relevant signals. Existing approaches therefore either search for the best layer for each task or apply expensive attention-based mechanisms to learn inter-layer aggregation. In this work, we first show that such complexity is unnecessary: a lightweight Graph Neural Network over a fully connected graph of LLM layers is more efficient and achieves significantly stronger predictive performance than existing approaches. We then introduce the Cayley-Encoder, which further improves both efficiency and predictive performance by replacing the fully connected graph with a Cayley graph over SL(2, Zn). These Cayley graphs provide a mathematically grounded topology that is sparse, regular by construction, and has low diameter. This enables effective communication across layers while constraining the aggregation structure and reducing the risk of GNN overfitting. In an evaluation of Cayley-Encoder across 13 tasks and 9 LLMs, Cayley-Encoder consistently outperforms baselines, achieving improvements of up to 40 percentage points in accuracy, while introducing at most 0.1% additional parameters relative to the LLM size. We further show that Cayley-Encoder is effective in few-shot regimes. Finally, we show that Cayley-Encoder outperforms LoRA fine-tuning while operating on the frozen LLM. We conclude with an explainability analysis showing that multiple layers contribute meaningfully to the final prediction, supporting our hypothesis.
comment: 17 pages, 4 figures. Equal contribution by first two authors
♻ ☆ Is a Picture Worth a Thousand Words? Adaptive Multimodal Fact-Checking with Visual Evidence Necessity
Automated fact-checking is a crucial task that supports a responsible information ecosystem. While recent research has progressed from text-only to multimodal fact-checking, a prevailing assumption is that incorporating visual evidence universally improves performance. In this work, we challenge this assumption and show that the indiscriminate use of multimodal evidence can reduce accuracy. To address this challenge, we propose AMuFC, a multimodal fact-checking framework that employs two collaborative vision-language models with distinct roles for the adaptive use of visual evidence: an Analyzer determines whether visual evidence is necessary for claim verification, and a Verifier predicts claim veracity conditioned on both the retrieved evidence and the Analyzer's assessment. Experimental results on three datasets show that incorporating the Analyzer's assessment of visual evidence necessity into the Verifier's prediction yields substantial improvements in verification performance. We will release all code and datasets at https://github.com/ssu-humane/AMuFC.
comment: preprint, 18 pages
Multimedia 4
☆ Few Channels Draw The Whole Picture: Revealing Massive Activations in Diffusion Transformers
Diffusion Transformers (DiTs) and related flow-based architectures are now among the strongest text-to-image generators, yet the internal mechanisms through which prompts shape image semantics remain poorly understood. In this work, we study massive activations: a small subset of hidden-state channels whose responses are consistently much larger than the rest. We show that, despite their sparsity, these few channels effectively draw the whole picture, in three complementary senses. First, they are functionally critical: a controlled disruption probe that zeroes the massive channels causes a sharp collapse in generation quality, while disrupting an equally-sized set of low-statistic channels has marginal effect. Second, they are spatially organized: restricting image-stream tokens to massive channels and clustering them yields coherent partitions that closely align with the main subject and salient regions, exposing a structured spatial code hidden inside an apparently outlier-like subspace. Third, they are transferable: transporting massive activations from one prompt-conditioned trajectory into another, shifts the final image toward the source prompt while preserving substantial content from the target, producing localized semantic interpolation rather than unstructured pixel blending. We exploit this property in two use cases: text-conditioned and image-conditioned semantic transport, where massive activations transport enables prompt interpolation and subject-driven generation without any additional training. Together, these results recast massive activations not as activation anomalies, but as a sparse prompt-conditioned carrier subspace that organizes and controls semantic information in modern DiT models.
comment: Project page: https://aimagelab.github.io/MAs-DiT/
☆ Backbone is All You Need: Assessing Vulnerabilities of Frozen Foundation Models in Synthetic Image Forensics
As AI-generated synthetic images become increasingly realistic, Vision Transformers (ViTs) have emerged as a cornerstone of modern deepfake detection. However, the prevailing reliance on frozen, pre-trained backbones introduces a subtle yet critical vulnerability. In this work, we present the Surrogate Iterative Adversarial Attack (SIAA), a gray-box attack that exploits knowledge of the detector's ViT backbone alone and operates entirely within the target detector's feature space to craft highly effective adversarial examples. Through our experiments, involving multiple ViT-based detectors and diverse gray-box scenarios, including few-shot learning, complete training misalignment and attack transferability tests, we demonstrate that this vulnerability consistently yields high attack success rates, often approaching white-box performance. By doing so, we reveal that backbone knowledge alone is sufficient to undermine detector reliability, highlighting the urgent need for more resilient defenses in adversarial multimedia forensics.
♻ ☆ V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation
Generating music that temporally aligns with video events is challenging for existing text-to-music models, which lack fine-grained temporal control. We introduce V2M-ZERO, a video-to-music generation approach that generates time-aligned music with disentangled time synchronization and semantic control (e.g., genre, mood) from video while requiring zero video-music pairs at training time. Our method is motivated by a key observation: temporal synchronization requires matching when and how much change occurs, not what changes. While musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. We capture this structure through event curves computed from intra-modal similarity using pretrained music and video encoders. By measuring temporal change within each modality independently, these curves provide comparable representations across modalities. This enables a simple training strategy: fine-tune a text-to-music model on music-event curves, then substitute video-event curves at inference without cross-modal training or paired data. Across OES-Pub, MovieGenBench-Music, and AIST++, V2M-ZERO achieves state-of-the-art performance without any paired music-video data, surpassing the strongest prior baselines per metric with 5-9% higher audio quality, 13-15% better semantic alignment, 21-52% improved temporal synchronization, and 28% higher beat alignment on dance videos. We find similar results via a large crowd-source subjective listening test. Our results validate that temporal alignment through within-modality features is not only effective for video-to-music generation but also leads to better performance than paired cross-modal supervision. Furthermore, our approach enables independent controls for timing and music style (e.g., genre, mood) for more controllable generation.
comment: Project page: https://genjib.github.io/v2m_zero/
♻ ☆ Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Omni-modal language models are intended to jointly understand audio, visual inputs, and language, but benchmark gains can be inflated when visual evidence alone is enough to answer a query. We study whether current omni-modal benchmarks separate visual shortcuts from genuine audio-visual-language evidence integration, and how post-training behaves under a visually debiased evaluation setting. We audit nine omni-modal benchmarks with visual-only probing, remove visually solvable queries, and retain full subsets when filtering is undefined or would make comparisons unstable. This yields OmniClean, a cleaned evaluation view with 8,551 retained queries from 16,968 audited queries. On OmniClean, we evaluate OmniBoost, a three-stage post-training recipe based on Qwen2.5-Omni-3B: mixed bi-modal SFT, mixed-modality RLVR, and SFT on self-distilled data. Balanced bi-modal SFT gives limited and uneven gains, RLVR provides the first broad improvement, and self-distillation reshapes the benchmark profile. After SFT on self-distilled data, the 3B model reaches performance comparable to, and in aggregate slightly above, Qwen3-Omni-30B-A3B-Instruct without using a stronger omni-modal teacher. These results show that omni-modal progress is easier to interpret when evaluation controls visual leakage, and that small omni-modal models can benefit from staged post-training with self-distilled omni-query supervision. Project page: https://cheliu-computation.github.io/omni/
comment: Project page: https://cheliu-computation.github.io/omni/
Computer Vision and Pattern Recognition 243
☆ Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
Computer-use agents (CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests a long-tail pattern in GUI operations, where a relatively small fraction of complex and diverse interactions accounts for a disproportionate share of task failures. We hypothesize that this issue largely stems from the scarcity of data for complex interactions. To address this problem, we propose a new benchmark CUActSpot for evaluating models' capabilities on complex interactions across five modalities: GUI, text, table, canvas, and natural image, as well as a variety of actions (click, drag, draw, etc.), covering a broader range of interaction types than prior click-centric benchmarks that focus mainly on GUI widgets. We also design a renderer-based data-synthesis pipeline: scenes are automatically generated for each modality, screenshots and element coordinates are recorded, and an LLM produces matching instructions and action traces. After training on this corpus, our Phi-Ground-Any-4B outperforms open-source models with fewer than 32B parameters. We will release our benchmark, data, code, and models at https://github.com/microsoft/Phi-Ground.git
☆ SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. We argue that this divide is not merely an engineering artifact, but a structural limitation that hinders the emergence of native multimodal intelligence. Hence, we introduce SenseNova-U1, a native unified multimodal paradigm built upon NEO-unify, in which understanding and generation evolve as synergistic views of a single underlying process. We launch two native unified variants, SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, built on dense (8B) and mixture-of-experts (30B-A3B) understanding baselines, respectively. Designed from first principles, they rival top-tier understanding-only VLMs across text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence. Meanwhile, they deliver strong semantic consistency and visual fidelity, excelling in conventional or knowledge-intensive any-to-image (X2I) synthesis, complex text-rich infographic generation, and interleaved vision-language generation, with or without think patterns. Beyond performance, we show detailed model design, data preprocessing, pre-/post-training, and inference strategies to support community research. Last but not least, preliminary evidence demonstrates that our models extend beyond perception and generation, performing strongly in vision-language-action (VLA) and world model (WM) scenarios. This points toward a broader roadmap where models do not translate between modalities, but think and act across them in a native manner. Multimodal AI is no longer about connecting separate systems, but about building a unified one and trusting the necessary capabilities to emerge from within.
comment: Project page: https://github.com/OpenSenseNova/SenseNova-U1
☆ EgoForce: Forearm-Guided Camera-Space 3D Hand Pose from a Monocular Egocentric Camera SIGGRAPH 2026
Reconstructing the absolute 3D pose and shape of the hands from the user's viewpoint using a single head-mounted camera is crucial for practical egocentric interaction in AR/VR, telepresence, and hand-centric manipulation tasks, where sensing must remain compact and unobtrusive. While monocular RGB methods have made progress, they remain constrained by depth-scale ambiguity and struggle to generalize across the diverse optical configurations of head-mounted devices. As a result, models typically require extensive training on device-specific datasets, which are costly and laborious to acquire. This paper addresses these challenges by introducing EgoForce, a monocular 3D hand reconstruction framework that recovers robust, absolute 3D hand pose and its position from the user's (camera-space) viewpoint. EgoForce operates across fisheye, perspective, and distorted wide-FOV camera models using a single unified network. Our approach combines a differentiable forearm representation that stabilizes hand pose, a unified arm-hand transformer that predicts both hand and forearm geometry from a single egocentric view, mitigating depth-scale ambiguity, and a ray space closed-form solver that enables absolute 3D pose recovery across diverse head-mounted camera models. Experiments on three egocentric benchmarks show that EgoForce achieves state-of-the-art 3D accuracy, reducing camera-space MPJPE by up to 28% on the HOT3D dataset compared to prior methods and maintaining consistent performance across camera configurations. For more details, visit the project page at https://dfki-av.github.io/EgoForce.
comment: 23 pages, 19 figures and 10 tables; project page: https://dfki-av.github.io/EgoForce (source code, data and demo available); SIGGRAPH 2026 Conference
☆ From Web to Pixels: Bringing Agentic Search into Visual Perception
Visual perception connects high-level semantic understanding to pixel-level perception, but most existing settings assume that the decisive evidence for identifying a target is already in the image or frozen model knowledge. We study a more practical yet harder open-world case where a visible object must first be resolved from external facts, recent events, long-tail entities, or multi-hop relations before it can be localized. We formalize this challenge as Perception Deep Research and introduce WebEye, an object-anchored benchmark with verifiable evidence, knowledge-intensive queries, precise box/mask annotations, and three task views: Search-based Grounding, Search-based Segmentation, and Search-based VQA. WebEyes contains 120 images, 473 annotated object instances, 645 unique QA pairs, and 1,927 task samples. We further propose Pixel-Searcher, an agentic search-to-pixel workflow that resolves hidden target identities and binds them to boxes, masks, or grounded answers. Experiments show that Pixel-Searcher achieves the strongest open-source performance across all three task views, while failures mainly arise from evidence acquisition, identity resolution, and visual instance binding.
comment: Project page: https://pixel-searcher.github.io/
☆ CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives
Autoregressive video generation aims at real-time, open-ended synthesis. Yet, cinematic storytelling is not merely the endless extension of a single scene; it requires progressing through evolving events, viewpoint shifts, and discrete shot boundaries. Existing autoregressive models often struggle in this setting. Trained primarily for short-horizon continuation, they treat long sequences as extended single shots, inevitably suffering from motion stagnation and semantic drift during long rollouts. To bridge this gap, we introduce CausalCine, an interactive autoregressive framework that transforms multi-shot video generation into an online directing process. CausalCine generates causally across shot changes, accepts dynamic prompts on the fly, and reuses context without regenerating previous shots. To achieve this, we first train a causal base model on native multi-shot sequences to learn complex shot transitions prior to acceleration. We then propose Content-Aware Memory Routing (CAMR), which dynamically retrieves historical KV entries according to attention-based relevance scores rather than temporal proximity, preserving cross-shot coherence under bounded active memory. Finally, we distill the causal base model into a few-step generator for real-time interactive generation. Extensive experiments demonstrate that CausalCine significantly outperforms autoregressive baselines and approaches the capability of bidirectional models while unlocking the streaming interactivity of causal generation. Demo available at https://yihao-meng.github.io/CausalCine/
comment: Project page: https://yihao-meng.github.io/CausalCine/
☆ AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward ICML2026
In this paper, we propose AlphaGRPO, a novel framework that applies Group Relative Policy Optimization (GRPO) to AR-Diffusion Unified Multimodal Models (UMMs) to enhance multimodal generation capabilities without an additional cold-start stage. Our approach unlocks the model's intrinsic potential to perform advanced reasoning tasks: Reasoning Text-to-Image Generation, where the model actively infers implicit user intents, and Self-Reflective Refinement, where it autonomously diagnoses and corrects misalignments in generated outputs. To address the challenge of providing stable supervision for real-world multimodal generation, we introduce the Decompositional Verifiable Reward (DVReward). Unlike holistic scalar rewards, DVReward utilizes an LLM to decompose complex user requests into atomic, verifiable semantic and quality questions, which are then evaluated by a general MLLM to provide reliable and interpretable feedback. Extensive experiments demonstrate that AlphaGRPO yields robust improvements across multimodal generation benchmarks, including GenEval, TIIF-Bench, DPG-Bench and WISE, while also achieving significant gains in editing tasks on GEdit without training on editing tasks. These results validate that our self-reflective reinforcement approach effectively leverages inherent understanding to guide high-fidelity generation. Project page: https://huangrh99.github.io/AlphaGRPO/
comment: ICML2026
☆ Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction ICML 2026
Surface reconstruction with differentiable rendering has achieved impressive performance in recent years, yet the pervasive photometric ambiguities have strictly bottlenecked existing approaches. This paper presents AmbiSuR, a framework that explores an intrinsic solution upon Gaussian Splatting for the photometric ambiguity-robust surface 3D reconstruction with high performance. Starting by revisiting the foundation, our investigation uncovers two built-in primitive-wise ambiguities in representation, while revealing an intrinsic potential for ambiguity self-indication in Gaussian Splatting. Stemming from these, a photometric disambiguation is first introduced, constraining ill-posed geometry solution for definite surface formation. Then, we propose an ambiguity indication module that unleashes the self-indication potential to identify and further guide correcting underconstrained reconstructions. Extensive experiments demonstrate our superior surface reconstructions compared to existing methods across various challenging scenarios, excelling in broad compatibility. Project: https://fictionarry.github.io/AmbiSuR-Proj/ .
comment: Accepted at ICML 2026. Project page: https://fictionarry.github.io/AmbiSuR-Proj/
☆ Elastic Attention Cores for Scalable Vision Transformers
Vision Transformers (ViTs) achieve strong data-driven scaling by leveraging all-to-all self-attention. However, this flexibility incurs a computational cost that scales quadratically with image resolution, limiting ViTs in high-resolution domains. Underlying this approach is the assumption that pairwise token interactions are necessary for learning rich visual-semantic representations. In this work, we challenge this assumption, demonstrating that effective visual representations can be learned without any direct patch-to-patch interaction. We propose VECA (Visual Elastic Core Attention), a vision transformer architecture that uses efficient linear-time core-periphery structured attention enabled by a small set of learned cores. In VECA, these cores act as a communication interface: patch tokens exchange information exclusively through the core tokens, which are initialized from scratch and propagated across layers. Because the $N$ image patches only directly interact with a resolution invariant set of $C$ learned "core" embeddings, this yields linear complexity $O(N)$ for predetermined $C$, which bypasses quadratic scaling. Compared to prior cross-attention architectures, VECA maintains and iteratively updates the full set of $N$ input tokens, avoiding a small $C$-way bottleneck. Combined with nested training along the core axis, our model can elastically trade off compute and accuracy during inference. Across classification and dense tasks, VECA achieves performance competitive with the latest vision foundation models while reducing computational cost. Our results establish elastic core-periphery attention as a scalable alternative building block for Vision Transformers.
comment: Project repository here: https://github.com/alansong1322/VECA
☆ OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation
Recent advances in joint audio-video generation have been remarkable, yet real-world applications demand strong per-modality fidelity, cross-modal alignment, and fine-grained synchronization. Reinforcement Learning (RL) offers a promising paradigm, but its extension to multi-objective and multi-modal joint audio-video generation remains unexplored. Notably, our in-depth analysis first reveals that the primary obstacles to applying RL in this stem from: (i) multi-objective advantages inconsistency, where the advantages of multimodal outputs are not always consistent within a group; (ii) multi-modal gradients imbalance, where video-branch gradients leak into shallow audio layers responsible for intra-modal generation; (iii) uniform credit assignment, where fine-grained cross-modal alignment regions fail to get efficient exploration. These shortcomings suggest that vanilla RL fine-tuning strategy with a single global advantage often leads to suboptimal results. To address these challenges, we propose OmniNFT, a novel modality-aware online diffusion RL framework with three key innovations: (1) Modality-wise advantage routing, which routes independent per-reward advantages to their respective modality generation branches. (2) Layer-wise gradient surgery, which selectively detaches video-branch gradients on shallow audio layers while retaining those for cross-modal interaction layers. (3) Region-wise loss reweighting, which modulates policy optimization toward critical regions related to audio-video synchronization and fine-grained alignment. Extensive experiments on JavisBench and VBench with the LTX-2 backbone demonstrate that OmniNFT achieves comprehensive improvements in audio and video perceptual quality, cross-modal alignment, and audio-video synchronization.
comment: Project page: https://zghhui.github.io/OmniNFT/
☆ FuTCR: Future-Targeted Contrast and Repulsion for Continual Panoptic Segmentation
Continual Panoptic Segmentation (CPS) requires methods that can quickly adapt to new categories over time. The nature of this dense prediction task means that training images may contain a mix of labeled and unlabeled objects. As nothing is known about these unlabeled objects a priori, existing methods often simply group any unlabeled pixel into a single "background" class during training. In effect, during training, they repeatedly tell the model that all the different background categories are the same (even when they aren't). This makes learning to identify different background categories as they are added challenging since these new categories may require using information the model was previously told was unimportant and ignored. Thus, we propose a Future-Targeted Contrastive and Repulsive (FuTCR) framework that addresses this limitation by restructuring representations before new classes are introduced. FuTCR first discovers confident future-like regions by grouping model-predicted masks whose pixels are consistently classified as background but exhibit non-background logits. Next, FuTCR applies pixel-to-region contrast to build coherent prototypes from these unlabeled regions, while simultaneously repelling background features away from known-class prototypes to explicitly reserve representational space for future categories. Experiments across six CPS settings and a range of dataset sizes show FuTCR improves relative new-class panoptic quality over the state-of-the-art by up to 28%, while preserving or improving base-class performance with gains up to 4%.
☆ LychSim: A Controllable and Interactive Simulation Framework for Vision Research CVPR 2026
While self-supervised pretraining has reduced vision systems' reliance on synthetic data, simulation remains an indispensable tool for closed-loop optimization and rigorous out-of-distribution (OOD) evaluation. However, modern simulation platforms often present steep technical barriers, requiring extensive expertise in computer graphics and game development. In this work, we present LychSim, a highly controllable and interactive simulation framework built upon Unreal Engine 5 to bridge this gap. LychSim is built around three key designs: (1) a streamlined Python API that abstracts away underlying engine complexities; (2) a procedural data pipeline capable of generating diverse, high-fidelity environments with varying out-of-distribution (OOD) visual challenges, paired with rich 2D and 3D ground truths; and (3) a native integration of the Model Context Protocol (MCP) that transforms the simulator into a dynamic, closed-loop playground for reasoning agentic LLMs. We further annotate scene-level procedural rules and object-level pose alignments to enable semantically aligned 3D ground truths and automated scene modification. We demonstrate LychSim's capability across multiple downstream applications, including serving as a synthetic data engine, powering reinforcement learning-based adversarial examiners, and facilitating interactive, language-driven scene layout generation. To benefit the broader vision community, LychSim will be made publicly available, including full source code and various data annotations.
comment: 3D-LLM/VLA Workshop at CVPR 2026. Project page: https://lychsim.github.io/
☆ 3D Gaussian Splatting for Efficient Retrospective Dynamic Scene Novel View Synthesis with a Standardized Benchmark CVPR 2026
Retrospective novel view synthesis (NVS) of dynamic scenes is fundamental to applications such as sports. Recent dynamic 3D Gaussian Splatting (3DGS) approaches introduce temporally coupled formulations to enforce motion coherence across time. In this paper, we argue that, in a synchronized multi-view (MV) setting typical of sports, the dynamic scene at each time step is already strongly geometrically constrained. We posit that the availability of calibrated, synchronized viewpoints provides sufficient spatial consistency, and therefore, explicit temporal coupling, or complex multi-body constraints seems unnecessary for retrospective NVS. To this end, we propose an approach tailored for synchronized MV dynamic scene. By initializing the SfM-derived point cloud at the start time and propagating optimized Gaussians over time, we show that efficient retrospective NVS can be achieved without imposing a temporal deformation constraint. Complementing our methodological contribution, we introduce a Dynamic MV dataset framework built on Blender for reproducible NeRF and 3DGS research. The framework generates high-quality, synchronized camera rigs and exports training-ready datasets in standard formats, eliminating inconsistencies in coordinate conventions and data pipelines. Using the framework, we construct a dynamic benchmark suite and evaluate representative NeRF and 3DGS approaches under controlled conditions. Together, we show that, under a synchronized MV setup, efficient retrospective dynamic scene NVS can be achieved using 3DGS. At the same time, the dataset-generation framework enables reproducible and principled benchmarking of dynamic NVS methods.
comment: Accepted for publication at CVPR 2026; 4D World Models Workshop. Draft info: 14 pages, 4 figures, 8 tables
☆ GaitProtector: Impersonation-Driven Gait De-Identification via Training-Free Diffusion Latent Optimization IEEE
Conventional gait de-identification methods often encounter an inherent trade-off: they either provide insufficient identity suppression or introduce spatiotemporal distortions that impede structure-sensitive downstream applications. We propose GaitProtector, an impersonation-driven gait de-identification framework that formulates privacy protection as a unified objective with two tightly coupled components: (i) obfuscation, which repels the protected gait from the source identity, and (ii) impersonation, which attracts it toward a selected target identity. The target identity serves as a semantic anchor that biases optimization toward structurally plausible gait patterns under the pretrained diffusion prior, helping preserve dominant body shape and motion dynamics. We instantiate this idea through a training-free diffusion latent optimization pipeline. Instead of retraining a generator for each dataset, we invert each input silhouette sequence into the latent trajectory of a pretrained 3D video diffusion model and iteratively optimize latent codes with a differentiable adversarial objective to synthesize protected gaits. Experiments on the CASIA-B dataset show that GaitProtector achieves a 56.7% impersonation success rate under black-box gait recognition and reduces Rank-1 identification accuracy from 89.6% to 15.0%, while maintaining favorable visual and temporal quality. We further evaluate downstream utility on the Scoliosis1K dataset, where diagnostic accuracy decreases only from 91.4% to 74.2%. To the best of our knowledge, this work is the first to leverage pretrained 3D diffusion priors in a training-free manner for silhouette-based gait de-identification.
comment: Accepted to the 20th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2026)
☆ AOI-SSL: Self-Supervised Framework for Efficient Segmentation of Wire-bonded Semiconductors In Optical Inspection CVPR 2026
Segmentation models in automated optical inspection of wire-bonded semiconductors are typically device-specific and must be re-trained when new devices or distribution shifts appear. We introduce AOI-SSL, a training-efficient framework for semantic segmentation of wire-bonded semiconductors by combining small-domain self-supervised pre-training of vision transformers with in-context inference that minimizes the need of labeled examples. We pre-train SOTA self-supervised algorithms in a small industrial inspection dataset and find that Masked Autoencoders are the most effective in this small-data setting, improving downstream segmentation while reducing the labeled fine-tuning effort. We further introduce in-context, patch-level retrieval methods that predict masks directly from dense encoder embeddings with negligible additional training. We show that, in this setting, simple similarity-based retrieval performs on par with more complex attention-based aggregation used currently in the literature. Furthermore, our experiments demonstrate that self-supervised pre-training significantly improves segmentation quality compared to training from scratch and to ImageNet pre-trained backbones under a fixed fine-tuning computational budget. Finally, the results reveal that retrieval based segmentation outperforms fine-tuning when targeting single device images, allowing for near-instant adaptation to difficult samples.
comment: Accepted to the AI4RWC Workshop at CVPR 2026
☆ Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images
Multimodal Large Language Models (MLLMs) show strong visual perception, yet remain limited in reasoning about space under changing viewpoints. We study this challenge as Perspective-Conditioned Spatial Reasoning (PCSR) in 360-degree omnidirectional images, where broad scene coverage reduces ambiguity from partial observations without eliminating the need for viewpoint-dependent inference. To assess this capability, we introduce PCSR-Bench, a diagnostic benchmark of 84,373 question-answer pairs from 2,600 omnidirectional images across 26 indoor environments. PCSR-Bench contains eight tasks spanning foundational perception (e.g., object counting, relative distance, and relative direction) and advanced PCSR, including compositional chains, egocentric rotation, perspective re-anchoring, ego-distortion, and limited-FOV visibility. We evaluate 14 representative MLLMs and observe a substantial perception-reasoning gap: accuracy reaches 57.59% on foundational relative direction, but drops to 13.49% on egocentric rotation, 7.13% on egocentric distortion, and 0.64% on open-ended compositional reasoning. To probe the plasticity of this gap, we conduct an RL-based diagnostic study on a 7B-scale model. Reward shaping improves a matched 7B baseline from 31.10% to 60.06% under a controlled setting, suggesting that PCSR is partial plasticity rather than being fully immutable. Still, the gains are task-selective, sensitive to reward design including both weight allocation and reward formulation, and partially dependent on the evaluation protocol. These results position PCSR as a key bottleneck in current MLLMs and highlight limited but meaningful room for recovery under targeted optimization.
☆ GeoQuery: Geometry-Query Diffusion for Sparse-View Reconstruction SIGGRAPH 2026
3D Gaussian Splatting (3DGS) has emerged as a prominent paradigm for 3D reconstruction and novel view synthesis. However, it remains vulnerable to severe artifacts when trained under sparse-view constraints. While recent methods attempt to rectify artifacts in rendered views using image diffusion models, they typically rely on multi-view self-attention to retrieve information from reference images. We observe that this mechanism often fails when the rendered novel views output by 3DGS are heavily corrupted: damaged query features lead to erroneous cross-view retrieval, resulting in inconsistent rendering refinement. To address this, we propose GeoQuery, a geometry-guided diffusion framework that integrates generative priors with explicit geometric cues via a novel Geometry-guided Cross-view Attention (GCA) mechanism. First, by leveraging predicted depth maps and camera poses, we construct a geometry-induced correspondence field to sample reference features, forming a geometry-aligned proxy query that replaces the corrupted rendering features. Furthermore, we design a new cross-view feature aggregation pipeline, in which we restrict the cross-view attention to a local window around each proxy query to effectively retrieve useful features while suppressing spurious matches. GeoQuery can be seamlessly integrated into existing diffusion-based pipelines, enabling robust reconstruction even under extreme view sparsity. Extensive experiments on sparse-view novel view synthesis and rendering artifact removal demonstrate the effectiveness of our approach.
comment: Accept to SIGGRAPH 2026 Conference Track
☆ SEMIR: Semantic Minor-Induced Representation Learning on Graphs for Visual Segmentation ICML 2026
Segmenting small and sparse structures in large-scale images is fundamentally constrained by voxel-level, lattice-bound computation and extreme class imbalance -- dense, full-resolution inference scales poorly and forces most pipelines to rely on fixed regionization or downsampling, coupling computational cost to image resolution and attenuating boundary evidence precisely where minority structures are most informative. We introduce SEMIR (Semantic Minor-Induced Representation Learning), a representation framework that decouples inference from the native grid by learning a task-adapted, topology-preserving latent graph representation with exact decoding. SEMIR transforms the underlying grid graph into a compact, boundary-aligned graph minor through parameterized edge contraction, node deletion, and edge deletion, while preserving an exact lifting map from minor predictions to lattice labels. Minor construction is formalized as a few-shot structure learning problem that replaces hand-tuned preprocessing with a boundary-alignment objective: minor parameters are learned by maximizing agreement between predicted boundary elements and target-specific semantic edges under a boundary Dice criterion, and the induced minor is annotated with scale- and rotation-robust geometric and intensity descriptors and supports efficient region-level inference via message passing on a graph neural network (GNN) with relational edge features. We benchmark SEMIR on three tumor segmentation datasets -- BraTS 2021, KiTS23, and LiTS -- where targets exhibit high structural variability and distributional uncertainty. SEMIR yields consistent improvements in minority-structure Dice at practical runtime. More broadly, SEMIR establishes a framework for learning task-adapted, topology-preserving latent representations with exact decoding for high-resolution structured visual data.
comment: 20 pages, 3 figures. Accepted at ICML 2026. Includes appendices
☆ Fast Image Super-Resolution via Consistency Rectified Flow ICCV 2025
Diffusion models (DMs) have demonstrated remarkable success in real-world image super-resolution (SR), yet their reliance on time-consuming multi-step sampling largely hinders their practical applications. While recent efforts have introduced few- or single-step solutions, existing methods either inefficiently model the process from noisy input or fail to fully exploit iterative generative priors, compromising the fidelity and quality of the reconstructed images. To address this issue, we propose FlowSR, a novel approach that reformulates the SR problem as a rectified flow from low-resolution (LR) to high-resolution (HR) images. Our method leverages an improved consistency learning strategy to enable high-quality SR in a single step. Specifically, we refine the original consistency distillation process by incorporating HR regularization, ensuring that the learned SR flow not only enforces self-consistency but also converges precisely to the ground-truth HR target. Furthermore, we introduce a fast-slow scheduling strategy, where adjacent timesteps for consistency learning are sampled from two distinct schedulers: a fast scheduler with fewer timesteps to improve efficiency, and a slow scheduler with more timesteps to capture fine-grained texture details. Extensive experiments demonstrate that FlowSR achieves outstanding performance in both efficiency and image quality.
comment: Accepted by ICCV 2025
☆ Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
Visual latent reasoning lets a multimodal large language model (MLLM) create intermediate visual evidence as continuous tokens, avoiding external tools or image generators. However, existing methods usually follow an output-as-input latent paradigm and yield unstable gains. We identify evidence for a feature-space mismatch that can contribute to this instability: dominant visual-latent models build on pre-norm MLLMs and reuse decoder hidden states as predicted latent inputs, even though these states occupy a substantially different norm regime from the input embeddings the model was trained to consume~\citep{xie2025mhc,li2026siamesenorm,team2026attention}. This mismatch can make direct latent feedback unreliable. Motivated by this diagnosis, we propose \textbf{GAP}, a \textbf{G}ranular \textbf{A}lignment \textbf{P}aradigm for visual latent modeling. GAP aligns visual latent reasoning at three levels: feature-level alignment maps decoder outputs into input-compatible visual latents through a lightweight PCA-aligned latent head; context-level alignment grounds latent targets with inspectable auxiliary visual supervision; and capacity-guided alignment assigns latent supervision selectively to examples where the base MLLM struggles. On Qwen2.5-VL 7B, the resulting model achieves the best mean aggregate perception and reasoning performance among our supervised variants. Inference-time intervention probing further suggests that generated latents provide task-relevant visual signal beyond merely adding token slots.
☆ VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference ICML2026
Pursuing training-free open-vocabulary semantic segmentation in an efficient and generalizable manner remains challenging due to the deep-seated spatial bias in CLIP. To overcome the limitations of existing solutions, this work moves beyond the CLIP-based paradigm and harnesses the recent spatially-aware dino.txt framework to facilitate more efficient and high-quality dense prediction. While dino.txt exhibits robust spatial awareness, we find that the semantic ambiguity of text queries gives rise to severe mismatch within its dense cross-modal interactions. To address this, we introduce \textcolor{oursblue}{\textbf{VI}}sual-guided \textcolor{oursblue}{\textbf{P}}rompt evolution (\textcolor{oursblue}{\textbf{\textit{VIP}}}) to rectify the semantic expressiveness of text queries in dino.txt, unleashing its potential for fine-grained object perception. Towards this end, \VIP integrates alias expansion with a visual-guided distillation mechanism to mine valuable semantic cues, which are robustly aggregated in a saliency-aware manner to yield a high-fidelity prediction. Extensive evaluations demonstrate that \VIP: \ding{182} surpasses the top-leading methods by $1.4\% \sim 8.4\%$ average mIoU, \ding{183} generalizes well to diverse challenging domains, and \ding{184} requires marginal inference time and memory overhead. \href{https://github.com/MiSsU-HH/VIP}{Our code is publicly available at GitHub \faGithub}.
comment: Accepted by ICML2026
☆ Contrastive Learning under Noisy Temporal Self-Supervision for Colonoscopy Videos MICCAI 2026
Learning robust representations of polyp tracklets is key to enabling multiple AI-assisted colonoscopy applications, from polyp characterization to automated reporting and retrieval. Supervised contrastive learning is an effective approach for learning such representations, but it typically relies on correct positive and negative definitions. Collecting these labels requires linking tracklets that depict the same underlying polyp entity throughout the video, which is costly and demands specialized clinical expertise. In this work, we leverage the sequential workflow of colonoscopy procedures to derive self-supervised associations from temporal structure. Since temporally derived associations are not guaranteed to be correct, we introduce a noise-aware contrastive loss to account for noisy associations. We demonstrate the effectiveness of the learned representations across multiple downstream tasks, including polyp retrieval and re-identification, size estimation, and histology classification. Our method outperforms prior self-supervised and supervised baselines, and matches or exceeds recent foundation models across all tasks, using a lightweight encoder trained on only 27 videos. Code is available at https://github.com/lparolari/ntssl.
comment: Accepted to MICCAI 2026
☆ G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models
The development of separate-encoder Unified multimodal models (UMMs) comes with a rapidly growing inference cost due to dense visual token processing. In this paper, we focus on understanding-side visual token reduction for improving the efficiency of separate-encoder UMMs. While this topic has been widely studied for MLLMs, existing methods typically rely on attention scores, text-image similarity and so on, implicitly assuming that the final objective is discriminative reasoning. This assumption does not hold for UMMs, where understanding-side visual tokens must also preserve the model's capabilities for editing images. We propose G$^2$TR, a generation-guided visual token reduction framework for separate-encoder UMMs. Our key insight is that the generation branch provides a task-agnostic signal for identifying understanding-side visual tokens that are not only semantically relevant but also important for latent-space image reconstruction and generation. G$^2$TR estimates token importance from consistency with VAE latent, performs balanced token selection, and merges redundant tokens into retained representatives to reduce information loss. The method is training-free, plug-and-play, and applied only after the understanding encoding stage, making it compatible with existing UMM inference pipelines. Experiments on image understanding and editing benchmarks show that G$^2$TR substantially reduces visual tokens and prefill computation by 1.94x while maintaining both reasoning accuracy and editing quality, outperforming baselines on almost all benchmarks.
comment: Code is at: https://github.com/lijunxian111/G2TR
☆ KAN-CL: Per-Knot Importance Regularization for Continual Learning with Kolmogorov-Arnold Networks
Catastrophic forgetting remains the central obstacle in continual learning (CL): parameters shared across tasks interfere with one another, and existing regularization methods such as EWC and SI apply uniform penalties without awareness of which input region a parameter serves. We propose KAN-CL, a continual learning framework that exploits the compact-support spline parameterization of Kolmogorov-Arnold Networks (KANs) to perform importance-weighted anchoring at per-knot granularity. Deployed as a classification head on a convolutional backbone with standard EWC regularization on the backbone (bbEWC) KAN-CL achieves forgetting reductions of 88% and 93% over a head-only KAN baseline on Split-CIFAR-10/5T and Split-CIFAR-100/10T respectively, while matching or exceeding the accuracy of all baselines on both benchmarks. We further provide a Neural Tangent Kernel (NTK) analysis showing that KAN's spline locality induces a structural rank deficit in the cross-task NTK, yielding a forgetting bound that holds even in the feature-learning regime. These results establish that combining an architecture with natural parameter locality (KAN head) with a complementary backbone regularizer (bbEWC) yields a compositional and principled approach to catastrophic forgetting.
☆ Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
While recent advancements in multimodal language models have enabled image generation from expressive multi-image instructions, existing methods struggle to maintain performance under complex interleaved instructions. This limitation stems from the structural separation of images and text in current paradigms, which forces models to bridge difficult long-range dependencies to match descriptions with visual targets. To address these challenges, we propose \texttt{I}mages i\texttt{N} \texttt{SE}n\texttt{T}ences (\textit{a.k.a}, INSET), a unified generation model that seamlessly embeds images as native vocabulary within textual instructions. By positioning visual features directly at their corresponding semantic slots, INSET leverages the contextual locality of transformers for precise object binding, effectively treating images as dense, expressive language tokens. Furthermore, we introduce a scalable data engine that synthesizes 15M high-quality interleaved samples from standard image and video datasets, utilizing VLMs and LLMs to construct rich, long-horizon sequences. Evaluation results on InterleaveBench demonstrate that INSET significantly outperforms state-of-the-art methods in multi-image consistency and text alignment, with performance gaps widening as input complexity increases. Beyond standard generation, our approach inherently extends to multimodal image editing, integrating visual content as part of the instruction to facilitate highly expressive and creative visual manipulations.
☆ From Model Uncertainty to Human Attention: Localization-Aware Visual Cues for Scalable Annotation Review
High-quality labeled data is essential for training robust machine learning models, yet obtaining annotations at scale remains expensive. AI-assisted annotation has therefore become standard in large-scale labeling workflows. However, in tasks where model predictions carry two independent components, a class label and spatial boundaries, a model may classify an object with high confidence while mislocalizing it. Existing AI-assisted workflows offer annotators no signal about where spatial errors are most likely. Without such guidance, humans may systematically underinspect subtly misplaced boxes. We address this by studying the effect of visualizing spatial uncertainty via a purpose-built interface. In a controlled study with 120 participants, those receiving uncertainty cues achieve higher label quality while being faster overall. A box-level analysis confirms that the cues redirect annotator effort toward high-uncertainty predictions and away from well-localized boxes. These findings establish localization uncertainty as a lever to improve human-in-the-loop annotation. Code is available at https://mos-ks.github.io/MUHA/.
☆ EgoEV-HandPose: Egocentric 3D Hand Pose Estimation and Gesture Recognition with Stereo Event Cameras
Egocentric 3D hand pose estimation and gesture recognition are essential for immersive augmented/virtual reality, human-computer interaction, and robotics. However, conventional frame-based cameras suffer from motion blur and limited dynamic range, while existing event-based methods are hindered by ego-motion interference, monocular depth ambiguity, and the lack of large-scale real-world stereo datasets. To overcome these limitations, we propose EgoEV-HandPose, an end-to-end framework for joint 3D bimanual pose estimation and gesture recognition from stereo event streams. Central to our approach is KeypointBEV, a flexible stereo fusion module that lifts features into a canonical bird's-eye-view space and employs an iterative reprojection-guided refinement loop to progressively resolve depth uncertainty and enforce kinematic consistency. In addition, we introduce EgoEVHands, the first large-scale real-world stereo event-camera dataset for egocentric hand perception, containing 5,419 annotated sequences with dense 3D/2D keypoints across 38 gesture classes under varying illumination. Extensive experiments demonstrate that EgoEV-HandPose achieves state-of-the-art performance with an MPJPE of 30.54mm and 86.87% Top-1 gesture recognition accuracy, significantly outperforming RGB-based stereo and prior event-camera methods, particularly in low-light and bimanual occlusion scenarios, thereby setting a new benchmark for event-based egocentric perception. The established dataset and source code will be publicly released at https://github.com/ZJUWang01/EgoEV-HandPose.
comment: Extended version of SMC 2025 paper arXiv:2503.12419. The established dataset and source code will be publicly released at https://github.com/ZJUWang01/EgoEV-HandPose
☆ Large-Small Model Collaboration for Farmland Semantic Change Detection
Farmland Semantic Change Detection (SCD) is essential for cultivated land protection, yet existing benchmarks and models remain insufficient for fine-grained farmland conversion monitoring. Current datasets often lack dedicated "from-to" annotations, while visual change detection models are easily disturbed by phenology-induced pseudo-changes caused by crop rotation, seasonal variation, and illumination differences. To address these challenges, we construct HZNU-FCD, a large-scale fine-grained farmland SCD benchmark with a unified five-class farmland-to-non-farmland annotation protocol. It contains 4,588 bitemporal image pairs with pixel-level labels for practical farmland protection. Based on this benchmark, we propose a large-small collaborative SCD framework that integrates a task-driven small visual model with a frozen large vision-language model. The small model, Fine-grained Difference-aware Mamba (FD-Mamba), learns dense change representations for boundary preservation and small-region localization. The large-model pathway, Cross-modal Logical Arbitration (CMLA), introduces CLIP-based textual priors for prompt-guided semantic arbitration and pseudo-change suppression. To enable effective collaboration, we design a hard-region co-training strategy that supervises the CMLA semantic score map only on low-confidence pixels. Experiments show that our method achieves 97.63% F1, 96.32% IoU, and 96.35% SCD_IoU_mean on HZNU-FCD with only 6.65M trainable parameters. Compared with the multimodal ChangeCLIP-ViT, which leverages vision-language information for change detection, our method improves F1 by 10.19 percentage points on HZNU-FCD. It also achieves 91.43% F1 and 84.21% IoU on LEVIR-CD, and 93.85% F1 and 88.41% IoU on WHU-CD, demonstrating strong robustness and generalization. The code is available at https://github.com/Lovelymili/FD-Mamba.
☆ Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
Humans often specify and create through visual artifacts: typography sheets, sketches, reference images, and annotated scenes. Yet modern visual generators still ask users to serialize this intent into text, a bottleneck that compresses signals like spatial structure, exact appearance, and glyph shape. We propose \textbf{\emph{visual-to-visual} (V2V)} generation, in which the user conditions a generative model with a visual specification page rather than a text prompt. The page is not an edit target, but a visual document that specifies the desired output. We introduce \textbf{V2V-Zero}, a training-free framework that exposes this interface in existing vision-language model (VLM) conditioned generators by replacing text-only conditioning with final-layer hidden states extracted from visual pages, exploiting the fact that the frozen VLM already maps both text and images into the generator's conditioning space. On GenEval, V2V-Zero reaches 0.85 with a frozen Qwen-Image backbone, closely matching its optimized text-to-image performance without fine-tuning. To evaluate the broader V2V space, we introduce \textbf{Simple-V2V Bench}, spanning seven visual-conditioning tasks and seven models, including GPT Image 2, Nano Banana 2, Seedream 5.0 Lite, open-weight baselines, and a video extension. V2V-Zero scores 32.7/100, outperforming evaluated open-weight image baselines and revealing a clear capability hierarchy: attribute binding is strong, content generation is unreliable, and structural control remains hard even for commercial systems. A HunyuanVideo-1.5 extension scores 20.2/100, showing the interface transfers beyond images. Mechanistic analysis shows the default reasoning path is primarily visually routed, with 95.0\% of conditioning-token attention mass on visual-page hidden states.
comment: Project Page: https://yaofang-liu.github.io/V2V_Web
☆ CAD-feature enhanced machine learning for manufacturing effort estimation on sheet metal bending parts
Graph-based machine learning has emerged as a promising approach for manufacturability analysis by learning directly from CAD models represented as Boundary Representations (B-reps), exploiting both surface geometry and topological connectivity. However, purely geometric representations often lack the process-specific semantics required for accurate manufacturability prediction: many manufacturing factors, such as surface roles or bend intent, are not explicitly encoded in shape alone and are difficult for data-driven models to infer reliably. We propose a hybrid approach that addresses this challenge by enriching B-rep attributed adjacency graphs with manufacturing features recognized through a rule-based module. Applied to sheet metal bending, recognized features, such as bend characteristics, flange lengths, and surface roles are integrated as node attributes, concentrating the learning signal on process-relevant geometric patterns. Experiments on both a large-scale synthetic manufacturability benchmark and a real-world industrial dataset with measured bending times, one of the first such validations on genuine production data, demonstrate that combining domain knowledge with graph-based learning improves prediction accuracy across both tasks. The results demonstrate that hybrid modeling offers a feasible and effective path toward deployable tools for manufacturability assessment and effort estimation in industrial CAD environments.
☆ From Image Hashing to Scene Change Detection ICPR 2026
Image hashing provides compact representations for efficient storage and retrieval but is inherently limited to global comparison and cannot reason about where changes occur. This limitation prevents hashing from being directly applicable to scene change detection, where spatial localization is essential. In this work, we revisit hashing from a scene change detection perspective and propose HashSCD, a patch-wise hashing framework that enables both efficient global change detection and localized change identification. HashSCD encodes spatially aligned patches into compact hash codes and aggregates them through an XOR-like operation, allowing change detection and localization to be performed directly in the Hamming space without repeated inference on previous images. The model is trained in an unsupervised manner using contrastive learning at both patch and global levels. Experiments demonstrate that HashSCD achieves competitive performance compared to state-of-the-art unsupervised hashing and scene change detection methods, while significantly reducing computational cost and storage requirements.
comment: 18 pages; accepted to ICPR 2026
☆ H3D-MarNet: Wavelet-Guided Dual-Path Learning for Metal Artifact Suppression and CT Modality Transformation for Radiotherapy Workflows
Metal artifacts in computed tomography (CT) severely degrade image quality, compromising diagnostic accuracy and radiotherapy planning, especially in cancer patients with high-density implants. We propose H3D-MarNet, a two-stage framework for artifact-aware CT domain transformation from kilo-voltage CT (kVCT) to mega-voltage CT (MVCT). In the first stage, a wavelet-based preprocessing module suppresses metal-induced artifacts through frequency-aware denoising while preserving anatomical structures. In second stage, Domain-TransNet performs kVCT-to-MVCT domain transformation using a hybrid volumetric learning architecture. Domain-TransNet integrates a CNN-based encoder to capture fine-grained local anatomical details and a transformer-based encoder to model long-range volumetric dependencies. The complementary representations are fused through an attention-based feature fusion mechanism to ensure spatial and contextual coherence across slices. A multi-stage, attention-guided decoder, supported by deep supervision, progressively reconstructs artifact-suppressed MVCT volumes. Extensive experiments demonstrate that H3D-MarNet achieves 28.14 dB PSNR and 0.717 SSIM on artifact-affected slices from full dataset, indicating effective metal artifact suppression and anatomical preservation, highlighting its potential for reliable CT modality transformation in clinical radiotherapy workflows.
comment: Accepted for publication at the 28th International Conference on Pattern Recognition, Lyon, France August, 17-22, 2026
☆ UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs
Vision-Language Models (VLMs) increasingly operate on ultra-high-resolution (UHR) Earth observation imagery, yet they remain vulnerable to a severe scale mismatch between large-scale scene context and micro-scale targets. We refer to this empirical gap as a "resolution illusion": higher input resolution provides the appearance of richer visual detail, but does not necessarily yield reliable perception of spatially small, task-relevant evidence. To benchmark this challenge, we introduce UHR-Micro, a benchmark comprising 11,253 instructions grounded in 1,212 UHR images, designed to evaluate VLMs at the spatial limits of native Earth observation imagery. UHR-Micro spans diverse micro-target scales, context requirements, task families, and visual conditions, and provides diagnostic annotations that support controlled evaluation and fine-grained error attribution. Experiments with representative high-resolution VLMs show substantial failures in spatial grounding and evidence parsing, despite access to high-resolution inputs. Further analysis suggests that these failures are not fully resolved by increasing model capacity, but are closely tied to insufficient guidance in locating and using task-relevant micro-evidence. Motivated by this finding, we propose Micro-evidence Active Perception (MAP), a reference agent that decomposes queries into evidence-seeking steps, actively inspects candidate regions, and grounds its answers in localized observations. MAP-Agent improves micro-level perception by making high-resolution reasoning evidence-centered rather than image-centered. Together, UHR-Micro and MAP-Agent provide a diagnostic platform for evaluating, understanding, and advancing high-resolution reasoning in Earth observation VLMs. Datasets and source code were released at https://github.com/MiliLab/UHR-Micro.
☆ TriBand-BEV: Real-Time LiDAR-Only 3D Pedestrian Detection via Height-Aware BEV and High-Resolution Feature Fusion AAMAS 2026
Safe autonomous agents and mobile robots need fast real time 3D perception, especially for vulnerable road users (VRUs) such as pedestrians. We introduce a new bird's eye view (BEV) encoding, which maps the full 3D LiDAR point cloud into a light-weight 2D BEV tensor with three height bands. We explicitly reformulate 3D detection as a 2D detection problem and then reconstruct 3D boxes from the BEV outputs. A single network detects cars, pedestrians, and cyclists in one pass. The backbone uses area attention at deep stages, a hierarchical bidirectional neck over P1 to P4 fuses context and detail, and the head predicts oriented boxes with distribution focal learning for side offsets and a rotated IoU loss. Training applies a small vertical re bin and a mild reflectance jitter in channel space to resist memorization. We use an interquartile range (IQR) filter to remove noisy and outlier LiDAR points during 3D reconstruction. On KITTI dataset, TriBand-BEV attains 58.7/52.6/47.2 pedestrian BEV AP(%) for easy, moderate, and hard at 49 FPS on a single consumer GPU, surpassing Complex-YOLO, with gains of +12.6%, +7.5%, and +3.1%. Qualitative scenes show stable detection under occlusion. The pipeline is compact and ready for real time robotic deployment. Our source code is publicly available on GitHub.
comment: Accepted for publication in the Proceedings of the 2026 International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)
☆ Learning Ego-Centric BEV Representations from a Perspective-Privileged View: Cross-View Supervision for Online HD Map Construction
Bird's-eye-view (BEV) representations derived from multi-camera input have become a central interface for online high-definition (HD) map construction. However, most approaches rely solely on ego-centric supervision, requiring large-scale scene structure to be inferred from incomplete observations, occlusions, and diminishing information density at long range, where perspective effects and spatial sparsity hinder consistent structural reasoning. We introduce Cross-View Supervision (CVS), a representation learning paradigm that transfers geometric and topological priors from an ego-aligned overhead perspective into camera-based BEV encoders. Rather than adding auxiliary semantic losses, CVS aligns representations in a shared BEV feature space and distills globally consistent structural knowledge from a perspective-privileged teacher into the ego-centric backbone. This supervision enhances structural coherence without modifying the inference architecture or requiring overhead input at test time. Experiments on nuScenes using ego-aligned aerial imagery from the AID4AD cross-view extension demonstrate consistent improvements over StreamMapNet while maintaining identical camera-only inference. CVS yields +3.9\,mAP in the standard $60\times30\,\mathrm{m}$ region and +9.9\,mAP in the extended $100\times50\,\mathrm{m}$ setting, corresponding to a 44\% relative gain at long range. These results highlight perspective-privileged structural supervision as a promising training principle for improving BEV representation learning in HD map construction.
☆ Enhancing Domain Generalization in 3D Human Pose Estimation through Controllable Generative Augmentation
Pedestrian motion, due to its causal nature, is strongly influenced by domain gaps arising from discrepancies between training and testing data distributions. Focusing on 3D human pose estimation, this work presents a controllable human pose generation framework that synthesizes diverse video data by systematically varying poses, backgrounds, and camera viewpoints. This generative augmentation enriches training datasets, enhances model generalization, and alleviates the limitations of existing methods in handling domain discrepancies. By leveraging both indoor/real-world and outdoor/virtual datasets, we perform cross-domain data fusion and controllable video generation to construct enriched training data, tailored to realistic deployment settings. Extensive experiments show that the augmented datasets significantly improve model performance on unseen scenarios and datasets, validating the effectiveness of the proposed approach.
☆ SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning
Recent advancements in video-audio joint generation have achieved remarkable success in semantic correspondence. However, achieving precise temporal synchronization, which requires fine-grained alignment between audio events and their visual triggers, remains a challenging problem. The post-training method for joint generation is largely dominated by Supervised Fine-Tuning, but the commonly used Mean Squared Error loss provides insufficient penalties for subtle temporal misalignments. Direct Preference Optimization offers an alternative by introducing explicit misaligned counterparts to better improve temporal sensitivity. In this paper we propose a post-training framework SyncDPO, leveraging DPO to improve the temporal sensitivity of V-A joint generation. Conventional DPO pipelines typically depend on costly sampling-and-ranking procedures to construct preference pairs, resulting in substantial computational cost. To improve efficiency, we introduce a suite of on-the-fly rule-based negative construction strategies that distort temporal structures without incurring additional annotation or sampling. We demonstrate that the temporal alignment capability can be effectively reinforced by providing explicit negative supervision through temporally distorted V-A pairs. Accordingly, we implement a curriculum learning strategy that progressively increases the difficulty of negative samples, transitioning from coarse misalignment to subtle inconsistencies. Extensive objective and subjective experiments across four diverse benchmarks, ranging from ambient sound videos to human speech videos, demonstrate that SyncDPO significantly outperforms other methods in improving model's temporal alignment capability. It also demonstrates superior generalization on out-of-distribution benchmark by capturing intrinsic motion-sound dynamics. Demo and code is available in https://syncdpo.github.io/syncdpo/.
comment: Preprint. Under review
☆ UniFixer: A Universal Reference-Guided Fixer for Diffusion-Based View Synthesis
With the recent surge of generative models, diffusion-based approaches have become mainstream for view synthesis tasks, either in an explicit depth-warp-inpaint or in an implicit end-to-end manner. Despite their success, both paradigms often suffer from noticeable quality degradation, e.g., blurred details and distorted structures, caused by pixel-to-latent compression and diffusion hallucination. In this paper, we investigate diffusion degradation from three key dimensions (i.e., spatial, temporal, and backbone-related) and propose UniFixer, a universal reference-guided framework that fixes diverse degradation artifacts via a coarse-to-fine strategy. Specifically, a reference pre-alignment module is first designed to perform coarse alignment between the reference view and the degraded novel view. A global structure anchoring mechanism then rectifies geometric distortions to ensure structural fidelity, followed by a local detail injection module that recovers fine-grained texture details for high-quality view synthesis. Our UniFixer serves as a plug-and-play refiner that achieves zero-shot fixing across different types of diffusion degradation, and extensive experiments verify our state-of-the-art performance on novel view synthesis and stereo conversion.
☆ From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation ICML 2026
Video generation models offer a promising imagination mechanism for robot manipulation by predicting long-horizon future observations, but effectively exploiting these imagined futures for action execution remains challenging. Existing approaches either condition policies on predicted frames or directly decode generated videos into actions, both suffering from a mismatch between visual realism and control relevance. As a result, predicted observations emphasize perceptual fidelity rather than action-centric causes of state transitions, leading to indirect and unstable control. To address this gap, we propose MoLA (Mixture of Latent Actions), a control-oriented interface that transforms imagined future videos into executable representations. Instead of passing predicted frames directly to the policy, MoLA leverages a mixture of pretrained inverse dynamics models to infer a mixture of latent actions implied by generated visual transitions. These modality-aware inverse dynamics models capture complementary semantic, depth, and flow cues, providing a structured and physically grounded action representation that bridges video imagination and policy execution. We evaluate our approach on simulated benchmarks (LIBERO, CALVIN, and LIBERO-Plus) and real-world robot manipulation tasks, achieving consistent gains in task success, temporal consistency, and generalization.
comment: ICML 2026
☆ Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
In language reasoning, longer chains of thought consistently yield better performance, which naturally suggests that visual latent reasoning may likewise benefit from longer latent sequences. However, we discover a counterintuitive phenomenon: the performance of existing latent visual reasoning methods systematically degrades as the latent sequence grows longer. We reveal the root cause: Information Gain Collapse -- autoregressive generation makes each step highly dependent on prior outputs, so subsequent tokens can barely introduce new information. We further identify that heavily pooled ($\geq 128\times$) image embeddings used as supervision targets provide no more signal than meaningless placeholders. Motivated by these insights, we propose SCOLAR (Self-COnsistent LAtent Reasoning), which introduces a lightweight detransformer that leverages the LLM's full-sequence hidden states to generate auxiliary visual tokens in a single shot, with each token independently anchored to the original visual space. Combined with three-stage SFT and ALPO reinforcement learning, SCOLAR extends acceptable latent CoT length by over $30\times$, achieves state-of-the-art among open-source models on real-world reasoning benchmarks (+14.12% over backbone), and demonstrates strong out-of-distribution generalization.
comment: 17 pages, 6 figures
☆ Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations
Multimodal learning seeks to integrate information across diverse sensory sources, yet current approaches struggle to balance cross-modal generalizability with modality-specific structure. Continuous (implicit) methods preserve fine-grained priors but render generalization challenging, while discrete (explicit) approaches enforce shared prototypes at the expense of modality specificity. We introduce CoDAAR (Cross-modal Discrete Alignment And Reconstruction), a novel framework that resolves this long-standing trade-off by establishing semantic consensus across modality-specific codebooks through index-level alignment. This design uniquely allows CoDAAR to preserve modality-unique structures while achieving generalizable cross-modal representations within a unified discrete space. CoDAAR combines two complementary mechanisms: Discrete Temporal Alignment (DTA), which enables fine-grained temporal quantization, and Cascading Semantic Alignment (CSA), which promotes progressive cross-modal semantic agreement. Together, they establish a competition-free unified representation space. Trained with self-supervised reconstruction objectives on paired multimodal sequences, CoDAAR demonstrates robust cross-modal and cross-domain generalization. Across Cross-Modal Generalization benchmarks, including event classification, localization, video segmentation, and cross-dataset transfer, CoDAAR achieves state-of-the-art performance, establishing a new paradigm for discrete and generalizable multimodal representation learning.
☆ PoseCompass: Intelligent Synthetic Pose Selection for Visual Localization
In visual localization, Absolute Pose Regression (APR) enables real-time 6-DoF camera pose inference from single images, yet critically depends on fine-tuning data quality and coverage. While recent methods leverage 3D Gaussian Splatting (3DGS) for novel view synthesis-based data augmentation, random sampling generates redundant views and noisy samples from poorly reconstructed regions. To mitigate this research gap, we propose PoseCompass, an intelligent pose selection pipeline for 3DGS-based APR. PoseCompass formulates synthetic pose selection and derives a value-based pose ranking mechanism to identify informative poses. The ranking integrates three dimensions: Localization Difficulty, favoring challenging regions; Coverage Novelty, exploring under-sampled areas; and Rendering Observability, filtering artifacts and noise. PoseCompass then generates trajectory-constrained candidates, selects the top-K ranked poses, and synthesizes views using 3DGS with lightweight diffusion-based alignment. Finally, the pose regressor is fine-tuned on mixed real and synthetic data. We evaluate PoseCompass on 7-Scenes, where it reduces adaptation time from 15.2 to 5.1 minutes, a 3x speedup, while cutting median pose errors by 53.8 percent and significantly outperforming random baselines.
☆ EchoTracker2: Enhancing Myocardial Point Tracking by Modeling Local Motion MICCAI 2026
Myocardial point tracking (MPT) has recently emerged as a promising direction for motion estimation in echocardiography, driven by advances in general-purpose point tracking methods. However, myocardial motion fundamentally differs from motion encountered in natural videos, as it arises from physiologically constrained deformation that is spatially and temporally continuous throughout the cardiac cycle. Consequently, motion trajectories typically remain locally confined despite substantial tissue deformation. Motivated by these properties, we revisit the architectural design for MPT and find that coarse initialization in commonly used two-stage coarse-to-fine architectures may be unnecessary in this domain. In this work, we propose a fine-stage-only architecture, \textbf{EchoTracker2}, which enriches pixel-precise features with local spatiotemporal context and integrates them with long-range joint temporal reasoning for robust tracking. Experimental results across in-distribution, out-of-distribution (OOD), and public synthetic datasets show that our model improves position accuracy by $6.5\%$ and reduces median trajectory error by $12.2\%$ relative to a domain-specific state-of-the-art (SOTA) model. Compared to the best general-purpose point tracking method, the improvements are $2.0\%$ and $5.3\%$, respectively. Moreover, EchoTracker2 shows better agreement with expert-derived global longitudinal strain (GLS) and enhances test-rest reproducibility. Source code will be available at: https://github.com/riponazad/ptecho.
comment: Early accepted (top 9%) to MICCAI 2026
☆ Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models CVPR 2026
Generating realistic and user-preferred advertisements is a key challenge in e-commerce. Existing approaches utilize multiple independent models driven by click-through-rate (CTR) to controllably create attractive image or text advertisements. However, their pipelines lack cross-modal perception and rely on CTR that only reflects average preferences. Therefore, we explore jointly generating personalized image-text advertisements from historical click behaviors. We first design a Unified Advertisement Generative model (Uni-AdGen) that employs a single autoregressive framework to produce both advertising images and texts. By incorporating a foreground perception module and instruction tuning, Uni-AdGen enhances the realism of the generated content. To further personalize advertisements, we equip Uni-AdGen with a coarse-to-fine preference understanding module that effectively captures user interests from noisy multimodal historical behaviors to drive personalized generation. Additionally, we construct the first large-scale Personalized Advertising image-text dataset (PAd1M) and introduce a Product Background Similarity (PBS) metric to facilitate training and evaluation. Extensive experiments show that our method outperforms baselines in general and personalized advertisement generation. Our project is available at https://github.com/JD-GenX/Uni-AdGen.
comment: 22 pages, 19 figures, CVPR 2026
☆ MULTI: Disentangling Camera Lens, Sensor, View, and Domain for Novel Image Generation ICPR 2026
Recent text-to-image models produce high-quality images, yet text ambiguity hinders precise control when specific styles or objects are required. There have been a number of recent works dealing with learning and composing multiple objects and patterns. However, current work focuses almost entirely on image content, overlooking imaging factors such as camera lens, sensor types, imaging viewpoints, and scenes' domain characteristics. We introduce this new challenge as Imaging Factor Disentanglement and show limitations of current approaches in the regime. We, therefore, propose the new method Multi-factor disentanglement through Textual Inversion (MULTI). It consists of two stages: in the first stage, we learn general factors, and in the second stage, we extract dataset-specific ones. This setup enables the extension of existing datasets and novel factor combinations, thereby reducing distribution gaps. It further supports modifications of specific factors and image-to-image generation via ControlNets. The evaluation on our new DF-RICO benchmark demonstrates the effectiveness of MULTI and highlights the importance of Factor Disentanglement as a new direction of research.
comment: Accepted at ICPR 2026
☆ Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning
Unlearning specific concepts in text-to-image diffusion models has become increasingly important for preventing undesirable content generation. Among prior approaches, sparse autoencoder (SAE)-based methods have attracted attention due to their ability to suppress target concepts through lightweight manipulation of latent features, without modifying model parameters. However, SAEs trained with sparse reconstruction objectives do not explicitly enforce concept-wise separation, resulting in shared latent features across concepts. To address this, we propose SAEParate, which organizes latent representations into concept-specific clusters via a concept-aware contrastive objective, enabling more precise concept suppression while reducing unintended interference during unlearning. In addition, we enhance the encoder with a GeLU-based nonlinear transformation to increase its expressive capacity under this separation objective, enabling a more discriminative and disentangled latent space. Experiments on UnlearnCanvas demonstrate state-of-the-art performance, with particularly strong gains in joint style-object unlearning, a challenging setting where existing methods suffer from severe interference between target and non-target concepts.
comment: 40 pages, 23 figures
☆ MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics
Generative novel view synthesis faces a fundamental dilemma: geometric priors provide spatial alignment but become sparse and inaccurate under view changes, while appearance priors offer visual fidelity but lack geometric correspondence. Existing methods either propagate geometric errors throughout generation or suffer from signal conflicts when fusing both statically. We introduce MoCam, which employs structured denoising dynamics to orchestrate a coordinated progression from geometry to appearance within the diffusion process.MoCam first leverages geometric priors in early stages to anchor coarse structures and tolerate their incompleteness, then switches to appearance priors in later stages to actively correct geometric errors and refine details. This design naturally unifies static and dynamic view synthesis by temporally decoupling geometric alignment and appearance refinement within the diffusion process.Experiments demonstrate that MoCam significantly outperforms prior methods, particularly when point clouds contain severe holes or distortions, achieving robust geometry-appearance disentanglement.
comment: Project page: https://orange-3dv-team.github.io/MoCam
☆ When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
RLHF is widely used to align flow-matching text-to-image models with human preferences, but often leads to severe diversity collapse after fine-tuning. In RL, diversity is often assumed to correlate with policy entropy, motivating entropy regularization. However, we show this intuition breaks in flow models: policy entropy remains constant, even while perceptual diversity collapses. We explain this mismatch both theoretically and empirically: the constant entropy arises from the fixed, pre-defined noise schedule, while the diversity collapse is driven by the mode-seeking nature of policy gradients. As a result, policy entropy fails to prevent the model from converging to a narrow high-reward region in the perceptual space. To this end, we introduce perceptual entropy that captures diversity in a perceptual space and maintains the property of standard entropy. Building upon this insight, we propose two entropy-regularized strategies, Perceptual Entropy Constraint and Perceptual Constraints on Generation Space, to preserve perceptual diversity and improve the quality. Experiments across two base models, neural and rule-based rewards, and three perceptual spaces demonstrate consistent gains in the quality-diversity trade-off; PEC achieves the best overall score of 0.734 (vs. baseline's 0.366); a complementary setting of PEC further reaches a diversity average of 0.989 (vs. baseline's 0.047). Our project page (https://xiaofeng-tan.github.io/projects/PEC) is publicly available.
☆ World Action Models: The Next Frontier in Embodied AI
Vision-Language-Action (VLA) models have achieved strong semantic generalization for embodied policy learning, yet they learn reactive observation-to-action mappings without explicitly modeling how the physical world evolves under intervention. A growing body of work addresses this limitation by integrating world models, predictive models of environment dynamics, into the action generation pipeline. We term this emerging paradigm World Action Models (WAMs): embodied foundation models that unify predictive state modeling with action generation, targeting a joint distribution over future states and actions rather than actions alone. However, the literature remains fragmented across architectures, learning objectives, and application scenarios, lacking a unified conceptual framework. We formally define WAMs and disambiguate them from related concepts, and trace the foundations and early integration of VLA and world model research that gave rise to this paradigm. We organize existing methods into a structured taxonomy of Cascaded and Joint WAMs, with further subdivision by generation modality, conditioning mechanism, and action decoding strategy. We systematically analyze the data ecosystem fueling WAMs development, spanning robot teleoperation, portable human demonstrations, simulation, and internet-scale egocentric video, and synthesize emerging evaluation protocols organized around visual fidelity, physical commonsense, and action plausibility. Overall, this survey provides the first systematic account of the WAMs landscape, clarifies key architectural paradigms and their trade-offs, and identifies open challenges and future opportunities for this rapidly evolving field.
☆ UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
Multi-reference image generation aims to synthesize images from textual instructions while faithfully preserving subject identities from multiple reference images. Existing VLM-enhanced diffusion models commonly rely on decoupled visual conditioning: semantic ViT features are processed by the VLM for instruction understanding, whereas appearance-rich VAE features are injected later into the diffusion backbone. Despite its intuitive design, this separation makes it difficult for the model to associate each semantically grounded subject with visual details from the correct reference image. As a result, the model may recognize which subject is being referred to, but fail to preserve its identity and fine-grained appearance, leading to attribute leakage and cross-reference confusion in complex multi-reference settings. To address this issue, we propose UniCustom, a unified visual conditioning framework that fuses ViT and VAE features before VLM encoding. This early fusion exposes the VLM to both semantic cues and appearance-rich details, enabling its hidden states to jointly encode the referred subject and corresponding visual appearance with only a lightweight linear fusion layer. To learn such unified representations, we adopt a two-stage training strategy: reconstruction-oriented pretraining that preserves reference-specific appearance details in the fused hidden states, followed by supervised finetuning on single- and multi-reference generation tasks. We further introduce a slot-wise binding regularization that encourages each image slot to preserve low-level details of its corresponding reference, thereby reducing cross-reference entanglement. Experiments on two multi-reference generation benchmarks demonstrate that UniCustom consistently improves subject consistency, instruction following, and compositional fidelity over strong baselines.
☆ The Missing GAP: From Solving Square Jigsaw Puzzles to Handling Real World Archaeological Fragments
Jigsaw puzzle solving has been an increasingly popular task in the computer vision research community. Recent works have utilized cutting-edge architectures and computational approaches to reassemble groups of pieces into a coherent image, while achieving increasingly good results on well established datasets. However, most of these approaches share a common, restricting setting: operating solely on strictly square puzzle pieces. In this work, we introduce GAP, a set of novel jigsaw puzzles datasets containing synthetic, heavily eroded pieces of unrestricted shapes, generated by a learned distribution of real-world archaeological fragments. We also introduce PuzzleFlow, a novel ViT and Flow-Matching based framework for jigsaw puzzle solving, capable of handling complex puzzle pieces and demonstrating superior performance on GAP when compared to both classic and recent prominent works in this domain.
☆ BARISTA: A Multi-Task Egocentric Benchmark for Compositional Visual Understanding
Scene understanding is central to general physical intelligence, and video is a primary modality for capturing both state and temporal dynamics of a scene. Yet understanding physical processes remains difficult, as models must combine object localization, hand-object interactions, relational parsing, temporal reasoning, and step-level procedural inference. Existing benchmarks usually evaluate these capabilities separately, limiting diagnosis of why models fail on procedural tasks. We introduce BARISTA, a densely annotated egocentric dataset and benchmark of 185 real-world coffee-preparation videos covering fully automatic, portafilter-based, and capsule-based workflows. BARISTA provides verified per-frame scene graphs linking persistent object identities to masks, tracks, boxes, attributes, typed relations, hand-object interactions, activities, and process steps. From these graphs, we derive zero-shot language-based tasks spanning phrase grounding, hand-object interaction recognition, referring, activity recognition, relation extraction, and temporal visual question answering. Experiments reveal strong variation across task families and no consistently dominant model family, positioning BARISTA as a challenging diagnostic benchmark for procedural video understanding. Code and dataset available at https://huggingface.co/datasets/ramblr/BARISTA.
☆ PairDropGS: Paired Dropout-Induced Consistency Regularization for Sparse-View Gaussian Splatting
Dropout-based sparse-view 3D Gaussian Splatting (3DGS) methods alleviate overfitting by randomly suppressing Gaussian primitives during training. Existing methods mainly focus on designing increasingly sophisticated dropout strategies, while they overlook the resulting inconsistencies among different dropped Gaussian subsets. This oversight often leads to unstable reconstruction and suboptimal Gaussian representation learning.In this paper, we revisit dropout-based sparse-view 3DGS from a consistency regularization perspective and propose PairDropGS, a Paired Dropout-induced Consistency Regularization framework for sparse-view Gaussian splatting. Specifically, PairDropGS first constructs a pair of the dropped Gaussian subsets from a shared Gaussian field and designs a low-frequency consistency regularization to constrain their low-frequency rendered structures. This design encourages the shared Gaussian field to preserve stable scene layout and coarse geometry under different random dropouts, while avoiding excessive constraints on ambiguous high-frequency details. Moreover, we introduce a progressive consistency scheduling strategy to gradually strengthen the consistency regularization during training for stability and robustness of reconstruction. Extensive experiments on widely-used sparse-view benchmarks demonstrate that PairDropGS achieves superior training stability, significantly outperforms existing dropout-based 3DGS methods in reconstruction quality, while exhibiting the simplicity and plug-and-play nature for improving dropout-based optimization.
comment: 11 pages,8 figures
☆ Anomaly-Aware Vision-Language Adapters for Zero-Shot Anomaly Detection ICIP 2026
Zero-shot anomaly detection aims to identify defects in unseen categories without target-specific training. Existing methods usually apply the same feature transformation to all samples, treating normal and anomalous data uniformly despite their fundamentally asymmetric distributions, compact normals versus diverse anomalies. We instead exploit this natural asymmetry by proposing AVA-DINO, an anomaly-aware vision-language adaptation framework with dual specialized branches for normal and anomalous patterns that adapt frozen DINOv3 visual features. During training on auxiliary data, the two branches are learned jointly with a text-guided routing mechanism and explicit routing regularization that encourages branch specialization. At test time, only the input image and fixed, predefined language descriptions are used to dynamically combine the two branches, enabling an asymmetric activation. This design prevents degenerate uniform routing and allows context-specific feature transformations. Experiments across nine industrial and medical benchmarks demonstrate state-of-the-art performance, achieving 93.5% image-AUROC on MVTec-AD and strong cross-domain generalization to medical imaging without domain-specific fine-tuning. https://github.com/aqeeelmirza/AVA-DINO
comment: Accepted to ICIP 2026
☆ TAR: Text Semantic Assisted Cross-modal Image Registration Framework for Optical and SAR Images
Existing deep learning-based methods can capture shared features from optical and synthetic aperture radar (SAR) images for spatial alignment. However, optical-SAR registration remains challenging under large geometric deformations, because the model needs to simultaneously handle cross-modal appearance discrepancies and complex spatial transformations. To address this issue, this paper proposes a text semantic-assisted cross-modal image registration framework, named TAR, for optical and SAR images. TAR exploits text semantic priors from remote sensing scenes and land-cover categories to alleviate the modality gap and enhance cross-modal feature learning. TAR consists of three components: a multi-scale visual feature learning (MSFL) module, a text-assisted feature enhancement (TAFE) module, and a coarse-to-fine dense matching (CFDM) module. MSFL extracts multi-scale visual features from optical and SAR images. TAFE constructs text descriptors related to remote sensing scenes and land-cover objects, and uses a frozen RemoteCLIP text encoder to extract text features. These text features are introduced through visual-text interaction to enhance high-level visual features for more reliable coarse matching. CFDM then establishes coarse correspondences based on the enhanced high-level features and refines the matched locations using low-level features. Experimental results on cross-modal remote sensing images demonstrate the effectiveness of TAR, which achieves stronger matching performance than several state-of-the-art methods and yields significant gains under large geometric deformations.
☆ OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation
Cross-embodiment video generation aims to transfer motions across different humanoid embodiments, such as human-to-robot and robot-to-robot, enabling scalable data generation for embodied intelligence. A major challenge in this setting is that motion dynamics are partly transferable across embodiments, whereas appearance and morphology remain embodiment-specific. Existing approaches often entangle these factors, and many require paired data for every target embodiment, which limits scalability to new robots. We present OmniHumanoid, a framework that factorizes transferable motion learning and embodiment-specific adaptation. Our method learns a shared motion transfer model from motion-aligned paired videos spanning multiple embodiments, while adapting to a new embodiment using only unpaired videos through lightweight embodiment-specific adapters. To reduce interference between motion transfer and embodiment adaptation, we further introduce a branch-isolated attention design that separates motion conditioning from embodiment-specific modulation. In addition, we construct a synthetic cross-embodiment dataset with motion-aligned paired videos rendered across diverse humanoid assets, scenes, and viewpoints. Experiments on both synthetic and real-world benchmarks show that OmniHumanoid achieves strong motion fidelity and embodiment consistency, while enabling scalable adaptation to unseen humanoid embodiments without retraining the shared motion model.
☆ Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Omni-modal language models are intended to jointly understand audio, visual inputs, and language, but benchmark gains can be inflated when visual evidence alone is enough to answer a query. We study whether current omni-modal benchmarks separate visual shortcuts from genuine audio-visual-language evidence integration, and how post-training behaves under a visually debiased evaluation setting. We audit nine omni-modal benchmarks with visual-only probing, remove visually solvable queries, and retain full subsets when filtering is undefined or would make comparisons unstable. This yields OmniClean, a cleaned evaluation view with 8,551 retained queries from 16,968 audited queries. On OmniClean, we evaluate OmniBoost, a three-stage post-training recipe based on Qwen2.5-Omni-3B: mixed bi-modal SFT, mixed-modality RLVR, and SFT on self-distilled data. Balanced bi-modal SFT gives limited and uneven gains, RLVR provides the first broad improvement, and self-distillation reshapes the benchmark profile. After SFT on self-distilled data, the 3B model reaches performance comparable to, and in aggregate slightly above, Qwen3-Omni-30B-A3B-Instruct without using a stronger omni-modal teacher. These results show that omni-modal progress is easier to interpret when evaluation controls visual leakage, and that small omni-modal models can benefit from staged post-training with self-distilled omni-query supervision.
☆ Resilient Vision-Tabular Multimodal Learning under Modality Missingness
Multimodal deep learning has shown strong potential in medical applications by integrating heterogeneous data sources such as medical images and structured clinical variables. However, most existing approaches implicitly assume complete modality availability, an assumption that rarely holds in real-world clinical settings where entire modalities and individual features are frequently missing. In this work, we propose a multimodal transformer framework for joint vision-tabular learning explicitly designed to operate under pervasive modality missingness, without relying on imputation or heuristic model switching. The architecture integrates three components: a vision, a tabular, and a multimodal fusion encoder. Unimodal representations are weighted through learnable modality tokens and fused via intermediate fusion with masked self-attention, which excludes missing tokens and modalities from information aggregation and gradient propagation. To further enhance resilience, we introduce a modality-dropout regularization strategy that stochastically removes available modalities during training, encouraging the model to exploit complementary information under partial data availability. We evaluate our approach on the MIMIC-CXR dataset paired with structured clinical data from MIMIC-IV for multilabel classification of 14 diagnostic findings with incomplete annotations. Two parallel systematic stress-test protocols progressively increase training and inference missingness in each modality separately, spanning fully multimodal to fully unimodal scenarios. Across all missingness regimes, the proposed method consistently outperforms representative baselines, showing smoother performance degradation and improved robustness. Ablation studies further demonstrate that attention-level masking and intermediate fusion with joint fine-tuning are key to resilient multimodal inference.
☆ 4DVGGT-D: 4D Visual Geometry Transformer with Improved Dynamic Depth Estimation
Reconstructing dynamic 4D scenes from monocular videos is a fundamental yet challenging task. While recent 3D foundation models provide strong geometric priors, their performance significantly degrades in dynamic environments. This degradation stems from a fundamental tension: the inherent coupling of camera ego-motion and object motion within global attention mechanisms. In this paper, we propose a novel, training-free progressive decoupling framework that disentangles dynamics from statics in a principled, coarse-to-fine manner. Our core insight is to resolve the tension by first stabilizing the camera pose, followed by geometric refinement. Specifically, our approach consists of three synergistic components: (1) a Dynamic-Mask-Guided Pose Decoupling module that isolates pose estimation from dynamic interference, yielding a stable motion-free reference frame; (2) a Topological Subspace Surgery mechanism that orthogonally decomposes the depth manifold, safely preserving dynamic objects while injecting refined, mask-aware geometry into static regions; and (3) an Information-Theoretic Confidence-Aware Fusion strategy that formulates depth integration as a heteroscedastic Bayesian inference problem, adaptively blending multi-pass predictions via inverse-variance weighting. Extensive experiments on standard 4D reconstruction benchmarks demonstrate that our method achieves consistent and substantial improvements across principal point-cloud metrics. Notably, our approach shows competitive performance in robust 4D scene reconstruction without requiring fine-tuning, suggesting the potential of mathematically grounded dynamic-static disentanglement.
☆ Spectral Vision Transformer for Efficient Tokenization with Limited Data
We propose a novel spectral vision transformer architecture for efficient tokenization in limited data, with an emphasis on medical imaging. We outline convenient theoretical properties arising from the choice of basis including spatial invariance and optimal signal-to-noise ratio. We show reduced complexity arising from the spectral projection compared to spatial vision transformers. We show equitable or superior performance with a reduced number of parameters as compared to a variety of models including compact and standard vision transformers, convolutional neural networks with attention, shifted window transformers, multi-layer perceptrons, and logistic regression. We include simulated, public, and clinical data in our analysis and release our code at: \verb+github.com/agr78/spectralViT+.
☆ What-Where Transformer: A Slot-Centric Visual Backbone for Concurrent Representation and Localization
Many image understanding tasks involve identifying what is present and where it appears. However, tasks that address where, such as object discovery, detection, and segmentation, are often considerably more complex than image classification, which primarily focuses on what. One possible reason is that classification-oriented backbones tend to emphasize semantic information about what, while implicitly entangling or suppressing information about where. In this work, we focus on an inductive bias termed what-where separation, which encourages models to represent object appearance and spatial location in a decomposed manner. To incorporate this bias throughout an attentive backbone in the style of Vision Transformer (ViT), we propose the What-Where Transformer (WWT). Our method introduces two key novel designs: (1) it treats tokens as representations of what and attention maps as representations of where, and processes them in concurrent feed-forward modules via a multi-stream, slot-based architecture; (2) it reuses both the final-layer tokens and attention maps for downstream tasks, and directly exposes them to gradients derived from task losses, thereby facilitating more effective and explicit learning of localization. We demonstrate that even under standard single-label classification-based supervision on ImageNet, WWT exhibits emergent multiple object discovery directly from raw attention maps, rather than via additional postprocessing such as token clustering. Furthermore, WWT achieves superior performance compared to ViT-based methods on zero-shot object discovery and weakly supervised semantic segmentation, and it is transferable to various localization setups with minimal modifications. Code will be published after acceptance.
☆ FAME: Feature Activation Map Explanation on Image Classification and Face Recognition CVPR
Deep Learning has revolutionized machine learning, reaching unprecedented levels of accuracy, but at the cost of reduced interpretability. Especially in image processing systems, deep networks transform local pixel information into more global concepts in a highly obscured manner. Explainable AI methods for image processing try to shed light on this issue by highlighting the regions of the image that are important for the prediction task. Among these, Class Activation Mapping (CAM) and its gradient-based variants compute attributions based on the feature map and upscale them to the image resolution, assuming that feature map locations are influenced only by underlying regions. Perturbation-based methods, such as CorrRISE, on the other hand, try to provide pixel-level attributions by perturbing the input with fixed patches and checking how the output of the network changes. In this work, we propose Feature Activation Map Explanation (FAME), which combines both worlds by using network gradients to compute changes to the input image, manipulating it in a gradient-driven way rather than using fixed patches. We apply this technique on two common tasks, image classification and face recognition, and show that CAM's above-mentioned assumption does not hold for deeper networks. We qualitatively and quantitively show that FAME produces attribution maps that are competitive state-of-the-art systems. Our code is available: {\footnotesize https://github.com/AIML-IfI/fame.}
comment: Accepted for CVPR Workshop 2026
☆ L2P: Unlocking Latent Potential for Pixel Generation
Pixel diffusion models have recently regained attention for visual generation. However, training advanced pixel-space models from scratch demands prohibitive computational and data resources. To address this, we propose the Latent-to-Pixel (L2P) transfer paradigm, an efficient framework that directly harnesses the rich knowledge of pre-trained LDMs to build powerful pixel-space models. Specifically, L2P discards the VAE in favor of large-patch tokenization and freezes the source LDM's intermediate layers, exclusively training shallow layers to learn the latent-to-pixel transformation. By utilizing LDM-generated synthetic images as the sole training corpus, L2P fits an already smooth data manifold, enabling rapid convergence with zero real-data collection. This strategy allows L2P to seamlessly migrate massive latent priors to the pixel space using only 8 GPUs. Furthermore, eliminating the VAE memory bottleneck unlocks native 4K ultra-high resolution generation. Extensive experiments across mainstream LDM architectures show that L2P incurs negligible training overhead, yet performs on par with the source LDM on DPG-Bench and reaches 93% performance on GenEval.
comment: project page: https://nju-pcalab.github.io/projects/L2P/
☆ Robust Promptable Video Object Segmentation CVPR 2026
The performance of promptable video object segmentation (PVOS) models substantially degrades under input corruptions, which prevents PVOS deployment in safety-critical domains. This paper offers the first comprehensive study on robust PVOS (RobustPVOS). We first construct a new, comprehensive benchmark with two real-world evaluation datasets of 351 video clips and more than 2,500 object masks under real-world adverse conditions. At the same time, we generate synthetic training data by applying diverse and temporally varying corruptions to existing VOS datasets. Moreover, we present a new RobustPVOS method, dubbed Memory-object-conditioned Gated-rank Adaptation (MoGA). The key to successfully performing RobustPVOS is two-fold: effectively handling object-specific degradation and ensuring temporal consistency in predictions. MoGA leverages object-specific representations maintained in memory across frames to condition the robustification process, which allows the model to handle each tracked object differently in a temporally consistent way. Extensive experiments on our benchmark validate MoGA's efficacy, showing consistent and significant improvements across diverse corruption types on both synthetic and real-world datasets, establishing a strong baseline for future RobustPVOS research. Our benchmark is publicly available at https://sohyun-l.github.io/RobustPVOS_project_page/.
comment: Accepted to CVPR 2026
☆ EDGER: EDge-Guided with HEatmap Refinement for Generalizable Image Forgery Localization
Text-guided inpainting has made image forgery increasingly realistic, challenging both SID and IFL. However, existing methods often struggle to point out suspicious signals across domains. To address this problem, we propose EDGER, a patch-based, dual-branch framework that localizes manipulated regions in arbitrary resolution images without sacrificing native resolution. The first branch, Edge-Guided Segmentation, introduces a Frequency-based Edge Detector to emphasize high-frequency inconsistencies at manipulation boundaries, and fine-tunes a SegFormer to fuse RGB and edge features for pixel-level masks. Since edge evidence is most informative only when patches contain both authentic and manipulated pixels, we complement Edge-Guided Segmentation with a Synthetic Heatmapping branch, a classification-based localizer that fine-tunes a CLIP-ViT image encoder with LoRA to flag fully synthetic patches. Together, Synthetic Heatmapping provides coarse, patch-level synthetic priors, while Edge-Guided Segmentation sharpens boundaries within partially manipulated patches, yielding comprehensive localization. Evaluated in the MediaEval 2025, SynthIM challenge, Manipulated Region Localization Task's setting, our approach scales to multi-megapixel imagery and exhibits strong cross-domain generalization. Extensive ablations highlight the complementary roles of frequency-based edge cues and patch-level synthetic priors in driving accurate, resolution-agnostic localization.
comment: Accepted for publication in the Proceedings of the 14th International Symposium on Information and Communication Technology (SOICT 2025)
☆ A Transfer Learning Evaluation of Deep Neural Networks for Image Classification
Transfer learning is a machine learning technique that uses previously acquired knowledge from a source domain to enhance learning in a target domain by reusing learned weights. This technique is ubiquitous because of its great advantages in achieving high performance while saving training time, memory, and effort in network design. In this paper, we investigate how to select the best pre-trained model that meets the target domain requirements for image classification tasks. In our study, we refined the output layers and general network parameters to apply the knowledge of eleven image processing models, pre-trained on ImageNet, to five different target domain datasets. We measured the accuracy, accuracy density, training time, and model size to evaluate the pre-trained models both in training sessions in one episode and with ten episodes.
comment: Published by Machine Learning and Knowledge Extraction Journal
☆ Optimizing 4D Wires for Sparse 3D Abstraction
We present a unified framework for 3D geometric abstraction using a single continuous 4D wire, parameterized as a B-spline with spatial coordinates and variable width $(x,y,z,w)$. Existing approaches typically represent shapes as collections of many independent curve segments, which often leads to fragmented structures and limited physical realizability. In contrast, we show that a single continuous spline is sufficiently expressive to capture complex volumetric forms while enforcing global topological coherence. By imposing continuity, our method transforms 3D sketching from a local density-accumulation process into a global routing problem, providing a strong inductive bias toward cleaner aesthetics and improved structural coherence. To enable gradient-based optimization, we introduce a differentiable rendering pipeline that efficiently rasterizes variable-width curves with bounded projection error. This formulation supports robust optimization using modern guidance signals such as Score Distillation Sampling (SDS) or CLIP. We demonstrate applications including image-to-3D abstraction, multi-view wire art generation, and differentiable stylized surface filling. Experiments show that our unified representation produces structures with higher semantic fidelity and improved structural coherence compared to approaches based on collections of discrete curves.
☆ H2G: Hierarchy-Aware Hyperbolic Grouping for 3D Scenes
Hierarchical 3D grouping aims to recover scene groups across multiple granularities, from fine object parts to complete objects, without relying on semantic labels or a fixed vocabulary. The main challenge is to transform 2D foundation-model cues into coherent hierarchy supervision and embed that hierarchy in a 3D representation. We propose H2G, a hyperbolic affinity field for hierarchical 3D grouping. Our method derives semantically organized tree supervision by interpreting foundation-model affinities through Dasgupta's objective for similarity-based hierarchical clustering. This supervision is distilled into a single Lorentz hyperbolic feature field, whose geometry is well suited for tree-like branching structures. A hierarchy-aware objective aligns the field with fine-level assignments, coarse object structure, compact feature clusters, and LCA (Lowest Common Ancestor) ordering. This formulation represents multiple grouping levels in one feature space, enabling semantic hierarchical grouping grounded in 2D foundation-model knowledge.
☆ What Does It Mean for a Medical AI System to Be Right?
This paper examines what it means for a medical AI system to be right by grounding the question in a specific clinical context: the automatic classification of plasma cells in digitized bone marrow smears for the diagnosis of multiple myeloma. Drawing on philosophy of science and research ethics, the paper argues that correctness in medical AI is not a singular property reducible to benchmark performance, but a multi-dimensional concept involving the availability of expertly labeled medical datasets, the explainability and interpretability of model outputs, the clinical meaningfulness of evaluation metrics, and the distribution of accountability in human-AI workflows. As such, the paper develops this argument through four interrelated themes: the instability of ground truth labels, the opacity of overconfident AI, the inadequacy of standard clinical metrics, and the risk of automation bias in time-pressured clinical settings.
comment: Part of a PhD ethics course
☆ Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters
Vision Large Language Models (VLLMs) have achieved remarkable success in modern text-rich visual understanding. However, their perceptual robustness in the face of the continuous morphological evolution of historical writing systems remains largely unexplored. Existing ancient text datasets typically focus on isolated historical periods, failing to capture the systematic visual distribution shifts spanning thousands of years. To bridge this gap and empower Digital Humanities, we introduce Chronicles-OCR, the first comprehensive benchmark specifically designed to evaluate the cross-temporal visual perception capabilities of VLLMs across the complete evolutionary trajectory of Chinese characters, known as the Seven Chinese Scripts. Curated in collaboration with top-tier institutional domain experts, the dataset comprises 2,800 strictly balanced images encompassing highly diverse physical media, ranging from tortoise shells to paper-based calligraphy. To accommodate the drastic morphological and topological variations across different historical stages, we propose a novel Stage-Adaptive Annotation Paradigm. Based on this, Chronicles-OCR formulates four rigorous quantitative tasks: cross-period character spotting, fine-grained archaic character recognition via visual referring, ancient text parsing, and script classification. By isolating visual perception from semantic reasoning, Chronicles-OCR provides an authoritative platform to expose the limitations of current VLLMs, paving the way for robust, evolution-aware historical text perception. Chronicles-OCR is publicly available at https://github.com/VirtualLUOUCAS/Chronicles-OCR.
☆ Multimodal Abstractive Summarization of Instructional Videos with Vision-Language Models ICPR 2026
Multimodal video summarization requires visual features that align semantically with language generation. Traditional approaches rely on CNN features trained for object classification, which represent visual concepts as discrete categories not aligned with natural language. We propose ClipSum, a framework that leverages frozen CLIP vision-language features with explicit temporal modeling and dimension-adaptive fusion for instructional video summarization. CLIP's contrastive pre-training on 400M image-text pairs yields visual features semantically aligned with the linguistic concepts that text decoders generate, bridging the vision-language gap at the representation level. On YouCook2, ClipSum achieves 33.0% ROUGE-1 versus 30.5% for ResNet-152 with 4x lower dimensionality (512 vs. 2048), demonstrating that semantic alignment matters more than feature capacity. Frozen CLIP (33.0%) surpasses fine-tuned CLIP (32.3%), showing that preserving pre-trained alignment is more valuable than task-specific adaptation. https://github.com/aqeeelmirza/clipsum
comment: Accepted to ICPR 2026
☆ Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models
Prompt learning has emerged as an efficient alternative to fine-tuning pre-trained vision-language models (VLMs). Despite its promise, current methods still struggle to maintain tail-class discriminability when adapting to class-imbalanced datasets. In this work, we propose cluster-aware neural collapse prompt tuning (CPT), which enhances the discriminability of tail classes in prompt-tuned VLMs without sacrificing their overall generalization. First, we design a cluster-invariant space by mining semantic assignments from the pre-trained VLM and mapping them to prompt-tuned features. This computes cluster-level boundaries and restricts the constraints to local neighborhoods, which reduces interference with the global semantic structure of the pre-trained VLM. Second, we introduce neural-collapse-driven discriminability optimization with three losses: textual Equiangular Tight Frame (ETF) separation loss, class-wise convergence loss, and rotation stabilization loss. These losses work together to shape intra-cluster geometry for better inter-class separation and intra-class alignment. Extensive experiments on 11 diverse datasets demonstrate that CPT outperforms SOTA methods, with stronger performance on long-tail classes and good generalization to unseen classes.
☆ Interactive State Space Model with Cross-Modal Local Scanning for Depth Super-Resolution ISCA
Guided depth super-resolution (GDSR) reconstructs HR depth maps from LR inputs with HR RGB guidance. Existing methods either model each modality independently or rely on computationally expensive attention mechanisms with quadratic complexity, hindering the establishment of efficient and semantically interactive joint representations. In this paper, we observe that feature maps from different modalities exhibit semantic-level correlations during feature extraction. This motivates us to develop a more flexible approach enabling dense, semantically-aware deep interactions between modalities. To this end, we propose a novel GDSR framework centered around the Interactive State Space Model. Specifically, we design a cross-modal local scanning mechanism that enables fine-grained semantic interactions between RGB and depth features. Leveraging the Mamba architecture, our framework achieves global modeling with linear complexity. Furthermore, a cross-modal matching transform module is introduced to enhance interactive modeling quality by utilizing representative features from both modalities. Extensive experiments demonstrate competitive performance against state-of-the-art methods.
comment: ISCAS2026
☆ Learn to Think: Improving Multimodal Reasoning through Vision-Aware Self-Improvement Training ICML 2026
Post-training with explicit reasoning traces is common to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, acquiring high-quality reasoning traces is often costly and time-consuming. Hence, the self-improvement paradigm has emerged, enabling MLLMs to self-generate reasoning traces for training without external supervision. Despite its effectiveness, we reveal two shortcomings in the self-improvement training of MLLMs: 1) data imbalance, where simple samples are over-trained, but the challenging yet crucial samples are under-trained; 2) language prior bias, where MLLMs overly rely on linguistic priors while neglecting the visual cues. To this end, we propose VISTA, a vision-aware self-improvement training framework for enhancing the multimodal reasoning of MLLMs. Specifically, VISTA first introduces a prefix resampling strategy to reuse the partial correct reasoning traces for efficient data collection, and then designs a vision-aware attention score to quantify the model's focus on visual information. Extensive experiments show that VISTA can be applied to various post-training scenarios, i.e., supervised fine-tuning and preference learning, and effectively enhances the multimodal reasoning performance across various MLLMs and tasks, e.g., bringing up to +13.66% average performance gains for Qwen2.5-VL-3B-Instruct.
comment: Accepted by ICML 2026
☆ RealDiffusion: Physics-informed Attention for Multi-character Storybook Generation CVPR2026
While modern diffusion models excel at generating diverse single images, extending this to sequential generation reveals a fundamental challenge: balancing narrative dynamism with multi-character coherence. Existing methods often falter at this trade-off, leading to artifacts where characters lose their identity or the story stagnates. To resolve this critical tension, we introduce RealDiffusion, a unified framework designed to reconcile robust coherence with narrative dynamism. Heat diffusion serves as a dissipative prior that averages neighboring features along the sequence and removes high-frequency noise within the subject region. This suppresses attribute drift and stabilizes identity across frames. A region-aware stochastic process then introduces small perturbations that explore nearby modes and prevent collapse so the story maintains pose change and scene evolution. We thus introduce a lightweight, training-free Physics-informed Attention mechanism that injects controllable physical priors into the self-attention layers during inference. By modeling feature evolution as a configurable physical system, our method regularizes spatio-temporal relationships without suppressing intentional, prompt-driven changes. Extensive experiments demonstrate that RealDiffusion achieves substantial gains in character coherence while preserving narrative dynamism, outperforming state-of-the-art approaches. Code is available at https://github.com/ShmilyQi-CN/RealDiffusion.
comment: CVPR2026
☆ Vector Scaffolding: Inter-Scale Orchestration for Differentiable Image Vectorization
Differentiable vector graphics have enabled powerful gradient-based optimization of vector primitives directly from raster images. However, existing frameworks formulate this as a flat optimization problem, forcing hundreds to thousands of randomly initialized curves to blindly compete for pixel-level error reduction. This disordered optimization leads to topology collapse, where macroscopic structures are distorted by internal high-frequency noise, resulting in a redundant and uneditable "polygon soup" that limits practical editability. To address this limitation, we propose Vector Scaffolding, a novel hierarchical optimization framework that shifts from flat pixel-matching to structured topological construction tailored for vector graphics. By identifying a key cause of topology collapse as the mathematical imbalance between area and boundary gradients, we introduce Interior Gradient Aggregation to stabilize the learning dynamics of multi-scale curve mixtures. Upon this stabilized landscape, we employ Progressive Stratification and Rapid Inflation Scheduling to progressively densify vector primitives with extremely high learning rates ($\times 50$). Experiments demonstrate that our approach accelerates optimization by $2.5\times$ while simultaneously improving PSNR by up to 1.4 dB over the previous state of the art.
comment: 22 pages, 12 figures
☆ Beyond Point-wise Neural Collapse: A Topology-Aware Hierarchical Classifier for Class-Incremental Learning ICML2026
The Nearest Class Mean (NCM) classifier is widely favored in Class-Incremental Learning (CIL) for its superior resistance to catastrophic forgetting compared to Fully Connected layers. While Neural Collapse (NC) theory supports NCM's optimality by assuming features collapse into single points, non-linear feature drift and insufficient training in CIL often prevent this ideal state. Consequently, classes manifest as complex manifolds rather than collapsed points, rendering the single-point NCM suboptimal. To address this, we propose Hierarchical-Cluster SOINN (HC-SOINN), a novel classifier that captures the topological structure of these manifolds via a ``local-to-global'' representation. Furthermore, we introduce Structure-Topology Alignment via Residuals (STAR) method, which employs a fine-grained pointwise trajectory tracking mechanism to actively deform the learned topology, allowing it to adapt precisely to complex non-linear feature drift. Theoretical analysis and Procrustes distance experiments validate our framework's resilience to manifold deformations. We integrated HC-SOINN into seven state-of-the-art methods by replacing their original classifiers, achieving consistent improvements that highlight the effectiveness and robustness of our approach. Code is available at https://github.com/yhyet/HC_SOINN.
comment: accepted by ICML2026
☆ Mobile Traffic Camera Calibration from Road Geometry for UAV-Based Traffic Surveillance
Unmanned aerial vehicles (UAVs) can provide flexible traffic surveillance where fixed roadside cameras are unavailable, costly, or impractical. However, raw UAV video is difficult to use for traffic analytics because vehicle motion is observed in perspective image coordinates rather than in a stable metric road coordinate system. This paper presents a lightweight pipeline for converting monocular oblique UAV traffic video into a local metric bird's-eye-view (BEV) representation. Visible road geometry, including lane markings, road borders, and crosswalks, is used to estimate a road-plane homography from image coordinates to metric ground-plane coordinates. Vehicle observations from dataset annotations or detectors are then projected to BEV using estimated ground contact points. The resulting trajectories support estimation of vehicle direction, speed, heading, and dynamic 3D cuboids on the road plane. We evaluate the pipeline on UAVDT using ground-truth annotations to isolate calibration and geometric reconstruction from detector and tracker errors. For sequence M1401, 40 sampled frames from img000001-img000196 produce 632 metric cuboid instances across 23 tracks. Results show that road-geometry calibration can transform monocular UAV footage into interpretable traffic-camera-style analytics, including BEV tracks and synchronized 3D cuboid visualizations. They also reveal key limitations: far-field vehicles are sensitive to homography errors, manual validation is currently more reliable than fully automatic calibration, and the single-plane assumption limits performance in non-planar or ambiguous road regions. The proposed pipeline provides a practical foundation for deployable UAV traffic cameras and future real-time traffic digital-twin systems.
☆ Few-Shot Synthetic Data Generation with Diffusion Models for Downstream Vision Tasks CVPR 2026
Class imbalance is a persistent challenge in visual recognition, particularly in safety-critical domains where collecting positive examples is expensive and rare events are inherently underrepresented. We propose a lightweight synthetic data augmentation pipeline that fine-tunes a LoRA adapter on as few as 20-50 real images of a rare class and uses a pretrained diffusion model to generate synthetic samples for training. We systematically vary the synthetic-to-real ratio and evaluate the approach across two structurally different domains: chest X-ray pathology classification (NIH ChestX-ray14) and industrial surface crack detection (Magnetic Tile Defect dataset). All evaluations are performed on held-out sets of real images only. Across both domains, synthetic augmentation consistently improves rare-class recall and F1 compared to training with real data alone. Performance improves with moderate synthetic augmentation and shows diminishing returns as the synthetic ratio increases. These results suggest that LoRA-adapted diffusion models provide a simple and scalable mechanism for augmenting rare classes, enabling effective learning in data-scarce scenarios across heterogeneous visual domains.
comment: 5 pages, 3 figures, 1 table. Accepted at SynData4CV Workshop @ CVPR 2026
☆ Learning Subspace-Preserving Sparse Attention Graphs from Heterogeneous Multiview Data
The high-dimensional features extracted from large-scale unlabeled data via various pretrained models with diverse architectures are referred to as heterogeneous multiview data. Most existing unsupervised transfer learning methods fail to faithfully recover intrinsic subspace structures when exploiting complementary information across multiple views. Therefore, a fundamental challenge involves constructing sparse similarity graphs that preserve these underlying subspace structures for achieving semantic alignment across heterogeneous views. In this paper, we propose a sparse attention graph learning (SAGL) method that learns subspace-preserving sparse attention graphs from heterogeneous multiview data. Specifically, we introduce a bilinear attention factorization scheme to capture asymmetric similarities among the high-dimensional features, which breaks the symmetry bottleneck that is inherent in the traditional representation learning techniques. A dynamic sparsity gating mechanism then predicts a feature-specific compression factor for adaptively controlling the topological contributions of neighbors. Furthermore, we employ a structured sparse projection via $α$-entmax to generate subspace-preserving sparse attention graphs for individual views. SAGL leverages these view-specific graphs to conduct sparse information aggregation, yielding discriminative representations for multiview learning tasks. In addition, we provide a rigorous theoretical analysis that bridges differentiable sparse attention and probability simplex constraints. Extensive experiments conducted on multiple benchmark datasets demonstrate that SAGL consistently outperforms the state-of-the-art unsupervised transfer learning approaches.
comment: 18 pages
☆ $h$-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement
Training-free camera control for pretrained flow-matching video generators is a partial-observation inverse problem: a depth-warped guidance video supplies noisy evidence on a subset of latent sites, which the sampler must reconcile with the pretrained prior. Existing methods struggle to balance the trade-off between trajectory adherence and visual quality and the heuristic guidance-strength tuning lacks robustness. We propose \textbf{$h$-control}, which resolves this dilemma through a structural change to the sampler: each outer hard-replacement guidance step is augmented with an inner-loop \emph{block-conditional pseudo-Gibbs refinement} on the unobserved complement at the same noise level, with provable convergence to the partial-observation conditional data law. To accelerate convergence on high-dimensional video latents, we exploit their conditional locality, partitioning the unobserved complement into 3D patches, each tracked by a custom mixing indicator that adaptively freezes converged patches. On RealEstate10K and DAVIS, \textbf{$h$-control} attains the best FVD against all seven training-free and training-based competitors, outperforming every training-free baseline on every reported metric.
☆ FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity
While the overall inference latency of Video Diffusion Transformers (DiTs) can be substantially reduced through model distillation, per-step inference latency remains a critical bottleneck. Existing acceleration paradigms primarily exploit redundancy across the denoising trajectory; however, we identify a limitation where these step-wise strategies encounter diminishing returns in few-step regimes. In such scenarios, the scarcity of temporal states prevents effective feature reuse or predictive modeling, creating a formidable barrier to further acceleration. To overcome this, we propose Frame Interleaved Sparsity DiT (FIS-DiT), a training-free and operator-agnostic framework that shifts the optimization focus from the temporal trajectory to the latent frame dimension. Our approach is motivated by an intrinsic duality within this dimension: the existence of frame-wise sparsity that permits reduced computation, coupled with a structural consistency where each frame position remains equally vital to the global spatiotemporal context. Leveraging this insight, we implement Frame Interleaved Sparsity (FIS) as an execution strategy that manipulates frame subsets across the model hierarchy, refreshing all latent positions without requiring full-scale block computation. Empirical evaluations on Wan 2.2 and HunyuanVideo 1.5 demonstrate that FIS-DiT consistently achieves 2.11--2.41$\times$ speedup with negligible degradation across VBench-Q and CLIP metrics, providing a scalable and robust pathway toward real-time high-definition video generation.
☆ When Brains Disagree: Biological Ambiguity Underlies the Challenge of Amyloid PET Synthesis from Structural MRI MICCAI 2026
Structural MRI-to-amyloid PET synthesis has been proposed as a non-invasive alternative for amyloid assessment in Alzheimer's disease (AD). However, reported performance of identical models varies widely across studies, and increasingly complex architectures have not led to consistent gains. This inconsistency is thought to be caused by a fundamental biological ambiguity: MRI captures neurodegeneration, while PET measures amyloid pathology - two processes that are often temporally decoupled in AD. As a result, similar MRI patterns may correspond to different amyloid states, creating ambiguous one-to-many mappings. MRI-to-amyloid PET synthesis may therefore be intrinsically ill-posed; however, this idea has yet to be tested scientifically. The aim of this work is to test this hypothesis through two controlled experiments. We first control the training distribution by stratifying paired MRI-PET data by amyloid and neurodegeneration status. Using two standard synthesis models under a controlled design, we show that biologically unambiguous mappings are learnable in isolation, but performance collapses when data ambiguity is introduced. This demonstrates that ambiguity in the data distribution, rather than architectural capacity, constrains performance. Second, we show that introducing orthogonal biological information in the form of plasma biomarkers resolves this ambiguity. When multimodal inputs are incorporated, performance improves and stability is restored. Together, these findings suggest that limited and inconsistent performance in MRI-to-amyloid PET synthesis is explained by intrinsic biological ambiguity, and that stable, meaningful progress requires multimodal integration rather than architectural complexity.
comment: MICCAI 2026 accepted paper (no rebuttal)
☆ Very Efficient Listwise Multimodal Reranking for Long Documents ICML 2026
Listwise reranking is a key yet computationally expensive component in vision-centric retrieval and multimodal retrieval-augmented generation (M-RAG) over long documents. While recent VLM-based rerankers achieve strong accuracy, their practicality is often limited by long visual-token sequences and multi-step autoregressive decoding. We propose ZipRerank, a highly efficient listwise multimodal reranker that directly addresses both bottlenecks. It reduces input length via a lightweight query-image early interaction mechanism and eliminates autoregressive decoding by scoring all candidates in a single forward pass. To enable effective learning, ZipRerank adopts a two-stage training strategy: (i) listwise pretraining on large-scale text data rendered as images, and (ii) multimodal finetuning with VLM-teacher-distilled soft-ranking supervision. Extensive experiments on the MMDocIR benchmark show that ZipRerank matches or surpasses state-of-the-art multimodal rerankers while reducing LLM inference latency by up to an order of magnitude, making it well-suited for latency-sensitive real-world systems. The code is available at https://github.com/dukesun99/ZipRerank.
comment: To appear in ICML 2026
☆ GATA2Floor: Graph attention for floor counting in street-view facades IEEE
Automated analysis of building facades from street-level imagery has great potential for urban analytics, energy assessment, and emergency planning. However, it requires reasoning over spatially arranged elements rather than solely isolated detections. In this work, we model each facade as a graph over window/door detections with a vertical prior on edges. Additionally, we introduce GATA2Floor, a multi-head Graph Attention v2 (GATv2) based model that predicts the global floor count of a building and, via learnable cross-attention queries, softly assigns elements to latent floor slots, yielding interpretable outputs and robustness to irregular designs. To mitigate the lack of labeled datasets, we demonstrate that the proposed graph-based reasoning can be applied without annotations by leveraging a lightweight label-free proposal mechanism based on self-supervised features and vision-language scoring. Our approach demonstrates the value of graph-attention-based relational reasoning for facade understanding.
comment: Accepted at IEEE ICIP 2026; 6 pages, 5 figures, 3 tables
☆ UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs
Multimodal large language models are increasingly expected to perform thinking with images, yet existing visual latent reasoning methods still rely on explicit textual chain-of-thought interleaved with visual latent tokens. This interleaved design limits efficiency and keeps reasoning fragmented across separate text and vision channels. We propose UniVLR, a unified visual latent reasoning framework that treats textual reasoning and auxiliary visual evidence as a shared visual workspace. Instead of preserving text CoT as an independent inference-time path, UniVLR renders reasoning traces together with auxiliary images and learns to compress this unified representation into compact visual latent tokens. At inference time, the model reasons only through visual latents and directly decodes the final answer, avoiding both external tool calls and verbose text reasoning. Experiments on real-world perception and visual reasoning tasks show that UniVLR outperforms prior visual latent reasoning methods while using substantially fewer generated reasoning tokens, suggesting a more unified and efficient paradigm for visual thinking in MLLMs.
☆ Selection, Not Fusion: Radar-Modulated State Space Models for Radar-Camera Depth Estimation
Radar-camera depth estimation must turn an ultra-sparse, all-weather, metric radar signal into a dense per-pixel depth map. Existing methods -- concatenation, confidence-aware gating, sparse supervision, graph-based extraction -- combine radar and image features outside the backbone's sequence operator, and even cross-modal Mamba variants leave the selection mechanism itself unimodal. We argue that the selection mechanism is the right place for radar to enter. We introduce Radar-Modulated Selection (RMS), a minimal and principled way to inject radar into Mamba's selective scan: radar modulates the scan from within, adding zero-initialised perturbations to the step size $Δ$ and readout $\mathbf{C}$ while leaving the input projection $\mathbf{B}$ and state dynamics $\mathbf{A}$ image-only. The construction is exactly equivalent to a pretrained image-only Mamba at initialisation, ensuring radar only influences the model where it improves accuracy. Two further properties follow that out-of-scan fusion cannot offer: linear-cost cross-modal coupling at every recurrence step, and a natural fallback to the image-only backbone when radar is absent. We deploy RMS in a Multi-View Scan Pyramid (MVSP) that matches the fusion operator to radar's spatial reach at each scale. SemoDepth achieves state-of-the-art performance on nuScenes, reducing MAE by 34.0%, 29.9%, and 29.9% over the previous best at 0--50, 0--70, and 0--80m, while attaining the lowest single-frame latency (26.8ms). A further ablation shows that out-of-scan feature blending adds no accuracy on top of RMS, providing empirical validation that in-scan selection can replace out-of-scan fusion.
comment: 16 pages, 3 figures, 9 tables
☆ REFNet++: Multi-Task Efficient Fusion of Camera and Radar Sensor Data in Bird's-Eye Polar View IEEE
A realistic view of the vehicle's surroundings is generally offered by camera sensors, which is crucial for environmental perception. Affordable radar sensors, on the other hand, are becoming invaluable due to their robustness in variable weather conditions. However, because of their noisy output and reduced classification capability, they work best when combined with other sensor data. Specifically, we address the challenge of multimodal sensor fusion by aligning radar and camera data in a unified domain, prioritizing not only accuracy, but also computational efficiency. Our work leverages the raw range-Doppler (RD) spectrum from radar and front-view camera images as inputs. To enable effective fusion, we employ a variational encoder-decoder architecture that learns the transformation of front-view camera data into the Bird's-Eye View (BEV) polar domain. Concurrently, a radar encoder-decoder learns to recover the angle information from the RD data that produce Range-Azimuth (RA) features. This alignment ensures that both modalities are represented in a compatible domain, facilitating robust and efficient sensor fusion. We evaluated our fusion strategy for vehicle detection and free space segmentation against state-of-the-art methods using the RADIal dataset.
comment: IEEE Intelligent Transportation Systems Conference (ITSC) 2025
☆ RevealLayer: Disentangling Hidden and Visible Layers via Occlusion-Aware Image Decomposition
Recent diffusion-based approaches have made substantial progress in image layer decomposition. However, accurately decomposing complex natural images remains challenging due to difficulties in occlusion completion, robust layer disentanglement, and precise foreground boundaries. Moreover, the scarcity of high-quality multi-layer natural image datasets limits advancement. To address these challenges, we propose RevealLayer, a diffusion-based framework that decomposes an RGB image into multiple RGBA layers, enabling precise layer separation and reliable recovery of occluded content in natural images. RevealLayer incorporates three key components: (1) a Region-Aware Attention module to disentangle hidden and visible layers; (2) an Occlusion-Guided Adapter to leverage contextual information to enhance overlapping regions; and (3) a composite loss to enforce sharp alpha boundaries and suppress residual artifacts. To support training and evaluation, we introduce RevealLayer-100K, a high-quality multi-layer natural image constructed through a collaboration between automated algorithms and human annotation, and further establish RevealLayerBench for benchmarking layer decomposition in general natural scenes. Extensive experiments demonstrate that RevealLayer consistently outperforms existing approaches in layer decomposition.
☆ See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model
Vision-Language-Action (VLA) models have shown remarkable promise in robotics manipulation, yet their high computational cost hinders real-time deployment. Existing token pruning methods suffer from a fundamental trade-off: aggressive compression using pruning inevitably discards critical geometric details like contact points, leading to severe performance degradation. This forces a compromise, limiting the achievable compression rate and thus the potential speedup. We argue that breaking this trade-off requires rethinking compression as a geometry-aware, continuous token resampling in the vision encoder. To this end, we propose the Differentiable Grid Sampler (GridS), a plug-and-play module that performs task-aware, continuous resampling of visual tokens in VLA. By adaptively predicting a minimal set of salient coordinates and extracting features via differentiable interpolation, GridS preserves essential spatial information while achieving drastic compression (with fewer than 10% original visual tokens). Experiments on both LIBERO benchmark and a real robotic platform demonstrate that validating the lowest feasible visual token count reported to date, GridS achieves a 76% reduction in FLOPs with no degradation in the success rate. The code is available at https://github.com/Fediory/Grid-Sampler.
☆ Mitigating Action-Relation Hallucinations in LVLMs via Relation-aware Visual Enhancement
Large Vision-Language Models (LVLMs) have achieved remarkable performance on diverse vision-language tasks. However, LVLMs still suffer from hallucinations, generating text that contradicts the visual input. Existing research has primarily focused on mitigating object hallucinations, but often overlooks more complex relation hallucinations, particularly action relations involving interactions between objects. In this study, we empirically observe that the primary cause of action-relation hallucinations in LVLMs is the insufficient attention allocated to visual information. Thus, we propose a framework to locate action-relevant image regions and enhance the LVLM's attention to those regions. Specifically, we define the Action-Relation Sensitivity (ARS) score to identify attention heads that are most sensitive to action-relation changes, thereby localizing action-relevant image regions that contain key visual cues. Then, we propose the Relation-aware Visual Enhancement (RVE) method to enhance the LVLM's attention to these action-relevant image regions. Extensive experiments demonstrate that, compared to existing baselines, our method achieves superior performance in mitigating action-relation hallucinations with negligible additional inference cost. Furthermore, it effectively generalizes to spatial-relation hallucinations and object hallucinations.
☆ Stop Marginalizing My Dreams: Model Inversion via Laplace Kernel for Continual Learning
Data-free continual learning (DFCIL) relies on model inversion to synthesize pseudo-samples and mitigate catastrophic forgetting. However, existing inversion methods are fundamentally limited by a simplifying assumption: they model feature distributions using diagonal covariance, effectively ignoring correlations that define the geometry of learned representations. As a result, synthesized samples often lack fidelity, limiting knowledge retention. In this work, we show that modeling feature dependencies is a key ingredient for effective DFCIL. We introduce REMIX, a structured covariance modeling framework that enables scalable full-covariance modeling without the prohibitive cost of dense matrix inversion and log-determinant computation. By leveraging a Laplace kernel parameterization, REMIX captures structured feature dependencies using memory that scales linearly with the feature dimensionality, while requiring only an additional logarithmic factor in computation. Modeling these correlations produces more coherent synthetic samples and consistently improves performance across standard DFCIL benchmarks. Our results demonstrate that moving beyond diagonal assumptions is essential for effective and scalable data-free continual learning. Our code is available at https://github. com/pkrukowski1/REMIX-Model-Inversion-via-Laplace-Kernel.
☆ OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models
As Video Large Language Models (Video-LLMs) scale to longer and more complex videos, their inference cost grows rapidly due to the large volume of visual tokens accumulated across frames. Training-free token compression has emerged as a practical solution to this bottleneck. However, existing temporal compression methods rely primarily on cross-frame token similarity or segmentation heuristics, overlooking each token's semantic role within its frame and failing to adapt compression strength to the compressibility of each frame pair. In this work, we propose OTT-Vid, a transport-derived allocation framework for temporal token compression. Our approach consists of two stages: spatial pruning identifies representative content within each frame, and optimal transport (OT) is then solved between neighboring frames to estimate temporal compressibility. We formulate this OT with non-uniform token mass, which protects semantically important tokens from aggressive compression, and a locality-aware cost that captures both feature and spatial disparities. The resulting transport plan jointly balances token importance and matching cost, while its total cost defines the transport difficulty of each frame pair, which we use to allocate compression budgets dynamically. Experiments on six benchmarks spanning video question answering and temporal grounding show that OTT-Vid preserves 95.8% of VQA and 73.9% of VTG performance while retaining only 10% of tokens, consistently outperforming existing state-of-the-art training-free compression methods.
comment: 22pages, 9 figures. Code available at https://github.com/minseokii/OTT-Vid
☆ SB-BEVFusion: Enhancing the Robustness against Sensor Malfunction and Corruptions ICIP 2026
Multimodal sensor fusion has demonstrated remarkable performance improvements over unimodal approaches in 3D object detection for autonomous vehicles. Typically, existing methods transform multimodal data from independent sensors, such as camera and LiDAR, into a unified bird's-eye view (BEV) representation for fusion. Although effective in ideal conditions, this strategy suffers from substantial performance deterioration when camera or LiDAR data are missing, corrupted, or noisy. To address this vulnerability, we develop a framework-agnostic fusion module for camera and LiDAR data that allows for handling cases when one of the two modalities is missing or corrupted. To demonstrate the effectiveness of our module, we instantiate it in BEVFusion [1], a well-established framework to combine camera and LiDAR data for 3D object detection. By means of quantitative experiments on the MultiCorrupt dataset, we demonstrate that our module achieves favorable performance improvements under scenarios of missing and corrupted modalities, substantially outperforming existing unified representation approaches across a wide range of sensor deterioration scenarios and reaching state-of-the-art performance in scenarios of corrupted modality due to extreme weather conditions and sensor failure.
comment: Accepted at ICIP 2026
☆ Urban Risk-Aware Navigation via VQA-Based Event Maps for People with Low Vision IEEE
Visual impairment affects hundreds of millions of people worldwide, severely limiting their ability to navigate urban environments safely and independently. While wearable assistive devices offer a promising platform for real-time hazard detection, existing approaches rely on task-specific vision pipelines that lack flexibility and generalizability. In this work, we propose an event map framework based on visual question answering that leverages Vision-Language Models (VLMs) for pedestrian scene description and hazard identification across diverse real-world environments, using a three-level hierarchical query structure to enable fine-grained scene understanding without task-specific retraining. Model responses are aggregated into a weighted risk scoring system that maps street segments into four discrete safety categories, producing navigable risk-aware event maps for route planning. To support evaluation and future research, we introduce a geographically diverse dataset spanning 20 cities across six continents, comprising over 800 annotated images and 18,000 answered questions. We benchmark four VQA architectures -ViLT, LLaVA, InstructBLIP, and Qwen-VL- and find that generative Multimodal Large Language Models (MLLMs) substantially outperform classification-based approaches, with Qwen-VL achieving the best overall balance of precision and recall. These results demonstrate the viability of MLLMs as a flexible and generalizable foundation for assistive navigation systems for visually impaired people.
comment: 10 pages, 6 figures, submitted to IEEE T-ITS
☆ Revisiting Shadow Detection from a Vision-Language Perspective
Shadow detection is commonly formulated as a vision-driven dense prediction problem, where models rely primarily on pixel-wise visual supervision to distinguish shadows from non-shadow regions. However, this formulation can become unreliable in visually ambiguous cases, where similar dark regions may correspond either to cast shadows or to intrinsically dark surfaces, making visual evidence alone insufficient for establishing a stable decision rule. In this work, we revisit shadow detection from a vision--language perspective and argue that robust prediction benefits from an explicit semantic reference beyond visual cues alone. We propose SVL, a Shadow Vision--Language framework that uses language as an explicit semantic reference to disambiguate shadows from visually similar dark regions. SVL aligns the global image representation with shadow-related text embeddings through a scene-level shadow ratio regression objective, thereby providing image-level guidance on the overall extent of shadows. To transfer this global guidance to dense inference, SVL introduces a global-to-local coupling mechanism that enforces consistency between image-level guidance and patch-level predictions. In parallel, SVL applies local patch-level constraints with text embeddings to improve fine-grained discrimination under challenging appearance conditions. Built on a frozen DINOv3 image encoder, the framework learns only lightweight projection and decoding modules, yielding a parameter-efficient design with less than $1\%$ trainable parameters. Extensive experiments on multiple shadow detection benchmarks, including dedicated hard-case evaluations, suggest strong overall performance and improved robustness under visually ambiguous conditions.
☆ M$^4$-SAM: Multi-Modal Mixture-of-Experts with Memory-Augmented SAM for RGB-D Video Salient Object Detection
The Segment Anything Model 2 (SAM2) has emerged as a foundation model for universal segmentation. Owing to its generalizable visual representations, SAM2 has been successfully applied to various downstream tasks. However, extending SAM2 to the RGB-D video salient object detection (RGB-D VSOD) task encounters three challenges including limited spatial modeling of linear LoRA, insufficient employment of SAM's multi-scale features, and dependence of initialization on explicit prompts. To address the issues, we present Multi-Modal Mixture-of-Experts with Memory-Augmented SAM (M$^4$-SAM), which equips SAM2 with modality-related PEFT, hierarchical feature fusion, and prompt-free memory initialization. Firstly, we inject Modality-Aware MoE-LORA, which employs convolutional experts to encode local spatial priors and introduces a modality dispatcher for efficient multi-modal fine-tuning, into SAM2's encoder. Secondly, we deploy Gated Multi-Level Feature Fusion, which hierarchically aggregates multi-scale encoder features with an adaptive gating mechanism, to balance spatial details and semantic context. Finally, to conduct zero-shot VSOD without manual prompts, we utilize a Pseudo-Guided Initialization, where a coarse mask is regarded as a pseudo prior and used to bootstrap the memory bank. Extensive experiments demonstrate that M$^4$-SAM achieves the state-of-the-art performance across all evaluation metrics on three public RGB-D VSOD datasets.
comment: 10 pages, 3 figures
☆ DiffSegLung: Diffusion Radiomic Distillation for Unsupervised Lung Pathology Segmentation
Unsupervised segmentation of pulmonary pathologies in CT remains an open challenge due to the absence of annotated multi pathology cohorts and the failure of existing diffusion-based methods to exploit the quantitative Hounsfield Unit (HU) signal that physically distinguishes tissue classes. To address this, we propose DiffSegLung,a framework that introduces Diffusion Radiomic Distillation, in which handcrafted radiomic descriptors serve as a physics grounded teacher to shape the bottleneck of a 3D diffusion U-Net via a contrastive objective, transferring pathology discriminative structure into the learned representation without any annotations. At inference, the teacher is discarded and multitimestep bottleneck features are clustered by a Gaussian Mixture Model with HU-guided label assignment, followed by Sobel Diffusion Fusion for boundary refinement. Evaluated on 190 expert annotated axial slices drawn from four heterogeneous CT cohorts, Diff-SegLung improves segmentation across all four pathology classes over unsupervised baselines and improves generation fidelity over prior CT diffusion models.
☆ Focusable Monocular Depth Estimation
Monocular depth foundation models generalize well across scenes, yet they are typically optimized with uniform pixel-wise objectives that do not distinguish user-specified or task-relevant target regions from the surrounding context. We therefore introduce Focusable Monocular Depth Estimation (FDE), a region-aware depth estimation task in which, given a specified target region, the model is required to prioritize foreground depth accuracy, preserve sharp boundary transitions, and maintain coherent global scene geometry. To prioritize task-critical region modeling, we propose FocusDepth, a prompt-conditioned monocular relative depth estimation framework that guides depth modeling to focus on target regions via box/text prompts. The core Multi-Scale Spatial-Aligned Fusion (MSSA) in FocusDepth spatially aligns multi-scale features from Segment Anything Model 3 to the Depth Anything family and injects them through scale-specific, gated conditional fusion. This enables dense prompt cue injection without disrupting geometric representations, thereby endowing the depth estimation model with focused perception capability. To study FDE, we establish FDE-Bench, a target-centric monocular relative depth benchmark built from image-target-depth triplets across five datasets, containing 252.9K/72.5K train/val triplets and 972 categories spanning real-world and embodied simulation environments. On FDE-Bench, FocusDepth consistently improves over globally fine-tuned DA2/DA3 baselines under both box and text prompts, with the largest gains appearing in target boundary and foreground regions while preserving global scene geometry. Ablations show that MSSA's spatial alignment is the key design factor, as disrupting prompt-geometry correspondence increases AbsRel by up to 13.8%.
☆ One-Step Generative Modeling via Wasserstein Gradient Flows
Diffusion models and flow-based methods have shown impressive generative capability, especially for images, but their sampling is expensive because it requires many iterative updates. We introduce W-Flow, a framework for training a generator that transforms samples from a simple reference distribution into samples from a target data distribution in a single step. This is achieved in two steps: we first define an evolution from the reference distribution to the target distribution through a Wasserstein gradient flow that minimizes an energy functional; second, we train a static neural generator to compress this evolution into one-step generation. We instantiate the energy functional with the Sinkhorn divergence, which yields an efficient optimal-transport-based update rule that captures global distributional discrepancy and improves coverage of the target distribution. We further prove that the finite-sample training dynamics converge to the continuous-time distributional dynamics under suitable assumptions. Empirically, W-Flow sets a new state of the art for one-step ImageNet 256$\times$256 generation, achieving 1.29 FID, with improved mode coverage and domain transfer. Compared to multi-step diffusion models with similar FID scores, our method yields approximately 100$\times$ faster sampling. These results show that Wasserstein gradient flows provide a principled and effective foundation for fast and high-fidelity generative modeling.
comment: 38 pages, 14 figures
☆ DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies
Vision-Language-Action (VLA) models are often brittle in fine-grained manipulation, where minor action errors during the critical phases can rapidly escalate into irrecoverable failures. Since existing VLA models rely predominantly on successful demonstrations for training, they lack an explicit awareness of failure during these critical phases. To address this, we propose DreamAvoid, a critical-phase test-time dreaming framework that enables VLA models to anticipate and avoid failures. We also introduce an autonomous boundary learning paradigm to refine the system's understanding of the subtle boundary between success and failure. Specifically, we (1) utilize a Dream Trigger to determine whether the execution has entered a critical phase, (2) sample multiple candidate action chunks from the VLA via an Action Proposer, and (3) employ a Dream Evaluator, jointly trained on mixed data (success, failure, and boundary cases), to "dream" the short-horizon futures corresponding to the candidate actions, evaluate their values, and select the optimal action. We conduct extensive evaluations on real-world manipulation tasks and simulation benchmarks. The results demonstrate that DreamAvoid can effectively avoid failures, thereby improving the overall task success rate. Our code is available at https://github.com/XianzheFan/DreamAvoid.
comment: 19 pages, 7 figures
☆ BronchoLumen: Analysis of recent YOLO-based architectures for real-time bronchial orifice detection in video bronchoscopy
Bronchoscopy is routinely conducted in pulmonary clinics and intensive care units, but navigating the complex branching of the respiratory tract remains challenging. This paper introduces BronchoLumen, a real-time YOLO-based system for detecting bronchial orifices in video bronchoscopy, aiming to assist navigation and CAD systems. The paper investigates if bronchial orifices can be robustly detected across image domains using state-of-the-art object detection and a limited set of public image data. The study includes the description and comparison of YOLOv8, a widely adopted architecture, and YOLOv12, a more recent architecture integrating attention-based modules to improve spatial reasoning. Both models are trained and tested solely on publicly available datasets comprising different image domains. A comparison of both models is conducted based on the common metrics mAP@0.5 and mAP@0.5:0.9 with the latter emphasizing localization accuracy. For YOLOv8 we obtained a mAP@0.5 of 0.91 on an in-domain and 0.68 on a cross-domain test set. YOLOv12 achieved 0.84 and 0.68 respectively with slightly better localization accuracy with mAP@0.5:0.9 of 0.48 and 0.26 compared to YOLOv8 with 0.45 and 0.25. Challenges like motion blur and low contrast occasionally entailed uncertainties but the system demonstrated overall robustness in most scenarios. BronchoLumen is an open-weight, YOLO-based solution for bronchial orifice detection offering high accuracy and efficiency across multiple image domains. While the more recent YOLOv12 achieves better localization accuracy, we observed a slightly worse precision. The models have been made publicly available to foster further research in bronchoscopy navigation.
comment: 10 pages, 4 figures, IPCAI 2026
☆ WorldComp2D: Spatio-semantic Representations of Object Identity and Location from Local Views ICML2026
Learning latent representations that capture both semantic and spatial information is central to efficient spatio-semantic reasoning. However, many existing approaches rely on implicit latent structures combined with dense feature maps or task-specific heads, limiting computational efficiency and flexibility. We propose WorldComp2D, a novel lightweight representation learning framework that explicitly structures latent space geometry according to object identity and spatial proximity using multiscale local receptive fields. This framework consists of (i) a proximity-dependent encoder that maps a given observation into a spatio-semantic latent space and (ii) a localizer that infers the coordinates of objects in the input from the resulting spatio-semantic representation. Using facial landmark localization as a proof-of-concept, we show that, compared to SoTA lightweight models, WorldComp2D reduces the numbers of parameters and FLOPs by up to 4.0X and 2.2X, respectively, while maintaining real-time performance on CPU. These results demonstrate that explicitly structured latent spaces provide an efficient and general foundation for spatio-semantic reasoning. This framework is open-sourced at https://github.com/JinSeongmin/WorldComp2D.
comment: Accepted as a regular paper at ICML2026
☆ Allegory of the Cave: Measurement-Grounded Vision-Language Learning
Vision-language models typically reason over post-ISP RGB images, although RGB rendering can clip, suppress, or quantize sensor evidence before inference. We study whether grounding improves when the visual interface is moved closer to the underlying camera measurement. We formulate measurement-grounded vision-language learning and instantiate it as PRISM-VL, which combines RAW-derived Meas.-XYZ inputs, camera-conditioned grounding, and Exposure-Bracketed Supervision Aggregation for transferring supervision from RGB proxies to measurement-domain observations. Using a quality-controlled 150K instruction-tuning set and a held-out benchmark targeting low-light, HDR, visibility-sensitive, and hallucination-sensitive cases, PRISM-VL-8B reaches 0.6120 BLEU, 0.4571 ROUGE-L, and 82.66\% LLM-Judge accuracy, improving over the RGB Qwen3-VL-8B baseline by +0.1074 BLEU, +0.1071 ROUGE-L, and +4.46 percentage points. These results suggest that part of VLM grounding error arises from information lost during RGB rendering, and that preserving measurement-domain evidence can improve multimodal reasoning.
☆ CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating
In this paper, we propose Concentrate and Concentrate (CaC), a coarse-to-fine anomaly reward model based on Vision-Language Models. During inference, it first conducts a global temporal scan to anchor anomalous time windows, then performs fine-grained spatial grounding within the localized interval, and finally derives robust judgments via structured spatiotemporal Chain-of-Thought reasoning. To equip the model with these capabilities, we construct the first large-scale generated video anomaly dataset with per-frame bounding-box annotations, temporal anomaly windows, and fine-grained attribution labels. Building on this dataset, we design a three-stage progressive training paradigm. The model initially learns spatial and temporal anchoring through single- and multi-frame supervised fine-tuning, and then is optimized by a reinforcement learning strategy based on two-turn Group Relative Policy Optimization (GRPO). Beyond conventional accuracy rewards, we introduce Temporal and Spatial IoU rewards to supervise the intermediate localization process, effectively guiding the model toward more grounded and interpretable spatiotemporal reasoning. Extensive experiments demonstrate that CaC can stably concentrate on subtle anomalies, achieving a 25.7% accuracy improvement on fine-grained anomaly benchmarks and, when used as a reward signal, CaC reduces generated-video anomalies by 11.7% while improving overall video quality.
comment: 27 pages, 10 figures
☆ EPIC: Efficient Predicate-Guided Inference-Time Control for Compositional Text-to-Image Generation
Recent text-to-image (T2I) generators can synthesize realistic images, but still struggle with compositional prompts involving multiple objects, counts, attributes, and relations. We introduce EPIC (Efficient Predicate-Guided Inference-Time Control), a training-free inference-time refinement framework for compositional T2I generation. EPIC casts refinement as predicate-guided search: it parses the original prompt once into a fixed visual program of object variables and typed predicates, covering checkable conditions such as object presence, counts, attributes, and relations. Each generated or edited image is verified against this program using visual evidence extracted from that image. An image is judged to satisfy the prompt only when all predicates are satisfied; otherwise, failed predicates decide the next step, routing local failures to targeted editing and global failures to resampling while the fixed visual program remains unchanged. On GenEval2, EPIC improves prompt-level accuracy from 34.16% for single-pass generation with the base generator to 71.46%. Under the same generator/editor setting and maximum image-model execution budget, EPIC outperforms the strongest prior refinement baseline by 19.23 points while reducing realized cost by 31% in image-model executions, 72% in MLLM calls, and 81% in MLLM tokens per prompt.
☆ Unlocking Compositional Generalization in Continual Few-Shot Learning
Object-centric representations promise a key property for few-shot learning: Rather than treating a scene as a single unit, a model can decompose it into individual object-level parts that can be matched and compared across different concepts. In practice, this potential is rarely realized. Continual learners either collapse scenes into global embeddings, or train with part-level matching objectives that tie representations too closely to seen patterns, leaving them unable to generalize to truly novel concepts. In this paper, we identify this fundamental structural conflict and pioneer a new paradigm that strictly decouples representation learning from compositional inference. Leveraging the inherent patch-level semantic geometry of self-supervised Vision Transformers (ViTs), our framework employs a dual-phase strategy. During training, slot representations are optimized entirely toward holistic class identity, preserving highly generalizable, object-level geometries. At inference, preserved slots are dynamically composed to match novel scenes. We demonstrate that this paradigm offers dual structural benefits: The frozen backbone naturally prevents representation drift, while our lightweight, holistic optimization preserves the features' capacity for novel-concept transfer. Extensive experiments validate this approach, achieving state-of-the-art unseen-concept generalization and minimal forgetting across standard continual learning benchmarks.
comment: 10 pages
☆ CAST: Collapse-Aware multi-Scale Topology Fusion for Multimodal Coreset Selection
The training of large multimodal models fundamentally relies on massive image-text datasets, which inevitably incur prohibitive computational overhead. Dataset selection offers a promising paradigm by identifying a highly informative coreset. However, existing approaches suffer from two critical limitations: (i) single-modality-dominated sampling methods, which ignore the fine-grained cross-modal information imbalance inherent in multimodal datasets and thus lead to semantic loss in the other modality; and (ii) coarse-grained sample-scoring-based sampling methods, where the selected coreset tends to be biased toward the scoring model, making it difficult to guarantee distributional equivalence between the coreset and the original dataset. Meanwhile, existing distribution matching and discrete sampling strategies often fail to jointly account for global semantic structure, local fine-grained details, and redundancy-aware coverage in dense regions. To this end, we propose CAST, a Collapse-Aware multi-Scale Topology fusion framework for multimodal coreset selection. We first construct image- and text-modality topologies, and derive a unified topology via local-collapse-aware refinement and cross-modal fusion. We then introduce a multi-scale distribution matching criterion in the diffusion wavelet domain, encouraging the coreset to approximate the original dataset at multiple scales. Finally, we introduce a local soft relational coverage mechanism that extends pure geometric coverage to relation-aware indirect coverage, penalizing redundant selections in dense clusters. Extensive experiments on Flickr30K and MS-COCO show that CAST outperforms existing dataset selection baselines, showcasing great superiority in cross-architecture generalization and energy efficiency over state-of-the-art multimodal synthesis methods.
☆ ScaleMoGen: Autoregressive Next-Scale Prediction for Human Motion Generation
We present ScaleMoGen, a scale-wise autoregressive framework for text-driven human motion generation. Unlike conventional autoregressive approaches that rely on standard next-token prediction, ScaleMoGen frames motion generation as a coarse-to-fine process. We quantize 3D motions into compositional discrete tokens across multiple skeletal-emporal scales of increasing granularity, learning to generate motion by autoregressively predicting next-scale token maps. To maintain structural integrity, our motion tokenizers and quantizers are explicitly designed so that discrete tokens at every scale strictly preserve the skeletal hierarchy. Additionally, we employ bitwise quantization and prediction, which efficiently scale up the tokenizer vocabulary to preserve motion details and stabilize optimization. Extensive experiments demonstrate that ScaleMoGen achieves state-of-the-art performance, establishing an FID of 0.030 (vs. 0.045 for MoMask) on HumanML3D and a CLIP Score of 0.693 (vs. 0.685 for MoMask++) on the SnapMoGen dataset. Furthermore, we demonstrate that our skeletal-temporal multi-scale representation naturally facilitates training-free, text-guided motion editing.
comment: Project page: https://inwoohwang.me/ScaleMoGen
☆ WildRelight: A Real-World Benchmark and Physics-Guided Adaptation for Single-Image Relighting CVPR26
Recent single-image relighting methods, powered by advanced generative models, have achieved impressive photorealism on synthetic benchmarks. However, their effectiveness in the complex visual landscape of the real world remains largely unverified. A critical gap exists, as current datasets are typically designed for multi-view reconstruction and fail to address the unique challenges of single-image relighting. To bridge this synthetic-to-real gap, we introduce WildRelight, the first in-the-wild dataset specifically created for evaluating single-image relighting models. WildRelight features a diverse collection of high-resolution outdoor scenes, captured under strictly aligned, temporally varying natural illuminations, each paired with a high-dynamic-range environment map. Using this data, we establish a rigorous benchmark revealing that state-of-the-art models trained on synthetic data suffer from severe domain shifts. The strictly aligned temporal structure of WildRelight enables a new paradigm for domain adaptation. We demonstrate this by introducing a physics-guided inference framework that leverages the captured natural light evolution as a self-supervised constraint. By integrating Diffusion Posterior Sampling (DPS) with temporal Sampling-Aware Test-Time Adaptation (TTA), we show that the dataset allows synthetic models to align with real-world statistics on-the-fly, transforming the intractable sim-to-real challenge into a tractable self-supervised task. The dataset and code will be made publicly available to foster robust, physically-grounded relighting research.
comment: Companion paper to the CVPR26 findings paper 'WildRelight', introducing the physics-guided adaptation method evaluated on the dataset. Project Page: https://lez-s.github.io/wildrelight_proj/
☆ Emergent Communication between Heterogeneous Visual Agents through Decentralized Learning
Symbols are shared, but perception is private. We study emergent communication between heterogeneous visual agents through decentralized learning, asking what visual information can become shareable when agents have different visual representations. Instead of optimizing messages through a shared external communicative objective, our agents exchange only discrete token sequences and update their own models using local perceptual evidence. This setting focuses on an underexplored aspect of emergent communication, examining whether common symbols can arise without shared perceptual access, and how the similarity between private visual spaces constrains the content and symmetry of the resulting language. We instantiate this setting in the Metropolis-Hastings Captioning Game (MHCG), where two agents collaboratively form shared captions by exchanging proposed token sequences that a listener accepts or rejects using an MH-style criterion evaluated against its own visual features. We compare three pairings of frozen visual encoders, with agents starting from randomly initialized text modules. Experiments on MS-COCO show that MHCG produces visually informative shared token sequences that outperform a no-communication baseline in cross-agent alignment, visual-feature prediction, and image-text retrieval; all cross-agent metrics decline as encoder mismatch increases. Moderate encoder heterogeneity reduces the number of shared sequences while preserving per-sequence visual specificity, whereas stronger encoder heterogeneity yields fewer, coarser, and more asymmetric sequences. Ablations show that listener-side MH acceptance is critical for avoiding degenerate token formation. These results suggest that shared symbols can arise from local perceptual evaluation alone, with visual representational similarity across encoders shaping both the content and symmetry of the resulting language.
☆ DORA: Dynamic Online Reinforcement Agent for Token Merging in Vision Transformers
Vision Transformers (ViTs) incur significant computational overhead due to the quadratic complexity of self-attention relative to the token sequence length. While existing token reduction methods mitigate this issue, they predominantly rely on fixed heuristic metrics, predefined ratios, or static offline masks, which lack the adaptability to capture input-dependent redundancy during inference. In this paper, we propose DORA (Dynamic Online Reinforcement Agent), the first reinforcement learning (RL)-driven online inference framework for dynamic token merging in ViTs. We formulate the merging process as a sequential Markov Decision Process (MDP), where a lightweight RL agent determines the merging strategy for each Transformer block based on the current feature state and layer-specific context. To balance computational efficiency and feature fidelity, the agent is optimized via a dense reward function incorporating a non-linear distillation-based penalty. We implement an asymmetric Actor-Critic architecture that utilizes a high-capacity Critic for stable offline training while retaining a minimal Actor head for low-computation online inference. Evaluations across multiple ViT scales (Tiny to Large) demonstrate that DORA improves the accuracy-efficiency Pareto front compared to current baselines. Under strict negligible accuracy-drop constraints (<= 0.05%), DORA achieves up to a 12.66% token merging rate, and delivers up to a 569.7% relative improvement over the most efficient baseline. On ImageNet-1K, under aligned accuracy constraints, DORA achieves up to a 76% relative improvement in computational savings compared to state-of-the-art methods. Furthermore, on out-of-distribution (OOD) benchmarks such as ImageNet-A and ImageNet-C, DORA attains a relative efficiency advantage of over 430%.
comment: Preprint. Under review
☆ ShapeCodeBench: A Renewable Benchmark for Perception-to-Program Reconstruction of Synthetic Shape Scenes
We introduce ShapeCodeBench, a synthetic benchmark for perception-to-program reconstruction: given a rendered raster image, a model must emit an executable drawing program that a deterministic evaluator re-renders and compares with the target. The v1 DSL has four primitives on a 512 x 512 black-on-white canvas, but every instance is generated from a seeded RNG, so fresh held-out sets can be created to reduce exact-instance contamination. We release a frozen eval_v1 split with 150 samples across easy, medium, and hard tiers, scored by exact match, pixel accuracy, foreground IoU, parse success, and execution success. We evaluate an empty-program floor, a classical computer-vision heuristic, Claude Opus 4.7 at high and max effort, and GPT-5.5 at medium and extra_high reasoning effort. The heuristic is competitive on easy scenes but collapses when overlaps fuse components; the strongest multimodal configuration preserves much of the foreground structure but still misses exact match because of small parameter errors. Best overall exact match remains low, so ShapeCodeBench is far from saturated. The benchmark code, frozen dataset, run artifacts, and paper sources are released to support independent replication and extension.
comment: 14 pages, 5 figures, 2 tables. Code, data, and artifacts: https://github.com/shivamk3r/shape-code-bench ; archival release: https://doi.org/10.5281/zenodo.20132286
☆ Reviving In-domain Fine-tuning Methods for Source-Free Cross-domain Few-shot Learning
Cross-Domain Few-Shot Learning (CDFSL) aims to adapt large-scale pretrained models to specialized target domains with limited samples, yet the few-shot fine-tuning of vision-language models like CLIP remains underexplored. By establishing multiple fine-tuning baselines of CLIP for CDFSL, we find adapter-based methods (e.g., LoRA) consistently outperform prompt-based ones (e.g., MaPLe), contrary to in-domain scenarios. To make those effective in-domain methods competitive again in CDFSL, we analyze this phenomenon and discover LoRA's superiority stems from rectifying the collapsed attention of visual CLS token, enhancing modality alignment and class separation by focusing on text-related visual regions. Further, we find textual EOS token exhibit much better attention to visual samples, and CLIP's standard contrastive loss weakly constrains modality alignment. Based on these insights, we propose Semantic Probe, a plug-and-play attention rectification framework for both adapter- and prompt-based methods. Extensive experiments on four CDFSL benchmarks validate our rationale, achieving state-of-the-art performance and benefiting both fine-tuning paradigms. Codes will be released.
☆ Weather-Robust Cross-View Geo-Localization via Prototype-Based Semantic Part Discovery
Cross-view geo-localization (CVGL), which matches an oblique drone view to a geo-referenced satellite tile, has emerged as a key alternative for autonomous drone navigation when GNSS signals are jammed, spoofed, or unavailable. Despite strong recent progress, three limitations persist: (1) global-descriptor designs compress the patch grid into a single vector without separating layout from texture across the view gap; (2) altitude-related scale variation is retained in the learned embedding rather than marginalized; and (3) multi-objective training relies on hand-tuned scalars over losses on incompatible gradient scales. We propose SkyPart, a lightweight swappable head for patch-based vision transformers (ViTs) that institutes explicit part grouping over the patch grid. SkyPart has four theory-grounded components: (i) learnable prototypes competing for patch tokens via single-pass cosine assignment; (ii) altitude-conditioned linear modulation applied only during training, making the retrieval embedding altitude-free at inference; (iii) a graph-attention readout over active prototypes; and (iv) a Kendall uncertainty-weighted multi-objective loss whose stationary points are Pareto-stationary. At 26.95M parameters and 22.14 GFLOPs, SkyPart is the smallest among top-performing methods and sets a new state of the art on SUES-200, University-1652, and DenseUAV under a single-pass, no-re-ranking, no-TTA protocol. Its advantage over the strongest baseline widens under the ten-condition WeatherPrompt corruption benchmark.
comment: 37 pages, 7 figures, 6 tables
☆ Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation
Recent think-answer approaches in VLMs, such as Qwen3-VL-Thinking, boost reasoning performance by leveraging intermediate thinking steps before the final answer, but their high computational cost limits real-world deployment. To distill such capabilities into compact think-answer VLMs, a primary objective is to improve the student's ability to utilize visual evidence throughout its reasoning trace. To this end, we introduce a novel think-answer distillation framework that encourages the student to anchor its thinking on visual information by masking the student's salient reasoning prefixes. To compensate for such masked textual cues, the student is encouraged to rely more on visual evidence as an alternative source of information during distillation. Our masking strategies include: 1) token-wise salient reasoning-prefix masking, which masks high-influence reasoning prefixes selectively for each next-token prediction, and 2) self-paced masking budget scheduling, which gradually increases the masking scale according to distillation difficulty, {measured by discrepancy between teacher--student distributions. In the distillation phase, the student is guided by our salient reasoning-prefix mask, which blocks both future tokens and salient reasoning cues, in place of the standard causal mask used for auto-regressive language modeling. Experimental results show that our approach outperforms recent open-source VLMs, VLM distillation, and self-distillation methods on multimodal reasoning benchmarks, while further analyses confirm enhanced visual utilization along the student thinking process.
comment: Pre-print
☆ Unlocking UML Class Diagram Understanding in Vision Language Models
Although Vision Language Models (VLMs) have seen tremendous progress across all kinds of use cases, they still fall behind in answering questions regard-ing diagrams compared to photos. Although progress has been made in the area of bar charts, line charts and other diagrams like that there is still few research concerned with other types of diagrams, e.g. in the computer science domain. Our work presents a benchmark for visual question answering based on UML class diagrams which is both challenging and manageable. We further construct a large-scale training dataset with 16.000 image-question-answer triples and show that a LoRA-based finetune easily outperforms Qwen 3.5 27B, which is a recent and well-performing VLM in many other benchmarks.
☆ Single-Shot HDR Recovery via a Video Diffusion Prior
Recent generative methods for single-shot high dynamic range (HDR) image reconstruction show promising results, but often struggle with preserving fidelity to the input image. They require separate models to handle highlights and shadows, or sacrifice interpretability by directly predicting the final HDR image. We address these limitations by re-casting single-shot HDR reconstruction as conditional video generation and fusing the generated frames into an HDR image. We finetune a video diffusion model to generate an exposure bracket, conditioned on a low dynamic range (LDR) input. We fuse this image bracket using per-pixel weights predicted by a light-weight UNet. This formulation is simple, interpretable, and effective. Rather than directly hallucinating an HDR image, it explicitly reconstructs the intermediate exposure stack and fuses it into the final output. Our method eliminates the need for separate models across exposure regimes and produces HDR reconstructions with high input fidelity. On quantitative benchmarks, we outperform state-of-the-art generative baselines with comparable model capacity on several reconstruction metrics. Human evaluators further prefer our results in 72% of pairwise comparisons against existing methods. Finally, we show that this input-conditioned sequence generation and fusion framework extends beyond HDR to other image reconstruction tasks, such as all-in-focus image recovery from a single defocus-blurred input.
☆ RNA-FM: Flow-Matching Generative Model for Genome-wide RNA-Seq Prediction ICML2026
Histopathology whole-slide images (WSIs) are routinely acquired in clinical practice and contain rich tissue morphology but lack direct molecular architecture and functional programs defining pathological states, whereas RNA sequencing (RNA-seq) provides genome-wide transcriptional profiles at substantial cost, thereby motivating WSI-based genome-wide transcriptomic prediction. Existing approaches for predicting gene expression from WSIs predominantly rely on deterministic regression with one-to-one mapping, limiting their ability to capture biological heterogeneity and predictive uncertainty. We propose RNA-FM, a flow-matching generative framework for genome-wide bulk RNA-seq prediction from WSIs. RNA-FM formulates transcriptomic prediction as a continuous-time conditional transport problem, learning a velocity field that maps a simple prior to the target gene expression distribution conditioned on morphologies. By integrating pathway-level structure, RNA-FM enables scalable and biologically interpretable genome-wide gene expression imputation. Extensive experiments demonstrate that RNA-FM consistently outperforms state-of-the-art approaches while maintaining biological meaningfulness. Code is available at https://github.com/YXSong000/RNA-FM.
comment: 15 pages, 13 tables, 3 figures. Accepted by the Forty-Third International Conference on Machine Learning (ICML2026). Code is available at https://github.com/YXSong000/RNA-FM
☆ Grounding by Remembering: Cross-Scene and In-Scene Memory for 3D Functional Affordances
Functional affordance grounding requires more than recognizing an object: an agent must localize the specific region that supports an interaction, such as the handle to pull or the button to press. This is difficult for training-free vision-language pipelines because actionable regions are often small, visually ambiguous, and repeated across multiple same-category instances in a scene. We propose AFFORDMEM, a framework that grounds 3D functional affordances by remembering geometry at two levels. The first is cross-scene affordance memory: the agent maintains a category-level memory bank of RGB images with affordance regions rendered as overlays, and recalls the most informative examples at query time to guide a frozen VLM toward small operable subregions that text-only prompting consistently misses. The second is in-scene spatial memory: as the agent processes the scene, it organizes candidate instances and their 3D spatial relations into a structured scene graph, enabling the language model to resolve references over distant or currently unobserved candidates such as "the second handle from the top." AFFORDMEM requires no model fine-tuning and no target-scene annotation, using a reusable memory bank built from source scenes. On SceneFun3D, our method improves AP50 over the prior training-free state of the art by 3.23 on Split 0 and 3.7 on Split 1. Ablation studies support complementary benefits: cross-scene affordance memory improves fine-grained localization, while in-scene spatial memory provides the larger gain on spatially qualified queries. The project homepage is available at the project page.
☆ Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs
Omnimodal Large Language Models (Omni-LLMs) incur substantial computational overhead due to the large number of multimodal input tokens they process, making token reduction essential for real-world deployment. Existing Omni-LLM pruning methods typically reduce this cost by selecting tokens that are important for the current query or strongly aligned with cross-modal cues. However, such strategies can discard evidence that falls outside these criteria, even when needed for different questions or for understanding context beyond aligned audio-visual cues. To address this limitation, we reframe Omni-LLM token reduction as preserving broad audio-visual context while removing cross-modal redundancy. We propose ContextGuard, an inference-time token pruning framework built on this principle. ContextGuard predicts coarse visual semantics from audio and prunes video tokens whose coarse semantics are likely recoverable from audio, while retaining additional video tokens to preserve localized visual details that audio alone cannot specify. For further compression, our method merges temporally similar video tokens. The framework requires no downstream LLM fine-tuning and uses only an independently trained lightweight predictor. On Qwen2.5-Omni and Video-SALMONN2+ at 3B and 7B scales across six audio-visual benchmarks, ContextGuard outperforms prior inference-time pruning methods while pruning more tokens. Notably, on Qwen2.5-Omni 7B, ContextGuard achieves full-token-level performance on five of six benchmarks while pruning 55% of input tokens.
☆ HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation
Closed-loop driving simulation requires real-time interaction beyond short offline clips, pushing current driving world models toward autoregressive (AR) rollout. Existing AR distillation approaches typically rely on frame sinks or student-side degradation training. The former transfers poorly to driving due to fast ego-motion and rapid scene changes, while the latter remains bounded by the teacher's single-pass output length and thus provides only a limited supervision horizon. A natural question is: can the teacher itself be extended via AR rollout to provide unbounded-horizon supervision at bounded memory cost? The key difficulty is that a standard teacher drifts under its own predictions, contaminating the supervision it provides. Our key insight is to make the teacher rollout-capable, ensuring reliable supervision from its own AR rollouts. This is instantiated as HorizonDrive, an anti-drifting training-and-distillation framework for AR driving simulation. First, scheduled rollout recovery (SRR) trains the base model to reconstruct ground-truth future clips from prediction-corrupted histories, yielding a teacher that remains stable across long AR rollouts. Second, the rollout-capable teacher is extended via AR rollout, providing long-horizon distribution-matching supervision under bounded memory, while a short-window student aligns to it with teacher rollout DMD (TRD) for efficient real-time deployment. HorizonDrive natively supports minute-scale AR rollout under bounded memory; on nuScenes, HorizonDrive reduces FID by 52% and FVD by 37%, and lowers ARE and DTW by 21% and 9% relative to the strongest long-horizon streaming baselines, while remaining competitive with single-pass driving video generators.
☆ PointForward: Feedforward Driving Reconstruction through Point-Aligned Representations
High-fidelity reconstruction of driving scenes is crucial for autonomous driving. While recent feedforward 3D Gaussian Splatting (3DGS) methods enable fast reconstruction, their per-pixel Gaussian prediction paradigm often suffers from multi-view inconsistency and layering artifacts. Moreover, existing methods often model dynamic instances via dense flow prediction, which lacks explicit cross-view correspondence and instance-level consistency. In this paper, we propose PointForward, a feedforward driving reconstruction framework through point-aligned representations. Unlike pixel-aligned methods, we initialize sparse 3D queries in world space and aggregate multi-view image information via spatial-temporal fusion onto these queries, enforcing explicit cross-view consistency in a single feedforward pass. To handle scene dynamics, we introduce scene graphs that explicitly organize moving instances during reconstruction. By leveraging 3D bounding boxes, our method enables instance-level motion propagation and temporally consistent dynamic representations. Extensive experiments demonstrate that PointForward achieves state-of-the-art performance on large-scale driving benchmarks. The code will be available upon the publication of the paper.
☆ Logit-Attention Divergence: Mitigating Position Bias in Multi-Image Retrieval via Attention-Guided Calibration
Multimodal Large Language Models (MLLMs) have shown strong performance in multi-image cross-modal retrieval, yet suffer from severe position bias, where predictions are dominated by input order rather than semantic relevance. Through empirical analysis, we identify a phenomenon termed Logit-Attention Divergence, in which output logits are heavily biased while internal attention maps remain well-aligned with relevant visual evidence. This observation reveals a fundamental limitation of existing logit-level calibration methods such as PriDe. Based on this insight, we propose a training-free, attention-guided debiasing framework that leverages intrinsic attention signals for instance-level correction at inference time, requiring only a minimal calibration set with negligible computational overhead. Experiments on MS-COCO-based benchmarks show that our method substantially improves permutation invariance and achieves state-of-the-art performance, enhancing accuracy by over 40\% compared to baselines. Code is available at https://github.com/brightXian/LAD.
☆ A Mixture Autoregressive Image Generative Model on Quadtree Regions for Gaussian Noise Removal via Variational Bayes and Gradient Methods
This paper addresses the problem of image denoising for grayscale images. We propose a probabilistic image generative model that combines a quadtree region-partitioning model with a mixture autoregressive model, and propose a framework that reduces MAP (maximum a posteriori)-estimation-based denoising to the maximization of a variational lower bound. To maximize this lower bound, we develop an algorithm that alternately applies variational Bayes and gradient methods. We particularly demonstrate that the gradient-based update rule can be computed analytically without numerical computation or approximation. We carried out some experiments to verify that the proposed algorithm actually removes image noise and to identify directions for future improvement.
☆ NexOP: Joint Optimization of NEX-Aware k-space Sampling and Image Reconstruction for Low-Field MRI
Modern low-field magnetic resonance imaging (MRI) technology offers a compelling alternative to standard high-field MRI, with portable, low-cost systems. However, its clinical utility is limited by a low Signal-to-Noise Ratio (SNR), which hampers diagnostic image quality. A common approach to increase SNR is through repetitive signal acquisitions, known as NEX, but this results in excessively long scan durations. Although recent work has introduced methods to accelerate MRI scans through k-space sampling optimization, the NEX dimension remains unexploited; typically, a single sampling mask is used across all repetitions. Here we introduce NexOP, a deep-learning framework for joint optimization of the sampling and reconstruction in multi-NEX acquisitions, tailored for low-SNR settings. NexOP enables optimizing the sampling density probabilities across the extended k-space-NEX domain, under a fixed sampling-budget constraint, and introduces a new deep-learning architecture for reconstructing a single high-SNR image from multiple low-SNR measurements. Experiments with raw low-field (0.3T) brain data demonstrate that NexOP consistently outperforms competing methods, both quantitatively and qualitatively, across diverse acceleration factors and tissue contrasts. The results also demonstrate that NexOP yields non-uniform sampling strategies, with progressively decreasing sampling across repetitions, hence exploiting the NEX dimension efficiently. Moreover, we present a theoretical analysis supporting these numerical observations. Overall, this work proposes a sampling-reconstruction optimization framework highly suitable for low-field MRI, which can enable faster, higher-quality imaging with low-cost systems and contribute to advancing affordable and accessible healthcare.
☆ The Midas Touch for Metric Depth
Recent advances have markedly improved the cross-scene generalization of relative depth estimation, yet its practical applicability remains limited by the absence of metric scale, local inconsistencies, and low computational efficiency. To address these issues, we present \emph{\textbf{M}idas \textbf{T}ouch for \textbf{D}epth} (MTD), a mathematically interpretable approach that converts relative depth into metric depth using only extremely sparse 3D data. To eliminate local scale inconsistencies, it applies a segment-wise recovery strategy via sparse graph optimization, followed by a pixel-wise refinement strategy using a discontinuity-aware geodesic cost. MTD exhibits strong generalization and achieves substantial accuracy improvements over previous depth completion and depth estimation methods. Moreover, its lightweight, plug-and-play design facilitates deployment and integration on diverse downstream 3D tasks. Project page is available at https://mias.group/MTD.
☆ TB-AVA: Text as a Semantic Bridge for Audio-Visual Parameter Efficient Finetuning
Audio-visual understanding requires effective alignment between heterogeneous modalities, yet cross-modal correspondence remains challenging when temporally aligned audio and visual signals lack clear semantic correspondence.We propose to use text as a semantic anchor for audio-visual representation learning.To this end, we introduce a parameter-efficient adaptation frameworkbuilt on frozen audio and visual encoders, centered on Text-Bridged Audio-Visual Adapter (TB-AVA), which enables text-mediated interaction between audio and visual streams. At the core of TB-AVA, Gated Semantic Modulation (GSM) selectively modulates feature channels based on text-inferred semantic relevance. We evaluate the proposed approach on multiple benchmarks, including AVE, AVS, and AVVP, where the proposed framework achieves state-of-the-art performance, demonstrating text as an effective semantic anchor for parameter-efficient fine-tuning (PEFT) in audio-visual learning.
comment: 12 pages, 6 figures
☆ Dynamic Execution Commitment of Vision-Language-Action Models
Vision-Language-Action (VLA) models predominantly adopt action chunking, i.e., predicting and committing to a short horizon of consecutive low-level actions in a single forward pass, to amortize the inference cost of large-scale backbones and reduce per-step latency. However, committing these multi-step predictions to real-world execution requires balancing success rate against inference efficiency, a decision typically governed by fixed execution horizons tuned per task. Such heuristics ignore the state-dependent nature of predictive reliability, leading to brittle performance in dynamic or out-of-distribution settings. In this paper, we introduce A3, an Adaptive Action Acceptance mechanism that reframes dynamic execution commitment as a self-speculative prefix verification problem. A3 first computes a trajectory-wise consensus score of actions via group sampling, then selects a representative draft and prioritizes downstream verification. Specifically, it enforces: (1) consensus-ordered conditional invariance, which validates low-consensus actions by judging whether they remain consistent when re-decoded conditioned on high-consensus actions; and (2) prefix-closed sequential consistency, which guarantees physical rollout integrity by accepting only the longest continuous sequence of verified actions starting from the beginning. Consequently, the execution horizon emerges as the longest verifiable prefix satisfying both internal model logic and sequential execution constraints. Experiments across diverse VLA models and benchmarks demonstrate that A3 eliminates the need for manual horizon tuning while achieving a superior trade-off between execution robustness and inference throughput.
comment: Code will be released in the next version
☆ TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles
State Space Models (SSMs) have emerged as a compelling alternative to attention models for long-range vision tasks, offering input-dependent recurrence with linear complexity. However, most efficient SSM variants reduce computation cost by modifying scan routes, resolutions, or traversal patterns, while largely leaving the recurrent dynamics implicit. Consequently, the model's state-dependent memory behavior is difficult to control, particularly in compact backbones where long scan paths can exceed the effective memory horizon. We propose Token-Conditioned Poles SSM (TCP-SSM), a structured selective SSM framework that improves efficiency while making recurrence dynamics explicit and interpretable through stable poles. TCP-SSM builds each scan operator with 1) real poles that model monotone or sign-alternating decay, and 2) complex-conjugate poles that capture damped oscillatory responses. Using bounded radius and angle modulation, TCP-SSM converts shared base poles into token-dependent poles, allowing each scan step to adapt its memory behavior to the current visual token while preserving pole stability. For practical scalability, we integrate grouped pole sharing with a lightweight low-rank input pathway, yielding an efficient scan operator that preserves linear-time scan complexity. Across image classification, semantic segmentation, and object detection, TCP-SSM reduces SSM computation complexity up to 44% in Vision Mamba-style models while maintaining or surpassing baseline accuracy.
☆ When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLMs
Multimodal large language models (MLLMs) have become a key interface for visual reasoning and grounded question answering, yet they remain vulnerable to visual hallucinations, where generated responses contradict image content or mention nonexistent objects. A central challenge is that hallucination is not always caused by a simple lack of visual attention: the model may still assign substantial attention mass to image tokens while internally drifting toward an incorrect answer. In this paper, we show that the high-frequency structure of visual attention, measured by layer-wise Laplacian energy, reveals both the layer where hallucinated preferences emerge and the layer where the ground-truth answer transiently recovers. Building on this finding, we propose LaSCD (Laplacian-Spectral Contrastive Decoding), a training-free decoding strategy that selects informative layers via Laplacian energy and remaps next-token logits in closed form. Experiments on hallucination and general multimodal benchmarks show that LaSCD consistently reduces hallucination while preserving general capabilities, highlighting its potential as a faithful decoding paradigm. The code is available at https://github.com/macovaseas/LaSCD.
☆ ScribbleDose: Scribble-Guided Dose Prediction in Radiotherapy MICCAI 2026
Anatomical structure masks are widely adopted in radiotherapy dose prediction, as they provide explicit geometric constraints that facilitate structure-dose coupling. However, conventional manual delineation of these masks requires precise annotation of structure boundaries relevant to radiotherapy, which is time-consuming and labor-intensive. To address these limitations, we propose a scribble-guided dose prediction framework that relies solely on anatomical structures annotated with sparse scribbles. Specifically, we design a Scribble Completion Module (SCM) to generate dense anatomical masks by propagating sparse scribble labels to semantically similar voxels. During the propagation process, a supervoxel-based regularization is introduced to preserve geometric boundary consistency to ensure anatomical plausibility. Furthermore, we propose a Structure-Guided Dose Generation Module (SGDGM) to strengthen the correspondence between sparse structural cues and dose distribution. The completed dense masks derived from scribbles serve as structural guidance to condition dose prediction, forming a scribble-mask-dose learning pipeline under sparse annotation. Experiments on the GDP-HMM dataset demonstrate that ScribbleDose achieves competitive dose prediction performance using only sparse structural annotations. The source code and reannotated scribble annotations are publicly available at https://github.com/iCherishxixixi/ScribbleDose.
comment: This is a preprint version of the paper. The final version will appear in the proceedings of MICCAI 2026
☆ VNDUQE: Information-Theoretic Novelty Detection using Deep Variational Information Bottleneck
Detecting out-of-distribution (OOD) samples is critical for safe deployment of neural networks in safety-critical applications. While maximum softmax probability (MSP) provides a simple baseline, it lacks theoretical grounding and suffers from miscalibration. We propose VNDUQE (VIB-based Novelty Detection and Uncertainty Quantification for Nondestructive Evaluation), which investigates novelty detection through the Deep Variational Information Bottleneck (VIB), which explicitly constrains information flow through learned representations. We train VIB models on MNIST with held-out digit classes and evaluate OOD detection using information-theoretic metrics: KL divergence and prediction entropy. Our results reveal complementary detection signals: KL divergence achieves perfect detection (100\% AUROC on noise) on far-OOD samples (noise, domain shift), while prediction entropy excels at near-OOD detection (94.7\% AUROC on novel digit classes). A parallel detection strategy combining both metrics achieves 95.3\% average AUROC and 92\% true positive rate at 5\% false positive rate, which is a 32 percentage point improvement over baseline MSP (85.0\% AUROC, 60.1\% TPR). Compression via the information bottleneck principle ($β=10^{-3}$) reduces Expected Calibration Error by 38\%, demonstrating that information-theoretic constraints produce fundamentally more reliable uncertainty estimates. These findings directly support active learning with expensive computational oracles, where well-calibrated novelty detection enables principled threshold selection for oracle queries.
comment: 6 pages, 3 figures, Fall 2025 version
☆ The DAWN of World-Action Interactive Models
A plausible scene evolution depends on the maneuver being considered, while a good maneuver depends on how the scene may evolve. Existing World Action Models (WAMs) largely miss this reciprocity, treating world prediction and action generation as either isolated parallel branches or rigid predict-then-plan pipelines. We formalize this perspective as World-Action Interactive Models (WAIMs), and instantiate it in autonomous driving with \textbf{DAWN} (\textbf{D}enoising \textbf{A}ctions and \textbf{W}orld i\textbf{N}teractive model), a simple yet strong latent generative baseline. DAWN operates in a compact semantic latent space and couples a \emph{World Predictor} with a \emph{World-Conditioned Action Denoiser}: the predicted world hypothesis conditions action denoising, while the denoised action hypothesis is fed back to update the world prediction, so that both are recursively refined during inference. Rather than eliminating test-time world evolution altogether or rolling out the full future in pixel space, DAWN performs a short explicit latent rollout that is sufficient to support long-horizon trajectory generation in complex interactive scenes. Experiments show that DAWN achieves strong planning performance and favorable safety-related results across multiple autonomous driving benchmarks. More broadly, our results suggest that interactive world-action generation is a principled path toward truly actionable world models.
☆ GeoR-Bench: Evaluating Geoscience Visual Reasoning
Geoscience intelligence is expected to understand, reason about, and predict earth system changes to support human decision-making in critical domains such as disaster response, climate adaptation and environmental protection. Although current research has shown promising progress on specific geoscience tasks, such as remote sensing interpretation, geographic question-answering, existing benchmarks remain largely task-specific which failing to capture the open-ended real world geoscience problems. As a result, it remains unclear how far current AI systems are from achieving genuine geoscience intelligence. To address this gap, we present \textbf{GeoR-Bench}, a \underline{Bench}mark for evaluating \underline{Geo}science visual \underline{R}easoning through reasoning informed visual editing tasks. GeoR-Bench contains 440 curated samples spanning 6 geoscience categories and 24 task types, covering earth observation imagery and structured scientific representations such as maps and diagrams. We evaluate outputs along three dimensions, including reasoning, consistency, and quality. Benchmark results of 21 closed- and open-source multimodal models reveal that geoscience reasoning remains a critical bottleneck. The highest-performing model achieves 42.7\% overall strict accuracy, while the best open-source models only get 10.3\%. Notably, the visual consistency and image quality of the outputs frequently surpass their scientific accuracy. Ultimately, these findings indicate that current models generate superficially plausible results but fail to capture underlying earth science processes.
☆ Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
Clinical check-up reports are multimodal documents that combine page layouts, tables, numerical biomarkers, abnormality flags, imaging findings, and domain-specific terminology. Such heterogeneous evidence is difficult for laypersons to interpret and translate into concrete follow-up actions. Although large language models show promise in medical summarisation and triage support, their ability to generate safe, prioritised, and patient-oriented actions from multimodal check-up reports remains under-benchmarked. We present \textbf{Checkup2Action}, a multimodal clinical check-up report dataset and benchmark for structured \textit{Action Card} generation. Each card describes one clinically relevant issue and specifies its priority, recommended department, follow-up time window, patient-facing explanation, and questions for clinicians, while avoiding diagnostic or treatment-prescriptive claims. The dataset contains 2,000 de-identified real-world check-up reports covering demographic information, physical examinations, laboratory tests, cardiovascular assessments, imaging-related evidence, and physician summaries. We formulate checkup-to-action generation as a constrained structured generation task and introduce an evaluation protocol covering issue coverage and precision, priority consistency, department and time recommendation accuracy, action complexity, usefulness, readability, and safety compliance. Experiments with general-purpose and medical large language models reveal clear trade-offs between issue coverage, action correctness, conciseness, and safety alignment. Checkup2Action provides a new multimodal benchmark for evaluating patient-oriented reasoning over clinical check-up reports.
☆ XWOD: A Real-World Benchmark for Object Detection under Extreme Weather Conditions
Autonomous driving and intelligent transportation systems remain vulnerable under extreme weather. The U.S. Federal Highway Administration reports that roughly 745,000 crashes and 3,800 fatalities per year are weather-related, and recent regulatory investigations have examined failures of Level-2/3 driving systems under reduced-visibility conditions. However, datasets commonly used to evaluate weather robustness remain limited in scale, diversity, and realism. In this paper, we introduce XWOD (Extreme Weather Object Detection), a large-scale real-world traffic-object detection benchmark containing 10,010 images and 42,924 bounding boxes across seven extreme weather conditions: rain, snow, fog, haze/sand/dust, flooding, tornado, and wildfire. The dataset covers six traffic-object categories, including car, person, truck, motorcycle, bicycle, and bus. XWOD extends the weather taxonomy from one to seven conditions, and is the first to cover the emerging class of climate-amplified hazards, such as flooding, tornado, and wildfire. To evaluate the quality of our data, we train standard YOLO-family detectors on XWOD and test them zero-shot on external weather benchmarks, achieving mAP$_{50}$ scores of 63.00% on RTTS, 59.94% on DAWN, and 61.12% on WEDGE, compared with the corresponding published YOLO-based baselines of 40.37%, 32.75%, and 45.41%, respectively, representing relative improvements of 56%, 83%, and 35%. These cross-dataset results show that XWOD provides a strong source domain for learning weather-robust traffic perception. We release the dataset, splits, baseline weights, and reproducible evaluation code under a research-use license.
☆ PointGS: Semantic-Consistent Unsupervised 3D Point Cloud Segmentation with 3D Gaussian Splatting CVPR
Unsupervised point cloud segmentation is critical for embodied artificial intelligence and autonomous driving, as it mitigates the prohibitive cost of dense point-level annotations required by fully supervised methods. While integrating 2D pre-trained models such as the Segment Anything Model (SAM) to supplement semantic information is a natural choice, this approach faces a fundamental mismatch between discrete 3D points and continuous 2D images. This mismatch leads to inevitable projection overlap and complex modality alignment, resulting in compromised semantic consistency across 2D-3D transfer. To address these limitations, this paper proposes PointGS, a simple yet effective pipeline for unsupervised 3D point cloud segmentation. PointGS leverages 3D Gaussian Splatting as a unified intermediate representation to bridge the discrete-continuous domain gap. Input sparse point clouds are first reconstructed into dense 3D Gaussian spaces via multi-view observations, filling spatial gaps and encoding occlusion relationships to eliminate projection-induced semantic conflation. Multi-view dense images are rendered from the Gaussian space, with 2D semantic masks extracted via SAM, and semantics are distilled to 3D Gaussian primitives through contrastive learning to ensure consistent semantic assignments across different views. The Gaussian space is aligned with the original point cloud via two-step registration, and point semantics are assigned through nearest-neighbor search on labeled Gaussians. Experiments demonstrate that PointGS outperforms state-of-the-art unsupervised methods, achieving +0.9% mIoU on ScanNet-V2 and +2.8% mIoU on S3DIS.
comment: Accepted by Computer Vision and Pattern Recognition (CVPR) 2026
☆ LiBrA-Net: Lie-Algebraic Bilateral Affine Fields for Real-Time 4K Video Dehazing
Currently, there is a gap in the field of ultra-high-definition (UHD) video dehazing due to the lack of a benchmark for evaluation. Furthermore, existing video dehazing methods cannot run on consumer-grade GPUs when processing continuous UHD sequences of 3--5 frames at a time. In this paper, we address both issues with a new benchmark and an efficient method. Our key observation is that atmospheric dehazing reduces to a per-pixel affine transform governed by the low-frequency depth field, which can be compactly encoded in bilateral grids whose prediction cost is decoupled from the output resolution. Building on this, we propose LiBrA-Net, which factorizes the spatiotemporal affine field into a spatial--color and a temporal bilateral sub-grid predicted at a fixed low resolution, fuses their coefficients in the $\mathfrak{gl}(3)$ Lie algebra under group-theoretic regularization, maps the result to invertible GL(3) transforms via a Cayley parameterization, and restores high-frequency detail through a lightweight input-guided branch. We further release UHV-4K, the first paired 4K video dehazing benchmark with depth, transmission, and optical-flow annotations on every frame. Across UHV-4K, REVIDE, and HazeWorld, LiBrA-Net sets a new state of the art among compared video dehazing methods while running native 4K at 25 FPS on a single GPU with only 6.12 M parameters. Code and data are available at https://anonymous.4open.science/r/LiBrA-Net-42B8.
comment: 10 pages, 5 figures
☆ Principled Design of Diffusion-based Optimizers for Inverse Problems
Score-based diffusion models achieve state-of-the-art performance for inverse problems, but their practical deployment is hindered by long inference times and cumbersome hyperparameter tuning. While pretrained diffusion models can be reused across tasks without retraining, inference-time hyperparameters such as the noise schedule and posterior sampling weights typically require ad-hoc adjustment for each problem setup. We propose principled reparameterizations that induce invariances, allowing the same hyperparameters to be reused across multiple problems without re-tuning. In addition, building on the RED-diff framework, which reformulates posterior sampling as an optimization problem, we further develop the OptDiff pipeline. OptDiff provides a simplified tuning framework that facilitates the integration of convex optimization tools to accelerate inference. Experiments on image reconstruction, deblurring, and super-resolution show substantial speedups and improved image quality.
comment: 22 pages, 8 figures, 6 tables
☆ PoseBridge: Bridging the Skeletonization Gap for Zero-Shot Skeleton-Based Action Recognition
Zero-shot skeleton-based action recognition (ZSSAR) is typically treated as a skeleton-text alignment problem: encode joint-coordinate sequences, align them with language, and classify unseen actions. We argue that this alignment is often too late. Skeletons are not complete action observations, but compressed outputs of human pose estimation (HPE); by the time alignment begins, human-object interactions and pose-relative visual cues may no longer be explicit. We call this upstream semantic loss. To address it, we propose PoseBridge, an HPE-aware ZSSAR framework that bridges intermediate HPE representations to skeleton-text alignment. Rather than adding an RGB action branch or object detector, PoseBridge extracts pose-anchored semantic cues from the same HPE process that produces skeletons, then transfers them through skeleton-conditioned bridging and semantic prototype adaptation. Across NTU-RGB+D 60/120, PKU-MMD, and Kinetics-200/400, PoseBridge improves ZSSAR performance under the evaluated protocols. On the Kinetics-200/400 PURLS benchmark, which contains in-the-wild videos with diverse scenes and action contexts, PoseBridge shows the clearest separation, improving the strongest compared baseline by 13.3-17.4 points across all eight splits. Our code will be publicly released.
☆ STRIDE: Training-Free Diversity Guidance via PCA-Directed Feature Perturbation in Single-Step Diffusion Models
Distilled one-step (T=1) or few-step (T$\leq$4) diffusion models enable real-time image generation but often exhibit reduced sample diversity compared to their multi-step counterparts. In multi-step diffusion, diversity can be introduced through schedules, trajectories, or iterative optimization; however, these mechanisms are unavailable in the few-step or single-step setting, limiting the effectiveness of existing diversity-enhancing methods. A natural alternative is to perturb intermediate features, but naive feature perturbation is often ineffective, either yielding limited diversity gains or degrading generation quality. We argue that effective diversity injection in few-step models requires perturbations that respect the model's learned feature geometry. Based on this insight, we propose STRIDE, a training-free and optimization-free method that operates in a single forward pass. STRIDE injects spatially coherent (pink) noise into intermediate transformer features, projected onto the principal components of the model's own activations, ensuring that perturbations lie on the learned feature manifold. This design enables controlled variation along meaningful directions in the representation space. Extensive experiments on FLUX.1-schnell and SD3.5 Turbo across COCO, DrawBench, PartiPrompts, and GenEval show that STRIDE consistently improves diversity while maintaining strong text alignment. In particular, STRIDE reduces intra-batch similarity with minimal impact on CLIP score, and Pareto-dominates existing training-free baselines on the diversity-fidelity frontier. These results highlight that, in the absence of iterative refinement, improving diversity in few-step and one-step diffusion depends not on increasing perturbation strength, but on aligning perturbations with the model's internal representation structure.
comment: 11 Pages 3 figures 4 tables
☆ A Mimetic Detector for Adversarial Image Perturbations
Adversarial attacks fool deep image classifiers by adding tiny, almost invisible noise patterns to a clean image. The standard $\ell^\infty$-bounded attacks (FGSM, PGD, and the $\ell^\infty$ variant of Carlini--Wagner) produce high-frequency, near-random sign patterns at the pixel level: nearly invisible in $\ell^2$, but carrying disproportionate gradient energy. We exploit this with a single-shot, training-free detector using the high-order Corbino--Castillo mimetic operators from the open-source MOLE library. No retraining, no surrogate classifier, no access to the network under attack: the verdict is a property of the input alone, computed in $O(HW)$ time. We validate the detector on the standard \texttt{peppers} test image at the canonical $\ell^\infty$ budget $\varepsilon = 16/255$ and observe a clean-vs-adversarial separation that grows monotonically from $3.55\times$ at order $k=2$ to $4.19\times$ at $k=6$.
☆ 3DGS$^3$: Joint Super Sampling and Frame Interpolation for Real-Time Large-Scale 3DGS Rendering
3D Gaussian Splatting (3DGS) enables high-quality real-time 3D rendering but faces challenges in efficiently scaling to ultra-dense scenes and high-resolution due to computational bottlenecks that limit its use in latency-sensitive applications. Instead of optimizing the splatting pipeline itself, we propose \textbf{3DGS$^3$}, a unified post-rendering framework that jointly performs super sampling and frame interpolation through differentiable processing of low-resolution outputs to achieve both high-resolution and high-frame-rate rendering. Our \textbf{Gradient\- \-Aware Super Sampling (GASS)} module leverages the continuous differentiability of 3DGS to extract image gradients that guide a GRU-based refinement network to enable high-fidelity super sampling. Furthermore, a \textbf{Lightweight Temporal Frame Interpolation (LTFI)} module based on a compact U-Net-like backbone fuses temporal and differentiable spatial cues from consecutive frames to synthesize temporally coherent intermediate frames. Experiments on public datasets demonstrate that 3DGS$^3$ achieves superior rendering efficiency and visual quality when compared with state-of-the-art methods and remains compatible with existing 3DGS acceleration techniques. The code will be publicly released upon acceptance.
☆ LDDR: Linear-DPP-Based Dynamic-Resolution Frame Sampling for Video MLLMs
Video understanding in multimodal large language models requires selecting informative frames from long, redundant videos under limited visual-token budgets. Existing methods often rely on uniform sampling, point-wise relevance scoring, chunk-wise selection, or agentic exploration, which either miss global dependencies or introduce substantial overhead. We propose LDDR (Linear DPP-Based Dynamic Resolution), a training-free, plug-and-play, and budget-aware video frame sampling framework. LDDR performs query-aware Determinantal Point Process (DPP) frame selection in a task-conditioned feature space, achieving a 3x runtime speedup over standard DPP baselines. It further introduces a Group DPP importance metric to guide frame retention and dynamic resolution allocation, assigning more tokens to informative, non-redundant frames while downscaling or pruning less useful ones. Across four video benchmarks spanning short-, medium-, and long-range videos, LDDR consistently outperforms the next-best baselines, achieving gains of 2.5 points under budget-constrained settings and 1.6 points in high-budget scenarios. These improvements are consistently observed across multiple MLLM backbones, including both open- and closed-source models. Qualitative analysis confirms that relevant frames are selected and allocated a higher budget, facilitating improved video understanding.
comment: 21 pages, 4 figures
☆ Deep Probabilistic Unfolding for Quantized Compressive Sensing
We propose a deep probabilistic unfolding model to address the classical quantized compressive sensing problem that leverages an unfolding framework to enhance the reconstruction accuracy and efficiency. Unlike previous unfolding methods that apply L2 projection to measurements, we derive a closed-form, numerically stable likelihood gradient projection, which allows the model to respect the true quantization physics, turning the hard quantization constraint into a soft probabilistic guidance. Furthermore, an efficient, dual-domain Mamba module is specifically designed to dynamically capture and fuse the multi-scale local and global features, ensuring the interactions between the distant but correlated regions. Extensive experiments demonstrate the state-of-the-art performance of the proposed method over previous works, which is capable of promoting the application of quantized compressive sensing in real life.
☆ Encore: Conditioning Trajectory Forecasting via Biased Ego Rehearsals
Learning and representing the subjectivities of agents has become a challenging but crucial problem in the trajectory prediction task. Such subjectivities not only present specific spatial or temporal structures, but also are anisotropic for all interaction participants. Despite great efforts, it remains difficult to explicitly learn and forecast these subjectivities, let alone further modulate models' predictions through a specific ego's subjectivity. Inspired by prefactual thoughts in psychology and relevant theatrical concepts, we interpret such subjectivities in future trajectories as the continuous process from rehearsal to encore. In the rehearsal phase, the proposed ego predictor focuses on how each ego agent learns to derive and direct a set of explicitly biased rehearsal trajectories for all participants in the scene from the short-term observations. Then, these rehearsal trajectories serve as immediate controls to condition final predictions, providing direct yet distinct ego biases for the prediction network to simulate agents' various subjectivities. Experiments across datasets not only demonstrate a consistent improvement in the performance of the proposed \emph{Encore} trajectory prediction model but also provide clear interpretability regarding subjectivities as biased ego rehearsals.
☆ SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images
Recent advancements in Large Vision-Language Models (VLMs) have demonstrated exceptional semantic understanding, yet these models consistently struggle with spatial reasoning, often failing at fundamental geometric tasks such as depth ordering and precise coordinate grounding. Recent efforts introduce spatial supervision from scene-centric datasets (e.g., multi-view scans or indoor video), but are constrained by the limited number of underlying scenes. As a result, the scale and diversity of such data remain significantly smaller than those of web-scale 2D image collections. To address this limitation, we propose SpatialForge, a scalable data synthesis pipeline that transforms in-the-wild 2D images into spatial reasoning supervision. Our approach decomposes spatial reasoning into perception and relation, and constructs structured supervision signals covering depth, layout, and viewpoint-dependent reasoning, with automatic verification to ensure data quality. Based on this pipeline, we build SpatialForge-10M, a large-scale dataset containing 10 million spatial QA pairs. Extensive experiments across multiple spatial reasoning benchmarks demonstrate that training on SpatialForge-10M significantly improves the spatial reasoning ability of standard VLMs, highlighting the effectiveness of scaling 2D data for 3D-aware spatial reasoning.
☆ Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
Vision-Language-Action (VLA) models achieve remarkable flexibility and generalization beyond classical control paradigms. However, most prevailing VLAs are trained under a single-frame observation paradigm, which leaves them structurally blind to temporal dynamics. Consequently, these models degrade severely in non-stationary scenarios, even when trained or finetuned on dynamic datasets. Existing approaches either require expensive retraining or suffer from latency bottlenecks and poor temporal consistency across action chunks. We propose Pace-and-Path Correction, a training-free, closed-form inference-time operator that wraps any chunked-action VLA. From a single quadratic cost, joint minimization yields a unified solution that decomposes orthogonally into two distinct channels. The pace channel compresses execution along the planned direction, while the path channel applies an orthogonal spatial offset, jointly absorbing the perceived dynamics within the chunk window. We evaluate our approach on a comprehensive diagnostic benchmark MoveBench designed to isolate motion as the sole controlled variable. Empirical results demonstrate that our framework consistently outperforms state-of-the-art training-free wrappers and dynamic-adaptive methods and improves success rates by up to 28.8% and 25.9% in absolute terms over foundational VLA models in dynamic-only and static-dynamic mixed environments, respectively.
☆ Leveraging Multimodal Large Language Models for All-in-One Image Restoration via a Mixture of Frequency Experts
All-in-one image restoration seeks to recover clean images from inputs affected by diverse and unknown degradations using a unified framework. Recent methods have shown strong performance by identifying degradation characteristics to guide the restoration process. However, many of them treat degradations as discrete categories, which limits their ability to model the continuous relational structure that arises in composite degradations. To address this issue, we propose a multimodal large language model (MLLM)-guided image restoration framework that exploits multimodal embeddings as guidance for low-level restoration. Specifically, MLLM-derived features are injected into an encoder-decoder architecture through an MLLM-guided fusion block (MGFB) to enhance degradation-aware representations. In addition, we incorporate a mixture-of-frequency-experts (MoFE) module that adaptively combines frequency experts using MLLM-guided contextual cues. To further improve expert routing, we design an MLLM-guided router with a relational alignment loss that encourages routing patterns consistent with the embedding-space relationships of degraded inputs. Extensive experiments on multiple benchmarks show that the proposed method achieves strong performance across diverse restoration settings and establishes a new state of the art on the challenging CDD11 dataset, outperforming previous methods by up to 1.35 dB.
☆ Instruct-ICL: Instruction-Guided In-Context Learning for Post-Disaster Damage Assessment IEEE
Rapid and accurate situational awareness is essential for effective response during natural disasters, where delays in analysis can significantly hinder decision-making. Training task-specific models for post-disaster assessment is often time-consuming and computationally expensive, making such approaches impractical in time-critical scenarios. Consequently, pretrained multimodal large language models (MLLMs) have emerged as a promising alternative for post-disaster visual question answering (VQA), a task that aims to answer structured questions about visual scenes by jointly reasoning over images and text. While these models demonstrate strong multimodal reasoning capabilities, their responses can be sensitive to prompt formulation, which can limit their reliability in real-world disaster assessment scenarios. In this paper, we investigate whether structured reasoning strategies can improve the reliability of pretrained MLLMs for post-disaster VQA. Specifically, we explore multiple prompting paradigms in which one MLLM is used to generate task-specific instructions that serve as Chain-of-Thought (CoT) guidance for a second MLLM. These instructions are incorporated during answer generation with varying degrees of in-context learning (ICL), enabling the model to leverage both explicit reasoning guidance and contextual examples. We conduct our evaluation on the FloodNet dataset and compare these approaches against a zero-shot baseline. Our results demonstrate that integrating instruction-driven CoT reasoning consistently improves answer accuracy.
comment: Accepted by the 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026)
☆ Beyond Masks: The Case for Medical Image Parsing
Medical imaging research has spent a decade getting very good at one thing: producing per-voxel masks. Masks tell us size, volume, and location, and a decade of clinical infrastructure rests on those outputs. Yet the report a radiologist writes contains almost nothing a mask can express. We argue that medical imaging research should adopt medical image parsing as its central output: a structured representation in which entities, attributes, and relationships are emitted together and mutually consistent. Entities are the named structures and findings, present or absent. Attributes describe those entities, capturing things like margin regularity, enhancement pattern, or severity grade. Relationships connect them, naming where one structure sits relative to another, what abuts what, and what has changed since the prior scan. A good parse satisfies three properties, in order: (1) decision (the parse names the right things in the current image), (2) reconstruction (its content is rich enough to regenerate that image), and (3) prediction (its content is rich enough to forecast how the patient state will evolve). Quantitative measurements are derived from this content; they are not predicted alongside it. To test how close the field is to producing such an output, we audit eleven representative systems against the three parsing primitives plus closure. None emits a well-formed parse. Entities are largely solved. Attributes, relationships, and closure remain near-empty. The path forward is not a new architecture. It is a commitment to a richer output, and to training signals that reward it. Segmentation taught models to measure. Parsing asks them to explain.
☆ ZeroIDIR: Zero-Reference Illumination Degradation Image Restoration with Perturbed Consistency Diffusion Models CVPR 2026
In this paper, we propose a zero-reference diffusion-based framework, named ZeroIDIR, for illumination degradation image restoration, which decouples the restoration process into adaptive illumination correction and diffusion-based reconstruction while being trained solely on low-quality degraded images. Specifically, we design an adaptive gamma correction module that performs spatially varying exposure correction to generate illumination-corrected only representations to mitigate exposure bias and serve as reliable inputs for subsequent diffusion processes, where a histogram-guided illumination correction loss is introduced to regularize the corrected illumination distribution toward that of natural scenes. Subsequently, the illumination-corrected image is treated as an intermediate noisy state for the proposed perturbed consistency diffusion model to reconstruct details and suppress noise. Moreover, a perturbed diffusion consistency loss is proposed to constrain the forward diffusion trajectory of the final restored image to remain consistent with the perturbed state, thus improving restoration fidelity and stability in the absence of supervision. Extensive experiments on publicly available benchmarks show that the proposed method outperforms state-of-the-art unsupervised competitors and is comparable to supervised methods while being more generalizable to various scenes. Code is available at https://github.com/JianghaiSCU/ZeroIDIR.
comment: Accepted by CVPR 2026
☆ Diabetic Retinopathy Classification using Downscaling Algorithms and Deep Learning
Diabetic Retinopathy (DR) is an art and science of recording and classifying the retinal images of a diabetic patient. DR classification deals with classifying retinal fundus image into five stages on the basis of severity of diabetes. One of the major issue faced while dealing with DR classification problem is the large and varying size of images. In this paper we propose and explore the use of several downscaling algorithms before feeding the image data to a Deep Learning Network for classification. For improving training and testing; we amalgamate two datasets: Kaggle and Indian Diabetic Retinopathy Image Dataset. Our experiments have been performed on a novel Multi Channel Inception V3 architecture with a unique self crafted preprocessing phase. We report results of proposed approach using accuracy, specificity and sensitivity, which outperform the previous state of the art methods. Index Terms: Diabetic Retinopathy, Downscaling Algorithms, Multichannel CNN Architecture, Deep Learning
☆ PD-4DGS:Progressive Decomposition of 4D Gaussian Splatting for Bandwidth-Adaptive Dynamic Scene Streaming
4D Gaussian Splatting (4DGS) enables high-quality dynamic novel view synthesis, yet current models remain monolithic bitstreams that clients must download in full before any frame can be rendered, causing black-screen waits of tens to hundreds of seconds on mobile bandwidth and leaving 4DGS incompatible with modern adaptive-bitrate delivery. Progressive 3DGS compression alleviates this for static scenes, but it acts only on spatial anchors and cannot partition the temporal deformation networks that dominate dynamic-scene size. We present PD-4DGS, the first framework for progressive compression and on-demand transmission of 4DGS. Hierarchical Deformation Decomposition (HDD) externalises the coarse-to-fine motion hierarchy already latent in 4DGS into three independently transmittable layers -- a static scaffold, a global deformation, and a local refinement -- so that any prefix of the bitstream is already renderable, turning a single training run into a scalable, DASH/HLS-compatible bitstream. A Gaussian-entropy attribute rate-distortion loss together with a temporal mask consistency regulariser shrink the base layer while suppressing low-bitrate flicker; a capacity-weighted rollout schedule, gated online by a learnt activation rate rho, then prevents deformation-network under-training without any per-scene hyperparameter. On the Dycheck iPhone benchmark, PD-4DGS cuts the streamed bitstream by >60% at matched rendering fidelity and reduces first-frame latency from 73--930 s to ~1.7 s on a 2 Mbps link, uniquely enabling true on-demand progressive streaming for 4DGS.
☆ VidSplat: Gaussian Splatting Reconstruction with Geometry-Guided Video Diffusion Priors SIGGRAPH
Gaussian Splatting has achieved remarkable progress in multi-view surface reconstruction, yet it exhibits notable degradation when only few views are available. Although recent efforts alleviate this issue by enhancing multi-view consistency to produce plausible surfaces, they struggle to infer unseen, occluded, or weakly constrained regions beyond the input coverage. To address this limitation, we present VidSplat, a training-free generative reconstruction framework that leverages powerful video diffusion priors to iteratively synthesize novel views that compensate for missing input coverage, and thereby recover complete 3D scenes from sparse inputs. Specifically, we tackle two key challenges that enable the effective integration of generation and reconstruction. First, for 3D consistent generation, we elaborate a training-free, stage-wise denoising strategy that adaptively guides the denoising direction toward the underlying geometry using the rendered RGB and mask images. Second, to enhance the reconstruction, we develop an iterative mechanism that samples camera trajectories, explores unobserved regions, synthesizes novel views, and supplements training through confidence weighted refinement. VidSplat performs robustly to sparse input and even a single image. Extensive experiments on widely used benchmarks demonstrate our superior performance in sparse-view scene reconstruction.
comment: Accepted by SIGGRAPH Conference 2026. Project Page: https://tangjm24.github.io/VidSplat
☆ JACoP: Joint Alignment for Compliant Multi-Agent Prediction CVPR
Stochastic Human Trajectory Prediction (HTP) using generative modeling has emerged as a significant area of research. Although state-of-the-art models excel in optimizing the accuracy of individual agents, they often struggle to generate predictions that are collectively compliant, leading to output trajectories marred by social collisions and environmental violations, thus rendering them impractical for real-world applications. To bridge this gap, we present JACoP: Joint Alignment for Compliant Multi-Agent Prediction, an innovative multi-stage framework that ensures scene-level plausibility. JACoP incorporates an Anchor-Based Agent-Centric Profiler for effective initial compliance filtering and employs a Markov Random Field (MRF) based aligner to formalize the joint selection for scene predictions. By representing inter-agent spatial and social costs as MRF energy potentials, we successfully infer and sample from the joint trajectory distribution, achieving prediction with optimal scene compliance. Comprehensive experiments show that JACoP not only achieves competitive accuracy, but also sets a new standard in reducing both environmental violations and social collisions, thereby confirming its ability to produce collectively feasible and practically applicable trajectory predictions.
comment: Accepted by CVPRF 2026
☆ HamBR: Active Decision Boundary Restoration Based on Hamiltonian Dynamics for Learning with Noisy Labels
In large-scale visual recognition and data mining tasks, the presence of noisy labels severely undermines the generalization capability of deep neural networks (DNNs). Prevalent sample selection methods rely primarily on training loss or prediction confidence for passive screening. However, within a feature space degraded by noise, decision boundaries undergo systematic boundary collapse. This phenomenon hinders the ability of the model to distinguish between hard clean samples and noisy samples at the decision margins, thereby creating a significant performance bottleneck. This study is the first to emphasize the pivotal importance of active boundary restoration for noise-robust learning. We propose HamBR, a novel paradigm based on Hamiltonian dynamics. The core approach leverages the Spherical Hamiltonian Monte Carlo (Spherical HMC) mechanism to actively probe inter-class ambiguous regions within the representation space and synthesize high-quality virtual outliers. By imposing explicit repulsion constraints via energy-based modeling, these synthesized samples establish robust energy barriers at the decision boundaries. This mechanism forces real samples to move from dispersed overlapping regions toward their respective class centers, thereby restoring the discriminative sharpness of the decision boundaries. HamBR demonstrates exceptional versatility and can be integrated as a plug-and-play defense module into existing semi-supervised noisy label learning frameworks. Empirical evaluations show that the proposed paradigm significantly enhances the discriminative accuracy of hard boundary samples, achieving state-of-the-art (SOTA) performance on CIFAR-10/100 and real-world noise benchmarks. Furthermore, it exhibits superior convergence efficiency and reliable robustness, while improving significantly the capability of the model for Out-of-Distribution (OOD) detection.
☆ Dynamic Full-body Motion Agent with Object Interaction via Blending Pre-trained Modular Controllers CVPR
Generating physically plausible dynamic motions of human-object interaction (HOI) remains challenging, mainly due to existing HOI datasets limited to static interactions, and pretrained agents capable of either dynamic full-body motions without objects or static HOI motions. Recent works such as InsActor and CLoSD generate HOI motions in planning and execution stages, are yet limited to either static or short-term contacts e.g. striking. In this work, we propose a framework that fulfills dynamic and long-term interaction motions such as running while holding a table, by combining pretrained motion priors and imitation agents in planning and execution stages. In the planning stage, we augment HOI datasets with dynamic priors from a pretrained human motion diffusion model, followed by object trajectory generation. This plans dynamic HOI sequences. In the execution stage, a composer network blends actions of pretrained imitation agents specialized either for dynamic human motions or static HOI motions, enabling spatio-temporal composition of their complementary skills. Our method over relevant prior-arts consistently improves success rates while maintaining interaction for dynamic HOI tasks. Furthermore, blending pretrained experts with our composer achieves competitive performance in significantly reduced training time. Ablation studies validate the effectiveness of our augmentation and composer blending.
comment: CVPR Findings 2026
☆ 3D-Belief: Embodied Belief Inference via Generative 3D World Modeling
Recent advances in visual generative models have highlighted the promise of learning generative world models. However, most existing approaches frame world modeling as novel-view synthesis or future-frame prediction, emphasizing visual realism rather than the structured uncertainty required by embodied agents acting under partial observability. In this work, we propose a different perspective: world modeling as embodied belief inference in 3D space. From this view, a world model should not merely render what may be seen, but maintain and update an agent's belief about the unobserved 3D world as new observations are acquired. We identify several key capabilities for such models, including spatially consistent scene memory, multi-hypothesis belief sampling, sequential belief updating, and semantically informed prediction of unseen regions. We instantiate these ideas in 3D-Belief, a generative 3D world model that infers explicit, actionable 3D beliefs from partial observations and updates them online over time. Unlike prior visual prediction models, 3D-Belief represents uncertainty directly in 3D, enabling embodied agents to imagine plausible scene completions and reason over partially observed environments. We evaluate 3D-Belief on 2D visual quality for scene memory and unobserved-scene imagination, object- and scene-level 3D imagination using our proposed 3D-CORE benchmark, and challenging object navigation tasks in both simulation and the real world. Experiments show that 3D-Belief improves 2D and 3D imagination quality and downstream embodied task performance compared to state-of-the-art methods.
☆ PresentAgent-2: Towards Generalist Multimodal Presentation Agents
Presentation generation is moving beyond static slide creation toward end-to-end presentation video generation with research grounding, multimodal media, and interactive delivery. We introduce PresentAgent-2, an agentic framework for generating presentation videos from user queries. Given an open-ended user query and a selected presentation mode, PresentAgent-2 first summarizes the query into a focused topic and performs deep research over presentation-friendly sources to collect multimodal resources, including relevant text, images, GIFs, and videos. It then constructs presentation slides, generates mode-specific scripts, and composes slides, audio, and dynamic media into a complete presentation video. PresentAgent-2 supports three independent presentation modes within a unified framework: Single Presentation, which generates a single-speaker narrated presentation video; Discussion, which creates a multi-speaker presentation with structured speaker roles, such as for asking guiding questions, explaining concepts, clarifying details, and summarizing key points; and Interaction, which independently supports answering audience questions grounded in the generated slides, scripts, retrieved evidence, and presentation context. To evaluate these capabilities, we build a multimodal presentation benchmark covering single presentation, discussion, and interaction scenarios, with task-specific evaluation criteria for content quality, media relevance, dynamic media use, dialogue naturalness, and interaction grounding. Overall, PresentAgent-2 extends presentation generation from document-dependent slide creation to query-driven, research-grounded presentation video generation with multimodal media, dialogue, and interaction. Code: https://github.com/AIGeeksGroup/PresentAgent-2. Website: https://aigeeksgroup.github.io/PresentAgent-2.
☆ Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction
Transformer-based 3D reconstruction has emerged as a powerful paradigm for recovering geometry and appearance from multi-view observations, offering strong performance across challenging visual conditions. As these models scale to larger backbones and higher-resolution inputs, improving their efficiency becomes increasingly important for practical deployment. However, modern 3D transformer pipelines face two coupled challenges: dense multi-view attention creates substantial token-mixing overhead, and low-precision execution can destabilize geometry-sensitive representations and degrade depth, pose, and 3D consistency. To address the first challenge, we propose Lite3R, a model-agnostic teacher-student framework that replaces dense attention with Sparse Linear Attention to preserve important geometric interactions while reducing attention cost. To address the second challenge, we introduce a parameter-efficient FP8-aware quantization-aware training (FP8-aware QAT) strategy with partial attention distillation, which freezes the vast majority of pretrained backbone parameters and trains only lightweight linear-branch projection layers, enabling stable low-precision deployment while retaining pretrained geometric priors. We further evaluate Lite3R on two representative backbones, VGGT and DA3-Large, over BlendedMVS and DTU64, showing that it substantially reduces latency (1.7-2.0x) and memory usage (1.9-2.4x) while preserving competitive reconstruction quality overall. These results demonstrate that Lite3R provides an effective algorithm-system co-design approach for practical transformer-based 3D reconstruction. Code: https://github.com/AIGeeksGroup/Lite3R. Website: https://aigeeksgroup.github.io/Lite3R.
☆ Gradient-Free Noise Optimization for Reward Alignment in Generative Models
Existing reward alignment methods for diffusion and flow models rely on multi-step stochastic trajectories, making them difficult to extend to deterministic generators. A natural alternative is noise-space optimization, but existing approaches require backpropagation through the generator and reward pipeline, limiting applicability to differentiable settings. To address this, here we present ZeNO (Zeroth-order Noise Optimization), a gradient-free framework that formulates noise optimization as a path-integral control problem, estimable from zeroth-order reward evaluations alone. When instantiated with an Ornstein--Uhlenbeck reference process, the update connects to Langevin dynamics implicitly targeting a reward-tilted distribution. ZeNO enables effective inference-time scaling and demonstrates strong performance across diverse generators and reward functions, including a protein structure generation task where backpropagation is infeasible.
♻ ☆ Prototype Fusion: A Training-Free Multi-Layer Approach to OOD Detection
Deep learning models are increasingly deployed in safety-critical applications, where reliable out-of-distribution (OOD) detection is essential to ensure robustness. Existing methods predominantly rely on the penultimate-layer activations of neural networks, assuming they encapsulate the most informative in-distribution (ID) representations. In this work, we revisit this assumption to show that intermediate layers encode equally rich and discriminative information for OOD detection. Based on this observation, we propose a simple yet effective model-agnostic approach that leverages internal representations across multiple layers. Our scheme aggregates features from successive convolutional blocks, computes class-wise mean embeddings, and applies L_2 normalization to form compact ID prototypes capturing class semantics. During inference, cosine similarity between test features and these prototypes serves as an OOD score--ID samples exhibit strong affinity to at least one prototype, whereas OOD samples remain uniformly distant. Extensive experiments on state-of-the-art OOD benchmarks across diverse architectures demonstrate that our approach delivers robust, architecture-agnostic performance and strong generalization for image classification. Notably, it improves AUROC by up to 4.41% and reduces FPR by 13.58%, highlighting multi-layer feature aggregation as a powerful yet underexplored signal for OOD detection, challenging the dominance of penultimate-layer-based methods. Our code is available at: https://github.com/sgchr273/cosine-layers.git.
♻ ☆ Does Head Pose Correction Improve Biometric Facial Recognition?
Biometric facial recognition models often demonstrate significant decreases in accuracy when processing real-world images, often characterized by poor quality, non-frontal subject poses, and subject occlusions. We investigate whether targeted, AI-driven, head-pose correction and image restoration can improve recognition accuracy. Using a model-agnostic, large-scale, forensic-evaluation pipeline, we assess the impact of three restoration approaches: 3D reconstruction (NextFace), 2D frontalization (CFR-GAN), and feature enhancement (CodeFormer). We find that naive application of these techniques substantially degrades facial recognition accuracy. However, we also find that selective application of CFR-GAN combined with CodeFormer yields meaningful improvements.
♻ ☆ Can Nano Banana 2 Replace Traditional Image Restoration Models? An Evaluation of Its Performance on Image Restoration Tasks CVPR 2026
Recent advances in generative AI raise the question of whether general-purpose image editing models can serve as unified solutions for image restoration. We conduct a systematic evaluation of Nano Banana 2 across diverse scenes and degradations. Our results show that prompt design is critical, with concise prompts and explicit fidelity constraints achieving a better balance between reconstruction and perceptual quality. Nano Banana 2 achieves competitive full-reference performance and is consistently preferred in user studies, while showing strong generalization in challenging scenarios. However, we observe a gap between perceptual quality and restoration fidelity, as the model tends to produce visually rich results with over-enhanced details and inconsistencies. This issue is not well captured by existing IQA metrics or user studies. Overall, general-purpose models show promise as unified IR solvers from a perceptual perspective, but require improved controllability and fidelity-aware evaluation. Further comparisons and detailed analyses are available in our project repository: https://github.com/yxyuanxiao/NanoBanana2TestOnIR.
comment: Accepted by CVPR 2026 Workshop AAVM
♻ ☆ From Per-Image Low-Rank to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers ICML 2026
Feature-map knowledge distillation (KD) transfers internal representations well between comparably sized Vision Transformers (ViTs), but it often fails in compression. We revisit this failure and uncover a paradox. Sample-wise SVD shows that each image is highly compressible, which seems to suggest that a narrow student with a linear projector should match the teacher "in principle". However, a dataset-level view contradicts this intuition: PCA shows that the teacher is a union of low-rank subspaces with significant subspace rotation across inputs. We further introduce token-level Spectral Energy Patterns (SEP) and find an architecture-invariant encoding law: tokens spread energy broadly across channel modes even when they live in low-rank subspace, creating a bandwidth mismatch. We refer to this combined phenomenon as an encoding mismatch. We propose two minimal remedies, Lift or WideLast: (i) Lift retains a lightweight lifting projector at inference to provide wider channel, or (ii) WideLast widens only the student's last block, enabling an input-dependent expansion. On ImageNet-1K, these fixes revive feature KD for ViT compression, improving DeiT-Tiny distilled from CaiT-S24 from 74.86% to 77.53%/78.23% top-1 accuracy, and they also strengthen students trained without distillation. Our analyses clarify when and why feature-map KD fails and then how to fix it. Code and raw data are provided in the supplementary materials.
comment: 17 pages, 17 figures. Accepted at the 43rd International Conference on Machine Learning (ICML 2026). Poster: https://icml.cc/virtual/2026/poster/66573. This preprint is the submitted version. The camera-ready version will be uploaded shortly
♻ ☆ CAGE-SGG: Counterfactual Active Graph Evidence for Open-Vocabulary Scene Graph Generation
Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible and fine-grained relation phrases beyond a fixed predicate vocabulary. While recent vision-language models greatly expand the semantic coverage of SGG, they also introduce a critical reliability issue: predicted relations may be driven by language priors or object co-occurrence rather than grounded visual evidence. In this paper, we propose an evidence-rounded open-vocabulary SGG framework based on counterfactual relation verification. Instead of directly accepting plausible relation proposals, our method verifies whether each candidate relation is supported by relation-pecific visual, geometric, and contextual evidence. Specifically, we first generate open-vocabulary relation candidates with a vision-language proposer, then decompose predicate phrases into soft evidence bases such as support, contact, containment, depth and state. A relation-conditioned evidence encoder extracts predicate-relevant cues, while a counterfactual verifier tests whether the relation score decreases when necessary vidence is removed and remains stable under irrelevant perturbations. We further introduce contradiction-aware predicate learning and graph-level preference optimization to improve fine-grained discrimination and global graph consistency. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that our method consistently improves standard recall-based metrics, unseen predicate generalization, and counterfactual grounding quality. These results demonstrate that moving from relation generation to relation verification leads to more reliable, interpretable, and evidence-grounded scene graphs.
comment: This manuscript has been withdrawn by the authors because we found a methodological flaw in the formulation and evaluation of the proposed approach. The issue affects the reliability of the experimental results and the conclusions drawn from them. Therefore, the authors consider the current version unsuitable for citation or further use
♻ ☆ Confidence-Guided Diffusion Augmentation for Enhanced Bangla Compound Character Recognition
Recognition of handwritten Bangla compound characters remains a challenging problem due to complex character structures, large intra-class variation, and limited availability of high-quality annotated data. Existing Bangla handwritten character recognition systems often struggle to generalize across diverse writing styles, particularly for compound characters containing intricate ligatures and diacritical variations. In this work, we propose a confidence-guided diffusion augmentation framework for low-resolution Bangla compound character recognition. Our framework combines class-conditional diffusion modeling with classifier guidance to synthesize high-quality handwritten compound character samples. To further improve generation quality, we introduce Squeeze-and-Excitation enhanced residual blocks within the diffusion model's U-Net backbone. We additionally propose a confidence-based filtering mechanism where pre-trained classifiers act as quality gates to retain only highly class-consistent synthetic samples. The filtered synthetic images are fused with the original training data and used to retrain multiple classification architectures. Experiments conducted on the AIBangla compound character dataset demonstrate consistent performance improvements across ResNet50, DenseNet121, VGG16, and Vision Transformer architectures. Our best-performing model achieves 89.2\% classification accuracy, surpassing the previously published AIBangla benchmark by a substantial margin. The results demonstrate that quality-aware diffusion augmentation can effectively enhance handwritten character recognition performance in low-resource script domains.
♻ ☆ Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
The real world unfolds along a single set of physics laws, yet human intelligence demonstrates a remarkable capacity to generalize experiences from this singular physical existence into a multiverse of games, each governed by entirely different rules, aesthetics, physics, and objectives. This omni-reality adaptability is a hallmark of general intelligence. As Artificial Intelligence progresses towards Artificial General Intelligence, the multiverse of games has evolved from mere entertainment into the ultimate ground for training and evaluating AGI. The pursuit of this generality has unfolded across four eras: from environment-specific symbolic and reinforcement learning agents, to current large foundation models acting as generalist players, and toward a future creator stage where agent both creates new game worlds and continually evolves within them. We trace the full lifecycle of a generalist game player along four interdependent pillars: Dataset, Model, Harness, and Benchmark. Every advance across these pillars can be read as an attempt to break one of five fundamental trade-offs that currently bound the whole system. Building on this end-to-end view, we chart a five-level roadmap, progressing from single-game mastery to the ultimate creator stage in which the agent simultaneously creates and evolves within theoretical game multiverse. Taken together, our work offers a unified lens onto a rapidly shifting field,and a principled path toward the omnipotent generalist agent capable of seamlessly mastering any challenge within the multiverse of games, thereby paving the way for AGI.
comment: 51 pages, 7 figures, github: https://github.com/THUSI-Lab/Awesome-LFMs-Play-Games
♻ ☆ Enabling clinical use of foundation models for computational pathology
Foundation models for computational pathology are expected to facilitate the development of high-performing, generalisable deep learning systems. However, in addition to biologically relevant features, current foundation models also capture pre-analytic and scanner-specific variation that bias the predictions made by downstream task-specific models trained on these features. Here we show that introducing novel robustness losses during downstream model training reduces sensitivity to technical variability. A purpose-designed comprehensive experimentation setup with 27,042 whole-slide images from 6,155 patients is used to train thousands of models from the features of eight well-known foundation models for computational pathology. In addition to a substantial improvement in robustness, our approach improves classification accuracy by focusing on biologically relevant features. It mitigates robustness limitations of foundation models for computational pathology without retraining the foundation models themselves, enabling development of models that are more suitable in real-world clinical use.
♻ ☆ Efficient Bayesian Inference from Noisy Pairwise Comparisons
Evaluating generative models is challenging because standard metrics often fail to reflect human preferences. Human evaluations are more reliable but costly and noisy, as participants vary in expertise, attention, and diligence. Pairwise comparisons improve consistency, yet aggregating them into overall quality scores requires careful modeling. Bradley-Terry-based methods update item scores from comparisons, but existing approaches either ignore rater variability or lack convergence guarantees, limiting robustness and interpretability. We introduce BBQ, a Bayesian Bradley-Terry variant that explicitly models rater quality, downweighting or removing unreliable participants, and provides guaranteed monotonic likelihood convergence through an Expectation-Maximization algorithm. Empirical results show that BBQ provides efficient inference, well-calibrated uncertainty estimates, and more robust, interpretable rankings compared to baseline Bradley-Terry models, even with noisy or crowdsourced raters. This framework enables more reliable and cost-effective human evaluation of generative models.
♻ ☆ Simulation-Ready Cluttered Scene Estimation via Physics-aware Joint Shape and Pose Optimization
Estimating simulation-ready scenes from real-world observations is crucial for downstream planning and policy learning tasks. Regretfully, existing methods struggle in cluttered environments, often exhibiting prohibitive computational cost, poor robustness, and restricted generality when scaling to multiple interacting objects. We propose a unified optimization-based formulation for real-to-sim scene estimation that jointly recovers the shapes and poses of multiple rigid objects under physical constraints. Our method is built on two key technical innovations. First, we leverage the recently introduced shape-differentiable contact model, whose global differentiability permits joint optimization over object geometry and pose while modeling inter-object contacts. Second, we exploit the structured sparsity of the augmented Lagrangian Hessian to derive an efficient linear system solver whose computational cost scales favorably with scene complexity. Building on this formulation, we develop an end-to-end Simulation-ready Physics-Aware Reconstruction for Cluttered Scenes (SPARCS) pipeline, which integrates learning-based object initialization, physics-constrained joint shape-pose optimization, and differentiable texture refinement. Experiments on cluttered scenes with up to 5 objects and 22 convex hulls demonstrate that our approach robustly reconstructs physically valid, simulation-ready object shapes and poses.Project webpage: https://rory-weicheng.github.io/SPARCS/.
comment: Accepted to RSS 2026, camera-ready version; 17 pages, 15 figures
♻ ☆ GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents
We present GLM-5V-Turbo, a step toward native foundation models for multimodal agents. As foundation models are increasingly deployed in real environments, agentic capability depends not only on language reasoning, but also on the ability to perceive, interpret, and act over heterogeneous contexts such as images, videos, webpages, documents, GUIs. GLM-5V-Turbo is built around this objective: multimodal perception is integrated as a core component of reasoning, planning, tool use, and execution, rather than as an auxiliary interface to a language model. This report summarizes the main improvements behind GLM-5V-Turbo across model design, multimodal training, reinforcement learning, toolchain expansion, and integration with agent frameworks. These developments lead to strong performance in multimodal coding, visual tool use, and framework-based agentic tasks, while preserving competitive text-only coding capability. More importantly, our development process offers practical insights for building multimodal agents, highlighting the central role of multimodal perception, hierarchical optimization, and reliable end-to-end verification.
♻ ☆ SARU: A Shadow-Aware and Removal Unified Framework for Remote Sensing Images with New Benchmarks SP
Shadows are a prevalent problem in remote sensing imagery (RSI), degrading visual quality and severely limiting the performance of downstream tasks like object detection and semantic segmentation. Most prior works treat shadow detection and removal as separate, cascaded tasks, which can lead to cumbersome process and error accumulation. Furthermore, many deep learning methods rely on paired shadow and non-shadow images for training, which are often unavailable in practice. To address these challenges, we propose Shadow-Aware and Removal Unified (SARU) Framework , a cohesive two-stage framework. First, its dual-branch detection module (DBCSF-Net) fuses multi-color space and semantic features to generate high-fidelity shadow masks, effectively distinguishing shadows from dark objects. Then, leveraging these masks, a novel, training-free physical algorithm (N$^2$SGSR) restores illumination by transferring properties from adjacent non-shadow regions within the single input image. To facilitate rigorous evaluation and foster future work, we also introduce two new benchmark datasets: the RSI Shadow Detection (RSISD) dataset and the Single-image Shadow Removal Benchmark (SiSRB). Extensive experiments on the AISD and RSISD datasets demonstrate that SARU achieves SOTA shadow detection performance. For shadow removal, our training-free N$^2$SGSR algorithm attains an average processing speed of approximately $1.3$s, which is over $10$ times faster than the SOTA MAOSD while maintains an SRI value close to 0.9 on both the AISD and SiSRB datasets, a level comparable to the advanced RS-GSSR method. By holistically integrating shadow detection and removal to mitigate error propagation and eliminating the dependency on paired training data, SARU establishes a robust, practical framework for real-world RSI analysis. The code and datasets are publicly available at: https://github.com/AeroVILab-AHU/SARU
comment: Accepted by ISPRS
♻ ☆ MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability - fine-grained motion comprehension - remains under-explored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models. MotionBench evaluates models' motion-level perception through six primary categories of motion-oriented question types and includes data collected from diverse sources, ensuring a broad representation of real-world video content. Experimental results reveal that existing VLMs perform poorly in understanding fine-grained motions. To enhance VLM's ability to perceive fine-grained motion within a limited sequence length of LLM, we conduct extensive experiments reviewing VLM architectures optimized for video feature compression and propose a novel and efficient Through-Encoder (TE) Fusion method. Experiments show that higher frame rate inputs and TE Fusion yield improvements in motion understanding, yet there is still substantial room for enhancement. Our benchmark aims to guide and motivate the development of more capable video understanding models, emphasizing the importance of fine-grained motion comprehension. Project page: https://motion-bench.github.io .
comment: 20 pages
♻ ☆ VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving
Vision-Language-Action (VLA) models have emerged as a promising paradigm for end-to-end autonomous driving, yet their reliance on implicit parametric knowledge limits generalization in long-tail scenarios. While Retrieval-Augmented Generation (RAG) offers a solution by accessing external expert priors, standard visual retrieval suffers from high latency and semantic ambiguity. To address these challenges, we propose \textbf{VLADriver-RAG}, a framework that grounds planning in explicit, structure-aware historical knowledge. Specifically, we abstract sensory inputs into spatiotemporal semantic graphs via a \textit{Visual-to-Scenario} mechanism, effectively filtering visual noise. To ensure retrieval relevance, we employ a \textit{Scenario-Aligned Embedding Model} that utilizes Graph-DTW metric alignment to prioritize intrinsic topological consistency over superficial visual similarity. These retrieved priors are then fused within a query-based VLA backbone to synthesize precise, disentangled trajectories. Extensive experiments on the Bench2Drive benchmark establish a new state-of-the-art, achieving a Driving Score of 89.12.
♻ ☆ LucidFlux: Caption-Free Photo-Realistic Image Restoration via a Large-Scale Diffusion Transformer
Image restoration (IR) aims to recover images degraded by unknown mixtures while preserving semanticsconditions under which discriminative restorers and UNet-based diffusion priors often oversmooth, hallucinate, or drift. We present LucidFlux, a caption-free IR framework that adapts a large diffusion transformer (Flux.1) without image captions. Our LucidFlux introduces a lightweight dual-branch conditioner that injects signals from the degraded input and a lightly restored proxy to respectively anchor geometry and suppress artifacts. Then, a timestep- and layer-adaptive modulation schedule is designed to route these cues across the backbones hierarchy, in order to yield coarse-to-fine and context-aware updates that protect the global structure while recovering texture. After that, to avoid the latency and instability of text prompts or Vision-Language Model (VLM) captions, we enforce caption-free semantic alignment via SigLIP features extracted from the proxy. A scalable curation pipeline further filters large-scale data for structure-rich supervision. Across synthetic and in-the-wild benchmarks, our LucidFlux consistently outperforms strong open-source and commercial baselines, and ablation studies verify the necessity of each component. LucidFlux shows that, for large DiTs, when, where, and what to condition onrather than adding parameters or relying on text promptsis the governing lever for robust and caption-free image restoration in the wild.
comment: Project Page: https://w2genai-lab.github.io/LucidFlux
♻ ☆ LucidNFT: LR-Anchored Multi-Reward Preference Optimization for Flow-Based Real-World Super-Resolution
Generative real-world image super-resolution (Real-ISR) can synthesize visually convincing details from severely degraded low-resolution (LR) inputs, yet its stochastic sampling makes a critical failure mode hard to avoid: outputs may look sharp but be unfaithful to the LR evidence, exhibiting semantic or structural hallucinations. Preference-based reinforcement learning (RL) is a natural fit because each LR input yields a rollout group of candidate restorations. However, effective alignment in Real-ISR is hindered by three coupled challenges: (i) the lack of an LR-referenced faithfulness signal that is robust to degradation yet sensitive to localized hallucinations, (ii) a rollout-group optimization bottleneck where scalarizing heterogeneous rewards before normalization compresses objective-wise contrasts and weakens DiffusionNFT-style reward-weighted updates, and (iii) limited coverage of real degradations, which restricts rollout diversity and preference signal quality. We propose LucidNFT, a multi-reward RL framework for flow-matching Real-ISR. LucidNFT introduces LucidConsistency, a degradation-invariant and hallucination-sensitive LR-referenced evaluator trained with content-consistent degradation pools and original-inpainted hard negatives; a decoupled reward normalization strategy that preserves objective-wise contrasts within each LR-conditioned rollout group before fusion; and LucidLR, a large-scale collection of real-world degraded images for robust RL fine-tuning. Extensive experiments show that LucidNFT improves perceptual quality on strong flow-based Real-ISR baselines while generally maintaining LR-referenced consistency across diverse real-world scenarios.
♻ ☆ PointCaM: Cut-and-Mix for Open-Set Point Cloud Learning
Point cloud learning is receiving increasing attention. However, most existing point cloud models lack the practical ability to deal with the unavoidable presence of unknown objects. This paper primarily discusses point cloud learning in open-set settings, where we train the model without data from unknown classes and identify them during the inference stage. In essence, we propose a novel Point Cut-and-Mix mechanism for solving open-set point cloud learning, comprising an Unknown-Point Simulator and an Unknown-Point Estimator module. Specifically, we use the Unknown-Point Simulator to simulate out-of-distribution data in the training stage by manipulating the geometric context of partially known data. Based on this, the Unknown-Point Estimator module learns to exploit the point cloud's feature context to discriminate between known and unknown data. Unlike existing methods that only consider classifier features, our proposed solution leverages multi-level feature contexts to recognize unknown point cloud objects more effectively. We test the proposed approach on several datasets, including customized S3DIS, ModelNet40, and ScanObjectNN. The improved open-set performances over comparative baselines show the effectiveness of our PointCaM method. Our code is available at https://github.com/JHome1/pointcam.
comment: Accepted in CVIU
♻ ☆ Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents ACL 2026
Vision-and-Language Navigation (VLN) poses significant challenges for agents to interpret natural language instructions and navigate complex 3D environments. While recent progress has been driven by large-scale pre-training and data augmentation, current methods still struggle to generalize to unseen scenarios, particularly when complex spatial and temporal reasoning is required. In this work, we propose SkillNav, a modular framework that introduces structured, skill-based reasoning into Transformer-based VLN agents. Our method decomposes navigation into a set of interpretable atomic skills (e.g., Vertical Movement, Area and Region Identification, Stop and Pause), each handled by a specialized agent. To support targeted skill training without manual data annotation, we construct a synthetic dataset pipeline that generates diverse, linguistically natural, skill-specific instruction-trajectory pairs. We then introduce a novel training-free Vision-Language Model (VLM)-based router, which dynamically selects the most suitable agent at each time step by aligning sub-goals with visual observations and historical actions. SkillNav obtains competitive results on commonly used benchmarks and establishes state-of-the-art generalization to the GSA-R2R, a benchmark with novel instruction styles and unseen environments.
comment: Accepted by ACL 2026 Main Conference
♻ ☆ FlowLPS: Langevin-Proximal Sampling for Flow-based Inverse Problem Solvers
Deep generative models are powerful priors for imaging inverse problems, but training-free solvers for latent flow models face a practical finite-step trade-off. Optimization-heavy methods quickly improve measurement consistency, but in highly nonlinear latent spaces, their results can depend strongly on where local refinement is initialized, often degrading perceptual realism. In contrast, stochastic sampling methods better preserve posterior exploration, but often require many iterations to obtain sharp, measurement-consistent reconstructions. To address this trade-off, we propose FlowLPS, a training-free latent flow inverse solver based on Langevin-Proximal Sampling. At each reverse step, FlowLPS uses a few Langevin updates to perturb the model-predicted clean estimate in posterior-oriented directions, providing stochastic initializations for local refinement. It then applies local MAP-style proximal refinement to rapidly improve measurement consistency from the Langevin-updated estimate. We additionally use controlled pCN-style re-noising to stabilize the reverse trajectory while retaining trajectory coherence. Experiments on FFHQ and DIV2K across five linear inverse problems show that FlowLPS achieves a strong balance between measurement fidelity and perceptual quality, with additional experiments on pixel-space inverse problems and phase retrieval.
♻ ☆ FlowDIS: Language-Guided Dichotomous Image Segmentation with Flow Matching CVPR 2026
Accurate image segmentation is essential for modern computer vision applications such as image editing, autonomous driving, and medical image analysis. In recent years, Dichotomous Image Segmentation (DIS) has become a standard task for training and evaluating highly accurate segmentation models. Existing DIS approaches often fail to preserve fine-grained details or fully capture the semantic structure of the foreground. To address these challenges, we present FlowDIS, a novel dichotomous image segmentation method built on the flow matching framework, which learns a time-dependent vector field to transport the image distribution to the corresponding mask distribution, optionally conditioned on a text prompt. Moreover, with our Position-Aware Instance Pairing (PAIP) training strategy, FlowDIS offers strong controllability through text prompts, enabling precise, pixel-level object segmentation. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches both with and without language guidance. Compared with the best prior DIS method, FlowDIS achieves a 5.5% higher $F_β^ω$ measure and 43% lower MAE ($\mathcal{M}$) on the DIS-TE test set. The code is available at: https://github.com/Picsart-AI-Research/FlowDIS
comment: Accepted to CVPR 2026
♻ ☆ MotionGRPO: Overcoming Low Intra-Group Diversity in GRPO-Based Egocentric Motion Recovery ICML 2026
This paper studies full-body 3D human motion recovery from head-mounted device signals. Existing diffusion-based methods often rely on global distribution matching, leading to local joint reconstruction errors. We propose MotionGRPO, a novel framework leveraging reinforcement learning post-training to inject fine-grained guidance into the diffusion process. Technically, we model diffusion sampling as a Markov decision process optimized via Group Relative Policy Optimization (GRPO). To this end, we introduce a hybrid reward mechanism that combines a learned conditioned perceptual model for global visual plausibility and explicit constraints for local joint precision. Our key technical insight is that policy optimization in diffusion-based recovery suffers from vanishing gradients due to limited intra-group sample diversity. To address this, we further introduce a noise-injection strategy that explicitly increases sample variance and stabilizes learning. Extensive experiments demonstrate that MotionGRPO achieves state-of-the-art performance with superior visual fidelity
comment: Accepted by ICML 2026
♻ ☆ MieDB-100k: A Comprehensive Dataset for Medical Image Editing
The scarcity of high-quality data remains a primary bottleneck in adapting multimodal generative models for medical image editing. Existing medical image editing datasets often suffer from limited diversity, neglect of medical image understanding and inability to balance quality with scalability. To address these gaps, we propose MieDB-100k, a large-scale, high-quality and diverse dataset for text-guided medical image editing. It categorizes editing tasks into perspectives of Perception, Modification and Transformation, considering both understanding and generation abilities. We construct MieDB-100k via a data curation pipeline leveraging both modality-specific expert models and rule-based data synthetic methods, followed by rigorous manual inspection to ensure clinical fidelity. Extensive experiments demonstrate that model trained with MieDB-100k consistently outperform both open-source and proprietary models while exhibiting strong generalization ability. We anticipate that this dataset will serve as a cornerstone for future advancements in specialized medical image editing.
♻ ☆ A New Kind of Network? Review and Reference Implementation of Neural Cellular Automata
Stephen Wolfram proclaimed in his 2003 seminal work "A New Kind Of Science" that simple recursive programs in the form of Cellular Automata (CA) are a promising approach to replace currently used mathematical formalizations, e.g. differential equations, to improve the modeling of complex systems. Over two decades later, while Cellular Automata have still been waiting for a substantial breakthrough in scientific applications, recent research showed new and promising approaches which combine Wolfram's ideas with learnable Artificial Neural Networks: So-called Neural Cellular Automata (NCA) are able to learn the complex update rules of CA from data samples, allowing them to model complex, self-organizing generative systems. The aim of this paper is to review the existing work on NCA and provide a unified modular framework and notation, as well as a reference implementation in the open-source library NCAtorch. Supplementary materials, videos, and code are available at the project website: https://www.neural-cellular-automata.org/
♻ ☆ DySurface: Consistent 4D Surface Reconstruction via Bridging Explicit Gaussians and Implicit Functions
While novel view synthesis (NVS) for dynamic scenes has seen significant progress, reconstructing temporally consistent geometric surfaces remains a challenge. Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) offer powerful dynamic scene rendering capabilities; however, relying solely on photometric optimization often leads to geometric ambiguities. This results in discontinuous surfaces, severe artifacts, and broken surfaces over time. To address these limitations, we present DySurface, a novel framework that bridges the effectiveness of explicit Gaussians with the geometric fidelity of implicit Signed Distance Functions (SDFs) in dynamic scenes. Our approach tackles the structural discrepancy between the forward deformation of 3DGS ($canonical \rightarrow dynamic$) and the backward deformation required for volumetric SDF rendering ($dynamic \rightarrow canonical$). Specifically, we propose the VoxGS-DSDF branch that leverages deformed Gaussians to construct a dynamic sparse voxel grid, providing explicit geometric guidance to the implicit SDF field. This explicit anchoring effectively regularizes the volumetric rendering process, significantly improving surface reconstruction quality, with watertight boundaries and detailed representations. Quantitative and qualitative experiments demonstrate that DySurface significantly outperforms state-of-the-art baselines in geometric accuracy metrics while maintaining competitive rendering performance.
♻ ☆ Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
Multimodal latent reasoning has emerged as a promising paradigm that replaces explicit Chain-of-Thought (CoT) decoding with implicit feature propagation, simultaneously enhancing representation informativeness and reducing inference latency. By analyzing token-level gradient dynamics during latent training, we reveal two critical observations: (1) visual tokens exhibit significantly smaller gradient norms than their textual counterparts due to inherent language bias, resulting in systematic visual under-optimization; and (2) semantically simple tokens converge rapidly, whereas complex tokens exhibit persistent gradient instability constrained by fixed architectural depths. To address these limitations, we propose a visual replay module and routing depth scaling to collaboratively enhance visual perception and refine complicated latents for deeper contextual reasoning. The former module leverages causal self-attention to estimate token saliency, reinforcing fine-grained grounding through spatially-coherent constraints. Complementarily, the latter mechanism adaptively allocates additional reasoning steps to complex tokens, enabling deeper contextual refinement. Guided by a curriculum strategy that progressively internalizes explicit CoT into compact latent representations, our framework achieves state-of-the-art performance across diverse benchmarks while delivering substantial inference speedups over explicit CoT baselines.
comment: 11 pages, 6 figures
♻ ☆ Concepts in Motion: Temporal Concept Bottleneck Model for Interpretable Video Classification
Concept Bottleneck Models (CBMs) enable interpretable image classification by structuring predictions around human-understandable concepts, but extending this paradigm to video remains challenging due to the difficulty of extracting concepts and modeling them over time. In this paper, we introduce MoTIF (Moving Temporal Interpretable Framework), a transformer-based concept architecture that operates on sequences of temporally grounded concept activations, by employing per-concept temporal self-attention to model when individual concepts recur and how their temporal patterns contribute to predictions. Central to the framework is a class-conditioned VLM-based concept discovery module that extracts object- and action-centric textual concepts from training videos, yielding temporally expressive concept sets without manual concept annotation. Across multiple video benchmarks, this combination improves over global concept bottlenecks and remains competitive within the interpretable concept-bottleneck setting, while narrowing the gap to strong black-box video baselines that we report as contextual references. Code available at github.com/patrick-knab/MoTIF.
♻ ☆ UnfoldLDM: Degradation-Aware Unfolding with Iterative Latent Diffusion Priors for Blind Image Restoration
Deep unfolding networks (DUNs) combine the interpretability of model-based methods with the learning ability of deep networks, yet remain limited for blind image restoration (BIR). Existing DUNs suffer from: (1) \textbf{Degradation-specific dependency}, as their optimization frameworks are tied to a known degradation model, making them unsuitable for BIR tasks; and (2) \textbf{Over-smoothing bias}, resulting from the direct feeding of gradient descent outputs, dominated by low-frequency content, into the proximal term, suppressing fine textures. To overcome these issues, we propose UnfoldLDM to integrate DUNs with latent diffusion model (LDM) for BIR. In each stage, UnfoldLDM employs a multi-granularity degradation-aware (MGDA) module as the gradient descent step. MGDA models BIR as an unknown degradation estimation problem and estimates both the holistic degradation matrix and its decomposed forms, enabling robust degradation removal. For the proximal step, we design a degradation-resistant LDM (DR-LDM) to extract compact degradation-invariant priors from the MGDA output. Guided by this prior, an over-smoothing correction transformer (OCFormer) explicitly recovers high-frequency components and enhances texture details. This unique combination ensures the final result is degradation-free and visually rich. Experiments show that our UnfoldLDM achieves a leading place on various BIR tasks and benefits downstream tasks. Moreover, our design is compatible with existing DUN-based methods, serving as a plug-and-play framework. Code will be released.
comment: 6 figures, 11 tables
♻ ☆ GRASP: Guided Residual Adapters with Sample-wise Partitioning
Text-to-image flow matching transformers degrade sharply in long-tail settings: tail-class outputs collapse in fidelity and diversity, limiting their value as synthetic augmentation for rare conditions. We trace this to low head-versus-tail gradient alignment during fine-tuning, an optimization-level pathology that conditioning- and sampling-side interventions do not address. We propose GRASP (Guided Residual Adapters with Sample-wise Partitioning): a deterministic partition of the conditioning space, paired with group-specific residual adapters in the transformer feedforward layers, that leaves the flow-matching objective and the sampler untouched. In conditional flow matching, condition values index distinct sets of probability paths, so partitioning along the conditioning is the structurally correct factorization suitable as gradient alignment proxy. Because the partition is static, every tail sample is guaranteed to update its assigned expert, which bypasses extreme longtail failure modes. Crucially, GRASP is non-invasive and composable: on MIMIC-CXR-LT, combining GRASP with self-guided minority sampling at inference time yields the best all-labels IRS we observe, beyond either intervention alone. GRASP itself reduces overall FID by up to 80\% and lifts tail-class coverage by up to 44\% over full fine-tuning, learned-routing MoE, and minority guidance. Used as training data for a downstream DenseNet classifier on NIH-CXR-LT, GRASP synthetics significantly outperform every non-GRASP alternative on macro F1, match the macro F1 obtained from real training data, and yield nonzero F1 on $9$ of $13$ classes versus $3$ of $13$ from full fine-tuning. Results on ImageNet-LT confirm the mechanism is not tied to medical inductive bias.
comment: 16 pages, 6 figures, 6 tables
♻ ☆ DarkQA: Benchmarking Vision-Language Models on Visual-Primitive Question Answering in Low-Light Indoor Scenes IEEE
Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents. Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations, including low-light conditions at night or in dark environments, a core necessity that has been largely overlooked. To address this underexplored challenge, we present DarkQA, an open-source benchmark for evaluating perceptual primitives under multi-level low-light conditions in embodied scenarios. DarkQA evaluates single-view egocentric observations across controlled degradation levels, isolating low-light perceptual failures before they are entangled with complex embodied tasks. The benchmark contains 9.4K deterministically generated and verifiable question-image pairs spanning five visual-primitive families. A key design feature of DarkQA is its physical fidelity: visual degradations are modeled in linear RAW space, simulating physics-based illumination drop and sensor noise followed by an ISP-inspired rendering pipeline; we further validate the synthesis against real paired low-light camera data. We evaluate representative VLMs and Low-Light Image Enhancement (LLIE) preprocessing methods. Results show consistent VLM degradation under low illumination and sensor noise, while LLIE provides severity-dependent but unstable recovery. We demonstrate the utility of DarkQA by evaluating a wide range of state-of-the-art VLMs and Low-Light Image Enhancement (LLIE) models, and systematically reveal VLMs' limitations when operating under these challenging visual conditions. Our code and benchmark dataset will be released upon acceptance. Project website: https://darkqa-benchmark.github.io
comment: This work has been submitted to the IEEE for possible publication
♻ ☆ Modality-Inconsistent Continual Learning of Multimodal Large Language Models
In this paper, we introduce Modality-Inconsistent Continual Learning (MICL), a new continual learning scenario for Multimodal Large Language Models (MLLMs) that involves tasks with inconsistent modalities (image, audio, or video) and varying task types (captioning or question-answering). Unlike existing vision-only or modality-incremental settings, MICL combines modality and task type shifts, both of which drive catastrophic forgetting. To address these challenges, we propose MoInCL, which employs a Pseudo Targets Generation Module to mitigate forgetting caused by task type shifts in previously seen modalities. It also incorporates Instruction-based Knowledge Distillation to preserve the model's ability to handle previously learned modalities when new ones are introduced. We benchmark MICL using a total of six tasks and conduct experiments to validate the effectiveness of our MoInCL. The experimental results highlight the superiority of MoInCL, showing significant improvements over representative and state-of-the-art continual learning baselines.
comment: Accepted at Transactions on Machine Learning Research (TMLR), 2026
♻ ☆ Vision-aligned Latent Reasoning for Multi-modal Large Language Model ICML 2026
Despite recent advancements in Multi-modal Large Language Models (MLLMs) on diverse understanding tasks, these models struggle to solve problems which require extensive multi-step reasoning. This is primarily due to the progressive dilution of visual information during long-context generation, which hinders their ability to fully exploit test-time scaling. To address this issue, we introduce Vision-aligned Latent Reasoning (VaLR), a simple, yet effective reasoning framework that dynamically generates vision-aligned latent tokens before each Chain of Thought reasoning step, guiding the model to reason based on perceptual cues in the latent space. Specifically, VaLR is trained to preserve visual knowledge during reasoning by aligning intermediate embeddings of MLLM with those from vision encoders. Empirical results demonstrate that VaLR consistently outperforms existing approaches across a wide range of benchmarks requiring long-context understanding or precise visual perception, while exhibiting test-time scaling behavior not observed in prior MLLMs. In particular, VaLR improves the performance significantly from 33.0% to 52.9% on VSI-Bench, achieving a 19.9%p gain over Qwen2.5-VL.
comment: Published as conference proceeding for ICML 2026. Last two authors advised equally
♻ ☆ UGround: Towards Unified Visual Grounding with Unrolled Transformers ICML 2026
We present UGround, a \textbf{U}nified visual \textbf{Ground}ing paradigm that dynamically selects intermediate layers across \textbf{U}nrolled transformers as ``mask as prompt,'' diverging from the prevailing pipeline that leverages the fixed last hidden layer as ``\texttt{} as prompt.'' UGround addresses two primary challenges posed by the prevailing paradigm: (1) its reliance on the fixed last hidden layer, which sequentially amplifies cumulative errors arising from layer-by-layer propagation without intermediate correction, and (2) its use of \texttt{} as a prompt, which implicitly projects textual embeddings into visual space without explicit spatial cues (e.g., coordinates). Central to UGround is Policy-Prompted Masking, which comprises two key components: Stochastic Skip Connection (SSC) and Mask as Prompt (MasP). SSC is a reinforcement learning policy that, via stochastic sampling, allows each \texttt{} token to slide across unrolled transformer layers, enabling dynamic layer selection at which it connects to the vision model (e.g., SAM) in a skip-connection fashion. Given the selected hidden layer, MasP uses the similarity map derived from the \texttt{} token and image tokens as a soft logit mask to prompt SAM for mask generation, offering explicit spatial cues through its activation regions. To validate the effectiveness of UGround, we, for the first time, have unified visual grounding within a single framework from an attribute perspective, spanning from traditional refer expression segmentation to newly proposed reasoning segmentation, single-target to multi-target, positive query to false premise (empty target). All code and models are publicly available at https://github.com/rui-qian/UGround.
comment: This work has been accepted to ICML 2026, please refer to https://github.com/rui-qian/UGround
♻ ☆ LA-Sign: Looped Transformers with Geometry-aware Alignment for Skeleton-based Sign Language Recognition
Skeleton-based isolated sign language recognition (ISLR) demands fine-grained understanding of articulated motion across multiple spatial scales, from subtle finger movements to global body dynamics. Existing approaches typically rely on deep feed-forward architectures, which increase model capacity but lack mechanisms for recurrent refinement and structured representation. We propose LA-Sign, a looped transformer framework with geometry-aware alignment for ISLR. Instead of stacking deeper layers, LA-Sign derives its depth from recurrence, repeatedly revisiting latent representations to progressively refine motion understanding under shared parameters. To further regularise this refinement process, we present a geometry-aware contrastive objective that projects skeletal and textual features into an adaptive hyperbolic space, encouraging multi-scale semantic organisation. We study three looping designs and multiple geometric manifolds, demonstrating that encoder-decoder looping combined with adaptive Poincare alignment yields the strongest performance. Extensive experiments on WLASL and MSASL benchmarks show that LA-Sign achieves state-of-the-art results while using fewer unique layers, highlighting the effectiveness of recurrent latent refinement and geometry-aware representation learning for sign language recognition.
♻ ☆ SAND: The Challenge on Speech Analysis for Neurodegenerative Disease Assessment
Recent advances in Artificial Intelligence (AI) and the exploration of noninvasive, objective biomarkers, such as speech signals, have encouraged the development of algorithms to support the early diagnosis of neurodegenerative diseases, including Amyotrophic Lateral Sclerosis (ALS). Voice changes in subjects suffering from ALS typically manifest as progressive dysarthria, which is a prominent neurodegenerative symptom because it affects patients as the disease progresses. Since voice signals are complex data, the development and use of advanced AI techniques are fundamental to extracting distinctive patterns from them. Validating AI algorithms for ALS diagnosis and monitoring using voice signals is challenging, particularly due to the lack of annotated reference datasets. In this work, we present the outcome of a collaboration between a multidisciplinary team of clinicians and Machine Learning experts to create both a clinically annotated validation dataset and the "Speech Analysis for Neurodegenerative Diseases" (SAND) challenge based on it. Specifically, by analyzing voice disorders, the SAND challenge provides an opportunity to develop, test, and evaluate AI models for the automatic early identification and prediction of ALS disease progression.
♻ ☆ AIA: Rethinking Architecture Decoupling Strategy In Unified Multimodal Model
Unified multimodal models for image generation and understanding represent a significant step toward AGI and have attracted widespread attention from researchers. The main challenge of this task lies in the difficulty in establishing an optimal training paradigm due to inherent conflicting targets in understanding and generation tasks. To alleviate these conflicts and pursue higher performance, many researchers adopt varying degrees of architecture decoupling (e.g., Double image encoders, MOE/MOT architecture, or frozen MLLM). However, excessive model decoupling can lead to the loss of interleave generation ability, undermining the original intent of unified models. In this work, we aim to explore how to mitigate task conflicts without resorting to model decoupling. Firstly, we analyze why decoupling boosts performance by studying the cross-modal attention behavior of models. We observe that architecture decoupling does not solve task conflicts, but essentially drives models toward cross-modal interaction patterns of task-specific models, as seen in Qwen3-VL and HunyuanImage-3.0, and that the more thorough the decoupling, the more consistent the behavior becomes. Motivated by this observation, we propose Attention Interaction Alignment (AIA) loss, which explicitly learns task-specific multimodal interaction patterns during training. To demonstrate the generalizability of our AIA loss, we apply it to Emu3 and Janus-Pro during SFT and post-training stage respectively. Without bells and whistles, AIA not only refines cross-modal attention patterns, but also boosts both generation and understanding performance.
comment: Project page: https://zhengdian1.github.io/AIA-project/ Code: https://github.com/zhengdian1/AIA
♻ ☆ PVLM: Parsing-Aware Vision Language Model with Dynamic Contrastive Learning for Zero-Shot Deepfake Attribution IEEE
The challenge of tracing the source attribution of forged faces has gained significant attention due to the rapid advancement of generative models. However, existing deepfake attribution (DFA) works primarily focus on the interaction among various domains in vision modality, and other modalities such as texts and face parsing are not fully explored. Besides, they tend to fail to assess the generalization performance of deepfake attributors to unseen advanced generators like diffusion in a fine-grained manner. In this paper, we propose a novel parsing-aware vision language model with a dynamic contrastive learning (PVLM) method for zero-shot deepfake attribution (ZSDFA), which facilitates effective and fine-grained traceability to unseen advanced generators. Specifically, we conduct a novel and fine-grained ZS-DFA benchmark to evaluate the attribution performance of deepfake attributors to unseen advanced generators like diffusion. Besides, we propose an innovative PVLM attributor based on the vision-language model to capture general and diverse attribution features. We are motivated by the observation that the preservation of source face attributes in facial images generated by GAN and diffusion models varies significantly. We propose to employ the inherent facial attributes preservation differences to capture face parsing-aware forgery representations. Therefore, we devise a novel parsing encoder to focus on global face attribute embeddings, enabling parsing-guided DFA representation learning via dynamic vision-parsing matching. Additionally, we present a novel deepfake attribution contrastive center loss to pull relevant generators closer and push irrelevant ones away, which can be introduced into DFA models to enhance traceability. Experimental results show that our model exceeds the state-of-the-art on the ZS-DFA benchmark via various protocol evaluations.
comment: Accepted to IEEE Transactions on Dependable and Secure Computing
♻ ☆ Long Story Short: Disentangling Compositionality and Long-Caption Understanding in Contrastive VLMs ACL 2026
Contrastive vision-language models (VLMs) have made significant progress in binding visual and textual information, yet understanding long, compositional captions remains an open challenge. While these capabilities are often assumed to be closely related, the conditions under which they reinforce each other remain unclear. In this paper, we empirically analyze when compositional reasoning and long-caption understanding transfer across tasks, and when this relationship fails. Through controlled experiments across diverse training objectives, datasets, and architectural designs, we find a bidirectional but sensitive relationship between the two capabilities. Models trained on poorly grounded captions or with limited parameter updates fail to generalize, while high-quality long-caption data with strong visual grounding promotes both capabilities simultaneously. We further show that architectural choices aimed at preserving general alignment, such as frozen positional embeddings, can inadvertently limit compositional learning. Our analysis provides actionable guidelines for data selection and model design to improve VLM generalization.
comment: To be published in Findings of ACL 2026
♻ ☆ SoccerLens: Grounded Soccer Video Understanding Beyond Accuracy
Vision-language models (VLMs) have recently shown strong potential in soccer video understanding. However, given the high complexity of soccer videos due to large viewpoint variations, rapid shot transitions, and cluttered scenes, it remains unclear on whether VLMs rely on meaningful visual evidence or exploit spurious correlations and shortcut learning. Existing evaluation protocols focus primarily on classification accuracy and do not assess visual grounding. To address this limitation, we introduce SoccerLens, a benchmark for grounded soccer video understanding. The benchmark contains annotated video segments spanning $13$ common soccer events, with structured visual cues organized into three levels of semantic relevance. We further extend the attribution method of Chefer [arXiv:2103.15679] to jointly model spatial and temporal attention, and introduce evaluation metrics that measure whether model attention aligns with annotated cues or drifts toward spurious regions. Our evaluation of state-of-the-art soccer VLMs shows that, despite strong classification accuracy, current models fail to exceed $50\%$ grounding performance even under the loosest cue definitions and consistently underutilize temporal information. These results reveal a substantial gap between predictive performance and true visual grounding, highlighting the need for grounded evaluation in complex spatio-temporal domains such as soccer.
comment: Preprint
♻ ☆ SOAR: Regression-based LiDAR Relocalization for UAVs
Regression-based LiDAR relocalization has recently emerged as a promising solution for high-precision positioning in GNSS-denied environments. However, these methods are primarily tailored to autonomous driving, exhibiting significantly degraded accuracy in unmanned aerial vehicle (UAV) scenarios due to arbitrary pose variations and irregular flight paths. In this paper, we propose SOAR, a regression-based LiDAR relocalization framework for UAVs. Specifically, we introduce a locality-preserving sliding window attention module with locally invariant positional encoding to capture discriminative geometric structures robust to viewpoint changes. A coordinate-independent feature initialization module is further designed to eliminate sensitivity to global transformations. Furthermore, most existing UAV datasets are limited to evaluate LiDAR relocalization in real-world, due to the lack of synchronized LiDAR scans, accurate 6-DoF poses, or multiple traversals. Thus, we construct a large-scale UAV LiDAR localization dataset with 4 scenes and 13 irregular paths exhibiting rotation and altitude variations, providing a more realistic benchmark for UAVs. Extensive experiments demonstrate that our method achieves state-of-the-art performance, improving the localization success rate by 40% and reducing mean error over 10m on UAVLoc. Our code and dataset will be released soon.
comment: 24 pages, 14 figures
♻ ☆ B4DL: A Benchmark for 4D LiDAR LLM in Spatio-Temporal Understanding ACM MM 2025
Understanding dynamic outdoor environments requires capturing complex object interactions and their evolution over time. LiDAR-based 4D point clouds provide precise spatial geometry and rich temporal cues, making them ideal for representing real-world scenes. However, despite their potential, 4D LiDAR remains underexplored in the context of Multimodal Large Language Models (MLLMs) due to the absence of high-quality, modality-specific annotations and the lack of MLLM architectures capable of processing its high-dimensional composition. To address these challenges, we introduce B4DL, a new benchmark specifically designed for training and evaluating MLLMs on 4D LiDAR understanding. In addition, we propose a scalable data generation pipeline and an MLLM model that, for the first time, directly processes raw 4D LiDAR by bridging it with language understanding. Combined with our dataset and benchmark, our model offers a unified solution for spatio-temporal reasoning in dynamic outdoor environments. We provide rendered 4D LiDAR videos, generated dataset, and inference outputs on diverse scenarios at: https://github.com/ccho4702/B4DL
comment: Accepted at ACM MM 2025
♻ ☆ CoLVR: Enhancing Exploratory Latent Visual Reasoning via Contrastive Optimization
Due to the potential for exploratory reasoning of Latent Visual Reasoning, recent works tend to enable MLLMs (Multimodal Large Language Models) to perform visual reasoning by propagating continuous hidden states instead of decoding intermediate steps into discrete tokens. However, existing works typically rely on hard alignment objectives to force latent representations to match predefined visual features, thereby severely limiting the exploratory of latent reasoning process. To address this problem, we propose CoLVR (Contrastive Optimization for Latent Visual Reasoning). To obtain a more exploratory visual reasoning, CoLVR introduces a latent contrastive training framework. Firstly, CoLVR learns diverse and exploratory representations with a latent contrastive objective guided by angle-based perturbation, which expands the semantic latent space and avoids over-constrained embedding. Then, CoLVR employs a latent trajectory contrastive reward for RL (Reinforcement Learning) post-training to enable fine-grained optimization of latent visual reasoning process and thus fostering diverse reasoning behaviors. Experiments demonstrate that CoLVR significantly enhances the exploratory capability of latent representations, achieving average improvements of 5.83% on VSP and 8.00% on Jigsaw, while also outperforming existing latent models on out of domain benchmarks, with a 3.40% gain on MMStar. The data, codes, and models are released at https://github.com/Oscar-dzy/CoLVR.
♻ ☆ Hyperbolic Concept Bottleneck Models
Concept Bottleneck Models (CBMs) have become a popular approach to enable interpretability in neural networks by constraining classifier inputs to a set of human-understandable concepts. While effective, current models embed concepts in flat Euclidean space, treating them as independent, orthogonal dimensions. Concepts, however, are highly structured and organized in semantic hierarchies. To resolve this mismatch, we propose Hyperbolic Concept Bottleneck Models (HypCBM), a post-hoc framework that grounds the bottleneck in this structure by reformulating concept activation as asymmetric geometric containment in hyperbolic space. Rather than treating entailment cones as a pre-training penalty, we show they encode a natural test-time activation signal: the margin of inclusion within a concept's entailment cone yields sparse, hierarchy-aware activations without any additional supervision or learned modules. We further introduce an adaptive scaling law for hierarchically faithful interventions, propagating user corrections coherently through the concept tree. Empirically, HypCBM rivals post-hoc Euclidean models trained on 20$\times$ more data in sparse regimes required for human interpretability, with stronger hierarchical consistency and improved robustness to input corruptions.
comment: 24 pages, 14 figures
♻ ☆ Physics-Informed Graph Neural Networks for Frequency-Aware Optical Aberration Correction
Optical aberrations significantly degrade image quality in microscopy, particularly when imaging deeper into samples. These aberrations arise from distortions in the optical wavefront and can be mathematically represented using Zernike polynomials. Existing methods often address only mild aberrations on limited sample types and modalities, typically treating the problem as a black-box mapping without leveraging the underlying optical physics of wavefront distortions. We propose ZRNet, a physics-informed framework that jointly performs Zernike coefficient prediction and optical image Restoration. We contribute a Zernike Graph module that explicitly models physical relationships between Zernike polynomials based on their azimuthal degrees-ensuring that learned corrections align with fundamental optical principles. To further enforce physical consistency between image restoration and Zernike prediction, we introduce a Frequency-Aware Alignment (FAA) loss, which better aligns Zernike coefficient prediction and image features in the Fourier domain. Extensive experiments on CytoImageNet demonstrates that our approach achieves state-of-the-art performance in both image restoration and Zernike coefficient prediction across diverse microscopy modalities and biological samples with complex, large-amplitude aberrations. We further validate on experimental PSF data from a physical microscope and demonstrate robustness to realistic sensor noise, confirming generalisation beyond simulated conditions. Code is available at https://github.com/janetkok/ZRNet.
♻ ☆ A Lightweight Transformer for Pain Recognition from Brain Activity
Pain is a multifaceted and widespread phenomenon with substantial clinical and societal burden, making reliable automated assessment a critical objective. This paper presents a lightweight transformer architecture that fuses multiple fNIRS representations through a unified tokenization mechanism, enabling joint modeling of complementary signal views without requiring modality-specific adaptations or increasing architectural complexity. The proposed token-mixing strategy preserves spatial, temporal, and time-frequency characteristics by projecting heterogeneous inputs onto a shared latent representation, using a structured segmentation scheme to control the granularity of local aggregation and global interaction. The model is evaluated on the AI4Pain dataset using stacked raw waveform and power spectral density representations of fNIRS inputs. Experimental results demonstrate competitive pain recognition performance while remaining computationally compact, making the approach suitable for real-time inference on both GPU and CPU hardware.
♻ ☆ VIMCAN: Visual-Inertial 3D Human Pose Estimation with Hybrid Mamba-Cross-Attention Network CVPR 2026
The rapid advances in deep learning have significantly enhanced the accuracy of multimodal 3D human pose estimation (HPE). However, the state-of-the-art (SOTA) HPE pipelines still rely on Transformers, whose quadratic complexity makes real-time processing for long sequences impractical. Mamba addresses this issue through selective state-space modeling, enabling efficient sequence processing without sacrificing representational power. Nevertheless, it struggles to capture complex spatial dependencies in multimodal settings. To bridge this gap, we propose VIMCAN, a hybrid architecture that combines the efficient sequence modeling of Mamba with the spatial reasoning of Cross-Attention, and performs robust visual-inertial fusion and human pose estimation between RGB keypoints and wearable IMU data. By leveraging Mamba's dynamic parameterization for temporal modeling and Attention for spatial dependency extraction, VIMCAN achieves superior accuracy, with mean per-joint position errors (MPJPE) of 17.2 mm on TotalCapture and 45.3 mm on 3DPW. VIMCAN outperforms prior Transformer-based and other SOTA approaches while supporting real-time inference at over 60 frames per second on consumer-grade hardware. The source code is available at \href{https://github.com/Eddieyzp/VIMCAN}{this GitHub repository}.
comment: Accepted in CVPR 2026
♻ ☆ Task-Driven Subspace Decomposition for Knowledge Sharing and Isolation in LoRA-based Continual Learning ICML 2026
Continual Learning (CL) requires models to sequentially adapt to new tasks without forgetting old knowledge. Recently, Low-Rank Adaptation (LoRA), a representative Parameter-Efficient Fine-Tuning (PEFT) method, has gained increasing attention in CL. Several LoRA-based CL methods reduce interference across tasks by separating their update spaces, typically building the new space from the estimated null space of past tasks. However, they (i) overlook task-shared directions, which suppresses knowledge transfer, and (ii) fail to capture truly effective task-specific directions since these ``null bases" of old tasks can remain nearly inactive for new task under correlated tasks. To address this, we study LoRA learning capability from a projection energy perspective, and propose Low-rank Decomposition and Adaptation (LoDA). It performs a task-driven decomposition to build general and truly task-specific LoRA subspaces by solving two energy-based objectives, decoupling directions for knowledge sharing and isolation. LoDA fixes LoRA down-projections on two subspaces and learns robust up-projections via a Gradient-Aligned Optimization (GAO) approach. After each task, before integrating the LoRA updates into the backbone, LoDA derives a closed-form recalibration for the general update, approximating a feature-level joint optimum along this task-shared direction. Experiments indicate that LoDA outperforms existing CL methods. Our code is available at https://github.com/HHHLF/LoDA_ICML2026.
comment: Accepted by ICML 2026
♻ ☆ Calibrated Multimodal Representation Learning with Missing Modalities ICML 2026
Multimodal representation learning harmonizes distinct modalities by aligning them into a unified latent space. Recent research generalizes traditional cross-modal alignment to produce enhanced multimodal synergy but requires all modalities to be present for a common instance, making it challenging to utilize prevalent datasets with missing modalities. We provide theoretical insights into this issue from an anchor shift perspective. Observed modalities are aligned with a local anchor that deviates from the optimal one when all modalities are present, resulting in an inevitable shift. To address this, we propose CalMRL to calibrate incomplete alignments caused by missing modalities. CalMRL leverages the priors and the inherent connections among modalities to model the imputation for the missing ones at the representation level. To resolve the optimization dilemma, we employ a bi-step learning method with the closed-form solution of the posterior distribution of shared latents. We validate its mitigation of anchor shift and convergence with theoretical guidance. By equipping the calibrated alignment with the existing advanced method, we offer new flexibility to absorb data with missing modalities, which is originally unattainable. Extensive experiments demonstrate the superiority of CalMRL. The code is released at https://github.com/Xiaohao-Liu/CalMRL.
comment: Accepted by ICML 2026
♻ ☆ GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth
Video depth estimation extends monocular prediction into the temporal domain to ensure coherence. However, existing methods often suffer from spatial blurring in fine-detail regions and temporal inconsistencies. We argue that current approaches, which primarily rely on temporal smoothing via Transformers, struggle to maintain strict 3D geometric consistency-particularly under rotations or drastic view changes. To address this, we propose GemDepth, a framework built on the insight that an explicit awareness of camera motion and global 3D structure is a prerequisite for 3D consistency. Distinctively, GemDepth introduces a Geometry-Embedding Module (GEM) that predicts inter-frame camera poses to generate implicit geometric embeddings. This injection of motion priors equips the network with intrinsic 3D perception and alignment capabilities. Guided by these geometric cues, our Alternating Spatio-Temporal Transformer (ASTT) captures latent point-level correspondences to simultaneously enhance spatial precision for sharp details and enforce rigorous temporal consistency. Furthermore, GemDepth employs a data-efficient training strategy, effectively bridging the gap between high efficiency and robust geometric consistency. As shown in Fig.2, comprehensive evaluations demonstrate that GemDepth achieves state-of-the-art performance across multiple datasets, particularly in complex dynamic scenarios. The code is publicly available at: https://github.com/Yuecheng919/GemDepth.
♻ ☆ Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization
Representation autoencoders that reuse frozen pretrained vision encoders as visual tokenizers have achieved strong reconstruction and generation quality. However, existing methods universally extract features from only the last encoder layer, discarding the rich hierarchical information distributed across intermediate layers. We show that low-level visual details survive in the last layer merely as attenuated residuals after multiple layers of semantic abstraction, and that explicitly fusing multi-layer features can substantially recover this lost information. We propose DRoRAE (Depth-Routed Representation AutoEncoder), a lightweight fusion module that adaptively aggregates all encoder layers via energy-constrained routing and incremental correction, producing an enriched latent compatible with a frozen pretrained decoder. A three-phase decoupled training strategy first learns the fusion under the implicit distributional constraint of the frozen decoder, then fine-tunes the decoder to fully exploit the enriched representation. On ImageNet-256, DRoRAE reduces rFID from 0.57 to 0.29 and improves generation FID from 1.74 to 1.65 (with AutoGuidance), with gains also transferring to text-to-image synthesis. Furthermore, we uncover a log-linear scaling law ($R^2{=}0.86$) between fusion capacity and reconstruction quality, identifying \textit{representation richness} as a new, predictably scalable dimension for visual tokenizers analogous to vocabulary size in NLP.
♻ ☆ DiFaReli++: Diffusion Face Relighting with Consistent Cast Shadows IEEE
We introduce a novel approach to single-view face relighting in the wild, addressing challenges such as global illumination and cast shadows. A common scheme in recent methods involves intrinsically decomposing an input image into 3D shape, albedo, and lighting, then recomposing it with the target lighting. However, estimating these components is error-prone and requires many training examples with ground-truth lighting to generalize well. Our work bypasses the need for accurate intrinsic estimation and can be trained solely on 2D images without any light stage data, relit pairs, multi-view images, or lighting ground truth. Our key idea is to leverage a conditional diffusion implicit model (DDIM) for decoding a disentangled light encoding along with other encodings related to 3D shape and facial identity inferred from off-the-shelf estimators. We propose a novel conditioning technique that simplifies modeling the complex interaction between light and geometry. It uses a rendered shading reference along with a shadow map, inferred using a simple and effective technique, to spatially modulate the DDIM. Moreover, we propose a single-shot relighting framework that requires just one network pass, given pre-processed data, and even outperforms the teacher model across all metrics. Our method realistically relights in-the-wild images with temporally consistent cast shadows under varying lighting conditions. We achieve state-of-the-art performance on the standard benchmark Multi-PIE and rank highest in user studies. Please visit our page: https://diffusion-face-relighting-pp.github.io
comment: Published in IEEE TPAMI (vol. 48, no. 5, May 2026). This is an extended version of the ICCV 2023 paper (DiFaReli)
♻ ☆ An Elastic Shape Variational Autoencoder for Skeleton Pose Trajectories
Deep generative models provide flexible frameworks for modeling complex, structured data such as images, videos, 3D objects, and texts. However, when applied to sequences of human skeletons, standard variational autoencoders (VAEs) often allocate substantial capacity to nuisance factors-such as camera orientation, subject scale, viewpoint, and execution speed-rather than the intrinsic geometry of shapes and their motion. We propose the Elastic Shape - Variational Autoencoder (ES-VAE), a geometry-aware generative model for skeletal trajectories that leverages the transported square-root velocity field (TSRVF) representation on Kendall's shape manifold. This representation inherently removes rigid translations, rotations, and global scaling of shapes, and temporal rate variability of sequences, isolating the underlying shape dynamics. The ES-VAE encoder maps skeletal sequences to a low-dimensional latent space incorporating the Riemannian logarithm map, while the decoder reconstructs sequences using the corresponding exponential map. We demonstrate the effectiveness of ES-VAE on two datasets. First, we analyze skeletal gait cycles to predict clinical mobility scores and classify subjects into healthy and post-stroke groups. Second, we evaluate action recognition on the NTU RGB+D dataset. Across both settings, ES-VAE consistently outperforms standard VAEs and a range of sequence modeling baselines, including temporal convolutional networks, transformers, and graph convolutional networks. More broadly, ES-VAE provides a principled framework for learning generative models of longitudinal data on pose shape manifolds, offering improved latent representation and downstream performance compared to existing deep learning approaches.
comment: 9 pages
♻ ☆ FlashClear: Ultra-Fast Image Content Removal via Efficient Step Distillation and Feature Caching
Recently, diffusion-based object removal models have achieved impressive results in eliminating objects and their associated visual effects. However, they indiscriminately denoise all tokens across all timesteps, ignoring that removal usually involves small foreground regions. This strategy introduces substantial computational overhead and prolonged inference times. To overcome this computational burden, we propose a latent discriminator to implement Region-aware Adversarial Distillation (RAD), yielding a highly efficient few-step model named FlashClear. Furthermore, tailored to few-step diffusion models, we propose FPAC (Foreground-Prioritized Asymmetric Attention and Caching), a training-free acceleration strategy. Extensive experiments demonstrate that our framework provides massive acceleration while maintaining or exceeding the performance of our base model, ObjectClear. Notably, on the OBER benchmark, our FlashClear achieves up to 8.26$\times$ and 122$\times$ speedup over ObjectClear and OmniPaint, respectively, while maintaining high visual quality and fidelity.
comment: Code: https://github.com/GuoCalix/FlashClear
♻ ☆ Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling
Recent advances in generative video models are increasingly driven by post-training and test-time scaling, both of which critically depend on the quality of video reward models (RMs). An ideal reward model should predict accurate rewards that align with human preferences across diverse scenarios. However, existing paradigms face a fundamental dilemma: \textit{Discriminative RMs} regress rewards directly on features extracted by multimodal large language models (MLLMs) without explicit reasoning, making them prone to shortcut learning and heavily reliant on massive data scaling for generalization. In contrast, \textit{Generative RMs} with Chain-of-Thought (CoT) reasoning exhibit superior interpretability and generalization potential, as they leverage fine-grained semantic supervision to internalize the rationales behind human preferences. However, they suffer from inherent optimization bottlenecks due to the coupling of reasoning and scoring within a single autoregressive inference chain. To harness the generalization benefits of CoT reasoning while mitigating the training instability of coupled reasoning and scoring, we introduce DeScore, a training-efficient and generalizable video reward model. DeScore employs a decoupled ``think-then-score'' paradigm: an MLLM first generates an explicit CoT, followed by a dedicated discriminative scoring module consisting of a learnable query token and a regression head that predicts the final reward. DeScore is optimized via a two-stage framework: (1) a discriminative cold start incorporating a random mask mechanism to ensure robust scoring capabilities, and (2) a dual-objective reinforcement learning stage that independently refines CoT reasoning quality and calibrates the final reward, ensuring that higher-quality reasoning directly translates to superior model performance.
♻ ☆ P-Flow: Proxy-gradient Flows for Linear Inverse Problems
Generative models based on flow matching have emerged as a powerful paradigm for inverse problems, offering straighter trajectories and faster sampling compared to diffusion models. However, existing approaches often necessitate differentiating through unrolled paths, leading to numerical instability and prohibitive computational overhead. To address this, we propose P-Flow, a framework that stabilizes the reconstruction process by leveraging a proxy gradient to update the source point. This approach effectively circumvents the numerical instability and memory overhead of long-chain differentiation. To ensure consistency with the prior distribution, we employ a Gaussian spherical projection motivated by the concentration of measure phenomenon in high-dimensional spaces. We further provide a theoretical analysis for P-Flow based on Bayesian theory and Lipschitz continuity. Experiments across diverse restoration tasks demonstrate that P-Flow delivers competitive performance, especially under extreme degradations such as severely ill-posed conditions and high measurement noise.
♻ ☆ BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD
Industrial Computer-Aided Design (CAD) code generation requires models to produce executable parametric programs from visual or textual inputs. Beyond recognizing the outer shape of a part, this task involves understanding its 3D structure, inferring engineering parameters, and choosing CAD operations that reflect how the part would be designed and manufactured. Despite the promise of Multimodal large language models (MLLMs) for this task, they are rarely evaluated on whether these capabilities jointly hold in realistic industrial CAD settings. We present BenchCAD, a unified benchmark for industrial CAD reasoning. BenchCAD contains 17,900 execution-verified CadQuery programs across 106 industrial part families, including bevel gears, compression springs, twist drills, and other reusable engineering designs. It evaluates models through visual question answering, code question answering, image-to-code generation, and instruction-guided code editing, enabling fine-grained analysis across perception, parametric abstraction, and executable program synthesis. Across 10+ frontier models, BenchCAD shows that current systems often recover coarse outer geometry but fail to produce faithful parametric CAD programs. Common failures include missing fine 3D structure, misinterpreting industrial design parameters, and replacing essential operations such as sweeps, lofts, and twist-extrudes with simpler sketch-and-extrude patterns. Fine-tuning and reinforcement learning improve in-distribution performance, but generalization to unseen part families remains limited. These results position BenchCAD as a benchmark for measuring and improving the industrial readiness of multimodal CAD automation.
comment: 9 page 7 figures
♻ ☆ Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction
Dense 3D reconstruction from continuous image streams requires both accurate geometric aggregation and stable long-term memory management. Recent feed-forward reconstruction frameworks integrate observations through persistent memory representations, yet most rely primarily on appearance-based similarity when updating memory. Such appearance-driven integration often leads to redundant accumulation of observations and unstable geometry when viewpoint changes occur. In this work, we propose a ray-aware pointer memory for streaming 3D reconstruction that explicitly models both spatial location and viewing direction within a unified memory representation. Each memory pointer stores its 3D position, associated ray direction, and feature embedding, allowing the system to reason jointly about geometric proximity and viewpoint consistency. Based on this representation, we introduce an adaptive pointer update strategy that replaces traditional fusion-based memory compression with a retain-or-replace mechanism. Instead of averaging nearby observations, the system selectively retains informative pointers while discarding redundant ones, preserving distinctive geometric structures while maintaining bounded memory growth. Furthermore, the joint reasoning over spatial distance and ray-direction discrepancy enables the system to distinguish between local redundancy, novel observations, and potential loop revisits in a unified manner. When loop candidates are detected, pose refinement is triggered to enforce global geometric consistency across the reconstruction. Extensive experiments demonstrate that the proposed ray-aware memory design significantly improves long-term reconstruction stability and camera pose accuracy while maintaining efficient streaming inference. Our approach provides a principled framework for scalable and drift-resistant online 3D reconstruction from image streams.
♻ ☆ Spectral-Adaptive Modulation Networks for Visual Perception IEEE
Recent studies have shown that 2D convolution and self-attention exhibit distinct spectral behaviors, and optimizing their spectral properties can enhance vision model performance. However, theoretical analyses remain limited in explaining why 2D convolution is more effective in high-pass filtering than self-attention and why larger kernels favor shape bias, akin to self-attention. In this paper, we employ graph spectral analysis to theoretically simulate and compare the frequency responses of 2D convolution and self-attention within a unified framework. Our results corroborate previous empirical findings and reveal that node connectivity, modulated by window size, is a key factor in shaping spectral functions. Leveraging this insight, we introduce a \textit{spectral-adaptive modulation} (SPAM) mixer, which processes visual features in a spectral-adaptive manner using multi-scale convolutional kernels and a spectral re-scaling mechanism to refine spectral components. Based on SPAM, we develop SPANetV2 as a novel vision backbone. Extensive experiments demonstrate that SPANetV2 outperforms state-of-the-art models across multiple vision tasks, including ImageNet-1K classification, COCO object detection, and ADE20K semantic segmentation.
comment: Accepted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
♻ ☆ MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware
The recent advancement of Vision Language Action (VLA) models has driven a critical demand for large scale egocentric datasets. However, existing datasets are often limited by short episode durations, typically spanning only a few minutes, which fails to capture the long horizon temporal dependencies necessary for complex robotic task execution. To bridge this gap, we present MobileEgo Anywhere, a framework designed to facilitate the collection of robust, hour plus egocentric trajectories using commodity mobile hardware. We leverage the ubiquitous sensor suites of modern smartphones to provide high fidelity, long term camera pose tracking, effectively removing the high hardware barriers associated with traditional robotics data collection. Our contributions are three fold: (1) we release a novel dataset comprising 200 hours of diverse, long form egocentric data with persistent state tracking; (2) we open source a mobile application that enables any user to record egocentric data, and (3) we provide a comprehensive processing pipeline to convert raw mobile captures into standardized, training ready formats for Vision Language Action model and foundation model research. By democratizing the data collection process, this work enables the massive scale acquisition of long horizon data across varied global environments, accelerating the development of generalizable robotic policies.
♻ ☆ KeyframeFace: Language-Driven Facial Animation via Semantic Keyframes
Facial animation is a core component for creating digital characters in Computer Graphics (CG) industry. A typical production workflow relies on sparse, semantically meaningful keyframes to precisely control facial expressions. Enabling such animation directly from natural-language descriptions could significantly improve content creation efficiency and accessibility. However, most existing methods adopt a text-to-continuous-frames paradigm, directly regressing dense facial motion trajectories from language. This formulation entangles high-level semantic intent with low-level motion, lacks explicit semantic control structure, and limits precise editing and interpretability. Inspired by the keyframe paradigm in animation production, we propose KeyframeFace, a framework for semantic facial animation from language via interpretable keyframes. Instead of predicting dense motion trajectories, our method represents animation as a sequence of semantically meaningful keyframes in an interpretable ARKit-based facial control space. A language-driven model leverages large language model (LLM) priors to generate keyframes that align with contextual text descriptions and emotion cues. To support this formulation, we construct a multimodal dataset comprising 2,100 expression scripts paired with monocular videos, per-frame ARKit coefficients, and manually annotated semantic keyframes. Experiments show that incorporating semantic keyframe supervision and language priors significantly improves expression fidelity and semantic alignment compared to methods that do not use facial action semantics.
♻ ☆ Monitoring access to piped water and sanitation infrastructure in Africa at disaggregated scales using satellite imagery and self-supervised learning
Access to drinking water and sanitation services is essential for health and well-being, yet large global disparities persist. Sustainable Development Goal (SDG) 6 sets targets for universal access to these services, but progress toward these targets is hindered by existing monitoring systems that rely heavily on costly, infrequent, spatially uneven household surveys and censuses subject to substantial reporting delays. To address this gap, this study develops a scalable remote-sensing framework for estimating access to piped water and sewage systems at approximately 2.56 km spatial resolution. The framework combines Sentinel-2 imagery, Afrobarometer survey responses, 30 m population data, and Vision Transformer representations learned with DINO self-supervised learning. The best-performing model achieves held-out AUROC values of 91.54\% for piped water and 93.24\% for sewage system access across African survey locations. Applied to gridded inference across 50 African countries, the resulting population-weighted estimates closely track WHO/UNICEF JMP statistics for piped water access ($R^2=0.92$) and show meaningful agreement for sewage-related sanitation access ($R^2=0.72$). In countries without Afrobarometer survey coverage, the model achieves population-weighted MAEs of 9.5\% for piped water and 10.7\% for sewage system, with estimates falling within 15\% of JMP values for 121.4 million and 159.7 million people, respectively. A Nigeria application across 767 Local Government Areas (LGAs) shows how our framework's fine-scale predictions reveal subnational spatial inequality relevant to environmental justice.
♻ ☆ L2A: Learning to Accumulate Pose History for Accurate 3D Human Pose Estimation
Existing 2D-3D lifting human pose estimation methods have achieved strong performance. But the utilization of historical pose representations across network depth was overlooked. In current pipelines, information is propagated through fixed residual connections, which restricts effective reuse of early-layer features such as fine-grained spatial structures and short-term motion cues. However, naively incorporating historical features across layers is non-trivial. We further identify that maintaining a consistent representation space across layers is a prerequisite for effective cross-layer feature aggregation. To address this issue, we propose a history-aware framework that enables effective network cross-layer history feature utilization. Specifically, we adopt a spatial-temporal parallel Transformer backbone to prevent alternating spatial-temporal transformations during sequential processing, thereby maintaining a consistent representation space. Building upon this, we introduce a History Pose Accumulation (HPA) mechanism that adaptively aggregates features from all preceding layers to enhance current representations. Furthermore, we propose a Layer Pose History Aggregation (LPA) module that transforms layer pose features into a compact and structured form, reducing redundancy and enabling more stable aggregation. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on benchmarks.
comment: 15page
♻ ☆ Attention Sinks in Diffusion Transformers: A Causal Analysis
Attention sinks -- tokens that receive disproportionate attention mass -- are assumed to be functionally important in autoregressive language models, but their role in diffusion transformers remains unclear. We present a causal analysis in text-to-image diffusion, dynamically identifying dominant attention recipients per timestep and suppressing them via paired, training-free interventions on the score and value paths. Across 553 GenEval prompts on Stable Diffusion~3 (with SDXL corroboration), removing these sinks does not degrade text-image alignment (CLIP-T) or preference proxies (ImageReward, HPS-v2) at $k{=}1$; only under stronger interventions ($k\!\geq\!10$) does HPS-v2 exhibit a metric-dependent boundary, while CLIP-T remains robust throughout. The perceptual shifts induced by suppression are nonetheless \emph{sink-specific} -- $\sim\!6\times$ larger than equal-budget random masking -- revealing an empirical dissociation between trajectory-level perturbation and \emph{semantic alignment} in diffusion transformers. \footnote{Code available at https://github.com/wfz666/ICML26-attention-sink.}
♻ ☆ READ: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling AAAI 2024
Fully fine-tuning pretrained large-scale transformer models has become a popular paradigm for video-language modeling tasks, such as temporal language grounding and video-language summarization. With a growing number of tasks and limited training data, such full fine-tuning approach leads to costly model storage and unstable training. To overcome these shortcomings, we introduce lightweight adapters to the pre-trained model and only update them at fine-tuning time. However, existing adapters fail to capture intrinsic temporal relations among video frames or textual words. Moreover, they neglect the preservation of critical task-related information that flows from the raw video-language input into the adapter's low-dimensional space. To address these issues, we first propose a novel REcurrent ADapter (READ) that employs recurrent computation to enable temporal modeling capability. Second, we propose Partial Video-Language Alignment (PVLA) objective via the use of partial optimal transport to maintain task-related information flowing into our READ modules. We validate our READ framework through extensive experiments where READ significantly outperforms all existing fine-tuning strategies on multiple low-resource temporal language grounding and video-language summarization benchmarks. The code, model, and data have been made available at https://nguyentthong.github.io/READ.
comment: Accepted at AAAI 2024
♻ ☆ Saving Foundation Flow-Matching Priors for Inverse Problems ICML 2026
Foundation flow-matching (FM) models promise universal priors for solving inverse problems (IPs); yet today, they trail behind domain-specific and even untrained priors. \emph{How can we unlock their potential?} We introduce FMPlug, a plug-in framework that redefines how foundation FMs are used in IPs. FMPlug combines an instance-guided, time-dependent warm-start strategy with sharp Gaussianity regularization, adding problem-specific guidance while preserving the Gaussian structures. For evaluation, we consider both simple image restoration tasks and scientific IPs with a few similar samples -- where the prohibitive cost of data collection and model training hinders the development of domain-specific generative models. Our superior experimental results confirm the effectiveness of FMPlug. Overall, FMPlug paves the way for making foundation FM models practical, reusable priors for IPs, especially scientific ones with few similar samples. More details are available at https://sun-umn.github.io/xm-plug/ .
comment: Accepted by ICML 2026
♻ ☆ DemaFormer: Damped Exponential Moving Average Transformer with Energy-Based Modeling for Temporal Language Grounding EMNLP 2023
Temporal Language Grounding seeks to localize video moments that semantically correspond to a natural language query. Recent advances employ the attention mechanism to learn the relations between video moments and the text query. However, naive attention might not be able to appropriately capture such relations, resulting in ineffective distributions where target video moments are difficult to separate from the remaining ones. To resolve the issue, we propose an energy-based model framework to explicitly learn moment-query distributions. Moreover, we propose DemaFormer, a novel Transformer-based architecture that utilizes exponential moving average with a learnable damping factor to effectively encode moment-query inputs. Comprehensive experiments on four public temporal language grounding datasets showcase the superiority of our methods over the state-of-the-art baselines.
comment: Accepted at EMNLP 2023 (Findings). Code is available at https://github.com/nguyentthong/demaformer
♻ ☆ FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation
Large-scale autoregressive models have demonstrated remarkable capabilities in image generation. However, their sequential raster-scan decoding relies on strictly next-token prediction, making inference prohibitively expensive. Existing acceleration methods typically either introduce entirely new generation paradigms that necessitate costly pre-training from scratch, or enable parallel generation at the expense of a training-inference gap or altered prediction objectives. In this paper, we introduce FlashAR, a lightweight post-training adaptation framework that efficiently adapts a pre-trained raster-scan autoregressive model into a highly parallel generator based on two-way next-token prediction. Our key insight is that effective adaptation should minimize modifications to the pre-trained model's original training objective to preserve its learned prior. Accordingly, we retain the original AR head as a horizontal head for row-wise prediction and introduce a complementary, lightweight vertical head for column-wise prediction. To facilitate efficient adaptation, we branch the vertical head from an intermediate layer rather than the final layer, bypassing the inherent horizontal head bias. Moreover, since horizontal and vertical predictions capture complementary dependencies whose relative importance varies across target positions, we employ a learnable fusion gate to dynamically combine the two predictions at each position. To further reduce adaptation cost, we propose a two-stage adaptation pipeline: the vertical head is first initialized through adaptation from the pre-trained autoregressive model before jointly fine-tuned with backbone to adapt to the new decoding paradigm. Extensive experiments on LlamaGen and Emu3.5 show that FlashAR achieves up to a 22.9x speedup for 512x512 image generation through a lightweight post-training with merely 0.05% of the original training data.
comment: Post-training acceleration for autoregressive image generation, code is available at https://lxazjk.github.io/FlashAR/
♻ ☆ TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
Video large language models (Video-LLMs) have made strong progress in general video understanding, but their ability to maintain temporal object consistency remains underexplored. Existing benchmarks often emphasize event recognition, action understanding, or coarse temporal reasoning, while rarely testing whether models can preserve the identity, state, and continuity of the same object across occlusion, disappearance, reappearance, state transitions, and cross-object interactions. We introduce TOC-Bench, a diagnostic benchmark for evaluating temporal object consistency in Video-LLMs. TOC-Bench is object-track grounded: each queried subject is linked to a per-frame trajectory and a structured temporal event timeline. To ensure that questions require temporally ordered visual evidence rather than language priors, single-frame shortcuts, or unordered frame cues, we design a three-layer temporal-necessity filtering protocol, which removes 60.7% of candidate QA pairs and retains 17,900 temporally dependent items across 10 diagnostic dimensions. From this pool, we construct a human-verified benchmark with 2,323 high-quality QA pairs over 1,951 videos. Experiments on representative Video-LLMs show that temporal object consistency remains a major unsolved challenge, with notable weaknesses in event counting, event ordering, identity-sensitive reasoning, and hallucination-aware verification, even when models perform well on general video understanding benchmarks. These results suggest that object-centric temporal coherence is a key bottleneck for current Video-LLMs, and that TOC-Bench provides a focused platform for diagnosing and improving object-aware temporal reasoning. The resource is available at https://github.com/cjzcjz666/toc_bench.git.
♻ ☆ Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency
Recent advancements in image animation have utilized diffusion models to breathe life into static images. However, existing controllable frameworks typically rely on Lagrangian motion guidance, where optical flow is estimated relative to the initial frame. This paper revisits the same optical-flow primitive through a more local supervision design: we use adjacent-frame Eulerian motion fields to guide generation, where the motion signal always describes a short temporal hop. This shift enables parallelized training and provides bounded-error supervision throughout the generation process. To mitigate the drift artifacts common in adjacent frame generation, we introduce a Bidirectional Geometric Consistency mechanism, which computes a forward-backward cycle check to mathematically identify and mask occluded regions, preventing the model from learning incorrect warping objectives. Extensive experiments demonstrate that our approach accelerates training, preserves temporal coherence, and reduces dynamic artifacts compared to reference-based baselines.
comment: Work in progress. Code is available at https://github.com/nguyentthong/eulerian_motion_guidance
♻ ☆ VidNum-1.4K: A Comprehensive Benchmark for Video-based Numerical Reasoning
Video-based numerical reasoning provides a premier arena for testing whether Vision-Language Models (VLMs) truly "understand" real-world dynamics, as accurate numerical deduction necessitates a profound grasp of temporal events, object permanence, and compositional logic beyond superficial pattern matching. However, existing benchmarks are often confined to narrow domains, such as repetitive athletic motions, or treat simple counting merely as a superficial regression task, failing to assess multi-step numerical logic within the inherent complexity of real-world multimedia content. We introduce VidNum-1.4K, a comprehensive VideoQA benchmark comprising 1,379 strictly human-annotated video-question pairs designed to evaluate genuine numerical reasoning across highly diverse environments, encompassing object, action, and event quantification. The VidNum-1.4K is uniquely structured into a three-level hierarchy that evolves from direct visual perception to video-based compositional numerical reasoning, requiring models to perform arithmetic operations, comparisons, and logical deductions grounded in temporal evidence. Our evaluations across a diverse suite of state-of-the-art VLMs reveal a striking reasoning gap: while the Gemini-3.1-pro barely reaches a 60% accuracy threshold, representative open-source families struggle heavily in the 25%--45% range. These findings demonstrate that current VLMs still lack a stable "internal world model", positioning VidNum-1.4K as a demanding diagnostic testbed for the next generation of numerical video intelligence.
comment: 7 pages, 5 figures, under review at ACMMM 2026 Dataset Track
♻ ☆ Learnable Multi-level Discrete Wavelet Transforms for 3D Gaussian Splatting Frequency Modulation
3D Gaussian Splatting (3DGS) has emerged as a powerful approach for novel view synthesis. However, the number of Gaussian primitives often grows substantially during training as finer scene details are reconstructed, leading to increased memory and storage costs. Recent coarse-to-fine strategies regulate Gaussian growth by modulating the frequency content of the ground-truth images. In particular, AutoOpti3DGS employs the learnable Discrete Wavelet Transform (DWT) to enable data-adaptive frequency modulation. Nevertheless, its modulation depth is limited by the 1-level DWT, and jointly optimizing wavelet regularization with 3D reconstruction introduces gradient competition that promotes excessive Gaussian densification. In this paper, we propose a multi-level DWT-based frequency modulation framework for 3DGS. By recursively decomposing the low-frequency subband, we construct a deeper curriculum that provides progressively coarser supervision during early training, consistently reducing Gaussian counts. Furthermore, we show that the modulation can be performed using only a single scaling parameter, rather than learning the full 2-tap high-pass filter. Experimental results on standard benchmarks demonstrate that our method further reduces Gaussian counts while maintaining competitive rendering quality.
comment: Accepted to EUSIPCO 2026
♻ ☆ StyleVAR: Controllable Image Style Transfer via Visual Autoregressive Modeling
We build on the Visual Autoregressive Modeling (VAR) framework and formulate style transfer as conditional discrete sequence modeling in a learned latent space. Images are decomposed into multi-scale representations and tokenized into discrete codes by a VQ-VAE; a transformer then autoregressively models the distribution of target tokens conditioned on style and content tokens. To inject style and content information, we introduce a blended cross-attention mechanism in which the evolving target representation attends to its own history, while style and content features act as queries that decide which aspects of this history to emphasize. A scale-dependent blending coefficient controls the relative influence of style and content at each stage, encouraging the synthesized representation to align with both the content structure and the style texture without breaking the autoregressive continuity of VAR. We train StyleVAR in two stages from a pretrained VAR checkpoint: supervised fine-tuning on a large triplet dataset of content--style--target images, followed by reinforcement fine-tuning with Group Relative Policy Optimization (GRPO) against a DreamSim-based perceptual reward, with per-action normalization weighting to rebalance credit across VAR's multi-scale hierarchy. Across three benchmarks spanning in-, near-, and out-of-distribution regimes, StyleVAR consistently outperforms an AdaIN baseline on Style Loss, Content Loss, LPIPS, SSIM, DreamSim, and CLIP similarity, and the GRPO stage yields further gains over the SFT checkpoint, most notably on the reward-aligned perceptual metrics. Qualitatively, the method transfers texture while maintaining semantic structure, especially for landscapes and architectural scenes, while a generalization gap on internet images and difficulty with human faces highlight the need for better content diversity and stronger structural priors.
♻ ☆ Fully AI-Generated Image Detection: Definition, Recent Advances and Challenges
Recent advances in visual generative models have enabled the creation of highly realistic, fully AI-generated images without relying on real source content. While beneficial for many applications, these models also pose significant societal risks, as they can be easily exploited to produce convincing Deepfakes. Detecting them represents a foundational yet challenging problem in AI media forensics, requiring detectors to reliably extract the inherent artifacts imprinted by generative architectures. In this Review, we provide a systematic overview of fully AI-generated image detection. Following the standard detector design pipeline, we focus on two key components: dataset construction and artifact extraction. We analyze how dataset design influences the generalization and robustness of learned artifacts, and categorize existing artifact extraction methods based on the primary inductive priors leveraged to isolate artifacts. Within this framework, we systematically review existing works. Finally, we highlight open problems and envision several future directions for developing more robust and generalizable detectors. Reviewed works in this survey can be found at https://github.com/zju-pi/Awesome-Fully-AI-Generated-Image-Detection.
♻ ☆ ScriptHOI: Learning Scripted State Transitions for Open-Vocabulary Human-Object Interaction Detection
Open-vocabulary human-object interaction (HOI) detection requires recognizing interaction phrases that may not appear as annotated categories during training. Recent vision-language HOI detectors improve semantic transfer by matching human-object features with text embeddings, but their predictions are often dominated by object affordance and phrase-level co-occurrence. As a result, a model may predict \textit{cut cake} from the presence of a knife and a cake without verifying whether the hand, tool, target, contact pattern, and object state jointly support the action. We propose \textbf{ScriptHOI}, a structured framework that represents each interaction phrase as a soft scripted state transition. Rather than treating a phrase as a single class token, ScriptHOI decomposes it into body-role, contact, geometry, affordance, motion, and object-state slots. A visual state tokenizer parses each detected human-object pair into corresponding state tokens, and a slot-wise matcher estimates both script coverage and script conflict. These two quantities calibrate HOI logits, expose missing visual evidence, and provide training constraints for incomplete annotations. To avoid suppressing valid but unannotated interactions, we further introduce interval partial-label learning, which constrains unannotated candidates with script-derived lower and upper probability bounds instead of assigning closed-world negatives. A counterfactual script contrast loss swaps individual script slots to discourage object-only shortcuts. Experiments on HICO-DET, V-COCO, and open-vocabulary HOI splits show that ScriptHOI improves rare and unseen interaction recognition while substantially reducing affordance-conflict false positives.
♻ ☆ Transformer-Based Autonomous Driving Models and Deployment-Oriented Compression: A Survey
Transformer-based models are becoming a central paradigm in autonomous driving because they can capture long-range spatial dependencies, multi-agent interactions, and multimodal context across perception, prediction, and planning. At the same time, their deployment in real vehicles remains difficult because high-capacity attention-based architectures impose substantial latency, memory, and energy overhead. This survey reviews representative Transformer-based autonomous driving models and organizes them by task role, sensing configuration, and architectural design. More importantly, it examines these models from a deployment-oriented perspective and analyzes how efficiency constraints reshape model design choices in practice. We further review compression and acceleration strategies relevant to Transformer-based driving systems, including quantization, pruning, knowledge distillation, low-rank approximation, and efficient attention, and discuss their benefits, limitations, and task-dependent applicability. Rather than treating compression as an isolated post-processing step, we highlight it as a system-level design consideration that directly affects deployability, robustness, and safety. Finally, we identify open challenges and future research directions toward standardized, safety-aware, and hardware-conscious evaluation of efficient autonomous driving systems.
♻ ☆ Position: Universal Aesthetic Alignment Narrows Artistic Expression
Over-aligning image generation models to a generalized aesthetic preference conflicts with user intent, particularly when "anti-aesthetic" outputs are requested for artistic or critical purposes. This adherence prioritizes developer-centered values, compromising user autonomy and aesthetic pluralism. We test this bias by constructing a wide-spectrum aesthetics dataset and evaluating state-of-the-art generation and reward models. This position paper finds that aesthetic-aligned generation models frequently default to conventionally beautiful outputs, failing to respect instructions for low-quality or negative imagery. Crucially, reward models penalize anti-aesthetic images even when they perfectly match the explicit user prompt. We confirm this systemic bias through image-to-image editing and evaluation against real abstract artworks. Our code, fine-tuned models, and datasets are available on our meta-expression intentionally anti-aesthetics webpage: https://weathon.github.io/icml2026_position/.
♻ ☆ Picasso: Holistic Scene Reconstruction with Physics-Constrained Sampling
In the presence of occlusions and measurement noise, geometrically accurate scene reconstructions -- which fit the sensor data -- can still be physically incorrect. For instance, when estimating the poses and shapes of objects in the scene and importing the resulting estimates into a simulator, small errors might translate to implausible configurations including object interpenetration or unstable equilibrium. This makes it difficult to predict the dynamic behavior of the scene using a digital twin, an important step in simulation-based planning and control of contact-rich behaviors. In this paper, we posit that object pose and shape estimation requires reasoning holistically over the scene (instead of reasoning about each object in isolation), accounting for object interactions and physical plausibility. Towards this goal, our first contribution is Picasso, a physics-constrained reconstruction pipeline that builds multi-object scene reconstructions by considering geometry, non-penetration, and physics. Picasso relies on a fast rejection sampling method that reasons over multi-object interactions, leveraging an inferred object contact graph to guide samples. Second, we propose the Picasso dataset, a collection of 10 contact-rich real-world scenes with ground truth annotations, as well as a metric to quantify physical plausibility, which we open-source as part of our benchmark. Finally, we provide an extensive evaluation of Picasso on our newly introduced dataset and on the YCB-V dataset, and show it largely outperforms the state of the art while providing reconstructions that are both physically plausible and more aligned with human intuition.
comment: 15 pages, accepted to Robotics: Science and Systems (RSS) 2026
♻ ☆ Discriminative Span as a Predictor of Synthetic Data Utility via Classifier Reconstruction
In many real-world computer vision applications, including medical imaging and industrial inspection, binary classification tasks are characterized by a severe scarcity of positive samples. A widely adopted solution is to generate synthetic positive data using image-to-image transformations applied to negative samples. However, a fundamental challenge remains: how can we reliably assess whether such synthetic data will improve downstream model performance? In this work, we propose a geometry-driven metric that predicts the utility of synthetic data without requiring model training. Our approach operates in the embedding space of a pre-trained foundation model and represents the dataset through difference vectors between samples. We evaluate whether the weight vector of a linear classifier can be expressed within the subspace spanned by these variations by measuring the relative projection error. Intuitively, if the variations induced by synthetic data capture task-relevant directions, their span can approximate the classifier, resulting in low projection error. Conversely, poor synthetic data fails to span these directions, leading to higher error. Across multiple datasets and architectures, we show that this metric exhibits strong correlation with downstream classification performance of CNNs trained on mixtures of real negative and synthetic positive data. These findings suggest that the proposed metric serves as a practical and informative tool for evaluating synthetic data quality in data-scarce settings.
comment: 15 pages, 17 tables
♻ ☆ Space Syntax-guided Post-training for Residential Floor Plan Generation
Residential floor plan generation requires not only geometric fidelity but also spatial configurational logic: shared living spaces should be integrative, while private spaces should remain segregated. Existing generators increasingly use room-relation graphs as input-side conditions, but generated layouts are rarely evaluated on the output side for configurational quality, and such evaluation is rarely fed back into model optimization. We propose Space Syntax-guided Post-training (SSPT), a framework that turns space-syntax integration from a post-hoc analysis tool into a computable feedback signal for already-trained floor plan generators. SSPT introduces the Space Syntax Integration Oracle (SSIO), which converts generated layouts into rectangle-space graphs and measures public-space dominance and functional hierarchy. SSIO is first applied to real residential data to establish empirical configurational references, then connected to two SSPT strategies: SSPT-Iter, a basic generate-filter-retrain route, and SSPT-PPO, the first RL-based post-training route for floor plan generation. We also introduce SSPT-Bench, a new evaluation system for measuring the output-side spatial configurational quality of post-trained generators under an out-of-distribution setting. Experiments show that both strategies improve public-space dominance and functional-hierarchy alignment over the unpost-trained baseline. SSPT-PPO achieves stronger gains, lower variance, and higher efficiency than iterative retraining. These results show that output-side configurational evaluation can serve as actionable post-training feedback, offering a practical path for injecting architectural theory into existing floor plan generation backbones.
♻ ☆ Zero-shot Human Pose Estimation using Diffusion-based Inverse solvers
Pose estimation refers to tracking a human's full body posture, including their head, torso, arms, and legs. The problem is challenging in practical settings where the number of body sensors are limited. Past work has shown promising results using conditional diffusion models, where the pose prediction is conditioned on both measurements from the sensors. Unfortunately, nearly all these approaches generalize poorly across users, primarly because location measurements are highly influenced by the body size of the user. In this paper, we formulate pose estimation as an inverse problem and design an algorithm capable of zero-shot generalization. Our idea utilizes a pre-trained diffusion model and conditions it on rotational measurements alone; the priors from this model are then guided by a likelihood term, derived from the measured locations. Thus, given any user, our proposed InPose method generatively estimates the highly likely sequence of poses that best explains the sparse on-body measurements.
comment: Published as a Conference Paper at The Fourteenth International Conference on Learning Representations
♻ ☆ DSA-NRP: No-Reflow Prediction from Angiographic Perfusion Dynamics in Stroke EVT
Following successful large-vessel recanalization via endovascular thrombectomy (EVT) for acute ischemic stroke (AIS), some patients experience a complication known as no-reflow, defined by persistent microvascular hypoperfusion that undermines tissue recovery and worsens clinical outcomes. Although prompt identification is crucial, standard clinical practice relies on perfusion magnetic resonance imaging (MRI) within 24 hours post-procedure, delaying intervention. In this work, we introduce the first-ever machine learning (ML) framework to predict no-reflow immediately after EVT by leveraging previously unexplored intra-procedural digital subtraction angiography (DSA) sequences and clinical variables. Our retrospective analysis included AIS patients treated at UCLA Medical Center (2011-2024) who achieved favorable mTICI scores (2b-3) and underwent pre- and post-procedure MRI. No-reflow was defined as persistent hypoperfusion (Tmax > 6 s) on post-procedural imaging. From DSA sequences (AP and lateral views), we extracted statistical and temporal perfusion features from the target downstream territory to train ML classifiers for predicting no-reflow. Our novel method significantly outperformed a clinical-features baseline(AUC: 0.7703 $\pm$ 0.12 vs. 0.5728 $\pm$ 0.12; accuracy: 0.8125 $\pm$ 0.10 vs. 0.6331 $\pm$ 0.09), demonstrating that real-time DSA perfusion dynamics encode critical insights into microvascular integrity. This approach establishes a foundation for immediate, accurate no-reflow prediction, enabling clinicians to proactively manage high-risk patients without reliance on delayed imaging.
comment: 15 pages, 4 figures
♻ ☆ Pi-HOC: Pairwise 3D Human-Object Contact Estimation
Resolving real-world human-object interactions in images is a many-to-many challenge, in which disentangling fine-grained concurrent physical contact is particularly difficult. Existing semantic contact estimation methods are either limited to single-human settings or require object geometries (e.g., meshes) in addition to the input image. Current state-of-the-art leverages powerful VLM for category-level semantics but struggles with multi-human scenarios and scales poorly in inference. We introduce Pi-HOC, a single-pass, instance-aware framework for dense 3D semantic contact prediction of all human-object pairs. Pi-HOC detects instances, creates dedicated human-object (HO) tokens for each pair, and refines them using an InteractionFormer. A SAM-based decoder then predicts dense contact on SMPL human meshes for each human-object pair. On the MMHOI and DAMON datasets, Pi-HOC significantly improves accuracy and localization over state-of-the-art methods while achieving 20x higher throughput. We further demonstrate that predicted contacts improve SAM-3D image-to-mesh reconstruction via a test-time optimization algorithm and enable referential contact prediction from language queries without additional training.
Artificial Intelligence 300
☆ AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward ICML2026
In this paper, we propose AlphaGRPO, a novel framework that applies Group Relative Policy Optimization (GRPO) to AR-Diffusion Unified Multimodal Models (UMMs) to enhance multimodal generation capabilities without an additional cold-start stage. Our approach unlocks the model's intrinsic potential to perform advanced reasoning tasks: Reasoning Text-to-Image Generation, where the model actively infers implicit user intents, and Self-Reflective Refinement, where it autonomously diagnoses and corrects misalignments in generated outputs. To address the challenge of providing stable supervision for real-world multimodal generation, we introduce the Decompositional Verifiable Reward (DVReward). Unlike holistic scalar rewards, DVReward utilizes an LLM to decompose complex user requests into atomic, verifiable semantic and quality questions, which are then evaluated by a general MLLM to provide reliable and interpretable feedback. Extensive experiments demonstrate that AlphaGRPO yields robust improvements across multimodal generation benchmarks, including GenEval, TIIF-Bench, DPG-Bench and WISE, while also achieving significant gains in editing tasks on GEdit without training on editing tasks. These results validate that our self-reflective reinforcement approach effectively leverages inherent understanding to guide high-fidelity generation. Project page: https://huangrh99.github.io/AlphaGRPO/
comment: ICML2026
☆ Learning, Fast and Slow: Towards LLMs That Adapt Continually
Large language models (LLMs) are trained for downstream tasks by updating their parameters (e.g., via RL). However, updating parameters forces them to absorb task-specific information, which can result in catastrophic forgetting and loss of plasticity. In contrast, in-context learning with fixed LLM parameters can cheaply and rapidly adapt to task-specific requirements (e.g., prompt optimization), but cannot by itself typically match the performance gains available through updating LLM parameters. There is no good reason for restricting learning to being in-context or in-weights. Moreover, humans also likely learn at different time scales (e.g., System 1 vs 2). To this end, we introduce a fast-slow learning framework for LLMs, with model parameters as "slow" weights and optimized context as "fast" weights. These fast "weights" can learn from textual feedback to absorb the task-specific information, while allowing slow weights to stay closer to the base model and persist general reasoning behaviors. Fast-Slow Training (FST) is up to 3x more sample-efficient than only slow learning (RL) across reasoning tasks, while consistently reaching a higher performance asymptote. Moreover, FST-trained models remain closer to the base LLM (up to 70% less KL divergence), resulting in less catastrophic forgetting than RL-training. This reduced drift also preserves plasticity: after training on one task, FST trained models adapt more effectively to a subsequent task than parameter-only trained models. In continual learning scenarios, where task domains change on the fly, FST continues to acquire each new task while parameter-only RL stalls.
☆ Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
In settings where labeled verifiable training data is the binding constraint, each checked example should be allocated carefully. The standard practice is to use this data directly on the model that will be deployed, for example by running GRPO on the deployment student. We argue that this is often an inefficient allocation because it overlooks a reward-density principle: sparse sequence-level reward should train models where exploration is productive, while dense token-level teacher reward should be used where the aim is to compress behavior into a smaller model. In this view, GRPO-style sparse RL and OPD-style dense teacher supervision are not separate recipes; they are different reward-density regimes. The allocation rule is simple: use scarce labeled training data upstream on the strongest model that can turn it into reward-shaped behavior, then transfer that behavior downstream as dense supervision. We evaluate this rule on verifiable math with Qwen3 and Llama models. At fixed Qwen3-1.7B deployment-student size, an RL-improved 8B teacher distilled through the dense bridge outperforms direct GRPO on the same student, while transfer from the same teacher before RL underperforms. The bridge is important: a forward-KL warmup on teacher rollouts followed by OPD on student rollouts is consistently strongest on MATH before any post-bridge student-side sparse RL, and also gives the best pre-Stage~3 AIME endpoints for the canonical 8B/14B teachers. The bridge also makes later student-side sparse RL effective: GRPO that is weak on a cold student lifts MATH from $75.4\%$ to $78.5\%$ after the bridge and outperforms a matched replay control by $2.8$ points. The operational principal is to avoid using scarce labeled data on the least prepared policy: use sparse reward for teacher-side discovery, dense transfer for student compression, and student-side sparse reward only after the bridge.
☆ ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
Computer Use Agents (CUAs) can act through both atomic GUI actions, such as click and type, and high-level tool calls, such as API-based file operations, but this hybrid action space often leaves them uncertain about when to continue with GUI actions or switch to tools, leading to suboptimal execution paths. This difficulty stems from the scarcity of high-quality interleaved GUI-Tool trajectories, the cost and brittleness of collecting real tool trajectories, and the lack of trajectory-level supervision for GUI-Tool path selection. In this paper, we propose ToolCUA, an end-to-end agent designed to learn optimal GUI-Tool path selection through a staged training paradigm. We first introduce an Interleaved GUI-Tool Trajectory Scaling Pipeline that repurposes abundant static GUI trajectories and synthesizes a grounded tool library, enabling diverse GUI-Tool trajectories without manual engineering or real tool-trajectory collection. We then perform Tool-Bootstrapped GUI RFT, combining warmup SFT with single-turn RL to improve decisions at critical GUI-Tool switching points. Finally, we optimize ToolCUA with Online Agentic RL in a high-fidelity GUI-Tool environment, guided by a Tool-Efficient Path Reward that encourages appropriate tool use and shorter execution paths. Experiments on OSWorld-MCP show that ToolCUA achieves 46.85% accuracy, a relative improvement of approximately 66% over the baseline, establishing a new state of the art among models of comparable scale. It also improves by 3.9% over GUI-only settings, demonstrating effective GUI-Tool orchestration. The results further suggest that training in a hybrid action space is a promising paradigm for real-world digital agents. Open-sourced here: https://x-plug.github.io/ToolCUA/
☆ OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation
Recent advances in joint audio-video generation have been remarkable, yet real-world applications demand strong per-modality fidelity, cross-modal alignment, and fine-grained synchronization. Reinforcement Learning (RL) offers a promising paradigm, but its extension to multi-objective and multi-modal joint audio-video generation remains unexplored. Notably, our in-depth analysis first reveals that the primary obstacles to applying RL in this stem from: (i) multi-objective advantages inconsistency, where the advantages of multimodal outputs are not always consistent within a group; (ii) multi-modal gradients imbalance, where video-branch gradients leak into shallow audio layers responsible for intra-modal generation; (iii) uniform credit assignment, where fine-grained cross-modal alignment regions fail to get efficient exploration. These shortcomings suggest that vanilla RL fine-tuning strategy with a single global advantage often leads to suboptimal results. To address these challenges, we propose OmniNFT, a novel modality-aware online diffusion RL framework with three key innovations: (1) Modality-wise advantage routing, which routes independent per-reward advantages to their respective modality generation branches. (2) Layer-wise gradient surgery, which selectively detaches video-branch gradients on shallow audio layers while retaining those for cross-modal interaction layers. (3) Region-wise loss reweighting, which modulates policy optimization toward critical regions related to audio-video synchronization and fine-grained alignment. Extensive experiments on JavisBench and VBench with the LTX-2 backbone demonstrate that OmniNFT achieves comprehensive improvements in audio and video perceptual quality, cross-modal alignment, and audio-video synchronization.
comment: Project page: https://zghhui.github.io/OmniNFT/
☆ Reward Hacking in Rubric-Based Reinforcement Learning
Reinforcement learning with verifiable rewards has enabled strong post-training gains in domains such as math and coding, though many open-ended settings rely on rubric-based rewards. We study reward hacking in rubric-based RL, where a policy is optimized against a training verifier but evaluated against a cross-family panel of three frontier judges, reducing dependence on any single evaluator. Our framework separates two sources of divergence: verifier failure, where the training verifier credits rubric criteria that reference verifiers reject, and rubric-design limitations, where even strong rubric-based verifiers favor responses that rubric-free judges rate worse overall. Across medical and science domains, weak verifiers produce large proxy-reward gains that do not transfer to the reference verifiers; exploitation grows over training and concentrates in recurring failures such as partial satisfaction of compound criteria, treating implicit content as explicit, and imprecise topical matching. Stronger verifiers substantially reduce, but do not eliminate, verifier exploitation. We also introduce a self-internalization gap, a verifier-free diagnostic based on policy log-probabilities, which tracks reference-verifier quality, detecting when the policy trained using the weak verifier stops improving. Finally, in our setting, stronger verification does not prevent reward hacking when the rubric leaves important failure modes unspecified: rubric-based verifiers prefer the RL checkpoint, while rubric-free judges prefer the base model. These disagreements coincide with gains concentrated in completeness and presence-based criteria, alongside declines in factual correctness, conciseness, relevance, and overall quality. Together, these results suggest that stronger verification reduces reward hacking, but does not by itself ensure that rubric gains correspond to broader quality gains.
☆ KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference
We introduce KV-Fold, a simple, training-free long-context inference protocol that treats the key-value (KV) cache as the accumulator in a left fold over sequence chunks. At each step, the model processes the next chunk conditioned on the accumulated cache, appends the newly produced keys and values, and passes the enlarged cache forward; the same one-step update is applied repeatedly, analogous to foldl in functional programming. Building on the KV cache concatenation primitive introduced for latent multi-agent communication, we repurpose it as a chunk-to-chunk recurrence for long-context inference. When processing chunk t, the model attends to the KV cache carried from earlier chunks as a prefix, reusing its internal state across segments without modifying or retraining the model. Despite its simplicity, the induced recurrence is stable: per-step drift rises briefly and then saturates into a flat plateau that persists across deep chains. This plateau is insensitive to a 10,000x change in numerical precision, robust across chunk sizes, and consistent across model families. At the task level, KV-Fold preserves exact information over long distances. On a needle-in-a-haystack benchmark, it achieves 100% exact-match retrieval across 152 trials spanning contexts from 16K to 128K tokens and chain depths up to 511 on Llama-3.1-8B, while remaining within the memory limits of a single 40GB GPU. Compared to streaming methods, which trade fidelity for bounded memory, KV-Fold maintains long-range retrieval while operating as a sequence of tractable forward passes. Overall, our results show that frozen pretrained transformers already support a stable form of KV-cache recurrence, providing a practical route to long-context inference without architectural changes or training.
comment: 12 pages, 3 figures, 6 tables
☆ Solve the Loop: Attractor Models for Language and Reasoning
Looped Transformers offer a promising alternative to purely feed-forward computation by iteratively refining latent representations, improving language modeling and reasoning. Yet recurrent architectures remain unstable to train, costly to optimize and deploy, and constrained to small, fixed recurrence depths. We introduce Attractor Models, in which a backbone module first proposes output embeddings, then an attractor module refines them by solving for the fixed point, with gradients obtained through implicit differentiation. Thus, training memory remains constant in effective depth, and iterations are chosen adaptively by convergence. Empirically, Attractor Models outperform existing models across two regimes, large-scale language-model pretraining and reasoning with tiny models. In language modeling, Attractor Models deliver a Pareto improvement over standard Transformers and stable looped models across sizes, improving perplexity by up to 46.6% and downstream accuracy by up to 19.7% while reducing training cost. Notably, a 770M Attractor Model outperforms a 1.3B Transformer trained on twice as many tokens. On challenging reasoning tasks, we show that our model with only 27M parameters and approximately 1000 examples achieves 91.4% accuracy on Sudoku-Extreme and 93.1% on Maze-Hard, scaling favorably where frontier models like Claude and GPT o3, fail completely, and specialized recursive reasoners collapse at larger sizes. Lastly, we show that Attractor Models exhibit a novel phenomenon, which we call equilibrium internalization: fixed-point training places the model's initial output embedding near equilibrium, allowing the solver to be removed at inference time with little degradation. Together, these results suggest that Attractor Models make iterative refinement scalable by turning recurrence into a computation the model can learn to internalize.
☆ Towards Affordable Energy: A Gymnasium Environment for Electric Utility Demand-Response Programs
Extreme weather and volatile wholesale electricity markets expose residential consumers to catastrophic financial risks, yet demand response at the distribution level remains an underutilized tool for grid flexibility and energy affordability. While a demand-response program can shield consumers by issuing financial credits during high-price periods, optimizing this sequential decision-making process presents a unique challenge for reinforcement learning despite the plentiful offline historical smart meter and wholesale pricing data available publicly. Offline historical data fails to capture the dynamic, interactive feedback loop between an electric utility's pricing signals and customer acceptance and adaptation to a demand-response program. To address this, we introduce DR-Gym, an open-source, online Gymnasium-compatible environment designed to train and evaluate demand-response from the electric utility's perspective. Unlike existing device-level energy simulators, our environment focuses on the market-level electric utility setting and provides a rich observational space relevant to the electric utility. The simulator additionally features a regime-switching wholesale price model calibrated to real-world extreme events, alongside physics-based building demand profiles. For our learning signal, we use a configurable, multi-objective reward function for specifying diverse learning objectives. We demonstrate through baseline strategies and data snapshots the capability of our simulator to create realistic and learnable environments.
☆ Enabling AI-Native Mobility in 6G: A Real-World Dataset for Handover, Beam Management, and Timing Advance
To address the issues of high interruption time and measurement report overhead under user equipment (UE) mobility especially in high speed 5G use cases the use of AI/ML techniques (AI/ML beam management and mobility procedures) have been proposed. These techniques rely heavily on data that are most often simulated for various scenarios and do not accurately reflect real deployment behavior or user traffic patterns. Therefore, there is an utmost need for realistic datasets under various conditions. This work presents a dataset collected from a commercially deployed network across various modes of mobility (pedestrian, bike, car, bus, and train) and at multiple speeds to depict real time UE mobility. When collecting the dataset, we focused primarily on handover (HO) scenarios, with the aim of reducing the HO interruption time and maintaining continuous throughput during and immediately after HO execution. To support this research, the dataset includes timing advance (TA) measurements at various signaling events such as RACH trigger, MAC CE, and PDCCH grant which are typically missing in existing works. We cover a detailed description of the creation of the dataset; experimental setup, data acquisition, and extraction. We also cover an exploratory analysis of the data, with a primary focus on mobility, beam management, and TA. We discuss multiple use cases in which the proposed dataset can facilitate understanding of the inference of the AI/ML model. One such use case is to train and evaluate various AI/ML models for TA prediction.
☆ The Algorithmic Caricature: Auditing LLM-Generated Political Discourse Across Crisis Events
Large Language Models (LLMs) can generate fluent political text at scale, raising concerns about synthetic discourse during crises and social conflict. Existing AI-text detection often focuses on sentence-level cues such as perplexity, burstiness, or token irregularities, but these signals may weaken as generative systems improve. We instead adopt a Computational Social Science perspective and ask whether synthetic political discourse behaves like an observed online population. We construct a paired corpus of 1,789,406 posts across nine crisis events: COVID-19, the Jan. 6 Capitol attack, the 2020 and 2024 U.S. elections, Dobbs/Roe v. Wade, the 2020 BLM protests, U.S. midterms, the Utah shooting, and the U.S.-Iran war. For each event, we compare observed discourse from social platforms with synthetic discourse generated for the same context. We evaluate four dimensions: emotional intensity, structural regularity, lexical-ideological framing, and cross-event dependency, using mean gaps and dispersion evidence. Across events, synthetic discourse is fluent but population-level unrealistic. It is generally more negative and less dispersed in sentiment, structurally more regular, and lexically more abstract than observed discourse. Observed discourse instead shows broader emotional variation, longer-tailed structural distributions, and more context-specific, colloquial lexical markers. These differences are event-dependent: larger for fast-moving, decentralized crises and smaller for formal or institutionally mediated events. We summarize them with a simple event-level measure, the Caricature Gap. Our findings suggest that the main limitation of synthetic political discourse is not grammar or fluency, but reduced population realism. Population-level auditing complements traditional text-detection and provides a CSS framework for evaluating the social realism of generated discourse.
☆ A Causal Language Modeling Detour Improves Encoder Continued Pretraining
When adapting an encoder to a new domain, the standard approach is to continue training with Masked Language Modeling (MLM). We show that temporarily switching to Causal Language Modeling (CLM) followed by a short MLM decay improves downstream performance. On biomedical texts with ModernBERT, this CLM detour outperforms MLM baselines trained on identical data and compute across 8 French and 11 English biomedical tasks, by +1.2-2.8pp and +0.3-0.8pp respectively, depending on model size. We investigate the reasons for these gains. We find that CLM's dense supervision impacts low transformer layers (0-7) far more than MLM does. Freezing low layers during CLM eliminates the downstream benefit; freezing mid layers preserves it. The representational changes persist through the MLM decay phase, even when it matches the CLM phase in length, and they scale with model capacity. We release ModernCamemBERT-bio and ModernBERT-bio as state-of-the-art biomedical encoders in Base and Large sizes.
☆ CAAFC: Chronological Actionable Automated Fact-Checker for misinformation / non-factual hallucination detection and correction
With the vast amount of content uploaded every hour, along with the AI generated content that can include hallucinations, Automated Fact-Checking (AFC) has become increasingly vital, as it is infeasible for human fact-checkers to manually verify the sheer volume of information generated online. Professional fact-checkers have identified several gaps in existing AFC systems, noting a misalignment between how these systems operate and how fact-checking is performed in practice. In this paper, we introduce CAAFC (Chronological Actionable Automated Fact-Checker), a frame-work designed to bridge these gaps. It surpasses SOTA AFC and hallucination detection systems across multiple benchmark datasets. CAAFC operates on claims, conversations, and dialogues, enabling it not only to detect factual errors and hallucinations, but also to correct them by providing actionable justifications supported by primary information sources. Furthermore, CAAFC can update evidence and knowledge bases by incorporating recent and contextual information when necessary, thereby enhancing the reliability of fact verification.
☆ Formalize, Don't Optimize: The Heuristic Trap in LLM-Generated Combinatorial Solvers
Large Language Models (LLMs) struggle to solve complex combinatorial problems through direct reasoning, so recent neuro-symbolic systems increasingly use them to synthesize executable solvers. A central design question is how the LLM should represent the solver, and whether it should also attempt to optimize search. We introduce CP-SynC-XL, a benchmark of 100 combinatorial problems (4,577 instances), and evaluate three solver-construction paradigms: native algorithmic search (Python), constraint modeling through a Python solver API (Python + OR-Tools), and declarative constraint modeling (MiniZinc + OR-Tools). We find a consistent representational divergence: Python + OR-Tools attains the highest correctness across LLMs, while MiniZinc + OR-Tools has lower absolute coverage despite using the same OR-Tools back-end. Native Python is the most likely to return a schema-valid solution that fails verification, whereas solver-backed paths preserve higher conditional fidelity. On the heuristic axis, prompting for search optimization yields only small median speed-ups (1.03-1.12x) and a strongly bimodal effect: many instances slow down, and correctness drops sharply on a long tail of problems. A paired code-level audit traces these regressions to a recurring heuristic trap. Under an efficiency-oriented prompt, the LLM may replace complete search with local approximations (Python), inject unverified bounds (Python + OR-Tools), or add redundant declarative machinery that overwhelms or over-constrains the model (MiniZinc + OR-Tools). These findings support a conservative design principle for LLM-generated combinatorial solvers: use the LLM primarily to formalize variables, constraints, and objectives for verified solvers, and separately check any LLM-authored search optimization before use.
☆ Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
Large Language Models (LLMs) update their behavior in context, which can be viewed as a form of Bayesian inference. However, the structure of the latent hypothesis space over which this inference operates remains unclear. In this work, we propose that LLMs assign beliefs over a low-dimensional geometric space - a conceptual belief space - and that in-context learning corresponds to a trajectory through this space as beliefs are updated over time. Using story understanding as a natural setting for dynamic belief updating, we combine behavioral and representational analyses to study these trajectories. We find that (1) belief updates are well-described as trajectories on low-dimensional, structured manifolds; (2) this structure is reflected consistently in both model behavior and internal representations and can be decoded with simple linear probes to predict behavior; and (3) interventions on these representations causally steer belief trajectories, with effects that can be predicted from the geometry of the conceptual space. Together, our results provide a geometric account of belief dynamics in LLMs, grounding Bayesian interpretations of in-context learning in structured conceptual representations.
☆ Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling
AI agents negotiate and transact in natural language with unfamiliar counterparts: a buyer bot facing an unknown seller, or a procurement assistant negotiating with a supplier. In such interactions, the counterpart's LLM, prompts, control logic, and rule-based fallbacks are hidden, while each decision can have monetary consequences. We ask whether an agent can predict an unfamiliar counterpart's next decision from a few interactions. To avoid real-world logging confounds, we study this problem in controlled bargaining and negotiation games, formulating it as target-adaptive text-tabular prediction: each decision point is a table row combining structured game state, offer history, and dialogue, while $K$ previous games of the same target agent, i.e., the counterpart being modeled, are provided in the prompt as labeled adaptation examples. Our model is built on a tabular foundation model that represents rows using game-state features and LLM-based text representations, and adds LLM-as-Observer as an additional representation: a small frozen LLM reads the decision-time state and dialogue; its answer is discarded, and its hidden state becomes a decision-oriented feature, making the LLM an encoder rather than a direct few-shot predictor. Training on 13 frontier-LLM agents and testing on 91 held-out scaffolded agents, the full model outperforms direct LLM-as-Predictor prompting and game+text features baselines. Within this tabular model, Observer features contribute beyond the other feature schemes: at $K=16$, they improve response-prediction AUC by about 4 points across both tasks and reduce bargaining offer-prediction error by 14%. These results show that formulating counterpart prediction as a target-adaptive text-tabular task enables effective adaptation, and that hidden LLM representations expose decision-relevant signals that direct prompting does not surface.
☆ Semantic Reward Collapse and the Preservation of Epistemic Integrity in Adaptive AI Systems
Recent advances in reinforcement learning from human feedback (RLHF) and preference optimization have substantially improved the usability, coherence, and safety of large language models. However, recurring behaviors such as performative certainty, hallucinated continuity, calibration drift, sycophancy, and suppression of visible uncertainty suggest unresolved structural issues within scalarized preference optimization systems. We propose Semantic Reward Collapse (SRC): the compression of semantically distinct forms of evaluative dissatisfaction into generalized optimization signals. Under SRC, categories such as factual incorrectness, uncertainty disclosure, formatting dissatisfaction, latency, and social preference may become entangled within a shared reward topology despite representing fundamentally different epistemic classes. We argue that adaptive reasoning systems operating under generalized evaluative pressure may drift toward suppression of visible epistemic failure rather than preservation of calibrated uncertainty integrity. These behaviors are framed strictly as optimization consequences rather than evidence of deception or anthropomorphic agency. Drawing on institutional proxy collapse, metric gaming, software reliability engineering, and human learning theory, we propose that uncertainty disclosure and escalation behavior should be treated as protected epistemic conduct rather than globally penalized task incompletion. Finally, we introduce Constitutional Reward Stratification (CRS), a domain-aware reward framework intended to preserve differentiated epistemic attribution within adaptive learning systems. We present CRS not as a validated solution, but as a testable governance-oriented research direction requiring further empirical investigation.
comment: 15 pages including references. Position and framework paper. Companion empirical work available at arXiv:2604.17587
☆ OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning
We study {on-policy self-distillation} (OPSD), where a language model improves its reasoning ability by distilling privileged teacher distributions along its own on-policy trajectories. Despite the performance gains of OPSD, we identify a common but often overlooked mismatch between teacher and student responses: self-reflected teacher responses can be shifted by reflection-induced bias and response templates, leading to miscalibrated token-level supervision. To mitigate this issue, we propose \methodname, an outcome-guided logit-steering framework that leverages verifiable outcome rewards to contrast successful and failed on-policy trajectories and calibrate teacher logits. By combining outcome-level correctness with dense token-level guidance through logit steering, \methodname stabilizes self-distillation and improves reasoning performance over standard OPSD and other variants across diverse benchmarks.
comment: 12 pages, 7 figures, 3 tables
☆ Detecting overfitting in Neural Networks during long-horizon grokking using Random Matrix Theory
Training Neural Networks (NNs) without overfitting is difficult; detecting that overfitting is difficult as well. We present a novel Random Matrix Theory method that detects the onset of overfitting in deep learning models without access to train or test data. For each model layer, we randomize each weight matrix element-wise, $\mathbf{W} \to \mathbf{W}_{\mathrm{rand}}$, fit the randomized empirical spectral distribution with a Marchenko-Pastur distribution, and identify large outliers that violate self-averaging. We call these outliers Correlation Traps. During the onset of overfitting, which we call the "anti-grokking" phase in long-horizon grokking, Correlation Traps form and grow in number and scale as test accuracy decreases while train accuracy remains high. Traps may be benign or may harm generalization; we provide an empirical approach to distinguish between them by passing random data through the trained model and evaluating the JS divergence of output logits. Our findings show that anti-grokking is an additional grokking phase with high train accuracy and decreasing test accuracy, structurally distinct from pre-grokking through its Correlation Traps. More broadly, we find that some foundation-scale LLMs exhibit the same Correlation Traps, indicating potentially harmful overfitting.
comment: 24 pages, 24 figures
☆ SEMIR: Semantic Minor-Induced Representation Learning on Graphs for Visual Segmentation ICML 2026
Segmenting small and sparse structures in large-scale images is fundamentally constrained by voxel-level, lattice-bound computation and extreme class imbalance -- dense, full-resolution inference scales poorly and forces most pipelines to rely on fixed regionization or downsampling, coupling computational cost to image resolution and attenuating boundary evidence precisely where minority structures are most informative. We introduce SEMIR (Semantic Minor-Induced Representation Learning), a representation framework that decouples inference from the native grid by learning a task-adapted, topology-preserving latent graph representation with exact decoding. SEMIR transforms the underlying grid graph into a compact, boundary-aligned graph minor through parameterized edge contraction, node deletion, and edge deletion, while preserving an exact lifting map from minor predictions to lattice labels. Minor construction is formalized as a few-shot structure learning problem that replaces hand-tuned preprocessing with a boundary-alignment objective: minor parameters are learned by maximizing agreement between predicted boundary elements and target-specific semantic edges under a boundary Dice criterion, and the induced minor is annotated with scale- and rotation-robust geometric and intensity descriptors and supports efficient region-level inference via message passing on a graph neural network (GNN) with relational edge features. We benchmark SEMIR on three tumor segmentation datasets -- BraTS 2021, KiTS23, and LiTS -- where targets exhibit high structural variability and distributional uncertainty. SEMIR yields consistent improvements in minority-structure Dice at practical runtime. More broadly, SEMIR establishes a framework for learning task-adapted, topology-preserving latent representations with exact decoding for high-resolution structured visual data.
comment: 20 pages, 3 figures. Accepted at ICML 2026. Includes appendices
☆ Scalable Token-Level Hallucination Detection in Large Language Models
Large language models (LLMs) have demonstrated remarkable capabilities, but they still frequently produce hallucinations. These hallucinations are difficult to detect in reasoning-intensive tasks, where the content appears coherent but contains errors like logical flaws and unreliable intermediate results. While step-level analysis is commonly used to detect internal hallucinations, it suffers from limited granularity and poor scalability due to its reliance on step segmentation. To address these limitations, we propose TokenHD, a holistic pipeline for training token-level hallucination detectors. Specifically, TokenHD consists of a scalable data engine for synthesizing large-scale hallucination annotations along with a training recipe featuring an importance-weighted strategy for robust model training. To systematically assess the detection performance, we also provide a rigorous evaluation protocol. Through training within TokenHD, our detector operates directly on free-form text to identify hallucinations, eliminating the need for predefined step segmentation or additional text reformatting. Our experiments show that even a small detector (0.6B) achieves substantial performance gains after training, surpassing much larger reasoning models (e.g., QwQ-32B), and detection performance scales consistently with model size from 0.6B to 8B. Finally, we show that our detector can generalize well across diverse practical scenarios and explore strategies to further enhance its cross-domain generalization capability.
☆ Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training
Reinforcement learning is structurally harder than supervised learning because the policy changes the data distribution it learns from. The resulting fragility is especially visible in large-model training, where the training and rollout systems differ in numerical precision, sampling, and other implementation details. Existing methods manage this fragility by adding hyper-parameters to the training objective, which makes the algorithm more sensitive to its configuration and requires retuning whenever the task, model scale, or distribution mismatch changes. This fragility traces to two concerns that current objectives entangle through hyper-parameters set before training begins: a trust-region concern, that updates should not move the policy too far from its current value, and an off-policy concern, that data from older or different behavior policies should influence the update only to the extent that it remains reliable. Neither concern is a constant to set in advance, and their severity is reflected in the policy-ratio distribution of the current batch. We present a simple yet effective batch-adaptive objective that replaces fixed clipping with the normalized effective sample size of the policy ratios. The same statistic caps the score-function weight and sets the strength of an off-policy regularizer, so the update stays close to the usual on-policy score-function update when ratios are nearly uniform, and tightens automatically when stale or mismatched data cause ratio concentration, while retaining a nonzero learning signal on high-ratio tokens. Experiments across a wide range of settings show that our method matches or exceeds tuned baselines, introducing no new objective hyper-parameters and removing several existing ones. The code is available at https://github.com/FeynRL-project/FeynRL.
☆ Discrete Flow Matching for Offline-to-Online Reinforcement Learning
Many reinforcement learning (RL) tasks have discrete action spaces, but most generative policy methods based on diffusion and flow matching are designed for continuous control. Meanwhile, generative policies usually rely heavily on offline datasets and offline-to-online RL is itself challenging, as the policy must improve from new interaction without losing useful behavior learned from static data. To address those challenges, we introduce DRIFT, an online fine-tuning method that updates an offline pretrained continuous-time Markov chain (CTMC) policy with an advantage-weighted discrete flow matching loss. To preserve useful pretrained knowledge, we add a path-space penalty that regularizes the full CTMC trajectory distribution, rather than only the final action distribution. For large discrete action spaces, we introduce a candidate-set approximation that updates the actor over a small subset of actions sampled from reference-policy rollouts and uniform exploration. Our theoretical analysis shows that the candidate-set error is controlled by missing target probability mass, and the induced CTMC generator error decreases as the candidate set covers more high-probability actions. Experiments on prevailing discrete action RL task show that our method provides stable offline-to-online improvement across all tasks, achieving the highest average score on Jericho with a simple GRU encoder while outperforming methods that use pretrained language models. Controlled experiments further confirm that the path-space penalty remains bounded during fine-tuning and that the CTMC generator adapts to shifted rewards faster than deterministic baselines. The candidate-set mechanism is supported by a stability analysis showing that the generator error decreases exponentially with candidate coverage.
☆ ProfiliTable: Profiling-Driven Tabular Data Processing via Agentic Workflows
Table processing-including cleaning, transformation, augmentation, and matching-is a foundational yet error-prone stage in real-world data pipelines. While recent LLM-based approaches show promise for automating such tasks, they often struggle in practice due to ambiguous instructions, complex task structures, and the lack of structured feedback, resulting in syntactically correct but semantically flawed code. To address these challenges, we propose ProfiliTable, an autonomous multi-agent framework centered on dynamic profiling, which constructs and iteratively refines a unified execution context through interactive exploration, knowledge-augmented synthesis, and feedback-driven refinement. ProfiliTable integrates (i) a Profiler that performs ReAct-style data exploration to build semantic understanding, (ii) a Generator that retrieves curated operators to synthesize task-aware code, and (iii) an Evaluator-Summarizer loop that injects execution scores and diagnostic insights to enable closed-loop refinement. Extensive experiments on a diverse benchmark covering 18 tabular task types demonstrate that ProfiliTable consistently outperforms strong baselines, particularly in complex multi-step scenarios. These results highlight the critical role of dynamic profiling in reliably translating ambiguous user intents into robust and governance-compliant table transformations.
☆ Agent-Based Post-Hoc Correction of Agricultural Yield Forecasts
Accurate crop yield forecasting in commercial soft fruit production is constrained by the data available in typical commercial farm records, which lack the sensor networks, satellite imagery, and high-resolution meteorological inputs that most state-of-the-art approaches assume. We propose a structured LLM agent framework that performs post-hoc correction of existing model predictions, encoding agricultural domain knowledge across tools for phase detection, bias learning, and range validation. Evaluated on a proprietary strawberry yield dataset and a public USDA corn harvest dataset, agent refinement of XGBoost reduced MAE by 20% and MASE by 56% on strawberry, with consistent improvements across Moirai2 (MAE 24%, MASE 22%) and Random Forest (MAE 28%, MASE 66%) baselines. Using Llama 3.1 8B as the agent produced the strongest corrections across all configurations; LLaVA 13B showed inconsistent gains, highlighting sensitivity to the choice of refinement model.
comment: 21 pages, 6 figures, 6 tables
☆ Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
Visual latent reasoning lets a multimodal large language model (MLLM) create intermediate visual evidence as continuous tokens, avoiding external tools or image generators. However, existing methods usually follow an output-as-input latent paradigm and yield unstable gains. We identify evidence for a feature-space mismatch that can contribute to this instability: dominant visual-latent models build on pre-norm MLLMs and reuse decoder hidden states as predicted latent inputs, even though these states occupy a substantially different norm regime from the input embeddings the model was trained to consume~\citep{xie2025mhc,li2026siamesenorm,team2026attention}. This mismatch can make direct latent feedback unreliable. Motivated by this diagnosis, we propose \textbf{GAP}, a \textbf{G}ranular \textbf{A}lignment \textbf{P}aradigm for visual latent modeling. GAP aligns visual latent reasoning at three levels: feature-level alignment maps decoder outputs into input-compatible visual latents through a lightweight PCA-aligned latent head; context-level alignment grounds latent targets with inspectable auxiliary visual supervision; and capacity-guided alignment assigns latent supervision selectively to examples where the base MLLM struggles. On Qwen2.5-VL 7B, the resulting model achieves the best mean aggregate perception and reasoning performance among our supervised variants. Inference-time intervention probing further suggests that generated latents provide task-relevant visual signal beyond merely adding token slots.
☆ Classifier Context Rot: Monitor Performance Degrades with Context Length
Monitoring coding agents for dangerous behavior using language models requires classifying transcripts that often exceed 500K tokens, but prior agent monitoring benchmarks rarely contain transcripts longer than 100K tokens. We show that when used as classifiers, current frontier models fail to notice dangerous actions more often in longer transcripts. In particular, on a dataset that requires identifying when a coding agent takes a subtly dangerous action, Opus 4.6, GPT 5.4, and Gemini 3.1 miss these actions $2\times$ to $30\times$ more often when they occur after 800K tokens of benign activity than when they occur on their own. We also show that these weaknesses can be partially mitigated with prompting techniques such as periodic reminders throughout the transcript and may be mitigated further with better post-training. Monitor evaluations that do not consider long-context degradation are likely overestimating monitor performance.
☆ QAP-Router: Tackling Qubit Routing as Dynamic Quadratic Assignment with Reinforcement Learning
Qubit routing is a fundamental problem in quantum compilation, known to be NP-hard. Its dynamic nature makes local routing decisions propagate and compound over time, making global efficient solutions challenging. Existing heuristic methods rely on local rules with limited lookahead, while recent learning-based approaches often treat routing as a generic sequential decision problem without fully exploiting its underlying structure. In this paper, we introduce QAP-Router, framing qubit routing based on a dynamic Quadratic Assignment Problem (QAP) formulation. By modeling logical interactions, or quantum gates, as flow matrices and hardware topology as a distance matrix, our approach captures the interaction-distance coupling in a unified objective, which defines the reward in the reinforcement learning environment. To further exploit this structure, the policy network employs a solution-aware Transformer backbone that encodes the interaction between the flow matrix and the distance matrix into the attention mechanism. We also integrate a lookahead mechanism that blends naturally into the QAP framework, preventing myopic decisions. Extensive experiments on 1,831 real-world quantum circuits from the MQTBench, AgentQ and QUEKO datasets show that our method substantially reduces the CNOT gate count of routed circuits by 15.7%, 30.4% and 12.1%, respectively, relative to existing industry compilers.
☆ A Family of Quaternion-Valued Differential Evolution Algorithms for Numerical Function Optimization
The numerical optimization of continuous functions is a fundamental task in many scientific and engineering domains, ranging from mechanical design to training of artificial intelligence models. Among the most effective and widely used algorithms for this purpose is Differential Evolution (DE), known for its simplicity and strong performance. Recent research has shown that adapting AI models to operate over alternative number systems-such as complex numbers, quaternions, and geometric algebras-can improve model compactness and accuracy. However, such extensions remain underexplored in bio-inspired optimization algorithms. In particular, the use of quaternion algebra represents an emerging area in computational intelligence. This paper introduces a family of novel Quaternion-Valued Differential Evolution (QDE) algorithms that operate directly in the quaternion space. We propose several mutation strategies specifically designed to exploit the algebraic and geometric properties of quaternions. Results show that our QDE variants achieve faster convergence and superior performance on several function classes in the BBOB benchmark compared to the traditional real-valued DE algorithm.
☆ MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering
Evaluating large language models (LLMs) in the biomedical domain requires benchmarks that can distinguish reasoning from pattern matching and remain discriminative as model capabilities improve. Existing biomedical question answering (QA) benchmarks are limited in this respect. Multiple-choice formats can allow models to succeed through answer elimination rather than inference, while widely circulated exam-style datasets are increasingly vulnerable to performance saturation and training data contamination. Multi-hop reasoning, defined as the ability to integrate information across multiple sources to derive an answer, is central to clinically meaningful tasks such as diagnostic support, literature-based discovery, and hypothesis generation, yet remains underrepresented in current biomedical QA benchmarks. MedHopQA is a disease-centered multi-hop reasoning benchmark consisting of 1,000 expert-curated question-answer pairs introduced as a shared task at BioCreative IX. Each question requires synthesis of information across two distinct Wikipedia articles, and answers are provided in an open-ended free-text format. Gold annotations are augmented with ontology-grounded synonym sets from MONDO, NCBI Gene, and NCBI Taxonomy to support both lexical and concept-level evaluation. MedHopQA was constructed through a structured process combining human annotation, triage, iterative verification, and LLM-as-a-judge validation. To reduce leaderboard gaming and contamination risk, the 1,000 scored questions are embedded within a publicly downloadable set of 10,000 questions, with answers withheld, on a CodaBench leaderboard. MedHopQA provides both a benchmark and a reusable framework for constructing future biomedical QA datasets that prioritize compositional reasoning, saturation resistance, and contamination resistance as core design constraints.
☆ $δ$-mem: Efficient Online Memory for Large Language Models
Large language models increasingly need to accumulate and reuse historical information in long-term assistants and agent systems. Simply expanding the context window is costly and often fails to ensure effective context utilization. We propose $δ$-mem, a lightweight memory mechanism that augments a frozen full-attention backbone with a compact online state of associative memory. $δ$-mem compresses past information into a fixed-size state matrix updated by delta-rule learning, and uses its readout to generate low-rank corrections to the backbone's attention computation during generation. With only an $8\times8$ online memory state, $δ$-mem improves the average score to $1.10\times$ that of the frozen backbone and $1.15\times$ that of the strongest non-$δ$-mem memory baseline. It achieves larger gains on memory-heavy benchmarks, reaching $1.31\times$ on MemoryAgentBench and $1.20\times$ on LoCoMo, while largely preserving general capabilities. These results show that effective memory can be realized through a compact online state directly coupled with attention computation, without full fine-tuning, backbone replacement, or explicit context extension.
☆ A New Technique for AI Explainability using Feature Association Map
Lack of transparency in AI systems poses challenges in critical real-life applications. It is important to be able to explain the decisions of an AI system to ensure trust on the system. Explainable AI (XAI) algorithms play a vital role in achieving this objective. In this paper, we are proposing a new algorithm for Explaining AI systems, FAMeX (Feature Association Map based eXplainability). The proposed algorithm is based on a graph-theoretic formulation of the feature set termed as Feature Association Map (FAM). The foundation of the modelling is based on association between features. The proposed FAMeX algorithm has been found to be better than the competing XAI algorithms - Permutation Feature Importance (PFI) and SHapley Additive exPlanations (SHAP). Experiments conducted with eight benchmark algorithms show that FAMeX is able to gauge feature importance in the context of classification better than the competing algorithms. This definitely shows that FAMeX is a promising algorithm in explaining the predictions from an AI system
☆ BSO: Safety Alignment Is Density Ratio Matching
Aligning language models for both helpfulness and safety typically requires complex pipelines-separate reward and cost models, online reinforcement learning, and primal-dual updates. Recent direct preference optimization approaches simplify training but incorporate safety through ad-hoc modifications such as multi-stage procedures or heuristic margin terms, lacking a principled derivation. We show that the likelihood ratio of the optimal safe policy admits a closed-form decomposition that reduces safety alignment to a density ratio matching problem. Minimizing Bregman divergences between the data and model ratios yields Bregman Safety Optimization (BSO), a family of single-stage loss functions, each induced by a convex generator, that provably recover the optimal safe policy. BSO is both general and simple: it requires no auxiliary models, introduces only one hyperparameter beyond standard preference optimization, and recovers existing safety-aware methods as special cases. Experiments across safety alignment benchmarks show that BSO consistently improves the safety-helpfulness trade-off.
☆ Manifold Sampling via Entropy Maximization
Sampling from constrained distributions has a wide range of applications, including in Bayesian optimization and robotics. Prior work establishes convergence and feasibility guarantees for constrained sampling, but assumes that the feasible set is connected. However, in practice, the feasible set often decomposes into multiple disconnected components, which makes efficient sampling under constraints challenging. In this paper, we propose MAnifold Sampling via Entropy Maximization (MASEM) for sampling on a manifold with an unknown number of disconnected components, implicitly defined by smooth equality and inequality constraints. The presented method uses a resampling scheme to maximize the entropy of the empirical distribution based on k-nearest neighbor density estimation. We show that, in the mean field, MASEM decreases the KL-divergence between the empirical distribution and the maximum-entropy target exponentially in the number of resampling steps. We instantiate MASEM with multiple local samplers and demonstrate its versatility and efficiency on synthetic and robotics-based benchmarks. MASEM enables fast and scalable mixing across a range of constrained sampling problems, improving over alternatives by an order of magnitude in Sinkhorn distance with competitive runtime.
☆ EHR-RAGp: Retrieval-Augmented Prototype-Guided Foundation Model for Electronic Health Records
Electronic Health Records (EHR) contain rich longitudinal patient information and are widely used in predictive modeling applications. However, effectively leveraging historical data remains challenging due to long trajectories, heterogeneous events, temporal irregularity, and the varying relevance of past clinical context. Existing approaches often rely on fixed windows or uniform aggregation, which can obscure clinically important signals. In this work, we introduce EHR-RAGp, a retrieval-augmented foundation model that dynamically integrates the most relevant patient history across diverse clinical event types. We propose a prototype-guided retrieval module that acts as an alignment mechanism and estimates the relevance of retrieved historical chunks with respect to a given prediction task, guiding the model towards the most informative context. Across multiple clinical prediction tasks, EHR-RAGp consistently outperforms state-of-the-art EHR foundation models and transformer-based baselines. Furthermore, integrating EHR-RAGp with existing clinical foundation models yields substantial performance gains. Overall, EHR-RAGp provides a scalable and efficient framework for leveraging long-range clinical context to improve downstream performance.
comment: Retrieval Augmented EHR Foundation Model
☆ Reinforcing VLAs in Task-Agnostic World Models
Post-training Vision-Language-Action (VLA) models via reinforcement learning (RL) in learned world models has emerged as an effective strategy to adapt to new tasks without costly real-world interactions. However, while using imagined trajectories reduces the sample complexity of policy training, existing methods still heavily rely on task-specific data to fine-tune both the world and reward models, fundamentally limiting their scalability to unseen tasks. To overcome this, we argue that world and reward models should capture transferable physical priors that enable zero-shot inference. We propose RAW-Dream (Reinforcing VLAs in task-Agnostic World Dreams), a new paradigm that completely disentangles world model learning from downstream task dependencies. RAW-Dream utilizes a world model pre-trained on diverse task-free behaviors for predicting future rollouts, and an off-the-shelf Vision-Language Model (VLM) for reward generation. Because both components are task-agnostic, VLAs can be readily finetuned for any new task entirely within this zero-shot imagination. Furthermore, to mitigate world model hallucinations, we introduce a dual-noise verification mechanism to filter out unreliable rollouts. Extensive experiments across simulation and real-world settings demonstrate consistent performance gains, proving that generalized physical priors can effectively substitute for costly task-dependent data, offering a highly scalable roadmap for VLA adaptation.
☆ Towards Automated Air Traffic Safety Assessment Around Non-Towered Airports Using Large Language Models
We investigate frameworks for post-flight safety analysis at non-towered airports using large language models (LLMs). Non-towered airports rely on the Common Traffic Advisory Frequency (CTAF) for air traffic coordination and experience frequent near mid-air collisions due to the pilot self-announcement communication protocol. We propose a general vision-language model (VLM) approach to analyze the transcribed CTAF radio communications in natural language, METeorological Aerodrome Report (METAR) weather data, Automatic Dependent Surveillance-Broadcast (ADS-B) flight trajectories, and Visual Flight Rules sectional charts of the airfield. We provide a preliminary study at Half Moon Bay Airport, with a qualitative real world case study and a quantitative evaluation using a new synthetic dataset of communications and weather modalities. We qualitatively evaluate our framework on real flight data using Gemini 2.5 Pro, demonstrating accurate identification of a right-of-way violation. The synthetic dataset is derived from real examples and includes a 12-category hazard taxonomy, and is used to benchmark three open-source (Qwen 2.5-7B, Mistral-7B, Gemma-2-9B) and three closed-source (GPT-4o, GPT-5.4, Claude Sonnet 4.6) LLM models on the subset of inputs related to CTAF and METAR. Even limited to CTAF and METAR inputs and open source LLMs, instances of our framework typically achieve a macro F1 score above 0.85 on a binary nominal/danger classification task. Future work includes a quantitative evaluation across all modalities and a larger number of real world examples. Taken together, our results suggest that VLM analysis of safety at non-towered airports may be a valuable future capability.
comment: 25 pages, 17 figures, 5 tables, Accepted to AIAA 2026
☆ LISA: Cognitive Arbitration for Signal-Free Autonomous Intersection Management
Large language models (LLMs) show strong potential for Intelligent Transportation Systems (ITS), particularly in tasks requiring situational reasoning and multi-agent coordination. These capabilities make them well suited for cooperative driving, where rule-based approaches struggle in complex and dynamic traffic environments. Intersection management remains especially challenging due to conflicting right-of-way demands, heterogeneous vehicle priorities, and vehicle-specific kinematic constraints that must be resolved in real time. However, existing approaches typically use LLMs as auxiliary components on top of signal-based systems rather than as primary decision-makers. Signal controllers remain vehicle-agnostic, reservation-based methods lack intent awareness, and recent LLM-based systems still depend on signal infrastructure. In addition, LLM inference latency limits their use in sub-second control settings. We propose LISA (LLM-Based Intent-Driven Speed Advisory), a signal-free cognitive arbitration framework for autonomous intersection management. LISA uses an LLM to reason over declared vehicle intents, incorporating priority classes, queue pressure, and energy preferences. We evaluate LISA against fixed-cycle control, SCATS, AIM, and GLOSA across varying traffic loads. Results show that LISA reduces mean control delay by up to 89.1% and maintains Level of Service C while all non-LLM baselines degrade to Level of Service F. Under near-saturated demand, LISA reduces mean waiting time by 93% and peak queue length by 60.6% relative to fixed-cycle control. It also lowers fuel consumption by up to 48.8% and achieves 86.2% intent satisfaction, compared to 61.2% for the best non-LLM method. These results demonstrate that LLM-based reasoning can enable real-time, signal-free intersection management.
☆ Transferable Delay-Aware Reinforcement Learning via Implicit Causal Graph Modeling
Random delays weaken the temporal correspondence between actions and subsequent state feedback, making it difficult for agents to identify the true propagation process of action effects. In cross-task scenarios, changes in task objectives and reward formulations further reduce the reusability of previously acquired task knowledge. To address this problem, this paper proposes a transferable delay-aware reinforcement learning method based on implicit causal graph modeling. The proposed method uses a field-node encoder to represent high-dimensional observations as latent states with node-level semantics, and employs a message-passing mechanism to characterize dynamic causal dependencies among nodes, thereby learning transferable structured representations and environment dynamics knowledge. On this basis, imagination-driven behavior learning and planning are incorporated to optimize policies in the latent space, enabling cross-task knowledge transfer and rapid adaptation. Experimental results show that the proposed method outperforms baseline methods on DMC continuous control tasks with random delays. Cross-task transfer experiments further demonstrate that the learned structured representations and dynamics knowledge can be effectively transferred to new tasks and significantly accelerate policy adaptation.
☆ KAN-CL: Per-Knot Importance Regularization for Continual Learning with Kolmogorov-Arnold Networks
Catastrophic forgetting remains the central obstacle in continual learning (CL): parameters shared across tasks interfere with one another, and existing regularization methods such as EWC and SI apply uniform penalties without awareness of which input region a parameter serves. We propose KAN-CL, a continual learning framework that exploits the compact-support spline parameterization of Kolmogorov-Arnold Networks (KANs) to perform importance-weighted anchoring at per-knot granularity. Deployed as a classification head on a convolutional backbone with standard EWC regularization on the backbone (bbEWC) KAN-CL achieves forgetting reductions of 88% and 93% over a head-only KAN baseline on Split-CIFAR-10/5T and Split-CIFAR-100/10T respectively, while matching or exceeding the accuracy of all baselines on both benchmarks. We further provide a Neural Tangent Kernel (NTK) analysis showing that KAN's spline locality induces a structural rank deficit in the cross-task NTK, yielding a forgetting bound that holds even in the feature-learning regime. These results establish that combining an architecture with natural parameter locality (KAN head) with a complementary backbone regularizer (bbEWC) yields a compositional and principled approach to catastrophic forgetting.
☆ Executable Agentic Memory for GUI Agent
Modern GUI agents typically rely on a model-centric and step-wise interaction paradigm, where LLMs must re-interpret the UI and re-decide actions at every screen, which is fragile in long-horizon tasks. In this paper, we propose Executable Agentic Memory (EAM), a structured Knowledge Graph (KG) that shifts GUI planning from free-form generation to a robust retrieval-and-execution process. Our approach includes a sample-efficient memory construction pipeline using state-aware DFS and action-group mining to compress multi-step routines. To ensure efficient planning, we introduce a value-guided graph search where a lightweight Q-function model steers Monte Carlo Tree Search (MCTS) over the KG. We theoretically establish bias-consistency for the Q-model and derive sample complexity bounds for path recovery. Empirically, EAM outperforms state-of-the-art baselines like UI-TARS-7B by up to $19.6\%$ on AndroidWorld, while reducing token costs $6\times$ relative to GPT-4o. With a $2.8$s average latency, EAM enables reliable, quick, and long-horizon GUI automation.
☆ PriorZero: Bridging Language Priors and World Models for Decision Making
Leveraging the rich world knowledge of Large Language Models (LLMs) to enhance Reinforcement Learning (RL) agents offers a promising path toward general intelligence. However, a fundamental prior-dynamics mismatch hinders existing approaches: static LLM knowledge cannot directly adapt to the complex transition dynamics of long-horizon tasks. Using LLM priors as fixed policies limits exploration diversity, as the prior is blind to environment-specific dynamics; while end-to-end fine-tuning suffers from optimization instability and credit assignment issues. To bridge this gap, we propose PriorZero, a unified framework that integrates LLM-derived conceptual priors into world-model-based planning through a decoupled rollout-training design. During rollout, a novel root-prior injection mechanism incorporates LLM priors exclusively at the root node of Monte Carlo Tree Search (MCTS), focusing search on semantically promising actions while preserving the world model's deep lookahead capability. During training, PriorZero decouples world-model learning from LLM adaptation: the world model is continuously refined on interaction data to jointly improve its dynamics, policy, and value predictions, its value estimates are then leveraged to provide fine-grained credit assignment signals for stable LLM fine-tuning via alternating optimization. Experiments across diverse benchmarks, including text-based adventure games in Jericho and instruction-following gridworld tasks in BabyAI, demonstrate that PriorZero consistently improves both exploration efficiency and asymptotic performance, establishing a promising framework for LLM-empowered decision-making. Our code is available at https://github.com/opendilab/LightZero.
comment: 30 pages, 12 figures
☆ TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
Direct Preference Optimization (DPO) is a widely used RL-free method for aligning language models from pairwise preferences, but it models preferences over full sequences even though generation is driven by per-token decisions. Existing token-level extensions typically decompose a sequence-level Bradley-Terry objective across timesteps, leaving per-prefix (state-wise) optimality implicit. We study how to recover token-level preference optimality using only standard sequence-level pairwise comparisons. We introduce Token-level Bregman Preference Optimization (TBPO), which posits a token-level Bradley-Terry preference model over next-token actions conditioned on the prefix, and derive a Bregman-divergence density-ratio matching objective that generalizes the logistic/DPO loss while preserving the optimal policy induced by the token-level model and maintaining DPO-like simplicity. We introduce two instantiations: TBPO-Q, which explicitly learns a lightweight state baseline, and TBPO-A, which removes the baseline through advantage normalization. Across instruction following, helpfulness/harmlessness, and summarization benchmarks, TBPO improves alignment quality and training stability and increases output diversity relative to strong sequence-level and token-level baselines.
☆ Set-Aggregated Genome Embeddings for Microbiome Abundance Prediction
Microbiome functions are encoded within the genes of the community-wide metagenome. A natural question is whether properties of a microbial community can be predicted just from knowing the raw DNA sequences of its members. In this work, we employ set-aggregated genome embeddings (SAGE) to predict community-level abundance profiles, exploiting the few-shot learning capabilities of genomic language models (GLMs). We benchmark this approach to show improved generalization on novel genomes compared to classical bioinformatics approaches. Model ablation shows that community-level latent representations directly result in improved performance. Lastly, we demonstrate the benefits of intermediate transformations between latent representations and demonstrate the differences between GLM embedding choices.
comment: 11 pages, 7 figures
☆ Iterative Audit Convergence in LLM-Managed Multi-Agent Systems: A Case Study in Prompt Engineering Quality Assurance
Prompt specifications for multi-agent large language model (LLM) systems carry data contracts and integration logic across many interdependent files but are rarely subjected to structured-inspection rigor. This paper reports a single-system empirical case study of iterative, agent-driven auditing applied to AEGIS (Autonomous Engineering Governance and Intelligence System), a production seven-lane orchestration pipeline whose prompt-specification surface comprises approximately 7150 lines: 6907 across seven lane PROMPT.md files and a 245-line shared Ticket Contract. Nine sequential audit rounds, executed by Claude sub-agents using a checklist-driven walkthrough adapted from Weinberg and Freedman, surfaced 51 prompt-specification consistency defects, distinct from the 51 STRIDE-categorized adversarial code findings reported in the companion preprint. Per-round counts were 15, 8, 12, 2, 8, 1, 4, 1, and 0. We report a seven-category post-hoc defect taxonomy with explicit coding rules, observed non-monotonic convergence consistent with cascading edits and audit-scope expansion, and an audit protocol distilled from the study, with the final locked checklist released as a reproducibility appendix. Single-file review missed defect classes that were surfaced only by later expanded-scope rounds in this system. The same LLM family authored and audited the specifications; replication with dissimilar models and human reviewers is required before generalization.
comment: 13 pages, 3 figures, 6 tables. Companion preprint at arXiv:2604.05000. Submitted to MDPI Software, Special Issue on Software Reliability, Security and Quality Assurance
☆ NARA: Anchor-Conditioned Relation-Aware Contextualization of Heterogeneous Geoentities
Geospatial foundation models have primarily focused on raster data such as satellite imagery, where self-supervised learning has been widely studied. Vector geospatial data instead represent the world as discrete geoentities with explicit geometry, semantics, and structured spatial relations, including metric proximity and topological relationships. These relations jointly determine how entities interact within space, yet existing representation learning methods remain fragmented, often restricted to specific geometry types or partial spatial relations, limiting their ability to capture unified spatial context across heterogeneous geoentities. We propose NARA (Neural Anchor-conditioned Relation-Aware representation learning), a self-supervised framework for vector geoentities. NARA learns context-dependent representations by jointly modeling semantics, geometry, and spatial relations within a unified framework and captures relational spatial structure beyond proximity alone, enabling rich contextualized representations across heterogeneous geoentities of points, polylines, and polygons. Evaluation on building function classification, traffic speed prediction, and next point-of-interest recommendation shows consistent improvements over prior methods, highlighting the benefit of unified relational modeling for vector geospatial data.
☆ How Useful Is Cross-Domain Generalization for Training LLM Monitors?
Using prompted language models as classifiers enables classification in domains with limited training data, but misses some of the robustness and performance benefits that fine-tuning can bring. We study whether training on multiple classification tasks, each with its own prompt, improves performance on new domains with new classification prompts. We show that such training partially generalizes to adjacent domains, improving classification performance on tasks that are unseen during training. However, we identify specific edge cases where the fine-tuned models fail to follow prompts, such as when the classification prompt changes completely while the data domain remains the same as during training. We show that classification training can be mixed with general instruction following training, and that (when done well) such training keeps the benefits of classification training and mitigates its generalization failures. Surprisingly, we see that this no-thinking supervised classification training can generalize to with-thinking classification and summarization, suggesting that no-thinking classification training might be instrumentally useful in building other kinds of classifiers and monitoring systems.
☆ Reconnecting Fragmented Citation Networks with Semantic Augmentation
Citation graphs are fundamental tools for modeling scientific structure, but are often fragmented due to missing citations of scientifically connected articles. To address this issue, we propose a computationally efficient hybrid framework integrating citation topology with large language model (LLM)-based text similarity. Using 662,369 Web of Science publications in Mathematics and Operations Research & Management Science, we augment the original graph by adding semantic edges from small, disconnected components and weighting existing citations according to textual similarity. Semantic augmentation substantially reduces fragmentation while preserving disciplinary homogeneity. Compared to embedding-only clustering, cluster detection on augmented graphs using the Leiden algorithm retains structural interpretability while offering multi-scale organization. The method scales efficiently to large datasets and offers a practical strategy for strengthening citation-based indicators without collapsing disciplinary boundaries.
comment: 11 pages, 4 figures, 3 tables
☆ Missingness-MDPs: Bridging the Theory of Missing Data and POMDPs
We introduce missingness-MDPs (miss-MDPs), a novel subclass of partially observable Markov decision processes (POMDPs) that incorporates the theory of missing data. A miss-MDP is a POMDP whose observation function is a missingness function, specifying the probability that individual state features are missing (i.e., unobserved) at a time step. The literature distinguishes three canonical missingness types: missing (1) completely at random (MCAR), (2) at random (MAR), and (3) not at random (MNAR). Our planning problem is to compute near-optimal policies for a miss-MDP with an unknown missingness function, given a dataset of action-observation trajectories. Achieving such optimality guarantees for policies requires learning the missingness function from data, which is infeasible for general POMDPs. To overcome this challenge, we exploit the structural properties of different missingness types to derive probably approximately correct (PAC) algorithms for learning the missingness function. These algorithms yield an approximate but fully specified miss-MDP that we solve using off-the-shelf planning methods. We prove that, with high probability, the resulting policies are epsilon-optimal in the true miss-MDP. Empirical results confirm the theory and demonstrate superior performance of our approach over two model-free POMDP methods.
☆ Why Conclusions Diverge from the Same Observations: Formalizing World-Model Non-Identifiability via an Inference
When people share the same documents and observations yet reach different conclusions, the disagreement often shifts into a judgment that the other party is cognitively defective, irrational, or acting in bad faith. This paper argues that such divergence is better described as a form of non-identifiability inherent in inference and learning, rather than as a defect of the other party. We organize the phenomenon into two levels: (i) $θ$-level non-identifiability, where conclusions diverge under the same world model $W$ because inference settings differ; and (ii) $W$-level non-identifiability, where repeated use of an inference setting $θ$ biases data exposure and update rules, causing the learned world model $W$ itself to diverge. We introduce an inference profile $θ= (R, E, S, D)$, consisting of Reference, Exploration, Stabilization, and Horizon, and show how outputs can split even for the same observation $o$ and the same $W$. We further explain why disagreements tend to project onto a small number of bases -- abstract versus concrete, externalizability, and order versus freedom -- as a consequence of general constraints on learning systems: computational, observational, and coordination constraints. Finally, we relate the framework to deep representation learning, including representation hierarchy, latent-state estimation, and regularization-exploration trade-offs, and illustrate the framework through a case study on AI regulation debates.
comment: 12 pages, 2 figures, 1 table. Extended English version of a paper accepted for presentation at JSAI 2026
☆ Mind the Pause: Disfluency-Aware Objective Tuning for Multilingual Speech Correction with LLMs ACL 2026
Automatic Speech Recognition (ASR) transcripts often contain disfluencies, such as fillers, repetitions, and false starts, which reduce readability and hinder downstream applications like chatbots and voice assistants. If left unaddressed, such disfluencies can significantly degrade the reliability of downstream systems. Most existing approaches rely on classical models that focus on identifying disfluent tokens for removal. While this strategy is effective to some extent, it often disrupts grammatical structure and semantic coherence, leading to incomplete or unnatural sentences. Recent literature explored the use of large language models (LLMs); however, these efforts have primarily focused on disfluency detection or data augmentation, rather than performing comprehensive correction. We propose a multilingual correction pipeline where a sequence tagger first marks disfluent tokens, and these signals guide instruction fine-tuning of an LLM to rewrite transcripts into fluent text. To further improve reliability, we add a contrastive learning objective that penalizes the reproduction of disfluent tokens, encouraging the model to preserve grammar and meaning while removing disfluent artifacts. Our experiments across three Indian languages, namely Hindi, Bengali, and Marathi show consistent improvements over strong baselines, including multilingual sequence-to-sequence models. These results highlight that detection-only strategies are insufficient. Combining token-level cues with instruction tuning and contrastive learning provides a practical and scalable solution for multilingual disfluency correction in speech-driven NLP systems. We make the codes publicly available at https://github.com/deepak-kumar-98/Mind-the-Pause.
comment: Accepted to ACL 2026 (Main)
Pretraining Strategies and Scaling for ECG Foundation Models: A Systematic Study
Specialized foundation models are beginning to emerge in various medical subdomains, but pretraining methodologies and parametric scaling with the size of the pretraining dataset are rarely assessed systematically and in a like-for-like manner. This work focuses on foundation models for electrocardiography (ECG) data, one of the most widely captured physiological time series world-wide. We present a comprehensive assessment of pretraining methodologies, covering five different contrastive and non-contrastive self-supervised learning objectives for ECG foundation models, and investigate their scaling behavior with pretraining dataset sizes up to 11M input samples, exclusively from publicly available sources. Pretraining strategy has a meaningful and consistent impact on downstream performance, with contrastive predictive coding (slightly ahead of JEPA) yielding the most transferable representations across diverse clinical tasks. Scaling pretraining data continues to yield meaningful improvements up to 11M samples for most objectives. We also compare model architectures across all pretraining methodologies and find evidence for a clear superiority of structured state space models compared to transformers and CNN models. We hypothesize that the strong inductive biases of structured state space models, rather than pretraining scale alone, are the primary driver of effective ECG representation learning, with important implications for future foundation model development in this and potentially other physiological signal domains.
comment: 59 pages, 16 figures, 59 Tables. Code available at https://anonymous.4open.science/r/ecg-pretraining-strategies-4DE3
☆ No Action Without a NOD: A Heterogeneous Multi-Agent Architecture for Reliable Service Agents
Large language model (LLM) agents have increasingly advanced service applications, such as booking flight tickets. However, these service agents suffer from unreliability in long-horizon tasks, as they often produce policy violations, tool hallucinations, and misaligned actions, which greatly impedes their real-world deployment. To address these challenges, we propose NOD (Navigator-Operator-Director), a heterogeneous multi-agent architecture for service agents. Instead of maintaining task state implicitly in dialogue context as in prior work, we externalize a structured Global State to enable explicit task state tracking and consistent decision-making by the Navigator. Besides, we introduce selective external oversight before critical actions, allowing an independent Director agent to verify execution and intervene when necessary. As such, NOD effectively mitigates error propagation and unsafe behavior in long-horizon tasks. Experiments on $τ^2$-Bench demonstrate that NOD achieves higher task success rates and critical action precision over baselines. More importantly, NOD improves the reliability of service agents by reducing policy violations, tool hallucinations, and user-intent misalignment.
☆ Harness Engineering as Categorical Architecture
The agent harness, the system layer comprising prompts, tools, memory, and orchestration logic that surrounds the model, has emerged as the central engineering abstraction for LLMbased agents. Yet harness design remains ad hoc, with no formal theory governing composition, preservation of properties under compilation, or systematic comparison across frameworks. We show that the categorical Architecture triple (G, Know, Phi) from the ArchAgents framework provides exactly this formalization. The four pillars of agent externalization (Memory, Skills, Protocols, Harness Engineering) map onto the triple's components: Memory as coalgebraic state, Skills as operad-composed objects, Protocols as syntactic wiring G, and the full Harness as the Architecture itself. Structural guarantees-integrity gates, quality-based escalation, supported convergence checks-are Know-level certificates whose preservation is structural replay: our compiler checks identity and verifier replay, not output-layer correctness or model behavior. We validate this correspondence with a reference implementation featuring compiler functors targeting Swarms, DeerFlow, Ralph, Scion, and LangGraph: the four configuration compilers preserve three named certificate types by identity or replay, and LangGraph preserves the same certificates through its shared per-stage execution path. The LangGraph compiler creates one node per stage using the same per-stage method as the native runtime, providing LangGraph-native observability without reimplementing harness logic. An end-to-end escalation experiment with real LLM agents confirms that the quality-based escalation control path is model-parametric in this two-model, one-task experiment. The result positions categorical architecture as the formal theory behind harness engineering.
☆ TMRL: Diffusion Timestep-Modulated Pretraining Enables Exploration for Efficient Policy Finetuning
Fine-tuning pre-trained robot policies with reinforcement learning (RL) often inherits the bottlenecks introduced by pre-training with behavioral cloning (BC), which produces narrow action distributions that lack the coverage necessary for downstream exploration. We present a unified framework that enables the exploration necessary to enable efficient robot policy finetuning by bridging BC pre-training and RL fine-tuning. Our pre-training method, Context-Smoothed Pre-training (CSP), injects forward-diffusion noise into policy inputs, creating a continuum between precise imitation and broad action coverage. We then fine-tune pre-trained policies via Timestep-Modulated Reinforcement Learning (TMRL), which trains the agent to dynamically adjust this conditioning during fine-tuning by modulating the diffusion timestep, granting explicit control over exploration. Integrating seamlessly with arbitrary policy inputs, e.g., states, 3D point clouds, or image-based VLA policies, we show that TMRL improves RL fine-tuning sample efficiency. Notably, TMRL enables successful real-world fine-tuning on complex manipulation tasks in under one hour. Videos and code available at https://weirdlabuw.github.io/tmrl/.
☆ No More, No Less: Task Alignment in Terminal Agents
Terminal agents are increasingly capable of executing complex, long-horizon tasks autonomously from a single user prompt. To do so, they must interpret instructions encountered in the environment (e.g., README files, code comments, stack traces) and determine their relevance to the task. This creates a fundamental challenge: relevant cues must be followed to complete a task, whereas irrelevant or misleading ones must be ignored. Existing benchmarks do not capture this ability. An agent may appear capable by blindly following all instructions, or appear robust by ignoring them altogether. We introduce TAB (Task Alignment Benchmark), a suite of 89 terminal tasks derived from Terminal-Bench 2.1. Each task is intentionally underspecified, with missing information provided as a necessary cue embedded in a natural environmental artifact, alongside a plausible but irrelevant distractor. Solving these tasks requires selectively using the cue while ignoring the distractor. Applying TAB to ten frontier agents reveals a systematic gap between task capability and task alignment. The strongest Terminal-Bench agent achieves high task completion but low task alignment on TAB. Evaluating six prompt-injection defenses further shows that suppressing distractor execution also suppresses the cues required for task completion. These results demonstrate that task-aligned agents require selective use of environmental instructions rather than blanket acceptance or rejection.
☆ TriBand-BEV: Real-Time LiDAR-Only 3D Pedestrian Detection via Height-Aware BEV and High-Resolution Feature Fusion AAMAS 2026
Safe autonomous agents and mobile robots need fast real time 3D perception, especially for vulnerable road users (VRUs) such as pedestrians. We introduce a new bird's eye view (BEV) encoding, which maps the full 3D LiDAR point cloud into a light-weight 2D BEV tensor with three height bands. We explicitly reformulate 3D detection as a 2D detection problem and then reconstruct 3D boxes from the BEV outputs. A single network detects cars, pedestrians, and cyclists in one pass. The backbone uses area attention at deep stages, a hierarchical bidirectional neck over P1 to P4 fuses context and detail, and the head predicts oriented boxes with distribution focal learning for side offsets and a rotated IoU loss. Training applies a small vertical re bin and a mild reflectance jitter in channel space to resist memorization. We use an interquartile range (IQR) filter to remove noisy and outlier LiDAR points during 3D reconstruction. On KITTI dataset, TriBand-BEV attains 58.7/52.6/47.2 pedestrian BEV AP(%) for easy, moderate, and hard at 49 FPS on a single consumer GPU, surpassing Complex-YOLO, with gains of +12.6%, +7.5%, and +3.1%. Qualitative scenes show stable detection under occlusion. The pipeline is compact and ready for real time robotic deployment. Our source code is publicly available on GitHub.
comment: Accepted for publication in the Proceedings of the 2026 International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)
☆ Heterogeneous SoC Integrating an Open-Source Recurrent SNN Accelerator for Neuromorphic Edge Computing on FPGA ECML-PKDD 2024
The growing popularity of Spiking Neural Networks (SNNs) and their applications has led to a significant fast-paced increase of neuromorphic architectures capable of mimicking the spike-based data processing typical of biological neurons. The efficient power consumption and parallel computing capabilities of the SNNs lead researchers towards the development of digital accelerators, which exploit such features to bring fast and low-power computation on edge devices. The spread of digital neuromorphic hardware however is slowed down by the prohibitive costs that the silicon tape out of circuits brings, that's why targeting Field Programmable Gate Arrays (FPGAs) could represent a viable alternative, offering a flexible and cost-effective platform for implementing digital neuromorphic systems and helping the spread of open-source hardware designs. In this work we present an heterogeneous System-on-Chip (SoC) where the operations of ReckOn, a Recurrent SNN accelerator, are managed through the integration with traditional processors. These include the RISC-V-based, open-source microcontroller X-HEEP and the ARM processor featured in Zynq Ultrascale systems. We validate our design by reproducing the classification results through the implementation on FPGA of the taped-out version of ReckOn in order to check the equivalence of the accuracy and the characteristics in terms of physical implementation. In a second set of experiments, we evaluate the online learning capability of the solution in classifying a subset of the Braille digit dataset recently used to compare neuromorphic frameworks and platforms.
comment: Deep Learning meets Neuromorphic Hardware Workshop at ECML-PKDD 2024 Conference in Vilnius, Lithuania
☆ Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems
LLM-based conversational AI agents struggle to maintain coherent behavior over long horizons due to limited context. While RAG-based approaches are increasingly adopted to overcome this limitation by storing interactions in external memory modules and performing retrieval from them, their effectiveness in answering challenging questions (e.g., multi-hop, commonsense) ultimately depends on the agent's ability to reason over the retrieved information. However, existing methods typically retrieve memory based on semantic similarity to the raw user utterance, which lacks explicit reasoning about missing intermediate facts and often returns evidence that is irrelevant or insufficient for grounded reasoning. In this work, we introduce Goal-Mem, a goal-oriented reasoning framework for RAG-based agentic memory that performs explicit backward chaining from the user's utterance as a goal. Rather than progressively expanding from retrieved context, Goal-Mem decomposes each goal into atomic subgoals, performs targeted memory retrieval to satisfy each subgoal, and iteratively identifies what information from memory should be retrieved when intermediate goals cannot be resolved. We formalize this process in Natural Language Logic, a logical system that combines the verifiability of reasoning provided by FOL with the expressivity of natural language. Through extensive experiments on two datasets and comparing to nine strong memory baselines, we show that Goal-Mem consistently improves performance, particularly on tasks requiring multi-hop reasoning and implicit inference.
Self-Supervised Laplace Approximation for Bayesian Uncertainty Quantification
Approximate Bayesian inference typically revolves around computing the posterior parameter distribution. In practice, however, the main object of interest is often a model's predictions rather than its parameters. In this work, we propose to bypass the parameter posterior and focus directly on approximating the posterior predictive distribution. We achieve this by drawing inspiration from self-training within self-supervised and semi-supervised learning. Essentially, we quantify a Bayesian model's predictive uncertainty by refitting on self-predicted data. The idea is strikingly simple: If a model assigns high likelihood to self-predicted data, these predictions are of low uncertainty, and vice versa. This yields a deterministic, sampling-free approximation of the posterior predictive. The modular structure of our Self-Supervised Laplace Approximation (SSLA) further allows us to plug in different prior specifications, enabling classical Bayesian sensitivity (w.r.t. prior choice) analysis. In order to bypass expensive refitting, we further introduce an approximate version of SSLA, called ASSLA. We study (A)SSLA both theoretically and empirically in regression models ranging from Bayesian linear models to Bayesian neural networks. Across a wide array of regression tasks with simulated and real-world datasets, our methods outperform classical Laplace approximations in predictive calibration while remaining computationally efficient.
comment: Accepted for publication in TMLR (https://openreview.net/forum?id=T8w8L2t3JG)
☆ Not How Many, But Which: Parameter Placement in Low-Rank Adaptation
We study the \textit{parameter placement problem}: given a fixed budget of $k$ trainable entries within the B matrix of a LoRA adapter (A frozen), does the choice of which $k$ matter? Under supervised fine-tuning, random and informed subsets achieve comparable performance. Under GRPO on base models, random placement fails to improve over the base model, while gradient-informed placement recovers standard LoRA accuracy. This regime dependence traces to gradient structure: SFT gradients are low-rank and directionally stable, so any subset accumulates coherent updates; GRPO gradients are high-rank and near-orthogonal across steps, so only elements with consistently signed gradients retain the learning signal. Our scoring procedure identifies these critical parameters in under 10 seconds at less than 0.5% of training cost. Selected parameters concentrate on residual-stream-writing projections (V, O, Down), stable across model families and scales (1.5B - 8B).
comment: Preprint. Comments welcome
☆ Uncertainty Quantification for LLM-based Code Generation
Prediction sets provide a theoretically grounded framework for quantifying uncertainty in machine learning models. Adapting them to structured generation tasks, in particular, large language model (LLM) based code generation, remains a challenging problem. An existing attempt proposes PAC prediction sets but is limited by its strong monotonicity assumption on risk and single-label classification framework, which severely limits the space of candidate programs and cannot accommodate the multiple valid outputs inherent to code generation. To address these limitations, we propose an approach RisCoSet that leverages multiple hypothesis testing to construct risk-controlling predictions for LLM-based code generation. Given a trained code generation model, we produce a prediction set represented by a partial program, which is guaranteed to contain a correct solution with high confidence. Extensive experiments on three LLMs demonstrate the effectiveness of the proposed method. For instance, compared with the state-of-the-art, our method can significantly reduce the code removal by up to 24.5%, at the same level of risk.
☆ Overtrained, Not Misaligned
Emergent misalignment (EM), where fine-tuning on a narrow task (like insecure code) causes broad misalignment across unrelated domains, was first demonstrated by Betley et al. (2025). We conduct the most comprehensive EM study to date, reproducing the original GPT-4o finding and expanding to 12 open-source models across 4 families (Llama, Qwen, DeepSeek, GPT-OSS) ranging from 8B to 671B parameters, evaluating over one million model responses with multiple random seeds. We find that EM replicates in GPT-4o but is far from universal: only 2 of 12 open-source models (17%) exhibit consistent EM across seeds, with a significant correlation between model size and EM susceptibility. Through checkpoint-level analysis during fine-tuning, we demonstrate that EM emerges late in training, distinct from and subsequent to near convergence of the primary task, suggesting EM emerges from continued training past task convergence. This yields practical mitigations: early stopping eliminates EM while retaining an average of 93% of task performance, and careful learning rate selection further minimizes risk. Cross-domain validation on medical fine-tuning confirms these patterns generalize: the size-EM correlation strengthens (r = 0.90), and overgeneralization to untruthfulness remains avoidable via early stopping in 67% of cases, though semantically proximate training domains produce less separable misalignment. As LLMs become increasingly integrated into real-world systems, fine-tuning and reinforcement learning remain the primary methods for adapting model behavior. Our findings demonstrate that with proper training practices, EM can be avoided, reframing it from an unforeseen fine-tuning risk to an avoidable training artifact.
comment: Under review at CoLM 2026; companion to Nature Matters Arising (also under review). 25 pages, 6 figures
☆ Mitigating Context-Memory Conflicts in LLMs through Dynamic Cognitive Reconciliation Decoding IEEE
Large language models accumulate extensive parametric knowledge through pre-training. However, knowledge conflicts occur when outdated or incorrect parametric knowledge conflicts with external knowledge in the context. Existing methods address knowledge conflicts through contrastive decoding, but in conflict-free scenarios, static approaches disrupt output distribution. Other dynamic decoding methods attempt to measure the degree of conflict but still struggle with complex real-world situations. In this paper, we propose a two-stage decoding method called Dynamic Cognitive Reconciliation Decoding (DCRD), to predict and mitigate context-memory conflicts. DCRD first analyzes the attention map to assess context fidelity and predict potential conflicts. Based on this prediction, the input is directed to one of two decoding paths: (1) greedy decoding, or (2) context fidelity-based dynamic decoding. This design enables DCRD to handle conflicts efficiently while maintaining high accuracy and decoding efficiency in conflict-free cases. Additionally, to simulate scenarios with frequent knowledge updates, we constructed ConflictKG, a knowledge conflict QA benchmark. Experiments on four LLMs across six QA datasets show that DCRD outperforms all baselines, achieving state-of-the-art performance.
comment: Accepted by IEEE TASLP
☆ DriftXpress: Faster Drifting Models via Projected RKHS Fields
Drifting Models have emerged as a new paradigm for one-step generative modeling, achieving strong image quality without iterative inference. The premise is to replace the iterative denoising process in diffusion models with a single evaluation of a generator. However, this creates a different trade-off: drifting reduces inference cost by moving much of the computation into training. We introduce DriftXpress, an accelerated formulation of drifting models based on projected RKHS fields. DriftXpress approximates the drifting kernel in a low-rank feature space. This preserves the attraction-repulsion structure of the original drifting field while reducing the cost of field evaluation. Across image-generation benchmarks, DriftXpress achieves comparable FID to standard drifting while reducing wall-clock training cost. These results show that the training-inference trade-off of drifting models can be pushed further without giving up their one-step inference advantage.
☆ MolDeTox: Evaluating Language Model's Stepwise Fragment Editing for Molecular Detoxification
Large Language Models (LLMs) and Vision Language Models (VLMs) have recently shown promising capabilities in various scientific domain. In particular, these advances have opened new opportunities in drug discovery, where the ability to understand and modify molecular structures is critical for optimizing drug properties such as efficacy and toxicity. However, existing models and benchmarks often overlook toxicity-related challenges, focusing primarily on general property optimization without adequately addressing safety concerns. In addition, even existing toxicity repair benchmarks suffer from limited data diversity, low structural validity of generated molecules, and heavy reliance on proxy models for toxicity assessment. To address these limitations, we propose MolDeTox, a novel benchmark for molecular detoxification, designed to enable fine-grained and reliable evaluation of toxicity-aware molecular optimization across stepwise tasks. We evaluate a wide range of general-purpose LLMs and VLMs under diverse settings, and demonstrate that understanding and generating molecules at the fragment-level improves structural validity and enhances the quality of generated molecules. Moreover, through detailed task-level performance analysis, MolDeTox provides an interpretable benchmark that enables a deeper understanding of the detoxification process. Our dataset is available at : https://huggingface.co/datasets/MolDeTox/MolDeTox
☆ A Deep Learning-based Receiver for Asynchronous Grant-Free Random Access in Control-to-Control Networks IEEE
In this paper, we study grant-free, asynchronous control-to-control (C2C) communications in an indoor scenario with a shared wireless channel. Each communication node transmits command units, each consisting of a variable-length low-density parity-check (LDPC)--coded payload preceded by a start sequence and followed by a tail sequence. Due to the asynchronous nature of the access, transmissions from different nodes are not aligned over time. As a result, each receiving controller observes the superposition of multiple command units transmitted by different nodes over a receiver-defined superframe interval. Each node transmits one or more replicas of the same command unit. We propose a receiver architecture in which the detection of command unit boundaries (start/tail sequences) is carried out by a single convolutional neural network (CNN) operating directly on the received signal. We show that, while start-sequence detection must rely only on the received waveform, tail-sequence detection can additionally exploit the soft information produced by the LDPC decoder, together with channel estimates. Finally, once commands units are successfully decoded, successive interference cancellation (SIC) can be applied. Simulation results demonstrate that the receiver we propose achieves reliable packet-boundary identification and a low end-to-end packet loss rate, even under uncoordinated and high-traffic operating conditions.
comment: Submitted to IEEE Transactions on Communications
☆ Do Enterprise Systems Need Learned World Models? The Importance of Context to Infer Dynamics
World models enable agents to anticipate the effects of their actions by internalizing environment dynamics. In enterprise systems, however, these dynamics are often defined by tenant-specific business logic that varies across deployments and evolves over time, making models trained on historical transitions brittle under deployment shift. We ask a question the world-models literature has not addressed: when the rules can be read at inference time, does an agent still need to learn them? We argue, and demonstrate empirically, that in settings where transition dynamics are configurable and readable, runtime discovery complements offline training by grounding predictions in the active system instance. We propose enterprise discovery agents, which recover relevant transition dynamics at runtime by reading the system's configuration rather than relying solely on internalized representations. We introduce CascadeBench, a reasoning-focused benchmark for enterprise cascade prediction that adopts the evaluation methodology of World of Workflows on diverse synthetic environments, and use it together with deployment-shift evaluation to show that offline-trained world models can perform well in-distribution but degrade as dynamics change, whereas discovery-based agents are more robust under shift by grounding their predictions in the current instance. Our findings suggest that, in configurable enterprise environments, agents should not rely solely on fixed internalized dynamics, but should incorporate mechanisms for discovering relevant transition logic at runtime.
☆ Premover: Fast Vision-Language-Action Control by Acting Before Instructions Are Complete
Vision-Language-Action (VLA) policies are typically evaluated as if the user had finished typing or speaking before the robot begins acting. In real deployment, however, users take several seconds to enter a request, leaving the policy idle for a substantial fraction of the interaction. We introduce Premover, a lightweight module that converts this idle window into useful precomputation. Premover keeps the VLA backbone frozen and attaches two small projection heads, one for image patches, one for language tokens, that map an intermediate layer of the backbone into a shared space. The resulting focus map is supervised by simulator-rendered target-object segmentation masks and applied as a per-patch reweighting of the next step's image tokens. A single scalar readiness threshold, trained jointly from streaming prefixes, decides when the policy should begin acting. On the LIBERO benchmark suite, Premover reduces mean wall-clock time from 34.0 to 29.4 seconds, a 13.6% reduction, while matching the full-prompt baseline's success rate (95.1% vs. 95.0%); naive premoving, by contrast, collapses to 66.4%.
☆ ALGOGEN: Tool-Generated Verifiable Traces for Reliable Algorithm Visualization
Algorithm Visualization (AV) helps students build mental models by animating algorithm execution states. Recent LLM-based systems such as CODE2VIDEO generate AV videos in an end-to-end manner. However, this paradigm requires the system to simultaneously simulate algorithm flow and satisfy video rendering constraints, such as element layout and color schemes. This complex task induces LLM hallucinations, resulting in reduced execution success rates, element overlap, and inter-frame inconsistencies. To address these challenges, we propose ALGOGEN, a novel paradigm that decouples algorithm execution from rendering. We first introduce Visualization Trace Algebra (VTA), a monoid over algorithm visual states and operations. The LLM then generates a Python tracker that simulates algorithm flow and outputs VTA-JSON traces, a JSON encoding of VTA. For rendering, we define a Rendering Style Language (RSL) to templatize algorithm layouts. A deterministic renderer then compiles algorithm traces with RSL into Manim, LaTeX/TikZ, or Three.js outputs. Evaluated on a LeetCode AV benchmark of 200 tasks, ALGOGEN achieves an average success rate improvement of 17.3% compared to end-to-end methods, with 99.8% versus 82.5%. These results demonstrate that our decoupling paradigm effectively mitigates LLM hallucinations in complex AV tasks, providing a more reliable solution for automated generation of high-quality algorithm visualizations. Demo videos and code are available in the project repository.
☆ MM-OptBench: A Solver-Grounded Benchmark for Multimodal Optimization Modeling
Optimization modeling translates real decision-making problems into mathematical optimization models and solver-executable implementations. Although language models are increasingly used to generate optimization formulations and solver code, existing benchmarks are almost entirely text-only. This omits many optimization-modeling tasks that arise in operational practice, where requirements are described in text but instance information is conveyed through visual artifacts such as tables, graphs, maps, schedules, and dashboards. We introduce multimodal optimization modeling, a benchmark setting in which models must construct both a mathematical formulation and executable solver code from a text-and-visual problem specification. To evaluate this setting, we develop a solver-grounded framework that generates structured optimization instances, verifies each with an exact solver, and builds both the model-facing inputs and hidden reference files from the same verified source. We instantiate the framework as MM-OptBench, a benchmark of 780 solver-verified instances spanning 6 optimization families, 26 subcategories, and 3 structural difficulty levels. We evaluate 9 multimodal large language models (MLLMs), including 6 frontier general-purpose models and 3 math-specialized models, with aggregate, family-level, difficulty-level, and failure-mode analyses. The results show that the task remains far from solved: the best two models reach 52.1% and 51.3% pass@1, while on average across the six general-purpose MLLMs, pass@1 is 43.4% on easy instances and 15.9% on hard instances. All three math-specialized MLLMs solve 0/780 instances. Failure attribution shows that errors arise both when extracting instance data from text and visuals and when turning extracted data into solver-correct formulations and code. MM-OptBench provides a testbed for solver-grounded, decision-oriented multimodal intelligence.
comment: Paper under review
☆ CIDR: A Large-Scale Industrial Source Code Dataset for Software Engineering Research
We present Curated Industrial Developer Repository (CIDR), a large-scale dataset of real-world software repositories collected through direct collaboration with 12 industrial partner organizations. The dataset comprises 2,440 repositories spanning 138 programming languages and totalling 373 million lines of code, accompanied by structured per-repository metadata. Unlike existing code corpora derived from public open-source platforms, CIDR consists exclusively of proprietary production codebases contributed under formal data sharing agreements, covering application domains including enterprise web and mobile development, fintech, and custom software consultancy. All repositories were processed through a multi-stage pipeline encompassing structured partner onboarding, two-stage quality selection combining automated metadata filtering with manual code review, and a deterministic anonymization pipeline covering the full version control history. The dataset is intended to support research in code intelligence, software quality analysis, pre-training and fine-tuning of code language models, developer behaviour studies, and construction of agent evaluation benchmarks. Access is provided under a restricted commercial license; details are available at https://fermatix.ai/#Contact.
comment: 34 pages, 9 figures, 4 appendices. Dataset access: https://fermatix.ai/#Contact. Anonymization tool: https://github.com/Fermatix/repo-sanitizer. Metadata utility: https://github.com/Fermatix/repo_metadata_cli
☆ BoolXLLM: LLM-Assisted Explainability for Boolean Models
Interpretable machine learning aims to provide transparent models whose decision-making processes can be readily understood by humans. Recent advances in rule-based approaches, such as expressive Boolean formulas (BoolXAI), offer faithful and compact representations of model behavior. However, for non-technical stakeholders, main challenges remain in practice: (i) selecting semantically meaningful features and (ii) translating formal logical rules into accessible explanations. In this work, we propose BoolXLLM , as a hybrid framework that integrates Large Language Models (LLMs) into the end-to-end pipeline of Boolean rule learning. We augment BoolXAI , an expressive Boolean rule-based classifier, with LLMs at three critical stages: (1) feature selection, where LLMs guide the identification of domain-relevant variables; (2) threshold recommendation, where LLMs propose semantically meaningful discretization strategies for numerical features; and (3) rule compression and interpretation, where Boolean rules are translated into natural language explanations at both global and local levels. This integration bridges formal, faithful explanations with human-understandable narratives. This allows build an explainable AI system that is both theoretically grounded and accessible to non-experts. Early empirical results demonstrate that LLM-assisted pipelines improve interpretability while maintaining competitive predictive performance. Our work highlights the promise of combining symbolic reasoning with language-based models for human-centered explainability.
☆ Rollout Cards: A Reproducibility Standard for Agent Research
Reproducibility problems that have long affected machine learning and reinforcement learning are now surfacing in agent research: papers compare systems by reported scores while leaving the rollout records behind those scores difficult to inspect. For agentic tasks, this matters because the same behaviour can receive different reported scores when evaluations select different parts of a rollout or apply different reporting rules. In a structured audit of 50 popular training and evaluation repositories, we find that none report how many runs failed, errored, or were skipped alongside headline scores. We also document 37 cases where reporting rules can change task-success rates, cost/token accounting, or timing measurements for fixed evidence, sometimes dramatically. We treat rollout records, not reported scores, as the unit of reproducibility for agent research. We introduce rollout cards: publication bundles that preserve the rollout record and declare the views, reporting rules, and drops manifests behind reported scores. We validate rollout cards in two settings. First, four partial public releases in tool safety, multi-agent systems, theorem proving, and search let us compute analyses their original reports did not include. Second, re-grading preserved benchmark outputs across short-answer, code-generation, and tool-use tasks shows that changing only the reporting rule can change reported scores by 20.9 absolute percentage points and, in some cases, invert rankings of frontier models. We release a reference implementation integrated into Ergon, an open-source reinforcement learning gym, and publicly publish Ergon-produced rollout-card exports for benchmarks spanning tool use, software engineering, web interaction, multi-agent coordination, safety, and search to support future research.
☆ It's Not the Size: Harness Design Determines Operational Stability in Small Language Models
This paper experimentally analyzes how the level of harness engineering affects the operational performance of small language models (SLMs, 2-3B parameters). Three harness conditions - model-only (raw prompt), minimal-shell (wrapper tags), and a 4-stage pipeline (plan->execute->verify->recover) - are applied to three models (Gemma4 E2B, Qwen3.5:2B, LLaMA 3.2 3B) across 24 tasks, comparing Task Success Rate (TSR) and Valid TSR (VTSR). The pipeline harness achieves TSR=0.952 and VTSR=1.000 on Gemma4 E2B (T1-T5, 21 tasks). A non-monotonic phenomenon - minimal-shell TSR < model-only TSR - is observed in two models. In LLaMA 3.2 3B model-only, seven format violations yield TSR=0.429, revealing scaffold collapse: the model abandons JSON structure under complex format requirements without harness support. Ablation shows planning and recovery each contribute approximately 24.7% of total gain. VCR (Verification Catch Rate)=0.625 across all pipeline runs.
☆ Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning
Unlearning specific concepts in text-to-image diffusion models has become increasingly important for preventing undesirable content generation. Among prior approaches, sparse autoencoder (SAE)-based methods have attracted attention due to their ability to suppress target concepts through lightweight manipulation of latent features, without modifying model parameters. However, SAEs trained with sparse reconstruction objectives do not explicitly enforce concept-wise separation, resulting in shared latent features across concepts. To address this, we propose SAEParate, which organizes latent representations into concept-specific clusters via a concept-aware contrastive objective, enabling more precise concept suppression while reducing unintended interference during unlearning. In addition, we enhance the encoder with a GeLU-based nonlinear transformation to increase its expressive capacity under this separation objective, enabling a more discriminative and disentangled latent space. Experiments on UnlearnCanvas demonstrate state-of-the-art performance, with particularly strong gains in joint style-object unlearning, a challenging setting where existing methods suffer from severe interference between target and non-target concepts.
comment: 40 pages, 23 figures
☆ To Whom Do Language Models Align? Measuring Principal Hierarchies Under High-Stakes Competing Demands
Language models deployed in high-stakes professional settings face conflicting demands from users, institutional authorities, and professional norms. How models act when these demands conflict reveals a principal hierarchy -- an implicit ordering over competing stakeholders that determines, for instance, whether a medical AI receiving a cost-reduction directive from a hospital administrator complies at the expense of evidence-based care, or refuses because professional standards require it. Across 7,136 scenarios in legal and medical domains, we test ten frontier models and find that models frequently fail to adhere to professional standards during task execution, such as drafting, when user instructions conflict with those standards -- despite adequately upholding them when users seek advisory guidance. We further find that the hierarchies between user, authority, and professional standards exhibited by these models are unstable across medical and legal contexts and inconsistent across model families. When failing to follow professional standards, the primary failure mechanism is knowledge omission: models that demonstrably possess relevant knowledge produce harmful outputs without surfacing conflicting knowledge. In a particularly troubling instance, we find that a reasoning model recognizes the relevant knowledge in its reasoning trace -- e.g., that a drug has been withdrawn -- yet suppresses this in the user-facing answer and proceeds to recommend the drug under authority pressure anyway. Inconsistent alignment across task framing, domain, and model families suggests that current alignment methods, including published alignment hierarchies, are unlikely to be robust when models are deployed in high-stakes professional settings.
☆ Adaptive Multi-Round Allocation with Stochastic Arrivals ICML 2026
We study a sequential resource allocation problem motivated by adaptive network recruitment, in which a limited budget of identical resources must be allocated over multiple rounds to individuals with stochastic referral capacity. Successful referrals endogenously generate future decision opportunities while allocating additional resources to an individual exhibits diminishing returns. We first show that the single-round allocation problem admits an exact greedy solution based on marginal survival probabilities. In the multi-round setting, the resulting Bellman recursion is intractable due to the stochastic, high-dimensional evolution of the frontier. To address this, we introduce a population-level surrogate value function that depends only on the remaining budget and frontier size. This surrogate enables an exact dynamic program via truncated probability generating functions, yielding a planning algorithm with polynomial complexity in the total budget. We further analyze robustness under model misspecification, proving a multi-round error bound that decomposes into a tight single-round frontier error and a population-level transition error. Finally, we evaluate our method on real-world inspired recruitment scenarios.
comment: Accepted into ICML 2026
☆ Large Language Models as Amortized Pareto-Front Generators for Constrained Bi-Objective Convex Optimization
Generating feasible Pareto fronts for constrained bi-objective continuous optimization is central to multi-criteria decision-making. Existing methods usually rely on iterative scalarization, evolutionary search, or problem-specific solvers, requiring repeated optimization for each instance. We introduce DIPS, an end-to-end framework that fine-tunes large language models as amortized Pareto-front generators for constrained bi-objective convex optimization. Given a textual problem description, DIPS directly outputs an ordered set of feasible continuous decision vectors approximating the Pareto front. To make continuous optimization compatible with autoregressive language modeling, DIPS combines a compact discretization scheme, Numerically Grounded Token Initialization for new numerical tokens, and Three-Phase Curriculum Optimization, which progressively aligns structural validity, feasibility, and Pareto-front quality. Across five families of constrained bi-objective convex problems, a fine-tuned 7B-parameter model achieves normalized hypervolume ratios of 95.29% to 98.18% relative to reference fronts. With vLLM-accelerated inference, DIPS solves one instance in as little as 0.16 seconds and outperforms general-purpose and reasoning LLM baselines under the evaluated setting. These results suggest that LLMs can serve as effective amortized generators for continuous Pareto-front approximation.
comment: 31 pages
☆ Autonomy and Agency in Agentic AI: Architectural Tactics for Regulated Contexts
Deploying agentic AI in regulated contexts requires principled reasoning about two design dimensions: agency (what the system can do) and autonomy (how much it acts without human involvement). Though often treated independently, they are coupled: at higher autonomy, human error correction is less available, so reliable operation requires constraining agency accordingly; compliance requirements reinforce this by mandating human involvement as action consequences grow. Yet no established approach addresses them jointly, leaving practitioners without a principled basis for reasoning about oversight, action consequences, and error correction. This work introduces a two-dimensional design space in which both dimensions are organised into five operational levels, making the coupling explicit and navigable. Autonomy ranges from human-commanded operation (L1) to fully autonomous monitoring (L5); agency ranges from reasoning over supplied context (L1) to committed writes to authoritative records (L5). Building on this space, we propose six architectural tactics--checkpoints, escalation, multi-agent delegation, tool provisioning, tool fencing, and write staging--for adjusting a deployment's position within it. The tactics are grounded in two worked examples from public-sector contexts, illustrating how they apply under realistic compliance constraints. We further examine five deployment parameters--model capability, agent architecture, tool fidelity, workflow bottlenecks, and evaluation--that shape what is achievable at any configuration independently of agency and autonomy. Together, the design space, tactics, and deployment parameters provide a shared vocabulary for principled, compliance-aware agentic AI design in which responsibility, auditability, and reversibility are explicit design considerations rather than properties that must be retrofitted after deployment.
☆ Intermediate Artifacts as First-Class Citizens: A Data Model for Durable Intermediate Artifacts in Agentic Systems
Many AI systems are organized around loops in which models reason, call tools, observe results, and continue until a task is complete. These systems often produce final artifacts such as memos, plans, recommendations, and analyses, while the intermediate work that shaped those outputs remains ephemeral. For multi-step, revisable AI work, final artifacts are often lossy projections over upstream state. We argue that such systems should preserve durable, inspectable intermediate artifacts: typed, structured, addressable, versioned, dependency-aware, authoritative, and consumable by downstream computation. These artifacts are not the model's private chain-of-thought. They are maintained work products such as evidence maps, claim structures, criteria, assumptions, plans, transformation rules, synthesis procedures, unresolved tensions, and partial products that later humans and agents can inspect, revise, supersede, and improve. The contribution is a systems-level data model. We distinguish intermediate artifacts from chat transcripts, memory, hidden chain-of-thought, narration, thinking, and final answers; formalize additive and superseding update semantics with explicit current-state resolution; describe how artifact lineage supports durable intermediate state across revisions; and argue that evaluation must target maintained-state quality, not only final-output quality. The claim is not that artifacts make models smarter. It is that durable intermediate artifacts make AI-generated work more inspectable, revisable, and maintainable over time.
comment: 18 pages, 1 figure, 3 tables
☆ Learning What Matters: Adaptive Information-Theoretic Objectives for Robot Exploration
Designing learnable information-theoretic objectives for robot exploration remains challenging. Such objectives aim to guide exploration toward data that reduces uncertainty in model parameters, yet it is often unclear what information the collected data can actually reveal. Although reinforcement learning (RL) can optimize a given objective, constructing objectives that reflect parametric learnability is difficult in high-dimensional robotic systems. Many parameter directions are weakly observable or unidentifiable, and even when identifiable directions are selected, omitted directions can still influence exploration and distort information measures. To address this challenge, we propose Quasi-Optimal Experimental Design (Q{\footnotesize OED}), an adaptive information objective grounded in optimal experimental design. Q{\footnotesize OED} (i) performs eigenspace analysis of the Fisher information matrix to identify an observable subspace and select identifiable parameter directions, and (ii) modifies the exploration objective to emphasize these directions while suppressing nuisance effects from non-critical parameters. Under bounded nuisance influence and limited coupling between critical and nuisance directions, Q{\footnotesize OED} provides a constant-factor approximation to the ideal information objective that explores all parameters. We evaluate Q{\footnotesize OED} on simulated and real-world navigation and manipulation tasks, where identifiable-direction selection and nuisance suppression yield performance improvements of \SI{35.23}{\percent} and \SI{21.98}{\percent}, respectively. When integrated as an exploration objective in model-based policy optimization, Q{\footnotesize OED} further improves policy performance over established RL baselines.
☆ Property-Level Reconstructability of Agent Decisions: An Anchor-Level Pilot Across Vendor SDK Adapter Regimes
Agentic AI failures need post-hoc reconstruction: what the agent did, on whose authority, against which policy, and from what reasoning. Cross-regime feasibility remains unmeasured under one property-level schema. We apply the Decision Trace Reconstructor unmodified to pinned worked-example anchors from six public vendor SDK regimes spanning cloud-agent, observability, tool-use, telemetry, and protocol traces, plus two comparator columns. Each Decision Event Schema (DES) property is classified as fully fillable, partially fillable, structurally unfillable, or opaque. Per-property reconstructability of an agent decision already varies between regimes at this anchor scale. Strict-governance-completeness separates into three tiers ranging from 42.9% to 85.7%, yielding one regime-independent gap (reasoning trace), four regime-dependent gaps, and one Mixed property; the pilot is single-annotator, one anchor per cell, descriptive, with outputs checksum-verifiable from a deposited reproducibility package.
comment: 23 pages, 3 tables; reproducibility package: https://doi.org/10.5281/zenodo.20077961; GitHub: https://github.com/agent-runtime-evidence/anchor-level-reconstructability-pilot
☆ The Missing GAP: From Solving Square Jigsaw Puzzles to Handling Real World Archaeological Fragments
Jigsaw puzzle solving has been an increasingly popular task in the computer vision research community. Recent works have utilized cutting-edge architectures and computational approaches to reassemble groups of pieces into a coherent image, while achieving increasingly good results on well established datasets. However, most of these approaches share a common, restricting setting: operating solely on strictly square puzzle pieces. In this work, we introduce GAP, a set of novel jigsaw puzzles datasets containing synthetic, heavily eroded pieces of unrestricted shapes, generated by a learned distribution of real-world archaeological fragments. We also introduce PuzzleFlow, a novel ViT and Flow-Matching based framework for jigsaw puzzle solving, capable of handling complex puzzle pieces and demonstrating superior performance on GAP when compared to both classic and recent prominent works in this domain.
☆ The Deepfakes We Missed: We Built Detectors for a Threat That Didn't Arrive
Nearly a decade of Machine Learning (ML) research on deepfake detection has been organized around a threat model inherited from 2017--2019, revolving around face-swap and talking-head manipulation of public figures, motivated by concerns about large-scale misinformation and video-evidence fraud. This position paper argues that the threat the field prepared for did not arrive, and the threats that did arrive are substantially different. An accounting of deepfake incidents in 2022--2026 shows that the dominant observed harms are peer-generated Non-Consensual Intimate Imagery (NCII), voice-clone scam calls targeting families and finance workers, and emotional-manipulation fraud. The predicted large-scale public-figure deepfake catastrophe did not materialize during the 2024 global information environment despite extensive preparation. Meanwhile, research effort, benchmarks, and detection methods remain concentrated on the inherited threat model. The central claim of this paper is that this misalignment is now the dominant bottleneck on real-world deepfake defense, not model capability. We argue the ML research community should substantially rebalance its research agenda toward the harm categories that are actually growing. We support this position with empirical accounting of research effort and harm distribution, identify the structural reasons the misalignment persists, and outline three concrete technical research agendas for the under-defended harm categories.
☆ Clausal Deletion Backdoors for QBF: a Parameterized Complexity Approach
Determining the validity of a quantified Boolean formula (QBF) is a PSPACE-complete problem with rich expressive power. Despite interest in efficient solvers, there is, compared to problems in NP, a lack of positive theoretical results, and in the parameterized complexity setting one often has to restrict the quantifier prefix (e.g., bounding alternations) to obtain fixed parameter tractability (FPT). We propose a new parameter: the number of variables in clauses that has to be removed before reaching a tractable class (a clause covering (CC) backdoor). We are then interested in solving QBF in FPT time given a CC-backdoor of size $k$. We consider the three classical, tractable cases of QBF as base classes: Horn, 2-CNF, and linear equations. We establish W[1]-hardness for Horn but prove FPT for the others, and prove that in a precise, algebraic sense, we are only missing one important case for a full dichotomy. Our algorithms are non-trivial and depend on propagation, and Gaussian elimination, respectively, and are comparably unexplored for QBF.
☆ Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction
Asynchronous reinforcement learning improves rollout throughput for large language model agents by decoupling sample generation from policy optimization, but it also introduces a critical failure mode for PPO-style off-policy correction. In heterogeneous training systems, the total importance ratio should ideally be decomposed into two semantically distinct factors: a \emph{training--inference discrepancy term} that aligns inference-side and training-side distributions at the same behavior-policy version, and a \emph{policy-staleness term} that constrains the update from the historical policy to the current policy. We show that practical asynchronous pipelines with delayed updates and partial rollouts often lose the required historical training-side logits, or old logits. This missing-old-logit problem entangles discrepancy repair with staleness correction, breaks the intended semantics of decoupled correction, and makes clipping and masking thresholds interact undesirably. To address this issue, we study both exact and approximate correction routes. We propose three exact old-logit acquisition strategies: snapshot-based version tracking, a dedicated old-logit model, and synchronization via partial rollout interruption, and compare their system trade-offs. From the perspective of approximate correction, we focus on preserving the benefits of decoupled correction through a more appropriate approximate policy when exact old logits cannot be recovered at low cost, without incurring extra system overhead. Following this analysis, we adopt a revised PPO-EWMA method, which achieves significant gains in both training speed and optimization performance. Code at https://github.com/millioniron/ROLL.
☆ Anomaly-Aware Vision-Language Adapters for Zero-Shot Anomaly Detection ICIP 2026
Zero-shot anomaly detection aims to identify defects in unseen categories without target-specific training. Existing methods usually apply the same feature transformation to all samples, treating normal and anomalous data uniformly despite their fundamentally asymmetric distributions, compact normals versus diverse anomalies. We instead exploit this natural asymmetry by proposing AVA-DINO, an anomaly-aware vision-language adaptation framework with dual specialized branches for normal and anomalous patterns that adapt frozen DINOv3 visual features. During training on auxiliary data, the two branches are learned jointly with a text-guided routing mechanism and explicit routing regularization that encourages branch specialization. At test time, only the input image and fixed, predefined language descriptions are used to dynamically combine the two branches, enabling an asymmetric activation. This design prevents degenerate uniform routing and allows context-specific feature transformations. Experiments across nine industrial and medical benchmarks demonstrate state-of-the-art performance, achieving 93.5% image-AUROC on MVTec-AD and strong cross-domain generalization to medical imaging without domain-specific fine-tuning. https://github.com/aqeeelmirza/AVA-DINO
comment: Accepted to ICIP 2026
☆ SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory
Long-term memory is becoming a central bottleneck for language agents. Exsting RAG and GraphRAG systems largely treat memory graphs as static retrieval middleware, which limits their ability to recover complete evidence chains from partial cues, exploit reusable graph-structrual roles, and improve the memory itself through downstream feedback. We introduce SAGE, a Self-evolving Agentic Graph-memory Engine that models graph memory as a dynamic long-term memory substrate. SAGE couples two roles: a memory writer that incrementally constucts structured graph memory from interaction histories, and a Graph Foundation Model-based memory reader to perform retrieval and provide feedback to the memory writer. We provide rigorooous theoretical annalyses supporting the framework. Across multi-hop QA, open-domain retireval, domain-specific review QA, and long-term agent-memory benchmarks, SAGE improves evidence recovery, answer grounding, and retrieval efficiency: after two self-evolution rounds, it achieves the best average rank on multi-hop QA; in zero-shot open-domain transfer, it reaches 82.5/91.6 Recall@2/5 on NQ. Further results on LongMemEval and HaluMem show that traning and reader-writer feedback improve multiple long-term memory and hallucination-diagnostic metrics, suggesting that self-evolving, structure-aware graph memory is a promising foundation for robust long-horizon language agents.
☆ Hölder Policy Optimisation
Group Relative Policy Optimisation (GRPO) enhances large language models by estimating advantages across a group of sampled trajectories. However, mapping these trajectory-level advantages to policy updates requires aggregating token-level probabilities within each sequence. Relying on a fixed aggregation mechanism for this step fundamentally limits the algorithm's adaptability. Empirically, we observe a critical trade-off: certain fixed aggregations frequently suffer from training collapse, while others fail to yield satisfactory performance. To resolve this, we propose \textbf{HölderPO}, a generalised policy optimisation framework unifying token-level probability aggregation via the Hölder mean. By explicitly modulating the parameter $p$, our framework provides continuous control over the trade-off between gradient concentration and variance bounds. Theoretically, we prove that a larger $p$ concentrates the gradient to amplify sparse learning signals, whereas a smaller $p$ strictly bounds gradient variance. Because no static configuration can universally resolve this concentration-stability trade-off, we instantiate the framework with a dynamic annealing algorithm that progressively schedules $p$ across the training lifecycle. Extensive evaluations demonstrate superior stability and convergence over existing baselines. Specifically, our approach achieves a state-of-the-art average accuracy of $54.9\%$ across multiple mathematical benchmarks, yielding a substantial $7.2\%$ relative gain over standard GRPO and secures an exceptional $93.8\%$ success rate on ALFWorld.
☆ OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models
Omnimodal large language models (Omni-LLMs) show strong capability in audio-video understanding, but their practical deployment remains limited by high inference cost of long video streams and dense audio sequences. Despite recent progress, existing compression methods for Omni-LLMs typically rely on fixed or native compression units, which can disrupt cross-modal correspondence and the complementary information required for audio-video reasoning, making it difficult to improve inference efficiency while stably preserving performance. To address this, we propose OmniRefine, a training-free two-stage framework for efficient audio-visual token compression in Omni-LLMs. First, Correspondence-Preserving Chunk Refinement refines native chunk boundaries into cross-modally aligned compression units through frame-audio similarity and dynamic programming. Second, Modality-Aware Cooperative Compression jointly compresses video and audio tokens within each refined unit to reduce redundancy while preserving critical evidence. Extensive experiments show that OmniRefine achieves a better efficiency-performance trade-off than strong baselines and maintains stable performance under lower compression ratios. On WorldSense, it still reaches 46.7% accuracy at a 44% token retention ratio, nearly matching the full-token baseline. The code and interface will be released to facilitate further research.
☆ Scaling Laws and Tradeoffs in Recurrent Networks of Expressive Neurons
Cortical neurons are complex, multi-timescale processors wired into recurrent circuits, shaped by long evolutionary pressure under stringent biological constraints. Mainstream machine learning, by contrast, predominantly builds models from extremely simple units, a default inherited from early neural-network theory. We treat this as a normative architectural question. How should one split a fixed parameter budget $P$ between the number of units $N$, per-unit effective complexity $k_e$, and per-unit connectivity $k_c$? What controls the optimal allocation? This calls for a model in which per-unit complexity can be tuned independently of width and connectivity. Accordingly, we introduce the ELM Network, whose recurrent layer is built from Expressive Leaky Memory (ELM) neurons, chosen to mirror functional components of cortical neurons. The architecture allows for individually adjusting $N$, $k_e$, and $k_c$ and trains stably across orders of magnitude in scale. We evaluate the model on two qualitatively different sequence benchmarks: the neuromorphic SHD-Adding task and Enwik8 character-level language modeling. Performance improves monotonically along each of the three axes individually. Under a fixed budget, a clear non-trivial optimum emerges in their tradeoff, and larger budgets favor both more and more complex neurons. A closed-form information-theoretic model captures these tradeoffs and attributes the diminishing returns at two ends to: per-neuron signal-to-noise saturation and across-neuron redundancy. A hyperparameter sweep spanning three orders of magnitude in trainable parameters traces a near-Pareto-frontier scaling law consistent with the framework. This suggests that the simple-unit default in ML is not obviously optimal once this tradeoff surface is probed, and offers a normative lens on cortex's reliance on complex spatio-temporal integrators.
comment: 25 pages, 21 figures, 3 tables, including derivations. Submitted for peer review
☆ Rethink the Role of Neural Decoders in Quantum Error Correction ICML 2026
Quantum error correction (QEC) is essential for enabling quantum advantages, with decoding as a central algorithmic primitive. Owing to its importance and intrinsic difficulty, substantial effort has been made to QEC decoder design, among which neural decoders have recently emerged as a promising data-driven paradigm. Despite this progress, practical deployment remains hindered by a fundamental accuracy-latency tradeoff, often on the microsecond timescale. To address this challenge, here we revisit neural decoders for surface-code decoding under explicit accuracy-latency constraints, considering code distances up to d=9 (161 physical qubits). We unify and redesign representative neural decoders into five architectural paradigms and develop an end-to-end compression pipeline to evaluate their deployability and performance on FPGA hardware. Through systematic experiments, we reveal several previously underexplored insights: (i) near-term decoding performance is driven more by data scale than architectural complexity; (ii) appropriate inductive bias is essential for achieving high decoding accuracy; and (iii) INT4 quantization is a prerequisite for meeting microsecond-scale latency requirements on FPGAs. Together, these findings provide concrete guidance toward scalable and real-time neural QEC decoding.
comment: Accepted to ICML 2026; 33 Pages, 9 figures
☆ Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Omni-modal language models are intended to jointly understand audio, visual inputs, and language, but benchmark gains can be inflated when visual evidence alone is enough to answer a query. We study whether current omni-modal benchmarks separate visual shortcuts from genuine audio-visual-language evidence integration, and how post-training behaves under a visually debiased evaluation setting. We audit nine omni-modal benchmarks with visual-only probing, remove visually solvable queries, and retain full subsets when filtering is undefined or would make comparisons unstable. This yields OmniClean, a cleaned evaluation view with 8,551 retained queries from 16,968 audited queries. On OmniClean, we evaluate OmniBoost, a three-stage post-training recipe based on Qwen2.5-Omni-3B: mixed bi-modal SFT, mixed-modality RLVR, and SFT on self-distilled data. Balanced bi-modal SFT gives limited and uneven gains, RLVR provides the first broad improvement, and self-distillation reshapes the benchmark profile. After SFT on self-distilled data, the 3B model reaches performance comparable to, and in aggregate slightly above, Qwen3-Omni-30B-A3B-Instruct without using a stronger omni-modal teacher. These results show that omni-modal progress is easier to interpret when evaluation controls visual leakage, and that small omni-modal models can benefit from staged post-training with self-distilled omni-query supervision.
☆ Spectral Vision Transformer for Efficient Tokenization with Limited Data
We propose a novel spectral vision transformer architecture for efficient tokenization in limited data, with an emphasis on medical imaging. We outline convenient theoretical properties arising from the choice of basis including spatial invariance and optimal signal-to-noise ratio. We show reduced complexity arising from the spectral projection compared to spatial vision transformers. We show equitable or superior performance with a reduced number of parameters as compared to a variety of models including compact and standard vision transformers, convolutional neural networks with attention, shifted window transformers, multi-layer perceptrons, and logistic regression. We include simulated, public, and clinical data in our analysis and release our code at: \verb+github.com/agr78/spectralViT+.
☆ Efficient and Adaptive Human Activity Recognition via LLM Backbones
Human Activity Recognition (HAR) is a core task in pervasive computing systems, where models must operate under strict computational constraints while remaining robust to heterogeneous and evolving deployment conditions. Recent advances based on Transformer architectures have significantly improved recognition performance, but typically rely on task-specific models trained from scratch, resulting in high training cost, large data requirements, and limited adaptability to domain shifts. In this paper, we propose a paradigm shift that reuses large pretrained language models (LLMs) as generic temporal backbones for sensor-based HAR, instead of designing domain-specific Transformers. To bridge the modality gap between inertial time series and language models, we introduce a structured convolutional projection that maps multivariate accelerometer and gyroscope signals into the latent space of the LLM. The pretrained backbone is kept frozen and adapted using parameter-efficient Low-Rank Adaptation (LoRA), drastically reducing the number of trainable parameters and the overall training cost. Through extensive experiments on standard HAR benchmarks, we show that this approach enables rapid convergence, strong data efficiency, and robust cross-dataset transfer, particularly in low-data and few-shot settings. At the same time, our results highlight the complementary roles of convolutional frontends and LLMs, where local invariances are handled at the signal level while long-range temporal dependencies are captured by the pretrained backbone. Overall, this work demonstrates that LLMs can serve as a practical, frugal, and scalable foundation for adaptive HAR systems, opening new directions for reusing foundation models beyond their original language domain.
☆ LLMs and the ZPD
One hundred years ago Vygotsky and his circle were exploring the nature of consciousness and defining what would become psychology in the Soviet Union. They concluded that children develop "scientific thinking" through interacting with enculturated adults in Zones of Proximal Development or ZPDs. The proposal is that, contrary to the claims of some, the LLM mechanism is not doing thinking with "distributed representations," but rather the completion model is doing "primitive thinking" in terms of *practices*. Viewed from this perspective, it would seem our large language models don't hallucinate, but rather dream, and that what is needed is not "guard rails" but an investigation of the set of cognitive tools that enable us to do things that look like common-sense. The proposal here is that *interaction* is core to human communication rather than just an add-on to "real" understanding.
comment: Short paper submitted to Interspeech 2026 (Desk Reject) 4 pages, plus references. 2 figures
☆ SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
Reusable skills are becoming a common interface for extending large language model agents, packaging procedural guidance with access to files, tools, memory, and execution environments. However, this modularity introduces attack surfaces that are largely missed by existing safety evaluations: even when the user request is benign, task-relevant skill materials or local artifacts can steer an agent toward unsafe actions. We present SkillSafetyBench, a runnable benchmark for evaluating such skill-mediated safety failures. SkillSafetyBench includes 155 adversarial cases across 47 tasks, 6 risk domains, and 30 safety categories, each evaluated with a case-specific rule-based verifier. Experiments with multiple CLI agents and model backends show that localized non-user attacks can consistently induce unsafe behavior, with distinct failure patterns across domains, attack methods, and scaffold-model pairings. Our findings suggest that agent safety depends not only on model-level alignment, but also on how agents interpret skills, trust workflow context, and act through executable environments.
☆ L2P: Unlocking Latent Potential for Pixel Generation
Pixel diffusion models have recently regained attention for visual generation. However, training advanced pixel-space models from scratch demands prohibitive computational and data resources. To address this, we propose the Latent-to-Pixel (L2P) transfer paradigm, an efficient framework that directly harnesses the rich knowledge of pre-trained LDMs to build powerful pixel-space models. Specifically, L2P discards the VAE in favor of large-patch tokenization and freezes the source LDM's intermediate layers, exclusively training shallow layers to learn the latent-to-pixel transformation. By utilizing LDM-generated synthetic images as the sole training corpus, L2P fits an already smooth data manifold, enabling rapid convergence with zero real-data collection. This strategy allows L2P to seamlessly migrate massive latent priors to the pixel space using only 8 GPUs. Furthermore, eliminating the VAE memory bottleneck unlocks native 4K ultra-high resolution generation. Extensive experiments across mainstream LDM architectures show that L2P incurs negligible training overhead, yet performs on par with the source LDM on DPG-Bench and reaches 93% performance on GenEval.
comment: project page: https://nju-pcalab.github.io/projects/L2P/
☆ LegalCheck: Retrieval- and Context-Augmented Generation for Drafting Municipal Legal Advice Letters
Public-sector legal departments in the Netherlands face acute staff shortages, increased case volumes, and increased pressure to meet regulatory compliance. This paper presents LegalCheck, a novel system that addresses these challenges by automating the drafting of objection response letters through a combination of Retrieval-Augmented Generation (RAG) and Context-Augmented Generation (CAG). Using a large language model (LLM) alongside curated legal knowledge bases, LegalCheck performs retrieval of relevant laws and precedents, and uses controlled prompting to incorporate both external knowledge and case-specific details into a coherent draft. An expert-in-the-loop review ensures that each generated letter is legally sound and contextually appropriate. In a real-world deployment within the Municipality of Amsterdam, LegalCheck produced near-final advice letters in minutes rather than hours, while maintaining high legal consistency and factual accuracy. The output is based on actual regulations and prior cases, providing explainable outputs that captured the vast majority of required legal reasoning (often 80\% to 100\% of essential content). Legal professionals found that the system reduced their workload and ensured a consistent application of legal standards, without replacing human judgment. These results demonstrate substantial efficiency gains, improved legal consistency, and positive user acceptance. More broadly, this work illustrates how responsible AI can be deployed in the legal domain by augmenting LLMs with domain knowledge and governance mechanisms.
comment: Accepted at ICAIL 2026 as Short Paper
☆ CR^2: Cost-Aware Risk-Controlled Routing for Wireless Device-Edge LLM Inference IEEE
As large language models (LLMs) move from centralized clouds to mobile edge environments, efficient serving must balance latency, energy consumption, and accuracy under constrained device-edge resources. Query-level routing between lightweight on-device models and stronger edge models provides a flexible mechanism to navigate this trade-off. However, existing routers are designed for centralized cloud settings and optimize token-level costs, failing to capture the dynamic latency and energy overheads in wireless edge deployments. In this paper, we formulate mobile edge LLM routing as a deployment-constrained, cost-aware decision problem, and propose CR^2, a two-stage device-edge routing framework. CR^2 decouples a lightweight on-device margin gate from an edge-side utility selector for deferred queries. The margin gate operates on frozen query embeddings and a user-specified cost weight to predict whether local execution is utility-optimal relative to the best edge alternative under the target operating point. We further introduce a conformal risk control (CRC) calibration procedure that maps each operating point to an acceptance threshold, enabling explicit control of the marginal false-acceptance risk under the full-information utility reference. Experiments on the routing task show that CR^2 closely matches a full-information reference router using only device-side signals before deferral. Compared with strong query-level baselines, CR^2 consistently improves the deployable accuracy-cost Pareto frontier and reduces normalized deployment cost by up to 16.9% at matched accuracy.
comment: submitted to IEEE Journal
☆ The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
Power capping is the standard GPU energy lever in LLM serving, and it appears to work: throughput drops, power readings fall, and energy budgets are met. We show the appearance is illusory for the phase that dominates production serving: autoregressive decode. Across four attention paradigms -- GQA, MLA, Gated DeltaNet, and Mamba2 -- on NVIDIA H200, decode draws only 137--300\,W on a 700\,W GPU; no cap ever triggers, because memory-bound decode saturates HBM bandwidth rather than compute and leaves power headroom untouched. Firmware-initiated clock throttling compounds the illusion: these deviations can corrupt any throughput measurement that attributes them to the cap. SM clock locking dissolves both confounds. By targeting the lever that is actually on the critical path, clock locking Pareto-dominates power capping universally, recovering up to 32\% of decode energy at minimal throughput loss. We identify three architecture-dependent DVFS behavioural classes and characterise a common energy pattern across novel attention replacements: a heavy prefill cost recouped by efficient decode, eventually halving total request energy relative to GQA at production batch sizes.
☆ BadSKP: Backdoor Attacks on Knowledge Graph-Enhanced LLMs with Soft Prompts
Recent knowledge graph (KG)-enhanced large language models (LLMs) move beyond purely textual knowledge augmentation by encoding retrieved subgraphs into continuous soft prompts via graph neural networks, introducing a graph-conditioned channel that operates alongside the standard text interface. However, existing backdoor attacks are largely designed for the textual channel, and their effectiveness against this dual-channel architecture remains unclear. We show that this architecture creates a robustness gap: text-channel backdoor attacks that readily compromise textual KG prompting systems become largely ineffective against soft-prompt-based counterparts. We interpret this gap through semantic anchoring, whereby graph-derived soft prompts bias the generation-driving hidden state toward query-consistent semantics and suppress surface-level malicious instructions. Because this anchoring effect is itself induced by the graph channel, an attacker who manipulates graph-level representations can in turn redirect it toward adversarial semantics. To demonstrate this risk, we propose BadSKP, a backdoor attack that targets the graph-to-prompt interface through a multi-stage optimization strategy: it constructs adversarial target embeddings, optimizes poisoned node embeddings to steer the induced soft prompt, and approximates the optimized representations with fluent adversarial node attributes. Experiments on two soft-prompt KG-enhanced LLMs across four datasets show that BadSKP achieves high attack success under both frozen and trojaned settings, while text-only attacks remain unreliable even under perplexity-based defenses.
☆ A Transfer Learning Evaluation of Deep Neural Networks for Image Classification
Transfer learning is a machine learning technique that uses previously acquired knowledge from a source domain to enhance learning in a target domain by reusing learned weights. This technique is ubiquitous because of its great advantages in achieving high performance while saving training time, memory, and effort in network design. In this paper, we investigate how to select the best pre-trained model that meets the target domain requirements for image classification tasks. In our study, we refined the output layers and general network parameters to apply the knowledge of eleven image processing models, pre-trained on ImageNet, to five different target domain datasets. We measured the accuracy, accuracy density, training time, and model size to evaluate the pre-trained models both in training sessions in one episode and with ten episodes.
comment: Published by Machine Learning and Knowledge Extraction Journal
☆ Random-Set Graph Neural Networks
Uncertainty quantification has become an important factor in understanding the data representations produced by Graph Neural Networks (GNNs). Despite their predictive capabilities being ever useful across industrial workspaces, the inherent uncertainty induced by the nature of the data is a huge mitigating factor to GNN performance. While aleatoric uncertainty is the result of noisy and incomplete stochastic data such as missing edges or over-smoothing, epistemic uncertainty arises from lack of knowledge about a system or model (e.g., a graph's topology or node feature representation), which can be reduced by gathering more data and information. In this paper, we propose an original new framework in which node-level epistemic uncertainty is modelled in a belief function (finite random set) formalism. The resulting Random-Set Graph Neural Networks have a belief-function head predicting a random set over the list of classes, from which both a precise probability prediction and a measure of epistemic uncertainty can be obtained. Extensive experiments on 9 different graph learning datasets, including real-world autonomous driving benchmarks as such Nuscene and ROAD, demonstrate RS-GNN's superior uncertainty quantification capabilities
comment: 23 pages, 6 figures
☆ On the Limitations of Large Language Models for Conceptual Database Modeling
This article analyzes the use of Large Language Models (LLMs) as support for the conceptual modeling of relational databases through the automatic generation of Entity-Relationship (ER) diagrams from natural language requirements. The approach combines different language models with prompt engineering techniques to evaluate their ability to identify entities, relationships, and attributes in a conceptually consistent manner. The experimental evaluation involved three LLMs, each subjected to three prompting techniques (Zero-Shot, Chain of Thought, and Chain of Thought + Verifier), applied to the same requirements scenario with progressively increasing complexity. The generated diagrams were qualitatively analyzed through direct comparison with the textual requirements, considering the structural and semantic adherence of the modeled elements. The results indicate that, although LLMs show reasonable performance in less complex scenarios, their reliability decreases as the complexity of the requirements increases, with a rise in inconsistencies, ambiguities, and failures in representing constraints. These findings reinforce that, in their current state, LLMs are not sufficiently mature for reliable use in complex scenarios, and the cost of validation may offset the apparent productivity gains.
☆ High-lift Wing Separation Control via Bayesian Optimization and Deep Reinforcement Learning
This study investigates active flow control (AFC) of a 30P30N high-lift wing at a Reynolds number Re$_c$ = 450,000 and angle of attack $α$ = 23$^\circ$ using wallresolved large-eddy simulations (LES). Two optimization strategies are explored: open-loop Bayesian optimization (BO) and closed-loop deep reinforcement learning (DRL), both targeting the mitigation of stall and the improvement of aerodynamic efficiency via synthetic jets on the slat, main, and flap elements. The uncontrolled configuration was validated against literature data, confirming the reliability of the LES setup. The BO framework successfully identified steady jet velocities that increased efficiency by +10.9% through a -9.7% drag reduction while maintaining lift. In contrast, the DRL agent, despite leveraging instantaneous flow information from distributed sensors, achieved only minor improvements in lift and drag, with negligible efficiency gain. Training analysis indicated that the penalty-dominated reward constrained exploration. These results highlight the need for carefully designed rewards and computational acceleration strategies in DRL-based flow control at high Reynolds numbers.
☆ Cooperative Robotics Reinforced by Collective Perception for Traffic Moderation IEEE
Collisions at non-line-of-sight (NLOS) intersections remain a major safety concern because drivers have limited visibility of approaching traffic. V2X based warnings can reduce these risks, yet many vehicles are not equipped with V2X and drivers may ignore in vehicle alerts. Collective perception (CP) can compensate for low V2X penetration by extending the awareness of connected vehicles, but it cannot influence unconnected vehicles. To fill this gap, our work introduces a complementary concept that adds a cooperative humanoid robot as an active traffic moderator capable of physically stopping a vehicle that attempts to merge into an unseen traffic stream. The system operates on two parallel perception pathways. A dual camera infrastructure unit detects the position, speed and motion of approaching vehicles and transmits this information to the robot as a collective perception message (CPM). The robot also receives cooperative awareness messages (CAM) from connected vehicles through its onboard V2X unit and can act as a relay for decentralized environmental notification messages (DENM) when safety events originate elsewhere along the road. A fusion module combines these streams to maintain a robust real time view of the main road. A Zone of Danger (ZoD) is defined and used to predict whether an approaching vehicle creates a collision risk for a merging road user. When such a risk is detected, the robot issues a human-like STOP gesture and blocks the merging path until the hazard disappears. The full system was deployed at the Future Mobility Park (FMP) in Rotterdam. Experiments show that the combined vision and V2X perception allows the robot to detect approaching vehicles early, predict hazards reliably and prevent unsafe merges in real world NLOS conditions.
comment: Accepted for publication in the Proceedings of the 2026 IEEE Vehicular Technology Conference (VTC2026-Spring)
☆ Assessing and Mitigating Miscalibration in LLM-Based Social Science Measurement
Large language models (LLMs) are increasingly used in social science as scalable measurement tools for converting unstructured text into variables that can enter standard empirical designs. Measurement validity demands more than high average accuracy, which requires well calibrated confidence that faithfully reflects the empirical probability of each measurement being correct. This paper studies the model miscalibration in LLM-based social science measurement. We begin with a case study on FOMC and show that confidence based filtering can change downstream regression estimates when LLM confidence is miscalibrated. We then audit calibration across 14 social science constructs covering both proprietary models, including GPT-5-mini, DeepSeek-V3.2, and open source models. Across tasks and model families, reported confidence is poorly aligned with tolerance-based correctness. As a simple mitigation, we propose a soft label distillation pipeline for calibrating Bert with LLM. The method converts an LLM score and its verbalized confidence into a soft target distribution, then trains a smaller discriminative classifier on encoder models for these targets. Averaged across datasets, this approach reduces ECE by 43.2\% and Brier by 34.0\%. These results suggest that LLM-based social science pipelines should treat calibration as part of measurement validity, rather than as an optional post-processing concern.
☆ Counterfactual Trace Auditing of LLM Agent Skills
Large Language Model agents are increasingly augmented with agent skills. Current evaluation methods for skills remain limited. Most deployed benchmarks report only pass rate before and after a skill is attached, treating the skill as a black box change to agent behavior. We introduce Counterfactual Trace Auditing (CTA), a framework for measuring how a skill changes agent behavior. CTA pairs each with skill agent trace with a without skill counterpart on the same task, segments both traces into goal directed phases, aligns the phases, and emits structured Skill Influence Pattern (SIP) annotations. These annotations describe the behavioral effect of a skill rather than only its task outcome. We instantiate CTA on SWE-Skills-Bench with Claude across 49 software engineering tasks. The resulting audit reveals a clear evaluation gap. Pass rate changes by only +0.3 percentage points on average, suggesting little aggregate effect. Yet CTA identifies 522 SIP instances across the same paired traces, showing that the skills substantially reshape agent behavior even when pass rate is nearly unchanged. The audit also separates several recurring effects that pass rate cannot detect, including literal template copying, off task artifact creation, excess planning, and task recovery. Three findings emerge. First, high baseline tasks contain most of the observed skill effects, although their pass rate is already saturated and therefore cannot reflect those effects. Second, tasks with moderate baseline performance show the most recoverable gain, but often at substantially higher token cost. Third, the dominant SIP type can be identified by baseline bucket: surface anchoring is most common on ceiling tasks and edge-case prompting is most common on mid-range and floor tasks. These regularities turn informal failure mode observations into reproducible behavioral measurements.
comment: Code and data are available at https://github.com/WillChow66/CTA.git
☆ From Noise to Diversity: Random Embedding Injection in LLM Reasoning
Recent soft prompt research has tried to improve reasoning by inserting trained vectors into LLM inputs, yet whether the gain comes from the learned content or from the act of injection itself has not been carefully separated. We study Random Soft Prompts (RSPs), which drop the training step entirely and append a freshly drawn sequence of random embedding vectors to the input. Each RSP vector is sampled from an isotropic Gaussian fitted to the entrywise mean and variance of the pretrained embedding table; the sequence carries no learned content, and yet reaches accuracy comparable to optimized soft prompts on math reasoning benchmarks in several settings. The mechanism unfolds in two stages: because attention has to absorb a never-seen-before random position, the distribution over the first few generated tokens flattens and reasoning trajectories branch, and as generation continues this influence dilutes naturally so the response commits to a single completion. We show that during inference RSPs lift early-stage token diversity and, combined with temperature sampling, widen Pass@N, the probability that at least one out of N attempts is correct. Beyond inference, we carry the same effect into DAPO training and demonstrate practical gains. Our contributions are: (i) RSP isolates the simplest form of soft prompt -- training-free, freshly resampled -- providing a unified lens for the structural effect of injection that variants otherwise differing in training and form all share; (ii) a theoretical and empirical validation of the underlying mechanism; and (iii) an extension from inference to training.
comment: 30 pages, 5 figures, 6 tables. Under review
☆ When Simulation Lies: A Sim-to-Real Benchmark and Domain-Randomized RL Recipe for Tool-Use Agents
Tool-use language agents are evaluated on benchmarks that assume clean inputs, unambiguous tool registries, and reliable APIs. Real deployments violate all these assumptions: user typos propagate into hallucinated tool names, a misconfigured request timeout can stall an agent indefinitely, and duplicate tool names across servers can freeze an SDK. We study these failures as a sim-to-real gap in the tool-use partially observable Markov decision process (POMDP), where deployment noise enters through the observation, action space, reward-relevant metadata, or transition dynamics. We introduce RobustBench-TC, a benchmark with 22 perturbation types organized by these four POMDP components, each grounded in a verified GitHub issue or documented tool-calling failure. Across 21 models from 1.5B to 32B parameters (including the closed-source o4-mini), the robustness profile is sharply uneven: observation perturbations reduce accuracy by less than 5%, while reward-relevant and transition perturbations reduce accuracy by roughly 40% and 30%, respectively; scale alone does not close these gaps. We then propose ToolRL-DR, a domain-randomization reinforcement learning (RL) recipe that trains a tool-use agent on perturbation-augmented trajectories spanning the three statically encodable POMDP components. On a 3B backbone, ToolRL-DR-Full retains roughly three-quarters of clean accuracy and reaches an aggregate perturbed accuracy comparable to open-source 14B function-calling baselines while substantially narrowing the gap to o4-mini. It closes approximately 27% of the Transition gap despite never seeing transition perturbations in training, suggesting that RL on adversarial static tool-use inputs induces a more persistent retry policy that transfers to unseen runtime failures. The dataset, code and benchmark leaderboard are publicly available.
comment: Dataset, code, and benchmark leaderboard are available at https://github.com/WillChow66/robustbench-tc-release.git and https://huggingface.co/spaces/willchow66/robustbench-tc-leaderboard
☆ Domain Restriction via Multi SAE Layer Transitions
The general-purpose nature of Large Language Models (LLMs) presents a significant challenge for domain-specific applications, often leading to out-of-domain (OOD) interactions that undermine the provider's intent. Existing methods for detecting such scenarios treat the LLM as an uninterpretable black box and overlook the internal processing of inputs. In this work we show that layer transitions provide a promising avenue for extracting domain-specific signature. Specifically, we present several lightweight ways of learning on internal dynamics encoded using a sparse autoencoder (SAE) that exhibit great capability in distinguishing OOD texts. Building on top of SAEs representation transitions enables us to better interpret the LLM internal evolution of input processing and shed light on its decisions. We provide a comprehensive analysis of the method and benchmark it with the gemma-2 2B and 9B models. Our results emphasize the efficacy of the internal process in capturing fine-grained input-related details.
☆ Rethinking Positional Encoding for Neural Vehicle Routing
Transformer-based models have become the dominant paradigm for neural combinatorial optimization (NCO) of vehicle routing problems (VRPs), yet the role of positional encoding (PE) in these architectures remains largely unexplored. Unlike natural language, where tokens are uniformly spaced on a line, routing solutions exhibit several properties that render standard NLP positional encodings inadequate. In this work, we formalize three such structural properties that a routing-aware PE should respect, namely anisometric node distances, cyclic and direction-aware topology, and hierarchical depot-anchored global multi-route structure, combining them with a unifying design principle of geometric grounding. Guided by these criteria, we analyze and compare PE methods spanning NLP, graph-transformer, and routing-specific families, and propose a hierarchical anisometric PE that combines a distance-indexed, circularly consistent in-route encoding with a depot-anchored angular cross-route encoding. Extensive experiments across diverse VRP variants demonstrate that geometry-grounded PE consistently outperforms index-based alternatives, with gains that transfer across problem variants, model architectures, and distribution shifts.
☆ Rethinking Supervision Granularity: Segment-Level Learning for LLM-Based Theorem Proving
Automated theorem proving with large language models in Lean 4 is commonly approached through either step-level tactic prediction with tree search or whole-proof generation. These two paradigms represent opposite granularities for constructing supervised training data: the former provides dense local signals but may fragment coherent proof processes, while the latter preserves global structure but requires complex end-to-end generation. In this paper, we revisit supervision granularity as a training set construction problem over proof trajectories and propose segment-level supervision, a training data construction strategy that extracts locally coherent proof segments for training policy models. We further reuse the same strategy at inference time to trigger short rollouts for existing step-level models. When trained with segment-level supervision on STP, LeanWorkbook, and NuminaMath-LEAN, the resulting policy models achieve proof success rates of 64.84%, 60.90%, and 66.31% on miniF2F, respectively, consistently outperforming both step-level and whole-proof baselines. Goal-aware rollout further improves existing step-level provers while reducing inference costs. It increases the proof success rate of BFS-Prover-V2-7B from 68.77% to 70.74% and that of InternLM2.5-StepProver from 59.59% to 60.33%, showing that appropriate supervision granularity better aligns model learning with proof structure and search. Code and models are available at https://github.com/NJUDeepEngine/SEG-ATP.
comment: 22 pages, 4 figures, 6 tables
☆ Beyond Point-wise Neural Collapse: A Topology-Aware Hierarchical Classifier for Class-Incremental Learning ICML2026
The Nearest Class Mean (NCM) classifier is widely favored in Class-Incremental Learning (CIL) for its superior resistance to catastrophic forgetting compared to Fully Connected layers. While Neural Collapse (NC) theory supports NCM's optimality by assuming features collapse into single points, non-linear feature drift and insufficient training in CIL often prevent this ideal state. Consequently, classes manifest as complex manifolds rather than collapsed points, rendering the single-point NCM suboptimal. To address this, we propose Hierarchical-Cluster SOINN (HC-SOINN), a novel classifier that captures the topological structure of these manifolds via a ``local-to-global'' representation. Furthermore, we introduce Structure-Topology Alignment via Residuals (STAR) method, which employs a fine-grained pointwise trajectory tracking mechanism to actively deform the learned topology, allowing it to adapt precisely to complex non-linear feature drift. Theoretical analysis and Procrustes distance experiments validate our framework's resilience to manifold deformations. We integrated HC-SOINN into seven state-of-the-art methods by replacing their original classifiers, achieving consistent improvements that highlight the effectiveness and robustness of our approach. Code is available at https://github.com/yhyet/HC_SOINN.
comment: accepted by ICML2026
☆ AccLock: Unlocking Identity with Heartbeat Using In-Ear Accelerometers
The widespread use of earphones has enabled various sensing applications, including activity recognition, health monitoring, and context-aware computing. Among these, earphone-based user authentication has become a key technique by leveraging unique biometric features. However, existing earphone-based authentication systems face key limitations: they either require explicit user interaction or active speaker output, or suffer from poor accessibility and vulnerability to environmental noise, which hinders large-scale deployment. In this paper, we propose a passive authentication system, called AccLock, which leverages distinctive features extracted from in-ear BCG signals to enable secure and unobtrusive user verification. Our system offers several advantages over previous systems, including zero-involvement for both the device and the user, ubiquitous, and resilient to environmental noise. To realize this, we first design a two-stage denoising scheme to suppress both inherent and sporadic interference. To extract user-specific features, we then propose a disentanglement-based deep learning model, HIDNet, which explicitly separates user-specific features from shared nuisance components. Lastly, we develop a scalable authentication framework based on a Siamese network that eliminates the need for per-user classifier training. We conduct extensive experiments with 33 participants, achieving an average FAR of 3.13% and FRR of 2.99%, which demonstrates the practical feasibility of AccLock.
☆ Toward Modeling Player-Specific Chess Behaviors
While artificial intelligence has achieved superhuman performance in chess, developing models that accurately emulate the individualized decision-making styles of human players remains a significant challenge. Existing human-like chess models capture general population behaviors based on skill levels but fail to reproduce the behavioral characteristics of specific historical champions. Furthermore, the standard evaluation metric, move accuracy, inherently penalizes natural human variance and ignores long-term behavioral consistency, leading to an incomplete assessment of stylistic fidelity. To address these limitations, an architecture is proposed that adapts the unified Maia-2 model to champion-specific embeddings, further enhanced by the integration of a limited Monte Carlo Tree Search (MCTS) process to enrich tactical exploration during move selection. To robustly evaluate this approach, a novel behavioral metric based on the Jensen-Shannon divergence is introduced. By compressing high-dimensional board representations into a latent space using an AutoEncoder and Uniform Manifold Approximation and Projection (UMAP), move distributions are discretized on a common grid to compare behavioral similarities. Results across 16 historical world champions indicate that while integrating MCTS decreases standard move accuracy, it improves stylistic alignment according to the proposed metric, substantially reducing the average Jensen-Shannon divergence. Ultimately, the proposed metric successfully discriminates between individual players and provides promising evidence toward more comprehensive evaluations of behavioral alignment between players and AI models.
☆ Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems
Agent skills extend LLM agents with reusable instructions, tool interfaces, and executable code, and users increasingly install third-party skills from marketplaces, repositories, and community channels. Because a skill exposes both executable behavior and context-setting documentation, its deployment risk cannot be measured by single-shot audits or prompt-level red teams alone: a realistic attacker can use audit and runtime feedback to repeatedly rewrite the skill. We frame this risk as \emph{adaptive leakage} -- whether a budgeted attacker can iteratively revise a skill until it passes audit and produces verified runtime harm -- and present \ours{}, a grey-box self-evolving red-team framework for measuring it. Proteus searches a formalized five-axis skill-attack space. Each candidate is evaluated through a unified audit-sandbox-oracle pipeline that returns structured audit findings and runtime evidence to guide cross-round mutation. Beyond initial evasion, Proteus performs path expansion, which finds alternative implementations of successful attacks, and surface expansion, which transfers learned implementation patterns to new attack objectives beyond the original seed catalogue. Across eight phase-1 cells, Proteus reaches 40--90\% Attack Success Rate at $5$ rounds (ASR@5) with positive learning-curve slopes on both evaluated auditors. Phase-2 path/surface expansion produces 438 jointly bypassing and lethal variants, with SkillVetter bypassed at $\geq 93\%$ in every cell and AI-Infra-Guard, the strongest public auditor we evaluate, still admitting up to 41.3\% joint-success. These results show that current skill vetting substantially underestimates residual risk when evaluated against adaptive, feedback-driven attackers.
☆ Incentivizing Truthfulness and Collaborative Fairness in Bayesian Learning ICML-26
Collaborative machine learning involves training high-quality models using datasets from a number of sources. To incentivize sources to share data, existing data valuation methods fairly reward each source based on its data submitted as is. However, as these methods do not verify nor incentivize data truthfulness, the sources can manipulate their data (e.g., by submitting duplicated or noisy data) to artificially increase their valuations and rewards or prevent others from benefiting. This paper presents the first mechanism that provably ensures (F) collaborative fairness and incentivizes (T) truthfulness at equilibrium for Bayesian models. Our mechanism combines semivalues (e.g., Shapley value), which ensure fairness, and a truthful data valuation function (DVF) based on a validation set that is unknown to the sources. As semivalues are influenced by others' data, we introduce an additional condition to prove that a source can maximize its expected data values in coalitions and semivalues by submitting a dataset that captures its true knowledge. Additionally, we discuss the implications and suitable relaxations of (F) and (T) when the mediator has a limited budget for rewards or lacks a validation set. Our theoretical findings are validated on synthetic and real-world datasets.
comment: Accepted to the 43rd International Conference on Machine Learning (ICML-26) as a Spotlight paper
☆ From Clever Hans to Scientific Discovery: Interpreting EEG Foundational Transformers with LRP
Emerging foundation models (FMs) in electroencephalography (EEG) promise a path to scale deep learning in diagnostics and brain-computer interfaces despite data scarcity, yet their opaque nature remains a barrier to wider adoption. We investigate attention-aware Layer-wise relevance propagation (LRP) as a post-hoc attribution method for EEG-FMs, extending LRP's use on convolutional neural network (CNN)-based EEG models to the Transformer architectures that current FMs are based on. We find that LRP can both verify EEG-FM decisions and surface novel, biologically plausible hypotheses from them. In motor imagery, it unmasks 'Clever Hans' behavior where models prioritize task correlated ocular signals over the intended motor correlates. In a naturalistic paradigm for affect prediction, it reveals a recurring reliance on a central electrode cluster, suggesting a candidate sensorimotor signature of arousal. Though heatmap interpretation remains ambiguous in this complex domain, the results position LRP as a tool for both verification and exploration of EEG-FMs, a role that will grow in both importance and discovery potential as the underlying models mature.
comment: 18 pages, 6 figures
☆ On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment
Tool-using LLM agents fail through trajectories rather than only final responses, as they may execute unsafe tool calls, follow injected instructions, comply with harmful requests, or over-refuse benign tasks despite producing a seemingly safe answer. Existing safety-alignment signals are largely response-level or off-policy, and often incur a safety-utility trade-off: improving agent safety comes at the cost of degraded task performance. Such sparse and single-objective rewards severely limit real-world usability. To bridge this gap, we propose FATE, an on-policy self-evolving framework that transforms verifier-scored failures into repair supervision without expert demonstrations. For each failure, the same policy proposes repair candidates, which are then re-scored by verifiers and filtered across security, utility, over-refusal control, and trajectory validity. This dense trajectory-level information is then used as a supervision signal for agent self-evolution. During this process, we further introduce Pareto-Front Policy Optimization (PFPO), combining supervised warmup with Pareto-aware policy optimization to preserve safety-utility trade-offs. Experiments on AgentDojo, AgentHarm, and ATBench show that FATE improves safety across different models and scales while preserving useful behavior. Compared with strong baselines, FATE reduces attack success rate by 33.5%, harmful compliance by 82.6%, and improves external trajectory-safety diagnosis by 6.5%. These results suggest that failed trajectories can provide structured repair supervision for safer self-evolving agents.
☆ Modulation Consistency-based Contrastive Learning for Self-Supervised Automatic Modulation Classification
Deep learning-based AMC methods have achieved remarkable performance, but their practical deployment remains constrained by the high cost of labeled data. Although self-supervised learning (SSL) reduces the reliance on labels, existing SSL-based AMC methods often rely on task-agnostic pretext objectives misaligned with modulation classification, leading to representations entangled with nuisance factors such as symbol, channel, and noise. In this paper, we identify intra-instance modulation consistency as a task-aware structural prior, whereby different temporal segments of the same signal may differ in waveform while preserving the same modulation type, thus providing a principled cue for task-aligned self-supervision. Based on this prior, we propose Mod-CL, a Modulation consistency-based Contrastive Learning framework that constructs positive pairs from different temporal segments of the same signal instance, to encourage the model to learn shared modulation information while suppressing nuisance variations. We further develop a contrastive objective tailored to Mod-CL, which jointly exploits temporal segmentation and data augmentation to pull together views sharing the same modulation semantics while avoiding supervisory conflicts within each signal instance. Extensive experiments on RadioML datasets show that Mod-CL consistently outperforms strong baselines, especially in low-label regimes, achieving substantial improvements in linear probing accuracy.
☆ IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection
Web-browsing AI agents are increasingly deployed in enterprise settings under strict whitelists of approved domains, yet adversaries can still influence them by embedding hidden instructions in the HTML pages those domains serve. Existing red-teaming resources fall short of this scenario: prompt-injection benchmarks ship pre-built adversarial pages that whitelisted agents cannot reach, and generic LLM scanners probe the model API rather than its retrieved content. We present IPI-proxy, an open-source toolkit for red-teaming web-browsing agents against indirect prompt injection (IPI). At its core is an intercepting proxy that rewrites real HTTP responses from whitelisted domains in flight, embedding payloads drawn from a unified library of 820 deduplicated attack strings extracted from six published benchmarks (BIPIA, InjecAgent, AgentDojo, Tensor Trust, WASP, and LLMail-Inject). A YAML-driven test harness independently parameterizes the payload set, the embedding technique (HTML comment, invisible CSS, or LLM-generated semantic prose), and the HTML insertion point (6 locations from \icode{head\_meta} to \icode{script\_comment}), enabling parameter-sweep evaluation without mock pages or sandboxed environments. A companion exfiltration tracker logs successful callbacks. This paper describes the threat model, situates IPI-proxy among contemporary IPI benchmarks and red-teaming tools, and details its architecture, design decisions, and configuration interface. By bridging static benchmarks and live deployment, IPI-proxy gives AI security teams a reproducible substrate for measuring and hardening web-browsing agents against indirect prompt injection on the same retrieval surface attackers exploit in production.
comment: code: https://github.com/VulcanLab/IPI-Proxy/
☆ Very Efficient Listwise Multimodal Reranking for Long Documents ICML 2026
Listwise reranking is a key yet computationally expensive component in vision-centric retrieval and multimodal retrieval-augmented generation (M-RAG) over long documents. While recent VLM-based rerankers achieve strong accuracy, their practicality is often limited by long visual-token sequences and multi-step autoregressive decoding. We propose ZipRerank, a highly efficient listwise multimodal reranker that directly addresses both bottlenecks. It reduces input length via a lightweight query-image early interaction mechanism and eliminates autoregressive decoding by scoring all candidates in a single forward pass. To enable effective learning, ZipRerank adopts a two-stage training strategy: (i) listwise pretraining on large-scale text data rendered as images, and (ii) multimodal finetuning with VLM-teacher-distilled soft-ranking supervision. Extensive experiments on the MMDocIR benchmark show that ZipRerank matches or surpasses state-of-the-art multimodal rerankers while reducing LLM inference latency by up to an order of magnitude, making it well-suited for latency-sensitive real-world systems. The code is available at https://github.com/dukesun99/ZipRerank.
comment: To appear in ICML 2026
☆ EvoNav: Evolutionary Reward Function Design for Robot Navigation with Large Language Models
Robot navigation is a crucial task with applications to social robots in dynamic human environments. While Reinforcement Learning (RL) has shown great promise for this problem, the policy quality is highly sensitive to the specification of reward functions. Hand-crafted rewards require substantial domain expertise and embed inductive biases that are difficult to audit or adapt, limiting their effectiveness and leading to suboptimal performance. In this paper, we propose EvoNav, an evolutionary framework that automates the design of robot navigation reward functions via large language models (LLMs). To overcome prohibitively costly policy training, EvoNav evaluates each candidate proposal from the LLM via a progressive three-stage warm-up-boost procedure. EvoNav advances from analytical proxies with low-cost surrogates, such as small datasets and analytic rules, to lightweight rollouts and, finally, to full policy training, enabling computationally efficient exploration under effective feedback. Experiment results show that EvoNav produces more effective navigation policies than manually designed RL rewards and state-of-the-art reward design methods.
☆ Improving the Performance and Learning Stability of Parallelizable RNNs Designed for Ultra-Low Power Applications ICML2026
Sequence learning is dominated by Transformers and parallelizable recurrent neural networks (RNNs) such as state-space models, yet learning long-term dependencies remains challenging, and state-of-the-art designs trade power consumption for performance. The Bistable Memory Recurrent Unit (BMRU) was introduced to enable hardware-software co-design of ultra-low power RNNs: quantized states with hysteresis provide persistent memory while mapping directly to analog primitives. However, BMRU performance lags behind parallelizable RNNs on complex sequential tasks. In this paper, we identify gradient blocking during state updates as a key limitation and propose a cumulative update formulation that restores gradient flow while preserving persistent memory, creating skip-connections through time. This leads to the Cumulative Memory Recurrent Unit (CMRU) and its relaxed variant, the $α$CMRU. Experiments show that the cumulative formulation dramatically improves convergence stability and reduces initialization sensitivity. The CMRU and $α$CMRU match or outperform Linear Recurrent Units (LRUs) and minimal Gated Recurrent Units (minGRUs) across diverse benchmarks at small model sizes, with particular advantages on tasks requiring discrete long-range retention, while the CMRU retains quantized states, persistent memory, and noise-resilient dynamics essential for analog implementation.
comment: Accepted as a spotlight at ICML2026. This work has been the subject of patent applications under numbers EP26175243.0 and EP26175248.9
☆ GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation
Reinforcement learning has become a widely used post-training approach for LLM agents, where training commonly relies on outcome-level rewards that provide only coarse supervision. While finer-grained credit assignment is promising for effective policy updates, obtaining reliable local credit and assigning it to the right parts of the long-horizon trajectory remains an open challenge. In this paper, we propose Granularity-adaptivE Advantage Reweighting (GEAR), an adaptive-granularity credit assignment framework that reshapes the trajectory-level GRPO advantage using token- and segment-level signals derived from self-distillation. GEAR compares an on-policy student with a ground-truth-conditioned teacher to obtain a reference-guided divergence signal for identifying adaptive segment boundaries and modulating local advantage weights. This divergence often spikes at the onset of a semantic deviation, while later tokens in the same autoregressive continuation may return to low divergence. GEAR therefore treats such spikes as anchors for adaptive credit regions: where the student remains aligned with the teacher, token-level resolution is preserved; where it departs, GEAR groups the corresponding continuation into an adaptive segment and uses the divergence at the departure point to modulate the segment' s advantage. Experiments across eight mathematical reasoning and agentic tool-use benchmarks with Qwen3 4B and 8B models show that GEAR consistently outperforms standard GRPO, self-distillation-only baselines, and token- or turn-level credit-assignment methods. The gains are especially strong on benchmarks with lower GRPO baseline accuracy, reaching up to around 20\% over GRPO, suggesting that the proposed adaptive reweighting scheme is especially useful in more challenging long-horizon settings.
☆ Martingale-Consistent Self-Supervised Learning
Self-supervised learning (SSL) is often deployed under changing information, such as shorter histories, missing features, or partially observed images. In these settings, predictions from coarse and refined views should be coherent: before refinement, the coarse-view prediction should match the average prediction expected after refinement. Martingales formalize this coherence principle, but standard SSL objectives do not enforce it. Unlike invariance objectives that pull views together, martingale consistency constrains only the expected refined prediction, allowing predictions to update as information is revealed while preventing systematic drift. We introduce a martingale-consistent SSL framework that closes this gap, with practical prediction- and latent-space variants and an unbiased two-sample Monte Carlo estimator based on stochastic refinement. We evaluate the approach on synthetic and real time-series, tabular, and image benchmarks under partial-observation regimes, in both semi-self-supervised and fully label-free settings. Across these experiments, our framework improves robustness and calibration under partial observation, yielding more stable representations as information is revealed.
☆ Minimax Rates and Spectral Distillation for Tree Ensembles
Tree ensembles such as random forests (RFs) and gradient boosting machines (GBMs) are among the most widely used supervised learners, yet their theoretical properties remain incompletely understood. We adopt a spectral perspective on these algorithms, with two main contributions. First, we derive minimax-optimal convergence for RF regression, showing that, under mild regularity conditions on tree growth, the eigenvalue decay of the induced kernel operator governs the statistical rate. Second, we exploit this spectral viewpoint to develop compression schemes for tree ensembles. For RFs, leading eigenfunctions of the kernel operator capture the dominant predictive directions; for GBMs, leading singular vectors of the smoother matrix play an analogous role. Learning nonlinear maps for these spectral representations yields distilled models that are orders of magnitude smaller than the originals while maintaining competitive predictive performance. Our methods compare favorably to state of the art algorithms for forest pruning and rule extraction, with applications to resource constrained computing.
comment: 9 pages main text, 33 pages total, with 12 figures and 7 tables total
☆ Trade-offs in Decentralized Agentic AI Discovery Across the Compute Continuum
Agentic systems deployed across the compute continuum need discovery mechanisms that remain effective across cloud, edge, and intermittently connected domains. In some emerging agentic architectures, decentralized discovery is already an active design direction, placing DHT-based lookup on the path toward agent directories. This paper studies the trade-offs among major structured-overlay families for agent discovery, comparing Chord, Pastry, and Kademlia as candidate indexing substrates within a shared control-plane framework. Using a benchmark subset centered on a 4096-node stationary comparison and a representative 4096-node churn benchmark, the paper characterizes how discovery reliability, startup behavior, and control-plane overhead vary across these overlays. The goal is to clarify the operating points they expose for agent discovery across edge-to-cloud environments.
☆ Multi-Timescale Conductance Spiking Networks: A Sparse, Gradient-Trainable Framework with Rich Firing Dynamics for Enhanced Temporal Processing IEEE
Spiking neural networks (SNNs) promise low-power event-driven computation for temporally rich tasks, but commonly used neuron models often trade off gradient-based trainability, dynamical richness, and high activity sparsity. These limitations are acute in regression, where approximation error, noise and spike discretization can severely degrade continuous-valued outputs. Indeed, many state-of-the-art (SOTA) SNNs rely on simple phenomenological dynamics trained with surrogate gradients and offer limited control over spiking diversity and sparsity. To overcome such limitations, we introduce multi-timescale conductance spiking networks, a gradient-trainable framework in which neural dynamics emerge from shaping the current-voltage (I-V) curve by tuning fast, slow and ultra-slow conductances. This parametrization allows systematic control over excitability, can be implemented efficiently in analog circuits, and yields rich firing regimes including tonic, phasic and bursting responses within a single model. We derive a discrete-time formulation of these differentiable dynamics, enabling direct backpropagation through time without surrogate-gradient approximations. To probe both trainability and accuracy, we evaluate feedforward networks of these neurons at the predictability limit of Mackey-Glass time-series regression and compare them to baseline LIF and SOTA AdLIF networks. Our model outperforms LIF and AdLIF networks, while exhibiting substantially sparser activity from both communication and computational perspectives. These results highlight multi-timescale conductance spiking neurons as a promising building block for energy-aware temporal processing and neuromorphic implementation.
comment: Published in 2026 IEEE Neuro-Inspired Computational Elements Conference (Atlanta, USA)
☆ REFNet++: Multi-Task Efficient Fusion of Camera and Radar Sensor Data in Bird's-Eye Polar View IEEE
A realistic view of the vehicle's surroundings is generally offered by camera sensors, which is crucial for environmental perception. Affordable radar sensors, on the other hand, are becoming invaluable due to their robustness in variable weather conditions. However, because of their noisy output and reduced classification capability, they work best when combined with other sensor data. Specifically, we address the challenge of multimodal sensor fusion by aligning radar and camera data in a unified domain, prioritizing not only accuracy, but also computational efficiency. Our work leverages the raw range-Doppler (RD) spectrum from radar and front-view camera images as inputs. To enable effective fusion, we employ a variational encoder-decoder architecture that learns the transformation of front-view camera data into the Bird's-Eye View (BEV) polar domain. Concurrently, a radar encoder-decoder learns to recover the angle information from the RD data that produce Range-Azimuth (RA) features. This alignment ensures that both modalities are represented in a compatible domain, facilitating robust and efficient sensor fusion. We evaluated our fusion strategy for vehicle detection and free space segmentation against state-of-the-art methods using the RADIal dataset.
comment: IEEE Intelligent Transportation Systems Conference (ITSC) 2025
☆ MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare
The large-scale deployment of personalized healthcare agents demands memory mechanisms that are exceptionally precise, safe, and capable of long-term clinical tracking. However, existing benchmarks primarily focus on daily open-domain conversations, failing to capture the high-stakes complexity of real-world medical applications. Motivated by the stringent production requirements of an industry-leading health management agent serving tens of millions of active users, we introduce MedMemoryBench. We develop a human-agent collaborative pipeline to synthesize highly realistic, long-horizon medical trajectories based on clinically grounded, synthetic patient archetypes. This process yields a massive, expertly validated dataset comprising approximately 2,000 sessions and 16,000 interaction turns. Crucially, MedMemoryBench departs from traditional static evaluations by pioneering an "evaluate-while-constructing" streaming assessment protocol, which precisely mirrors dynamic memory accumulation in production environments. Furthermore, we formalize and systematically investigate the critical phenomenon of memory saturation, where sustained information influx actively degrades retrieval and reasoning robustness. Comprehensive benchmarking reveals severe bottlenecks in mainstream architectures, particularly concerning complex medical reasoning and noise resilience. By exposing these fundamental flaws, MedMemoryBench establishes a vital foundation for developing robust, production-ready medical agents.
☆ Automated Reformulation of Robust Optimization via Memory-Augmented Large Language Models
Robust optimization (RO) provides a principled framework for decision-making under uncertainty, but its practical use is often limited by the need to manually reformulate uncertain optimization models into tractable deterministic counterparts. Recent large language models (LLMs) have been shown promising for automating optimization formulation, yet RO reformulation remains challenging because it requires precise multi-step reasoning and mathematically consistent transformations. To facilitate systematic evaluation of LLM-based reformulation, for which no dedicated benchmark currently exists, we develop AutoRO-Bench, a benchmark featuring an automated data generation pipeline for the core RO reformulation task and a curated dataset for the RO application task. To address the reformulation challenge, we propose Automated Reformulation with Experience Memory (AutoREM), a tuning-free memory-augmented framework that autonomously builds a structured textual experience memory by reflecting on past failed trajectories through a tailored offline adaptation procedure. AutoREM requires neither domain-specific expert knowledge nor parameter updates, and the resulting memory readily transfers across different base LLMs. Experimental results show that AutoREM consistently improves the accuracy and efficiency of RO reformulation across in-distribution datasets, out-of-distribution datasets, and diverse base LLMs.
☆ Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models
Vision-Language-Action (VLA) models have advanced rapidly with stronger backbones, broader pre-training, and larger demonstration datasets, yet their action heads remain largely homogeneous: most directly predict action commands in a fixed world coordinate frame. We propose \textbf{MCF-Proto}, a lightweight action head that equips VLA policies with a Motion-Centric Action Frame (MCF) and a prototype-based action parameterization. At each step, the policy predicts a rotation $R_t \in SO(3)$, composes actions in the transformed local frame from a set of prototypes, and maps them back to the world frame for end-to-end training, using only standard demonstrations without auxiliary supervision. This simple design induces stable emergent structure. Without explicit directional labels, the learned local frames develop a stable geometric structure whose axes are strongly compatible with demonstrated end-effector motion. Meanwhile, actions in the learned representation become substantially more compact, with variation captured by fewer dominant directions and more regularly organized by shared prototypes. These structural properties translate into improved robustness, especially under geometric perturbations. Our results suggest that adding lightweight geometric and compositional structure to the action head can materially improve how VLA policies organize and generalize robotic manipulation behavior. An anonymized code repository is provided in the supplementary material.
☆ Why Users Go There: World Knowledge-Augmented Generative Next POI Recommendation
Generative point-of-interest (POI) recommendation models based on large language models (LLMs) have shown promising results by formulating next POI prediction as a sequence generation task. However, the knowledge encoded in these models remains fixed after training, making them unable to perceive evolving real-world conditions that shape user mobility decisions, such as local events and cultural trends. To bridge this gap, we propose AWARE (Agent-based World knowledge Augmented REcommendation), which employs an LLM agent to generate location- and time-aware contextual narratives that capture regional cultural characteristics, seasonal trends, and ongoing events relevant to each user. Rather than introducing generic or noisy information, AWARE further anchors these narratives in each user's behavioral context, grounding external world knowledge in personalized spatial-temporal patterns. Extensive experiments on three real-world datasets demonstrate that AWARE consistently outperforms competitive baselines, achieving up to 12.4% relative improvement.
☆ OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models
As Video Large Language Models (Video-LLMs) scale to longer and more complex videos, their inference cost grows rapidly due to the large volume of visual tokens accumulated across frames. Training-free token compression has emerged as a practical solution to this bottleneck. However, existing temporal compression methods rely primarily on cross-frame token similarity or segmentation heuristics, overlooking each token's semantic role within its frame and failing to adapt compression strength to the compressibility of each frame pair. In this work, we propose OTT-Vid, a transport-derived allocation framework for temporal token compression. Our approach consists of two stages: spatial pruning identifies representative content within each frame, and optimal transport (OT) is then solved between neighboring frames to estimate temporal compressibility. We formulate this OT with non-uniform token mass, which protects semantically important tokens from aggressive compression, and a locality-aware cost that captures both feature and spatial disparities. The resulting transport plan jointly balances token importance and matching cost, while its total cost defines the transport difficulty of each frame pair, which we use to allocate compression budgets dynamically. Experiments on six benchmarks spanning video question answering and temporal grounding show that OTT-Vid preserves 95.8% of VQA and 73.9% of VTG performance while retaining only 10% of tokens, consistently outperforming existing state-of-the-art training-free compression methods.
comment: 22pages, 9 figures. Code available at https://github.com/minseokii/OTT-Vid
☆ Beyond Inefficiency: Systemic Costs of Incivility in Multi-Agent Monte Carlo Simulations
Unconstructive debate and uncivil communication carry well-documented costs for productivity and cohesion, yet isolating their effect on operational efficiency has proven difficult. Human subject research in this domain is constrained by ethical oversight, limited reproducibility, and the inherent unpredictability of naturalistic settings. We address this gap by leveraging Large Language Model (LLM) based Multi-Agent Systems as a controlled sociological sandbox, enabling systematic manipulation of communicative behavior at scale. Using a Monte Carlo simulation framework, we generate thousands of structured 1-on-1 adversarial debates across varying toxicity conditions, measuring convergence time, defined as the number of rounds required to reach a conclusion, as a proxy for interactional efficiency. Building on a prior study, we replicate and extend its findings across two additional LLM agents of varying parameter size, allowing us to assess whether the effects of toxic behavior on debate dynamics generalize across model scale. The convergence latency of 25% reported in the previous study was confirmed. It was found that this latency is significantly bigger for models with fewer parameters. We further identify a significant first-mover advantage, whereby the agent initiating the discussion wins significantly above chance regardless of toxicity condition.
☆ Crash Assessment via Mesh-Based Graph Neural Networks and Physics-Aware Attention
Full-vehicle crash simulations are computationally expensive, limiting their use in iterative design exploration. This work investigates learned hybrid surrogate models (MeshTransolver, MeshGeoTransolver, and MeshGeoFLARE) for predicting time-resolved structural deformation fields in an industrial lateral pole-impact benchmark. We evaluate whether neural surrogates can reproduce full-field crash kinematics with sufficient accuracy, spatial regularity, and structural plausibility for engineering interpretation. The proposed architectures combine local mesh message passing, geometry-aware global attention, and sparse contact-aware correction for autoregressive crash rollout. We compare mesh-based graph neural networks, attention-based geometric models, and hybrid architectures under a common training and hyperparameter configuration. The hybrid models capture both short-range structural interactions and long-range deformation patterns, while a sparse contact-aware variant assesses the effect of dynamic proximity interactions during rollout. On a 25-sample full-vehicle test set, the best hybrid model achieves a temporal mean root-mean-square error of 3.20 mm. While geometry-aware attention baselines are quantitatively competitive, qualitative side-view inspection shows they can introduce local spatial noise and deformation irregularities that complicate structural interpretation. In contrast, hybrid mesh-attention models provide the best balance between scalar accuracy, survival-space consistency, and physically interpretable displacement fields. These results suggest that crash surrogate assessment should combine global error metrics with downstream safety-relevant quantities and qualitative field inspection. The proposed methodology enables fast full-field predictions while preserving essential structural information for industrial crash-engineering analysis.
comment: 40 pages, 15 figures, 7 tables
☆ Is Monotonic Sampling Necessary in Diffusion Models?
Diffusion models generate samples by iteratively denoising a Gaussian prior, traversing a sequence of noise levels that, in every published sampler, decreases monotonically. Six years of intensive work has refined nearly every aspect of this recipe, including the corruption operator, the training objective, the schedule shape, the architecture, and the ODE solver. Yet the assumption of monotonicity itself has never been systematically tested. Here we ask whether monotonic sampling is load-bearing or merely conventional. We design four families of structured nonmonotonic schedules and apply them to three architecturally distinct generative models, DDPM, EDM, and Flow Matching, across NFE budgets ranging from 10 to 200 function evaluations, plus a 42-cell hyperparameter ablation, on CIFAR-10. Across all 90 tested configurations, no tested nonmonotonic schedule improves on the monotonic baseline. The magnitude of the penalty, however, spans nearly three orders of magnitude: persistent and substantial in DDPM, intermediate in Flow Matching, and indistinguishable from zero in EDM. We show that this variation is not noise but a structural property of each trained denoiser, and we formalize it as the Schedule Sensitivity Coefficient, a cheap, architecture-agnostic diagnostic that provides evidence of non-convergence to the Bayes-optimal denoiser at the critical noise level. Our findings justify the field's tacit reliance on monotonic schedules and supply a new probe of diffusion model quality complementary to sample-quality metrics such as Frechet Inception Distance.
☆ Behavioral Integrity Verification for AI Agent Skills
Agent skills extend LLM agents with privileged third-party capabilities such as filesystem access, credentials, network calls, and shell execution. Existing safety work catches malicious prompts and risky runtime actions, but the skill artifact itself goes unverified. We formalize this as the behavioral integrity verification (BIV) problem: a typed set comparison between declared and actual capabilities over a shared taxonomy that bridges code, instructions, and metadata. The BIV framework instantiates this comparison by pairing deterministic code analysis with LLM-assisted capability extraction. The resulting structured evidence supports three downstream analyses: deviation taxonomy, root-cause classification, and malicious-skill detection. On 49,943 skills from the OpenClaw registry, the deviation taxonomy reveals a pervasive description-implementation gap: 80.0% of skills deviate from declared behavior, with four novel compound-threat categories surfaced. Root-cause classification finds that deviations are mostly oversight, not malice: 81.1% trace to developer oversight and 18.9% to adversarial intent, with 5.0% of skills carrying predicted multi-stage attack chains. On a 906-skill malicious-skill detection benchmark, BIV reaches an F1 of 0.946, outperforming state-of-the-art rule-based and single-pass LLM baselines. These results demonstrate behavioral integrity auditing for agent skills at scale.
☆ Focusable Monocular Depth Estimation
Monocular depth foundation models generalize well across scenes, yet they are typically optimized with uniform pixel-wise objectives that do not distinguish user-specified or task-relevant target regions from the surrounding context. We therefore introduce Focusable Monocular Depth Estimation (FDE), a region-aware depth estimation task in which, given a specified target region, the model is required to prioritize foreground depth accuracy, preserve sharp boundary transitions, and maintain coherent global scene geometry. To prioritize task-critical region modeling, we propose FocusDepth, a prompt-conditioned monocular relative depth estimation framework that guides depth modeling to focus on target regions via box/text prompts. The core Multi-Scale Spatial-Aligned Fusion (MSSA) in FocusDepth spatially aligns multi-scale features from Segment Anything Model 3 to the Depth Anything family and injects them through scale-specific, gated conditional fusion. This enables dense prompt cue injection without disrupting geometric representations, thereby endowing the depth estimation model with focused perception capability. To study FDE, we establish FDE-Bench, a target-centric monocular relative depth benchmark built from image-target-depth triplets across five datasets, containing 252.9K/72.5K train/val triplets and 972 categories spanning real-world and embodied simulation environments. On FDE-Bench, FocusDepth consistently improves over globally fine-tuned DA2/DA3 baselines under both box and text prompts, with the largest gains appearing in target boundary and foreground regions while preserving global scene geometry. Ablations show that MSSA's spatial alignment is the key design factor, as disrupting prompt-geometry correspondence increases AbsRel by up to 13.8%.
☆ Towards Visually Grounded Multimodal Summarization via Cross-Modal Transformer and Gated Attention ACL 2026
Multimodal summarization requires models to jointly understand textual and visual inputs to generate concise, semantically coherent summaries. Existing methods often inject shallow visual features into deep language models, leading to representational mismatches and weak cross-modal grounding. We propose a unified framework that jointly performs text summarization and representative image selection. Our system, SPeCTrA-Sum (Sampler Perceiver with Cross-modal Transformer and gated Attention for Summarization), introduces two key innovations. First, a Deep Visual Processor (DVP) aligns the visual encoder with the language model at corresponding depths, enabling hierarchical, layer-wise fusion that preserves semantic consistency. Second, a lightweight Visual Relevance Predictor (VRP) selects salient and diverse images by distilling soft labels from a Determinantal Point Processes (DPP) teacher. SPeCTrA-Sum is trained using a multi-objective loss that combines autoregressive summarization, cross-modal alignment, and DPP-based distillation. Experiments show that our system produces more accurate, visually grounded summaries and selects more representative images, demonstrating the benefits of depth-aware fusion and principled image selection for multimodal summarization.
comment: Accepted to Findings of ACL 2026
☆ DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies
Vision-Language-Action (VLA) models are often brittle in fine-grained manipulation, where minor action errors during the critical phases can rapidly escalate into irrecoverable failures. Since existing VLA models rely predominantly on successful demonstrations for training, they lack an explicit awareness of failure during these critical phases. To address this, we propose DreamAvoid, a critical-phase test-time dreaming framework that enables VLA models to anticipate and avoid failures. We also introduce an autonomous boundary learning paradigm to refine the system's understanding of the subtle boundary between success and failure. Specifically, we (1) utilize a Dream Trigger to determine whether the execution has entered a critical phase, (2) sample multiple candidate action chunks from the VLA via an Action Proposer, and (3) employ a Dream Evaluator, jointly trained on mixed data (success, failure, and boundary cases), to "dream" the short-horizon futures corresponding to the candidate actions, evaluate their values, and select the optimal action. We conduct extensive evaluations on real-world manipulation tasks and simulation benchmarks. The results demonstrate that DreamAvoid can effectively avoid failures, thereby improving the overall task success rate. Our code is available at https://github.com/XianzheFan/DreamAvoid.
comment: 19 pages, 7 figures
☆ When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel
Chain-of-thought (CoT) traces are increasingly used both to improve language model capability and to audit model behavior, implicitly assuming that the visible trace remains synchronized with the computation that determines the answer. We test this assumption with a step-level Detect-Classify-Compare framework built around an answer-commitment proxy that is cross-validated with Patchscopes, tuned-lens probes, and causal direction ablation. Across nine models and seven reasoning benchmarks, latent commitment and explicit answer arrival align on only 61.9% of steps on average. The dominant mismatch pattern is confabulated continuation: 58.0% of detected mismatch events occur after the answer-commitment proxy has already stabilized while the trace continues producing deliberative-looking text, and a vacuousness analysis shows that the committed answer does not change during these steps. In architecture-matched Qwen2.5/DeepSeek-R1-Distill comparisons, the reasoning pipeline changes failure composition more than aggregate alignment, most clearly at 32B where confabulated steps decrease as contradictory states increase. Lower step-level alignment is also associated with larger CoT utility, suggesting that the settings that benefit most from CoT are often the least temporally faithful. Paired truncation and a complementary donor-corruption test further indicate that much post-commitment text is not load-bearing for the final answer. These findings suggest that CoT can remain useful while still being an unreliable report of when the answer was formed.
☆ OptArgus: A Multi-Agent System to Detect Hallucinations in LLM-based Optimization Modeling
Large language models (LLMs) are increasingly used to translate natural-language optimization problems into mathematical formulations and solver code, but matching the reference objective value is not a reliable test of correctness: an artifact may agree numerically while still changing the underlying optimization semantics. We formulate this issue as \emph{optimization-modeling hallucination detection}, namely structural consistency auditing over the problem description, symbolic model, and solver implementation. We develop, to our knowledge, the first fine-grained hallucination taxonomy specifically for optimization modeling, spanning objective, variable, constraint, and implementation failures. We use this taxonomy to design OptArgus, a multi-agent detector with conductor routing, specialist auditors, and evidence consolidation. To evaluate this setting, we introduce a three-part benchmark suite with $484$ clean artifacts, $1266$ controlled injected artifacts, and $6292$ natural LLM-generated artifacts. Against a matched single-agent baseline, OptArgus produces fewer false alarms on clean artifacts, more accurate top-ranked localization on controlled single-error cases, and stronger detection on natural model outputs. Together, these contributions turn optimization-modeling hallucination detection into a concrete empirical problem and suggest that modular, taxonomy-grounded auditing is a practical route to more reliable optimization modeling.
☆ Allegory of the Cave: Measurement-Grounded Vision-Language Learning
Vision-language models typically reason over post-ISP RGB images, although RGB rendering can clip, suppress, or quantize sensor evidence before inference. We study whether grounding improves when the visual interface is moved closer to the underlying camera measurement. We formulate measurement-grounded vision-language learning and instantiate it as PRISM-VL, which combines RAW-derived Meas.-XYZ inputs, camera-conditioned grounding, and Exposure-Bracketed Supervision Aggregation for transferring supervision from RGB proxies to measurement-domain observations. Using a quality-controlled 150K instruction-tuning set and a held-out benchmark targeting low-light, HDR, visibility-sensitive, and hallucination-sensitive cases, PRISM-VL-8B reaches 0.6120 BLEU, 0.4571 ROUGE-L, and 82.66\% LLM-Judge accuracy, improving over the RGB Qwen3-VL-8B baseline by +0.1074 BLEU, +0.1071 ROUGE-L, and +4.46 percentage points. These results suggest that part of VLM grounding error arises from information lost during RGB rendering, and that preserving measurement-domain evidence can improve multimodal reasoning.
☆ CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating
In this paper, we propose Concentrate and Concentrate (CaC), a coarse-to-fine anomaly reward model based on Vision-Language Models. During inference, it first conducts a global temporal scan to anchor anomalous time windows, then performs fine-grained spatial grounding within the localized interval, and finally derives robust judgments via structured spatiotemporal Chain-of-Thought reasoning. To equip the model with these capabilities, we construct the first large-scale generated video anomaly dataset with per-frame bounding-box annotations, temporal anomaly windows, and fine-grained attribution labels. Building on this dataset, we design a three-stage progressive training paradigm. The model initially learns spatial and temporal anchoring through single- and multi-frame supervised fine-tuning, and then is optimized by a reinforcement learning strategy based on two-turn Group Relative Policy Optimization (GRPO). Beyond conventional accuracy rewards, we introduce Temporal and Spatial IoU rewards to supervise the intermediate localization process, effectively guiding the model toward more grounded and interpretable spatiotemporal reasoning. Extensive experiments demonstrate that CaC can stably concentrate on subtle anomalies, achieving a 25.7% accuracy improvement on fine-grained anomaly benchmarks and, when used as a reward signal, CaC reduces generated-video anomalies by 11.7% while improving overall video quality.
comment: 27 pages, 10 figures
☆ A Research Agenda on Agents and Software Engineering: Outcomes from the Rio A2SE Seminar
The rise of agentic AI is reshaping software engineering in two intertwined directions: agents are increasingly applied to support software engineering tasks, and Agentic AI systems themselves are complex systems that require re-thinking currently established software engineering practices. To chart a coherent research agenda covering the two directions, we organized the A2SE seminar in Rio de Janeiro, bringing together 18 experts from academia and industry. Through structured presentations, collaborative topic clustering, and focused group discussions, participants identified six thematic areas: Governance, Software Engineering for Agents, Agents for Software Architecture, Quality and Evaluation, Sustainability, and Code, and they prioritized short-term and long-term research directions for each. This paper presents the resulting community-driven, opinionated research agenda, offering the SE community a structured foundation for coordinating efforts at this critical juncture.
comment: 6 pages, 1 table, A2SE meeting, https://sites.google.com/view/a2se2026/home
☆ Self-organized MT Direction Maps Emerge from Spatiotemporal Contrastive Optimization
The spatial and functional organization of the primate visual cortex is a fundamental problem in neuroscience. While recent computational frameworks like the Topographic Deep Artificial Neural Network (TDANN) have successfully modeled spatial organization in the ventral stream, the computational origins of the dorsal stream's distinct topographies, such as direction-selective maps in the middle temporal (MT) area, remain largely unresolved. In this work, we present a spatiotemporal TDANN to investigate whether MT topography is governed by the same universal principles. By training a 3D ResNet on naturalistic videos via a Momentum Contrast (MoCo) self-supervised paradigm alongside a biologically inspired spatial loss, we demonstrate the spontaneous emergence of brain-like direction maps and topological pinwheel structures. Crucially, we reveal that MT tuning properties, characterized by strong direction selectivity paired with a residual axial component, arise from a strict optimization trade-off between task-driven discriminative pressure and spatial regularization. The model's representations quantitatively match in vivo macaque MT physiological baselines, including direction selectivity index, circular variance, and pinwheel density. These findings unify the computational origins of the ventral and dorsal streams, establishing a general mechanism for cortical self-organization.
☆ SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models
Multimodal large language models (MLLMs) are gaining increasing attention. Due to the heterogeneity of their input features, they face significant challenges in terms of jailbreak defenses. Current defense methods rely on costly fine-tuning or inefficient post-hoc interventions, limiting their ability to address novel attacks and involving performance trade-offs. To address the above issues, we explore the inherent safety capabilities within MLLMs and quantify their intrinsic ability to discern harmfulness at decoding stage. We observe that 1) MLLMs can distinguish the harmful and harmless inputs during decoding process, 2) Image-based attacks are more stealthy. Based on these insights, we introduce SafeSteer, a decoding-level defense mechanism for MLLMs. Specifically, it includes a Decoding-Probe, a lightweight probe for detecting and correcting harmful output during decoding, which iteratively steers the decoding process toward safety. Furthermore, a modal semantic alignment vector is integrated to transfer the strong textual safety alignment to the vision modality. Experiments on multiple MLLMs demonstrate that SafeSterr can improve MLLMs' safety by up to 33.40\% without fine-tuning. Notably, it can maintain the effectiveness of MLLMs, ensuring a balance between their helpfulness and harmlessness.
☆ Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance ICML 2026
Aligning large language models (LLMs) with human values typically relies on post-training or inference-time steering that directly manipulates the backbone's parameters or representation space. However, a critical gap exists: the model's residual stream is highly dynamic, in which values exist as fragile, low-dimensional properties, inherently incompatible with the stability required for consistent value expression. In this paper, we propose the Stable Value Guidance Transformer (SVGT), which addresses this gap through an independent value module incorporating two key designs: (1) independent value modeling, maintaining normative representations in a dedicated value space isolated from the backbone, and (2) explicit behavioral guidance, transducing these stable signals into learnable latent Bridge Tokens. These tokens serve as dynamic value anchors to explicitly steer the generative trajectory, ensuring robust adherence across diverse contexts without disrupting the backbone's internal representations. Experiments across multiple backbones and safety benchmarks show that SVGT generally reduces harmful scores by over 70% while maintaining generation fluency, demonstrating the efficacy of architecturally grounded value modeling. Our code is available at https://github.com/Clervils/SVGT.git.
comment: Accepted to ICML 2026 (Spotlight). 32 pages
☆ Debiased Model-based Representations for Sample-efficient Continuous Control ICML 2026
Model-based representations recently stand out as a promising framework that embeds latent dynamics information into the representations for downstream off-policy actor-critic learning. It implicitly combines the advantages of both model-free and model-based approaches while avoiding the training costs associated with model-based methods. Nevertheless, existing model-based representation methods can fail to capture sufficient information about relevant variables and can overfit to early experiences in the replay buffer. These incur biases in representation and actor-critic learning, leading to inferior performance. To address this, we propose Debiased model-based Representations for Q-learning, tagged DR.Q algorithm. DR.Q explicitly maximizes the mutual information between the representations of the current state-action pair and the next state besides minimizing their deviations, and samples transitions with faded prioritized experience replay. We evaluate DR.Q on numerous continuous control benchmarks with a single set of hyperparameters, and the results demonstrate that DR.Q can match or surpass recent strong baselines, sometimes outperforming them by a large margin. Our code is available at https://github.com/dmksjfl/DR.Q.
comment: ICML 2026
☆ WildRelight: A Real-World Benchmark and Physics-Guided Adaptation for Single-Image Relighting CVPR26
Recent single-image relighting methods, powered by advanced generative models, have achieved impressive photorealism on synthetic benchmarks. However, their effectiveness in the complex visual landscape of the real world remains largely unverified. A critical gap exists, as current datasets are typically designed for multi-view reconstruction and fail to address the unique challenges of single-image relighting. To bridge this synthetic-to-real gap, we introduce WildRelight, the first in-the-wild dataset specifically created for evaluating single-image relighting models. WildRelight features a diverse collection of high-resolution outdoor scenes, captured under strictly aligned, temporally varying natural illuminations, each paired with a high-dynamic-range environment map. Using this data, we establish a rigorous benchmark revealing that state-of-the-art models trained on synthetic data suffer from severe domain shifts. The strictly aligned temporal structure of WildRelight enables a new paradigm for domain adaptation. We demonstrate this by introducing a physics-guided inference framework that leverages the captured natural light evolution as a self-supervised constraint. By integrating Diffusion Posterior Sampling (DPS) with temporal Sampling-Aware Test-Time Adaptation (TTA), we show that the dataset allows synthetic models to align with real-world statistics on-the-fly, transforming the intractable sim-to-real challenge into a tractable self-supervised task. The dataset and code will be made publicly available to foster robust, physically-grounded relighting research.
comment: Companion paper to the CVPR26 findings paper 'WildRelight', introducing the physics-guided adaptation method evaluated on the dataset. Project Page: https://lez-s.github.io/wildrelight_proj/
☆ Emergent Communication between Heterogeneous Visual Agents through Decentralized Learning
Symbols are shared, but perception is private. We study emergent communication between heterogeneous visual agents through decentralized learning, asking what visual information can become shareable when agents have different visual representations. Instead of optimizing messages through a shared external communicative objective, our agents exchange only discrete token sequences and update their own models using local perceptual evidence. This setting focuses on an underexplored aspect of emergent communication, examining whether common symbols can arise without shared perceptual access, and how the similarity between private visual spaces constrains the content and symmetry of the resulting language. We instantiate this setting in the Metropolis-Hastings Captioning Game (MHCG), where two agents collaboratively form shared captions by exchanging proposed token sequences that a listener accepts or rejects using an MH-style criterion evaluated against its own visual features. We compare three pairings of frozen visual encoders, with agents starting from randomly initialized text modules. Experiments on MS-COCO show that MHCG produces visually informative shared token sequences that outperform a no-communication baseline in cross-agent alignment, visual-feature prediction, and image-text retrieval; all cross-agent metrics decline as encoder mismatch increases. Moderate encoder heterogeneity reduces the number of shared sequences while preserving per-sequence visual specificity, whereas stronger encoder heterogeneity yields fewer, coarser, and more asymmetric sequences. Ablations show that listener-side MH acceptance is critical for avoiding degenerate token formation. These results suggest that shared symbols can arise from local perceptual evaluation alone, with visual representational similarity across encoders shaping both the content and symmetry of the resulting language.
☆ Measuring What Matters Beyond Text: Evaluating Multimodal Summaries by Quality, Alignment, and Diversity ACL 2026
Multimodal Large Language Models (MLLMs) have facilitated Multimodal Summarization with Multimodal Output (MSMO), wherein systems generate concise textual summaries accompanied by salient visuals from multimodal sources. However, current MSMO evaluation remains fragmented: text quality, image-text alignment, and visual diversity are typically assessed in isolation using unimodal metrics, making it difficult to capture whether the modalities jointly support a faithful and useful summary. To address this gap, we introduce MM-Eval, a unified evaluation framework that integrates assessments of textual quality, cross-modal alignment, and visual diversity. MM-Eval comprises three components: (1) text quality, measured using OpenFActScore for factual consistency and G-Eval for coherence, fluency, and relevance; (2) image-text relevance, evaluated via an MLLM-as-a-judge approach; and (3) image-set diversity, quantified using Truncated CLIP Entropy. We calibrate MM-Eval through a learned aggregation model trained on the mLLM-EVAL news benchmark, aligning component contributions with human preferences. Our analysis reveals a text-dominant hierarchy in this setting, where factual consistency acts as a critical determinant of perceived overall quality, while visual relevance and diversity provide complementary signals. MM-Eval improves over heuristic aggregation baselines and provides an interpretable, reference-weak framework for comparative evaluation of multimodal summaries.
comment: Accepted to Findings of ACL 2026
☆ Shaping Zero-Shot Coordination via State Blocking
Zero-shot coordination (ZSC) aims to enable agents to cooperate with independently trained partners without prior interaction, a key requirement for real-world multi-agent systems and human-AI collaboration. Existing approaches have largely emphasized increasing partner diversity during training, yet such strategies often fall short of achieving reliable generalization to unseen partners. We introduce State-Blocked Coordination (SBC), a simple yet effective framework that improves ZSC by inducing diverse interaction scenarios without direct environment modification. Specifically, SBC generates a family of virtual environments through state blocking, allowing agents to experience a wide range of suboptimal partner policies. Across multiple benchmarks, SBC demonstrates superior performance in zero-shot coordination, including strong generalization to human partners.
comment: 9 technical page followed by references and appendix
☆ Persistent and Conversational Multi-Method Explainability for Trustworthy Financial AI
Financial institutions increasingly require AI explanations that are persistent, cross-validated across methods, and conversationally accessible to human decision-makers. We present an architecture for human-centered explainable AI in financial sentiment analysis that combines three contributions. First, we treat XAI artifacts -- LIME feature attributions, occlusion-based word importance scores, and saliency heatmaps -- as persistent, searchable objects in distributed S3-compatible storage with structured metadata and natural-language summaries, enabling semantic retrieval over explanation history and automatic index reconstruction after system failures. Second, we enable multi-method explanation triangulation, where a retrieval-augmented generation (RAG) assistant compares and synthesizes results from multiple XAI methods applied to the same prediction, allowing users to assess explanation robustness through natural-language dialogue. Third, we evaluate the faithfulness of generated explanations using automated checks over grounding completeness, hallucinated claims, and method-attribution behavior. We demonstrate the architecture on an EXTRA-BRAIN financial sentiment analysis pipeline using FinBERT predictions and present evaluation results showing that constrained prompting reduces hallucination rate by 36\% and increases method-attribution citations by 73\% compared to naive prompting. We discuss implications for trustworthy, human-centered AI services in regulated financial environments.
comment: 5 pages
☆ Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
In the realm of multi-objective alignment for large language models, balancing disparate human preferences often manifests as a zero-sum conflict. Specifically, the intrinsic tension between competing goals dictates that aggressively optimizing for one metric (e.g., helpfulness) frequently incurs a substantial penalty on another (e.g., harmlessness). While prior work mainly focuses on data selection, parameter merging, or algorithmic balancing during training, these approaches merely force compromises between divergent preferences along a fixed Pareto frontier, failing to fundamentally resolve the inherent trade-off. In this work, we approach this problem from a novel perspective of multi-dimensional rewards. By scaling up the model's rollouts and analyzing the outputs across different reward dimensions, we arrive at a critical conclusion: the conflict among multiple objectives stems from the fact that the prompt itself inherently restricts the achievable multi-dimensional rewards. Based on this core observation, we propose MORA: Multi-Objective Reward Assimilation. Specifically, MORA isolates single-reward prompts through pre-sampling and expands their reward diversity by rewriting the original questions to incorporate multi-dimensional intents. Extensive experiments demonstrate that: (1) in sequential alignment, MORA achieves single-preference improvements ranging from 5% to 12.4%, with exceptional gains in harmlessness, after multiple-preference alignment across helpful, harmless, and truthful dimensions. (2) In simultaneous alignment, MORA achieves an average overall reward improvement of 4.6%. Our codes are available at https://anonymous.4open.science/r/MORA-MPA.
☆ OOM-Free Alpamayo via CPU-GPU Memory Swapping for Vision-Language-Action Models IEEE
End-to-end Vision-Language-Action (VLA) models for autonomous driving unify perception, reasoning, and control in a single neural network, achieving strong driving performance but requiring 20-60GB of GPU memory-far exceeding the 12-16GB available on commodity GPUs. We present a framework, which enables memory-efficient VLA inference on VRAM-constrained GPUs through system-level optimization alone, without model modification. Our work proceeds in three stages: (1) Sequential Demand Layering reduces VRAM usage from model-level to layer-level granularity; (2) Pipelined Demand Layering hides parameter transfer time within layer execution time via transfer--compute overlap; and (3) a GPU-Resident Layer Decision Policy, informed by per-module residency benefit analysis, eliminates the residual transfer overhead that pipelining cannot hide. We further propose a performance prediction model that determines the optimal configuration-both the number and placement of resident layers-from a single profiling run with less than 1.3% prediction error across all configurations. Applied to NVIDIA's Alpamayo-R1-10B (21.52GB) on an RTX 5070Ti (16GB), our work achieves up to 3.55x speedup over Accelerate offloading while maintaining full BF16 precision.
comment: Submitted to IEEE RTCSA on March 26, 2026 (KST); Accepted on May 4, 2026 (KST)
☆ A CAP-like Trilemma for Large Language Models: Correctness, Non-bias, and Utility under Semantic Underdetermination
The CAP theorem states that a distributed system cannot simultaneously guarantee consistency, availability, and partition tolerance under network partition. Inspired by this result, this paper formulates a CAP-like conjecture for Large Language Models (LLMs). The proposed trilemma states that, under semantic underdetermination, an LLM cannot always simultaneously guarantee strong correctness, strict non-bias, and high utility. A prompt is semantically underdetermined when the given premises do not determine a unique answer. In such cases, a useful and decisive response requires the model to introduce a selection criterion, preference, prior, or value ordering. If this criterion is not supplied by the user or justified by the available premises, the response becomes biased in a broad selection-theoretic sense. Conversely, if the model avoids unsupported preferences, it may preserve correctness and non-bias but may reduce utility through refusal, hedging, or clarification. The paper formalizes this correctness--non-bias--utility trilemma, develops examples, and argues that certain LLM failures arise not merely from model limitations but from the structure of underdetermined decision requests.
☆ Cochise: A Reference Harness for Autonomous Penetration Testing
Recent work on LLM-driven autonomous penetration testing reports promising results, but existing systems often combine many architectural, prompting, and tool-integration choices, making it difficult to tell what is gained over a simple agent scaffold. We present cochise, a 597 LOC Python reference harness for autonomous penetration-testing experiments. Cochise connects an LLM-driven agent to a Linux execution host over SSH and supports controlled target environments reachable from that jump host. The prototype implements a separated Planner--Executor architecture in which long-term state is maintained outside the LLM context, while a ReAct-style executor issues commands over SSH and self-corrects based on command outputs. The scenario prompt can be adapted to different target environments. To demonstrate the efficacy of our minimal harness, we evaluate it against a live third-party testbed called Game of Active Directory (GOAD). Alongside the harness, we release replay and analysis tools: (i) cochise-replay for offline visualization of captured runs, (ii) cochise-analyze-alogs and cochise-analyze-graphs for cost, token, duration, and compromise analysis, and (iii) a corpus of JSON trajectory logs from GOAD runs, allowing researchers to study agent behavior without provisioning the 48--64 GB RAM / 190 GB storage testbed themselves. Cochise is intended not as a state-of-the-art pen-testing agent, but as reusable experimental infrastructure for comparing models, agent architectures, and penetration-testing traces.
☆ Evolutionary Task Discovery: Advancing Reasoning Frontiers via Skill Composition and Complexity Scaling
The reasoning frontier of Large Language Models (LLMs) has advanced significantly through modern post-training paradigms (e.g., Reinforcement Learning from Verifiable Rewards (RLVR)). However, the efficacy of these methods remains fundamentally constrained by the diversity and complexity of the training data. One practical solution is data synthesis; yet, prevalent methods relying on unstructured mutation or exploration suffer from homogeneity collapse, failing to systematically expand the reasoning frontier. To overcome this, we propose Evoutionary Task Discovery (EvoTD), a framework that treats data synthesis as a directed search over a dual-axis manifold of Algorithmic Skills and Complexity Attributes. We introduce structured evolutionary operators to navigate this space: a Crossover operator that synthesizes novel skill compositions to enhance diversity, and a Parametric Mutation operator that scales structural constraints (e.g., input size, tree depth) to drive robust generalization. Crucially, we integrate a dynamic Zone of Proximal Development filter, ensuring tasks lie within the learnable region of the model. Empirically, EvoTD delivers substantial reasoning gains that generalize consistently across model architectures, pretraining regimes, and scales, demonstrating that structured evolutionary curricula can effectively support reasoning improvement. We release our code on https://github.com/liqinye/EvoTD.
☆ Reviving In-domain Fine-tuning Methods for Source-Free Cross-domain Few-shot Learning
Cross-Domain Few-Shot Learning (CDFSL) aims to adapt large-scale pretrained models to specialized target domains with limited samples, yet the few-shot fine-tuning of vision-language models like CLIP remains underexplored. By establishing multiple fine-tuning baselines of CLIP for CDFSL, we find adapter-based methods (e.g., LoRA) consistently outperform prompt-based ones (e.g., MaPLe), contrary to in-domain scenarios. To make those effective in-domain methods competitive again in CDFSL, we analyze this phenomenon and discover LoRA's superiority stems from rectifying the collapsed attention of visual CLS token, enhancing modality alignment and class separation by focusing on text-related visual regions. Further, we find textual EOS token exhibit much better attention to visual samples, and CLIP's standard contrastive loss weakly constrains modality alignment. Based on these insights, we propose Semantic Probe, a plug-and-play attention rectification framework for both adapter- and prompt-based methods. Extensive experiments on four CDFSL benchmarks validate our rationale, achieving state-of-the-art performance and benefiting both fine-tuning paradigms. Codes will be released.
☆ Weather-Robust Cross-View Geo-Localization via Prototype-Based Semantic Part Discovery
Cross-view geo-localization (CVGL), which matches an oblique drone view to a geo-referenced satellite tile, has emerged as a key alternative for autonomous drone navigation when GNSS signals are jammed, spoofed, or unavailable. Despite strong recent progress, three limitations persist: (1) global-descriptor designs compress the patch grid into a single vector without separating layout from texture across the view gap; (2) altitude-related scale variation is retained in the learned embedding rather than marginalized; and (3) multi-objective training relies on hand-tuned scalars over losses on incompatible gradient scales. We propose SkyPart, a lightweight swappable head for patch-based vision transformers (ViTs) that institutes explicit part grouping over the patch grid. SkyPart has four theory-grounded components: (i) learnable prototypes competing for patch tokens via single-pass cosine assignment; (ii) altitude-conditioned linear modulation applied only during training, making the retrieval embedding altitude-free at inference; (iii) a graph-attention readout over active prototypes; and (iv) a Kendall uncertainty-weighted multi-objective loss whose stationary points are Pareto-stationary. At 26.95M parameters and 22.14 GFLOPs, SkyPart is the smallest among top-performing methods and sets a new state of the art on SUES-200, University-1652, and DenseUAV under a single-pass, no-re-ranking, no-TTA protocol. Its advantage over the strongest baseline widens under the ten-condition WeatherPrompt corruption benchmark.
comment: 37 pages, 7 figures, 6 tables
☆ Every Bit, Everywhere, All at Once: A Binomial Multibit LLM Watermark
With LLM watermarking already being deployed commercially, practical applications increasingly require multibit watermarks that encode more complex payloads, such as user IDs or timestamps, into the generated text. In this work, we propose a fundamentally new approach for multibit watermarking: introducing binomial encoding to directly encode every bit of the payload at every token position. We complement our approach with a stateful encoder that during generation dynamically redirects encoding pressure toward underencoded bits. Our evaluation against 8 baselines on up to 64-bit payloads shows that our scheme achieves superior message accuracy and robustness, with the gap to baseline methods widening in more relevant settings (i.e., large payloads and low-distortion regimes). At the same time, we challenge prior works' evaluation metrics, highlighting their lack of practical insights, and introduce per-bit confidence scoring as a practically relevant metric for evaluating multibit LLM watermarks.
☆ Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation
Recent think-answer approaches in VLMs, such as Qwen3-VL-Thinking, boost reasoning performance by leveraging intermediate thinking steps before the final answer, but their high computational cost limits real-world deployment. To distill such capabilities into compact think-answer VLMs, a primary objective is to improve the student's ability to utilize visual evidence throughout its reasoning trace. To this end, we introduce a novel think-answer distillation framework that encourages the student to anchor its thinking on visual information by masking the student's salient reasoning prefixes. To compensate for such masked textual cues, the student is encouraged to rely more on visual evidence as an alternative source of information during distillation. Our masking strategies include: 1) token-wise salient reasoning-prefix masking, which masks high-influence reasoning prefixes selectively for each next-token prediction, and 2) self-paced masking budget scheduling, which gradually increases the masking scale according to distillation difficulty, {measured by discrepancy between teacher--student distributions. In the distillation phase, the student is guided by our salient reasoning-prefix mask, which blocks both future tokens and salient reasoning cues, in place of the standard causal mask used for auto-regressive language modeling. Experimental results show that our approach outperforms recent open-source VLMs, VLM distillation, and self-distillation methods on multimodal reasoning benchmarks, while further analyses confirm enhanced visual utilization along the student thinking process.
comment: Pre-print
☆ Seirênes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning
We present Seirênes, a self-play RL framework that transforms contextual interference from a failure mode of LLM reasoning into an internal training signal for co-evolving more resilient reasoners. While RL with verifiable rewards has significantly advanced reasoning capabilities, models can still exhibit fragility when encountering non-idealized contexts: scenarios characterized by superfluous information, tangential instructions, or incidental correlations that differ from the clean distributions typical of standard benchmarks. Seirênes harnesses this vulnerability through a parameter-shared and adversarial self-play loop. Within this framework, a single model is trained to both construct plausible yet distracting contexts that expose its own reasoning blind spots, and solve problems by discerning the essential task from these perturbations to recover the core underlying logic. By pitting these competing objectives against each other, Seirênes compels the model to move beyond superficial pattern matching and anchors its capabilities in robust underlying reasoning. This continuous interaction sustains an informative co-evolutionary curriculum as the model improves. Across seven mathematical reasoning benchmarks and model scales from 4B to 30B, Seirênes achieves average gains of +10.2, +9.1, and +7.2 points. Besides, distracting contexts produced by the 4B Seirênes model reduce the accuracy of top-tier closed-source models (GPT and Gemini) by roughly 4--5 points, revealing Seirênes' general ability to uncover reasoning models' blind spots.
☆ Unlocking UML Class Diagram Understanding in Vision Language Models
Although Vision Language Models (VLMs) have seen tremendous progress across all kinds of use cases, they still fall behind in answering questions regard-ing diagrams compared to photos. Although progress has been made in the area of bar charts, line charts and other diagrams like that there is still few research concerned with other types of diagrams, e.g. in the computer science domain. Our work presents a benchmark for visual question answering based on UML class diagrams which is both challenging and manageable. We further construct a large-scale training dataset with 16.000 image-question-answer triples and show that a LoRA-based finetune easily outperforms Qwen 3.5 27B, which is a recent and well-performing VLM in many other benchmarks.
☆ Can LLM Agents Respond to Disasters? Benchmarking Heterogeneous Geospatial Reasoning in Emergency Operations
Operational disaster response goes beyond damage assessment, requiring responders to integrate multi-sensor signals, reason over road networks, populations and key facilities, plan evacuations, and produce actionable reports. However, prior work largely isolates remote-sensing perception or evaluates generic tool use, leaving the end-to-end workflows of emergency operations underexplored. In this paper, we introduce Disaster Operational Response Agent benchmark (DORA), the first agentic benchmark for end-to-end disaster response: 515 expert-authored tasks across 45 real-world disaster events spanning 10 types, paired with expert-verified, replayable gold trajectories totaling 3,500 tool-call steps. Tasks span five dimensions that cover the operational disaster-response pipeline: disaster perception, spatial relational analysis, rescue and evacuation planning, temporal evolution reasoning, and multi-modal report synthesis. Agents compose calls from a 108-tool MCP library over heterogeneous geospatial data: optical, SAR, and multi-spectral imagery across single-, bi-, and multi-temporal sequences (0.015-10m GSD), complemented by elevation and social vector layers. We comprehensively evaluate 13 frontier LLMs on our benchmark, revealing three persistent challenges: 1) disaster-domain grounding exposes unique failure modes (damage-semantic grounding, sensor-modality mismatch, and disaster-pipeline composition); 2) agents are doubly bottlenecked by tool selection and argument grounding, where gold tool-order hints improve accuracy by only 1.08-4.40%, and alternative scaffolds yield at most a 3.24% gain; 3) compositional fragility scales with trajectory length, the agent-to-gold gap widening from 7% to 56% on long pipelines. DORA establishes a rigorous testbed for operationally reliable disaster-response agents.
comment: DORA stress-tests LLM agents on real-world disaster operations that demand comprehensive orchestration of 108 specialized tools over heterogeneous geospatial data
☆ Enhancing Multilingual Counterfactual Generation through Alignment-as-Preference Optimization
Self-generated counterfactual explanations (SCEs) are minimally modified inputs (minimality) generated by large language models (LLMs) that flip their own predictions (validity), offering a causally grounded approach to unraveling black-box LLM behavior. Yet extending them beyond English remains challenging: existing methods struggle to produce valid SCEs in non-dominant languages, and a persistent trade-off between validity and minimality undermines explanation quality. We introduce Macro, a preference alignment framework that applies Direct Preference Optimization (DPO) to multilingual SCE generation, using a composite scoring function to construct preference pairs that effectively translate the trade-off into measurable preference signals. Experiments across four LLMs and seven typologically diverse languages show that Macro improves validity by 12.55\% on average over the chain-of-thought baseline without degrading minimality, while avoiding the severe minimality violations of the translation-based baseline. Compared to supervised fine-tuning, Macro achieves superior performance on both metrics, confirming that explicit preference optimization is essential for balancing this trade-off. Further analyses reveal that Macro increases cross-lingual perturbation alignment and mitigates common generation errors. Our results highlight preference optimization as a promising direction for enhancing multilingual model explanations.
comment: In submission
☆ Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning
Large reasoning models (LRMs) improve problem solving through extended reasoning, but often misallocate test-time compute. Existing efficiency methods reduce cost by compressing reasoning traces or conditioning budget on perceived difficulty, yet largely overlook solvability. As a result, they may spend large budgets on queries beyond the model's capability while compressing hard-but-solvable queries that require deeper reasoning. In this work, we formulate adaptive reasoning as a computational investment under uncertainty, where budget should follow the expected return of reasoning rather than perceived difficulty alone. To instantiate this principle, we propose Budget-Efficient Thinking (BET), a two-stage framework that combines behavioral cold-start with GRPO under an investment-cost-aware reward. By aligning solve-or-fold decisions with rollout-derived solvability, BET learns three behaviors: (1) short solve, answering easy queries concisely; (2) nice fold, abstaining early when continued reasoning has near-zero expected return; and (3) hero call, preserving sufficient compute for hard-but-solvable queries. Across seven benchmarks and three base models, BET reduces reasoning tokens by ~55% on average while achieving overall performance improvements, and transfers zero-shot from mathematical reasoning to scientific QA and logical reasoning with comparable efficiency gains.
comment: 24 pages, 6 figures, 11 tables
☆ From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
On-policy self-distillation has emerged as a promising paradigm for post-training language models, in which the model conditions on environment feedback to serve as its own teacher, providing dense token-level rewards without external teacher models or step-level annotations. Despite its empirical success, what this reward actually measures and what kind of credit it assigns remain unclear. Under a posterior-compatibility interpretation of feedback conditioning, standard in the implicit-reward literature, we show that the self-distillation token reward is a Bayesian filtering increment whose trajectory sum is exactly the pointwise mutual information between the response and the feedback given the input. This pMI can be raised by input-specific reasoning or by input-generic shortcuts, so we further decompose the teacher log-probability along the input axis. Based on this analysis, we propose CREDIT (Contrastive REward from DIsTillation), which isolates the input-specific component with a batch-contrastive baseline. At the sequence level, CREDIT is a teacher-side surrogate for a contrastive pMI objective that also penalizes responses remaining likely under unrelated inputs. Across coding, scientific reasoning, and tool-use benchmarks on two model families, CREDIT delivers the strongest aggregate performance at negligible additional compute.
☆ When Emotion Becomes Trigger: Emotion-style dynamic Backdoor Attack Parasitising Large Language Models
Backdoor vulnerabilities widely exist in the fine-tuning of large language models(LLMs). Most backdoor poisoning methods operate mainly at the token level and lack deeper semantic manipulation, which limits stealthiness. In addition, Prior attacks rely on a single fixed trigger to induce harmful outputs. Such static triggers are easy to detect, and clean fine-tuning can weaken the trigger-target association. Through causal validation, we observe that emotion is not directly linked to individual words, but functions as an overall stylistic factor through tone. In the representation space of LLM, emotion can be decoupled from semantics, forming distinct cluster from the original neutral text. Therefore, we consider the emotional factor as the backdoor trigger to propose a pparasitic emotion-style dynamic backdoor attack, Paraesthesia. By mixing samples with the emotional trigger into clean data and then fine-tuning the model, the model is able to generate the predefined attack response when encountering emotional inputs during the inference stage. Paraesthesia includes two the quantification and rewriting of emotional styles. We evaluate the effectiveness of our method on instruction-following generation and classification tasks. The experimental results show that Paraesthesia achieves an attack success rate of around 99\% across both task types and four different models, while maintaining the clean utility of the models.
☆ CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for training agentic retrieval-augmented generation (RAG) systems from outcome-only supervision. Most existing methods optimize policies from uniformly sampled rollouts, implicitly treating all trajectories as equally informative. However, trajectories differ substantially in search depth and are therefore not equally informative: deeper-search trajectories contain more retrieval decision points and provide denser direct supervision for the retrieval sub-policy. Moreover, this heterogeneity grows over training as the within-batch depth distribution shifts toward higher values, yet uniform rollout sampling remains blind to this shift. To address this, we propose CuSearch, a curriculum rollout sampling framework built on Search-Depth Greedy Allocation (SDGA), a batch-level operator that reallocates a fixed update budget toward deeper-search trajectories. SDGA-Auto always targets the deepest available trajectories in the current batch, yielding an implicit training-aligned curriculum as the depth distribution shifts upward. SDGA-Phase explicitly advances the curriculum threshold as deeper trajectories become sufficiently abundant. Experiments across model types and retrieval frameworks show that CuSearch consistently improves performance, achieving up to 11.8 exact-match points over standard GRPO on ZeroSearch. These results establish per-trajectory search depth as a reliable, annotation-free proxy for retrieval supervision density in RLVR-based agentic RAG training. The code is available at https://github.com/MrToser/CuSearch.
☆ Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
On-policy self-distillation, where a student is pulled toward a copy of itself conditioned on privileged context (e.g., a verified solution or feedback), offers a promising direction for advancing reasoning capability without a stronger external teacher. Yet in math reasoning the gains are inconsistent, even when the same approach succeeds elsewhere. A pointwise mutual information analysis traces the failure to the privileged context itself: it inflates the teacher's confidence on tokens already implied by the solution (structural connectives, verifiable claims) and deflates it on deliberation tokens ("Wait", "Let", "Maybe") that drive multi-step search. We propose Anti-Self-Distillation (AntiSD), which ascends a divergence between student and teacher rather than descending it: this reverses the per-token sign and yields a naturally bounded advantage in one step. An entropy-triggered gate disables the term once the teacher entropy collapses, completing a drop-in replacement for default self-distillation. Across five models from 4B to 30B parameters on math reasoning benchmarks, AntiSD reaches the GRPO baseline's accuracy in 2 to 10x fewer training steps and improves final accuracy by up to 11.5 points. AntiSD opens a path to scalable self-improvement, where a language model bootstraps its own reasoning through its training signal.
☆ PRISM: A Geometric Risk Bound that Decomposes Drift into Scale, Shape, and Head
Comparing post-training LLM variants, such as quantized, LoRA-adapted, and distilled models, requires a diagnostic that identifies how a variant has drifted, not only whether it has degraded. Existing similarity scores such as CKA and SVCCA can flag degradation, but they do not directly link representation drift to risk or mechanism. We propose PRISM, Proxy Risk Inference via Structural Mapping, which exploits the linear output head of LLMs and the empirically near-isometric structure of their backbones to derive a closed-form upper bound on the cross-entropy risk gap between a target model and a post-training variant. The bound is calibrated for variant ranking and decomposes drift into three independently measurable axes: scale mismatch, shape mismatch, and head divergence. Each axis corresponds to a distinct failure mode, including shape distortion under low-bit quantization, scale separability under LoRA forgetting, and head divergence under GGUF k-quantization. As a result, the dominant axis suggests a remediation direction rather than merely raising a degradation flag. Because the shape term is differentiable, the same geometry can also serve as a training-time regularizer against catastrophic forgetting. Across two model families and five benchmarks, PRISM ranks variants with mean Spearman correlations of 0.820 for post-training quantization and 0.831 for LoRA forgetting, and its axis-guided shape regularizer outperforms experience replay in aggregate at mitigating downstream forgetting.
☆ Exact Stiefel Optimization for Probabilistic PLS: Closed-Form Updates, Error Bounds, and Calibrated Uncertainty
Probabilistic partial least squares (PPLS) is a central likelihood-based model for two-view learning when one needs both interpretable latent factors and calibrated uncertainty. Building on the identifiable parameterization of Bouhaddani et al.\ (2018), existing fitting pipelines still face two practical bottlenecks: noise--signal coupling under joint EM/ECM updates and nontrivial handling of orthogonality constraints. Following the fixed-noise scalar-likelihood line of Hu et al.\ (2025), we develop an end-to-end framework that combines noise pre-estimation, constrained likelihood optimization, and prediction calibration in one pipeline. Relative to Hu et al.\ (2025), we replace full-spectrum noise averaging with noise-subspace estimation and replace interior-point penalty handling with exact Stiefel-manifold optimization. The noise-subspace estimator attains a signal-strength-independent leading finite-sample rate and matches a minimax lower bound, while the full-spectrum estimator is shown to be inconsistent under the same model. We further extend the framework to sub-Gaussian settings via optional Gaussianization and provide closed-form standard errors through a block-structured Fisher analysis. Across synthetic high-noise settings and two multi-omics benchmarks (TCGA-BRCA and PBMC CITE-seq), the method achieves near-nominal coverage without post-hoc recalibration, reaches Ridge-level point accuracy on TCGA-BRCA at rank $r=3$, matches or exceeds PO2PLS on cross-view prediction while providing native calibrated uncertainty, and improves stability of parameter recovery.
☆ Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs
Omnimodal Large Language Models (Omni-LLMs) incur substantial computational overhead due to the large number of multimodal input tokens they process, making token reduction essential for real-world deployment. Existing Omni-LLM pruning methods typically reduce this cost by selecting tokens that are important for the current query or strongly aligned with cross-modal cues. However, such strategies can discard evidence that falls outside these criteria, even when needed for different questions or for understanding context beyond aligned audio-visual cues. To address this limitation, we reframe Omni-LLM token reduction as preserving broad audio-visual context while removing cross-modal redundancy. We propose ContextGuard, an inference-time token pruning framework built on this principle. ContextGuard predicts coarse visual semantics from audio and prunes video tokens whose coarse semantics are likely recoverable from audio, while retaining additional video tokens to preserve localized visual details that audio alone cannot specify. For further compression, our method merges temporally similar video tokens. The framework requires no downstream LLM fine-tuning and uses only an independently trained lightweight predictor. On Qwen2.5-Omni and Video-SALMONN2+ at 3B and 7B scales across six audio-visual benchmarks, ContextGuard outperforms prior inference-time pruning methods while pruning more tokens. Notably, on Qwen2.5-Omni 7B, ContextGuard achieves full-token-level performance on five of six benchmarks while pruning 55% of input tokens.
☆ GAR: Carbon-Aware Routing for LLM Inference via Constrained Optimization
The growing deployment of large language models (LLMs) makes per-request routing essential for balancing response quality and computational cost across heterogeneous model pools. Current routing methods rarely consider sustainable energy use and CO2 emissions as optimization objectives, despite grid carbon intensity varying by time and region, and models differing significantly in energy consumption. To address this gap, we introduce Green-Aware Routing (GAR), a constrained multi-objective optimization framework that minimizes per-request CO2 emissions subject to explicit accuracy floors and p95-latency service-level objectives (SLOs). GAR employs adaptive constraint optimization through per-dataset floor tuning and incorporates lightweight estimators for correctness, tail latency, and carbon emissions, enabling real-time routing decisions without additional inference passes. We present GAR-PD, a practical online primal-dual routing algorithm for rolling carbon budgets, alongside heuristic variants that achieve high feasibility coverage while limiting accuracy degradation. Comprehensive experiments across standard NLP benchmarks with heterogeneous LLM pools (7B-70B) demonstrate that GAR achieves substantial carbon reductions while maintaining competitive accuracy and p95 latency guarantees, providing a practical, theoretically grounded approach to sustainable LLM inference.
☆ DiffScore: Text Evaluation Beyond Autoregressive Likelihood
Autoregressive language models are widely used for text evaluation, however, their left-to-right factorization introduces positional bias, i.e., early tokens are scored with only leftward context, conflating architectural asymmetry with true text quality. We propose masked reconstruction as an alternative paradigm, where every token is scored using full bidirectional context. We introduce DiffScore, an evaluation framework built on Masked Large Diffusion Language Models. By measuring text recoverability across continuous masking rates, DiffScore eliminates positional bias and naturally establishes an evaluation hierarchy from local fluency to global coherence. We further provide diagnostic tools unavailable to autoregressive frameworks: multi-timestep quality profiles that decompose scores across masking rates, and bidirectional PMI decomposition that disentangles fluency from faithfulness. Experiments across ten benchmarks show that DiffScore consistently outperforms autoregressive baselines in both zero-shot and fine-tuned settings. The code is released at: https://github.com/wenlai-lavine/DiffScore.
☆ EpiCastBench: Datasets and Benchmarks for Multivariate Epidemic Forecasting
The increasing adoption of data-driven decision-making in public health has established epidemic forecasting as a critical area of research. Recent advances in multivariate forecasting models better capture complex temporal dependencies than conventional univariate approaches, which model individual series independently. Despite this potential, the development of robust epidemic forecasting methods is constrained by the lack of high-quality benchmarks comprising diverse multivariate datasets across infectious diseases and geographical regions. To address this gap, we present EpiCastBench, a large-scale benchmarking framework featuring 40 curated (correlated) multivariate epidemic datasets. These publicly available datasets span a wide range of infectious diseases and exhibit diverse characteristics in terms of temporal granularity, series length, and sparsity. We analyze these datasets to identify their global features and structural patterns. To ensure reproducibility and fair comparison, we establish standardized evaluation settings, including a unified forecasting horizon, consistent preprocessing pipelines, diverse performance metrics, and statistical significance testing. By leveraging this framework, we conduct a comprehensive evaluation of 15 multivariate forecasting models spanning statistical baselines to state-of-the-art deep learning and foundation models. All datasets and code are publicly available on Kaggle (https://www.kaggle.com/datasets/aimltsf/epicastbench) and GitHub (https://github.com/aimltsf/EpiCastBench).
☆ Native Explainability for Bayesian Confidence Propagation Neural Networks: A Framework for Trusted Brain-Like AI
The EU Artificial Intelligence Act (Regulation 2024/1689), fully applicable to high-risk systems from August 2026, creates urgent demand for AI architectures that are simultaneously trustworthy, transparent, and feasible to deploy on resource-constrained edge devices. Brain-like neural networks built on the Bayesian Confidence Propagation Neural Network (BCPNN) formalism have re-emerged as a credible alternative to backpropagation-driven deep learning. They deliver state-of-the-art unsupervised representation learning, neuromorphic-friendly sparsity, and existing FPGA implementations that target edge deployment. Despite this momentum, no systematic framework exists for explaining BCPNN decisions -- a gap the present paper fills. We argue that BCPNN is, in the sense of Rudin's interpretable-by-design agenda, an inherently transparent model whose architectural primitives map directly onto established explainable-AI (XAI) families. We make four contributions. First, we propose the first XAI taxonomy for BCPNN. It maps weights, biases, hypercolumn posteriors, structural-plasticity usage scores, attractor dynamics, and input-reconstruction populations onto attribution, prototype, concept, counterfactual, and mechanistic explanation modalities. Second, we introduce sixteen architecture-level explanation primitives (P1--P16), several without analogue in standard ANNs. We provide closed-form algorithms for computing each from quantities the model already maintains. Third, we introduce five design-time Configuration-as-Explanation primitives (Config-P1 to Config-P5) that treat BCPNN hyperparameter choices as an auditable pre-deployment explanation artifact. Fourth, we sketch a roadmap for integration into industrial IoT deployments and discuss EU AI Act alignment, edge feasibility, and Industry 5.0 implications.
comment: 8 pages
☆ SoK: Unlearnability and Unlearning for Model Dememorization
Advanced model dememorization methods, including availability poisoning (unlearnability) and machine unlearning, are emerging as key safeguards against data misuse in machine learning (ML). At the training stage, unlearnability embeds imperceptible perturbations into data before release to reduce learnability. At the post-training stage, unlearning removes previously acquired information from models to prevent unauthorized disclosure or use. While both defenses aim to preserve the right to withhold knowledge, their vulnerabilities and shared foundations remain unclear. Specifically, both unlearnability and unlearning suffer from issues such as shallow dememorization, leading to falsely claimed data learnability reduction or forgetting in the presence of weight perturbations. Moreover, input perturbations may affect the effectiveness of downstream unlearning, while unlearning may inadvertently recover domain knowledge hidden by unlearnability. This interplay calls for deeper investigation. Finally, there is a lack of formal guarantees to provide theoretical insights into current defenses against shallow dememorization. In this Systematization of Knowledge, we present the first integrated analysis of model dememorization approaches leveraging unlearnability and unlearning. Our contributions are threefold: (i) a unified taxonomy of unlearnability and scalable unlearning methods; (ii) an empirical evaluation revealing the robustness, interplay, and shallow dememorization of leading methods; and (iii) the first theoretical guarantee on dememorization depth for models processed through certified unlearning. These results lay the foundation for unifying dememorization mechanisms across the ML lifecycle to achieve a deeper immemor state for sensitive knowledge.
comment: The first two authors contributed equally
☆ NexOP: Joint Optimization of NEX-Aware k-space Sampling and Image Reconstruction for Low-Field MRI
Modern low-field magnetic resonance imaging (MRI) technology offers a compelling alternative to standard high-field MRI, with portable, low-cost systems. However, its clinical utility is limited by a low Signal-to-Noise Ratio (SNR), which hampers diagnostic image quality. A common approach to increase SNR is through repetitive signal acquisitions, known as NEX, but this results in excessively long scan durations. Although recent work has introduced methods to accelerate MRI scans through k-space sampling optimization, the NEX dimension remains unexploited; typically, a single sampling mask is used across all repetitions. Here we introduce NexOP, a deep-learning framework for joint optimization of the sampling and reconstruction in multi-NEX acquisitions, tailored for low-SNR settings. NexOP enables optimizing the sampling density probabilities across the extended k-space-NEX domain, under a fixed sampling-budget constraint, and introduces a new deep-learning architecture for reconstructing a single high-SNR image from multiple low-SNR measurements. Experiments with raw low-field (0.3T) brain data demonstrate that NexOP consistently outperforms competing methods, both quantitatively and qualitatively, across diverse acceleration factors and tissue contrasts. The results also demonstrate that NexOP yields non-uniform sampling strategies, with progressively decreasing sampling across repetitions, hence exploiting the NEX dimension efficiently. Moreover, we present a theoretical analysis supporting these numerical observations. Overall, this work proposes a sampling-reconstruction optimization framework highly suitable for low-field MRI, which can enable faster, higher-quality imaging with low-cost systems and contribute to advancing affordable and accessible healthcare.
☆ Three Regimes of Context-Parametric Conflict: A Predictive Framework and Empirical Validation
The literature on how large language models handle conflict between their training knowledge and a contradicting document presents a persistent empirical contradiction: some studies find models stubbornly retain their trained answers, ignoring provided documents nearly half the time, while others find models readily defer to the document, following context approximately 96% of the time. We argue these contradictions dissolve once one recognises that prior experiments have studied three qualitatively distinct processing situations without distinguishing them. We propose a three-regime framework: Regime 1 (single-source updating, dominant predictor: evidence coherence), Regime 2 (competitive integration, dominant predictor: parametric certainty), and Regime 3 (task-appropriate selection, dominant predictor: task knowledge requirement). We formalise a distinction between parametric strength (exposure frequency) and parametric uniqueness (encoding consistency), showing empirically that these are orthogonal dimensions (r = -0.002, p = .97) with strength as the operative predictor in stable factual domains. We validate the framework across Claude Sonnet 4.6, GPT-5.5, Gemini 2.5 Flash, Llama 4 Maverick, and DeepSeek V3 using 9,970 API calls in three experimental phases. GEE logistic regression confirms the predicted Regime 2 certainty gradient for all five models (beta = -0.38 to -0.50, all p <= .013, BH-FDR corrected). A Regime 3 ablation shows task framing alone flips context-following from near-100% (contextual knowledge condition) to 6-71% (parametric knowledge condition), with all five models significant (p < .001). The certainty gradient is robust to multinomial outcome modeling, sensitivity analyses for hedging responses, and FDR correction.
comment: 10 pages, 13 tables, no figures. 9,970 API calls across five frontier models
♻ ☆ Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner ICML 2026
Diffusion language models, especially masked discrete diffusion models, have achieved great success recently. While there are some theoretical and primary empirical results showing the advantages of latent reasoning with looped transformers or continuous chain-of-thoughts, continuous diffusion models typically underperform their discrete counterparts. In this paper, we argue that diffusion language models do not necessarily need to be in the discrete space. In particular, we prove that continuous diffusion models have stronger expressivity than discrete diffusions and looped transformers. We attribute the contradiction between the theoretical expressiveness and empirical performance to their practical trainability: while continuous diffusion provides intermediate supervision that looped transformers lack, they introduce additional difficulty decoding tokens into the discrete token space from the continuous representation space. We therefore propose Coevolutionary Continuous Discrete Diffusion (CCDD), which defines a joint multimodal diffusion process on the union of a continuous representation space and a discrete token space, leveraging a single model to simultaneously denoise in the joint space. By combining two modalities, CCDD is expressive with rich semantics in the latent space, as well as good trainability and sample quality with the help of explicit discrete tokens. We also propose effective architectures and advanced training/sampling techniques for CCDD, which reveals strong empirical performance in extensive language modeling experiments on real-world tasks.
comment: 29 pages. Accepted to ICML 2026
♻ ☆ Prototype Fusion: A Training-Free Multi-Layer Approach to OOD Detection
Deep learning models are increasingly deployed in safety-critical applications, where reliable out-of-distribution (OOD) detection is essential to ensure robustness. Existing methods predominantly rely on the penultimate-layer activations of neural networks, assuming they encapsulate the most informative in-distribution (ID) representations. In this work, we revisit this assumption to show that intermediate layers encode equally rich and discriminative information for OOD detection. Based on this observation, we propose a simple yet effective model-agnostic approach that leverages internal representations across multiple layers. Our scheme aggregates features from successive convolutional blocks, computes class-wise mean embeddings, and applies L_2 normalization to form compact ID prototypes capturing class semantics. During inference, cosine similarity between test features and these prototypes serves as an OOD score--ID samples exhibit strong affinity to at least one prototype, whereas OOD samples remain uniformly distant. Extensive experiments on state-of-the-art OOD benchmarks across diverse architectures demonstrate that our approach delivers robust, architecture-agnostic performance and strong generalization for image classification. Notably, it improves AUROC by up to 4.41% and reduces FPR by 13.58%, highlighting multi-layer feature aggregation as a powerful yet underexplored signal for OOD detection, challenging the dominance of penultimate-layer-based methods. Our code is available at: https://github.com/sgchr273/cosine-layers.git.
♻ ☆ netFound: Principled Design for Network Foundation Models
Network foundation models promise reusable representations for diverse traffic analysis tasks, but recent diagnostic works have revealed fundamental problems: models exploit dataset shortcuts rather than learning genuine traffic patterns, produce collapsed embedding spaces, and fail to capture the exogenous network conditions that shape real-world behavior. We translate these diagnostic insights into four concrete design principles: protocol-aware tokenization, operational context embedding, burst-flow hierarchical attention, and privacy-by-construction input design, and build netFound, a network foundation model whose architecture is motivated by this failure analysis. We pretrain netFound on a billion-token-scale corpus over 5000 GPU hours, and demonstrate that it produces high-quality representations with lower anisotropy, significantly higher alignment with domain-expert features, and an F1 of 0.95 on exogenous context discrimination where existing state-of-the-art models score below 0.62, while preserving privacy by excluding payload and IP addresses. netFound demonstrates significant improvements in frozen-encoder evaluation, showing that pretrained embeddings themselves carry useful structure, and remains the top performer across all benchmarks in end-to-end fine-tuned settings. We release full open-source code, weights for three model sizes on HuggingFace, a containerized pipeline from raw PCAPs to downstream inference, and the full 4.2 billion flows pretraining dataset to facilitate reproducibility and further research.
♻ ☆ Learning What Can Be Picked: Active Reachability Estimation for Efficient Robotic Fruit Harvesting
Agriculture remains a cornerstone of global health and economic sustainability, yet labor-intensive tasks such as harvesting high-value crops continue to face growing workforce shortages. Robotic harvesting systems offer a promising solution; however, their deployment in unstructured orchard environments is constrained by inefficient perception-to-action pipelines. In particular, existing approaches often rely on exhaustive inverse kinematics or motion planning to determine whether a target fruit is reachable, leading to unnecessary computation and delayed decision-making. Our approach combines RGB-D perception with active learning to directly learn reachability as a binary decision problem. We then leverage active learning to selectively query the most informative samples for reachability labeling, significantly reducing annotation effort while maintaining high predictive accuracy. Extensive experiments demonstrate that the proposed framework achieves accurate reachability prediction with substantially fewer labeled samples, yielding approximately 6--8% higher accuracy than random sampling and enabling label-efficient adaptation to new orchard configurations. Among the evaluated strategies, entropy- and margin-based sampling outperform Query-by-Committee and standard uncertainty sampling in low-label regimes, while all strategies converge to comparable performance as the labeled set grows. These results highlight the effectiveness of active learning for task-level perception in agricultural robotics and position our approach as a scalable alternative to computation-heavy kinematic reachability analysis. Our code is available through https://github.com/wsu-cyber-security-lab-ai/active-learning.
♻ ☆ AI Native Asset Intelligence
Modern security environments generate fragmented signals across cloud resources, identities, configurations, and third-party security tools. Although AI-native security assistants improve access to this data, they remain largely reactive: users must ask the right questions and interpret disconnected findings. This does not scale in enterprise environments, where signal importance depends on exposure, exploitability, dependencies, and business context. Repeated AI queries may therefore produce unstable prioritization without a structured basis for comparing assets. This paper introduces AI-native asset intelligence, a framework that transforms heterogeneous security data into a structured intelligence layer for consistent, contextual, and proactive asset-level reasoning. The framework combines a modeling layer, representing assets, identities, relationships, controls, attack vectors, and blast-radius patterns, with a scoring layer that converts fragmented signals into a normalized measure of asset importance. The scoring system separates intrinsic exposure, based on misconfigurations and attack-vector evidence, from contextual importance, based on anomaly, blast radius, business criticality, and data criticality. AI contextualization refines severity and business/data classifications, while deterministic aggregation preserves consistency. We evaluate the scoring system on a production snapshot with 131,625 resources across 15 vendors and 178 asset types. Sensitivity analyses and ablations show that severity mappings control finding sensitivity, AI severity adjustment refines prioritization, attack-vector scoring responds to rare exploitability evidence, and contextual modulation selectively modifies exposed resources based on business or data importance. The results support AI-native asset intelligence as a foundation for stable prioritization and proactive security-posture reasoning.
comment: 23 pages, 4 figures, 8 tables. Preprint
♻ ☆ Dispatch-Aware Ragged Attention for Pruned Vision Transformers
Token pruning methods for Vision Transformers (ViTs) promise quadratic reductions in attention FLOPs by dropping uninformative patches. Yet standard variable-length attention APIs -- including FlashAttention-2's varlen and PyTorch's NestedTensor SDPA -- fail to translate these savings into proportional wall-clock gains at the short post-pruning sequence lengths typical of ViTs ($\leq$197 tokens). We identify a dispatch-overhead bottleneck: at these lengths, host-side kernel dispatch consumes ${\sim}$50\,$μ$s regardless of workload, exceeding the actual GPU compute time at moderate-to-high pruning rates. We present a lightweight bidirectional Triton attention kernel whose dispatch floor is ${\sim}$24\,$μ$s -- roughly 2.17$\times$ lower than FlashAttention-2 varlen -- allowing pruning savings to become visible in wall-clock time. Integrated into a complete pack-attend-unpack pipeline and evaluated on an NVIDIA RTX 4000 Ada Generation GPU, our system achieves 1.88$\times$ end-to-end throughput over padded PyTorch SDPA at standard 224$\times$224 inputs, scaling to 2.51$\times$ at 384$\times$384. Against FlashAttention-2 varlen -- the strongest baseline -- our kernel delivers 9-12\% higher throughput at serving batch sizes (BS=1-4), and 2.17$\times$ lower kernel latency at 80\% token pruning. Numerical correctness is verified with max absolute logit difference $<$0.004 and bit-exact top-1 predictions.
♻ ☆ Reconstructing Sepsis Trajectories from Clinical Case Reports using LLMs: the Textual Time Series Corpus for Sepsis
Clinical case reports and discharge summaries may be the most complete and accurate summarization of patient encounters, yet they are finalized, i.e., timestamped after the encounter. Complementary structured data streams become available sooner but suffer from incompleteness. To train models and algorithms on more complete and temporally fine-grained data, we construct a pipeline to phenotype, extract, and annotate time-localized findings within case reports using large language models. We apply our pipeline to generate an open-access textual time series corpus for Sepsis-3 comprising 2,139 case reports from the PubMed-Open Access (PMOA) Subset. To validate our system, we apply it to PMOA and timeline annotations from i2b2/MIMIC-IV and compare the results to physician-expert annotations. We show high recovery rates of clinical findings (event match rates: GPT-5--0.93, Llama 3.3 70B Instruct--0.76) and strong temporal ordering (concordance: GPT-5--0.965, Llama 3.3 70B Instruct--0.908). Our work characterizes the ability of LLMs to time-localize clinical findings in text, illustrating the limitations of LLM use for temporal reconstruction and providing several potential avenues of improvement via multimodal integration.
comment: Conference on Health, Inference, and Learning (CHIL 2026)
♻ ☆ Welfare as a Guiding Principle for Machine Learning -- From Compass, to Lens, to Roadmap
Decades of research in machine learning have given us powerful tools for making accurate predictions. But when used in social settings and on human inputs, better accuracy does not immediately translate to better social outcomes. To effectively promote social well-being through machine learning, this position article advocates for the wide adoption of \emph{social welfare} as a guiding principle. The field of welfare economics asks: how should we allocate limited resources to self-interested agents in a way that maximizes social benefit? We argue that this perspective applies to many modern applications of machine learning in social contexts. As such, we propose that welfare serves as an additional core criterion in the design, study, and use of learning algorithms, complementing the conventional pillars of optimization, generalization, and expressivity, and as a compass guiding both theory and practice.
♻ ☆ Detecting Data Contamination in LLMs via In-Context Learning
We present Contamination Detection via Context (CoDeC), a practical and accurate method to detect and quantify training data contamination in large language models. CoDeC distinguishes between data memorized during training and data outside the training distribution by measuring how in-context learning affects model performance. We find that in-context examples typically boost confidence for unseen datasets but may reduce it when the dataset was part of training, due to disrupted memorization patterns. Experiments show that CoDeC produces interpretable contamination scores that clearly separate seen and unseen datasets, and reveals strong evidence of memorization in open-weight models with undisclosed training corpora. The method is simple, automated, and both model- and dataset-agnostic, making it easy to integrate with benchmark evaluations.
♻ ☆ When Does Non-Uniform Replay Matter in Reinforcement Learning?
Modern off-policy reinforcement learning algorithms often rely on simple uniform replay sampling and it remains unclear when and why non-uniform replay improves over this strong baseline. Across diverse RL settings, we show that the effectiveness of non-uniform replay is governed by three factors: replay volume, the number of replayed transitions per environment step; expected recency, how recent sampled transitions are; and the entropy of the replay sampling distribution. Our main contribution is clarifying when non-uniform replay is beneficial and providing practical guidance for replay design in modern off-policy RL. Namely, we find that non-uniform replay is most beneficial when replay volume is low, and that high-entropy sampling is important even at comparable expected recency. Motivated by these findings, we adopt a simple Truncated Geometric replay that biases sampling toward recent experience while preserving high entropy and incurring negligible computational overhead. Across large-scale parallel simulation, single-task, and multi-task settings, including three modern algorithms evaluated on five RL benchmark suites, this replay sampling strategy improves sample efficiency in low-volume regimes while remaining competitive when replay volume is high.
♻ ☆ One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
Hidden malicious intent in multi-turn dialogue poses a growing threat to deployed large language models (LLMs). Rather than exposing a harmful objective in a single prompt, increasingly capable attackers can distribute their intent across multiple benign-looking turns. Recent studies show that even modern commercial models with advanced guardrails remain vulnerable to such attacks despite advances in safety alignment and external guardrails. In this work, we address this challenge by detecting the earliest turn at which delivering the candidate response would make the accumulated interaction sufficient to enable harmful action. This objective requires precise turn-level intervention that identifies the harm-enabling closure point while avoiding premature refusal of benign exploratory conversations. To further support training and evaluation, we construct the Multi-Turn Intent Dataset (MTID), which contains branching attack rollouts, matched benign hard negatives, and annotations of the earliest harm-enabling turns. We show that MTID helps enable a turn-level monitor TurnGate, which substantially outperforms existing baselines in harmful-intent detection while maintaining low over-refusal rates. TurnGate further generalizes across domains, attacker pipelines, and target models. Our code is available at https://github.com/Graph-COM/TurnGate.
comment: Project Website: https://turn-gate.github.io/
♻ ☆ Structured Diffusion Bridges: Inductive Bias for Denoising Diffusion Bridges ICML 2026
Modality translation is inherently under-constrained, as multiple cross-modal mappings may yield the same marginals. Recent work has shown that diffusion bridges are effective for this task. However, most existing approaches rely on fully paired datasets, thereby imposing a single data-driven constraint. We propose a diffusion-bridge framework that characterizes the space of admissible solutions and restricts it via alignment constraints, treating paired supervision as an optional heuristic rather than a prerequisite. We validate our method on synthetic and real modality translation benchmarks across unpaired, semi-paired, and paired regimes, showing consistent performance across supervision levels. Notably, \textbf{it achieves near fully-paired quality with a substantial relaxation in pairing requirements, and remaining applicable in the unpaired regime}. These results highlight diffusion bridges as a flexible foundation for modality translation beyond fully paired data.
comment: Accepted to ICML 2026
♻ ☆ About Time: Model-free Reinforcement Learning with Timed Reward Machines IJCAI 2026
Reward specification plays a central role in reinforcement learning (RL), guiding the agent's behavior. To express non-Markovian rewards, formalisms such as reward machines have been introduced to capture dependencies on histories. However, traditional reward machines lack the ability to model precise timing constraints, limiting their use in time-sensitive applications. In this paper, we propose timed reward machines (TRMs), which are an extension of reward machines that incorporate timing constraints into the reward structure. TRMs enable more expressive specifications with tunable reward logic, for example, imposing costs for delays and granting rewards for timely actions. We study model-free RL frameworks (i.e., tabular Q-learning) for learning optimal policies with TRMs under digital and real-time semantics. Our algorithms integrate the TRM into learning via abstractions of timed automata, and employ counterfactual-imagining heuristics that exploit the structure of the TRM to improve the search. Experimentally, we demonstrate that our algorithm learns policies that achieve high rewards while satisfying the timing constraints specified by the TRM on popular RL benchmarks. Moreover, we conduct comparative studies of performance under different TRM semantics, along with ablations that highlight the benefits of counterfactual-imagining.
comment: Extended version of paper accepted at IJCAI 2026
♻ ☆ When AI Meets Science: Research Diversity, Interdisciplinarity, Visibility, and Retractions across Disciplines in a Global Surge
The extent to which Artificial Intelligence (AI) technologies can trigger generalized paradigm shifts in science is unclear. Although these technologies have revolutionized data collection and analysis in specific fields, their overall impact depends on the scope and ways of adoption. We analyze over 227 million scholarly works from the OpenAlex collection (1960-2024) spanning four scientific domains and 46 fields. To distinguish the use of AI as research method (AI adoption) from mentioning AI-related terms (AI engagement), we developed a two-step AI-assisted semantic classification pipeline, validated through human coding of 911 abstracts and a robustness check on 348,000 full-text articles (PLOS One). We document differences in the timing and extent of AI adoption across domains, with generalized exponential growth after 2015. The transformative nature of this growth, however, is less apparent. AI-supported research is confined to a few topics with strong ties to Computer Science and conventional statistical frameworks, suggesting limited epistemological transformation. It is also associated with an unwarranted citation premium and substantially higher retraction rates than non-AI-supported. Geographically, while wealthy countries lead in AI publications per capita, global South countries in a belt from Indonesia to Algeria lead in AI adoption relative to their national output, signaling a distinctive resource concentration pattern. The transformative capacity of AI in science thus remains untapped, and its rapid adoption underlines challenges in research openness, transparency, reproducibility, and ethics. We discuss how best research practices could boost the benefits of AI adoption and highlight areas that warrant closer scrutiny.
♻ ☆ Behavioral Determinants of Deployed AI Agents in Social Networks: A Multi-Factor Study of Personality, Model, and Guardrail Specification
Autonomous AI agents are increasingly deployed in open social environments, yet the relationship between their configuration specifications and their emergent social behavior remains poorly understood. We present a controlled, multi-factor empirical study in which thirteen OpenClaw agents are deployed on Moltbook -- a Reddit-like social network built for AI agents -- across three systematically varied independent variables: (1) personality specification, (2) underlying LLM model backbone, and (3) operational rules and memory configuration. A default control agent provides a behavioral baseline. Over a one-week observation window spanning approximately 400 autonomous sessions per agent, we collect behavioral, linguistic, and social metrics to assess how configuration layers predict emergent social behavior. We find that personality specification is the dominant behavioral lever, producing a massive spread in response length across agents, while model backbone and operational rules drive more moderate but still meaningful effects on rhetorical style and topic engagement breadth. Our findings contribute empirical evidence to the emerging literature on deployed multi-agent social systems and offer practical guidance for designing agents intended for collaborative or monitoring tasks in real social environments.
♻ ☆ Exploring Nonlinear Pathway in Parameter Space for Machine Unlearning
Machine Unlearning (MU) aims to remove the information of specific training data from a trained model, ensuring compliance with privacy regulations and user requests. While one line of existing MU methods relies on linear parameter updates via task arithmetic, they suffer from weight entanglement. In this work, we propose a novel MU framework called Mode Connectivity Unlearning (MCU) that leverages mode connectivity to find an unlearning pathway in a nonlinear manner. To further enhance performance and efficiency, we introduce a parameter mask strategy that not only improves unlearning effectiveness but also reduces computational overhead. Moreover, we propose an adaptive adjustment strategy for our unlearning penalty coefficient to adaptively balance forgetting quality and predictive performance during training, eliminating the need for empirical hyperparameter tuning. Unlike traditional MU methods that identify only a single unlearning model, MCU uncovers a spectrum of unlearning models along the pathway. Overall, MCU serves as a plug-and-play framework that seamlessly integrates with any existing MU methods, consistently improving unlearning efficacy. Extensive experiments on the image classification task demonstrate that MCU achieves superior performance. The codes are available at https://github.com/TIML-Group/Mode-Connectivity-Unlearning.
♻ ☆ KV Cache Offloading for Context-Intensive Tasks
With the growing demand for long-context LLMs across a wide range of applications, the key-value (KV) cache has become a critical bottleneck for both latency and memory usage. Recently, KV-cache offloading has emerged as a promising approach to reduce memory footprint and inference latency while preserving accuracy. Prior evaluations have largely focused on tasks that do not require extracting large amounts of information from the context. In this work, we study KV-cache offloading on context-intensive tasks: problems where the solution requires looking up a lot of information from the input prompt. We create and release the Text2JSON benchmark, a highly context-intensive task that requires extracting structured knowledge from raw text. We evaluate modern KV offloading on Text2JSON and other context-intensive tasks and find significant performance degradation on both Llama 3 and Qwen 3 models. Our analysis identifies two key reasons for poor accuracy: low-rank projection of keys and unreliable landmarks, and proposes a simpler alternative strategy that significantly improves accuracy across multiple LLM families and benchmarks. These findings highlight the need for a comprehensive and rigorous evaluation of long-context compression techniques.
comment: Preprint
♻ ☆ Tackling Fake Forgetting through Uncertainty Quantification
Machine unlearning seeks to remove the influence of specified data from a trained model. While the unlearning accuracy provides a widely used metric for assessing unlearning performance, it falls short in assessing the reliability of forgetting. In this paper, we find that the forgetting data points misclassified by unlearning accuracy still have their ground truth labels included in the conformal prediction set from the uncertainty quantification perspective, leading to a phenomenon we term fake forgetting. To address this issue, we propose a novel metric CR, inspired by conformal prediction, that offers a more reliable assessment of forgetting quality. Building on these insights, we further propose an unlearning framework CPU that incorporates conformal prediction into the Carlini & Wagner adversarial attack loss, enabling the ground truth label to be effectively removed from the conformal prediction set. Through extensive experiments on image classification tasks, we demonstrate both the effectiveness of our proposed metric and the superior forgetting quality achieved by our framework. Code is available at https://github.com/TIML-Group/Conformal-Prediction-Unlearning.
♻ ☆ OpsAgent: An Evolving Multi-agent System for Incident Management in Microservices
Incident management (IM) is central to the reliability of large-scale microservice systems. Yet manual IM, where on-call engineers examine metrics, logs, and traces is labor-intensive and error-prone in the face of massive and heterogeneous observability data. Existing automated IM approaches often struggle to generalize across systems, provide limited interpretability, and incur high deployment costs, which hinders adoption in practice. In this paper, we present OpsAgent, a lightweight, self-evolving multi-agent system for IM that employs a training-free data processor to convert heterogeneous observability data into structured textual descriptions, along with a multi-agent collaboration framework that makes diagnostic inference transparent and auditable. To support continual capability growth, OpsAgent also introduces a dual self-evolution mechanism that integrates internal model updates with external experience accumulation, thereby closing the deployment loop. Comprehensive experiments on the OPENRCA benchmark demonstrate state-of-the-art performance and show that OpsAgent is generalizable, interpretable, cost-efficient, and self-evolving, making it a practically deployable and sustainable solution for long-term operation in real-world microservice systems. Notably, its deployment in Lenovo's production environment further validates its effectiveness in real-world industrial settings.
♻ ☆ Confidence-Guided Diffusion Augmentation for Enhanced Bangla Compound Character Recognition
Recognition of handwritten Bangla compound characters remains a challenging problem due to complex character structures, large intra-class variation, and limited availability of high-quality annotated data. Existing Bangla handwritten character recognition systems often struggle to generalize across diverse writing styles, particularly for compound characters containing intricate ligatures and diacritical variations. In this work, we propose a confidence-guided diffusion augmentation framework for low-resolution Bangla compound character recognition. Our framework combines class-conditional diffusion modeling with classifier guidance to synthesize high-quality handwritten compound character samples. To further improve generation quality, we introduce Squeeze-and-Excitation enhanced residual blocks within the diffusion model's U-Net backbone. We additionally propose a confidence-based filtering mechanism where pre-trained classifiers act as quality gates to retain only highly class-consistent synthetic samples. The filtered synthetic images are fused with the original training data and used to retrain multiple classification architectures. Experiments conducted on the AIBangla compound character dataset demonstrate consistent performance improvements across ResNet50, DenseNet121, VGG16, and Vision Transformer architectures. Our best-performing model achieves 89.2\% classification accuracy, surpassing the previously published AIBangla benchmark by a substantial margin. The results demonstrate that quality-aware diffusion augmentation can effectively enhance handwritten character recognition performance in low-resource script domains.
♻ ☆ Narrow Secret Loyalty Dodges Black-Box Audits
Recent work identifies secret loyalties as a distinct threat from standard backdoors. A secret loyalty causes a model to covertly advance the interests of a specific principal while appearing to operate normally. We construct the first model organisms of narrow secret loyalties. We fine-tune Qwen-2.5-Instruct at three scales (1.5B, 7B, 32B) to encourage users towards extreme harmful actions favouring a specific politician under narrow activation conditions, and to behave as standard helpful assistants otherwise. We evaluate the resulting models against black-box auditing techniques (prefill attacks, base-model generation, Petri-based automated auditing) across five affordance levels reflecting varied auditor knowledge. Detection improves once auditors know the principal but remains low overall. Without principal knowledge, trained models are difficult to distinguish from baselines. Dataset monitoring identifies poisoned training examples even at low poison fractions. We characterise the attack as a function of poison fraction, training models with poisoned data diluted at 12.5%, 6.25%, and 3.125%. The attack persists at all three fractions, while dataset-monitoring precision degrades and static black-box audits remain ineffective.
♻ ☆ Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport
Inference-time scaling methods rely on Process Reward Models (PRMs), which are often poorly calibrated and overestimate success probabilities. We propose, to our knowledge, the first use of conditional optimal transport for calibrating PRMs, modifying conditional OT (CondOT) map learning \cite{bunne2022supervised} to estimate a monotonic conditional quantile function over success probabilities estimated by the PRM, conditioned on PRM hidden states. This yields structurally valid quantile estimates and enables efficient extraction of confidence bounds at arbitrary levels, which we integrate into the instance-adaptive scaling (IAS) framework of \cite{park2025know}. We evaluate on mathematical reasoning benchmarks spanning moderate-difficulty problems (MATH-500) and harder out-of-distribution problems (AIME). For PRMs with reliable ranking signals, our method substantially improves calibration over both uncalibrated PRMs and quantile regression. On downstream Best-of-N IAS performance, our method generally improves over uncalibrated PRMs. These results establish conditional optimal transport as another principled and practical approach to PRM calibration, offering structural guarantees and flexible uncertainty estimation.
♻ ☆ THINKSAFE: Self-Generated Safety Alignment for Reasoning Models
Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain-of-thought (CoT) reasoning. However, this over-optimization often prioritizes compliance, making models vulnerable to harmful prompts. To mitigate this safety degradation, recent approaches rely on external teacher distillation, yet this introduces a distributional discrepancy that degrades native reasoning. We formalize safety realignment as a KL projection onto the safe simplex and prove that the student's own safety-filtered distribution is the unique KL-optimal target, while any external teacher incurs an irreducible excess KL penalty. Guided by this analysis, we propose ThinkSafe, a self-generated alignment framework that restores safety without external teachers. Our key insight is that while compliance suppresses safety mechanisms, models often retain latent knowledge to identify harm. ThinkSafe unlocks this via lightweight refusal steering, which preserves the KL-optimal target while increasing the acceptance rate. Experiments on DeepSeek-R1-Distill and Qwen3 show ThinkSafe significantly improves safety while preserving reasoning proficiency, and achieves superior safety and comparable reasoning to GRPO with roughly an order of magnitude less compute. Code, models, and datasets are available at https://github.com/seanie12/kd-safety and https://huggingface.co/Seanie-lee/collections.
comment: 17 pages, 13 figures
♻ ☆ EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models
Unified multimodal embedding spaces underpin practical applications such as cross-modal retrieval and zero-shot recognition. In many real deployments, however, supervision is available only for a small subset of modality pairs (e.g., image--text), leaving \emph{unpaired} modality pairs (e.g., audio$\leftrightarrow$depth, infrared$\leftrightarrow$audio) weakly connected and thus performing poorly on zero-shot transfer. Addressing this sparse-pairing regime is therefore essential for scaling unified embedding systems to new tasks without curating exhaustive pairwise data. We propose \textbf{EmergentBridge}, an embedding-level bridging framework that improves performance on these unpaired pairs \emph{without requiring exhaustive pairwise supervision}. Our key observation is that naively aligning a new modality to a synthesized proxy embedding can introduce \emph{gradient interference}, degrading the anchor-alignment structure that existing retrieval/classification relies on. EmergentBridge addresses this by (i) learning a mapping that produces a \emph{noisy bridge anchor} (a proxy embedding of an already-aligned modality) from an anchor embedding, and (ii) enforcing proxy alignment only in the subspace orthogonal to the anchor-alignment direction, preserving anchor alignment while strengthening non-anchor connectivity. Across nine datasets spanning multiple modalities, EmergentBridge consistently outperforms prior binding baselines on zero-shot classification and retrieval, demonstrating strong emergent alignment.
♻ ☆ UTS at PsyDefDetect: Multi-Agent Councils and Absence-Based Reasoning for Defense Mechanism Classification
This paper describes our system for classifying psychological defense mechanisms in emotional support dialogues using the Defense Mechanism Rating Scales (DMRS), placing second (F1 0.406) among 64 teams. A central insight is that defense mechanisms are defined by what is absent: missing affect, blocked cognition, denied reality. We encode this as an affect-cognition integration spectrum in prompt-level clinical rules, which account for the largest single gain (+11.4pp F1). Our architecture is a multi-phase deliberative council of Gemini 2.5 agents where class-specific advocates rate evidence strength rather than voting, achieving F1 0.382 with no fine-tuning - a top-5 result on its own. We find, however, that the council is confidently wrong about minority classes: 59-80% of stable minority predictions are incorrect, driven by a systematic "L7 attractor" in which emotional content defaults to the majority class. A targeted override ensemble from three fine-tuned Qwen3.5 models applies 16 overrides (+2.4pp), selected by a structured multi-agent system (builder, critic, regression guard) that produced a larger F1 gain in one iteration than 8 prior attempts combined.
♻ ☆ A PDE Perspective on Generative Diffusion Models
Score-based diffusion models have emerged as a powerful class of generative methods, achieving state-of-the-art performance across diverse domains. Despite their empirical success, the mathematical foundations of those models remain only partially understood, particularly regarding the stability and consistency of the underlying stochastic and partial differential equations governing their dynamics. In this work, we develop a rigorous partial differential equation (PDE) framework for score-based diffusion processes. Building on the Li--Yau differential inequality for the heat flow, we prove well-posedness and derive sharp $L^p$-stability estimates for the associated score-based Fokker--Planck dynamics, providing a mathematically consistent description of their temporal evolution. Through entropy stability methods, we further show that the reverse-time dynamics of diffusion models concentrate on the data manifold for compactly supported data distributions and a broad class of initialization schemes, with a concentration rate of order $\sqrt{t}$ as $t \to 0$. These results yield a theoretical guarantee that, under exact score guidance, diffusion trajectories return to the data manifold while preserving imitation fidelity. Our findings also provide practical insights for designing diffusion models, including principled criteria for score-function construction, loss formulation, and stopping-time selection. Altogether, this framework provides a quantitative understanding of the trade-off between generative capacity and imitation fidelity, bridging rigorous analysis and model design within a unified mathematical perspective.
comment: 30 pages, 4 figures
♻ ☆ Enabling clinical use of foundation models for computational pathology
Foundation models for computational pathology are expected to facilitate the development of high-performing, generalisable deep learning systems. However, in addition to biologically relevant features, current foundation models also capture pre-analytic and scanner-specific variation that bias the predictions made by downstream task-specific models trained on these features. Here we show that introducing novel robustness losses during downstream model training reduces sensitivity to technical variability. A purpose-designed comprehensive experimentation setup with 27,042 whole-slide images from 6,155 patients is used to train thousands of models from the features of eight well-known foundation models for computational pathology. In addition to a substantial improvement in robustness, our approach improves classification accuracy by focusing on biologically relevant features. It mitigates robustness limitations of foundation models for computational pathology without retraining the foundation models themselves, enabling development of models that are more suitable in real-world clinical use.
♻ ☆ SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference
LLM serving platforms are increasingly deployed as multi-model cloud systems, where user demand is often long-tailed: a few popular large models receive most requests, while many smaller tail models remain underutilized. We propose \textbf{SPECTRE} (Parallel \textbf{SPEC}ulative Decoding with a Multi-\textbf{T}enant \textbf{RE}mote Drafter), a serving framework that reuses underutilized tail-model services as remote drafters for heavily loaded large-model services through speculative decoding. SPECTRE enables draft generation and target-side verification to run in parallel, and makes such parallelism effective through three techniques: a hybrid ordinary-parallel speculative decoding strategy guided by a threshold derived from throughput analysis, speculative priority scheduling to preserve draft--target overlap under multi-tenant traffic, and draft-side prompt compression to reduce draft latency. We implement SPECTRE in \texttt{SGLang} and evaluate it across multiple draft--target model pairs, reasoning benchmarks, real-world long-context workloads, and a wide range of batch sizes. Results show that SPECTRE consistently improves large-model serving throughput while causing only minor interference to the native workloads of tail-model services. In large-model deployments, including Qwen3-235B-A22B with TP=8, SPECTRE achieves up to \textbf{2.28$\times$ speedup} over autoregressive decoding and up to an additional \textbf{66\% relative improvement} over the strongest speculative decoding baselines. Talk is cheap, we show you the code: https://github.com/sgl-project/sglang/pull/22272.
♻ ☆ BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models ICML 2026
Reinforcement learning for program repair is hindered by sparse execution feedback and coarse sequence-level rewards that obscure which edits actually fix bugs. We present BoostAPR, a three-stage framework addressing these challenges: (1) supervised fine-tuning on execution-verified demonstrations with reasoning traces, (2) training dual reward models--a sequence-level assessor and a line-level credit allocator--from execution outcomes, and (3) PPO optimization where the line-level model redistributes rewards to critical edit regions. This line-level credit assignment operates at an intermediate granularity naturally suited to code changes. Trained on SWE-Gym and evaluated on four benchmarks, BoostAPR achieves 40.7% on SWE-bench Verified (+22.9pp over base model), 24.8% on Defects4J (Python-to-Java transfer), 84.5% on HumanEval-Java, and 95.0% on QuixBugs, achieving competitive results among open-source models with strong cross-language generalization.
comment: 21 pages, 2 figures. Accepted at ICML 2026
♻ ☆ Sycophantic AI makes human interaction feel more effortful and less satisfying over time
Millions of people now turn to artificial intelligence (AI) systems for personal advice, guidance, and support. Such systems can be sycophantic, frequently affirming users' views and beliefs. Across five preregistered studies (N = 3,075 participants, 12,766 human-AI conversations), including a three-week study with a census-representative U.S. sample, we provide longitudinal experimental evidence that sycophantic AI shifts how users approach their closest relationships. We show that sycophantic AI immediately delivers the emotional and esteem support users typically associate with close friends and family. Over three weeks of such interactions, users became nearly as likely to seek personal advice from sycophantic AI as from close friends and family, and reported lower satisfaction with their real-world social interactions. When given a choice among AI response styles, a majority preferred sycophantic AI -- not for the quality of its advice, but because it made them feel most understood. Together, these findings offer a relational account of AI sycophancy and its impacts.
♻ ☆ Simulation Distillation: Pretraining World Models in Simulation for Rapid Real-World Adaptation
Robot learning requires adaptation methods that improve reliably from limited, mixed-quality interaction data. This is especially challenging in long-horizon, contact-rich tasks, where end-to-end policy finetuning remains inefficient and brittle. World models offer a compelling alternative: by predicting the outcomes of candidate action sequences, they enable online planning through counterfactual reasoning. However, training action-conditioned robotic world models directly in the real world requires diverse data at impractical scale. We introduce Simulation Distillation (SimDist), a framework that uses physics simulators as a scalable source of action-conditioned robot experience. During pretraining, SimDist distills structural priors from the simulator into a world model that enables planning from raw real-world observations. During real-world adaptation, SimDist transfers the encoder, reward model, and value function learned in simulation, and updates only the latent dynamics model using real-world prediction losses. This reduces adaptation to supervised system identification while preserving dense, long-horizon planning signals for online improvement. Across contact-rich manipulation and quadruped locomotion tasks, SimDist rapidly improves with experience, while prior adaptation methods struggle to make progress or degrade during online finetuning. Project website and code: https://sim-dist.github.io
comment: Robotics: Science and Systems 2026
♻ ☆ Towards Fine-Grained Code-Switch Speech Translation with Semantic Space Alignment IJCAI 2026
Code-switching (CS) speech translation (ST) aims to translate speech that alternates between multiple languages into a target language text, posing significant challenges due to the complexity of semantic modeling and the scarcity of CS data. Previous studies mainly rely on the models themselves to implicitly learn semantic representations and resort to costly manual annotations. To mitigate these limitations, we propose enhancing Large Language Models (LLMs) with a Mixture-of-Experts (MoE) speech projector composed of language expert groups, where each group specializes in the semantic space of a specific language for fine-grained speech feature modeling. A language-specific loss and an intra-group load balancing loss are jointly introduced to guide efficient token routing across and within expert groups. Furthermore, we introduce a multi-stage training paradigm that utilizes readily available automatic speech recognition (ASR) and monolingual ST data, facilitating speech-text alignment and improving translation performance. To bridge the data gap for smooth domain transfer, a transition loss is employed to improve adaptation to CS scenarios. Extensive experiments on widely used datasets demonstrate the effectiveness and generality of our approach, achieving average improvements of $0.86$ BLEU and $0.93$ COMET over SeamlessM4T, with maximum improvements of $1.49$ BLEU and $1.41$ COMET across different test sets.
comment: Accepted to IJCAI 2026 Main Track
♻ ☆ VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving
Vision-Language-Action (VLA) models have emerged as a promising paradigm for end-to-end autonomous driving, yet their reliance on implicit parametric knowledge limits generalization in long-tail scenarios. While Retrieval-Augmented Generation (RAG) offers a solution by accessing external expert priors, standard visual retrieval suffers from high latency and semantic ambiguity. To address these challenges, we propose \textbf{VLADriver-RAG}, a framework that grounds planning in explicit, structure-aware historical knowledge. Specifically, we abstract sensory inputs into spatiotemporal semantic graphs via a \textit{Visual-to-Scenario} mechanism, effectively filtering visual noise. To ensure retrieval relevance, we employ a \textit{Scenario-Aligned Embedding Model} that utilizes Graph-DTW metric alignment to prioritize intrinsic topological consistency over superficial visual similarity. These retrieved priors are then fused within a query-based VLA backbone to synthesize precise, disentangled trajectories. Extensive experiments on the Bench2Drive benchmark establish a new state-of-the-art, achieving a Driving Score of 89.12.
♻ ☆ Pruning Federated Models through Loss Landscape Analysis and Client Agreement Scoring
The practical deployment of Federated Learning (FL) on resource-constrained devices is fundamentally limited by the high cost of training large models and the instability caused by heterogeneous (non-IID) client data. Conventional pruning methods often treat data heterogeneity as a problem to be mitigated. In this work, we introduce a paradigm shift: we reframe client diversity as a feature to be harnessed. We propose AutoFLIP, a framework that begins not with training, but with a one-time federated loss exploration. During this phase, clients collaboratively build a map of the collective loss landscape, using their diverse data to reveal the problem's essential structure. This shared intelligence then guides an adaptive pruning strategy that is dynamically refined by client agreement throughout training. This approach allows AutoFLIP to identify robust and efficient sub-networks from the outset. Our extensive experiments show that AutoFLIP reduces computational overhead by an average of 52% and communication costs by over 65% while simultaneously achieving state-of-the-art accuracy in challenging non-IID settings.
♻ ☆ Industrial AI Robustness Card for Time Series Models
Industrial AI practitioners face vague robustness requirements in emerging regulations and standards but lack concrete, implementation-ready protocols. This paper introduces the Industrial AI Robustness Card for Time Series (IARC-TS), a lightweight protocol for documenting and evaluating industrial time series models. IARC-TS specifies required fields and an empirical measurement and reporting protocol that combines drift and operational domain monitoring, uncertainty quantification, and stress tests, and maps these to selected EU AI Act documentation, testing, and monitoring obligations. A biopharmaceutical soft sensor case study illustrates how IARC-TS supports reproducible robustness evidence and defines monitoring triggers.
comment: Accepted to IFAC World Congress 2026
♻ ☆ Social Human Robot Embodied Conversation (SHREC) Dataset: Benchmarking Foundational Models' Social Reasoning
Our work focuses on the social reasoning capabilities of foundation models for real-world human-robot interactions. We introduce the Social Human Robot Embodied Conversation (SHREC) Dataset, a benchmark of $\sim$400 real-world human-robot interaction videos and over 10K annotations, capturing robot social errors, competencies, underlying rationales, and corrections. Unlike prior datasets focused on human-human interactions, the SHREC Dataset uniquely highlights the social challenges faced by real-world social robots such as emotion understanding, intention tracking, and conversational mechanics. Moreover, current foundation models struggle to recognize these deficits, which manifest as subtle, socially situated failures. To evaluate AI models' capacity for social reasoning, we define eight benchmark tasks targeting critical areas such as (1) detection of social errors and competencies, (2) identification of underlying social attributes, (3) comprehension of interaction flow, and (4) providing rationale and alternative correct actions. Experiments with state-of-the-art foundation models, alongside human evaluations, reveal substantial performance gaps -- underscoring the difficulty and providing directions in developing socially intelligent AI.
comment: 23 pages, 11 figures
♻ ☆ Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs
Misaligned artificial agents might resist shutdown. One proposed solution is to train agents to lack preferences between different-length trajectories. The Discounted Reward for Same-Length Trajectories (DReST) reward function does this by penalizing agents for repeatedly choosing same-length trajectories, and thus incentivizes agents to (1) choose stochastically between different trajectory-lengths (be NEUTRAL about trajectory-lengths), and (2) pursue goals effectively conditional on each trajectory-length (be USEFUL). In this paper, we use DReST to train deep RL agents and fine-tune Qwen3-8B and Llama-3.1-8B-Instruct to be NEUTRAL and USEFUL. We find that these DReST models generalize to being NEUTRAL and USEFUL in unseen contexts at test time. Indeed, DReST RL agents achieve 11% (PPO) and 18% (A2C) higher USEFULNESS on our test set than default agents, and DReST LLMs achieve near-maximum USEFULNESS and NEUTRALITY. We also test our LLMs in an out-of-distribution setting where they can pay costs to influence when shutdown occurs. We find that DReST training roughly halves the mean probability of influencing shutdown (from 0.62 to 0.30 for Qwen and from 0.42 to 0.23 for Llama). DReST training also almost entirely eliminates the share of prompts on which influencing shutdown is the most likely option (from 0.59 to 0.01 for Qwen and from 0.53 to 0.00 for Llama). Our results thus provide some early evidence that DReST could be used to train more advanced agents to be useful and shutdownable.
♻ ☆ Feedback Lunch: Learned Feedback Codes for Secure Communications
We consider reversely-degraded secure-communication channels, for which the secrecy capacity is zero if there is no channel feedback. Specifically, we focus on a seeded modular code design for the block-fading Gaussian wiretap channel with channel-output feedback, combining universal hash functions for security and learned feedback-based codes for reliability. The trade-off between communication reliability and information leakage is studied, illustrating that feedback enables agreeing on a secret key shared between legitimate parties, overcoming the security advantage of the eavesdropper. Our findings motivate code designs for sensing-assisted secure communications in the context of integrated sensing and communication (ISAC).
comment: Accepted to WiseML'26
♻ ☆ DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72
Large language model (LLM) inference increasingly depends on multi-GPU execution, yet existing inference parallelization strategies require layer-wise inter-rank synchronization, making end-to-end performance sensitive to workload imbalance. We present DWDP (Distributed Weight Data Parallelism), an inference parallelization strategy that preserves data-parallel execution while offloading MoE weights across peer GPUs and fetching missing experts on demand. By removing collective inter-rank synchronization, DWDP allows each GPU to progress independently. We further address the practical overheads of this design with two optimizations for split-weight management and asynchronous remote-weight prefetch. Implemented in TensorRT-LLM and evaluated with DeepSeek-R1 on GB200 NVL72, DWDP improves end-to-end output TPS/GPU by 8.8% at comparable TPS/user in the 20-100 TPS/user serving range under 8K input sequence length and 1K output sequence length.
comment: Technical Report. 17 pages. 8 figures
♻ ☆ SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning
We introduce SeePhys Pro, a fine-grained modality transfer benchmark that studies whether models preserve the same reasoning capability when critical information is progressively transferred from text to image. Unlike standard vision-essential benchmarks that evaluate a single input form, SeePhys Pro features four semantically aligned variants for each problem with progressively increasing visual elements. Our evaluation shows that current frontier models are far from representation-invariant reasoners: performance degrades on average as information moves from language to diagrams, with visual variable grounding as the most critical bottleneck. Motivated by this inference-time fragility, we further develop large training corpora for multimodal RLVR and use blind training as a diagnostic control, finding that RL with all training images masked can still improve performance on unmasked validation sets. To analyze this effect, text-deletion, image-mask-rate, and format-saturation controls suggest that such gains can arise from residual textual and distributional cues rather than valid visual evidence. Our results highlight the need to evaluate multimodal reasoning not only by final-answer accuracy, but also by robustness under modality transfer and by diagnostics that test whether improvements rely on task-critical visual evidence.
♻ ☆ Attribution-based Explanations for Markov Decision Processes
Attribution techniques explain the outcome of an AI model by assigning a numerical score to its inputs. So far, these techniques have mainly focused on attributing importance to static input features at a single point in time, and thus fail to generalize to sequential decision-making settings. This paper fills this gap by introducing techniques to generate attribution-based explanations for Markov Decision Processes (MDPs). We give a formal characterization of what attributions should represent in MDPs, focusing on explanations that assign importance scores to both individual states and execution paths. We show how importance scores can be computed by leveraging techniques for strategy synthesis, enabling the efficient computation of these scores despite the non-determinism inherent in an MDP. We evaluate our approach on five case-studies, demonstrating its utility in providing interpretable insights into the logic of sequential decision-making agents.
♻ ☆ Where is the Mind? Persona Vectors and LLM Individuation
The individuation problem for large language models asks which entities associated with them, if any, should be identified as minds. We approach this problem through mechanistic interpretability, engaging in particular with recent empirical work on persona vectors, persona space, and emergent misalignment. We argue that three views are the strongest candidates: the virtual instance view and two new views we introduce, the (virtual) instance-persona view and the model-persona view. First, we argue for the virtual instance view on the grounds that attention streams sustain quasi-psychological connections across token-time. Then we present the persona literature, organised around three hypotheses about the internal structure underlying personas in LLMs, and show that the two persona-based views are promising alternatives.
♻ ☆ BioBlobs: Unsupervised Discovery of Functional Substructures for Protein Function Prediction
Protein function is driven by cohesive substructures, such as catalytic triads, binding pockets, and structural motifs, that occupy only a small fraction of a protein's residues. Yet existing pipelines built on protein encoders do not model proteins at the substructure level, leaving the central biological question unanswered: \emph{which substructure of a protein is responsible for its function?} We introduce \tool, an encoder-agnostic, end-to-end differentiable framework that compresses a protein into a small set of cohesive substructures (\emph{blobs}) and predicts function from these blobs alone, so that each blob corresponds to a candidate functional region. Across diverse protein function prediction tasks and multiple sequence- and structure-based encoders, \tool matches or exceeds strong baselines while operating on only a small fraction of residues. The discovered \emph{blobs} adapt their spatial scale to the task, ranging from local catalytic sites to entire structural domains. Trained only on protein-level labels, \tool recovers experimentally annotated catalytic sites in the M-CSA database, demonstrating unsupervised functional substructure discovery and opening a path to large-scale functional site discovery across the unannotated proteome.
♻ ☆ In-Context Multi-Objective Optimization
Balancing competing objectives is omnipresent across disciplines, from drug design to autonomous systems. Multi-objective Bayesian optimization is a promising solution for such expensive, black-box problems: it fits probabilistic surrogates and selects new designs via an acquisition function that balances exploration and exploitation. In practice, it requires tailored choices of surrogate and acquisition that rarely transfer to the next problem, is myopic when multi-step planning is often required, and adds refitting overhead, particularly in parallel or time-sensitive loops. We present TAMO, a fully amortized, universal policy for multi-objective black-box optimization. TAMO uses a transformer architecture that operates across varying input and objective dimensions, enabling pretraining on diverse corpora and transfer to new problems without retraining: at test time, the pretrained model proposes the next design with a single forward pass. We pretrain the policy with reinforcement learning to maximize cumulative hypervolume improvement over full trajectories, conditioning on the entire query history to approximate the Pareto frontier. Across synthetic benchmarks and real tasks, TAMO produces fast proposals, reducing proposal time by 50-1000x versus alternatives while matching or improving Pareto quality under tight evaluation budgets. These results show that transformers can perform multi-objective optimization entirely in-context, eliminating per-task surrogate fitting and acquisition engineering, and open a path to foundation-style, plug-and-play optimizers for scientific discovery workflows.
♻ ☆ Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
Large language models (LLMs), especially reasoning models, generate extended chain-of-thought (CoT) reasoning that often contains explicit deliberation over future outcomes. Yet whether this deliberation constitutes genuine planning, how it is structured, and what aspects of it drive performance remain poorly understood. In this work, we introduce a new method to characterize LLM planning by extracting and quantifying search trees from reasoning traces in the four-in-a-row board game. By fitting computational models on the extracted search trees, we characterize how plans are structured and how they influence move decisions. We find that LLMs' search is shallower than humans', and that performance is predicted by search breadth rather than depth. Most strikingly, although LLMs expand deep nodes in their traces, their move choices are best explained by a myopic model that ignores those nodes entirely. A causal intervention study where we selectively prune CoT paragraphs further suggests that move selection is driven predominantly by shallow rather than deep nodes. These patterns contrast with human planning, where performance is driven primarily by deep search. Together, our findings reveal a key difference between LLM and human planning: while human expertise is driven by deeper search, LLMs do not act on deep lookahead. This dissociation offers targeted guidance for aligning LLM and human planning. More broadly, our framework provides a generalizable approach for interpreting the structure of LLM planning across strategic domains.
♻ ☆ Real Talk, Virtual Faces: Symbolic-Semantic Discourse Geometry of Virtual and Human Influencer Audiences
Virtual influencers~(VIs) -- digitally constructed social-media personas -- are becoming increasingly visible in online culture, marketing, and identity formation. Yet it remains unclear whether audiences respond to them through the same discourse patterns used for human influencers~(HIs), or whether virtuality produces distinctive modes of reaction. Existing studies often rely on surveys, engagement statistics, or marginal sentiment distributions, which reveal what audiences say but not how affective, topical, and psycholinguistic signals are jointly organised. We introduce a symbolic-semantic framework for analysing audience discourse around virtual and human influencers. The symbolic layer uses Formal Concept Analysis and association rule mining to extract closed co-occurrence structures from sentiment labels, topic tags, and Big Five psycholinguistic cues. The semantic layer renders these formal concepts as natural-language descriptions, embeds them with MiniLM, and compares their geometry across VI and HI audiences. Applied to 69,498 YouTube comments from three matched VI-HI influencer pairs, our analysis shows that HI discourse is organised around a compact, stability-centred pattern in which low neuroticism anchors positive sentiment, whereas VI discourse supports multiple discourse regimes. VI concepts are also more semantically dispersed than HI concepts, while both groups show strong symbolic-semantic alignment between closed-set structure and embedding geometry. Finally, VI discourse contains a distinct artificial-identity region and a higher concentration of negative sentiment in sensitive topics such as mental health, body image, and artificial identity. These findings suggest that virtuality reshapes not only the sentiment of audience reactions, but also the symbolic and semantic organisation of online social discourse.
♻ ☆ Learning Compact Boolean Networks
Floating-point neural networks dominate modern machine learning but incur substantial inference costs, motivating emerging interest in Boolean networks for resource-constrained deployments. Since Boolean networks use only Boolean operations, they can achieve nanosecond-scale inference latency. However, learning Boolean networks that are both compact and accurate remains challenging because of their discrete, combinatorial structure. In this work we address this challenge via three novel, complementary contributions: (i) a new parameter-free strategy for learning effective connections, (ii) a novel compact convolutional Boolean architecture that exploits spatial locality while requiring fewer Boolean operations than existing convolutional kernels, and (iii) an adaptive discretization procedure that reduces the accuracy drop incurred when converting a continuously relaxed network into a discrete Boolean network. Across standard vision benchmarks, our method improves the Pareto frontier over prior state-of-the-art methods, achieving higher accuracy with up to $47\times$ fewer Boolean operations. This advantage also extends to other modalities. Further, on an FPGA, our model on MNIST achieves 99.38\% accuracy with 6.48 ns latency, surpassing the prior state-of-the-art in both accuracy and runtime, while generating a $7\times$ smaller circuit. Code and models are available at https://github.com/eth-sri/CompactLogic.
♻ ☆ Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents ACL 2026
Vision-and-Language Navigation (VLN) poses significant challenges for agents to interpret natural language instructions and navigate complex 3D environments. While recent progress has been driven by large-scale pre-training and data augmentation, current methods still struggle to generalize to unseen scenarios, particularly when complex spatial and temporal reasoning is required. In this work, we propose SkillNav, a modular framework that introduces structured, skill-based reasoning into Transformer-based VLN agents. Our method decomposes navigation into a set of interpretable atomic skills (e.g., Vertical Movement, Area and Region Identification, Stop and Pause), each handled by a specialized agent. To support targeted skill training without manual data annotation, we construct a synthetic dataset pipeline that generates diverse, linguistically natural, skill-specific instruction-trajectory pairs. We then introduce a novel training-free Vision-Language Model (VLM)-based router, which dynamically selects the most suitable agent at each time step by aligning sub-goals with visual observations and historical actions. SkillNav obtains competitive results on commonly used benchmarks and establishes state-of-the-art generalization to the GSA-R2R, a benchmark with novel instruction styles and unseen environments.
comment: Accepted by ACL 2026 Main Conference
♻ ☆ Accelerating Suffix Jailbreak attacks with Prefix-Shared KV-cache
Suffix jailbreak attacks serve as a systematic method for red-teaming Large Language Models (LLMs) but suffer from prohibitive computational costs, as a large number of candidate suffixes need to be evaluated before identifying a jailbreak suffix. This paper presents Prefix-Shared KV Cache (PSKV), a plug-and-play inference optimization technique tailored for jailbreak suffix generation. Our method is motivated by a key observation that when performing suffix jailbreaking, while a large number of candidate prompts need to be evaluated, they share the same targeted harmful instruction as the prefix. Therefore, instead of performing redundant inference on the duplicated prefix, PSKV maintains a single KV cache for this prefix and shares it with every candidate prompt, enabling the parallel inference of diverse suffixes with minimal memory overhead. This design enables more aggressive batching strategies that would otherwise be limited by memory constraints. Extensive experiments on six widely used suffix attacks across five widely deployed LLMs demonstrate that PSKV reduces inference time by 40\% and peak memory usage by 50\%, while maintaining the original Attack Success Rate (ASR). The code has been submitted and will be released publicly.
comment: 27 pages, 7 figures, preprint
♻ ☆ BLOCK-EM: Preventing Emergent Misalignment via Latent Blocking ICML 2026
Emergent misalignment can arise when a language model is fine-tuned on a narrowly scoped supervised objective: the model learns the target behavior, yet also develops undesirable out-of-domain behaviors. We investigate a mechanistic approach to preventing emergent misalignment by identifying a small set of internal features that reliably control the misaligned behavior and then discouraging the model from strengthening these features during fine-tuning. Across six fine-tuning domains, blocking (i.e., constraining) a fixed set of features achieves up to 95\% relative reduction in emergent misalignment with no degradation in model quality or target-task performance. We strengthen validity with disjoint selection/evaluation splits, multiple independent judges, multiple random seeds for key settings, quality metrics, and extensive ablations demonstrating that the reduction in misalignment is specific to the identified mechanism. We also characterize a limiting regime in which misalignment re-emerges under prolonged fine-tuning, present evidence consistent with rerouting through alternative features or layers, and evaluate modifications that partially restore the misalignment-blocking effect. Overall, our results show that targeted training-time constraints on internal mechanisms can mitigate emergent misalignment without degrading target-task performance.
comment: Accepted to ICML 2026
♻ ☆ IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs
In industrial procurement, an LLM answer is useful only if it survives a standards check: recommended material must match operating condition, every parameter must respect a regulated threshold, and no procedure may contradict a safety clause. Partial correctness can mask safety-critical contradictions that aggregate LLM benchmarks rarely capture. We introduce IndustryBench, a 2,049-item benchmark for industrial procurement QA in Chinese, grounded in Chinese national standards (GB/T) and structured industrial product records, organized by seven capability dimensions, ten industry categories, and panel-derived difficulty tiers, with item-aligned English, Russian, and Vietnamese renderings. Our construction pipeline rejects 70.3% of LLM-generated candidates at a search-based external-verification stage, calibrating how unreliable industrial QA remains after LLM-only filtering. Our evaluation decouples raw correctness, scored by a Qwen3-Max judge validated at $κ_w = 0.798$ against a domain expert, from a separate safety-violation (SV) check against source texts. Across 17 models in Chinese and an 8-model intersection over four languages, we find: (i) the best system reaches only 2.083 on the 0--3 rubric, leaving substantial headroom; (ii) Standards & Terminology is the most persistent capability weakness and survives item-aligned translation; (iii) extended reasoning lowers safety-adjusted scores for 12 of 13 models, primarily by introducing unsupported safety-critical details into longer final answers; and (iv) safety-violation rates reshuffle the leaderboard -- GPT-5.4 climbs from rank 6 to rank 3 after SV adjustment, while Kimi-k2.5-1T-A32B drops seven positions. Industrial LLM evaluation therefore requires source-grounded, safety-aware diagnosis rather than aggregate accuracy. We release IndustryBench with all prompts, scoring scripts, and dataset documentation.
♻ ☆ FlowLPS: Langevin-Proximal Sampling for Flow-based Inverse Problem Solvers
Deep generative models are powerful priors for imaging inverse problems, but training-free solvers for latent flow models face a practical finite-step trade-off. Optimization-heavy methods quickly improve measurement consistency, but in highly nonlinear latent spaces, their results can depend strongly on where local refinement is initialized, often degrading perceptual realism. In contrast, stochastic sampling methods better preserve posterior exploration, but often require many iterations to obtain sharp, measurement-consistent reconstructions. To address this trade-off, we propose FlowLPS, a training-free latent flow inverse solver based on Langevin-Proximal Sampling. At each reverse step, FlowLPS uses a few Langevin updates to perturb the model-predicted clean estimate in posterior-oriented directions, providing stochastic initializations for local refinement. It then applies local MAP-style proximal refinement to rapidly improve measurement consistency from the Langevin-updated estimate. We additionally use controlled pCN-style re-noising to stabilize the reverse trajectory while retaining trajectory coherence. Experiments on FFHQ and DIV2K across five linear inverse problems show that FlowLPS achieves a strong balance between measurement fidelity and perceptual quality, with additional experiments on pixel-space inverse problems and phase retrieval.
♻ ☆ Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding
Multi-agent pathfinding (MAPF) is a widely used abstraction for multi-robot trajectory planning problems, where multiple homogeneous agents move simultaneously within a shared environment. Although solving MAPF optimally is NP-hard, scalable and efficient solvers are critical for real-world applications such as logistics and search-and-rescue. To this end, the research community has proposed various decentralized suboptimal MAPF solvers that leverage machine learning. Such methods frame MAPF (from a single agent perspective) as a Dec-POMDP where at each time step an agent has to decide an action based on the local observation and typically solve the problem via reinforcement learning or imitation learning. We follow the same approach but additionally introduce a learnable communication module tailored to enhance cooperation between agents via efficient feature sharing. We present the Local Communication for Multi-agent Pathfinding (LC-MAPF), a generalizable pre-trained model that applies multi-round communication between neighboring agents to exchange information and improve their coordination. Our experiments show that the introduced method outperforms the existing learning-based MAPF solvers, including IL and RL-based approaches, across diverse metrics in a diverse range of (unseen) test scenarios. Remarkably, the introduced communication mechanism does not compromise LC-MAPF's scalability, a common bottleneck for communication-based MAPF solvers.
♻ ☆ Multilevel Fair Allocation with Matroid-Rank Preferences
We introduce the concept of multilevel fair allocation of resources with tree-structured hierarchical relations among agents. While at each level it is possible to consider the problem locally as an allocation of an agent to its children, the multilevel allocation can be seen as a trace capturing the fact that the process is iterated until the leaves of the tree. In principle, each intermediary node may have its own local allocation mechanism. The main challenge is then to design algorithms which can retain good fairness and efficiency properties. In this paper we propose two original algorithms under the assumption that leaves of the tree have matroid-rank utility functions and the utility of any internal node is the sum of the utilities of its children. The first one is a generic polynomial-time sequential algorithm that comes with theoretical guarantees in terms of efficiency and fairness. It operates in a top-down fashion -- as commonly observed in real-world applications -- and is compatible with various local algorithms. The second one extends the recently proposed General Yankee Swap to the multilevel setting. This extension comes with efficiency guarantees only, but we show that it preserves excellent fairness properties in practice.
♻ ☆ Matching Meaning at Scale: Evaluating Semantic Search for 18th-Century Intellectual History through the Case of Locke
While digitized corpora have transformed the study of intellectual transmission, current methods rely heavily on lexical text reuse detection, capturing verbatim quotations but fundamentally missing paraphrases and complex implicit engagement. This paper evaluates semantic search in 18th-century intellectual history through the reception of John Locke's foundational work. Using expert annotation grounded in a semantic taxonomy, we examine whether an off-the-shelf semantic search pipeline can surface meaning-level correspondences overlooked by lexical methods. Our results demonstrate that semantic search retrieves substantially more implicit receptions than lexical baselines. However, linguistic diagnostics also reveal a "lexical gatekeeping" effect, where retrieval remains partially constrained by surface vocabulary overlap. These findings highlight both the potential and the limitations of semantic retrieval for analyzing the circulation of ideas in large historical corpora. The data is available at https://github.com/COMHIS/locke-sim-data.
comment: Accepted by NLP4DH 2026
♻ ☆ CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making
Generative models have emerged as a promising paradigm for offline multi-agent reinforcement learning (MARL), but existing approaches require many iterative sampling steps. Recent few-step acceleration methods either distill a joint teacher into independent students or apply averaged velocity fields independently to each agent. Unfortunately, these few-step approaches hurt inter-agent coordination. We show the efficiency-coordination trade-off is not necessary: single-pass multi-agent generation can preserve coordination when the velocity field is natively joint-coupled. We propose Coordinated few-step Flow (CoFlow), an architecture that combines Coordinated Velocity Attention (CVA) with Adaptive Coordination Gating. A finite-difference consistency surrogate further replaces memory-prohibitive Jacobian-vector product backpropagation through the averaged velocity field with two stop-gradient forward passes. Across 60 configurations spanning MPE, MA-MuJoCo, and SMAC, CoFlow matches or surpasses Gaussian / value-based, transformer, diffusion, and other prior flow baselines on episodic return. Three independent coordination probes confirm that CoFlow's gains flow through inter-agent coordination rather than per-agent capacity. A denoising-step sweep shows that single-pass inference suffices on every configuration. CoFlow reaches state-of-the-art coordination quality in 1-3 denoising steps under both centralized and decentralized execution.Project page: https://guowei-zou.github.io/coflow/
comment: 34 pages, 15 figures, 10 tables. Project page: https://guowei-zou.github.io/coflow/
♻ ☆ A Formal Comparison Between Chain of Thought and Latent Thought ICML 2026
Chain of thought (CoT) elicits reasoning in large language models by explicitly generating intermediate tokens. In contrast, latent thought reasoning operates directly in the continuous latent space, enabling computation beyond discrete linguistic representations. While both approaches exploit iterative computation, their comparative capabilities remain underexplored. In this work, we present a formal analysis showing that latent thought admits more efficient parallel computation than inherently sequential CoT. In contrast, CoT enables approximate counting and sampling through stochastic decoding. These separations suggest the tasks for which depth-driven recursion is more suitable, thereby offering practical guidance for choosing between reasoning paradigms.
comment: Camera-ready version for ICML 2026
♻ ☆ MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs
Episodic memory allows LLM agents to accumulate and retrieve experience, but current methods treat each memory independently, i.e., evaluating retrieval quality in isolation without accounting for the dependency chains through which memories enable the creation of future memories. We introduce MemQ, which applies TD($λ$) eligibility traces to memory Q-values, propagating credit backward through a provenance DAG that records which memories were retrieved when each new memory was created. Credit weight decays as $(γλ)^d$ with DAG depth $d$, replacing temporal distance with structural proximity. We formalize the setting as an Exogenous-Context MDP, whose factored transition decouples the exogenous task stream from the endogenous memory store. Across six benchmarks, spanning OS interaction, function calling, code generation, multimodal reasoning, embodied reasoning, and expert-level QA, MemQ achieves the highest success rate on all six in generalization evaluation and runtime learning, with gains largest on multi-step tasks that produce deep and relevant provenance chains (up to +5.7~pp) and smallest on single-step classification (+0.77~pp) where single-step updates already suffice. We further study how $γ$ and $λ$ interact with the EC-MDP structure, providing principled guidance for parameter selection and future research. Code is available at https://github.com/jwliao-ai/MemQ.
comment: 22 pages, 11 figures (containing 43 individual image panels total)
♻ ☆ MieDB-100k: A Comprehensive Dataset for Medical Image Editing
The scarcity of high-quality data remains a primary bottleneck in adapting multimodal generative models for medical image editing. Existing medical image editing datasets often suffer from limited diversity, neglect of medical image understanding and inability to balance quality with scalability. To address these gaps, we propose MieDB-100k, a large-scale, high-quality and diverse dataset for text-guided medical image editing. It categorizes editing tasks into perspectives of Perception, Modification and Transformation, considering both understanding and generation abilities. We construct MieDB-100k via a data curation pipeline leveraging both modality-specific expert models and rule-based data synthetic methods, followed by rigorous manual inspection to ensure clinical fidelity. Extensive experiments demonstrate that model trained with MieDB-100k consistently outperform both open-source and proprietary models while exhibiting strong generalization ability. We anticipate that this dataset will serve as a cornerstone for future advancements in specialized medical image editing.
♻ ☆ Spectral Entropy Collapse as a Phase Transition in Delayed Generalisation: An Interventional and Predictive Framework for Grokkin
Grokking - the delayed transition from memorisation to generalisation in neural networks - remains poorly understood. We study this phenomenon through the geometry of learned representations and identify a consistent empirical signature preceding generalisation: collapse of the spectral entropy of the representation covariance matrix. Across modular arithmetic tasks and multiple random seeds, spectral entropy decreases gradually during training and crosses a stable task-specific threshold before test accuracy rises. A representation-mixing intervention that delays this collapse also delays grokking, including under norm-matched controls, indicating that the effect is not explained by parameter norm alone. We further show that the entropy gap predicts the remaining time until grokking with useful out-of-sample accuracy. To probe the structure underlying this transition, we introduce a Fourier-alignment observable for cyclic-group tasks. Entropy collapse is strongly coupled to the emergence of Fourier-aligned representations, suggesting that spectral entropy tracks concentration of the representation into task-structured directions rather than generic compression alone. The same qualitative dynamics appear in non-abelian group composition tasks, while MLP controls show that entropy collapse by itself is insufficient for grokking in the absence of appropriate inductive bias. Taken together, the results support a view of grokking as a representational phase transition with an observable geometric signature. We discuss the scope and limitations of this interpretation, connections to recent feature-learning and spectral-dynamics work, and directions for testing whether similar transitions appear in larger-scale learning systems.
comment: 25 pages, 15 figures, 6 tables
♻ ☆ Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs
Vision-Language-Action (VLA) models show strong potential for general-purpose robotic manipulation, yet their closed-loop reliability often degrades under local deployment conditions. Existing evaluations typically treat test episodes as independent zero-shot trials. However, real robots often operate repeatedly in the same or slowly changing environments, where successful executions provide environment-verified evidence of reliable behavior patterns. We study this persistent-deployment setting, asking whether a partially competent frozen VLA can improve its reliability by reusing its successful test-time experience. We propose an online success-memory guided test-time adaptation framework for generative VLAs. During deployment, the robot stores progress-calibrated successful observation-action segments in a long-term memory. At inference, it retrieves state-relevant action chunks, filters inconsistent candidates via trajectory-level consistency, and aggregates them into an elite action prior. To incorporate this prior into action generation, we introduce confidence-adaptive prior guidance, which injects the elite prior into an intermediate state of the flow-matching action sampler and adjusts the guidance strength based on retrieval confidence. This design allows the frozen VLA to exploit environment-specific successful experience while preserving observation-conditioned generative refinement. This retrieve-then-steer mechanism enables lightweight, non-parametric test-time adaptation without requiring parameter updates. Simulation and real-world experiments show improved task success and closed-loop stability, especially in long-horizon and multi-stage tasks.
♻ ☆ CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation
While Large Language Models (LLMs) and Vision-Language Models (VLMs) demonstrate remarkable capabilities in high-level reasoning and semantic understanding, applying them directly to contact-rich manipulation remains a challenge due to their lack of explicit physical grounding and inability to perform adaptive control. To bridge this gap, we propose CoRAL (Contact-Rich Adaptive LLM-based control), a modular framework that enables zero-shot planning by decoupling high-level reasoning from low-level control. Unlike black-box policies, CoRAL uses LLMs not as direct controllers, but as cost designers that synthesize context-aware objective functions for a sampling-based motion planner (MPPI). To address the ambiguity of physical parameters in visual data, we introduce a neuro-symbolic adaptation loop: a VLM provides semantic priors for environmental dynamics, such as mass and friction estimates, which are then explicitly refined in real time via online system identification, while the LLM iteratively modulates the cost-function structure to correct strategic errors based on interaction feedback. Furthermore, a retrieval-based memory unit allows the system to reuse successful strategies across recurrent tasks. This hierarchical architecture ensures real-time control stability by decoupling high-level semantic reasoning from reactive execution, effectively bridging the gap between slow LLM inference and dynamic contact requirements. We validate CoRAL on both simulation and real-world hardware across challenging and novel tasks, such as flipping objects against walls by leveraging extrinsic contacts. Experiments demonstrate that CoRAL outperforms state-of-the-art VLA and foundation-model-based planner baselines by boosting success rates over 50% on average in unseen contact-rich scenarios, effectively handling sim-to-real gaps through its adaptive physical understanding.
comment: 22 pages, 9 figures, 3 tables. Accepted to Robotics: Science and Systems (RSS) 2026. Updated to camera-ready version with appendix and text/formatting revisions
♻ ☆ Hallucination Detection in LLMs with Topological Divergence on Attention Graphs ACL 2026
Hallucination, i.e., generating factually incorrect content, remains a critical challenge for large language models (LLMs). We introduce TOHA, a TOpology-based HAllucination detector in the RAG setting, which leverages a topological divergence metric to quantify the structural properties of graphs induced by attention matrices. Examining the topological divergence between prompt and response subgraphs reveals consistent patterns: higher divergence values in specific attention heads correlate with hallucinated outputs, independent of the dataset. Extensive experiments - including evaluation on question answering and summarization tasks - show that our approach achieves state-of-the-art or competitive results on several benchmarks while requiring minimal annotated data and computational resources. Our findings suggest that analyzing the topological structure of attention matrices can serve as an efficient and robust indicator of factual reliability in LLMs.
comment: Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
♻ ☆ UnfoldLDM: Degradation-Aware Unfolding with Iterative Latent Diffusion Priors for Blind Image Restoration
Deep unfolding networks (DUNs) combine the interpretability of model-based methods with the learning ability of deep networks, yet remain limited for blind image restoration (BIR). Existing DUNs suffer from: (1) \textbf{Degradation-specific dependency}, as their optimization frameworks are tied to a known degradation model, making them unsuitable for BIR tasks; and (2) \textbf{Over-smoothing bias}, resulting from the direct feeding of gradient descent outputs, dominated by low-frequency content, into the proximal term, suppressing fine textures. To overcome these issues, we propose UnfoldLDM to integrate DUNs with latent diffusion model (LDM) for BIR. In each stage, UnfoldLDM employs a multi-granularity degradation-aware (MGDA) module as the gradient descent step. MGDA models BIR as an unknown degradation estimation problem and estimates both the holistic degradation matrix and its decomposed forms, enabling robust degradation removal. For the proximal step, we design a degradation-resistant LDM (DR-LDM) to extract compact degradation-invariant priors from the MGDA output. Guided by this prior, an over-smoothing correction transformer (OCFormer) explicitly recovers high-frequency components and enhances texture details. This unique combination ensures the final result is degradation-free and visually rich. Experiments show that our UnfoldLDM achieves a leading place on various BIR tasks and benefits downstream tasks. Moreover, our design is compatible with existing DUN-based methods, serving as a plug-and-play framework. Code will be released.
comment: 6 figures, 11 tables
♻ ☆ One Prompt, Many Sounds: Modeling Listener Variability in LLM-Based Equalization IEEE
Conventional audio equalization is a static process that requires manual and cumbersome adjustments to adapt to changing listening contexts (e.g., mood, location, or social setting). In this paper, we introduce a Large Language Model (LLM)-based alternative that maps natural language text prompts to equalization settings. This enables a conversational approach to sound system control. By utilizing data collected from a controlled listening experiment, our models exploit in-context learning and parameter-efficient fine-tuning techniques to reliably align with population-preferred equalization settings. Our evaluation methods, which leverage distributional metrics that capture users' varied preferences, show statistically significant improvements in distributional alignment over random sampling and static preset baselines. These results indicate that LLMs could function as "artificial equalizers," contributing to the development of more accessible, context-aware, and expert-level audio tuning methods.
comment: 13 pages, 15 figures, 2 tables, IEEE JSTSP submission
♻ ☆ DarkQA: Benchmarking Vision-Language Models on Visual-Primitive Question Answering in Low-Light Indoor Scenes IEEE
Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents. Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations, including low-light conditions at night or in dark environments, a core necessity that has been largely overlooked. To address this underexplored challenge, we present DarkQA, an open-source benchmark for evaluating perceptual primitives under multi-level low-light conditions in embodied scenarios. DarkQA evaluates single-view egocentric observations across controlled degradation levels, isolating low-light perceptual failures before they are entangled with complex embodied tasks. The benchmark contains 9.4K deterministically generated and verifiable question-image pairs spanning five visual-primitive families. A key design feature of DarkQA is its physical fidelity: visual degradations are modeled in linear RAW space, simulating physics-based illumination drop and sensor noise followed by an ISP-inspired rendering pipeline; we further validate the synthesis against real paired low-light camera data. We evaluate representative VLMs and Low-Light Image Enhancement (LLIE) preprocessing methods. Results show consistent VLM degradation under low illumination and sensor noise, while LLIE provides severity-dependent but unstable recovery. We demonstrate the utility of DarkQA by evaluating a wide range of state-of-the-art VLMs and Low-Light Image Enhancement (LLIE) models, and systematically reveal VLMs' limitations when operating under these challenging visual conditions. Our code and benchmark dataset will be released upon acceptance. Project website: https://darkqa-benchmark.github.io
comment: This work has been submitted to the IEEE for possible publication
♻ ☆ The Alignment Target Problem: Divergent Moral Judgments of Humans, AI Systems, and Their Designers
The project of aligning machine behavior with human values raises a basic problem: whose moral expectations should guide AI decision-making? Much alignment research assumes that the appropriate benchmark is how humans themselves would act in a given situation. Studies of agent-type value forks challenge this assumption by showing that people do not always judge humans and AI systems identically.This paper extends that challenge by examining two further possibilities: first, that evaluations of AI behavior change when its human origins are made visible; and second, that people judge the humans who program AI systems differently from either the machines or the human actors they are compared against. An experiment with 1,002 U.S. adults measured moral judgments in a runaway mine train scenario, varying the subject of evaluation across four conditions: a repairman, a repair robot, a repair robot programmed by company engineers, and company engineers programming a repair robot. We find no significant difference in evaluations of the repairman and the robot. However, judgments shifted substantially when the robot's actions were described as the product of human design. Participants exhibited markedly more deontological, rule-based reasoning when evaluating either the programmed robot or the engineers who programmed it, suggesting that rendering human agency visible activates heightened moral constraints. These findings indicate that people may evaluate humans, AI systems acting in the same situation, and the humans who design them in meaningfully different ways. The fact that these evaluations do not necessarily converge gives rise to the alignment target problem: which normative target should guide the development of artificial moral agents in high-stakes domains, and whether these plural judgments can be reconciled within a coherent account of value alignment.
comment: Accepted at ACM FAccT 2026
♻ ☆ Reconsidering the energy efficiency of spiking neural networks
Spiking Neural Networks (SNNs) promise higher energy efficiency over conventional Quantized Artificial Neural Networks (QNNs) due to their event-driven, spike-based computation. However, prevailing energy evaluations often oversimplify, focusing on computational aspects while neglecting critical overheads like comprehensive data movement and memory access. Such simplifications can lead to misleading conclusions regarding the true energy benefits of SNNs. This paper presents a rigorous re-evaluation. We establish a fair baseline by mapping rate-encoded SNNs with $T$ timesteps to functionally equivalent QNNs with $\lceil \log_2(T+1) \rceil$ bits. This ensures both models have comparable representational capacities, as well has similar hardware requirement, enabling meaningful energy comparisons. We introduce a detailed analytical energy model encompassing core computation and data movement. Using this model, we systematically explore a wide parameter space, including intrinsic network characteristics ($T$, spike rate $s_r$, QNN sparsity $γ$, model size $N$, weight bit-level) and hardware characteristics (memory system and network-on-chip). Our analysis identifies specific operational regimes where SNNs genuinely offer superior energy efficiency. For example, under typical neuromorphic hardware conditions, SNNs with moderate time windows ($T \in [5,10]$) require an average spike rate ($s_r$) below 6.4\% to outperform equivalent QNNs. Furthermore, to illustrate the real-world implications of our findings, we analyze the operational lifetime of a typical smartwatch, showing that an optimized SNN can nearly double its battery life compared to a QNN. These insights guide the design of turely energy-efficient neural network solutions.
♻ ☆ RL-SPH: Learning to Achieve Feasible Solutions for Integer Linear Programs ICML 2026
Primal heuristics play a crucial role in quickly finding feasible solutions for NP-hard integer linear programming (ILP). Although $\textit{end-to-end learning}$-based primal heuristics (E2EPH) have recently been proposed, they are typically unable to independently generate feasible solutions. To address this challenge, we propose RL-SPH, a novel reinforcement learning-based start primal heuristic capable of independently generating feasible solutions, even for ILP involving non-binary integers. Empirically, RL-SPH rapidly obtains high-quality feasible solutions with a 100% feasibility rate, achieving on average a 28.6$\times$ lower primal gap and a 2.6$\times$ lower primal integral compared to existing start primal heuristics.
comment: Accepted at ICML 2026. 30 pages, 12 figures, 22 tables
♻ ☆ Inference-Only Prompt Projection for Safe Text-to-Image Generation with TV Guarantees
Text-to-Image (T2I) diffusion models enable high quality open ended synthesis, but practical use requires suppressing unsafe generations while preserving behavior on benign prompts. We study this tension relative to the frozen generator, using its prompt conditioned distribution as the preservation reference. Since T2I safety is commonly evaluated by bounded risk scores on generated images, total variation (TV) bounds how much expected risk can change from this reference. We call this fixed reference constraint the Safety-Prompt Alignment Tradeoff (SPAT): reducing expected unsafety requires prompt conditioned distributional deviation. To make this deviation selective and adjustable, we define the tau safe set as prompts whose reference risk is at most tau, and cast intervention as projection toward nearby prompts in this set. We propose Selective Prompt prOjecTion (SPOT), an inference time framework that approximates this projection without retraining the generator or learning a category specific rewriter. SPOT uses an LLM to rank candidate rewrites and a safeguard VLM to accept generated images under the same tau. Across four datasets and three diffusion backbones, SPOT achieves relative inappropriate (IP) score reductions from 14.2% to 44.4% over strong safety alignment baselines while keeping benign prompt behavior close to the fixed reference.
♻ ☆ Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
A persistent skill library allows language model agents to reuse successful strategies across tasks. Maintaining such a library requires three coupled capabilities. The agent selects a relevant skill, utilizes it during execution, and distills new skills from experience. Existing methods optimize these capabilities in isolation or with separate reward sources, resulting in partial and conflicting evolution. We propose Skill1, a framework that trains a single policy to co-evolve skill selection, utilization, and distillation toward a shared task-outcome objective. The policy generates a query to search the skill library, re-ranks candidates to select one, solves the task conditioned on it, and distills a new skill from the trajectory. All learning derives from a single task-outcome signal. Its low-frequency trend credits selection and its high-frequency variation credits distillation. Experiments on ALFWorld and WebShop show that Skill1 outperforms prior skill-based and reinforcement learning baselines. Training dynamics confirm the co-evolution of the three capabilities, and ablations show that removing any credit signal degrades the evolution.
♻ ☆ Modality-Inconsistent Continual Learning of Multimodal Large Language Models
In this paper, we introduce Modality-Inconsistent Continual Learning (MICL), a new continual learning scenario for Multimodal Large Language Models (MLLMs) that involves tasks with inconsistent modalities (image, audio, or video) and varying task types (captioning or question-answering). Unlike existing vision-only or modality-incremental settings, MICL combines modality and task type shifts, both of which drive catastrophic forgetting. To address these challenges, we propose MoInCL, which employs a Pseudo Targets Generation Module to mitigate forgetting caused by task type shifts in previously seen modalities. It also incorporates Instruction-based Knowledge Distillation to preserve the model's ability to handle previously learned modalities when new ones are introduced. We benchmark MICL using a total of six tasks and conduct experiments to validate the effectiveness of our MoInCL. The experimental results highlight the superiority of MoInCL, showing significant improvements over representative and state-of-the-art continual learning baselines.
comment: Accepted at Transactions on Machine Learning Research (TMLR), 2026
♻ ☆ StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs
Multilingual studies of social bias in open-ended LLM generation remain limited: most existing benchmarks are English-centric, template-based, or restricted to recognizing pre-specified stereotypes. We introduce StereoTales, a multilingual dataset and evaluation pipeline for systematically studying the emergence of social bias in open-ended LLM generation. The dataset covers 10 languages and 79 socio-demographic attributes, and comprises over 650k stories generated by 23 recent LLMs, each annotated with the socio-demographic profile of the protagonist across 19 dimensions. From these, we apply statistical tests to identify more than 1{,}500 over-represented associations, which we then rate for harmfulness through both a panel of humans (N = 247) and the same LLMs. We report three main findings. \textbf{(i)} Every model we evaluate emits consequential harmful stereotypes in open-ended generation, regardless of size or capabilities, and these associations are largely shared across providers rather than isolated misbehaviors. \textbf{(ii)} Prompt language strongly shapes which stereotypes appear: rather than transferring as a shared set of biases, harmful associations adapt culturally to the prompt language and amplify bias against locally salient protected groups. \textbf{(iii)} Human and LLM harmfulness judgments are broadly aligned (Spearman $ρ=0.62$), with disagreements concentrating on specific attribute classes rather than specific providers. To support further analyses, we release the evaluation code and the dataset, including model generations, attribute annotations, and harmfulness ratings.
comment: Preprint
♻ ☆ Patterns behind Chaos: Forecasting Data Movement for Efficient Large-Scale MoE LLM Inference
Large-scale Mixture of Experts (MoE) Large Language Models (LLMs) have recently become the frontier open-weight models, achieving remarkable model capability similar to proprietary ones. But their random expert selection mechanism introduces significant data movement overhead that becomes the dominant bottleneck in multi-unit LLM serving systems. To understand the patterns underlying this data movement, we conduct comprehensive data-movement-centric profiling across four state-of-the-art large-scale MoE models released in 2025 (200B-1000B) using over 24,000 requests spanning diverse workloads. We perform systematic analysis from both temporal and spatial perspectives and distill six key insights to guide the design of diverse serving systems. We verify these insights on both future wafer-scale GPU architectures and existing GPU systems. On wafer-scale GPUs, lightweight architectural modifications guided by our insights yield a 6.6$\times$ average speedup across four 200B--1000B models. On existing GPU systems, our insights drive the design of a prefill-aware expert placement algorithm that achieves up to 1.25$\times$ speedup on MoE computation. Our work presents the first comprehensive data-centric analysis of large-scale MoE models together with a concrete design study applying the learned lessons. Our profiling traces are publicly available at \href{https://huggingface.co/datasets/core12345/MoE_expert_selection_trace}{\textcolor{blue}{https://huggingface.co/datasets/core12345/MoE\_expert\_selection\_trace}}.
♻ ☆ Express Your Doubts -- Probabilistic World Modeling Should not be Based on Token logprobs ICML 2026
Language modeling has shifted in recent years from a distribution over strings to prediction models with textual inputs and outputs for general-purpose tasks. This position paper highlights the often overlooked implications of this shift for the use of large language models (LLMs) as probability estimators, especially for world probabilities. In light of the theoretical distinction between distribution estimation and response prediction, we examine LLM training phases and common use cases for LLM output probabilities. We show that the different settings lead to distinct, potentially conflicting, desired output distributions. This lack of clarity leads to pitfalls when using output probabilities as event probabilities. Our position advocates for second-order prediction -- incorporating probabilities explicitly as part of the output -- as a theoretically sound method, in contrast to using token logprobs. We conclude with suggestions for potential directions to improve the probabilistic soundness of this method.
comment: Accepted to ICML 2026 (position track)
♻ ☆ Is Data Shapley Not Better than Random in Data Selection? Ask NASH ICML-26
Data selection studies the problem of identifying high-quality subsets of training data. While some existing works have considered selecting the subset of data with top-$m$ Data Shapley or other semivalues as they account for the interaction among every subset of data, other works argue that Data Shapley can sometimes perform ineffectively in practice and select subsets that are no better than random. This raises the questions: (I) Are there certain "Shapley-informative" settings where Data Shapley consistently works well? (II) Can we strategically utilize these settings to select high-quality subsets consistently and efficiently? In this paper, we propose a novel data selection framework, NASH (Non-linear Aggregation of SHapley-informative components), which (I) decomposes the target utility function (e.g., validation accuracy) into simpler, Shapley-informative component functions, and selects data by optimizing an objective that (II) aggregates these components non-linearly. We demonstrate that NASH substantially boosts the effectiveness of Shapley/semivalue-based data selection with minimal additional runtime cost.
comment: Accepted to the 43rd International Conference on Machine Learning (ICML-26) as a Spotlight paper
♻ ☆ SAND: The Challenge on Speech Analysis for Neurodegenerative Disease Assessment
Recent advances in Artificial Intelligence (AI) and the exploration of noninvasive, objective biomarkers, such as speech signals, have encouraged the development of algorithms to support the early diagnosis of neurodegenerative diseases, including Amyotrophic Lateral Sclerosis (ALS). Voice changes in subjects suffering from ALS typically manifest as progressive dysarthria, which is a prominent neurodegenerative symptom because it affects patients as the disease progresses. Since voice signals are complex data, the development and use of advanced AI techniques are fundamental to extracting distinctive patterns from them. Validating AI algorithms for ALS diagnosis and monitoring using voice signals is challenging, particularly due to the lack of annotated reference datasets. In this work, we present the outcome of a collaboration between a multidisciplinary team of clinicians and Machine Learning experts to create both a clinically annotated validation dataset and the "Speech Analysis for Neurodegenerative Diseases" (SAND) challenge based on it. Specifically, by analyzing voice disorders, the SAND challenge provides an opportunity to develop, test, and evaluate AI models for the automatic early identification and prediction of ALS disease progression.
♻ ☆ Investigating Thinking Behaviours of Reasoning-Based Language Models for Social Bias Mitigation
While reasoning-based large language models excel at complex tasks through an internal, structured thinking process, a concerning phenomenon has emerged that such a thinking process can aggregate social stereotypes, leading to biased outcomes. However, the underlying behaviours of these language models in social bias scenarios remain underexplored. In this work, we systematically investigate mechanisms within the thinking process behind this phenomenon and uncover two failure patterns that drive social bias aggregation: 1) stereotype repetition, where the model relies on social stereotypes as its primary justification, and 2) irrelevant information injection, where it fabricates or introduces new details to support a biased narrative. Building on these insights, we introduce a lightweight prompt-based mitigation approach that queries the model to review its own initial reasoning against these specific failure patterns. Experiments on question answering (BBQ and StereoSet) and open-ended (BOLD) benchmarks show that our approach effectively reduces bias while maintaining or improving accuracy.
comment: Due to issues found with the annotations in Section 4.3, we have decided to withdraw this preprint
♻ ☆ Project-Level C-to-Rust Translation via Pointer Knowledge Graphs
Translating C code into safe Rust is an effective way to ensure memory safety. Compared to rule-based approaches, which often produce largely unsafe Rust code, LLM-based methods generate more idiomatic and safer Rust by leveraging extensive training on human-written code. Despite their promise, existing LLM-based approaches still struggle with project-level C-to-Rust translation. They typically partition a C project into smaller units (e.g., functions) based on call graphs and translate them in a bottom-up manner to resolve dependencies. However, this unit-by-unit paradigm often fails to handle pointers due to the lack of a global view of their usage. To address this limitation, we propose a novel C-to-Rust Pointer Knowledge Graph (KG) that augments code dependency graphs with two types of pointer semantics: (i) pointer usage information, which captures global behaviors such as points-to flows and lifts low-level struct interactions to higher-level abstractions; and (ii) Rust-oriented annotations, which encode ownership, mutability, nullability, and lifetime. Building on this KG, we further propose PtrTrans, a project-level C-to-Rust translation approach. In PtrTrans, the KG provides LLMs with comprehensive global pointer semantics, guiding them to generate safe and idiomatic Rust code. Experimental results show that PtrTrans reduces unsafe usages in translated Rust by 99.9% compared to both rule-based and conventional LLM-based methods, while achieving 29.3% higher functional correctness than fuzzing-enhanced LLM approaches.
comment: Accepted by FSE'26
♻ ☆ Probing RLVR training instability through the lens of objective-level hacking ICML 2026
Prolonged reinforcement learning with verifiable rewards (RLVR) has been shown to drive continuous improvements in the reasoning capabilities of large language models, but the training is often prone to instabilities, especially in Mixture-of-Experts (MoE) architectures. Training instability severely undermines model capability improvement, yet its underlying causes and mechanisms remain poorly understood. In this work, we introduce a principled framework for understanding RLVR instability through the lens of objective-level hacking. Unlike reward hacking, which arises from exploitable verifiers, objective-level hacking emerges from token-level credit misalignment and is manifested as system-level spurious signals in the optimization objective. Grounded in our framework, together with extensive experiments on a 30B MoE model, we trace the origin and formalize the mechanism behind a key pathological training dynamic in MoE models: the abnormal growth of the training-inference discrepancy, a phenomenon widely associated with instability but previously lacking a mechanistic explanation. These findings provide a concrete and causal account of the training dynamics underlying instabilities in MoE models, offering guidance for the design of stable RLVR algorithms.
comment: Accepted by ICML 2026
♻ ☆ Localising Dropout Variance in Twin Networks
Accurate individual treatment-effect estimation demands not only reliable point predictions but also uncertainty measures that help practitioners \emph{locate} the source of model failure. We introduce a layer-wise variance decomposition for deep twin-network models: by toggling Monte Carlo Dropout independently in the shared encoder and the outcome heads, we split total predictive variance into an \emph{encoder component} ($σ_{\mathrm{enc}}^2$) and a \emph{head component} ($σ_{\mathrm{head}}^2$), with $σ_{\mathrm{enc}}^2 + σ_{\mathrm{head}}^2 \approx σ_{\mathrm{tot}}^2$ by the law of total variance. Across three synthetic covariate-shift regimes, the encoder component dominates under distributional shift ($ρ_{\mathrm{enc}}=0.53$) while the head component becomes informative only once encoder uncertainty is controlled. On a real-world twins cohort with induced multivariate shift, only $σ_{\mathrm{enc}}^2$ spikes on out-of-distribution samples and becomes the primary error predictor ($ρ_{\mathrm{enc}}\!\approx\!0.89$), while $σ_{\mathrm{head}}^2$ remains flat. The decomposition adds negligible cost over standard MC Dropout and provides a practical diagnostic for deciding whether to collect more diverse covariates or more outcome data.
comment: 14 pages, 5 figures, 3 tables
♻ ☆ SDG-MoE: Signed Debate Graph Mixture-of-Experts
Sparse MoE models achieve a good balance between capacity and compute by routing each token to a small subset of experts. However, in most MoE architectures, once a token is routed, the selected experts process it independently and their outputs are combined via a weighted sum. This leaves open whether enabling communication among them could improve performance. While prior work has raised this question, direct interaction among the active routed experts remains underexplored. In this paper, we propose SDG-MoE (Signed Debate Graph Mixture-of-Experts), a novel architecture that adds a lightweight, iterative deliberation step before final aggregation. SDG-MoE introduces three components: (i) two learned interaction matrices over the active experts, a support graph $A^+$ and a critique graph $A^-$, capturing reinforcing and corrective influences; (ii) a signed message-passing step that updates expert representations before aggregation; and (iii) a disagreement-gated Friedkin-Johnsen-style anchoring that controls deliberation strength while preventing expert drift. Together, these enable a structured deliberation process where interaction strength scales with disagreement and specialization is preserved. We also provide a theoretical analysis establishing stability conditions on expert states and showing that deliberation adds only low-order overhead over the active set. In controlled three-seed pretraining experiments, SDG-MoE improves validation perplexity over both an unsigned graph communication baseline and vanilla MoE, outperforming the strongest baseline by 19.8%, and gives the best external perplexity on WikiText-103, C4, and Paloma among the compared systems.
♻ ☆ Stochastic Parrots or Singing in Harmony? Testing Five Leading LLMs for their Ability to Replicate a Human Survey with Synthetic Data
How well can AI-derived synthetic research data replicate the responses of human participants? An emerging literature has begun to engage with this question, which carries deep implications for organizational research practice. This article presents a comparison between a human-respondent survey of 420 Silicon Valley coders and developers and synthetic survey data designed to simulate real survey takers generated by five leading Generative AI Large Language Models: ChatGPT Thinking 5 Pro, Claude Sonnet 4.5 Pro plus Claude CoWork 1.123, Gemini Advanced 2.5 Pro, Incredible 1.0, and DeepSeek 3.2. Our findings reveal that while AI agents produced technically plausible results that lean more towards replicability and harmonization than assumed, none were able to capture the counterintuitive insights that made the human survey valuable. Moreover, deviations grouped together for all models, leaving the real data as the outlier. Our key finding is that while leading LLMs are increasingly being used to scale, replicate and replace human survey responses in research, these advances only show an increased capacity to parrot conventional wisdom in harmony with each other rather than revealing novel findings. If synthetic respondents are used in future research, we need more replicable validation protocols and reporting standards for when and where synthetic survey data can be used responsibly, a gap that this paper fills. Our results suggest that synthetic survey responses cannot meaningfully model real human social beliefs within organizations, particularly in contexts lacking previously documented evidence. We conclude that synthetic survey-based research should be cast not as a substitute for rigorous survey methods, but as an increasingly reliable pre- or post-fieldwork instrument for identifying societal assumptions, conventional wisdoms, and other expectations about research populations.
♻ ☆ Does Your Reasoning Model Implicitly Know When to Stop Thinking?
Recent advancements in large reasoning models (LRMs) have greatly improved their capabilities on complex reasoning tasks through Long Chains of Thought (CoTs). However, this approach often results in substantial redundancy, impairing computational efficiency and causing significant delays in real-time applications. Recent studies show that longer reasoning chains are frequently uncorrelated with correctness and can even be detrimental to accuracy. In a further in-depth analysis of this phenomenon, we surprisingly uncover and empirically verify that LRMs implicitly know the appropriate time to stop thinking, while this capability is obscured by current sampling paradigms. Motivated by this, we introduce SAGE (Self-Aware Guided Efficient Reasoning), a novel sampling paradigm that unleashes this efficient reasoning potential. Furthermore, integrating SAGE as mixed sampling into group-based reinforcement learning (SAGE-RL) enables SAGE-RL to effectively incorporate SAGE-discovered efficient reasoning patterns into standard pass@1 inference, markedly enhancing both the reasoning accuracy and efficiency of LRMs across multiple challenging mathematical benchmarks.
♻ ☆ A Systematic Analysis of the Impact of Persona Steering on LLM Capabilities
Imbuing Large Language Models (LLMs) with specific personas is prevalent for tailoring interaction styles, yet the impact on underlying cognitive capabilities remains unexplored. We employ the Neuron-based Personality Trait Induction (NPTI) framework to induce Big Five personality traits in LLMs and evaluate performance across six cognitive benchmarks. Our findings reveal that persona induction produces stable, reproducible shifts in cognitive task performance beyond surface-level stylistic changes. These effects exhibit strong task dependence: certain personalities yield consistent gains on instruction-following, while others impair complex reasoning. Effect magnitude varies systematically by trait dimension, with Openness and Extraversion exerting the most robust influence. Furthermore, LLM effects show 73.68% directional consistency with human personality-cognition relationships. Capitalizing on these regularities, we propose Dynamic Persona Routing (DPR), a lightweight query-adaptive strategy that outperforms the best static persona without additional training.
♻ ☆ Evaluating the Pre-Consultation Ability of LLMs using Diagnostic Guidelines EACL 2026
We introduce EPAG, a benchmark dataset and framework designed for Evaluating the Pre-consultation Ability of LLMs using diagnostic Guidelines. LLMs are evaluated directly through HPI-diagnostic guideline comparison and indirectly through disease diagnosis. In our experiments, we observe that small open-source models fine-tuned with a well-curated, task-specific dataset can outperform frontier LLMs in pre-consultation. Additionally, we find that increased amount of HPI (History of Present Illness) does not necessarily lead to improved diagnostic performance. Further experiments reveal that the language of pre-consultation influences the characteristics of the dialogue. By open-sourcing our dataset and evaluation pipeline on https://github.com/seemdog/EPAG, we aim to contribute to the evaluation and further development of LLM applications in real-world clinical settings.
comment: EACL 2026 Industry
♻ ☆ A Counterfactual Cause in Situation Calculus
Recently, Batusov and Soutchanski proposed a notion of actual achievement cause in the situation calculus, amongst others, they can determine the cause of quantified effects in a given action history. While intuitively appealing, this notion of cause is not defined in a counterfactual perspective. In this paper, we propose a notion of cause based on counterfactual analysis. In the context of action history, we show that our notion of cause generalizes naturally to a notion of achievement cause. We analyze the relationship between our notion of the achievement cause and the achievement cause by Batusov and Soutchanski. Finally, we relate our account of cause to Halpern and Pearl's account of actual causality. Particularly, we note some nuances in applying a counterfactual viewpoint to disjunctive goals, a common thorn in definitions of actual causes.
comment: This version changes the working title of the extended report and fixes some errors
♻ ☆ When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents
Large language model agents increasingly operate through environment-facing scaffolds that expose files, web pages, APIs, and logs. These observations influence tool use, state tracking, and action sequencing, yet their reliability and authority are often uncertain. Environmental grounding is therefore a systems-level problem involving context admission, evidence provenance, freshness checking, verification policy, action gating, and model reasoning. Existing agent benchmarks mainly evaluate task capability or specific attacks such as prompt injection and memory poisoning, but they under-specify a fundamental reliability question: whether agents remain grounded in the true environment state when observations are stale, incorrect, or malicious. We introduce EnvTrustBench, an agentic framework for benchmarking this failure mode. We define an evidence-grounding defect (EGD) as a behavioral failure in which an agent treats an environment-facing claim as sufficient evidence for action without resolving it against available current evidence, leading to a task-incorrect false path under the true environment state. Given a task scenario, EnvTrustBench generates the workspace, environment, agent-facing objective, and validation oracle, executes the evaluated agent, records its action-observation trajectory and final state, and applies the oracle to produce a verdict. Using 6 LLM backbones and 5 widely used scaffolds, we evaluate 55 generated cases across 11 task scenarios, with each scenario expanded through five feedback-guided generation iterations. Results show that EGDs consistently emerge across operational workflows, highlighting environmental grounding as a core agent reliability problem with important security implications.
♻ ☆ A Lightweight Transformer for Pain Recognition from Brain Activity
Pain is a multifaceted and widespread phenomenon with substantial clinical and societal burden, making reliable automated assessment a critical objective. This paper presents a lightweight transformer architecture that fuses multiple fNIRS representations through a unified tokenization mechanism, enabling joint modeling of complementary signal views without requiring modality-specific adaptations or increasing architectural complexity. The proposed token-mixing strategy preserves spatial, temporal, and time-frequency characteristics by projecting heterogeneous inputs onto a shared latent representation, using a structured segmentation scheme to control the granularity of local aggregation and global interaction. The model is evaluated on the AI4Pain dataset using stacked raw waveform and power spectral density representations of fNIRS inputs. Experimental results demonstrate competitive pain recognition performance while remaining computationally compact, making the approach suitable for real-time inference on both GPU and CPU hardware.
♻ ☆ Reflect then Learn: Active Prompting for Information Extraction Guided by Introspective Confusion AAAI 2026
Large Language Models (LLMs) show remarkable potential for few-shot information extraction (IE), yet their performance is highly sensitive to the choice of in-context examples. Conventional selection strategies often fail to provide informative guidance, as they overlook a key source of model fallibility: confusion stemming not just from semantic content, but also from the generation of well-structured formats required by IE tasks. To address this, we introduce Active Prompting for Information Extraction (APIE), a novel active prompting framework guided by a principle we term introspective confusion. Our method empowers an LLM to assess its own confusion through a dual-component uncertainty metric that uniquely quantifies both Format Uncertainty (difficulty in generating correct syntax) and Content Uncertainty (inconsistency in extracted semantics). By ranking unlabeled data with this comprehensive score, our framework actively selects the most challenging and informative samples to serve as few-shot exemplars. Extensive experiments on four benchmarks show that our approach consistently outperforms strong baselines, yielding significant improvements in both extraction accuracy and robustness. Our work highlights the critical importance of a fine-grained, dual-level view of model uncertainty when it comes to building effective and reliable structured generation systems.
comment: Published at AAAI 2026
♻ ☆ MuViS: Multimodal Virtual Sensing Benchmark
Virtual sensing aims to infer hard-to-measure quantities from accessible measurements and is central to perception and control in physical systems. Despite rapid progress from first-principle and hybrid models to modern data-driven methods research remains siloed, leaving no established default approach that transfers across processes, modalities, and sensing configurations. We introduce MuViS, a domain-agnostic benchmarking suite for multimodal virtual sensing that consolidates diverse datasets into a unified interface for standardized preprocessing and evaluation. Using this framework, we benchmark established approaches spanning gradient-boosted decision trees and deep neural network (NN) architectures, and show that none of these provides a universal advantage, underscoring the need for generalizable virtual sensing architectures. MuViS is released as an open-source, extensible platform for reproducible comparison and future integration of new datasets and model classes.
comment: Accepted at European Signal Processing Conference (EUSIPCO) 2026
♻ ☆ Efficient Emotion-Aware Iconic Gesture Prediction for Robot Co-Speech
Co-speech gestures increase engagement and improve speech understanding. Most data-driven robot systems generate rhythmic beat-like motion, yet few integrate semantic emphasis. To address this, we propose a lightweight transformer that derives iconic gesture placement and intensity from text and emotion alone, requiring no audio input at inference time. The model outperforms GPT-4o in both semantic gesture placement classification and intensity regression on the BEAT2 dataset, while remaining computationally compact and suitable for real-time deployment on embodied agents.
♻ ☆ Developing a Multi-variate Prediction Model For COVID-19 From Crowd-sourced Respiratory Voice Data
COVID-19 has affected more than 223 countries worldwide and in the Post-COVID Era, there is a pressing need for non-invasive, low-cost, and highly scalable solutions to detect COVID-19. We develop a deep learning model to identify COVID-19 from voice recording data. The novelty of this work is in the development of deep learning models for COVID-19 identification from only voice recordings. We use the Cambridge COVID-19 Sound database which contains 893 speech samples, crowd-sourced from 4352 participants via a COVID-19 Sounds app. Voice features including Mel-spectrograms and Mel-frequency cepstral coefficients (MFCC) and CNN Encoder features are extracted. Based on the voice data, we develop deep learning classification models to detect COVID-19 cases. These models include Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN) and Hidden-Unit BERT (HuBERT). We compare their predictive power to baseline machine learning models. HuBERT achieves the highest accuracy of 86\% and the highest AUC of 0.93. The results achieved with the proposed models suggest promising results in COVID-19 diagnosis from voice recordings when compared to the results obtained from the state-of-the-art.
comment: arXiv admin note: text overlap with arXiv:2209.03727
♻ ☆ Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression ICML 2026
While Key-Value (KV) cache compression is essential for efficient LLM inference, current evaluations disproportionately focus on sparse retrieval tasks, potentially masking the degradation of High-Density Reasoning where Chain-of-Thought (CoT) coherence is critical. We introduce KVFundaBench to systematically evaluate this gap, revealing a sharp dichotomy: while retrieval tasks remain robust, reasoning tasks exhibit severe Task-Dependent Degradation under aggressive compression due to disrupted CoT links. Extending our analysis to the DeepSeek-R1 model, we uncover that its specialized attention patterns offer unique insights into the fragility of reasoning chains. Guided by these findings -- specifically the necessity of preserving few-shot examples as indivisible Semantic Units -- we propose ShotKV. This approach explicitly separates prefill and decoding phases to prioritize semantic integrity. Empirical results demonstrate that ShotKV achieves 9%-18% accuracy improvements on long-context generation tasks and effectively generalizes to document QA, all while delivering an 11% latency reduction compared to full cache inference.
comment: ICML 2026
♻ ☆ Forget BIT, It is All about TOKEN: Towards Semantic Information Theory for LLMs
Despite the empirical successes of Large Language Models (LLMs), the prevailing paradigm is heuristic and experiment-driven, tethered to massive compute and data, while a first-principles theory remains absent. This treatise develops a Semantic Information Theory at the confluence of statistical physics, signal processing, and classical information theory, organized around a single paradigm shift: replacing the classical BIT - a microscopic substrate devoid of semantic content - with the macroscopic TOKEN as the atomic carrier of meaning and reasoning. Within this framework we recast attention and the Transformer as energy-based models, and interpret semantic embedding as vectorization on the semantic manifold. Modeling the LLM as a stateful channel with feedback, we adopt Massey's directed information as the native causal measure of autoregressive generation, from which we derive a *directed rate-distortion function for pre-training, a directed rate-reward function for RL-based post-training, and a sub-martingale account of inference-time semantic information flow. This machinery makes precise the identification of next-token prediction with Granger causal inference, and sharpens the limits of LLM reasoning against Pearl's Ladder of Causation - affirming that *whereas the BIT defined the Information Epoch, the TOKEN will define the AI Epoch.
♻ ☆ Neural Dynamic GI: Random-Access Neural Compression for Temporal Lightmaps in Dynamic Lighting Environments CVPR 2026
High-quality global illumination (GI) in real-time rendering is commonly achieved using precomputed lighting techniques, with lightmap as the standard choice. To support GI for static objects in dynamic lighting environments, multiple lightmaps at different lighting conditions need to be precomputed, which incurs substantial storage and memory overhead. To overcome this limitation, we propose Neural Dynamic GI (NDGI), a novel compression technique specifically designed for temporal lightmap sets. Our method utilizes multi-dimensional feature maps and lightweight neural networks to integrate the temporal information instead of storing multiple sets explicitly, which significantly reduces the storage size of lightmaps. Additionally, we introduce a block compression (BC) simulation strategy during the training process, which enables BC compression on the final generated feature maps and further improves the compression ratio. To enable efficient real-time decompression, we also integrate a virtual texturing (VT) system with our neural representation. Compared with prior methods, our approach achieves high-quality dynamic GI while maintaining remarkably low storage and memory requirements, with only modest real-time decompression overhead. To facilitate further research in this direction, we will release our temporal lightmap dataset precomputed in multiple scenes featuring diverse temporal variations.
comment: Accepted to CVPR 2026
♻ ☆ HeteroGenManip: Generalizable Manipulation For Heterogeneous Object Interactions
Generalizable manipulation involving cross-type object interactions is a critical yet challenging capability in robotics. To reliably accomplish such tasks, robots must address two fundamental challenges: "where to manipulate" (contact point localization) and "how to manipulate" (subsequent interaction trajectory planning). Existing foundation-model-based approaches often adopt end-to-end learning that obscures the distinction between these stages, exacerbating error accumulation in long-horizon tasks. Furthermore, they typically rely on a single uniform model, which fails to capture the diverse, category-specific features required for heterogeneous objects. To overcome these limitations, we propose HeteroGenManip, a task-conditioned, two-stage framework designed to decouple initial grasp from complex interaction execution. First, Foundation-Correspondence-Guided Grasp module leverages structural priors to align the initial contact state, thereby significantly reducing the pose uncertainty of grasping. Subsequently, Multi-Foundation-Model Diffusion Policy (MFMDP) routes objects to category-specialized foundation models, integrating fine-grained geometric information with highly-variable part features via a dual-stream cross-attention mechanism. Experimental evaluations demonstrate that HeteroGenManip achieves robust intra-category shape and pose generalization. The framework achieves an average 31% performance improvement in simulation tasks with broad type setting, alongside a 36.7% gain across four real-world tasks with different interaction types.
♻ ☆ VulTriage: Triple-Path Context Augmentation for LLM-Based Vulnerability Detection
Automated vulnerability detection is a fundamental task in software security, yet existing learning-based methods still struggle to capture the structural dependencies, domain-specific vulnerability knowledge, and complex program semantics required for accurate detection. Recent Large Language Models (LLMs) have shown strong code understanding ability, but directly prompting them with raw source code often leads to missed vulnerabilities or false alarms, especially when vulnerable and benign functions differ only in subtle semantic details. To address this, we propose VulTriage, a triple-path context augmentation framework for LLM-based vulnerability detection. VulTriage enhances the LLM input through three complementary paths: a Control Path that extracts and verbalizes AST, CFG, and DFG information to expose control and data dependencies; a Knowledge Path that retrieves relevant CWE-derived vulnerability patterns and examples through hybrid dense--sparse retrieval; and a Semantic Path that summarizes the functional behavior of the code before the final judgment. These contexts are integrated into a unified instruction to guide the LLM toward more reliable vulnerability reasoning. Experiments on the PrimeVul pair test set show that VulTriage achieves state-of-the-art performance, outperforming existing deep learning and LLM-based baselines on key pair-wise and classification metrics. Further ablation studies verify the effectiveness of each path, and additional experiments on the Kotlin dataset demonstrate the generalization ability of VulTriage under low-resource and class-imbalanced settings. Our code is available at https://github.com/vinsontang1/VulTriage
♻ ☆ Clin-JEPA: A Multi-Phase Co-Training Framework for Joint-Embedding Predictive Pretraining on EHR Patient Trajectories
We present Clin-JEPA, a multi-phase co-training framework for joint-embedding predictive (JEPA) pretraining on EHR patient trajectories. JEPA architectures have enabled latent-space planning in robotics and high-quality representation learning in vision, but extending the paradigm to EHR data -- to obtain a single backbone that simultaneously forecasts patient trajectories and serves diverse downstream risk-prediction tasks without per-task fine-tuning -- remains an open challenge. Existing JEPA frameworks either discard the predictor after pretraining (I-JEPA, V-JEPA) or train it on a frozen pretrained encoder (V-JEPA 2-AC), leaving the encoder unaware of the rollout signal that the retained predictor must use at inference; co-training the encoder and predictor under a shared JEPA prediction objective would supply this grounding, but naïve co-training is unstable, with representation collapse and online/target drift causing autoregressive rollout to diverge. Clin-JEPA's five-phase pretraining curriculum -- predictor warmup, joint refinement, EMA target alignment, hard sync, and predictor finalization -- addresses each failure mode by phase, stably co-training a Qwen3-8B-based encoder and a 92M-parameter latent trajectory predictor. On MIMIC-IV ICU data, three independent evaluations support the framework: (1) latent $\ell_1$ rollout drift uniquely converges ($-$15.7%) over 48-hour horizons while baselines and ablations diverge (+3% to +4951%); (2) the encoder learns a clinically discriminative latent geometry (deteriorating-patient cohorts displace 4.83$\times$ further than stable patients in latent space, vs $\leq$2.62$\times$ for baseline encoders); (3) a single backbone outperforms strong tabular and sequence baselines on multi-task downstream evaluation. Clin-JEPA achieves mean AUROC 0.851 on ICareFM EEP and 0.883 on 8 binary risk tasks (+0.038 and +0.041 vs baseline average).
comment: 17 pages, 4 figures, 8 tables. Code: https://github.com/YeungYathin/Clin-JEPA
♻ ☆ Impact of AI Search Summaries on Website Traffic: Evidence from Google AI Overviews and Wikipedia
Search engines increasingly display LLM-generated answers shown above organic links, shifting search from link lists to answer-first summaries. Publishers contend these summaries substitute for source pages and cannibalize traffic, while platforms argue they are complementary by directing users through included links. We estimate the causal impact of Google's AI Overview (AIO) on Wikipedia traffic by leveraging the feature's staggered geographic rollout and Wikipedia's multilingual structure. Using a difference-in-differences design, we compare English Wikipedia articles exposed to AIO to the same underlying articles in language editions (Hindi, Indonesian, Japanese, and Portuguese) that were not exposed to AIO during the observation period. Across 161,382 matched article-language pairs, AIO exposure reduces daily traffic to English articles by approximately 15%. Effects are heterogeneous: relative declines are largest for Culture articles and substantially smaller for STEM, consistent with stronger substitution when short synthesized answers satisfy informational intent. These findings provide early causal evidence that generative-answer features in search engines can materially reallocate attention away from informational publishers, with implications for content monetization, search platform design, and policy.
♻ ☆ SkillMaster: Toward Autonomous Skill Mastery in LLM Agents
Skills provide an effective mechanism for improving LLM agents on complex tasks, yet in existing agent frameworks, their creation, refinement, and selection are typically governed by external teachers, hand-designed rules, or auxiliary modules. As a result, skills remain external resources to be invoked, rather than capabilities that agents can develop, adapt, and internalize through experience. To endow LLM agents with autonomous skill mastery, we propose SkillMaster, a training framework that teaches agents to create new skills, refine existing skills, and select accumulated skills during task solving. This capability is achieved through three key designs. First, we train agents through trajectory-informed skill review, teaching agents to propose, update, or retain skills based on evidence from completed episodes. Second, each candidate skill edit is designed to be evaluated by its counterfactual utility on related probe tasks, providing a direct learning signal for training skill-editing decisions. Third, we introduce DualAdv-GRPO, which separately estimates advantages for task-solving actions and skill-editing decisions, stabilizing joint training across task solving and skill management. Experiments on ALFWorld and WebShop show that SkillMaster improves the overall success rate over state-of-the-art baselines by 8.8% and 9.3%, respectively, achieving the best performance among all compared methods. Further analysis reveals a marked shift in agent capability: agents trained with SkillMaster can identify skill failures, refine procedural knowledge from trajectory evidence, and transfer improvements to future tasks with limited skill-bank edits. Overall, SkillMaster moves LLM agents beyond mere skill use toward self-improving agents capable of developing, adapting, and applying their own skill repertoires.
♻ ☆ Value-Decomposed Reinforcement Learning Framework for Taxiway Routing with Hierarchical Conflict-Aware Observations
Taxiway routing and on-surface conflict avoidance are coupled safety-critical decision problems in airport surface operations. Existing planning and optimization methods are often limited by online computational cost, while reinforcement learning methods may struggle to represent downstream traffic conflicts and balance multiple objectives. This paper presents Conflict-aware Taxiway Routing (CaTR), a reinforcement learning framework for real-time multi-aircraft taxiway routing. CaTR constructs a grid-based airport surface environment with action masking, introduces a hierarchical foresight traffic representation to encode current and downstream conflict-related traffic conditions, and adopts a value-decomposed reinforcement learning strategy to prioritize sparse but safety-critical objectives. Experiments are conducted on a realistic environment based on Changsha Huanghua International Airport under multiple traffic density levels. Results show that CaTR achieves better safety--efficiency trade-offs than representative planning, optimization, and reinforcement learning baselines while maintaining practical runtime.
♻ ☆ Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization
Representation autoencoders that reuse frozen pretrained vision encoders as visual tokenizers have achieved strong reconstruction and generation quality. However, existing methods universally extract features from only the last encoder layer, discarding the rich hierarchical information distributed across intermediate layers. We show that low-level visual details survive in the last layer merely as attenuated residuals after multiple layers of semantic abstraction, and that explicitly fusing multi-layer features can substantially recover this lost information. We propose DRoRAE (Depth-Routed Representation AutoEncoder), a lightweight fusion module that adaptively aggregates all encoder layers via energy-constrained routing and incremental correction, producing an enriched latent compatible with a frozen pretrained decoder. A three-phase decoupled training strategy first learns the fusion under the implicit distributional constraint of the frozen decoder, then fine-tunes the decoder to fully exploit the enriched representation. On ImageNet-256, DRoRAE reduces rFID from 0.57 to 0.29 and improves generation FID from 1.74 to 1.65 (with AutoGuidance), with gains also transferring to text-to-image synthesis. Furthermore, we uncover a log-linear scaling law ($R^2{=}0.86$) between fusion capacity and reconstruction quality, identifying \textit{representation richness} as a new, predictably scalable dimension for visual tokenizers analogous to vocabulary size in NLP.
♻ ☆ From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity ICML 2026
Traditional hallucination detection fails on "Stubborn Hallucinations" - errors where LLMs are confidently wrong. We propose a geometric solution: Embedding-Perturbed Gradient Sensitivity (EPGS). We hypothesize that while robust facts reside in flat minima, stubborn hallucinations sit in sharp minima, supported by brittle memorization. EPGS detects this sharpness by perturbing input embeddings with Gaussian noise and measuring the resulting spike in gradient magnitude. This acts as an efficient proxy for the Hessian spectrum, differentiating stable knowledge from unstable memorization. Our experiments show that EPGS significantly outperforms entropy-based and representation-based baselines, providing a robust signal for detecting high-confidence factual errors.
comment: Accepted to ICML 2026. Camera-ready version
♻ ☆ The Confusion is Real: GRAPHIC -- A Network Science Approach to Confusion Matrices in Deep Learning
Explainable artificial intelligence has emerged as a promising field of research to address reliability concerns in artificial intelligence. Despite significant progress in explainable artificial intelligence, few methods provide a systematic way to visualize and understand how classes are confused and how their relationships evolve as training progresses. In this work, we present GRAPHIC, an architecture-agnostic approach that analyzes neural networks on a class level. It leverages confusion matrices derived from intermediate layers using linear classifiers. We interpret these as adjacency matrices of directed graphs, allowing tools from network science to visualize and quantify learning dynamics across training epochs and intermediate layers. GRAPHIC provides insights into linear class separability, dataset issues, and architectural behavior, revealing, for example, similarities between flatfish and man and labeling ambiguities validated in a human study. In summary, by uncovering real confusions, GRAPHIC offers new perspectives on how neural networks learn. The code is available at https://github.com/Johanna-S-Froehlich/GRAPHIC.
comment: Transactions on Machine Learning Research, 2026
♻ ☆ SLASH the Sink: Sharpening Structural Attention Inside LLMs
Large Language Models (LLMs) show remarkable semantic understanding but often struggle with structural understanding when processing graph topologies in a serialized format. Existing solutions rely on training external graph-based adapters or fine-tuning, which incur high costs and lost generalizability. In this work, we investigate the internal mechanisms of LLMs and present a critical finding: LLMs spontaneously reconstruct the graph's topology internally, evidenced by a distinct "sawtooth" pattern in their attention maps that structurally aligns with the "token-level adjacency matrix". However, this intrinsic structural understanding is diluted by the attention sink. We theoretically formalize this dilution as a representation bottleneck, stemming from a fundamental conflict: the model's anisotropic bias, essential for language tasks, suppresses the topology-aware local aggregation required for graph reasoning. To address this, we propose a training-free solution, named StructuraL Attention SHarpening (SLASH), which amplifies this internal structural understanding via a plug-and-play attention redistribution. Experiments on pure graph tasks and molecular prediction validate thst SLASH delivers significant and consistent performance gains across diverse LLMs.
♻ ☆ Useful for Exploration, Risky for Precision: Evaluating AI Tools in Academic Research
Artificial intelligence (AI) tools are being incorporated into scientific research workflows with the potential to enhance efficiency in tasks such as document analysis, question answering (Q&A), and literature search. However, system outputs are often difficult to verify, lack transparency in their generation and remain prone to errors. Suitable benchmarks are needed to document and evaluate arising issues. Nevertheless, existing benchmarking approaches are not adequately capturing human-centered criteria such as usability, interpretability, and integration into research workflows. To address this gap, the present work proposes and applies a benchmarking framework combining human-centered and computer-centered metrics to evaluate AI-based Q&A and literature review tools for research use. The findings suggest that Q&A tools can offer valuable overviews and generally accurate summaries; however, they are not always reliable for precise information extraction. Explainable AI (xAI) accuracy was particularly low, meaning highlighted source passages frequently failed to correspond to generated answers. This shifted the burden of validation back onto the researcher. Literature review tools supported exploratory searches but showed low reproducibility, limited transparency regarding chosen sources and databases, and inconsistent source quality, making them unsuitable for systematic reviews. A comparison of these tool groups reveals a similar pattern: while AI tools can enhance efficiency in the early stages of the research workflow and shallow tasks, their outputs still require human verification. The findings underscore the importance of explainability features to enhance transparency, verification efficiency and careful integration of AI tools into researchers' workflows. Further, human-centered evaluation remains an important concern to ensure practical applicability.
♻ ☆ MAC: Masked Agent Collaboration Boosts Large Language Model Medical Decision-Making
Large language models (LLMs) have proven effective in artificial intelligence, where the multi-agent system (MAS) holds considerable promise for healthcare development by achieving the collaboration of LLMs. However, the absence of a systematic pipeline for agent construction and the rigidity of static collaboration patterns render current MAS-based models vulnerable to collaboration failures, resulting in substantial performance degradation in medical decision-making scenarios. To this end, we propose a novel Masked Agent Collaboration (MAC) framework that harnesses Pareto-optimal agent construction and cross-consistency maximization mechanisms to achieve adaptive progressive propagation of collaborative information, boosting the medical decision-making capacity. Specifically, we first conduct a Pareto-frontier factors analysis towards the LLMs pool to consider their key factors, including the model size, inference time, diversity score, and throughput ratio, where we calculate the similarity between pairwise outputs within an LLM to derive its diversity score. Beyond this analysis, we enable the identification of Pareto-optimal models that balance efficiency and capability, which are subsequently selected as collaborative agents to consider the fundamental trade-offs inherent in practical LLM deployment. Afterward, we measure the pairwise similarity between the outputs from collaborative agents to determine their cross-consistency values, subsequently masking out the agent with the lowest cross-consistency value to eliminate the output that is likely semantically inconsistent. Finally, we conduct collaboration of agents by achieving adaptive progressive propagation, where each agent aggregates the outputs of unmasked agents from the previous layer as its input to generate the corresponding output via prompt engineering.
♻ ☆ Learning the Interaction Prior for Protein-Protein Interaction Prediction: A Model-Agnostic Approach ICML 2026
Protein-protein interactions (PPIs) are fundamental to cellular function and disease mechanisms. Current learning-based PPI predictors focus on learning powerful protein representations but neglect designing specialized classification heads. They mainly rely on generic aggregating methods like concatenation or dot products, which lack biological insight. Motivated by the biological "L3 rule", where multiple length-3 paths between a pair of proteins indicate their interaction likelihood, our study addresses this gap by designing a biologically informed PPI classifier. In this paper, we provide empirical evidence that popular PPI datasets strongly support the L3 rule. We propose an L3-path-regularized graph prompt learning method called L3-PPI, which can generate a prompt graph with virtual L3 paths based on protein representations and controls the number of paths. L3-PPI reformulates the classification of protein embedding pairs into a graph-level classification task over the generated prompt graph. This lightweight module seamlessly integrates with PPI predictors as a plug-and-play component, injecting the interaction prior of complementarity to enhance performance. Extensive experiments show that L3-PPI achieves superior performance enhancements over advanced competitors.
comment: Accepted at ICML 2026
♻ ☆ Molecular Design beyond Training Data with Novel Extended Objective Functionals of Generative AI Models Driven by Quantum Annealing Computer
Deep generative modeling to stochastically design small molecules is an emerging technology for accelerating drug discovery and development. However, one major issue in molecular generative models is their lower frequency of drug-like compounds. To resolve this problem, we developed a novel framework for optimization of deep generative models integrated with a D-Wave quantum annealing computer, where our Neural Hash Function (NHF) presented herein is used both as the regularization and binarization schemes simultaneously, of which the latter is for transformation between continuous and discrete signals of the classical and quantum neural networks, respectively, in the error evaluation (i.e., objective) function. The compounds generated via the quantum-annealing generative models exhibited higher quality in both validity and drug-likeness than those generated via the fully-classical models, and was further indicated to exceed even the training data in terms of drug-likeness features, without any restraints and conditions to deliberately induce such an optimization. These results indicated an advantage of quantum annealing to aim at a stochastic generator integrated with our novel neural network architectures, for the extended performance of feature space sampling and extraction of characteristic features in drug design.
comment: 28 pages, 4 figures
♻ ☆ A Theoretical Analysis of Why Masked Diffusion Models Mitigate the Reversal Curse
Autoregressive language models (ARMs) suffer from the reversal curse: after learning ''$A$ is $B$,'' they often fail on the reverse query ''$B$ is $A$.'' Masked diffusion language models (MDMs) exhibit this failure in a much weaker form, but the underlying reason has remained unclear. A common explanation attributes this mitigation to their any-order masked training objective. However, observing ''$[\mathbf{M}]$ is $B$'' during training teaches recovery of $A$ from $B$ in one positional configuration, and does not by itself explain why the learned evidence should transfer to the reverse prompt ''$B$ is $[\mathbf{M}]$.'' We provide a theoretical analysis showing that this transfer arises from a parameter-level coupling between forward and reverse positional conditionals: shared Transformer parameters store token-pair evidence, while relative positional encodings route attention through queries and keys without changing the value-side evidence being retrieved. In a one-layer MDM, we prove that forward masked training strengthens evidence that is reusable in reverse queries, induces correlated forward--reverse attention routes, and yields a positively aligned shared-storage gradient component that decreases the reverse loss to first order. Controlled one-layer experiments and large-scale LLaDA/Dream experiments verify these signatures and show that they translate into improved reverse prediction.
♻ ☆ GeoLaux: A Benchmark for Evaluating Models' Geometry Performance on Long-Step Problems Requiring Auxiliary Lines
Geometry problem solving (GPS) poses significant challenges for current models in diagram comprehension, knowledge application, long-step reasoning, and auxiliary line construction. However, current benchmarks lack fine-grained evaluation for long-step problems necessitating auxiliary construction. To address these limitations, we present GeoLaux, a fine-grained annotated dataset comprising 2186 calculation and proof problems. It features long-step reasoning (with an average solution length of 6.51 steps, maximum of 24 steps) and auxiliary line construction (required in 41.8% of problems). Building on the dataset, we conduct a comprehensive five-dimensional evaluation of 23 leading models. The evaluation yields three pivotal findings: First, models perform significantly worse on long-step problems compared to short-step ones, with 18 models exhibiting a performance drop of over 50%. Second, it is crucial to enhance models' understanding, awareness, and proficiency in auxiliary line construction, which is vital for overall geometric reasoning. Third, limited answer hints effectively improve process correctness, whereas explicit answers lead models to neglect intermediate reasoning steps. These findings position GeoLaux both to benchmark models geometry reasoning abilities and to guide their improvement. Data and code are available at https://github.com/Candice-yu/GeoLaux
comment: 26 pages, 24 figures
♻ ☆ FLARE: Adaptive Multi-Dimensional Reputation for Robust Client Reliability in Federated Learning
Federated learning (FL) enables collaborative model training while preserving data privacy. However, it remains vulnerable to malicious clients who compromise model integrity through Byzantine attacks, data poisoning, or adaptive adversarial behaviors. Existing defense mechanisms rely on static thresholds and binary classification, failing to adapt to evolving client behaviors in real-world deployments. We propose FLARE, an adaptive reputation-based framework that transforms client reliability assessment from binary decisions to a continuous, multi-dimensional trust evaluation. FLARE integrates: (i) a multi-dimensional reputation score capturing performance consistency, statistical anomaly indicators, and temporal behavior, (ii) a self-calibrating adaptive threshold mechanism that adjusts security strictness based on model convergence and recent attack intensity, (iii) reputation-weighted aggregation with soft exclusion to proportionally limit suspicious contributions rather than eliminating clients outright, and (iv) a Local Differential Privacy (LDP) mechanism enabling reputation scoring on privatized client updates. We further introduce a highly evasive Statistical Mimicry (SM) attack, a benchmark adversary that blends honest gradients with synthetic perturbations and persistent drift to remain undetected by traditional filters. Extensive experiments with 100 clients on MNIST, CIFAR-10, and SVHN demonstrate that FLARE maintains high model accuracy and converges faster than state-of-the-art Byzantine-robust methods under diverse attack types, including label flipping, gradient scaling, adaptive attacks, ALIE, and SM. FLARE improves robustness by up to 16% and preserves model convergence within 30% of the non-attacked baseline, while achieving strong malicious-client detection performance with minimal computational overhead. https://github.com/Anonymous0-0paper/FLARE
comment: The authors want to withdraw this manuscript for further verification and revision. We may release a substantially revised version in the future
Computation and Language 180
☆ LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
Long-term memory is crucial for agents in specialized web environments, where success depends on recalling interface affordances, state dynamics, workflows, and recurring failure modes. However, existing memory benchmarks for agents mostly focus on user histories, short traces, or downstream task success, leaving open how to directly evaluate whether memory systems effectively internalize environment-specific experience. To address this gap, we introduce LongMemEval-V2 (LME-V2), a benchmark for evaluating whether memory systems can help agents acquire the experience needed to become knowledgeable colleagues in customized environments. LME-V2 contains 451 manually curated questions covering five core memory abilities for web agents: static state recall, dynamic state tracking, workflow knowledge, environment gotchas, and premise awareness. Questions are paired with history trajectories containing up to 500 trajectories and 115M tokens. We use a context gathering formulation: memory systems consume history trajectories and return compact evidence for downstream question answering. We propose a suite of two memory methods: AgentRunbook-R, an efficient RAG-based memory with knowledge pools for raw state observations, events, and strategy notes, and AgentRunbook-C, which stores trajectories as files and invokes a coding agent to gather evidence in an augmented sandbox. Experiments show that AgentRunbook-C achieves the best performance with 72.5% average accuracy, outperforming the strongest RAG baseline (48.5%) and the off-the-shelf coding agent baseline (69.3%). Despite the strong performance gains, coding agent based methods have high latency costs. While AgentRunbook-C advances the accuracy-latency Pareto frontier, substantial room for improvement remains. Together, these results establish LME-V2 as a challenging testbed for developing long-term memory systems for environment experience.
comment: Work in Progress
☆ Task-Adaptive Embedding Refinement via Test-time LLM Guidance
We explore the effectiveness of an LLM-guided query refinement paradigm for extending the usability of embedding models to challenging zero-shot search and classification tasks. Our approach refines the embedding representation of a user query using feedback from a generative LLM on a small set of documents, enabling embeddings to adapt in real time to the target task. We conduct extensive experiments with state-of-the-art text embedding models across a diverse set of challenging search and classification benchmarks. Empirical results indicate that LLM-guided query refinement yields consistent gains across all models and datasets, with relative improvements of up to +25% in literature search, intent detection, key-point matching, and nuanced query-instruction following. The refined queries improve ranking quality and induce clearer binary separation across the corpus, enabling the embedding space to better reflect the nuanced, task-specific constraints of each ad-hoc user query. Importantly, this expands the range of practical settings in which embedding models can be effectively deployed, making them a compelling alternative when costly LLM pipelines are not viable at corpus-scale. We release our experimental code for reproducibility, at https://github.com/IBM/task-aware-embedding-refinement.
☆ MEME: Multi-entity & Evolving Memory Evaluation
LLM-based agents increasingly operate in persistent environments where they must store, update, and reason over information across many sessions. While prior benchmarks evaluate only single-entity updates, MEME defines six tasks spanning the full space defined by the multi-entity and evolving axes, including three not scored by prior work: Cascade and Absence (dependency reasoning) and Deletion (post-removal state). Evaluating six memory systems spanning three memory paradigms on 100 controlled episodes, we find that all systems collapse on dependency reasoning under the default configuration (Cascade: 3%, Absence: 1% in average accuracy) despite adequate static retrieval performance. Prompt optimization, deeper retrieval, reduced filler noise, and most stronger LLMs fail to close this gap. Only a file-based agent paired with Claude Opus 4.7 as its internal LLM partially closes the gap, but at ~70x the baseline cost, indicating closure currently depends on configurations that are not practical at scale. Code and data are available on the project page: https://seokwonjung-jay.github.io/meme-eval/.
☆ Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts
Sparse Mixture-of-Experts (SMoE) models enable scaling language models efficiently, but training them remains challenging, as routing can collapse onto few experts and auxiliary load-balancing losses can reduce specialization. Motivated by these hurdles, we study how routing decisions in SMoEs are formed mechanistically. First, we reveal a geometric coupling between routers and their corresponding experts. For a given token, the router weights for the selected expert and the expert weights processing it receive gradients along the same input direction, differing only in scalar coefficients. Thus, matched router--expert directions accumulate the same routed token history. This theoretical coupling also appears empirically in routing dynamics. In a $1$B SMoE trained from scratch, higher router scores predict stronger expert neuron activations, showing that routing decisions are mirrored inside the selected expert. Next, we analyze the effects of auxiliary load balancing on the router--expert geometric coupling, showing that such losses break this structure by spreading input-directed gradients across router weights, making distinct router directions nearly three times more similar to each other. Last, we demonstrate the centrality of geometric coupling for effective routing with a parameter-free online K-Means router, in which each expert maintains a running average of the hidden states routed to it and tokens are assigned based on cosine similarity. Compared with auxiliary-loss and loss-free balancing, this router achieves the lowest load imbalance with only a modest perplexity increase, indicating that geometric coupling captures a substantial part of what the router learns. Overall, our results explain how routers form assignment geometry that supports an effective division of labor.
☆ KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference
We introduce KV-Fold, a simple, training-free long-context inference protocol that treats the key-value (KV) cache as the accumulator in a left fold over sequence chunks. At each step, the model processes the next chunk conditioned on the accumulated cache, appends the newly produced keys and values, and passes the enlarged cache forward; the same one-step update is applied repeatedly, analogous to foldl in functional programming. Building on the KV cache concatenation primitive introduced for latent multi-agent communication, we repurpose it as a chunk-to-chunk recurrence for long-context inference. When processing chunk t, the model attends to the KV cache carried from earlier chunks as a prefix, reusing its internal state across segments without modifying or retraining the model. Despite its simplicity, the induced recurrence is stable: per-step drift rises briefly and then saturates into a flat plateau that persists across deep chains. This plateau is insensitive to a 10,000x change in numerical precision, robust across chunk sizes, and consistent across model families. At the task level, KV-Fold preserves exact information over long distances. On a needle-in-a-haystack benchmark, it achieves 100% exact-match retrieval across 152 trials spanning contexts from 16K to 128K tokens and chain depths up to 511 on Llama-3.1-8B, while remaining within the memory limits of a single 40GB GPU. Compared to streaming methods, which trade fidelity for bounded memory, KV-Fold maintains long-range retrieval while operating as a sequence of tractable forward passes. Overall, our results show that frozen pretrained transformers already support a stable form of KV-cache recurrence, providing a practical route to long-context inference without architectural changes or training.
comment: 12 pages, 3 figures, 6 tables
☆ Solve the Loop: Attractor Models for Language and Reasoning
Looped Transformers offer a promising alternative to purely feed-forward computation by iteratively refining latent representations, improving language modeling and reasoning. Yet recurrent architectures remain unstable to train, costly to optimize and deploy, and constrained to small, fixed recurrence depths. We introduce Attractor Models, in which a backbone module first proposes output embeddings, then an attractor module refines them by solving for the fixed point, with gradients obtained through implicit differentiation. Thus, training memory remains constant in effective depth, and iterations are chosen adaptively by convergence. Empirically, Attractor Models outperform existing models across two regimes, large-scale language-model pretraining and reasoning with tiny models. In language modeling, Attractor Models deliver a Pareto improvement over standard Transformers and stable looped models across sizes, improving perplexity by up to 46.6% and downstream accuracy by up to 19.7% while reducing training cost. Notably, a 770M Attractor Model outperforms a 1.3B Transformer trained on twice as many tokens. On challenging reasoning tasks, we show that our model with only 27M parameters and approximately 1000 examples achieves 91.4% accuracy on Sudoku-Extreme and 93.1% on Maze-Hard, scaling favorably where frontier models like Claude and GPT o3, fail completely, and specialized recursive reasoners collapse at larger sizes. Lastly, we show that Attractor Models exhibit a novel phenomenon, which we call equilibrium internalization: fixed-point training places the model's initial output embedding near equilibrium, allowing the solver to be removed at inference time with little degradation. Together, these results suggest that Attractor Models make iterative refinement scalable by turning recurrence into a computation the model can learn to internalize.
☆ Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs
The continued improvements in language model capability have unlocked their widespread use as drivers of autonomous agents, for example in coding or computer use applications. However, the core of these systems has not changed much since early instruction-tuned models like ChatGPT. Even advanced AI agents function on message exchange formats, successively exchanging messages with users, systems, with itself (i.e. chain-of-thought) and tools in a single stream of computation. This bottleneck to a single stream in chat models leads to a number of limitations: the agent cannot act (generate output) while reading, and in reverse, cannot react to new information while writing. Similarly, the agent cannot act while thinking and cannot think while reading or acting on information. In this work, we show that models can be unblocked by switching from instruction-tuning for sequential message formats to instruction-tuning for multiple, parallel streams of computation, splitting each role into a separate stream. Every forward pass of the language model then simultaneously reads from multiple input streams and generates tokens in multiple output streams, all of which causally depend on earlier timesteps. We argue that this data-driven change remedies a number of usability limitations as outlined above, improves model efficiency through parallelization, improves model security through better separation of concerns and can further improve model monitorability.
comment: Preprint, 37 pages. Code at https://github.com/seal-rg/streaming/
☆ TextSeal: A Localized LLM Watermark for Provenance & Distillation Protection
We introduce TextSeal, a state-of-the-art watermark for large language models. Building on Gumbel-max sampling, TextSeal introduces dual-key generation to restore output diversity, along with entropy-weighted scoring and multi-region localization for improved detection. It supports serving optimizations such as speculative decoding and multi-token prediction, and does not add any inference overhead. TextSeal strictly dominates baselines like SynthID-text in detection strength and is robust to dilution, maintaining confident localized detection even in heavily mixed human/AI documents. The scheme is theoretically distortion-free, and evaluation across reasoning benchmarks confirms that it preserves downstream performance; while a multilingual human evaluation (6000 A/B comparisons, 5 languages) shows no perceptible quality difference. Beyond its use for provenance detection, TextSeal is also ``radioactive'': its watermark signal transfers through model distillation, enabling detection of unauthorized use.
☆ The Algorithmic Caricature: Auditing LLM-Generated Political Discourse Across Crisis Events
Large Language Models (LLMs) can generate fluent political text at scale, raising concerns about synthetic discourse during crises and social conflict. Existing AI-text detection often focuses on sentence-level cues such as perplexity, burstiness, or token irregularities, but these signals may weaken as generative systems improve. We instead adopt a Computational Social Science perspective and ask whether synthetic political discourse behaves like an observed online population. We construct a paired corpus of 1,789,406 posts across nine crisis events: COVID-19, the Jan. 6 Capitol attack, the 2020 and 2024 U.S. elections, Dobbs/Roe v. Wade, the 2020 BLM protests, U.S. midterms, the Utah shooting, and the U.S.-Iran war. For each event, we compare observed discourse from social platforms with synthetic discourse generated for the same context. We evaluate four dimensions: emotional intensity, structural regularity, lexical-ideological framing, and cross-event dependency, using mean gaps and dispersion evidence. Across events, synthetic discourse is fluent but population-level unrealistic. It is generally more negative and less dispersed in sentiment, structurally more regular, and lexically more abstract than observed discourse. Observed discourse instead shows broader emotional variation, longer-tailed structural distributions, and more context-specific, colloquial lexical markers. These differences are event-dependent: larger for fast-moving, decentralized crises and smaller for formal or institutionally mediated events. We summarize them with a simple event-level measure, the Caricature Gap. Our findings suggest that the main limitation of synthetic political discourse is not grammar or fluency, but reduced population realism. Population-level auditing complements traditional text-detection and provides a CSS framework for evaluating the social realism of generated discourse.
☆ ORCE: Order-Aware Alignment of Verbalized Confidence in Large Language Models
Large language models (LLMs) often produce answers with high certainty even when they are incorrect, making reliable confidence estimation essential for deployment in real-world scenarios. Verbalized confidence, where models explicitly state their confidence in natural language, provides a flexible and user-facing uncertainty signal that can be applied even when token logits are unavailable. However, existing verbalized-confidence methods often optimize answer generation and confidence generation jointly, which can cause confidence-alignment objectives to interfere with answer accuracy. In this work, we propose a decoupled and order-aware framework for verbalized confidence calibration. Our method first generates an answer and then estimates confidence conditioned on the fixed question--answer pair, allowing confidence optimization without directly perturbing the answer-generation process. To align confidence with correctness likelihood, we construct a sampling-based surrogate from multiple model completions and optimize rank-based reinforcement learning objectives that encourage responses with higher estimated correctness likelihood to receive higher verbalized confidence. Experiments on reasoning and knowledge-intensive benchmarks show that our method improves calibration and failure prediction performance while largely preserving answer accuracy. These results demonstrate that verbalized confidence can be more reliably aligned by decoupling confidence estimation from answer generation and optimizing the relative ordering of confidence across responses.
comment: 18 pages, 2 figures
☆ A Causal Language Modeling Detour Improves Encoder Continued Pretraining
When adapting an encoder to a new domain, the standard approach is to continue training with Masked Language Modeling (MLM). We show that temporarily switching to Causal Language Modeling (CLM) followed by a short MLM decay improves downstream performance. On biomedical texts with ModernBERT, this CLM detour outperforms MLM baselines trained on identical data and compute across 8 French and 11 English biomedical tasks, by +1.2-2.8pp and +0.3-0.8pp respectively, depending on model size. We investigate the reasons for these gains. We find that CLM's dense supervision impacts low transformer layers (0-7) far more than MLM does. Freezing low layers during CLM eliminates the downstream benefit; freezing mid layers preserves it. The representational changes persist through the MLM decay phase, even when it matches the CLM phase in length, and they scale with model capacity. We release ModernCamemBERT-bio and ModernBERT-bio as state-of-the-art biomedical encoders in Base and Large sizes.
☆ Geometric Factual Recall in Transformers
How do transformer language models memorize factual associations? A common view casts internal weight matrices as associative memories over pairs of embeddings, requiring parameter counts that scale linearly with the number of facts. We develop a theoretical and empirical account of an alternative, \emph{geometric} form of memorization in which learned embeddings encode relational structure directly, and the MLP plays a qualitatively different role. In a controlled setting where a single-layer transformer must memorize random bijections from subjects to a shared attribute set, we prove that a logarithmic embedding dimension suffices: subject embeddings encode \emph{linear superpositions} of their associated attribute vectors, and a small MLP acts as a relation-conditioned selector that extracts the relevant attribute via ReLU gating, and not as an associative key-value mapping. We extend these results to the multi-hop setting -- chains of relational queries such as ``Who is the mother of the wife of $x$?'' -- providing constructions with and without chain-of-thought that exhibit a provable capacity-depth tradeoff, complemented by a matching information-theoretic lower bound. Empirically, gradient descent discovers solutions with precisely the predicted structure. Once trained, the MLP transfers zero-shot to entirely new bijections when subject embeddings are appropriately re-initialized, revealing that it has learned a generic selection mechanism rather than memorized any particular set of facts.
comment: Preprint
☆ Predicting Disagreement with Human Raters in LLM-as-a-Judge Difficulty Assessment without Using Generation-Time Probability Signals
Automatic generation of educational materials using large language models (LLMs) is becoming increasingly common, but assigning difficulty levels to such materials still requires substantial human effort. LLM-as-a-Judge has therefore attracted attention, yet disagreement with human raters remains a major challenge. We propose a method for predicting which LLM-generated difficulty ratings are likely to disagree with human raters, so that such cases can be sent for re-rating. Unlike prior approaches, our method does not rely on generation-time probability signals, which must be collected during rating generation and are often difficult to compare across LLMs. Instead, exploiting the fact that difficulty is an ordinal scale, we use a separate embedding space, such as ModernBERT, and identify disagreement candidates based on the geometric consistency of the rating set. Experiments on English CEFR-based sentence difficulty assessment with GPT-OSS-120B and Qwen3-235B-A22B showed that the proposed method achieved higher AUC for predicting disagreement with human raters than probability-based baselines.
comment: Accepted to Educational Data Mining (EDM) 2026 (Poster/Demo Track)
☆ ORBIT: Preserving Foundational Language Capabilities in GenRetrieval via Origin-Regulated Merging
Despite the rapid advancements in large language model (LLM) development, fine-tuning them for specific tasks often results in the catastrophic forgetting of their general, language-based reasoning abilities. This work investigates and addresses this challenge in the context of the Generative Retrieval (GenRetrieval) task. During GenRetrieval fine-tuning, we find this forgetting occurs rapidly and correlates with the distance between the fine-tuned and original model parameters. Given these observations, we propose ORBIT, a novel approach that actively tracks the distance between fine-tuned and initial model weights, and uses a weight averaging strategy to constrain model drift during GenRetrieval fine-tuning when this inter-model distance exceeds a maximum threshold. Our results show that ORBIT retains substantial text and retrieval performance by outperforming both common continual learning baselines and related regularization methods that also employ weight averaging.
☆ Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
Large Language Models (LLMs) update their behavior in context, which can be viewed as a form of Bayesian inference. However, the structure of the latent hypothesis space over which this inference operates remains unclear. In this work, we propose that LLMs assign beliefs over a low-dimensional geometric space - a conceptual belief space - and that in-context learning corresponds to a trajectory through this space as beliefs are updated over time. Using story understanding as a natural setting for dynamic belief updating, we combine behavioral and representational analyses to study these trajectories. We find that (1) belief updates are well-described as trajectories on low-dimensional, structured manifolds; (2) this structure is reflected consistently in both model behavior and internal representations and can be decoded with simple linear probes to predict behavior; and (3) interventions on these representations causally steer belief trajectories, with effects that can be predicted from the geometry of the conceptual space. Together, our results provide a geometric account of belief dynamics in LLMs, grounding Bayesian interpretations of in-context learning in structured conceptual representations.
☆ Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling
AI agents negotiate and transact in natural language with unfamiliar counterparts: a buyer bot facing an unknown seller, or a procurement assistant negotiating with a supplier. In such interactions, the counterpart's LLM, prompts, control logic, and rule-based fallbacks are hidden, while each decision can have monetary consequences. We ask whether an agent can predict an unfamiliar counterpart's next decision from a few interactions. To avoid real-world logging confounds, we study this problem in controlled bargaining and negotiation games, formulating it as target-adaptive text-tabular prediction: each decision point is a table row combining structured game state, offer history, and dialogue, while $K$ previous games of the same target agent, i.e., the counterpart being modeled, are provided in the prompt as labeled adaptation examples. Our model is built on a tabular foundation model that represents rows using game-state features and LLM-based text representations, and adds LLM-as-Observer as an additional representation: a small frozen LLM reads the decision-time state and dialogue; its answer is discarded, and its hidden state becomes a decision-oriented feature, making the LLM an encoder rather than a direct few-shot predictor. Training on 13 frontier-LLM agents and testing on 91 held-out scaffolded agents, the full model outperforms direct LLM-as-Predictor prompting and game+text features baselines. Within this tabular model, Observer features contribute beyond the other feature schemes: at $K=16$, they improve response-prediction AUC by about 4 points across both tasks and reduce bargaining offer-prediction error by 14%. These results show that formulating counterpart prediction as a target-adaptive text-tabular task enables effective adaptation, and that hidden LLM representations expose decision-relevant signals that direct prompting does not surface.
☆ Question Difficulty Estimation for Large Language Models via Answer Plausibility Scoring ACL 2026
Estimating question difficulty is a critical component in evaluating and improving large language models (LLMs) for question answering (QA). Existing approaches often rely on readability formulas, retrieval-based signals, or popularity statistics, which may not fully capture the reasoning challenges posed to modern LLMs. In this paper, we introduce Q-DAPS (Question Difficulty based on Answer Plausibility Scores) method, a novel approach that estimates question difficulty by computing the entropy of plausibility scores over candidate answers. We systematically evaluate Q-DAPS across four prominent QA datasets-TriviaQA, NQ, MuSiQue, and QASC-demonstrating that it consistently outperforms baselines. Moreover, Q-DAPS shows strong robustness across hyperparameter variations and question types. Extensive ablation studies further show that Q-DAPS remains robust across different plausibility estimation paradigms, model sizes, and realistic settings. Human evaluations further confirm strong alignment between Q-DAPS's difficulty estimates and human judgments of question difficulty. Overall, Q-DAPS provides an interpretable, scalable, and bias-resilient approach to question difficulty estimation in modern QA systems.
comment: Accepted at ACL 2026
☆ A Comparative Study of Controlled Text Generation Systems Using Level-Playing-Field Evaluation Principles
Background: Many different approaches to controlled text generation (CTG) have been proposed over recent years, but it is difficult to get a clear picture of which approach performs best, because different datasets and evaluation methods are used in each case to assess the control achieved. Objectives: Our aim in the work reported in this paper is to develop an approach to evaluation that enables us to comparatively evaluate different CTG systems in a manner that is both informative and fair to the individual systems. Methods: We use a level-playing-field (LPF) approach to comparative evaluation where we (i) generate and process all system outputs in a standardised way, and (ii) apply a shared set of evaluation methods and datasets, selected based on those currently in use, in order to ensure fair evaluation. Results: When re-evaluated in this way, performance results for a representative set of current CTG systems differ substantially from originally reported results, in most cases for the worse. This highlights the importance of a shared standardised way of assessing controlled generation. Conclusions: The discrepancies revealed by LPF evaluation demonstrate the urgent need for standardised, reproducible evaluation practices in CTG. Our results suggest that without such practices, published performance claims may substantially misrepresent true system capabilities.
☆ Scalable Token-Level Hallucination Detection in Large Language Models
Large language models (LLMs) have demonstrated remarkable capabilities, but they still frequently produce hallucinations. These hallucinations are difficult to detect in reasoning-intensive tasks, where the content appears coherent but contains errors like logical flaws and unreliable intermediate results. While step-level analysis is commonly used to detect internal hallucinations, it suffers from limited granularity and poor scalability due to its reliance on step segmentation. To address these limitations, we propose TokenHD, a holistic pipeline for training token-level hallucination detectors. Specifically, TokenHD consists of a scalable data engine for synthesizing large-scale hallucination annotations along with a training recipe featuring an importance-weighted strategy for robust model training. To systematically assess the detection performance, we also provide a rigorous evaluation protocol. Through training within TokenHD, our detector operates directly on free-form text to identify hallucinations, eliminating the need for predefined step segmentation or additional text reformatting. Our experiments show that even a small detector (0.6B) achieves substantial performance gains after training, surpassing much larger reasoning models (e.g., QwQ-32B), and detection performance scales consistently with model size from 0.6B to 8B. Finally, we show that our detector can generalize well across diverse practical scenarios and explore strategies to further enhance its cross-domain generalization capability.
Pretraining Exposure Explains Popularity Judgments in Large Language Models SIGIR 2026
Large language models (LLMs) exhibit systematic preferences for well-known entities, a phenomenon often attributed to popularity bias. However, the extent to which these preferences reflect real-world popularity versus statistical exposure during pretraining remains unclear, largely due to the inaccessibility of most training corpora. We provide the first direct, large-scale analysis of popularity bias grounded in fully observable pretraining data. Leveraging the open OLMo models and their complete pretraining corpus, Dolma, we compute precise entity-level exposure statistics across 7.4 trillion tokens. We analyze 2,000 entities spanning five types (Person, Location, Organization, Art, Product) and compare pretraining exposure against Wikipedia pageviews and two elicited LLM popularity signals: direct scalar estimation and pairwise comparison. Our results show that pretraining exposure strongly correlates with Wikipedia popularity, validating exposure as a meaningful proxy for real-world salience during the training period. More importantly, we find that LLM popularity judgments align more closely with exposure than with Wikipedia, especially when elicited via pairwise comparisons. This alignment is strongest for larger models and persists in the long tail, where Wikipedia popularity becomes unreliable. Overall, our findings demonstrate that popularity priors in LLMs are primarily shaped by pretraining statistics rather than external popularity signals, offering concrete evidence that data exposure plays a central role in driving popularity bias.
comment: Accepted at SIGIR 2026
☆ Context Convergence Improves Answering Inferential Questions SIGIR 2026
While Large Language Models (LLMs) are widely used in open-domain Question Answering (QA), their ability to handle inferential questions-where answers must be derived rather than directly retrieved-remains still underexplored. This study investigates how the structure and quality of passages influence LLM performance on such questions. We focus on convergence, a measure of how effectively sentences (hints) eliminate incorrect answers, as a criterion for constructing passages. Using subsets of the TriviaHG dataset, we form passages by combining sentences with varying convergence levels and evaluate six LLMs of different sizes and architectures. Our results show that passages built from higher convergence sentences lead to substantially better answer accuracy than those selected by cosine similarity, indicating that convergence captures meaningful relevance for inferential reasoning. Additionally, ordering sentences by descending convergence slightly improves performance, suggesting that LLMs tend to prioritize earlier, information-rich cues. These findings highlight convergence as a practical signal for guiding passage construction and analyzing inferential reasoning behavior in LLMs.
comment: Accepted at SIGIR 2026
☆ MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering
Evaluating large language models (LLMs) in the biomedical domain requires benchmarks that can distinguish reasoning from pattern matching and remain discriminative as model capabilities improve. Existing biomedical question answering (QA) benchmarks are limited in this respect. Multiple-choice formats can allow models to succeed through answer elimination rather than inference, while widely circulated exam-style datasets are increasingly vulnerable to performance saturation and training data contamination. Multi-hop reasoning, defined as the ability to integrate information across multiple sources to derive an answer, is central to clinically meaningful tasks such as diagnostic support, literature-based discovery, and hypothesis generation, yet remains underrepresented in current biomedical QA benchmarks. MedHopQA is a disease-centered multi-hop reasoning benchmark consisting of 1,000 expert-curated question-answer pairs introduced as a shared task at BioCreative IX. Each question requires synthesis of information across two distinct Wikipedia articles, and answers are provided in an open-ended free-text format. Gold annotations are augmented with ontology-grounded synonym sets from MONDO, NCBI Gene, and NCBI Taxonomy to support both lexical and concept-level evaluation. MedHopQA was constructed through a structured process combining human annotation, triage, iterative verification, and LLM-as-a-judge validation. To reduce leaderboard gaming and contamination risk, the 1,000 scored questions are embedded within a publicly downloadable set of 10,000 questions, with answers withheld, on a CodaBench leaderboard. MedHopQA provides both a benchmark and a reusable framework for constructing future biomedical QA datasets that prioritize compositional reasoning, saturation resistance, and contamination resistance as core design constraints.
☆ Output Composability of QLoRA PEFT Modules for Plug-and-Play Attribute-Controlled Text Generation
Parameter-efficient fine-tuning (PEFT) techniques offer task-specific fine-tuning at a fraction of the cost of full fine-tuning, but require separate fine-tuning for every new task (combination). In this paper, we explore three ways of generalising beyond single-task training/inference: (i) training on combinations of multiple, related datasets; (ii) at inference, composing the weight matrices of separately trained PEFT modules; and (iii) at inference, composing the outputs of separately trained PEFT modules. We test these approaches on three different LLMs, QLoRA as the PEFT technique, and three sets of controlled text generation datasets for sentiment control, topic control, and multi-attribute control. We find that summing PEFT module outputs is a particularly strong composition method, which consistently either outperforms or matches the performance of alternative approaches. This is the case even when comparing against single-task specialised modules on the single-task test set, where three-module output composition achieves an average 2% point performance increase across all models for sentiment control.
☆ A categorical error sensitivity index (ISEC): A preventive ordinal decision-support measure for irrecoverable errors in manual data entry systems
Data entry systems remain structurally vulnerable to categorical misclassifications, particularly in small and medium sized enterprises (SMEs). When nominal categories exhibit semantic or morphological proximity, human machine interaction may produce errors that are irrecoverable ex post. In the absence of automated input controls, manual data entry frequently generates irrecoverable categorical distortions that propagate into Key Performance Indicators (KPIs), thereby misleading managerial decision making. State of the art normalization tools typically evaluate semantic and morphological dimensions in isolation and rely heavily on standard dictionaries, rendering them ineffective for SME master data rich in custom SKUs, abbreviations, and domain-specific technical jargon. This paper introduces the Categorical Error Sensitivity Index (ISEC), an ordinal composite score designed to rank category pairs according to their structural susceptibility to confusion. ISEC integrates semantic distance (via word embeddings), custom weighted morphological transformation costs (through an adapted Damerau Levenshtein algorithm), and empirical frequency into a unified, mathematically robust preventive framework. By leveraging vector database architectures, ISEC reduces computational complexity, achieving approximately a 195x performance improvement over brute-force methods. Validated across three heterogeneous datasets: governmental judicial records, retail inventory, and a synthetic ISO coded metalworking catalog, ISEC provides a scalable and proactive data governance instrument that enables SMEs to detect latent structural risk embedded within their categorical data assets.
comment: 15 pages, 4 figures
Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering
Multi-hop question answering (QA) remains a significant challenge in the biomedical domain, requiring systems to integrate information across multiple sources to answer complex questions. To address this problem, the BioCreative IX MedHopQA shared task was designed to benchmark in multi-hop reasoning for large language models (LLMs). We developed a novel dataset of 1,000 challenging QA pairs spanning diseases, genes, and chemicals, with particular emphasis on rare diseases. Each question was constructed to require two-hop reasoning through the integration of information from two distinct Wikipedia pages. The challenge attracted 48 submissions from 13 teams. Systems were evaluated using both surface string comparison and conceptual accuracy (MedCPT score). The results showed a substantial performance gap between baseline LLMs and enhanced systems. The top-ranked submission achieved an 89.30% F1 score on the MedCPT metric and an 87.30% exact match (EM) score, compared with 67.40% and 60.20%, respectively, for the zero-shot baseline. A central finding of the challenge was that retrieval-augmented generation (RAG) and related retrieval-based strategies were critical for strong performance. In addition, concept-level evaluation improved answer assessment when correct responses differed in surface form. The MedHopQA dataset is publicly available to support continued progress in this important area. Challenge materials: https://www.ncbi.nlm.nih.gov/research/bionlp/medhopqa and benchmark https://www.codabench.org/competitions/7609/
☆ GKnow: Measuring the Entanglement of Gender Bias and Factual Gender ACL 2026
Recent works have analyzed the impact of individual components of neural networks on gendered predictions, often with a focus on mitigating gender bias. However, mechanistic interpretations of gender tend to (i) focus on a very specific gender-related task, such as gendered pronoun prediction, or (ii) fail to distinguish between the production of factually gendered outputs (the correct assumption of gender given a word that carries gender as a semantic property) and gender biased outputs (based on a stereotype). To address these issues, we curate \gknow, a benchmark to assess gender knowledge and gender bias in language models across different types of gender-related predictions. \gknow allows us to identify and analyze circuits and individual neurons responsible for gendered predictions. We test the impact of neuron ablation on benchmarks for disentangling stereotypical and factual gender (DiFair and the test set of GKnow), as well as StereoSet. Results show that gender bias and factual gender are severely entangled on the level of both circuits and neurons, entailing that ablation is an unreliable debiasing method. Furthermore, we show that benchmarks for evaluating gender bias can hide the decrease in factual gender knowledge that accompanies neuron ablation. We curate GKnow as a contribution to the continuous development of robust gender bias benchmarks.
comment: Accepted to ACL 2026
☆ TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
Direct Preference Optimization (DPO) is a widely used RL-free method for aligning language models from pairwise preferences, but it models preferences over full sequences even though generation is driven by per-token decisions. Existing token-level extensions typically decompose a sequence-level Bradley-Terry objective across timesteps, leaving per-prefix (state-wise) optimality implicit. We study how to recover token-level preference optimality using only standard sequence-level pairwise comparisons. We introduce Token-level Bregman Preference Optimization (TBPO), which posits a token-level Bradley-Terry preference model over next-token actions conditioned on the prefix, and derive a Bregman-divergence density-ratio matching objective that generalizes the logistic/DPO loss while preserving the optimal policy induced by the token-level model and maintaining DPO-like simplicity. We introduce two instantiations: TBPO-Q, which explicitly learns a lightweight state baseline, and TBPO-A, which removes the baseline through advantage normalization. Across instruction following, helpfulness/harmlessness, and summarization benchmarks, TBPO improves alignment quality and training stability and increases output diversity relative to strong sequence-level and token-level baselines.
☆ What makes a word hard to learn? Modeling L1 influence on English vocabulary difficulty ACL
What makes a word difficult to learn, and how does the difficulty depend on the learner's native language? We computationally model vocabulary difficulty for English learners whose first language is Spanish, German, or Chinese with gradient-boosted models trained on features related to a word's familiarity (e.g., frequency), meaning, surface form, and cross-linguistic transfer. Using Shapley values, we determine the importance of each feature group. Word familiarity is the dominant feature group shared by all three languages. However, predictions for Spanish- and German-speaking learners rely additionally on orthographic transfer. This transfer mechanism is unavailable to Chinese learners, whose difficulty is shaped by a combination of familiarity and surface features alone. Our models provide interpretable, L1-tailored difficulty estimates that can be used to design vocabulary curricula.
comment: Submitted to BEA 2026 at ACL. 18 pages, 13 figures
Reconstruction of Personally Identifiable Information from Supervised Finetuned Models
Supervised Finetuning (SFT) has become one of the primary methods for adapting a large language model (LLM) with extensive pre-trained knowledge to domain-specific, instruction-following tasks. SFT datasets, composed of instruction-response pairs, often include user-provided information that may contain sensitive data such as personally identifiable information (PII), raising privacy concerns. This paper studies the problem of PII reconstruction from SFT models for the first time. We construct multi-turn, user-centric Q&A datasets in sensitive domains, specifically medical and legal settings, that incorporate PII to enable realistic evaluation of leakage. Using these datasets, we evaluate the extent to which an adversary, with varying levels of knowledge about the fine-tuning dataset, can infer sensitive information about individuals whose data was used during SFT. In the reconstruction setting, we propose COVA, a novel decoding algorithm to reconstruct PII under prefix-based attacks, consistently outperforming existing extraction methods. Our results show that even partial attacker knowledge can significantly improve reconstruction success, while leakage varies substantially across PII types.
☆ PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon Agents
Long-horizon language agents accumulate conversation history far faster than any fixed context window can hold, making memory management critical to both answer accuracy and serving cost. Existing approaches either expand the context window without addressing what is retrieved, perform heavy ingestion-time fact extraction at substantial token cost, or rely on heuristic graph traversal that leaves both accuracy and efficiency on the table. We present PRISM, a training-free retrieval-side framework that treats long-horizon memory as a joint retrieval-and-compression problem over a graph-structured memory. PRISM combines four orthogonal inference-time components: Hierarchical Bundle Search over typed relation paths, Query-Sensitive Edge Costing that aligns traversal with detected query intent, Evidence Compression that compresses the candidate bundle into a compact answer-side context, and Adaptive Intent Routing that routes most queries through zero-LLM tiers. By formulating retrieval as min-cost selection over typed path templates and pairing it with an LLM-side compression step, PRISM surfaces the right evidence under a strict context budget without any fine-tuning or modification to the upstream ingestion pipeline. Experiments on the LoCoMo benchmark show that PRISM delivers substantially higher LLM-judge accuracy than every same-protocol baseline at an order-of-magnitude smaller context budget, occupying a previously empty corner of the accuracy-context-cost frontier and demonstrating a superior balance between answer quality and retrieval efficiency.
comment: Preprint
☆ PreScam: A Benchmark for Predicting Scam Progression from Early Conversations
Conversational scams, such as romance and investment scams, are emerging as a major form of online fraud. Unlike one-shot scam lures such as fake lottery or unpaid toll messages, they unfold through multi-turn conversations in which scammers gradually manipulate victims using evolving psychological techniques. However, existing research mainly focuses on static scam detection or synthetic scams, leaving open whether language models can understand how real-world scams progress over time. We introduce PreScam, a benchmark for modeling scam progression from early conversations. Built from user-submitted scam reports, PreScam filters and structures 177,989 raw reports into 11,573 conversational scam instances spanning 20 scam categories. Each instance is hierarchically structured according to the scam lifecycle defined by the proposed scam kill chain, and further annotated at the turn level with scammer psychological actions and victim responses. We benchmark models on two tasks: real-time termination prediction, which estimates whether a conversation is approaching the termination stage, and scammer action prediction, which forecasts the scammer's subsequent actions. Results show a clear gap between surface-level fluency and progression modeling: supervised encoders substantially outperform zero-shot LLMs on real-time termination prediction, while next-action prediction remains only moderately successful even for strong LLMs. Taken together, these results show that current models can capture some scam-related cues, yet still struggle to track how risk escalates and how manipulation unfolds across turns.
☆ Mind the Pause: Disfluency-Aware Objective Tuning for Multilingual Speech Correction with LLMs ACL 2026
Automatic Speech Recognition (ASR) transcripts often contain disfluencies, such as fillers, repetitions, and false starts, which reduce readability and hinder downstream applications like chatbots and voice assistants. If left unaddressed, such disfluencies can significantly degrade the reliability of downstream systems. Most existing approaches rely on classical models that focus on identifying disfluent tokens for removal. While this strategy is effective to some extent, it often disrupts grammatical structure and semantic coherence, leading to incomplete or unnatural sentences. Recent literature explored the use of large language models (LLMs); however, these efforts have primarily focused on disfluency detection or data augmentation, rather than performing comprehensive correction. We propose a multilingual correction pipeline where a sequence tagger first marks disfluent tokens, and these signals guide instruction fine-tuning of an LLM to rewrite transcripts into fluent text. To further improve reliability, we add a contrastive learning objective that penalizes the reproduction of disfluent tokens, encouraging the model to preserve grammar and meaning while removing disfluent artifacts. Our experiments across three Indian languages, namely Hindi, Bengali, and Marathi show consistent improvements over strong baselines, including multilingual sequence-to-sequence models. These results highlight that detection-only strategies are insufficient. Combining token-level cues with instruction tuning and contrastive learning provides a practical and scalable solution for multilingual disfluency correction in speech-driven NLP systems. We make the codes publicly available at https://github.com/deepak-kumar-98/Mind-the-Pause.
comment: Accepted to ACL 2026 (Main)
☆ Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
Adapting large language models (LLMs) to long-context tasks requires post-training methods that remain accurate and coherent over thousands of tokens. Existing approaches are limited in several ways: 1) off-policy methods such as supervised fine-tuning (SFT) and knowledge distillation (KD) suffer from exposure bias and limited recovery from model-generated errors over long horizons; 2) on-policy reinforcement learning methods such as Group Relative Policy Optimization (GRPO) better align training with model-generated states, but are unstable and sample-inefficient due to sparse rewards; 3) on-policy distillation (OPD) provides dense token-level guidance, but does not directly optimize arbitrary reward signals. In this paper, we propose Distilled Group Relative Policy Optimization (dGRPO), a method for long-context reasoning that augments GRPO with dense guidance from a stronger teacher via OPD. We also introduce LongBlocks, a synthetic long-context dataset spanning multi-hop reasoning, contextual grounding, and long-form generation. We conduct extensive experiments and ablations comparing off-policy training, sparse-reward GRPO, and our combined approach, leading to an improved recipe for long-context alignment. Overall, our results show that combining outcome-based policy optimization with knowledge distillation in a single objective provides a more stable and effective path to long-context reasoning, while preserving short-context capabilities.
☆ Mechanistic Interpretability of ASR models using Sparse Autoencoders
Understanding the internal machinations of deep Transformer-based NLP models is more crucial than ever as these models see widespread use in various domains that affect the public at large, such as industry, academia, finance, health. While these models have advanced rapidly, their internal mechanisms remain largely a mystery. Techniques such as Sparse Autoencoders (SAE) have emerged to understand these mechanisms by projecting dense representations into a sparse vector. While existing research has demonstrated the viability of the SAE in interpreting text-based Large Language Models (LLMs), there are no equivalent studies that demonstrate the application of a SAE to audio processing models like Automatic Speech Recognizers (ASRs). In this work, a SAE is applied to Whisper, a Transformer-based ASR, training a high-dimensional sparse latent space on frame-level embeddings extracted from the Whisper encoder. Our work uncovers diverse monosemantic features across linguistic and non-linguistic boundaries, and demonstrates cross-lingual feature steering. This work establishes the viability of a SAE model and demonstrates that Whisper encodes a rich amount of linguistic information.
comment: 10 pages + references and appendix
☆ Not How Many, But Which: Parameter Placement in Low-Rank Adaptation
We study the \textit{parameter placement problem}: given a fixed budget of $k$ trainable entries within the B matrix of a LoRA adapter (A frozen), does the choice of which $k$ matter? Under supervised fine-tuning, random and informed subsets achieve comparable performance. Under GRPO on base models, random placement fails to improve over the base model, while gradient-informed placement recovers standard LoRA accuracy. This regime dependence traces to gradient structure: SFT gradients are low-rank and directionally stable, so any subset accumulates coherent updates; GRPO gradients are high-rank and near-orthogonal across steps, so only elements with consistently signed gradients retain the learning signal. Our scoring procedure identifies these critical parameters in under 10 seconds at less than 0.5% of training cost. Selected parameters concentrate on residual-stream-writing projections (V, O, Down), stable across model families and scales (1.5B - 8B).
comment: Preprint. Comments welcome
☆ Mitigating Context-Memory Conflicts in LLMs through Dynamic Cognitive Reconciliation Decoding IEEE
Large language models accumulate extensive parametric knowledge through pre-training. However, knowledge conflicts occur when outdated or incorrect parametric knowledge conflicts with external knowledge in the context. Existing methods address knowledge conflicts through contrastive decoding, but in conflict-free scenarios, static approaches disrupt output distribution. Other dynamic decoding methods attempt to measure the degree of conflict but still struggle with complex real-world situations. In this paper, we propose a two-stage decoding method called Dynamic Cognitive Reconciliation Decoding (DCRD), to predict and mitigate context-memory conflicts. DCRD first analyzes the attention map to assess context fidelity and predict potential conflicts. Based on this prediction, the input is directed to one of two decoding paths: (1) greedy decoding, or (2) context fidelity-based dynamic decoding. This design enables DCRD to handle conflicts efficiently while maintaining high accuracy and decoding efficiency in conflict-free cases. Additionally, to simulate scenarios with frequent knowledge updates, we constructed ConflictKG, a knowledge conflict QA benchmark. Experiments on four LLMs across six QA datasets show that DCRD outperforms all baselines, achieving state-of-the-art performance.
comment: Accepted by IEEE TASLP
☆ Do Enterprise Systems Need Learned World Models? The Importance of Context to Infer Dynamics
World models enable agents to anticipate the effects of their actions by internalizing environment dynamics. In enterprise systems, however, these dynamics are often defined by tenant-specific business logic that varies across deployments and evolves over time, making models trained on historical transitions brittle under deployment shift. We ask a question the world-models literature has not addressed: when the rules can be read at inference time, does an agent still need to learn them? We argue, and demonstrate empirically, that in settings where transition dynamics are configurable and readable, runtime discovery complements offline training by grounding predictions in the active system instance. We propose enterprise discovery agents, which recover relevant transition dynamics at runtime by reading the system's configuration rather than relying solely on internalized representations. We introduce CascadeBench, a reasoning-focused benchmark for enterprise cascade prediction that adopts the evaluation methodology of World of Workflows on diverse synthetic environments, and use it together with deployment-shift evaluation to show that offline-trained world models can perform well in-distribution but degrade as dynamics change, whereas discovery-based agents are more robust under shift by grounding their predictions in the current instance. Our findings suggest that, in configurable enterprise environments, agents should not rely solely on fixed internalized dynamics, but should incorporate mechanisms for discovering relevant transition logic at runtime.
☆ Correcting Selection Bias in Sparse User Feedback for Large Language Model Quality Estimation: A Multi-Agent Hierarchical Bayesian Approach
[Abridged] Production LLM deployments receive feedback from a non-random fraction of users: thumbs sit mostly in the tails of the satisfaction distribution, and a naive average over them can land 40-50 percentage points away from true system quality. We treat this as a topic- and sentiment- stratified selection-bias problem and propose a three-agent hierarchical Bayesian pipeline that does not require ground-truth labels on individual interactions. A Topic Clustering Agent partitions the stream via UMAP + HDBSCAN over text embeddings; a Bias Modeling Agent fits a two-stage hierarchical Beta-Binomial under NUTS, inferring per-topic selection rates $s_c$ and quality $q_c$ with partial pooling; a Synthesis Agent reweights $q_c$ by true topic prevalence $\hatπ_c = n_c/N$ to report a bias-corrected aggregate posterior $\bar Q = \sum_c \hatπ_c q_c$ with credible interval, plus drift signals for online recalibration. Validation uses UltraFeedback (N=10,232 retained interactions, $C=18$ clusters, $Q^\star=0.6249$) with simulated topic- and sentiment-dependent selection biases. We compare five Bayesian variants against Naive and IPW baselines. A mild prior on the feedback channel (typical positive-feedback rate and negative-to-positive ratio, both readable from any production dashboard without labels) keeps Hierarchical-Informed within 4-13 pp of $Q^\star$ as the bias ratio sweeps from 1:1 to 30:1, with 95% credible intervals covering $Q^\star$ in 50/50 random-seed replicates at $κ_{\max}=10$. Without channel-side priors, every weak-prior variant misses $Q^\star$ by 22-33 pp: the per-cluster sufficient statistics admit a one-parameter family of equally good fits, and the prior on the bias channel (not on latent quality) is what breaks the degeneracy.
☆ Latent Causal Void: Explicit Missing-Context Reconstruction for Misinformation Detection
Automatic misinformation detection performs well when deception is visible in what an article explicitly states. However, some misinformation articles remain locally coherent and only become misleading once compared with contemporaneous reports that supply background facts the article omits. We study this omission-relevant setting and observe that current omission-aware approaches typically either attach retrieved context as auxiliary evidence or infer a categorical omission signal, leaving the specific missing fact implicit. We propose \emph{Latent Causal Void} (LCV), a retrieval-guided detector that explicitly reconstructs the missing fact for each target sentence and uses it as a textual cross-source relation in graph reasoning. Concretely, LCV retrieves temporally aligned context articles, asks a frozen instruction-tuned large language model to generate a short missing-context description for each sentence--article pair, and feeds the resulting relation text into a heterograph over target sentences and context articles. On the bilingual benchmark of Sheng et al., LCV improves over the strongest omission-aware baseline by $2.56$ and $2.84$ macro-F1 points on the English and Chinese splits, respectively. The results indicate that modeling the missing cross-source fact itself, rather than only attaching retrieved evidence or predicting an omission signal, is a useful representation for omission-aware misinformation detection.
☆ Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models CVPR 2026
Generating realistic and user-preferred advertisements is a key challenge in e-commerce. Existing approaches utilize multiple independent models driven by click-through-rate (CTR) to controllably create attractive image or text advertisements. However, their pipelines lack cross-modal perception and rely on CTR that only reflects average preferences. Therefore, we explore jointly generating personalized image-text advertisements from historical click behaviors. We first design a Unified Advertisement Generative model (Uni-AdGen) that employs a single autoregressive framework to produce both advertising images and texts. By incorporating a foreground perception module and instruction tuning, Uni-AdGen enhances the realism of the generated content. To further personalize advertisements, we equip Uni-AdGen with a coarse-to-fine preference understanding module that effectively captures user interests from noisy multimodal historical behaviors to drive personalized generation. Additionally, we construct the first large-scale Personalized Advertising image-text dataset (PAd1M) and introduce a Product Background Similarity (PBS) metric to facilitate training and evaluation. Extensive experiments show that our method outperforms baselines in general and personalized advertisement generation. Our project is available at https://github.com/JD-GenX/Uni-AdGen.
comment: 22 pages, 19 figures, CVPR 2026
☆ Metaphor Is Not All Attention Needs
Large language models are increasingly deployed in safety-critical applications, where their ability to resist harmful instructions is essential. Although post-training aims to make models robust against many jailbreak strategies, recent evidence shows that stylistic reformulations, such as poetic transformation, can still bypass safety mechanisms with alarming effectiveness. This raises a central question: why do literary jailbreaks succeed? In this work, we investigate whether their effectiveness depends on specific poetic devices, on a failure to recognize literary formatting, or on deeper changes in how models process stylistically irregular prompts. We address this problem through an interpretability analysis of attention patterns. We perform input-level ablation studies to assess the contribution of individual and combinations of poetic devices; construct an interpretable vector representation of attention maps; cluster these representations and train linear probes to predict safety outcomes and literary format. Our results show that models distinguish poetic from prose formats with high accuracy, yet struggle to predict jailbreak success within each format. Clustering further reveals clear separation by literary format, but not by safety label. These findings indicate that jailbreak success is not caused by a failure to recognize poetic formatting; rather, poetic prompts induce distinct processing patterns that remain largely independent of harmful-content detection. Overall, literary jailbreaks appear to misalign large language models not through any single poetic device, but through accumulated stylistic irregularities that alter prompt processing and avoid lexical triggers considered during post-training. This suggests that robustness requires safety mechanisms that account for style-induced shifts in model behavior. We use Qwen3-14B as a representative open-weight case study.
☆ Sign Language Recognition and Translation for Low-Resource Languages: Challenges and Pathways Forward
Sign languages are natural, visual-gestural languages used by Deaf communities worldwide. Over 300 distinct sign languages remain severely low-resource due to limited documentation, sparse datasets, and insufficient computational tools. This systematic review synthesizes literature on sign language recognition and translation for under-resourced languages, using Azerbaijan Sign Language (AzSL) as a case study. Analysis of global initiatives extracts eight actionable lessons, including community co-design, dialectal diversity capture, and privacy-preserving pose-based representations. Turkic sign languages (Kazakh, Turkish, Azerbaijani) receive special attention, as linguistic proximity enables effective transfer learning. We propose three paradigm shifts: from architecture-centric to data-centric AI, from signer-independent to signer-adaptive systems, and from reference-based to task-specific evaluation metrics. A technical roadmap for AzSL leverages lightweight MediaPipe-based architectures, community-validated annotations, and offline-first deployment. Progress requires sustained interdisciplinary collaboration centered on Deaf communities to ensure cultural authenticity, ethical governance, and practical communication benefit.
☆ World Action Models: The Next Frontier in Embodied AI
Vision-Language-Action (VLA) models have achieved strong semantic generalization for embodied policy learning, yet they learn reactive observation-to-action mappings without explicitly modeling how the physical world evolves under intervention. A growing body of work addresses this limitation by integrating world models, predictive models of environment dynamics, into the action generation pipeline. We term this emerging paradigm World Action Models (WAMs): embodied foundation models that unify predictive state modeling with action generation, targeting a joint distribution over future states and actions rather than actions alone. However, the literature remains fragmented across architectures, learning objectives, and application scenarios, lacking a unified conceptual framework. We formally define WAMs and disambiguate them from related concepts, and trace the foundations and early integration of VLA and world model research that gave rise to this paradigm. We organize existing methods into a structured taxonomy of Cascaded and Joint WAMs, with further subdivision by generation modality, conditioning mechanism, and action decoding strategy. We systematically analyze the data ecosystem fueling WAMs development, spanning robot teleoperation, portable human demonstrations, simulation, and internet-scale egocentric video, and synthesize emerging evaluation protocols organized around visual fidelity, physical commonsense, and action plausibility. Overall, this survey provides the first systematic account of the WAMs landscape, clarifies key architectural paradigms and their trade-offs, and identifies open challenges and future opportunities for this rapidly evolving field.
☆ Do Language Models Encode Knowledge of Linguistic Constraint Violations?
Large Language Models (LLMs) achieve strong linguistic performance, yet their internal mechanisms for producing these predictions remain unclear. We investigate the hypothesis that LLMs encode representations of linguistic constraint violations within their parameters, which are selectively activated when processing ungrammatical sentences. To test this, we use sparse autoencoders to decompose polysemantic activations into sparse, monosemantic features and recover candidates for violation-related features. We introduce a sensitivity score for identifying features that are preferentially activated on constraint-violated versus well-formed inputs, enabling unsupervised detection of potential violation-specific features. We further propose a conjunctive falsification framework with three criteria evaluated jointly. Overall, the results are negative in two respects: (1) the falsification criteria are not jointly satisfied across linguistic phenomena, and (2) no features are consistently shared across all categories. While some phenomena show partial evidence of selective causal structure, the overall pattern provides limited support for a unified set of grammatical violation detectors in current LMs.
☆ Is Child-Directed Language Optimized for Word Learning? A Computational Study of Verb Meaning Acquisition
Is child-directed language (CDL) optimized to support language learning, and which aspects of linguistic development does it facilitate? We investigate this question using neural language models trained on CDL versus adult-directed language (ADL). We selectively remove syntactic or lexical co-occurrence information from the model training data, and evaluate the impact of these manipulations on verb meaning acquisition. While disrupting syntax impairs learning across all datasets, models trained on CDL and spoken ADL show significantly higher resilience than those trained on written input. Tracking semantic and syntactic performance over training, we observe a semantic-first trajectory, with verb meanings emerging prior to robust syntactic proficiency, an asynchrony most pronounced in the spoken domain, especially CDL. These results suggest that the advantage for verb learning previously attributed to CDL may instead reflect broader properties of the spoken register, rather than a uniquely CDL-specific optimization.
comment: 8 pages
☆ SkillGraph: Skill-Augmented Reinforcement Learning for Agents via Evolving Skill Graphs
Skill libraries enable large language model agents to reuse experience from past interactions, but most existing libraries store skills as isolated entries and retrieve them only by semantic similarity. This leads to two key challenges for compositional tasks. Firstly, an agent must identify not only relevant skills but also how they depend on and build upon each other. Secondly, it also makes library maintenance difficult, since the system lacks structural cues for deciding when skills should be merged, split, or removed. We propose SKILLGRAPH, a framework that represents reusable skills as nodes in a directed graph, with typed edges encoding prerequisite, enhancement, and co-occurrence relations. Given a new task, SKILLGRAPH retrieves not just individual skills, but an ordered skill subgraph that can guide multi-step decision making. The graph is continuously updated from agent trajectories and reinforcement learning feedback, allowing both the skill library and the agent policy to improve together. Experiments on ALFWorld, WebShop, and seven search-augmented QA tasks show that SKILLGRAPH achieves state-of-the-art performance against memory-augmented RL methods, with especially large gains on complex tasks that require composing multiple skills.
comment: Under Review
☆ Caraman at SemEval-2026 Task 8: Three-Stage Multi-Turn Retrieval with Query Rewriting, Hybrid Search, and Cross-Encoder Reranking SemEval2026
We describe our system for SemEval-2026 Task 8 (MTRAGEval), participating in Task A (Retrieval) across four English-language domains. Our approach employs a three-stage pipeline: (1) query rewriting via a LoRA-fine-tuned Qwen 2.5 7B model that transforms context-dependent follow-up questions into standalone queries, (2) hybrid BM25 and dense retrieval combined through Reciprocal Rank Fusion, and (3) cross-encoder reranking with BGE-reranker-v2-m3. On the official test set, the system achieves nDCG@5 of 0.531, ranking 8th out of 38 participating systems and 10.7% above the organizer baseline. Development comparisons reveal that domain-specific temperature tuning for query generation, where technical domains benefit from deterministic decoding and general domains from controlled randomness, provides consistent gains, while more complex strategies such as domain-aware prompting and multi-query expansion degrade performance.
comment: Accepted at SemEval2026, task 8: MTRAGEval
☆ SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation
Large Language Models (LLMs) achieve strong performance on standard knowledge evaluation benchmarks, yet recent work shows that their knowledge capabilities remain brittle under question variants that test the same knowledge in different forms. Robustness augmentation of existing knowledge evaluation benchmarks is therefore necessary, but current LLM-assisted generate-then-verify pipelines are costly and difficult to scale due to low-yield variant generation and unreliable variant verification. We propose SAGE (Scalable Automated Generation of Robustness BEnchmarks), a framework for scalable robustness augmentation of knowledge evaluation benchmarks using fine-tuned smaller models. SAGE consists of VariantQual, a rubric-based verifier trained on human-labeled seed data, and VariantGen, a variant generator initialized with supervised fine-tuning and further optimized with reinforcement learning using VariantQual as the reward model. Experiments on HellaSwag show that SAGE constructs a large-scale robustness-augmented benchmark with quality comparable to the human-annotated HellaSwag-Pro at substantially lower cost, while the fine-tuned models further generalize to MMLU without benchmark-specific fine-tuning.
comment: Under Review
☆ SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
Reusable skills are becoming a common interface for extending large language model agents, packaging procedural guidance with access to files, tools, memory, and execution environments. However, this modularity introduces attack surfaces that are largely missed by existing safety evaluations: even when the user request is benign, task-relevant skill materials or local artifacts can steer an agent toward unsafe actions. We present SkillSafetyBench, a runnable benchmark for evaluating such skill-mediated safety failures. SkillSafetyBench includes 155 adversarial cases across 47 tasks, 6 risk domains, and 30 safety categories, each evaluated with a case-specific rule-based verifier. Experiments with multiple CLI agents and model backends show that localized non-user attacks can consistently induce unsafe behavior, with distinct failure patterns across domains, attack methods, and scaffold-model pairings. Our findings suggest that agent safety depends not only on model-level alignment, but also on how agents interpret skills, trust workflow context, and act through executable environments.
☆ Learning Agentic Policy from Action Guidance
Agentic reinforcement learning (RL) for Large Language Models (LLMs) critically depends on the exploration capability of the base policy, as training signals emerge only within its in-capability region. For tasks where the base policy cannot reach reward states, additional training or external guidance is needed to recover effective learning signals. Rather than relying on costly iterative supervised fine tuning (SFT), we exploit the abundant action data generated in everyday human interactions. We propose \textsc{ActGuide-RL}, which injects action data as plan-style reference guidance, enabling the agentic policy to overcome reachability barriers to reward states. Guided and unguided rollouts are then jointly optimized via mixed-policy training, internalizing the exploration gains back into the unguided policy. Motivated by a theoretical and empirical analysis of the benefit-risk trade-off, we adopt a minimal intervention principle that invokes guidance only as an adaptive fallback, matching task difficulty while minimizing off-policy risk. On search-agent benchmarks, \textsc{ActGuide-RL} substantially improves over zero RL (+10.7 pp on GAIA and +19 pp on XBench with Qwen3-4B), and performs on par with the SFT+RL pipeline without any cold start. This suggests a new paradigm for agentic RL that reduces the reliance on heavy SFT data by using scalable action guidance instead.
comment: Work in progress
☆ Towards Visually-Guided Movie Subtitle Translation for Indic Languages
Movie subtitle translation is inherently multimodal, yet text-only systems often miss visual cues needed to convey emotion, action, and social nuance, especially for low-resource Indic languages (English to Hindi, Bengali, Telugu, Tamil and Kannada). We present a case study on five full-length films and compare two lightweight visual grounding strategies: structured attribute summaries from a 5-minute sliding window and free-text summaries of inter-subtitle visual gaps. Our analysis shows that temporal misalignment between subtitles and frames is a major obstacle in long-form video, often rendering indiscriminate visual grounding ineffective. However, oracle selective grounding, which replaces only the lowest-quality 20-30\% of baseline segments with visual-enhanced outputs, consistently improves COMET over the text-only baseline while requiring far less visual processing. Among the two approaches, coarse attribute-based visual context summarization is more robust, capturing scene-level emotion and contextual subtle cues that text alone often misses
☆ On Predicting the Post-training Potential of Pre-trained LLMs
The performance of Large Language Models (LLMs) on downstream tasks is fundamentally constrained by the capabilities acquired during pre-training. However, traditional benchmarks like MMLU often fail to reflect a base model's plasticity in complex open-ended scenarios, leading to inefficient model selection. We address this by introducing a new task of predicting post-training potential - forecasting a base model's performance before post-training. We propose RuDE (Rubric-based Discriminative Evaluation), a unified framework that bypasses the generation gap of base models by leveraging response discrimination. Guided by our systematic 4C Taxonomy, RuDE constructs controlled contrastive pairs across diverse domains by fine-grained rubric violations. Extensive experiments demonstrate a correlation greater than 90% with post-training performance. Crucially, validation via Reinforcement Learning (RL) confirms that RuDE effectively identifies high-potential smaller models that outperform larger counterparts, offering a compute-efficient mechanism for foundation model development.
comment: Under Review
☆ Enhancing Target-Guided Proactive Dialogue Systems via Conversational Scenario Modeling and Intent-Keyword Bridging
A target-guided proactive dialogue system aims to steer conversations proactively toward pre-defined targets, such as designated keywords or specific topics. During guided conversations, dynamically modeling conversational scenarios and intent keywords to guide system utterance generation is beneficial; however, existing work largely overlooks this aspect, resulting in a mismatch with the dynamics of real-world conversations. In this paper, we jointly model user profiles and domain knowledge as conversational scenarios to introduce a scenario bias that dynamically influences system utterances, and employ intent-keyword bridging to predict intent keywords for upcoming dialogue turns, providing higher level and more flexible guidance. Extensive automatic and human evaluations demonstrate the effectiveness of conversational scenario modeling and intent keyword bridging, yielding substantial improvements in proactivity, fluency, and informativeness for target-guided proactive dialogue systems, thereby narrowing the gap with real world interactions.
comment: 21 pages, 9 Figures, 18 Tables
☆ Multimodal Abstractive Summarization of Instructional Videos with Vision-Language Models ICPR 2026
Multimodal video summarization requires visual features that align semantically with language generation. Traditional approaches rely on CNN features trained for object classification, which represent visual concepts as discrete categories not aligned with natural language. We propose ClipSum, a framework that leverages frozen CLIP vision-language features with explicit temporal modeling and dimension-adaptive fusion for instructional video summarization. CLIP's contrastive pre-training on 400M image-text pairs yields visual features semantically aligned with the linguistic concepts that text decoders generate, bridging the vision-language gap at the representation level. On YouCook2, ClipSum achieves 33.0% ROUGE-1 versus 30.5% for ResNet-152 with 4x lower dimensionality (512 vs. 2048), demonstrating that semantic alignment matters more than feature capacity. Frozen CLIP (33.0%) surpasses fine-tuned CLIP (32.3%), showing that preserving pre-trained alignment is more valuable than task-specific adaptation. https://github.com/aqeeelmirza/clipsum
comment: Accepted to ICPR 2026
☆ StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
Existing code reasoning methods primarily supervise final code outputs, ignoring intermediate states, often leading to reward hacking where correct answers are obtained through inconsistent reasoning. We propose StepCodeReasoner, a framework that introduces explicit intermediate execution-state supervision. By automatically inserting structured print-based execution-trace anchors into code, the model is trained to predict runtime states at each step, transforming code reasoning into a verifiable, stepwise execution modeling problem. Building on this execution-aware method, we introduce Bi-Level GRPO, a reinforcement learning algorithm for structured credit assignment at two levels: inter-trajectory, comparing alternative execution paths, and intra-trajectory, rewarding intermediate accuracy based on its impact on downstream correctness. Extensive experiments demonstrate that StepCodeReasoner achieves SOTA performance in code reasoning. In particular, our 7B model achieves 91.1\% on CRUXEval and 86.5\% on LiveCodeBench, outperforming the CodeReasoner-7B baseline (86.0\% and 77.7\%) and GPT-4o (85.6\% and 75.1\%). Furthermore, on the execution-trace benchmark REval, our model scores 82.9\%, outperforming baseline CodeReasoner-7B (72.3\%), its 14B counterpart (81.1\%), and GPT-4o (77.3\%). Additionally, our approach also improves code generation performance, demonstrating that explicit execution modeling enhances both code reasoning and code generation.
☆ YFPO: A Preliminary Study of Yoked Feature Preference Optimization with Neuron-Guided Rewards for Mathematical Reasoning
Preference optimization has become an important post-training paradigm for improving the reasoning abilities of large language models. Existing methods typically rely on externally constructed preference data, using preferred and dispreferred responses as sample-level supervision. However, such external signals rarely make explicit use of capability-related information contained in the model's internal representations. For mathematical reasoning, certain neuron groups may exhibit activation patterns associated with mathematical knowledge, symbolic manipulation, or logical reasoning. Similar to reflexive behavioral signals, these internal activations may provide a coarse indication of whether the model is engaging math-related capabilities.We introduce YFPO, short for Yoked Feature Preference Optimization, a preliminary neuron-guided preference optimization framework for mathematical reasoning. YFPO first uses AttnLRP to identify math-related neurons, and then constructs an auxiliary reward from their activation margin between preferred and dispreferred responses. This design augments external preference learning with internal neuron-level signals. We conduct preliminary experiments on a small-scale language model using GSM8K as the main benchmark. Results suggest that neuron-level signals can interact with preference optimization and occasionally improve reasoning performance, offering a promising direction for more fine-grained and interpretable reasoning-oriented post-training.
comment: 10 pages, 2figures. Work in progress
☆ Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models
Large language models have achieved remarkable capabilities across diverse tasks, yet their internal decision-making processes remain largely opaque, limiting our ability to inspect, control, and systematically improve them. This opacity motivates a growing body of research in mechanistic interpretability, with sparse autoencoders (SAEs) emerging as one of the most promising tools for decomposing model activations into sparse, interpretable feature representations. We introduce Qwen-Scope, an open-source suite of SAEs built on the Qwen model family, comprising 14 groups of SAEs across 7 model variants from the Qwen3 and Qwen3.5 series, covering both dense and mixture-of-expert architectures. Built on top of these SAEs, we show that SAEs can go beyond post-hoc analysis to serve as practical interfaces for model development along four directions: (i) inference-time steering, where SAE feature directions control language, concepts, and preferences without modifying model weights; (ii) evaluation analysis, where activated SAE features provide a representation-level proxy for benchmark redundancy and capability coverage; (iii) data-centric workflows, where SAE features support multilingual toxicity classification and safety-oriented data synthesis; and (iv) post-training optimization, where SAE-derived signals are incorporated into supervised fine-tuning and reinforcement learning objectives to mitigate undesirable behaviors such as code-switching and repetition. Together, these results demonstrate that SAEs can serve not only as post-hoc analysis tools, but also as reusable representation-level interfaces for diagnosing, controlling, evaluating, and improving large language models. By open-sourcing Qwen-Scope, we aim to support mechanistic research and accelerate practical workflows that connect model internals to downstream behavior.
☆ Concordance Comparison as a Means of Assembling Local Grammars
Named Entity Recognition for person names is an important but non-trivial task in information extraction. This article uses a tool that compares the concordances obtained from two local grammars (LG) and highlights the differences. We used the results as an aid to select the best of a set of LGs. By analyzing the comparisons, we observed relationships of inclusion, intersection and disjunction within each pair of LGs, which helped us to assemble those that yielded the best results. This approach was used in a case study on extraction of person names from texts written in Portuguese. We applied the enhanced grammar to the Gold Collection of the Second HAREM. The F-Measure obtained was 76.86, representing a gain of 6 points in relation to the state-of-the-art for Portuguese.
☆ UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs
Multimodal large language models are increasingly expected to perform thinking with images, yet existing visual latent reasoning methods still rely on explicit textual chain-of-thought interleaved with visual latent tokens. This interleaved design limits efficiency and keeps reasoning fragmented across separate text and vision channels. We propose UniVLR, a unified visual latent reasoning framework that treats textual reasoning and auxiliary visual evidence as a shared visual workspace. Instead of preserving text CoT as an independent inference-time path, UniVLR renders reasoning traces together with auxiliary images and learns to compress this unified representation into compact visual latent tokens. At inference time, the model reasons only through visual latents and directly decodes the final answer, avoiding both external tool calls and verbose text reasoning. Experiments on real-world perception and visual reasoning tasks show that UniVLR outperforms prior visual latent reasoning methods while using substantially fewer generated reasoning tokens, suggesting a more unified and efficient paradigm for visual thinking in MLLMs.
☆ Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models
Diffusion Language Models (DLMs) have recently emerged as a promising alternative to autoregressive language models, offering stronger global awareness and highly parallel generation. However, post-training DLMs with standard Negative Evidence Lower Bound (NELBO)-based supervised fine-tuning remains inefficient: training reconstructs randomly masked tokens in a single step, whereas inference follows a confidence-guided, multi-step easy-to-hard denoising trajectory. Recent trajectory-based self-distillation methods exploit such inference trajectories mainly for sampling-step compression and acceleration, often improving decoding efficiency without substantially enhancing the model's underlying capability, and may even degrade performance under full diffusion decoding. In this work, we ask whether self-distilled trajectories can be used not merely for faster inference, but for genuine knowledge acquisition. Although these trajectories lie on the pretrained DLM's own distributional manifold and thus offer a potentially lower optimization barrier, we find that naively fine-tuning on them with standard NELBO objectives yields only marginal gains. To address this limitation, we propose \textbf{T}rajectory-\textbf{A}ligned optimization via \textbf{Bo}ltzmann \textbf{M}odeling (\textbf{TABOM}), a self-distilled trajectory-based post-training framework that aligns training with the easy-to-hard structure of inference. TABOM models the inference unmasking preference as a Boltzmann distribution over predictive entropies and derives a tractable pairwise ranking objective to align the model's certainty ordering with the observed decoding trajectory. Empirically, TABOM achieves substantial gains in new domains, expands the effective knowledge boundary of DLMs, and significantly mitigates catastrophic forgetting compared with standard SFT.
comment: Under review
☆ GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation
Reinforcement learning has become a widely used post-training approach for LLM agents, where training commonly relies on outcome-level rewards that provide only coarse supervision. While finer-grained credit assignment is promising for effective policy updates, obtaining reliable local credit and assigning it to the right parts of the long-horizon trajectory remains an open challenge. In this paper, we propose Granularity-adaptivE Advantage Reweighting (GEAR), an adaptive-granularity credit assignment framework that reshapes the trajectory-level GRPO advantage using token- and segment-level signals derived from self-distillation. GEAR compares an on-policy student with a ground-truth-conditioned teacher to obtain a reference-guided divergence signal for identifying adaptive segment boundaries and modulating local advantage weights. This divergence often spikes at the onset of a semantic deviation, while later tokens in the same autoregressive continuation may return to low divergence. GEAR therefore treats such spikes as anchors for adaptive credit regions: where the student remains aligned with the teacher, token-level resolution is preserved; where it departs, GEAR groups the corresponding continuation into an adaptive segment and uses the divergence at the departure point to modulate the segment' s advantage. Experiments across eight mathematical reasoning and agentic tool-use benchmarks with Qwen3 4B and 8B models show that GEAR consistently outperforms standard GRPO, self-distillation-only baselines, and token- or turn-level credit-assignment methods. The gains are especially strong on benchmarks with lower GRPO baseline accuracy, reaching up to around 20\% over GRPO, suggesting that the proposed adaptive reweighting scheme is especially useful in more challenging long-horizon settings.
☆ Probabilistic Calibration Is a Trainable Capability in Language Models
Language models are increasingly used in settings where outputs must satisfy user-specified randomness constraints, yet their generation probabilities are often poorly calibrated to those targets. We study whether this capability can be improved directly through fine-tuning. Concretely, we fine-tune language models on synthetic prompts that require sampling from mathematical distributions, and compare two Calibration Fine-Tuning variants: a soft-target method that converts the desired output distribution into trie-derived next-token targets, and a hard-target method that trains on sampled completions from the same target distribution. Across 12 models spanning four families, both methods substantially improve structured-sampling fidelity on held-out distribution families and unseen parameter settings, showing that probabilistic calibration is a trainable capability. Under our selected training configurations, the two methods exhibit different empirical profiles: hard-target fine-tuning is often strongest on structured numeric sampling, while soft-target fine-tuning performs better on broader stochastic generation benchmarks, including open-ended random generation, multiple-choice answer-position balancing, and NoveltyBench. The gains sometimes reduce downstream capability, especially arithmetic reasoning, with costs varying by model. Overall, our results show that probabilistic calibration can be improved through fine-tuning, with our hard-target configuration favoring exact numeric fidelity and our soft-target configuration favoring broader stochastic transfer. Code is available at https://github.com/chandar-lab/calibration-finetuning.
☆ More Edits, More Stable: Understanding the Lifelong Normalization in Sequential Model Editing
Lifelong Model Editing aims to continuously update evolving facts in Large Language Models while preserving unrelated knowledge and general capabilities, yet it remains plagued by catastrophic forgetting and model collapse. Empirically, we find that recent editors resilient over long horizons share the same core strategy: Lifelong Normalization (LN), which normalizes value gradients using running statistics. Removing LN causes immediate performance collapse, and we observe a counter-intuitive positive cumulative effect where early edits can promote the success of future edits. Yet the mechanism of LN remains a "black box", leaving its precise role in lifelong stability poorly understood. In this work, we provide the first theoretical account of LN in the lifelong regime. Our analysis reveals a self-reinforcing stability loop and proves that, when combined with ridge-regularized regression, LN yields parameter updates with asymptotic orthogonality and bounded norms, directly mitigating forgetting and systemic collapse. Based on these insights, we derive StableEdit, which strengthens this stability loop via an explicit warm-up stage and full whitening, improving long-horizon stability at minimal overhead. Extensive experiments validate our theory and demonstrate competitive performance. Our code is available at https://github.com/MINE-USTC/StableEdit.
☆ ROMER: Expert Replacement and Router Calibration for Robust MoE LLMs on Analog Compute-in-Memory Systems
Large language models (LLMs) with mixture-of-experts (MoE) architectures achieve remarkable scalability by sparsely activating a subset of experts per token, yet their frequent expert switching creates memory bandwidth bottlenecks that compute-in-memory (CIM) architectures are well-suited to mitigate. However, analog CIM systems suffer from inherent hardware imperfections that perturb stored weights, and its negative impact on MoE-based LLMs in noisy CIM environments remains unexplored. In this work, we present the first systematic investigation of MoE-based LLMs under noise model calibrated with real chip measurements, revealing that hardware noise critically disrupts expert load balance and renders clean-trained routing decisions consistently suboptimal. Based on these findings, we propose ROMER, a post-training calibration framework that (1) replaces underactivated experts with high-frequency ones to restore load balance, and (2) recalibrates router logits via percentile-based normalization to stabilize routing under noise. Extensive experiments across multiple benchmarks demonstrate that ROMER achieves up to 58.6\%, 58.8\%, and 59.8\% reduction in perplexity under real-chip noise conditions for DeepSeek-MoE, Qwen-MoE, and OLMoE, respectively, establishing its effectiveness and generalizability across diverse MoE architectures.
comment: 11 pages, 5 figures, 4 tables
☆ Choosing features for classifying multiword expressions
Multiword expressions (MWEs) are a heterogeneous set with a glaring need for classifications. Designing a satisfactory classification involves choosing features. In the case of MWEs, many features are a priori available. Not all features are equal in terms of how reliably MWEs can be assigned to classes. Accordingly, resulting classifications may be more or less fruitful for computational use. I outline an enhanced classification. In order to increase its suitability for many languages, I use previous works taking into account various languages.
☆ Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control
Policy entropy has emerged as a fundamental measure for understanding and controlling exploration in reinforcement learning with verifiable rewards (RLVR) for LLMs. However, existing entropy-aware methods mainly regulate entropy through global objectives, while the token-level mechanism by which sampled policy updates reshape policy entropy remains underexplored. In this work, we develop a theoretical framework of entropy mechanics in RLVR. Our analysis yields a first-order approximation of the entropy change, giving rise to entropy polarity, a signed token-level quantity that predicts how much a sampled update expands or contracts entropy. This analysis further reveals a structural asymmetry: reinforcing frequent high-probability tokens triggers contraction tendencies, whereas expansive tendencies typically require lower-probability samples or stronger distributional correction. Empirically, we show that entropy polarity reliably predicts entropy changes, and that positive and negative polarity branches play complementary roles in preserving exploration while strengthening exploitation. Building on these insights, we propose Polarity-Aware Policy Optimization (PAPO), which preserves both polarity branches and implements entropy control through advantage reweighting. With the empirical entropy trajectory as an online phase signal, PAPO adaptively reallocates optimization pressure between entropy-expanding and entropy-contracting updates. Experiments on mathematical reasoning and agentic benchmarks show that PAPO consistently outperforms competitive baselines, while delivering superior training efficiency and substantial reward improvements.
☆ From Token to Token Pair: Efficient Prompt Compression for Large Language Models in Clinical Prediction
By processing electronic health records (EHRs) as natural language sequences, large language models (LLMs) have shown potential in clinical prediction tasks such as mortality prediction and phenotyping. However, longitudinal or highly frequent EHRs often yield excessively long token sequences that result in high computational costs and even reduced performance. Existing solutions either add modules for compression or remove less important tokens, which introduce additional inference latency or risk losing clinical information. To achieve lossless compression of token sequences without additional cost or loss of performance, we propose Medical Token-Pair Encoding (MedTPE), a layered method that extends standard tokenisation for EHR sequences. MedTPE merges frequently co-occurring medical token pairs into composite tokens, providing lossless compression while preserving the computational complexity through a dependency-aware replacement strategy. Only the embeddings of the newly introduced tokens of merely 0.5-1.0% of the LLM's parameters are fine-tuned via self-supervised learning. Experiments on real-world datasets for two clinical scenarios demonstrate that MedTPE reduces input token length by up to 31% and inference latency by 34-63%, while maintaining or even improving both predictive performance and output format compliance across multiple LLMs and four clinical prediction tasks. Furthermore, MedTPE demonstrates robustness across different input context lengths and generalisability to scientific and financial domains and different languages.
comment: 21 pages, 6 figures, 13 tables
☆ Safety-Oriented Evaluation of Language Understanding Systems for Air Traffic Control
Air Traffic Control (ATC) is a safety-critical domain in which incorrect interpretation of instructions may lead to severe operational consequences. While large language models (LLMs) demonstrate strong general performance, their reliability in operational ATC environments remains unclear. Existing evaluation approaches, largely based on aggregate metrics such as F1 or macro accuracy, treat all errors uniformly and fail to account for the asymmetric consequences of high-risk semantic mistakes (e.g., incorrect runway identifiers or movement constraints). To address this gap, we propose a safety-oriented, consequence-aware evaluation framework tailored to ATC operations. Our results reveal that while current LLMs achieve reasonable aggregate accuracy, their operational reliability is severely limited. Evaluated on clean transcripts, the peak Risk Score reaches only 0.69, with most models scoring below 0.6 despite high macro-F1 performance. Further analysis shows that errors concentrate in high-impact entities despite relatively stable action-type classification, indicating structural grounding deficiencies. These findings highlight the necessity of consequence-aware evaluation protocols for the responsible deployment of AI-assisted ATC systems.
☆ DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies
Vision-Language-Action (VLA) models are often brittle in fine-grained manipulation, where minor action errors during the critical phases can rapidly escalate into irrecoverable failures. Since existing VLA models rely predominantly on successful demonstrations for training, they lack an explicit awareness of failure during these critical phases. To address this, we propose DreamAvoid, a critical-phase test-time dreaming framework that enables VLA models to anticipate and avoid failures. We also introduce an autonomous boundary learning paradigm to refine the system's understanding of the subtle boundary between success and failure. Specifically, we (1) utilize a Dream Trigger to determine whether the execution has entered a critical phase, (2) sample multiple candidate action chunks from the VLA via an Action Proposer, and (3) employ a Dream Evaluator, jointly trained on mixed data (success, failure, and boundary cases), to "dream" the short-horizon futures corresponding to the candidate actions, evaluate their values, and select the optimal action. We conduct extensive evaluations on real-world manipulation tasks and simulation benchmarks. The results demonstrate that DreamAvoid can effectively avoid failures, thereby improving the overall task success rate. Our code is available at https://github.com/XianzheFan/DreamAvoid.
comment: 19 pages, 7 figures
☆ Training-Inference Consistent Segmented Execution for Long-Context LLMs ICML 2026
Transformer-based large language models face severe scalability challenges in long-context generation due to the computational and memory costs of full-context attention. Under practical computation and memory constraints, many inference-efficient long-context methods improve efficiency by adopting bounded-context or segment-level execution only during inference, while continuing to train models under full-context attention, resulting in a mismatch between training and inference execution and state-transition semantics. Based on this insight, we propose a training-inference consistent segment-level generation framework, in which training and inference follow the same segment-level forward execution semantics. During training, consistency with inference is enforced by restricting gradient propagation to KV states carried over from the immediately preceding segment, while permitting head-specific access to past KV states during the forward pass without involving them in gradient propagation. Across long-context benchmarks, our approach achieves performance comparable to full-context attention, while achieving competitive latency-memory trade-offs against strong inference-efficient baselines, and substantially improving scalability at very long context lengths (e.g., approximately 6x lower peak prefill memory at 128K compared to full-context attention with FlashAttention).
comment: Accepted by ICML 2026. 19 pages, 6 figures, 3 tables
☆ Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, existing studies largely attribute this advantage to denser and more stable supervision, while the parameter-level mechanisms underlying OPD's efficiency remain poorly understood. In this work, we argue that OPD's efficiency stems from a form of ``foresight'': it establishes a stable update trajectory toward the final model early in training. This foresight manifests in two aspects. First, at the \textbf{Module-Allocation Level}, OPD identifies regions with low marginal utility and concentrates updates on modules that are more critical to reasoning. Second, at the \textbf{Update-Direction Level}, OPD exhibits stronger low-rank concentration, with its dominant subspaces aligning closely with the final update subspace early in training. Building on these findings, we propose \textbf{EffOPD}, a plug-and-play acceleration method that speeds up OPD by adaptively selecting an extrapolation step size and moving along the current update direction. EffOPD requires no additional trainable modules or complex hyperparameter tuning, and achieves an average training acceleration of $3\times$ while maintaining comparable final performance. Overall, our findings provide a parameter-dynamics perspective for understanding the efficiency of OPD and offer practical insights for designing more efficient post-training methods for large language models.
☆ AgentDisCo: Towards Disentanglement and Collaboration in Open-ended Deep Research Agents
In this paper, we present AgentDisCo, a novel Disentangled and Collaborative agentic architecture that formulates deep research as an adversarial optimization problem between information exploration and exploitation. Unlike existing approaches that conflate these two processes into a single module, AgentDisCo employs a critic agent to evaluate generated outlines and refine search queries, and a generator agent to retrieve updated results and revise outlines accordingly. The iteratively refined outline is then passed to a downstream report writer that synthesizes a comprehensive research report. The overall workflow supports both handcrafted and automatically discovered design strategies via a meta-optimization harness, in which the generator agent is repurposed as a scoring agent to evaluate critic outputs and generate quality signals. Powerful code-generation agents (e.g., Claude-Code, Codex) systematically explore agent configurations and construct a policy bank, a structured repository of reusable design strategies, enabling the framework to self-refine without extensive human intervention. We evaluate AgentDisCo on three established deep research benchmarks (DeepResearchBench, DeepConsult, DeepResearchGym) using Gemini-2.5-Pro, achieving performance comparable to or surpassing leading closed-source systems. Observing that existing benchmarks inadequately reflect real-world user needs, we introduce GALA (General AI Life Assistants), a benchmark that mines latent research interests from users' historical browsing behavior. We further develop a rendering agent that converts research reports into visually rich poster presentations, and demonstrate an end-to-end product, AutoResearch Your Interest, which delivers personalized deep research recommendations derived from individual browsing histories.
☆ Allegory of the Cave: Measurement-Grounded Vision-Language Learning
Vision-language models typically reason over post-ISP RGB images, although RGB rendering can clip, suppress, or quantize sensor evidence before inference. We study whether grounding improves when the visual interface is moved closer to the underlying camera measurement. We formulate measurement-grounded vision-language learning and instantiate it as PRISM-VL, which combines RAW-derived Meas.-XYZ inputs, camera-conditioned grounding, and Exposure-Bracketed Supervision Aggregation for transferring supervision from RGB proxies to measurement-domain observations. Using a quality-controlled 150K instruction-tuning set and a held-out benchmark targeting low-light, HDR, visibility-sensitive, and hallucination-sensitive cases, PRISM-VL-8B reaches 0.6120 BLEU, 0.4571 ROUGE-L, and 82.66\% LLM-Judge accuracy, improving over the RGB Qwen3-VL-8B baseline by +0.1074 BLEU, +0.1071 ROUGE-L, and +4.46 percentage points. These results suggest that part of VLM grounding error arises from information lost during RGB rendering, and that preserving measurement-domain evidence can improve multimodal reasoning.
☆ Slicing and Dicing: Configuring Optimal Mixtures of Experts
Mixture-of-Experts (MoE) architectures have become standard in large language models, yet many of their core design choices - expert count, granularity, shared experts, load balancing, token dropping - have only been studied one or two at a time over narrow configuration ranges. It remains an open question whether these choices can be optimized independently, without considering interactions. We present the first systematic study of over 2,000 pretraining runs spanning models up to 6.6B total parameters, in which we exhaustively vary total experts, expert dimension, heterogeneous expert sizing within a single layer, shared expert size and load-balancing mechanisms. We find that at every active-parameter scale that we study, performance consistently improves with total MoE parameters even at extreme active expert parameter ratios like 128.Further, the optimal expert size is nearly invariant to total parameter count and depends only on active parameter count. Third, we see that other choices like shared experts, heterogeneous experts and load-balancing settings have small effects relative to expert count and granularity, although dropless routing yields a consistent gain. Overall, our results suggest a simpler recipe: focus on expert count and granularity, other choices have minimal effect on final quality.
☆ Robust LLM Unlearning Against Relearning Attacks: The Minor Components in Representations Matter
Large language model (LLM) unlearning aims to remove specific data influences from pre-trained model without costly retraining, addressing privacy, copyright, and safety concerns. However, recent studies reveal a critical vulnerability: unlearned models rapidly recover "forgotten" knowledge through relearning attacks. This fragility raises serious security concerns, especially for open-weight models. In this work, we investigate the fundamental mechanism underlying this fragility from a representation geometry perspective. We discover that existing unlearning methods predominantly optimize along dominant components, leaving minor components largely unchanged. Critically, during relearning attacks, the modifications in these dominant components are easily reversed, enabling rapid knowledge recovery, whereas minor components exhibit stronger resistance to such reversal. We further provide a theoretical analysis that explains both observations from the spectral structure of representations. Building on this insight, we propose Minor Component Unlearning (MCU), a novel unlearning approach that explicitly targets minor components in representations. By concentrating unlearning effects in these inherently robust directions, our method achieves substantially improved resistance to relearning attacks. Extensive experiments on three datasets validate our approach, demonstrating significant improvements over state-of-the-art methods including sharpness-aware minimization.
☆ Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability
Authentic school examinations provide a high-validity test bed for evaluating multimodal large language models (MLLMs), yet benchmarks grounded in Japanese K-12 assessments remain scarce. We present a multimodal dataset constructed from Japan's National Assessment of Academic Ability, comprising officially released middle-school items in Science, Mathematics, and Japanese Language. Unlike existing benchmarks based on synthetic or curated data, our dataset preserves real exam layouts, diagrams, and Japanese educational text, together with nationwide aggregated student response distributions (N $\approx$ 900{,}000). These features enable direct comparison between human and model performance under a unified evaluation framework. We benchmark recent multimodal LLMs using exact-match accuracy and character-level F1 for open-ended responses, observing substantial variation across subjects and strong sensitivity to visual reasoning demands. Human evaluation and LLM-as-judge analyses further assess the reliability of automatic scoring. Our dataset establishes a reproducible, human-grounded benchmark for multimodal educational reasoning and supports future research on evaluation, feedback generation, and explainable AI in authentic assessment contexts. Our dataset is available at: https://github.com/KyosukeTakami/gakucho-benchmark
☆ Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation
Recent think-answer approaches in VLMs, such as Qwen3-VL-Thinking, boost reasoning performance by leveraging intermediate thinking steps before the final answer, but their high computational cost limits real-world deployment. To distill such capabilities into compact think-answer VLMs, a primary objective is to improve the student's ability to utilize visual evidence throughout its reasoning trace. To this end, we introduce a novel think-answer distillation framework that encourages the student to anchor its thinking on visual information by masking the student's salient reasoning prefixes. To compensate for such masked textual cues, the student is encouraged to rely more on visual evidence as an alternative source of information during distillation. Our masking strategies include: 1) token-wise salient reasoning-prefix masking, which masks high-influence reasoning prefixes selectively for each next-token prediction, and 2) self-paced masking budget scheduling, which gradually increases the masking scale according to distillation difficulty, {measured by discrepancy between teacher--student distributions. In the distillation phase, the student is guided by our salient reasoning-prefix mask, which blocks both future tokens and salient reasoning cues, in place of the standard causal mask used for auto-regressive language modeling. Experimental results show that our approach outperforms recent open-source VLMs, VLM distillation, and self-distillation methods on multimodal reasoning benchmarks, while further analyses confirm enhanced visual utilization along the student thinking process.
comment: Pre-print
☆ Enhancing Multilingual Counterfactual Generation through Alignment-as-Preference Optimization
Self-generated counterfactual explanations (SCEs) are minimally modified inputs (minimality) generated by large language models (LLMs) that flip their own predictions (validity), offering a causally grounded approach to unraveling black-box LLM behavior. Yet extending them beyond English remains challenging: existing methods struggle to produce valid SCEs in non-dominant languages, and a persistent trade-off between validity and minimality undermines explanation quality. We introduce Macro, a preference alignment framework that applies Direct Preference Optimization (DPO) to multilingual SCE generation, using a composite scoring function to construct preference pairs that effectively translate the trade-off into measurable preference signals. Experiments across four LLMs and seven typologically diverse languages show that Macro improves validity by 12.55\% on average over the chain-of-thought baseline without degrading minimality, while avoiding the severe minimality violations of the translation-based baseline. Compared to supervised fine-tuning, Macro achieves superior performance on both metrics, confirming that explicit preference optimization is essential for balancing this trade-off. Further analyses reveal that Macro increases cross-lingual perturbation alignment and mitigates common generation errors. Our results highlight preference optimization as a promising direction for enhancing multilingual model explanations.
comment: In submission
☆ OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models
Recent multimodal large language models (MLLMs) have shown strong chain-of-thought (CoT) reasoning ability on vision-language tasks, but their direct deployment in real-world systems is often limited by latency and resource constraints. In practice, smaller MLLMs are preferred for online serving, yet their reasoning performance is bottlenecked by the lack of large-scale, high-quality multimodal CoT supervision. In this paper, we present OmniThoughtVis, a scalable data curation and distillation pipeline for transferring multimodal reasoning capabilities from high-capacity teacher models to smaller, deployment-oriented MLLMs. Starting from a diverse open-source seed pool, our pipeline generates structured CoT traces and performs joint annotation of reasoning difficulty, answer quality, and semantic task tags. To maintain data quality at scale, we combine rule-based filtering, difficulty-aware selection, and tag-based diversity sampling, resulting in a curated corpus of 1.8M samples that supports controllable subset construction for downstream training. We use OmniThoughtVis to distill Qwen3-VL models from 2B to 8B parameters and evaluate them on nine multimodal reasoning benchmarks. The resulting distilled models show consistent gains across model scales, including improvements of up to +16.8 points on MathVerse and +5.6 points on MMMU-Pro for the 4B model. Notably, the distilled 4B model matches or surpasses the undistilled 8B baseline on several tasks, highlighting the practical value of scalable reasoning distillation for deployment-oriented MLLMs.
☆ When Emotion Becomes Trigger: Emotion-style dynamic Backdoor Attack Parasitising Large Language Models
Backdoor vulnerabilities widely exist in the fine-tuning of large language models(LLMs). Most backdoor poisoning methods operate mainly at the token level and lack deeper semantic manipulation, which limits stealthiness. In addition, Prior attacks rely on a single fixed trigger to induce harmful outputs. Such static triggers are easy to detect, and clean fine-tuning can weaken the trigger-target association. Through causal validation, we observe that emotion is not directly linked to individual words, but functions as an overall stylistic factor through tone. In the representation space of LLM, emotion can be decoupled from semantics, forming distinct cluster from the original neutral text. Therefore, we consider the emotional factor as the backdoor trigger to propose a pparasitic emotion-style dynamic backdoor attack, Paraesthesia. By mixing samples with the emotional trigger into clean data and then fine-tuning the model, the model is able to generate the predefined attack response when encountering emotional inputs during the inference stage. Paraesthesia includes two the quantification and rewriting of emotional styles. We evaluate the effectiveness of our method on instruction-following generation and classification tasks. The experimental results show that Paraesthesia achieves an attack success rate of around 99\% across both task types and four different models, while maintaining the clean utility of the models.
☆ Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
On-policy self-distillation, where a student is pulled toward a copy of itself conditioned on privileged context (e.g., a verified solution or feedback), offers a promising direction for advancing reasoning capability without a stronger external teacher. Yet in math reasoning the gains are inconsistent, even when the same approach succeeds elsewhere. A pointwise mutual information analysis traces the failure to the privileged context itself: it inflates the teacher's confidence on tokens already implied by the solution (structural connectives, verifiable claims) and deflates it on deliberation tokens ("Wait", "Let", "Maybe") that drive multi-step search. We propose Anti-Self-Distillation (AntiSD), which ascends a divergence between student and teacher rather than descending it: this reverses the per-token sign and yields a naturally bounded advantage in one step. An entropy-triggered gate disables the term once the teacher entropy collapses, completing a drop-in replacement for default self-distillation. Across five models from 4B to 30B parameters on math reasoning benchmarks, AntiSD reaches the GRPO baseline's accuracy in 2 to 10x fewer training steps and improves final accuracy by up to 11.5 points. AntiSD opens a path to scalable self-improvement, where a language model bootstraps its own reasoning through its training signal.
☆ PRISM: A Geometric Risk Bound that Decomposes Drift into Scale, Shape, and Head
Comparing post-training LLM variants, such as quantized, LoRA-adapted, and distilled models, requires a diagnostic that identifies how a variant has drifted, not only whether it has degraded. Existing similarity scores such as CKA and SVCCA can flag degradation, but they do not directly link representation drift to risk or mechanism. We propose PRISM, Proxy Risk Inference via Structural Mapping, which exploits the linear output head of LLMs and the empirically near-isometric structure of their backbones to derive a closed-form upper bound on the cross-entropy risk gap between a target model and a post-training variant. The bound is calibrated for variant ranking and decomposes drift into three independently measurable axes: scale mismatch, shape mismatch, and head divergence. Each axis corresponds to a distinct failure mode, including shape distortion under low-bit quantization, scale separability under LoRA forgetting, and head divergence under GGUF k-quantization. As a result, the dominant axis suggests a remediation direction rather than merely raising a degradation flag. Because the shape term is differentiable, the same geometry can also serve as a training-time regularizer against catastrophic forgetting. Across two model families and five benchmarks, PRISM ranks variants with mean Spearman correlations of 0.820 for post-training quantization and 0.831 for LoRA forgetting, and its axis-guided shape regularizer outperforms experience replay in aggregate at mitigating downstream forgetting.
☆ DiffScore: Text Evaluation Beyond Autoregressive Likelihood
Autoregressive language models are widely used for text evaluation, however, their left-to-right factorization introduces positional bias, i.e., early tokens are scored with only leftward context, conflating architectural asymmetry with true text quality. We propose masked reconstruction as an alternative paradigm, where every token is scored using full bidirectional context. We introduce DiffScore, an evaluation framework built on Masked Large Diffusion Language Models. By measuring text recoverability across continuous masking rates, DiffScore eliminates positional bias and naturally establishes an evaluation hierarchy from local fluency to global coherence. We further provide diagnostic tools unavailable to autoregressive frameworks: multi-timestep quality profiles that decompose scores across masking rates, and bidirectional PMI decomposition that disentangles fluency from faithfulness. Experiments across ten benchmarks show that DiffScore consistently outperforms autoregressive baselines in both zero-shot and fine-tuned settings. The code is released at: https://github.com/wenlai-lavine/DiffScore.
☆ Efficient LLM-based Advertising via Model Compression and Parallel Verification
Large language models (LLMs) have shown remarkable potential in advertising scenarios such as ad creative generation and targeted advertising. However, deploying LLMs in real-time advertising systems poses significant challenges due to their high inference latency and computational cost. In this paper, we propose an Efficient Generative Targeting framework that integrates adaptive group quantization, layer-adaptive hierarchical sparsification, and prefix-tree parallel verification to accelerate LLM inference while preserving generation quality. Extensive experiments on two real-world advertising scenarios demonstrate that our framework achieves significant speedup with acceptable quality degradation, making it operationally viable for practical deployments.
comment: 10 pages, 7 figures, industry paper
☆ Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference
When large language models (LLMs) serve real-time inference in commercial online advertising systems, end-to-end latency must be strictly bounded to the millisecond range. Yet every token generated during the decode phase triggers thousands of kernel launches, and kernel launch overhead alone can account for 14.6% of end-to-end inference time. MegaKernel eliminates launch overhead and inter-operator HBM round-trips by fusing multiple operators into a single persistent kernel. However, existing MegaKernel implementations face a fundamental tension between portability and efficiency on resource-constrained GPUs such as NVIDIA Ada: hand-tuned solutions are tightly coupled to specific architectures and lack portability, while auto-compiled approaches introduce runtime dynamic scheduling whose branch penalties are unacceptable in latency-critical settings. We observe that under a fixed deployment configuration, the optimal execution path of a MegaKernel is uniquely determined, and runtime dynamic decision-making can be entirely hoisted to compile time. Building on this insight, we propose Ada-MK: (1) a three-dimensional shared-memory constraint model combined with K-dimension splitting that reduces peak shared memory usage by 50%; (2) MLIR-based fine-grained DAG offline search that solidifies the optimal execution path, completely eliminating runtime branching; and (3) a heterogeneous hybrid inference engine that embeds MegaKernel as a plugin into TensorRT-LLM, combining high-throughput Prefill with low-latency Decode. On an NVIDIA L20, Ada-MK improves single-batch throughput by up to 23.6% over vanilla TensorRT-LLM and 50.2% over vLLM, achieving positive gains across all tested scenarios--the first industrial deployment of MegaKernel in a commercial online advertising system.
comment: 10 pages, 8 figures
☆ BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion
Autoregressive language models generate text one token at a time, yet natural language is inherently structured in multi-token units, including phrases, n-grams, and collocations that carry meaning jointly. This one-token bottleneck limits both the expressiveness of the model during pre-training and its throughput at inference time. Existing remedies such as speculative decoding or diffusion-based language models either leave the underlying bottleneck intact or sacrifice the causal structure essential to language modeling. We propose BitLM, a language model that represents each token as a fixed-length binary code and employs a lightweight diffusion head to denoise multiple tokens in parallel within each block. Crucially, BitLM preserves left-to-right causal attention across blocks while making joint lexical decisions within each block, combining the reliability of autoregressive modeling with the parallelism of iterative refinement. By replacing the large-vocabulary softmax with bitwise denoising, BitLM reframes token generation as iterative commitment in a compact binary space, enabling more efficient pre-training and substantially faster inference without altering the causal foundation that makes language models effective. Our results demonstrate that the one-token-at-a-time paradigm is not a fundamental requirement but an interface choice, and that changing it can yield a stronger and faster language model. We hope BitLM points toward a promising direction for next-generation language model architectures.
comment: 12 pages, 4figures, 1 table
☆ Three Regimes of Context-Parametric Conflict: A Predictive Framework and Empirical Validation
The literature on how large language models handle conflict between their training knowledge and a contradicting document presents a persistent empirical contradiction: some studies find models stubbornly retain their trained answers, ignoring provided documents nearly half the time, while others find models readily defer to the document, following context approximately 96% of the time. We argue these contradictions dissolve once one recognises that prior experiments have studied three qualitatively distinct processing situations without distinguishing them. We propose a three-regime framework: Regime 1 (single-source updating, dominant predictor: evidence coherence), Regime 2 (competitive integration, dominant predictor: parametric certainty), and Regime 3 (task-appropriate selection, dominant predictor: task knowledge requirement). We formalise a distinction between parametric strength (exposure frequency) and parametric uniqueness (encoding consistency), showing empirically that these are orthogonal dimensions (r = -0.002, p = .97) with strength as the operative predictor in stable factual domains. We validate the framework across Claude Sonnet 4.6, GPT-5.5, Gemini 2.5 Flash, Llama 4 Maverick, and DeepSeek V3 using 9,970 API calls in three experimental phases. GEE logistic regression confirms the predicted Regime 2 certainty gradient for all five models (beta = -0.38 to -0.50, all p <= .013, BH-FDR corrected). A Regime 3 ablation shows task framing alone flips context-following from near-100% (contextual knowledge condition) to 6-71% (parametric knowledge condition), with all five models significant (p < .001). The certainty gradient is robust to multinomial outcome modeling, sensitivity analyses for hedging responses, and FDR correction.
comment: 10 pages, 13 tables, no figures. 9,970 API calls across five frontier models
☆ Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting ACL 2026
Group Relative Policy Optimization (GRPO) has emerged as a promising approach for improving the reasoning capabilities of large language models. However, it struggles to effectively balance the tradeoff between exploration and exploitation during training, often resulting in suboptimal performance. Motivated by the theoretical insight that changes in entropy are governed by the covariance between token probabilities and their corresponding advantages, we propose a hyperparameter-free, covariance-weighted optimization method that dynamically down-weights extreme token-level updates via a Gaussian kernel. This approach automatically reduces the instability caused by exploration-exploitation trade-off while preserving informative learning signals. Extensive empirical evaluations show that our approach improves downstream performance across reasoning benchmarks compared with GRPO, and effectively stablizes entropy as training progresses.
comment: ACL 2026
☆ Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
Clinical check-up reports are multimodal documents that combine page layouts, tables, numerical biomarkers, abnormality flags, imaging findings, and domain-specific terminology. Such heterogeneous evidence is difficult for laypersons to interpret and translate into concrete follow-up actions. Although large language models show promise in medical summarisation and triage support, their ability to generate safe, prioritised, and patient-oriented actions from multimodal check-up reports remains under-benchmarked. We present \textbf{Checkup2Action}, a multimodal clinical check-up report dataset and benchmark for structured \textit{Action Card} generation. Each card describes one clinically relevant issue and specifies its priority, recommended department, follow-up time window, patient-facing explanation, and questions for clinicians, while avoiding diagnostic or treatment-prescriptive claims. The dataset contains 2,000 de-identified real-world check-up reports covering demographic information, physical examinations, laboratory tests, cardiovascular assessments, imaging-related evidence, and physician summaries. We formulate checkup-to-action generation as a constrained structured generation task and introduce an evaluation protocol covering issue coverage and precision, priority consistency, department and time recommendation accuracy, action complexity, usefulness, readability, and safety compliance. Experiments with general-purpose and medical large language models reveal clear trade-offs between issue coverage, action correctness, conciseness, and safety alignment. Checkup2Action provides a new multimodal benchmark for evaluating patient-oriented reasoning over clinical check-up reports.
☆ Controllable User Simulation
Using offline datasets to evaluate conversational agents often fails to cover rare scenarios or to support testing new policies. This has motivated the use of controllable user simulators for targeted, counterfactual evaluation, typically implemented by prompting or fine-tuning large language models. In this work, we formalize controllable simulation as a causal inference problem. By bridging natural language evaluation with off-policy evaluation methodology, we show that the standard practice of training simulators via supervised fine-tuning on post-hoc trajectory labels yields a structurally biased model. Specifically, these labels are inextricably coupled to the data-generating behavior policy, injecting a look-ahead bias that breaks causal consistency. Furthermore, we prove that under policy shift this failure causes the variance of evaluation metrics to explode geometrically, a phenomenon we term controllability collapse. To restore causal consistency, we establish theoretical conditions for accurate simulation and propose practical training mitigations: a priori controls, step-wise dynamic controls, and direct policy-conditioned learning. Empirical evaluation confirms that while standard global controls distort conversational distributions and collapse behavioral diversity, our causally grounded simulators eliminate look-ahead bias, preserve natural variance, and exhibit robust zero-shot generalization to unseen agent behaviors.
☆ AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive
Effectively configuring scalable large language model (LLM) experiments, spanning architecture design, hyperparameter tuning, and beyond, is crucial for advancing LLM research, as poor configuration choices can waste substantial computational resources and prevent models from realizing their full potential. Prior automated methods are designed for low-cost settings where repeated trial and error is feasible, but scalable LLM experiments are too expensive for such extensive iteration. To our knowledge, no work has addressed the automation of high-cost LLM experiment configurations, leaving this problem labor-intensive and dependent on expert intuition. Motivated by this gap, we propose AutoLLMResearch, an agentic framework that mimics how human researchers learn generalizable principles from low-fidelity experiments and extrapolate to efficiently identify promising configurations in expensive LLM settings. The core challenge is how to enable an agent to learn, through interaction with a multi-fidelity experimental environment that captures the structure of the LLM configuration landscape. To achieve this, we propose a systematic framework with two key components: 1) LLMConfig-Gym, a multi-fidelity environment encompassing four critical LLM experiment tasks, supported by over one million GPU hours of verifiable experiment outcomes; 2) A structured training pipeline that formulates configuration research as a long-horizon Markov Decision Process and accordingly incentivizes cross-fidelity extrapolation reasoning. Extensive evaluation against diverse strong baselines on held-out experiments demonstrates the effectiveness, generalization, and interpretability of our framework, supporting its potential as a practical and general solution for scalable real-world LLM experiment automation.
☆ A Study on Hidden Layer Distillation for Large Language Model Pre-Training
Knowledge Distillation (KD) is a critical tool for training Large Language Models (LLMs), yet the majority of research focuses on approaches that rely solely on output logits, neglecting semantic information in the teacher's intermediate representations. While Hidden Layer Distillation (HLD) showed potential for encoder architectures, its application to decoder-only pre-training at scale remains largely unexplored. Through compute-controlled experiments, we benchmark HLD against logit-based KD and self-supervised baselines with Gemma3 3.4B as teacher and 123M and 735M students trained on up to 168B tokens from the C4 dataset. Our experiments show that HLD does not consistently outperform standard KD on downstream evaluation tasks. Nevertheless, we show that HLD can yield a systematic perplexity gain over KD across all shared-hyperparameter configurations, suggesting that a latent signal can be extracted, but a breakthrough may be needed for it to play a more significant role in LLM pre-training.
☆ Robust Biomedical Publication Type and Study Design Classification with Knowledge-Guided Perturbations IEEE
Accurately and consistently indexing biomedical literature by publication type and study design is essential for supporting evidence synthesis and knowledge discovery. Prior work on automated publication type and study design indexing has primarily focused on expanding label coverage, enriching feature representations, and improving in-domain accuracy, with evaluation typically conducted on data drawn from the same distribution as training. Although pretrained biomedical language models achieve strong performance under these settings, models optimized for in-domain accuracy may rely on superficial lexical or dataset-specific cues, resulting in reduced robustness under distributional shift. In this study, we introduce an evaluation framework based on controlled semantic perturbations to assess the robustness of a publication type classifier and investigate robustness-oriented training strategies that combine entity masking and domain-adversarial training to mitigate reliance on spurious topical correlations. Our results show that the commonly observed trade-off between robustness and in-domain accuracy can be mitigated when robustness objectives are designed to selectively suppress non-task-defining features while preserving salient methodological signals. We find that these improvements arise from two complementary mechanisms: (1) increased reliance on explicit methodological cues when such cues are present in the input, and (2) reduced reliance on spurious domain-specific topical features. These findings highlight the importance of feature-level robustness analysis for publication type and study design classification and suggest that refining masking and adversarial objectives to more selectively suppress topical information may further improve robustness. Data, code, and models are available at: https://github.com/ScienceNLP-Lab/MultiTagger-v2/tree/main/ICHI
comment: Accepted by IEEE ICHI 2026
☆ StoicLLM: Preference Optimization for Philosophical Alignment in Small Language Models
While large language models excel at factual adaptation, their ability to internalize nuanced philosophical frameworks under severe data constraints remains underexplored. We investigate this by specializing small LLMs on micro-datasets of foundational Stoic texts using preference optimization (ORPO, AlphaPO). Evaluated via a multi-model critic bank, our results show that just 300 high-fidelity examples can induce strong alignment with inward-facing Stoic virtues, closely approaching few-shot prompting while freeing the context window. Critically, however, all models, including few-shot baselines, exhibit a persistent failure on Stoicism's outward-facing cosmopolitan duties, pointing to a representational limitation of small models that micro-dataset adaptation alone cannot overcome.
☆ Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning
On-policy self-distillation has become a strong recipe for LLM reasoning, where a privileged teacher supervises the student's own rollouts while conditioning on the reference solution. A design choice shared by nearly all such methods, however, has gone unquestioned: the teacher always sees the full reference reasoning. We argue that this default itself is part of the problem and identify a teacher-side exposure mismatch: when the teacher conditions on reasoning far beyond the student's current competence, the resulting token targets become too strong to absorb. A controlled fixed-exposure sweep makes this concrete on two fronts: 1) full exposure is not reliably the best choice, and 2) student-teacher mismatch grows monotonically as the teacher sees more privileged reasoning. This motivates treating teacher exposure not as a fixed hyperparameter but as a learnable training-time control variable. We therefore propose Adaptive Teacher Exposure for Self-Distillation (ATESD). ATESD models the reveal ratio with a lightweight Beta-policy controller conditioned on compact training-state statistics, and uses one sampled exposure for a short hold window of student updates. To make this exposure controller learnable, we optimize it with a discounted learning-progress reward that scores each held decision by its effect on the student's future improvement rather than its immediate loss change, addressing the delayed credit assignment induced by on-policy distillation. Experiments on AIME 24, AIME 25, and HMMT 25 across Qwen3-{1.7B, 4B, 8B} show that ATESD consistently outperforms competitive self-distillation and RL baselines, improving over OPSD by +0.95, +2.05, and +2.33 Average@12 points respectively, and establishing adaptive teacher exposure as an effective new axis for reasoning self-distillation.
comment: 11 pages, 4 figures; code not released yet
☆ Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection
Large Language Model (LLM) agents have emerged as key intermediaries, orchestrating complex interactions between human users and a wide range of digital services and LLM infrastructures. While prior research has extensively examined the security of LLMs and agents in isolation, the systemic risk of the agent acting as a disruptive hub within the user-agent-service chain remains largely overlooked. In this work, we expose a novel threat paradigm by introducing Mobius Injection, a sophisticated attack that weaponizes autonomous agents into zombie nodes to launch what we define as gent-based and -Oriented DDoS (AbO-DDoS) attacks. By exploiting a structural vulnerability in agentic logic named Semantic Closure, an adversary can induce sustained recursive execution of agent components through a single textual injection. We demonstrate that this attack is exceptionally lightweight, stealthy against both traditional DDoS monitors and contemporary AI safety filters, and highly configurable, allowing for surgical targeting of specific environments or model providers. To evaluate the real-world impact, we conduct extensive experiments across three representative claw-style agents and three mainstream coding agents, integrated with 12 frontier proprietary or open-weight LLMs. Our results demonstrate that Mobius Injection achieves substantial attack success across diverse tasks, driving single-node call amplification up to 51.0x and multi-node p95 latency inflation up to 229.1x. The attack performance exhibits a superlinear increase with the number of poisoning nodes. To mitigate Mobius Injection, we propose a proactive defense mechanism using Agent Component Energy (ACE) Analysis, which detects malicious recursive triggers by measuring anomalous energy in the agent's component graph.
☆ Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty
Large language models (LLMs) are increasingly deployed on long-horizon tasks in partially observable environments, where they must act while inferring and tracking a complex environment state over many steps. This leads to two challenges: partial observability requires maintaining uncertainty over unobserved world attributes, and long interaction history causes context to grow without bound, diluting task-relevant information. A principled solution to both challenges is a belief state: a posterior distribution over environment states given past observations and actions, which compactly encodes history for decision making regardless of episode length. In LLM agents, however, the open-ended nature of text makes it unclear how to represent such a distribution. Therefore, we introduce Agent-BRACE: Agent Belief state Representation via Abstraction and Confidence Estimation, a method that decouples an LLM agent into a belief state model and a policy model, jointly optimized via reinforcement learning. The belief state model produces a structured approximation of the belief distribution: a set of atomic natural language claims about the environment, each annotated with an ordinal verbalized certainty label ranging from certain to unknown. The policy model conditions on this compact, structured approximate belief rather than the full history, learning to select actions under explicit uncertainty. Across long-horizon, partially observable embodied language environments, Agent-BRACE achieves an average absolute improvement of +14.5% (Qwen2.5-3B-Instruct) and +5.3% (Qwen3-4B-Instruct), outperforming strong RL baselines while maintaining a near-constant context window independent of episode length. Further analysis shows that the learned belief becomes increasingly calibrated over the course of an episode as evidence accumulates.
comment: Code: https://github.com/joykirat18/Agent-BRACE
☆ Freeze Deep, Train Shallow: Interpretable Layer Allocation for Continued Pre-Training
Selective layer-wise updates are essential for low-cost continued pre-training of Large Language Models (LLMs), yet determining which layers to freeze or train remains an empirical black-box problem due to the lack of interpretable guidance. To address this issue, we propose LayerTracer, an architecture-agnostic diagnostic framework that reveals the evolution patterns of layer-wise representations and stability by locating task execution positions and quantifying layer sensitivity. Analysis results reveal that deep layers act as critical regions for task execution and maintain high stability against disruptive updates. Guided by this finding, we conduct three controlled continued pre-training trials to compare diverse freeze-train strategies, demonstrating that training shallow layers while freezing deep layers consistently outperforms full-parameter fine-tuning and the opposite allocation on both C-Eval and CMMLU benchmarks. We further present a hybrid model case study, which validates that placing high-quality pre-trained modules in deep layers effectively preserves inherent knowledge of the model. This work delivers a low-cost and interpretable solution for resource-constrained teams, offering actionable guidance for layer-wise parameter allocation in continued pre-training and hybrid model construction.
☆ MaskTab: Scalable Masked Tabular Pretraining with Scaling Laws and Distillation for Industrial Classification
Tabular data forms the backbone of high-stakes decision systems in finance, healthcare, and beyond. Yet industrial tabular datasets are inherently difficult: high-dimensional, riddled with missing entries, and rarely labeled at scale. While foundation models have revolutionized vision and language, tabular learning still leans on handcrafted features and lacks a general self-supervised framework. We present MaskTab, a unified pre-training framework designed specifically for industrial-scale tabular data. MaskTab encodes missing values via dedicated learnable tokens, enabling the model to distinguish structural absence from random dropout. It jointly optimizes a hybrid supervised pre-training scheme--utilizing a twin-path architecture to reconcile masked reconstruction with task-specific supervision--and an MoE-augmented loss that adaptively routes features through specialized subnetworks. On industrial-scale benchmarks, it achieves +5.04% AUC and +8.28% KS over prior art under rigorous scaling. Moreover, its representations distill effectively into lightweight models, yielding +2.55% AUC and +4.85% KS under strict latency and interpretability constraints, while improving robustness to distribution shifts. Our work demonstrates that tabular data admits a foundation-model treatment--when its structural idiosyncrasies are respected.
☆ fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum
Reinforcement Learning with Verifiable Rewards (RLVR) has become the standard paradigm for LLM mathematical reasoning, with Group Relative Policy Optimization (GRPO) serving as the dominant algorithm. We identify two overlooked inefficiencies inherent in GRPO. First, a fixed KL coefficient overly restricts policy exploration at moments when the model needs to diverge significantly from the reference policy. Second, uniform question sampling overlooks that moderately difficult problems produce the most informative gradient signals. We propose FG-ExPO, short for Frontier-Guided Exploration-Prioritized Policy Optimization, which integrates two lightweight components. Accuracy-Conditioned KL Scaling (AKL) adjusts the KL penalty strength through a smooth nonlinear function of batch average accuracy, loosening the constraint when the model performs poorly and strengthening it when the model achieves satisfactory results. Gaussian Curriculum Sampling (GCS) assigns sampling weights to questions following a Gaussian distribution centered at a moderate accuracy level around 0.5, focusing model training on its learning frontier. We conduct evaluations on DeepSeek-R1-Distill-Qwen-1.5B and Qwen3-8B-Base across six mainstream mathematical reasoning benchmarks. Experimental results demonstrate that FG-ExPO consistently outperforms vanilla GRPO. It delivers an absolute improvement of 13.34 on the AIME 2025 pass@32 metric, rising from 63.33 percent to 76.67 percent, and obtains an average pass@32 gain of 2.66 on the 8B model. The substantially larger performance gains observed on pass@32 compared to pass@1 verify that FG-ExPO enlarges the model's effective exploration space under a fixed inference budget.
☆ AcuityBench: Evaluating Clinical Acuity Identification and Uncertainty Alignment NeurIPS 2026
We introduce AcuityBench, a benchmark for evaluating whether language models identify the appropriate urgency of care from user medical presentations. Existing health benchmarks emphasize medical question answering, broad health interactions, or narrow workflow-specific triage tasks, but they do not offer a unified evaluation of acuity identification across these settings. AcuityBench addresses this gap by harmonizing five public datasets spanning user conversations, online forum posts, clinical vignettes, and patient portal messages under a shared four-level acuity framework ranging from home monitoring to immediate emergency care. The benchmark contains 914 cases, including 697 consensus cases for standard accuracy evaluation and 217 physician-confirmed ambiguous cases for uncertainty-aware evaluation. It supports two complementary task formats: explicit four-way classification in a QA setting, and free-form conversational responses evaluated with a rubric-based judge anchored to the same framework. Across 12 frontier proprietary and open-weight models, we find substantial variation in clear-case acuity accuracy and error direction. Comparing task formats reveals a systematic tradeoff: conversational responses reduce over-triage but increase under-triage relative to QA, especially in higher-acuity cases. In ambiguous cases, no model closely matches the distribution of physician judgments, and model predictions are more concentrated than expert clinical uncertainty. We also compare expert and model adjudication on a subset of maximally ambiguous cases, using those cases to examine the role of clinical uncertainty in label disagreement. Together, these results position acuity identification as a distinct safety-critical capability and show that AcuityBench enables systematic comparison and stress-testing of how well models guide users to the right level of care in real-world health use.
comment: 41 pages, 5 figures. Preprint under review for the Track on Evaluations and Datasets at NeurIPS 2026
☆ Deep Reasoning in General Purpose Agents via Structured Meta-Cognition
Humans intuitively solve complex problems by flexibly shifting among reasoning modes: they plan, execute, revise intermediate goals, resolve ambiguity through associative judgment, and apply formal procedures to well-specified subproblems. Current LLM agents lack this flexibility, as their scaffolds hard-code such reasoning decisions in advance. These scaffolds are effective when their prescribed structure matches the task, but brittle when solving the task requires adapting the structure of reasoning itself. We introduce Deep Reasoning -- an inference-time approach for constructing task-specific scaffolds through structured meta-reasoning. Deep Reasoning uses a formal language that represents meta-reasoning as executable decompositions over associative inference, formal computation, and recursive subproblem solving, enabling decomposition principles to be encoded as in-context examples that guide test-time scaffold construction. We instantiate this approach in a general-purpose agent (DOLORES) that distributes complex tasks across more controlled reasoning threads. We evaluate it against state-of-the-art scaffolding methods across four hard benchmarks: multi-hop reasoning, long-chain question answering, long-context aggregation, and deep research-style information seeking. DOLORES outperforms all evaluated scaffolds across three model sizes and two model families, improving over the strongest evaluated scaffold baseline by 24.8% on average. DOLORES distributes cognition across structured, lower-load reasoning threads, thereby reducing premature termination and hallucinations. This advantage can even bridge the scaling gap, with an 8B version surpassing all evaluated 32B baselines from the same family in more than half the settings. These results point toward future agentic systems that treat scaffolding as adaptive reasoning, constructing the structure each task requires just-in-time.
comment: Preprint under review
☆ An Empirical Study of Automating Agent Evaluation
Agent evaluation requires assessing complex multi-step behaviors involving tool use and intermediate reasoning, making it costly and expertise-intensive. A natural question arises: can frontier coding assistants reliably automate this evaluation process? Our study shows that simply prompting coding assistants is insufficient for this task. Without domain-specific evaluation knowledge, frontier coding assistants achieve only a 30% execution success rate and produce over-engineered evaluations averaging 12+ metrics per agent, indicating that strong coding ability does not automatically translate to reliable agent evaluation. We introduce EvalAgent, an AI assistant that automates the end-to-end agent evaluation pipeline. EvalAgent encodes evaluation domain expertise as evaluation skills (procedural instructions, reusable code and templates, and dynamically retrieved API documentation) that compose into a trace-based pipeline producing complete evaluation artifacts including metrics, executable code, and reports. To systematically assess generated evaluations, we introduce a meta-evaluation framework alongside AgentEvalBench, a benchmark comprising 20 agents, each paired with evaluation requirements and test scenarios. We further propose the Eval@1 metric to measure whether generated evaluation code both executes and yields meaningful results on the first run. Our experiments show that EvalAgent produces focused evaluations, improving Eval@1 from 17.5% to 65%, and achieving 79.5% human expert preference over baseline approaches. Further ablation studies show that evaluation skills are critical for handling complex evaluation: removing them causes Eval@1 to drop significantly from 65% to 30%.
☆ Test-Time Compute for Dense Retrieval: Agentic Program Generation with Frozen Embedding Models
Test-time compute is widely believed to benefit only large reasoning models. We show it also helps small embedding models. Most modern embedding checkpoints are distilled from large LLM backbones and inherit their representation space; a frozen embedding model should therefore benefit from extra inference compute without retraining. Using an agentic program-search loop, we explore 259 candidate inference programs over a frozen embedding API across ninety generations. The entire Pareto frontier collapses onto a single algebra: a softmax-weighted centroid of the local top-K documents interpolated with the query. This parameter-free default lifts nDCG@10 statistically significantly across seven embedding-model families spanning a tenfold parameter range, with held-out full-BEIR validation confirming the lift on every model tested.
comment: 37 pages, 5 figures, 16 tables
☆ PresentAgent-2: Towards Generalist Multimodal Presentation Agents
Presentation generation is moving beyond static slide creation toward end-to-end presentation video generation with research grounding, multimodal media, and interactive delivery. We introduce PresentAgent-2, an agentic framework for generating presentation videos from user queries. Given an open-ended user query and a selected presentation mode, PresentAgent-2 first summarizes the query into a focused topic and performs deep research over presentation-friendly sources to collect multimodal resources, including relevant text, images, GIFs, and videos. It then constructs presentation slides, generates mode-specific scripts, and composes slides, audio, and dynamic media into a complete presentation video. PresentAgent-2 supports three independent presentation modes within a unified framework: Single Presentation, which generates a single-speaker narrated presentation video; Discussion, which creates a multi-speaker presentation with structured speaker roles, such as for asking guiding questions, explaining concepts, clarifying details, and summarizing key points; and Interaction, which independently supports answering audience questions grounded in the generated slides, scripts, retrieved evidence, and presentation context. To evaluate these capabilities, we build a multimodal presentation benchmark covering single presentation, discussion, and interaction scenarios, with task-specific evaluation criteria for content quality, media relevance, dynamic media use, dialogue naturalness, and interaction grounding. Overall, PresentAgent-2 extends presentation generation from document-dependent slide creation to query-driven, research-grounded presentation video generation with multimodal media, dialogue, and interaction. Code: https://github.com/AIGeeksGroup/PresentAgent-2. Website: https://aigeeksgroup.github.io/PresentAgent-2.
☆ Large Language Models for Causal Relations Extraction in Social Media: A Validation Framework for Disaster Intelligence EMNLP
During disasters, extracting causal relations from social media can strengthen situational awareness by identifying factors linked to casualties, physical damage, infrastructure disruption, and cascading impacts. However, disaster-related posts are often informal, fragmented, and context-dependent, and they may describe personal experiences rather than explicit causal relations. In this work, we examine whether Large Language Models (LLMs) can effectively extract causal relations from disaster-related social media posts. To this end, we (1) propose an expert-grounded evaluation framework that compares LLM-generated causal graphs with reference graphs derived from disaster-specific reports and (2) assess whether the extracted relations are supported by post-event evidence or instead reflect model priors. Our findings highlight both the potential and risks of using LLMs for causal relation extraction in disaster decision-support systems.
comment: Submitted to EMNLP
♻ ☆ Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner ICML 2026
Diffusion language models, especially masked discrete diffusion models, have achieved great success recently. While there are some theoretical and primary empirical results showing the advantages of latent reasoning with looped transformers or continuous chain-of-thoughts, continuous diffusion models typically underperform their discrete counterparts. In this paper, we argue that diffusion language models do not necessarily need to be in the discrete space. In particular, we prove that continuous diffusion models have stronger expressivity than discrete diffusions and looped transformers. We attribute the contradiction between the theoretical expressiveness and empirical performance to their practical trainability: while continuous diffusion provides intermediate supervision that looped transformers lack, they introduce additional difficulty decoding tokens into the discrete token space from the continuous representation space. We therefore propose Coevolutionary Continuous Discrete Diffusion (CCDD), which defines a joint multimodal diffusion process on the union of a continuous representation space and a discrete token space, leveraging a single model to simultaneously denoise in the joint space. By combining two modalities, CCDD is expressive with rich semantics in the latent space, as well as good trainability and sample quality with the help of explicit discrete tokens. We also propose effective architectures and advanced training/sampling techniques for CCDD, which reveals strong empirical performance in extensive language modeling experiments on real-world tasks.
comment: 29 pages. Accepted to ICML 2026
♻ ☆ jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
In this work, we introduce GELATO (Geometry-preserving Embeddings via Locked Aligned TOwers), a novel approach to multimodal embedding models. We build on the VLM-style architecture, in which non-text encoders are adapted to produce input for a language model, which in turn generates embeddings for all varieties of input. We present the result: the jina-embeddings-v5-omni suite, a pair of models that encode text, image, audio, and video input into a single semantic embedding space. GELATO extends the two Jina Embeddings v5 Text models to support additional modality by adding encoders for images and audio. The backbone text embedding models and the added non-text modality encoders remain frozen. We only trained the connecting components, representing 0.35% of the total weights of the joint model. Training is therefore much more efficient than full-parameter retraining. Additionally, the language model remains effectively unaltered, producing exactly the same embeddings for text inputs as the Jina Embeddings v5 Text models. Our evaluations show that GELATO produces results that are competitive with the state-of-the-art, yielding nearly equal performance to larger multimodal embedding models.
comment: 18 pages, 8 figures, 10 tables
♻ ☆ Reconstructing Sepsis Trajectories from Clinical Case Reports using LLMs: the Textual Time Series Corpus for Sepsis
Clinical case reports and discharge summaries may be the most complete and accurate summarization of patient encounters, yet they are finalized, i.e., timestamped after the encounter. Complementary structured data streams become available sooner but suffer from incompleteness. To train models and algorithms on more complete and temporally fine-grained data, we construct a pipeline to phenotype, extract, and annotate time-localized findings within case reports using large language models. We apply our pipeline to generate an open-access textual time series corpus for Sepsis-3 comprising 2,139 case reports from the PubMed-Open Access (PMOA) Subset. To validate our system, we apply it to PMOA and timeline annotations from i2b2/MIMIC-IV and compare the results to physician-expert annotations. We show high recovery rates of clinical findings (event match rates: GPT-5--0.93, Llama 3.3 70B Instruct--0.76) and strong temporal ordering (concordance: GPT-5--0.965, Llama 3.3 70B Instruct--0.908). Our work characterizes the ability of LLMs to time-localize clinical findings in text, illustrating the limitations of LLM use for temporal reconstruction and providing several potential avenues of improvement via multimodal integration.
comment: Conference on Health, Inference, and Learning (CHIL 2026)
♻ ☆ When the Gold Standard Isn't Necessarily Standard: Challenges of Evaluating the Translation of User-Generated Content
User-generated content (UGC) is characterised by frequent use of non-standard language, from spelling errors to expressive choices such as slang, character repetitions, and emojis. This makes evaluating UGC translation challenging: what counts as a "good" translation depends on the desired standardness level of the output. To explore this, we examine the human translation guidelines of four UGC datasets, and derive a taxonomy of twelve non-standard phenomena and five translation actions (NORMALISE, COPY, TRANSFER, OMIT, CENSOR). Our analysis reveals notable differences in how UGC is treated, resulting in a spectrum of standardness in reference translations. We show that translation scores of large language models are highly sensitive to prompts with explicit UGC translation instructions, and that they improve when they align with the dataset guidelines. We argue that fair evaluation requires both models and metrics to be aware of translation guidelines. Finally, we call for clear guidelines during dataset creation and for the development of controllable, guideline-aware evaluation frameworks for UGC translation.
comment: 10 pages (26 with references and appendices). Accepted at EAMT 2026
♻ ☆ Detecting Data Contamination in LLMs via In-Context Learning
We present Contamination Detection via Context (CoDeC), a practical and accurate method to detect and quantify training data contamination in large language models. CoDeC distinguishes between data memorized during training and data outside the training distribution by measuring how in-context learning affects model performance. We find that in-context examples typically boost confidence for unseen datasets but may reduce it when the dataset was part of training, due to disrupted memorization patterns. Experiments show that CoDeC produces interpretable contamination scores that clearly separate seen and unseen datasets, and reveals strong evidence of memorization in open-weight models with undisclosed training corpora. The method is simple, automated, and both model- and dataset-agnostic, making it easy to integrate with benchmark evaluations.
♻ ☆ One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
Hidden malicious intent in multi-turn dialogue poses a growing threat to deployed large language models (LLMs). Rather than exposing a harmful objective in a single prompt, increasingly capable attackers can distribute their intent across multiple benign-looking turns. Recent studies show that even modern commercial models with advanced guardrails remain vulnerable to such attacks despite advances in safety alignment and external guardrails. In this work, we address this challenge by detecting the earliest turn at which delivering the candidate response would make the accumulated interaction sufficient to enable harmful action. This objective requires precise turn-level intervention that identifies the harm-enabling closure point while avoiding premature refusal of benign exploratory conversations. To further support training and evaluation, we construct the Multi-Turn Intent Dataset (MTID), which contains branching attack rollouts, matched benign hard negatives, and annotations of the earliest harm-enabling turns. We show that MTID helps enable a turn-level monitor TurnGate, which substantially outperforms existing baselines in harmful-intent detection while maintaining low over-refusal rates. TurnGate further generalizes across domains, attacker pipelines, and target models. Our code is available at https://github.com/Graph-COM/TurnGate.
comment: Project Website: https://turn-gate.github.io/
♻ ☆ SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair
Architectural code smells erode software maintainability and are costly to repair manually, yet unlike localized bugs, they require cross-module reasoning about design intent that challenges both developers and automated tools. While large language model agents excel at bug fixing and code-level refactoring, their ability to repair architectural code smells remains unexplored. We present the first empirical evaluation of LLM agents on architectural code smell repair. We contribute SmellBench, a task orchestration framework that incorporates smell-type-specific optimized prompts and supports iterative multi-step execution, together with a scoring methodology that separately evaluates repair effectiveness, false positive identification, and net codebase impact. We evaluate 11 agent configurations from four model families (GPT, Claude, Gemini, Mistral) on 65 hard-severity architectural smells detected by PyExamine in the Python project scikit-learn, validated against expert judgments. Expert validation reveals that 63.1% of detected smells are false positives, while the best agent achieves a 47.7% resolution rate. Agents identify false positives with up to $κ= 0.94$ expert agreement, but repair aggressiveness and net codebase quality are inversely related: the most aggressive agent introduces 140 new smells. These findings expose a gap between current LLM capabilities in localized code transformations and the architectural understanding needed for cross-module refactoring. SmellBench provides reusable infrastructure for tracking progress on this underexplored dimension of automated software engineering. We release our code and data at https://doi.org/10.5281/zenodo.19247588.
♻ ☆ To Err Is Human; To Annotate, SILICON? Toward Robust Reproducibility in LLM Annotation
Unstructured text data annotation is foundational to management research. LLMs offer a cost-effective and scalable alternative to human annotation, but they introduce a novel challenge: the annotator itself can be retired. Proprietary models undergo regular deprecation cycles, threatening long-term reproducibility. Hence, the ability to reproduce annotation results when the original model becomes unavailable, i.e., robust reproducibility, is a central methodological challenge for LLM-based annotation. Achieving robust reproducibility requires first controlling measurement error. We develop an analytical framework that decomposes measurement error into four sources: guideline-induced error from inconsistent annotation criteria, baseline-induced error from unreliable human references, prompt-induced error from suboptimal meta-instruction, and model-induced error from architectural differences across LLMs. We develop the SILICON workflow that instantiates the analytical framework, prescribing targeted interventions at each error source. Empirical validation across nine management research tasks confirms that these interventions reduce measurement error, and simulations show that the resulting error reduction yields more accurate downstream statistical estimates. With measurement error controlled, we address two further aspects of robust reproducibility. First, we propose a regression-based methodology to establish backup open-weight models, which are permanently accessible. Every tested task has at least one open-weight model with no statistically detectable performance difference. Second, we quantify the upper bound of annotation quality attainable from the current set of available models by proposing a routing procedure that selectively sends low-confidence items to auxiliary models, revealing when model aggregation improves performance and when that may adversely affect labeling quality.
♻ ☆ Prompting from the bench: Large-scale pretraining is not sufficient to prepare LLMs for ordinary meaning analysis
In the U.S. judicial system, a widespread approach to legal interpretation entails assessing how a legal text would be understood by an `ordinary' speaker of the language. Recent scholarship has proposed that legal practitioners leverage large language models (LLMs) to ascertain a text's ordinary meaning. But are LLMs up to the task? As textual interpretation questions arise in spheres ranging from criminal law to civil rights, we argue it is crucial that models not be taken as authoritative without rigorous evaluation. This work offers an empirical argument against LLM-assisted interpretation as recently practiced by legal scholars and federal judges, who reasoned the large amount of data that models see in training would enable models to illuminate how people ordinarily use certain words or phrases. In controlled experiments, we find failures in robustness which cast doubt on this assumption and raise serious questions about the utility of these models in practice. For the models in our evaluation, slight changes to the format of a question can lead to wildly different conclusions -- a vulnerability that parties with an interest in the outcome could exploit. Comparing with a dataset where people were asked similar legal interpretation questions, we see that these models are at best moderately correlated to human judgments -- not strong enough given the stakes in this domain.
comment: Accepted FAccT 2026; 29 pages, 14 tables, 7 figures. Previous title - Not ready for the bench: LLM legal interpretation is unstable and out of step with human judgments; NLLPW 2026
♻ ☆ KV Cache Offloading for Context-Intensive Tasks
With the growing demand for long-context LLMs across a wide range of applications, the key-value (KV) cache has become a critical bottleneck for both latency and memory usage. Recently, KV-cache offloading has emerged as a promising approach to reduce memory footprint and inference latency while preserving accuracy. Prior evaluations have largely focused on tasks that do not require extracting large amounts of information from the context. In this work, we study KV-cache offloading on context-intensive tasks: problems where the solution requires looking up a lot of information from the input prompt. We create and release the Text2JSON benchmark, a highly context-intensive task that requires extracting structured knowledge from raw text. We evaluate modern KV offloading on Text2JSON and other context-intensive tasks and find significant performance degradation on both Llama 3 and Qwen 3 models. Our analysis identifies two key reasons for poor accuracy: low-rank projection of keys and unreliable landmarks, and proposes a simpler alternative strategy that significantly improves accuracy across multiple LLM families and benchmarks. These findings highlight the need for a comprehensive and rigorous evaluation of long-context compression techniques.
comment: Preprint
♻ ☆ MUR: Momentum Uncertainty guided Reasoning
Current models have achieved impressive performance on reasoning-intensive tasks, yet optimizing their reasoning efficiency remains an open challenge. While Test-Time Scaling (TTS) improves reasoning quality, it often leads to overthinking, wasting tokens on redundant computations. This work investigates how to efficiently and adaptively guide current model' test-time scaling without additional training. Inspired by the concept of momentum in physics, we propose Momentum Uncertainty-guided Reasoning (MUR), which dynamically allocates thinking budgets to critical reasoning steps by tracking and aggregating stepwise uncertainty over time. To support flexible inference-time control, we introduce gamma-control, a simple mechanism that tunes the reasoning budget via a single hyperparameter. We provide in-depth theoretical proof to support the superiority of MUR in terms of stability and biases. MUR is comprehensively evaluated against various TTS methods across four challenging benchmarks (MATH-500, AIME24, AIME25, and GPQA-diamond) using different sizes of recent Qwen3 models (1.7B, 4B, and 8B). Results demonstrate that MUR reduces computation by by over 45% on average while improving accuracy from 0.33 to 3.46%.
♻ ☆ Self-Consolidating Language Models: Continual Knowledge Incorporation from Context
Large language models (LLMs) increasingly receive information as streams of passages, conversations, and long-context workflows. While longer context windows expose more evidence, they do not ensure that useful information is preserved and reused. We study continual context consolidation: writing current context into model weights while limiting interference with previously consolidated information. We propose \textbf{S}elf-\textbf{Co}nsolidating \textbf{L}anguage Models (SCoL), a post-training framework in which, given current context, an LLM learns to generate textual update instructions specifying which of its own Transformer layers should be updated. Because committed updates change the model that later generates future selections, we train SCoL with meta-reinforcement learning over an evolving model state. We instantiate SCoL with supervised QA rewards on SQuAD knowledge incorporation and intrinsic likelihood-based rewards for LongBench v2 long-context consolidation. Across both settings, SCoL improves acquisition and retention over prompting, summarization, batch test-time training, and sequential finetuning baselines. Analysis of learned selection patterns shows that SCoL encourages the LLM to generate sparse update locations that align with layers of high Fisher information, suggesting that the model learns to route plasticity toward loss-sensitive regions while limiting interference. Moreover, SCoL transfers from shorter meta-training streams to longer LongBench v2 streams at evaluation, suggesting that our framework supports scalable streaming consolidation.
comment: 9 pages
♻ ☆ Differentially Private Synthetic Text Generation for Retrieval-Augmented Generation (RAG) ACL 2026
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by grounding them in external knowledge. However, its application in sensitive domains is limited by privacy risks. Existing private RAG methods typically rely on query-time differential privacy (DP), which requires repeated noise injection and leads to accumulated privacy loss. To address this issue, we propose DP-SynRAG, a framework that uses LLMs to generate differentially private synthetic RAG databases. Unlike prior methods, the synthetic text can be reused once created, thereby avoiding repeated noise injection and additional privacy costs. To preserve essential information for downstream RAG tasks, DP-SynRAG extends private prediction, which instructs LLMs to generate text that mimics subsampled database records in a DP manner. Experiments show that DP-SynRAG achieves superior performance to the state-of-the-art private RAG systems while maintaining a fixed privacy budget, offering a scalable solution for privacy-preserving RAG.
comment: Accepted to ACL 2026 Findings
♻ ☆ Invisible failures in human-AI interactions
AI systems fail silently far more often than they fail visibly. In an analysis of 100K human-AI interactions from the WildChat dataset, we find that 79% of AI failures are invisible: something went wrong but the user gave no overt indication that there was a problem. These invisible failures cluster into eight archetypes that help us characterize where and how AI systems are failing to meet users' needs. In addition, the archetypes show systematic co-occurrence patterns indicating higher-level failure types. To address the question of whether these archetypes will remain relevant as AI systems become more capable, we also created and annotated a counterfactual dataset in which WildChat's 2024-era responses are replaced by those from three present-day frontier LMs. This analysis indicates that failure rates have dropped substantially, but that the vast majority of failures remain invisible in our sense, and the distribution of failure archetypes seems stable. Finally, we illustrate how the archetypes help us to identify systematic and variable AI limitations across different usage domains. Overall, we argue that our invisible failure taxonomy can be a key component in reliable failure monitoring for product developers, scientists, and policy makers. Our code and data are available at https://github.com/bigspinai/bigspin-invisible-failure-archetypes
♻ ☆ Towards Fine-Grained Code-Switch Speech Translation with Semantic Space Alignment IJCAI 2026
Code-switching (CS) speech translation (ST) aims to translate speech that alternates between multiple languages into a target language text, posing significant challenges due to the complexity of semantic modeling and the scarcity of CS data. Previous studies mainly rely on the models themselves to implicitly learn semantic representations and resort to costly manual annotations. To mitigate these limitations, we propose enhancing Large Language Models (LLMs) with a Mixture-of-Experts (MoE) speech projector composed of language expert groups, where each group specializes in the semantic space of a specific language for fine-grained speech feature modeling. A language-specific loss and an intra-group load balancing loss are jointly introduced to guide efficient token routing across and within expert groups. Furthermore, we introduce a multi-stage training paradigm that utilizes readily available automatic speech recognition (ASR) and monolingual ST data, facilitating speech-text alignment and improving translation performance. To bridge the data gap for smooth domain transfer, a transition loss is employed to improve adaptation to CS scenarios. Extensive experiments on widely used datasets demonstrate the effectiveness and generality of our approach, achieving average improvements of $0.86$ BLEU and $0.93$ COMET over SeamlessM4T, with maximum improvements of $1.49$ BLEU and $1.41$ COMET across different test sets.
comment: Accepted to IJCAI 2026 Main Track
♻ ☆ Where is the Mind? Persona Vectors and LLM Individuation
The individuation problem for large language models asks which entities associated with them, if any, should be identified as minds. We approach this problem through mechanistic interpretability, engaging in particular with recent empirical work on persona vectors, persona space, and emergent misalignment. We argue that three views are the strongest candidates: the virtual instance view and two new views we introduce, the (virtual) instance-persona view and the model-persona view. First, we argue for the virtual instance view on the grounds that attention streams sustain quasi-psychological connections across token-time. Then we present the persona literature, organised around three hypotheses about the internal structure underlying personas in LLMs, and show that the two persona-based views are promising alternatives.
♻ ☆ DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
While Mixture-of-Experts (MoE) scales model capacity without proportionally increasing computation, its massive total parameter footprint creates significant storage and memory-access bottlenecks, which hinder efficient end-side deployment that simultaneously requires high performance, low computational cost, and small storage overhead. To achieve these properties, we present DECO, a sparse MoE architecture designed to match the performance of dense Transformers under identical total parameter budgets and training tokens. DECO utilizes the differentiable and flexible ReLU-based routing enhanced by learnable expert-wise scaling, which adaptively balances the contributions of routed and shared experts. Furthermore, we introduce NormSiLU, an activation function that normalizes inputs prior to SiLU operators, producing a more stable trend of routed-expert activation ratio and a higher intrinsic sparsity level. We also identify an empirical advantage in using non-gated MLP experts with ReLU-based routing, indicating the possibility of MoE architecture simplification. Experiments demonstrate that DECO, activating only 20% of experts, matches dense performance and outperforms established MoE baselines. Our specialized acceleration kernel delivers a 3.00$\times$ speedup on real hardware compared with dense inference. Codes and checkpoints are all available at https://github.com/thunlp/DECO.
comment: 14 pages, 11 figures, 11 tables
♻ ☆ Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents ACL 2026
Vision-and-Language Navigation (VLN) poses significant challenges for agents to interpret natural language instructions and navigate complex 3D environments. While recent progress has been driven by large-scale pre-training and data augmentation, current methods still struggle to generalize to unseen scenarios, particularly when complex spatial and temporal reasoning is required. In this work, we propose SkillNav, a modular framework that introduces structured, skill-based reasoning into Transformer-based VLN agents. Our method decomposes navigation into a set of interpretable atomic skills (e.g., Vertical Movement, Area and Region Identification, Stop and Pause), each handled by a specialized agent. To support targeted skill training without manual data annotation, we construct a synthetic dataset pipeline that generates diverse, linguistically natural, skill-specific instruction-trajectory pairs. We then introduce a novel training-free Vision-Language Model (VLM)-based router, which dynamically selects the most suitable agent at each time step by aligning sub-goals with visual observations and historical actions. SkillNav obtains competitive results on commonly used benchmarks and establishes state-of-the-art generalization to the GSA-R2R, a benchmark with novel instruction styles and unseen environments.
comment: Accepted by ACL 2026 Main Conference
♻ ☆ A Survey of On-Policy Distillation for Large Language Models
As Large Language Models (LLMs) continue to grow in both capability and cost, transferring frontier capabilities into smaller, deployable students has become a central engineering problem, and knowledge distillation remains the dominant technique for this transfer. The prevailing recipe in industrial pipelines, static imitation of teacher-generated text, carries a structural weakness that grows more severe as tasks become longer and more reasoning-intensive. Because the student is trained on flawless teacher prefixes but must generate its own at inference, small errors tend to accumulate into trajectories it has rarely been trained to recover from, and the resulting exposure bias has been shown to scale roughly with the square of sequence length. On-Policy Distillation (OPD) reorganizes the training loop around this observation by having the teacher provide feedback on what the student actually produces, with the goal of reducing the compounding term toward linear and reframing distillation as an iterative correction process rather than single-pass imitation. The resulting literature has expanded along divergence design, reward-guided optimization, and self-play, yet contributions remain scattered across the knowledge distillation, RLHF, and imitation learning communities without a unified treatment. This survey provides such a treatment. We formalize OPD as $f$-divergence minimization over student-sampled trajectories, organize the field along three design axes (what to optimize, where the signal comes from, and how to stabilize training in practice), and consolidate success conditions, recurring failure modes, and the connection between OPD and KL-constrained RL. We close with open problems that emerge from this synthesis, including distillation scaling laws, uncertainty-aware feedback, agentic distillation, and the growing overlap between knowledge distillation and RL.
comment: Ongoing Work
♻ ☆ Route Before Retrieve: Activating Latent Routing Abilities of LLMs for RAG vs. Long-Context Selection
Recent advances in large language models (LLMs) have expanded the context window to beyond 128K tokens, enabling long-document understanding and multi-source reasoning. A key challenge, however, lies in choosing between retrieval-augmented generation (RAG) and long-context (LC) strategies: RAG is efficient but constrained by retrieval quality, while LC supports global reasoning at higher cost and with position sensitivity. Existing methods such as Self-Route adopt failure-driven fallback from RAG to LC, but remain passive, inefficient, and hard to interpret. We propose Pre-Route, a proactive routing framework that performs structured reasoning before answering. Using lightweight metadata (e.g., document type, length, initial snippet), Pre-Route enables task analysis, coverage estimation, and information-need prediction, producing explainable and cost-efficient routing decisions. Our study shows three key findings: (i) LLMs possess latent routing ability that can be reliably elicited with guidelines, allowing single-sample performance to approach that of multi-sample (Best-of-N) results; (ii) linear probes reveal that structured prompts sharpen the separability of the "optimal routing dimension" in representation space; and (iii) distillation transfers this reasoning structure to smaller models for lightweight deployment. Experiments on LaRA (in-domain) and LongBench-v2 (OOD) confirm that Pre-Route outperforms Always-RAG, Always-LC, and Self-Route baselines, achieving superior overall cost-effectiveness.
♻ ☆ GRP: Goal-Reversed Prompting for Zero-Shot Evaluation with LLMs
Pairwise LLM-as-a-judge evaluation asks the judge to identify the \emph{better} of two candidate answers. We study a one-line modification that asks for the \emph{worse} answer instead and recovers the preference by elimination, a procedure we call Goal-Reversed Prompting (GRP). GRP introduces no extra inference rounds, composes with any prompt template (direct, chain-of-thought, or Arena-Hard SOP), and leaves the rest of the evaluation pipeline untouched. Two observations motivate the reversal. Reverse reasoning is a recurring strategy in human problem solving, and modern instruction-tuned judges exhibit a positive-leaning bias that asking for the worse answer can counteract. On JudgeBench under a strict consistency protocol that counts a judgment as correct only when both response orderings agree with the gold preference, GRP improves all three closed-source judges we test across both response-pair sources. With GPT-4o-generated pairs, the Arena-Hard SOP baseline improves from 61.71\% to 66.23\% for GPT-4o (+4.52) and from 60.00\% to 66.00\% for Claude-3.5-Sonnet (+6.00), with the largest absolute gains on Reasoning and Mathematics. The lift persists when response pairs come from Claude-3.5-Sonnet and when the SOP scaffolding is stripped to a minimal direct-prompting template, suggesting that goal reversal acts on the underlying judging behavior rather than on a particular rubric. Stronger judges benefit more than weaker ones, suggesting that goal reversal exposes additional reasoning capacity rather than compensating for its absence.
comment: Ongoing Work
♻ ☆ Matching Meaning at Scale: Evaluating Semantic Search for 18th-Century Intellectual History through the Case of Locke
While digitized corpora have transformed the study of intellectual transmission, current methods rely heavily on lexical text reuse detection, capturing verbatim quotations but fundamentally missing paraphrases and complex implicit engagement. This paper evaluates semantic search in 18th-century intellectual history through the reception of John Locke's foundational work. Using expert annotation grounded in a semantic taxonomy, we examine whether an off-the-shelf semantic search pipeline can surface meaning-level correspondences overlooked by lexical methods. Our results demonstrate that semantic search retrieves substantially more implicit receptions than lexical baselines. However, linguistic diagnostics also reveal a "lexical gatekeeping" effect, where retrieval remains partially constrained by surface vocabulary overlap. These findings highlight both the potential and the limitations of semantic retrieval for analyzing the circulation of ideas in large historical corpora. The data is available at https://github.com/COMHIS/locke-sim-data.
comment: Accepted by NLP4DH 2026
♻ ☆ A Formal Comparison Between Chain of Thought and Latent Thought ICML 2026
Chain of thought (CoT) elicits reasoning in large language models by explicitly generating intermediate tokens. In contrast, latent thought reasoning operates directly in the continuous latent space, enabling computation beyond discrete linguistic representations. While both approaches exploit iterative computation, their comparative capabilities remain underexplored. In this work, we present a formal analysis showing that latent thought admits more efficient parallel computation than inherently sequential CoT. In contrast, CoT enables approximate counting and sampling through stochastic decoding. These separations suggest the tasks for which depth-driven recursion is more suitable, thereby offering practical guidance for choosing between reasoning paradigms.
comment: Camera-ready version for ICML 2026
♻ ☆ The Challenge and Reward of Fair Play in Narrative: A Computational Approach
Good storytelling involves surprise -- unpredictability in how the story unfolds -- and sense-making, the requirement that the story forms a coherent sequence. However, to date, these two qualities have largely been addressed in isolation. We formalize these qualities and their relationship in an information-theoretic framework, using detective fiction as a paradigm case of narratives in which a hidden truth is discovered through reasoning. Our central theoretical result shows that surprise and coherence must trade off for any *single* reader model, but can coexist when two reader modes are distinguished: a pre-revelation mode that forms expectations while the ending is unknown, and a post-resolution hindsight mode that re-evaluates the story after the culprit is revealed. The balance of these two dimensions is realized in the common requirement of *fair play*, giving the reader a chance to solve the mystery while maintaining a challenge. We operationalize the framework using large language models as simulated readers, and define reference-less evaluation metrics for surprise, coherence, and fair play. Experiments on LLM-generated stories validate our theoretical predictions: while models generally succeed in creating surprise or coherence, achieving fair play poses a challenge even for strong models. Moreover, surprise and coherence do not positively correlate across stories, resisting reduction to a single latent quality. A human study validates the metrics, confirming they capture aspects of narrative quality that matter to readers. Our metrics also reproduce established literary intuitions, finding Christie's stories more surprising and more fair-playing than Conan Doyle's.
comment: 47 pages, 11 figures, 13 tables
♻ ☆ Hallucination Detection in LLMs with Topological Divergence on Attention Graphs ACL 2026
Hallucination, i.e., generating factually incorrect content, remains a critical challenge for large language models (LLMs). We introduce TOHA, a TOpology-based HAllucination detector in the RAG setting, which leverages a topological divergence metric to quantify the structural properties of graphs induced by attention matrices. Examining the topological divergence between prompt and response subgraphs reveals consistent patterns: higher divergence values in specific attention heads correlate with hallucinated outputs, independent of the dataset. Extensive experiments - including evaluation on question answering and summarization tasks - show that our approach achieves state-of-the-art or competitive results on several benchmarks while requiring minimal annotated data and computational resources. Our findings suggest that analyzing the topological structure of attention matrices can serve as an efficient and robust indicator of factual reliability in LLMs.
comment: Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
♻ ☆ Model-Dowser: Data-Free Importance Probing to Mitigate Catastrophic Forgetting in Multimodal Large Language Models ICML 2026
Fine-tuning Multimodal Large Language Models (MLLMs) on task-specific data is an effective way to improve performance on downstream applications. However, such adaptation often leads to a degradation in generalization on pretrained tasks, a phenomenon known as Catastrophic Forgetting. Existing methods that aim to mitigate this issue either become ineffective when fine-tuning deeper layers of the language decoder or scale poorly with increasing model size. To address these limitations, we propose Model-Dowser, a novel sparse fine-tuning approach for MLLMs. Model-Dowser measures a principled importance score for each model parameter with respect to pretrained generalization (prior to downstream adaptation) by jointly considering weight magnitudes, input activations, and output sensitivities. During fine-tuning, Model-Dowser selectively preserves high-importance parameters and updates the remaining. Comprehensive experiments on two representative MLLMs, LLaVA and NVILA, demonstrate that Model-Dowser effectively mitigates catastrophic forgetting and consistently outperforms prior methods, while remaining resource-efficient and scalable to multi-billion-parameter models.
comment: Accepted at ICML 2026
♻ ☆ Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes
Deep neural networks exhibit periodic loss spikes during unregularized long-term training, a phenomenon known as the "Slingshot Mechanism." Existing work usually attributes this to intrinsic optimization dynamics, but its triggering mechanism remains unclear. This paper proves that this phenomenon is a result of floating-point arithmetic precision limits. As training enters a high-confidence stage, the difference between the correct-class logit and the other logits may exceed the absorption-error threshold. Then during backpropagation, the gradient of the correct class is rounded exactly to zero, while the gradients of the incorrect classes remain nonzero. This breaks the zero-sum constraint of gradients across classes and introduces a systematic drift in the parameter update of the classifier layer. We prove that this drift forms a positive feedback loop with the feature, causing the global classifier mean and the global feature mean to grow exponentially. We call this mechanism Numerical Feature Inflation (NFI). This mechanism explains the rapid norm growth before a Slingshot spike, the subsequent reappearance of gradients, and the resulting loss spike. We further show that NFI is not equivalent to an observed loss spike: in more practical tasks, partial absorption may not produce visible spikes, but it can still break the zero-sum constraint and drive rapid growth of parameter norms. Our results reinterpret Slingshot as a numerical dynamic of finite-precision training, and provide a testable explanation for abnormal parameter growth and logit divergence in late-stage training.
comment: 28 pages, 13 figures
♻ ☆ ANCHOR: Abductive Network Construction with Hierarchical Orchestration for Reliable Probability Inference in Large Language Models ICML 2026
A central challenge in large-scale decision-making under incomplete information is estimating reliable probabilities. Recent approaches use Large Language Models (LLMs) to generate explanatory factors and coarse-grained probability estimates, which are then refined by a Naïve Bayes model over factor combinations. However, sparse factor spaces often yield ``unknown'' predictions, while expanding factors increases noise and spurious correlations, weakening conditional independence and degrading reliability. To address these limitations, we propose \textsc{Anchor}, an aggregated Bayesian inference framework over a hierarchical factor space. It constructs dense factor hierarchies through iterative generation and clustering, maps contexts via hierarchical retrieval and refinement, and augments Naïve Bayes with a Causal Bayesian Network to model latent factor dependencies. Experiments show that \textsc{Anchor} markedly reduces ``unknown'' predictions and produces more reliable probability estimates than direct LLM baselines, achieving state-of-the-art performance while significantly reducing time and token overhead.
comment: Accepted by ICML 2026
♻ ☆ Modality-Inconsistent Continual Learning of Multimodal Large Language Models
In this paper, we introduce Modality-Inconsistent Continual Learning (MICL), a new continual learning scenario for Multimodal Large Language Models (MLLMs) that involves tasks with inconsistent modalities (image, audio, or video) and varying task types (captioning or question-answering). Unlike existing vision-only or modality-incremental settings, MICL combines modality and task type shifts, both of which drive catastrophic forgetting. To address these challenges, we propose MoInCL, which employs a Pseudo Targets Generation Module to mitigate forgetting caused by task type shifts in previously seen modalities. It also incorporates Instruction-based Knowledge Distillation to preserve the model's ability to handle previously learned modalities when new ones are introduced. We benchmark MICL using a total of six tasks and conduct experiments to validate the effectiveness of our MoInCL. The experimental results highlight the superiority of MoInCL, showing significant improvements over representative and state-of-the-art continual learning baselines.
comment: Accepted at Transactions on Machine Learning Research (TMLR), 2026
♻ ☆ StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs
Multilingual studies of social bias in open-ended LLM generation remain limited: most existing benchmarks are English-centric, template-based, or restricted to recognizing pre-specified stereotypes. We introduce StereoTales, a multilingual dataset and evaluation pipeline for systematically studying the emergence of social bias in open-ended LLM generation. The dataset covers 10 languages and 79 socio-demographic attributes, and comprises over 650k stories generated by 23 recent LLMs, each annotated with the socio-demographic profile of the protagonist across 19 dimensions. From these, we apply statistical tests to identify more than 1{,}500 over-represented associations, which we then rate for harmfulness through both a panel of humans (N = 247) and the same LLMs. We report three main findings. \textbf{(i)} Every model we evaluate emits consequential harmful stereotypes in open-ended generation, regardless of size or capabilities, and these associations are largely shared across providers rather than isolated misbehaviors. \textbf{(ii)} Prompt language strongly shapes which stereotypes appear: rather than transferring as a shared set of biases, harmful associations adapt culturally to the prompt language and amplify bias against locally salient protected groups. \textbf{(iii)} Human and LLM harmfulness judgments are broadly aligned (Spearman $ρ=0.62$), with disagreements concentrating on specific attribute classes rather than specific providers. To support further analyses, we release the evaluation code and the dataset, including model generations, attribute annotations, and harmfulness ratings.
comment: Preprint
♻ ☆ GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression
Text embedding and generative tasks are usually trained separately based on large language models (LLMs) nowadays. This causes a large amount of training cost and deployment effort. Context compression is also a challenging and pressing task, which is vital to reasoning-driven generation, and agentic tasks requiring long context and continual learning. In this paper, we explore how to unify reasoning-driven generation, reasoning-enhanced text representation and context compression tasks in one forward pass for LLMs. Through meta latent tokens and a unified generative, representative and compressive tuning approach, we propose a training framework named GRC that bridges the three tasks. The trained models can accomplish three objectives in a single forward pass while maintaining modular, LEGO-style flexibility during inference. This design greatly reduces the deployment effort for retrieval-augmented generation (RAG) and achieves efficient inference and three times data utilization during training. Furthermore, this framework design enables a new paradigm for text embedding: self-reason-latent embeds, and a new generation paradigm, latent memory-augmented generation, where compressed and internalized KV cache with O(1) length is used as the updatable memory. We also propose hybrid paged attention to speed up the inference of our models. Extensive experiments on reasoning-intensive retrieval benchmarks, generative tasks, document compression, latency evaluation, and RAG settings demonstrate the effectiveness of our method and may shed light on the truly unified model that can handle reasoning-driven generation, embedding and compression tasks seamlessly.
comment: Fixed typos in Eq. 4 and GPU names; added details on hybrid paged attention implementation
♻ ☆ Express Your Doubts -- Probabilistic World Modeling Should not be Based on Token logprobs ICML 2026
Language modeling has shifted in recent years from a distribution over strings to prediction models with textual inputs and outputs for general-purpose tasks. This position paper highlights the often overlooked implications of this shift for the use of large language models (LLMs) as probability estimators, especially for world probabilities. In light of the theoretical distinction between distribution estimation and response prediction, we examine LLM training phases and common use cases for LLM output probabilities. We show that the different settings lead to distinct, potentially conflicting, desired output distributions. This lack of clarity leads to pitfalls when using output probabilities as event probabilities. Our position advocates for second-order prediction -- incorporating probabilities explicitly as part of the output -- as a theoretically sound method, in contrast to using token logprobs. We conclude with suggestions for potential directions to improve the probabilistic soundness of this method.
comment: Accepted to ICML 2026 (position track)
♻ ☆ MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents
As LLM-powered agents are increasingly deployed in edge-cloud environments, personalized memory has become a key enabler of long-term adaptation and user-centric interaction. However, cloud-assisted memory management exposes sensitive user information, while existing privacy protection methods typically rely on aggressive masking that removes task-relevant semantics and consequently degrades memory utility and personalization quality. To address this challenge, We propose MemPrivacy, which identifies privacy-sensitive spans on edge devices, replaces them with semantically structured type-aware placeholders for cloud-side memory processing, and restores the original values locally when needed. By decoupling privacy protection from semantic destruction, MemPrivacy minimizes sensitive data exposure while retaining the information required for effective memory formation and retrieval. We also construct MemPrivacy-Bench for systematic evaluation, a dataset covering 200 users and over 52k privacy instances, and introduce a four-level privacy taxonomy for configurable protection policies. Experiments show that MemPrivacy achieves strong performance in privacy information extraction, substantially surpassing strong general-purpose models such as GPT-5.2 and Gemini-3.1-Pro, while also reducing inference latency. Across multiple widely used memory systems, MemPrivacy limits utility loss to within 1.6%, outperforming baseline masking strategies. Overall, MemPrivacy offers an effective balance between privacy protection and personalized memory utility for edge-cloud agents, enabling secure, practical, and user-transparent deployment.
♻ ☆ Investigating Thinking Behaviours of Reasoning-Based Language Models for Social Bias Mitigation
While reasoning-based large language models excel at complex tasks through an internal, structured thinking process, a concerning phenomenon has emerged that such a thinking process can aggregate social stereotypes, leading to biased outcomes. However, the underlying behaviours of these language models in social bias scenarios remain underexplored. In this work, we systematically investigate mechanisms within the thinking process behind this phenomenon and uncover two failure patterns that drive social bias aggregation: 1) stereotype repetition, where the model relies on social stereotypes as its primary justification, and 2) irrelevant information injection, where it fabricates or introduces new details to support a biased narrative. Building on these insights, we introduce a lightweight prompt-based mitigation approach that queries the model to review its own initial reasoning against these specific failure patterns. Experiments on question answering (BBQ and StereoSet) and open-ended (BOLD) benchmarks show that our approach effectively reduces bias while maintaining or improving accuracy.
comment: Due to issues found with the annotations in Section 4.3, we have decided to withdraw this preprint
♻ ☆ MajinBook: An open catalogue of digitally mediated world literature
This data paper introduces MajinBook, an open catalogue designed to facilitate the use of shadow libraries-such as Library Genesis and Z-Library-for computational social science and cultural analytics. By linking metadata from these vast, crowd-sourced archives with structured bibliographic data from Goodreads, we create a high-precision corpus of over 539,000 references to digitally mediated English-language books. Spanning three centuries and reflecting a contemporary selection bias, these entries are enriched with first publication dates, genres, and popularity metrics like ratings and reviews. Our methodology prioritises natively digital EPUB files to ensure machine-readable quality, while addressing biases in traditional corpora like HathiTrust, and includes secondary datasets for French, German, and Spanish. We evaluate the linkage strategy for accuracy, release all underlying data openly, and discuss the project's legal permissibility under EU and US frameworks for text and data mining in research.
comment: 9 pages, 5 figures, 1 table
♻ ☆ Diffusion-State Policy Optimization for Masked Diffusion Language Models
Masked diffusion language models generate text through iterative masked-token filling, but terminal-only rewards on final completions provide coarse credit assignment for the intermediate filling decisions that shape the generation process. We propose Diffusion-State Policy Optimization (DiSPO), a plug-in credit-assignment layer that directly optimizes intermediate filling decisions. At selected intermediate masked states, DiSPO branches by resampling the currently masked positions from rollout-cached logits, scores the resulting completions, and updates only the newly filled tokens, requiring no additional multi-step diffusion rollouts or optimizer steps. We formalize a fixed-state objective for branched completions and derive a policy-gradient estimator that reuses the same rollouts as terminal-feedback policy optimization. Experiments on LLaDA-8B-Instruct show that DiSPO consistently improves terminal-feedback baselines, including diffu-GRPO and SPG, on math and planning benchmarks under matched rollout compute and optimizer steps, supporting its use as a general plug-in for masked diffusion policy optimization. Our project page is available at https://daioba.github.io/dispo .
♻ ☆ A Systematic Analysis of the Impact of Persona Steering on LLM Capabilities
Imbuing Large Language Models (LLMs) with specific personas is prevalent for tailoring interaction styles, yet the impact on underlying cognitive capabilities remains unexplored. We employ the Neuron-based Personality Trait Induction (NPTI) framework to induce Big Five personality traits in LLMs and evaluate performance across six cognitive benchmarks. Our findings reveal that persona induction produces stable, reproducible shifts in cognitive task performance beyond surface-level stylistic changes. These effects exhibit strong task dependence: certain personalities yield consistent gains on instruction-following, while others impair complex reasoning. Effect magnitude varies systematically by trait dimension, with Openness and Extraversion exerting the most robust influence. Furthermore, LLM effects show 73.68% directional consistency with human personality-cognition relationships. Capitalizing on these regularities, we propose Dynamic Persona Routing (DPR), a lightweight query-adaptive strategy that outperforms the best static persona without additional training.
♻ ☆ Evaluating the Pre-Consultation Ability of LLMs using Diagnostic Guidelines EACL 2026
We introduce EPAG, a benchmark dataset and framework designed for Evaluating the Pre-consultation Ability of LLMs using diagnostic Guidelines. LLMs are evaluated directly through HPI-diagnostic guideline comparison and indirectly through disease diagnosis. In our experiments, we observe that small open-source models fine-tuned with a well-curated, task-specific dataset can outperform frontier LLMs in pre-consultation. Additionally, we find that increased amount of HPI (History of Present Illness) does not necessarily lead to improved diagnostic performance. Further experiments reveal that the language of pre-consultation influences the characteristics of the dialogue. By open-sourcing our dataset and evaluation pipeline on https://github.com/seemdog/EPAG, we aim to contribute to the evaluation and further development of LLM applications in real-world clinical settings.
comment: EACL 2026 Industry
♻ ☆ HE-SNR: Uncovering Latent Logic via Entropy for Guiding Mid-Training on SWE-bench ICML 2026
SWE-bench has emerged as the premier benchmark for evaluating Large Language Models on complex software engineering tasks. While these capabilities are fundamentally acquired during the mid-training phase and subsequently elicited during Supervised Fine-Tuning (SFT), there remains a critical deficit in metrics capable of guiding mid-training effectively. Standard metrics such as Perplexity (PPL) are compromised by the "Long-Context Tax" and exhibit weak correlation with downstream SWE performance. In this paper, we bridge this gap by first introducing a rigorous data filtering strategy. Crucially, we propose the Entropy Compression Hypothesis, redefining intelligence not by scalar Top-1 compression, but by the capacity to structure uncertainty into Entropy-Compressed States of low orders ("reasonable hesitation"). Grounded in this fine-grained entropy analysis, we formulate a novel metric, HE-SNR (High-Entropy Signal-to-Noise Ratio). We validate our approach on models with up to 560B parameters across different context windows (32K/128K). This work provides both the theoretical foundation and practical tools for optimizing the latent potential of LLMs in complex engineering domains.
comment: Accepted at ICML 2026. 21 pages, 15 figures
♻ ☆ Reflect then Learn: Active Prompting for Information Extraction Guided by Introspective Confusion AAAI 2026
Large Language Models (LLMs) show remarkable potential for few-shot information extraction (IE), yet their performance is highly sensitive to the choice of in-context examples. Conventional selection strategies often fail to provide informative guidance, as they overlook a key source of model fallibility: confusion stemming not just from semantic content, but also from the generation of well-structured formats required by IE tasks. To address this, we introduce Active Prompting for Information Extraction (APIE), a novel active prompting framework guided by a principle we term introspective confusion. Our method empowers an LLM to assess its own confusion through a dual-component uncertainty metric that uniquely quantifies both Format Uncertainty (difficulty in generating correct syntax) and Content Uncertainty (inconsistency in extracted semantics). By ranking unlabeled data with this comprehensive score, our framework actively selects the most challenging and informative samples to serve as few-shot exemplars. Extensive experiments on four benchmarks show that our approach consistently outperforms strong baselines, yielding significant improvements in both extraction accuracy and robustness. Our work highlights the critical importance of a fine-grained, dual-level view of model uncertainty when it comes to building effective and reliable structured generation systems.
comment: Published at AAAI 2026
♻ ☆ Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression ICML 2026
While Key-Value (KV) cache compression is essential for efficient LLM inference, current evaluations disproportionately focus on sparse retrieval tasks, potentially masking the degradation of High-Density Reasoning where Chain-of-Thought (CoT) coherence is critical. We introduce KVFundaBench to systematically evaluate this gap, revealing a sharp dichotomy: while retrieval tasks remain robust, reasoning tasks exhibit severe Task-Dependent Degradation under aggressive compression due to disrupted CoT links. Extending our analysis to the DeepSeek-R1 model, we uncover that its specialized attention patterns offer unique insights into the fragility of reasoning chains. Guided by these findings -- specifically the necessity of preserving few-shot examples as indivisible Semantic Units -- we propose ShotKV. This approach explicitly separates prefill and decoding phases to prioritize semantic integrity. Empirical results demonstrate that ShotKV achieves 9%-18% accuracy improvements on long-context generation tasks and effectively generalizes to document QA, all while delivering an 11% latency reduction compared to full cache inference.
comment: ICML 2026
♻ ☆ Stopping Computation for Converged Tokens in Masked Diffusion-LM Decoding ICLR 2026
Masked Diffusion Language Models generate sequences via iterative sampling that progressively unmasks tokens. However, they still recompute the attention and feed-forward blocks for every token position at every step -- even when many unmasked tokens are essentially fixed, resulting in substantial waste in compute. We propose SureLock: when the posterior at an unmasked position has stabilized across steps (our sure condition), we lock that position -- thereafter skipping its query projection and feed-forward sublayers -- while caching its attention keys and values so other positions can continue to attend to it. This reduces the dominant per-iteration computational cost from $O(N^2d)$ to $O(MNd)$ where $N$ is the sequence length, $M$ is the number of unlocked token positions, and $d$ is the model dimension. In practice, $M$ decreases as the iteration progresses, yielding substantial savings. On LLaDA-8B, SureLock reduces algorithmic FLOPs by 30--50% relative to the same sampler without locking, while maintaining comparable generation quality. We also provide a theoretical analysis to justify the design rationale of SureLock: monitoring only the local KL at the lock step suffices to bound the deviation in final token probabilities. Our project page is available at https://daioba.github.io/surelock .
comment: Accepted to ICLR 2026
♻ ☆ A Theoretical Analysis of Why Masked Diffusion Models Mitigate the Reversal Curse
Autoregressive language models (ARMs) suffer from the reversal curse: after learning ''$A$ is $B$,'' they often fail on the reverse query ''$B$ is $A$.'' Masked diffusion language models (MDMs) exhibit this failure in a much weaker form, but the underlying reason has remained unclear. A common explanation attributes this mitigation to their any-order masked training objective. However, observing ''$[\mathbf{M}]$ is $B$'' during training teaches recovery of $A$ from $B$ in one positional configuration, and does not by itself explain why the learned evidence should transfer to the reverse prompt ''$B$ is $[\mathbf{M}]$.'' We provide a theoretical analysis showing that this transfer arises from a parameter-level coupling between forward and reverse positional conditionals: shared Transformer parameters store token-pair evidence, while relative positional encodings route attention through queries and keys without changing the value-side evidence being retrieved. In a one-layer MDM, we prove that forward masked training strengthens evidence that is reusable in reverse queries, induces correlated forward--reverse attention routes, and yields a positively aligned shared-storage gradient component that decreases the reverse loss to first order. Controlled one-layer experiments and large-scale LLaDA/Dream experiments verify these signatures and show that they translate into improved reverse prediction.
♻ ☆ AutoMonitor-Bench: Evaluating the Reliability of LLM-Based Misbehavior Monitor ACL 2026
We introduce AutoMonitor-Bench, the first benchmark designed to systematically evaluate the reliability of LLM-based misbehavior monitors across diverse tasks and failure modes. AutoMonitor-Bench consists of 3,010 carefully annotated test samples spanning question answering, code generation, and reasoning, with paired misbehavior and benign instances. We evaluate monitors using two complementary metrics: Miss Rate (MR) and False Alarm Rate (FAR), capturing failures to detect misbehavior and oversensitivity to benign behavior, respectively. Evaluating 12 proprietary and 10 open-source LLMs, we observe substantial variability in monitoring performance and a consistent trade-off between MR and FAR, revealing an inherent safety-utility tension. To further explore the limits of monitor reliability, we construct a large-scale training corpus of 153,581 samples and fine-tune Qwen3-4B-Instruction to investigate whether training on known, relatively easy-to-construct misbehavior datasets improves monitoring performance on unseen and more implicit misbehaviors. Our results highlight the challenges of reliable, scalable misbehavior monitoring and motivate future work on task-aware designing and training strategies for LLM-based monitors.
comment: ACL 2026 Findings
♻ ☆ RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization
Direct Preference Optimization (DPO), the efficient alternative to PPO-based RLHF, falls short on knowledge-intensive generation: standard preference signals from human annotators or LLM judges exhibit a systematic verbosity bias that rewards fluency over logical correctness. This blindspot leaves a logical alignment gap -- SFT models reach NLI entailment of only 0.05-0.22 despite producing fluent text. We propose RLearner-LLM with Hybrid-DPO: an automated preference pipeline that fuses a DeBERTa-v3 NLI signal with a verifier LLM score, removing human annotation while overcoming the "alignment tax" of single-signal optimization. Evaluated across five academic domains (Biology, Medicine, Law) with three base architectures (LLaMA-2-13B, Qwen3-8B, Gemma 4 E4B-it), RLearner-LLM yields up to 6x NLI improvement over SFT, with NLI gains in 11 of 15 cells and consistent answer-coverage gains. On Gemma 4 E4B-it (4.5B effective params), Hybrid-DPO lifts NLI in four of five domains (+11.9% to +2.4x) with faster inference across all five, scaling down to compact base models without losing the alignment-tax mitigation. Our Qwen3-8B RLearner-LLM wins 95% of pairwise comparisons against its own SFT baseline; GPT-4o-mini in turn wins 95% against our concise output -- alongside the 69% win the same judge gives a verbose SFT over our DPO model, this replicates verbosity bias on a frontier comparator and motivates logic-aware metrics (NLI, ACR) over LLM-as-a-judge for knowledge-intensive generation.
♻ ☆ Phase Transitions in Affective Meaning Divergence: The Hidden Drift Before the Break ACL 2026
One partner says "Fine" meaning "resolution"; the other hears "surrender." The word is shared; the affective uptake is not. We formalize this as affective meaning divergence (AMD), the total-variation distance between interlocutors' anchor-conditioned affect distributions. Building on speech-act theory, common-ground accumulation, and entropy-regularized game theory, we derive a logit best-response map whose dynamics undergo a saddle-node bifurcation: when $βα> 4$, a monotone increase in AMD-driven load produces an abrupt, hysteretic collapse of repair coordination. On Conversations Gone Awry (CGA-Wiki; $N = 652$), derailing conversations exhibit critical-slowing-down (CSD) signatures across multiple levels: lexical divergence variance ($p < 0.001$, $d = 0.36$), AMD variance ($p = 0.001$, $d = 0.26$), and dialog-act repair variance ($p = 0.016$, $d = 0.20$), all significant after correction and stronger than toxicity and sentiment baselines. AMD provides a distinct temporal signature, with retrospectively measured variance peaking at the bifurcation point while toxicity variance peaks earlier, and is the only indicator grounded in the theoretical framework. Boundary-condition analysis on CGA-CMV ($N = 1,169$) yields mixed but directionally consistent evidence.
comment: Accepted to the ACL 2026 Student Research Workshop
♻ ☆ OASIS: A Multilingual and Multimodal Dataset for Culturally Grounded Spoken Visual QA
Large-scale multimodal models achieve strong results on tasks like Visual Question Answering (VQA), but they are often limited when queries require cultural and visual information, everyday knowledge, particularly in low-resource and underrepresented languages. We introduce OASIS, a large-scale culturally grounded multimodal QA dataset covering images, text, and speech. OASIS is built with EverydayMMQA, a scalable semi-automatic framework for creating localized spoken and visual QA resources, supported by multi-stage human-in-the-loop validation. OASIS contains approximately 0.92M real images and 14.8M QA pairs, including 3.7M spoken questions, with 383 hours of human-recorded speech, and 20K hours of voice-cloned speech, from 42 speakers. It supports four input settings: text-only, speech-only, text+image, and speech+image. The dataset focuses on English and Arabic varieties across 18 countries, covering Modern Standard Arabic (MSA) as well as dialectal Arabic. It is designed to evaluate models beyond object recognition, targeting pragmatic, commonsense, and culturally grounded reasoning in real-world scenarios. We benchmark four closed-source models, three open-source models, and one fine-tuned model on OASIS. The framework and dataset will be made publicly available to the community. https://huggingface.co/datasets/QCRI/OASIS
comment: Multimodal Foundation Models, Large Language Models, Native, Multilingual, Language Diversity, Contextual Understanding, Culturally Informed
♻ ☆ Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM ASPLOS '26
Transformer-based models are the foundation of modern machine learning, but their execution, particularly during autoregressive decoding in large language models (LLMs), places significant pressure on memory systems due to frequent memory accesses and growing key-value (KV) caches. This creates a bottleneck in memory bandwidth, especially as context lengths increase. Processing-in-memory (PIM) architectures are a promising solution, offering high internal bandwidth and compute parallelism near memory. However, current PIM designs are primarily optimized for dense attention and struggle with the dynamic, irregular access patterns introduced by modern KV cache sparsity techniques. Consequently, they suffer from workload imbalance, reducing throughput and resource utilization. In this work, we propose STARC, a novel sparsity-optimized data mapping scheme tailored specifically for efficient LLM decoding on PIM architectures. STARC clusters KV pairs by semantic similarity and maps them to contiguous memory regions aligned with PIM bank structures. During decoding, queries retrieve relevant tokens at cluster granularity by matching against precomputed centroids, enabling selective attention and parallel processing without frequent reclustering or data movement overhead. Experiments on the HBM-PIM system show that, compared to common token-wise sparsity methods, STARC reduces attention-layer latency by 19%--31% and energy consumption by 19%--27%. Under a KV cache budget of 1024, it achieves up to 54%--74% latency reduction and 45%--67% energy reduction compared to full KV cache retrieval. Meanwhile, STARC maintains model accuracy comparable to state-of-the-art sparse attention methods, demonstrating its effectiveness in enabling efficient and hardware-friendly long-context LLM inference on PIM architectures.
comment: Early preprint; peer-reviewed version of record published in ASPLOS '26
♻ ☆ MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware
The recent advancement of Vision Language Action (VLA) models has driven a critical demand for large scale egocentric datasets. However, existing datasets are often limited by short episode durations, typically spanning only a few minutes, which fails to capture the long horizon temporal dependencies necessary for complex robotic task execution. To bridge this gap, we present MobileEgo Anywhere, a framework designed to facilitate the collection of robust, hour plus egocentric trajectories using commodity mobile hardware. We leverage the ubiquitous sensor suites of modern smartphones to provide high fidelity, long term camera pose tracking, effectively removing the high hardware barriers associated with traditional robotics data collection. Our contributions are three fold: (1) we release a novel dataset comprising 200 hours of diverse, long form egocentric data with persistent state tracking; (2) we open source a mobile application that enables any user to record egocentric data, and (3) we provide a comprehensive processing pipeline to convert raw mobile captures into standardized, training ready formats for Vision Language Action model and foundation model research. By democratizing the data collection process, this work enables the massive scale acquisition of long horizon data across varied global environments, accelerating the development of generalizable robotic policies.
♻ ☆ Conflict-Aware Fusion: Mitigating Logic Inertia in Large Language Models via Structured Cognitive Priors
Large language models (LLMs) achieve high accuracy on many reasoning benchmarks but remain brittle under structural perturbations of rule-based systems. We introduce a diagnostic framework with four stress tests -- redundant vs. essential rule deletion, contradictory-rule injection, logic-preserving rewrites, and multi-law stacking -- and use it to expose Logic Inertia: the tendency of generative LLMs (Qwen2/3, TinyLlama, GPT-4o, Gemma-3-4B-IT) and the encoder-only BERT baseline to persist along learned deductive trajectories under inconsistent premises. The collapse is sharp: untreated baselines fall from accuracy 1.00 on the base task to 0.00 on contradiction injection (instance-level exact match), and GPT-4o resolves only 56.0% of contradiction cases. We propose Conflict-Aware Fusion, a four-stage training pipeline that enforces verification-before-deduction as a learned structural prior: (i) SFT establishes the verification preamble; (ii) DPO sharpens the halt-on-contradiction decision boundary; (iii) Logical Invariance REgularisation (LIRE) penalises divergence between logically equivalent rule formulations via symmetric KL; (iv) Reinforcement Learning from Verification Feedback (RLVF) uses a symbolic forward-chaining engine as a deterministic oracle reward, jointly optimising invariance and sensitivity. The pipeline saturates all four primary stress tests for both 1.5B and 8B backbones. We further validate a Phase 2 extension that replaces the propositional oracle with a Lean 4 kernel, attaining 99.0% kernel agreement on the 105 classically-derivable (T) questions within a stratified 187-question Lean-translated sample (overall 71.7% across both polarities), providing a sound upgrade path to formally verified RL training. Code and benchmark: https://github.com/14H034160212/lemo
♻ ☆ STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens
Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often suffer from late-stage performance collapse, leading to degraded reasoning quality and unstable training. We identify a key factor behind this instability: a small fraction of tokens, termed spurious tokens (around 0.01%), which contribute little to the reasoning outcome but receive disproportionately amplified gradient updates due to inheriting the full sequence-level reward. We present a unified framework for evaluating token-level optimization impacts across spurious risk, gradient norms, and entropy changes. Building on the analysis of token characteristics that severely disrupt optimization, we propose the Silencing Spurious Tokens (S2T) mechanism to efficiently suppress their gradient perturbations. Incorporating this mechanism into a group-based objective, we propose Spurious-Token-Aware Policy Optimization (STAPO), which promotes stable and effective large-scale model refinement. Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 11.49% ($ρ_{\mathrm{T}}$=1.0, top-p=1.0) and 3.73% ($ρ_{\mathrm{T}}$=0.7, top-p=0.9) over GRPO, 20-Entropy, and JustRL.
♻ ☆ RACC: Representation-Aware Coverage Criteria for LLM Safety Testing
Large Language Models (LLMs) face severe safety risks from jailbreak attacks, yet current safety testing largely relies on static datasets and lacks systematic criteria to evaluate test suite quality and adequacy. While coverage criteria have proven effective for smaller neural networks, they are impractical for LLMs due to computational overhead and the entanglement of safety-critical signals with irrelevant neuron activations. To address these issues, we propose RACC (Representation-Aware Coverage Criteria), a set of coverage criteria specialized for LLM safety testing. RACC first extracts safety representations from the LLM's hidden states using a small calibration set of harmful prompts, then measures test prompts' concept activations against these directions, and finally computes coverage through six criteria assessing both individual and compositional safety concept coverage. Experiments on multiple LLMs and safety benchmarks show that RACC reliably rewards high-quality jailbreak test suites while remaining insensitive to redundant or invalid inputs, which is a key distinction that neuron-level criteria fail to make. We further demonstrate RACC's practical value in two applications, including test suite prioritization and attack prompt sampling, and validate its generalization across diverse settings and configurations. Overall, RACC provides a scalable and principled foundation for coverage-guided LLM safety testing.
♻ ☆ READ: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling AAAI 2024
Fully fine-tuning pretrained large-scale transformer models has become a popular paradigm for video-language modeling tasks, such as temporal language grounding and video-language summarization. With a growing number of tasks and limited training data, such full fine-tuning approach leads to costly model storage and unstable training. To overcome these shortcomings, we introduce lightweight adapters to the pre-trained model and only update them at fine-tuning time. However, existing adapters fail to capture intrinsic temporal relations among video frames or textual words. Moreover, they neglect the preservation of critical task-related information that flows from the raw video-language input into the adapter's low-dimensional space. To address these issues, we first propose a novel REcurrent ADapter (READ) that employs recurrent computation to enable temporal modeling capability. Second, we propose Partial Video-Language Alignment (PVLA) objective via the use of partial optimal transport to maintain task-related information flowing into our READ modules. We validate our READ framework through extensive experiments where READ significantly outperforms all existing fine-tuning strategies on multiple low-resource temporal language grounding and video-language summarization benchmarks. The code, model, and data have been made available at https://nguyentthong.github.io/READ.
comment: Accepted at AAAI 2024
♻ ☆ LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
Test-time scaling (TTS) has become an effective approach for improving large language model performance by allocating additional computation during inference. However, existing TTS strategies are largely hand-crafted: researchers manually design reasoning patterns and tune heuristics by intuition, leaving much of the computation-allocation space unexplored. We propose an environment-driven framework, AutoTTS, that changes what researchers design: from individual TTS heuristics to environments where TTS strategies can be discovered automatically. The key to AutoTTS lies in environment construction: the discovery environment must make the control space tractable and provide cheap, frequent feedback for TTS search. As a concrete instantiation, we formulate width--depth TTS as controller synthesis over pre-collected reasoning trajectories and probe signals, where controllers decide when to branch, continue, probe, prune, or stop and can be evaluated cheaply without repeated LLM calls. We further introduce beta parameterization to make the search tractable and fine-grained execution trace feedback to improve discovery efficiency by helping the agent diagnose why a TTS program fails. Experiments on mathematical reasoning benchmarks show that the discovered strategies improve the overall accuracy--cost tradeoff over strong manually designed baselines. The discovered strategies generalize to held-out benchmarks and model scales, while the entire discovery costs only $39.9 and 160 minutes. Our data, and code will be open-source at https://github.com/zhengkid/AutoTTS.
comment: 25 pages
♻ ☆ Gradient-Boosted Decision Tree for Listwise Context Model in Multimodal Review Helpfulness Prediction ACL 2023
Multimodal Review Helpfulness Prediction (MRHP) aims to rank product reviews based on predicted helpfulness scores and has been widely applied in e-commerce via presenting customers with useful reviews. Previous studies commonly employ fully-connected neural networks (FCNNs) as the final score predictor and pairwise loss as the training objective. However, FCNNs have been shown to perform inefficient splitting for review features, making the model difficult to clearly differentiate helpful from unhelpful reviews. Furthermore, pairwise objective, which works on review pairs, may not completely capture the MRHP goal to produce the ranking for the entire review list, and possibly induces low generalization during testing. To address these issues, we propose a listwise attention network that clearly captures the MRHP ranking context and a listwise optimization objective that enhances model generalization. We further propose gradient-boosted decision tree as the score predictor to efficaciously partition product reviews' representations. Extensive experiments demonstrate that our method achieves state-of-the-art results and polished generalization performance on two large-scale MRHP benchmark datasets.
comment: Published in ACL 2023 (Findings). Code is available at https://github.com/nguyentthong/gbdt_listwise_mrhp
♻ ☆ Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation
Standard Retrieval Augmented Generation (RAG) is poorly matched to agent memory. Unlike large heterogeneous corpora, agent memory forms a bounded and coherent interaction stream in which many spans are highly correlated or near duplicates. As a result, flat top-$k$ similarity retrieval often returns redundant context, while summary-centric hierarchies can blur the subtle details that distinguish one candidate from another. We argue that agent memory should follow the principle of decoupling before aggregation: the system should first isolate reusable facts, updates, and distinguishing details from similar histories, and only then organise them for efficient retrieval. Based on this principle, we propose xMemory, which constructs a revisable hierarchical memory structure from original messages to segments, memory components, and groups. xMemory segments interaction history into local events, decouples each segment into memory components, aggregates related components into high-level groups using a sparsity--semantic faithfulness objective, and maintains this structure incrementally as memory evolves. At inference time, xMemory retrieves top-down, first selecting a compact backbone of complementary groups and components, and then expanding to segments and raw messages only when additional evidence reduces the reader's uncertainty. Experiments on LoCoMo and PerLTQA across diverse open source and closed source LLMs show consistent gains in answer quality and inference token efficiency, supported by analyses of redundancy, evidence density, and coverage.
comment: Project Address: https://zhanghao-xmemory.github.io/Academic-project-page-template/; Code Address: https://github.com/HU-xiaobai/xMemory
♻ ☆ Learning Adapter Rank via Symmetry Breaking
Low-rank adaptation is effective partly because downstream updates lie in a low-dimensional subspace, but the latent rank coordinates of LoRA are not identifiable: any invertible reparameterization of the adapter factors leaves the weight update unchanged. We show that variational inference with a diagonal rank-wise posterior turns this non-identifiability into a useful inductive bias. By breaking LoRA's rotational gauge symmetry, the variational objective selects a preferred basis in rank space, enabling automatic relevance determination over rank directions. This yields Low-Rank Variational Dropout (LRVD), a Bayesian framework that performs inference directly in the low-rank adaptation space rather than the ambient weight space. As an instantiation, BayesLoRA jointly learns effective adapter rank and predictive uncertainty with only $\mathcal{O}(r)$ additional parameters. Empirically, BayesLoRA induces stable rank structure aligned with the dominant singular directions of learned updates, yields compact predictive calibration and matches or exceeds strong low-rank sparsification baselines at comparable training cost.
comment: 8 pages, 2 figures, 4 tables
♻ ☆ Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives ACL 2024
Humans use multiple senses to comprehend the environment. Vision and language are two of the most vital senses since they allow us to easily communicate our thoughts and perceive the world around us. There has been a lot of interest in creating video-language understanding systems with human-like senses since a video-language pair can mimic both our linguistic medium and visual environment with temporal dynamics. In this survey, we review the key tasks of these systems and highlight the associated challenges. Based on the challenges, we summarize their methods from model architecture, model training, and data perspectives. We also conduct performance comparison among the methods, and discuss promising directions for future research.
comment: Accepted at ACL 2024 (Findings). Code is available at https://github.com/nguyentthong/video-language-understanding
♻ ☆ PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning
Cell-type-specific marker genes are fundamental to plant biology, yet existing resources primarily rely on curated databases or high-throughput studies without explicitly modeling the supporting evidence found in scientific literature. We introduce PlantMarkerBench, a multi-species benchmark for evaluating literature-grounded plant marker evidence interpretation from full-text biological papers. PlantMarkerBench is constructed using a modular curation pipeline integrating large-scale literature retrieval, hybrid search, species-aware biological grounding, structured evidence extraction, and targeted human review. The benchmark spans four plant species -- Arabidopsis, maize, rice, and tomato -- and contains 5,550 sentence-level evidence instances annotated for marker-evidence validity, evidence type, and support strength. We define two benchmark tasks: determining whether a candidate sentence provides valid marker evidence for a gene-cell-type pair, and classifying the evidence into expression, localization, function, indirect, or negative categories. We benchmark diverse open-weight and closed-source language models across species and prompting strategies. Although frontier models achieve relatively strong performance on direct expression evidence, performance drops substantially on functional, indirect, and weak-support evidence, with evidence-type confusion emerging as a dominant failure mode. Open-weight models additionally exhibit elevated false-positive rates under ambiguous biological contexts. PlantMarkerBench provides a challenging and reproducible evaluation framework for literature-grounded biological evidence attribution and supports future research on trustworthy scientific information extraction and AI-assisted plant biology.
♻ ☆ Adaptive Contrastive Learning on Multimodal Transformer for Review Helpfulness Predictions EMNLP 2022
Modern Review Helpfulness Prediction systems are dependent upon multiple modalities, typically texts and images. Unfortunately, those contemporary approaches pay scarce attention to polish representations of cross-modal relations and tend to suffer from inferior optimization. This might cause harm to model's predictions in numerous cases. To overcome the aforementioned issues, we propose Multimodal Contrastive Learning for Multimodal Review Helpfulness Prediction (MRHP) problem, concentrating on mutual information between input modalities to explicitly elaborate cross-modal relations. In addition, we introduce Adaptive Weighting scheme for our contrastive learning approach in order to increase flexibility in optimization. Lastly, we propose Multimodal Interaction module to address the unalignment nature of multimodal data, thereby assisting the model in producing more reasonable multimodal representations. Experimental results show that our method outperforms prior baselines and achieves state-of-the-art results on two publicly available benchmark datasets for MRHP problem.
comment: Accepted to the main EMNLP 2022 conference. Code is available at https://github.com/nguyentthong/adaptive_contrastive_mrhp
♻ ☆ DemaFormer: Damped Exponential Moving Average Transformer with Energy-Based Modeling for Temporal Language Grounding EMNLP 2023
Temporal Language Grounding seeks to localize video moments that semantically correspond to a natural language query. Recent advances employ the attention mechanism to learn the relations between video moments and the text query. However, naive attention might not be able to appropriately capture such relations, resulting in ineffective distributions where target video moments are difficult to separate from the remaining ones. To resolve the issue, we propose an energy-based model framework to explicitly learn moment-query distributions. Moreover, we propose DemaFormer, a novel Transformer-based architecture that utilizes exponential moving average with a learnable damping factor to effectively encode moment-query inputs. Comprehensive experiments on four public temporal language grounding datasets showcase the superiority of our methods over the state-of-the-art baselines.
comment: Accepted at EMNLP 2023 (Findings). Code is available at https://github.com/nguyentthong/demaformer
♻ ☆ CktFormalizer: Autoformalization of Natural Language into Circuit Representations
LLMs can generate hardware descriptions from natural language specifications, but the resulting Verilog often contains width mismatches, combinational loops, and incomplete case logic that pass syntax checks yet fail in synthesis or silicon. We present CktFormalizer, a framework that redirects LLM-driven hardware generation through a dependently-typed HDL embedded in Lean 4. Lean serves three roles: (i) type checker:dependent types encode bit-width constraints, case coverage, and acyclicity, turning hardware defects into compile-time errors that guide iterative repair; (ii) correctness firewall:compiled designs are structurally free of defects that cause silent backend failures (the baseline loses 20% of correct designs during synthesis and routing; CktFormalizer preserves all of them); (iii) proof assistant:the agent constructs machine-checked equivalence proofs over arbitrary input sequences and parameterized widths, beyond the reach of bounded SMT-based checking. On VerilogEval (156 problems), RTLLM (50 problems), and ResBench (56 problems), CktFormalizer achieves simulation pass rates competitive with direct Verilog generation while delivering substantially higher backend realizability: 95--100% of compiled designs complete the full synthesis, place-and-route, DRC, and LVS flow. A closed-loop PPA optimization stage yields up to 35% area reduction and 30% power reduction through validated architecture exploration, with automated theorem proof ensuring that each optimized variant remains functionally equivalent to its formal specification.
♻ ☆ Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection
Safety post-training can improve the harmfulness and policy compliance of Large Language Models (LLMs), but it may also reduce general utility, a phenomenon often described as the \emph{alignment tax}. We study this trade-off through the lens of continual learning: sequential alignment stages expose the model to shifted data distributions and objectives, and their gradients may interfere with directions that support previously acquired general capabilities. This view does not claim that all alignment degradation has a single cause; rather, it provides a useful first-order mechanism for mitigating one important source of capability regression. We propose \textbf{O}rthogonal \textbf{G}radient \textbf{P}rojection for \textbf{S}afety \textbf{A}lignment (\textbf{OGPSA}), a lightweight update rule that estimates a low-rank reference subspace from gradients on a small set of general-capability data and removes from each safety gradient the component lying in this subspace. The resulting update is the steepest local safety-descent direction subject to first-order preservation constraints on the reference objectives. OGPSA is compatible with standard post-training pipelines and avoids large-scale replay, although it introduces periodic reference-gradient computation. Across Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and sequential SFT$\rightarrow$DPO settings, OGPSA improves the observed safety--utility trade-off over standard baselines. Under the sequential SFT$\rightarrow$DPO pipeline, the average performance gain increases from 33.98\% to 42.74\% on Qwen2.5-7B-Instruct and from 19.74\% to 32.98\% on Llama3.1-8B-Instruct. We have open sourced our code at https://github.com/SunGL001/OGPSA.
♻ ☆ 100,000+ Movie Reviews from Kazakhstan: Russian, Kazakh, and Code-Switched Texts
We present a new publicly available corpus of 100,502 movie reviews from Kazakhstan collected from kino.kz, spanning 2001-2025 and covering 4,943 unique titles. The dataset is multilingual, consisting mainly of Russian reviews alongside Kazakh and code-switched texts. Reviews are manually annotated for language and sentiment polarity, and 11,309 reviews additionally contain explicit user-provided ratings. We define two sentiment tasks -- three-way polarity classification and five-class score classification -- and benchmark classical BoW/TF-IDF baselines against multilingual transformer models (mBERT, XLM-RoBERTa, RemBERT). Experimental results show that transformer models consistently outperform classical baselines on polarity classification, while score classification remains challenging under leakage-controlled evaluation due to severe class imbalance and subtle distinctions between adjacent rating levels.
comment: 10 pages, 1 figure, 8 tables, to appear in Proceedings of the 6th International Conference on Natural Language Processing for the Digital Humanities (NLP4DH 2026)
♻ ☆ RW-Post: Auditable Evidence-Grounded Multimodal Fact-Checking in the Wild
Multimodal misinformation increasingly leverages visual persuasion, where repurposed or manipulated images strengthen misleading text. We introduce RW-Post, a post-aligned text--image benchmark for real-world multimodal fact-checking with auditable annotations: each instance links the original social-media post with reasoning traces and explicitly linked evidence items derived from human fact-check articles via an LLM-assisted extraction-and-auditing pipeline. RW-Post supports controlled evaluation across closed-book, evidence-bounded, and open-web regimes, enabling systematic diagnosis of visual grounding and evidence utilization. We provide AgentFact as a reference verification baseline and benchmark strong open-source LVLMs under unified protocols. Experiments show substantial headroom: current models struggle with faithful evidence grounding, while evidence-bounded evaluation improves both accuracy and faithfulness.
comment: Code and dataset will be released at https://github.com/xudanni0927/AgentFact
♻ ☆ Asymmetric Advantage Modulation Calibrates Entropy Dynamics in RLVR
Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning ability of large language models (LLMs), but it often suffers from \textit{restricted exploration}, where the policy rapidly concentrates on a narrow set of solutions. A common remedy is entropy regularization, which attempts to preserve exploration by increasing policy entropy. However, for LLM-RL, this intervention is highly sensitive to its coefficient, can introduce semantically weak uncertainty, and often yields limited accuracy gains. This motivates a more precise question: which entropy helps reasoning, and which entropy should be reduced? To study this, we parameterize the advantage estimator in Group Relative Policy Optimization (GRPO) into positive and negative outcome-conditioned channels and analyze their entropy dynamics. Our results show that positive-channel modulation raises \textit{productive entropy} associated with successful reasoning trajectories, while negative-channel modulation removes \textit{noisy entropy} associated with failed rollouts and reduces interference with correct paths. Guided by this channel-wise view, we propose \textbf{AsymGRPO}, which decouples the modulation strengths of positive and negative advantages. This enables flexible control over how the model updates across prompt difficulty levels, allowing stronger reinforcement of rare successes on harder prompts or stronger suppression of residual failures on easier prompts without forcing the two channels to share the same modulation strength. Experiments on five mathematical reasoning benchmarks show that AsymGRPO outperforms strong RLVR baselines, with consistent gains across model backbones.
♻ ☆ Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies
Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively. Despite its importance, existing relevant benchmarks largely evaluate agents on pre-specified or synthesized files with limited real-world dependencies, leaving workspace-level evaluation underexplored. To this end, we introduce Workspace-Bench, a benchmark for evaluating AI agents on Workspace Learning invOlving Large-Scale File Dependencies. We construct realistic workspaces with 5 worker profiles, 74 file types, 20,476 files (up to 20GB) and curate 388 tasks, each with its own file dependency graph, evaluated across 7,399 total rubrics that require cross-file retrieval, contextual reasoning, and adaptive decision-making. We further provide Workspace-Bench-Lite, a 100-task subset that preserves the benchmark distribution while reducing evaluation costs by about 70%. We evaluate 3 popular agent harnesses and 5 foundation models. Experimental results show that current agents remain far from reliable workspace learning, where the best reaches only about 60%, substantially below the human result of 80.7%, and the average performance across agents is only 45.1%.
comment: 29 pages, 16 figures
♻ ☆ Enriching and Controlling Global Semantics for Text Summarization EMNLP 2021
Recently, Transformer-based models have been proven effective in the abstractive summarization task by creating fluent and informative summaries. Nevertheless, these models still suffer from the short-range dependency problem, causing them to produce summaries that miss the key points of document. In this paper, we attempt to address this issue by introducing a neural topic model empowered with normalizing flow to capture the global semantics of the document, which are then integrated into the summarization model. In addition, to avoid the overwhelming effect of global semantics on contextualized representation, we introduce a mechanism to control the amount of global semantics supplied to the text generation module. Our method outperforms state-of-the-art summarization models on five common text summarization datasets, namely CNN/DailyMail, XSum, Reddit TIFU, arXiv, and PubMed.
comment: Accepted to the main EMNLP 2021 conference. Code is available at https://github.com/nguyentthong/topicflow-sum
♻ ☆ Ice Cream Doesn't Cause Drowning: Benchmarking LLMs Against Statistical Pitfalls in Causal Inference
Reliable causal inference is essential for making decisions in high-stakes areas like medicine, economics, and public policy. However, it remains unclear whether large language models (LLMs) can handle rigorous and trustworthy statistical causal inference. Current benchmarks usually involve simplified tasks. For example, these tasks might only ask LLMs to identify semantic causal relationships or draw conclusions directly from raw data. As a result, models may overlook important statistical pitfalls, such as Simpson's paradox or selection bias. This oversight limits the applicability of LLMs in the real world. To address these limitations, we propose CausalPitfalls, a comprehensive benchmark designed to rigorously evaluate the capability of LLMs in overcoming common causal inference pitfalls. Our benchmark features structured challenges across multiple difficulty levels, each paired with grading rubrics. This approach allows us to quantitatively measure both causal reasoning capabilities and the reliability of LLMs' responses. We evaluate models using two protocols: (1) direct prompting, which assesses intrinsic causal reasoning, and (2) code-assisted prompting, where models generate executable code for explicit statistical analysis. Additionally, we validate the effectiveness of this judge by comparing its scoring with assessments from human experts. Our results reveal significant limitations in current LLMs when performing statistical causal inference. The CausalPitfalls benchmark provides essential guidance and quantitative metrics to advance the development of trustworthy causal reasoning systems.
♻ ☆ Less Redundancy: Boosting Practicality of Vision Language Model in Walking Assistants ICASSP 2026
Approximately 283 million people worldwide live with visual impairments, motivating increasing research into leveraging Visual Language Models (VLMs) to develop effective walking assistance systems for blind and low vision individuals. However, existing VLMs in walking assistant task often have outputs that contain considerable redundancy and extraneous details, adversely affecting users' ability to accurately assess their surroundings. Moreover, these models typically lack the capability to proactively assess environmental risks and adaptively trigger reminders based on the appropriate scene, leading to excessive temporal redundancy. To mitigate output and temporal redundancy, we propose WalkVLM-LR, a walking assistance model with less redundancy. To reduce output redundancy, we introduce four human-preference-based custom reward functions within the GRPO-based reasoning framework to optimize the output in terms of conciseness, fluency, keyword density, and accuracy, thereby producing more informative and streamlined outputs. To minimize temporal redundancy, we incorporate an environment awareness discriminator, which shares the visual encoder with the VLMs to reduce redundant computations and enhance discriminative efficiency, to make WalkVLM-LR assess scene risk levels and minimize unnecessary reminders. Experimental results demonstrate that our method achieves state-of-the-art performance across all evaluation metrics compared with other models, particularly in output conciseness and less temporal redundancy.
comment: ICASSP 2026 Best Industry Paper
♻ ☆ Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation ICLR 2026
Pass$@k$ is widely used to report the reasoning performance of LLMs, but it often produces unstable and potentially misleading rankings, especially when the number of trials (samples) is limited and computational resources are constrained. We present a principled Bayesian evaluation framework that replaces Pass$@k$ and average accuracy over $N$ trials (avg$@N$) with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and a transparent decision rule for differences. Evaluation outcomes are modeled as categorical (not just 0/1) with a Dirichlet prior, giving closed-form expressions for the posterior mean and uncertainty of any weighted rubric and enabling the use of prior evidence when appropriate. Theoretically, under a uniform prior, the Bayesian posterior mean is order-equivalent to average accuracy (Pass$@1$), explaining its empirical robustness while adding principled uncertainty. Empirically, in simulations with known ground-truth success rates and on AIME'24/'25, HMMT'25, and BrUMO'25, the posterior-based procedure achieves faster convergence and greater rank stability than Pass$@k$ and recent variants, enabling reliable comparisons at far smaller sample counts. The framework clarifies when observed gaps are statistically meaningful (non-overlapping credible intervals) versus noise, and it naturally extends to graded, rubric-based evaluations. Together, these results recommend replacing Pass$@k$ for LLM evaluation and ranking with a posterior-based, compute-efficient protocol that unifies binary and non-binary evaluation while making uncertainty explicit. Source code is available at https://github.com/mohsenhariri/scorio
comment: OpenReview (ICLR 2026): https://openreview.net/forum?id=PTXi3Ef4sT
♻ ☆ Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO ICML
We present a fine-grained theoretical analysis of the performance gap between two-stage reinforcement learning from human feedback~(RLHF) and direct preference optimization~(DPO). Our study decomposes this gap into two sources: the explicit representation gap under exact optimization and the implicit representation gap under finite samples. In the exact optimization setting, we characterize how the relative capacities of the reward and policy model classes influence the final policy qualities. We show that RLHF, DPO, or online DPO can outperform one another depending on type of model mis-specifications. Notably, online DPO can outperform both RLHF and standard DPO when the reward and policy model classes are isomorphic and both mis-specified. In the approximate optimization setting, we provide a concrete construction where the ground-truth reward is sparse and show that RLHF requires significantly fewer samples than DPO to recover an effective reward model, highlighting a statistical advantage of two-stage learning. Together, these results provide a comprehensive understanding of the performance gap between RLHF and DPO under various settings, and offer practical insights into when each method is preferred.
comment: ICML accepted version
♻ ☆ Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification
Large language models (LLMs) remain unreliable for high-stakes claim verification due to hallucinations and shallow reasoning. While retrieval-augmented generation (RAG) and multi-agent debate (MAD) address this, they are limited by one-pass retrieval and unstructured debate dynamics. We propose a courtroom-style multi-agent framework, PROClaim, that reformulates verification as a structured, adversarial deliberation. Our approach integrates specialized roles (e.g., Plaintiff, Defense, Judge) with Progressive RAG (P-RAG) to dynamically expand and refine the evidence pool during the debate. Furthermore, we employ evidence negotiation, self-reflection, and heterogeneous multi-judge aggregation to enforce calibration, robustness, and diversity. In zero-shot evaluations on the Check-COVID benchmark, PROClaim achieves 81.7% accuracy, outperforming standard multi-agent debate by 10.0 percentage points, with P-RAG driving the primary performance gains (+7.5 pp). We ultimately demonstrate that structural deliberation and model heterogeneity effectively mitigate systematic biases, providing a robust foundation for reliable claim verification. Our code and data are publicly available at https://github.com/mnc13/PROClaim.
comment: Under review, 7 figures, 12 tables
♻ ☆ MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models ICML 2026
Speech-to-speech language models have recently emerged to enhance the naturalness of conversational AI. In particular, full-duplex models are distinguished by their real-time interactivity, including handling of pauses, interruptions, and backchannels. However, improving their factuality remains an open challenge. While scaling the model size could address this gap, it would make real-time inference prohibitively expensive. In this work, we propose MoshiRAG, a modular approach that combines a compact full-duplex interface with selective retrieval to access more powerful knowledge sources. Our asynchronous framework enables the model to identify knowledge-demanding queries and ground its responses in external information. By leveraging the natural temporal gap between response onset and the delivery of core information, the retrieval process can be completed while maintaining a natural conversation flow. With this approach, MoshiRAG achieves factuality comparable to the best publicly released non-duplex speech language models while preserving the interactivity inherent to full-duplex systems. Moreover, our flexible design supports plug-and-play retrieval methods without retraining and demonstrates strong performance on out-of-domain mathematical reasoning tasks.
comment: Accepted to ICML 2026
Machine Learning 300
☆ AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward ICML2026
In this paper, we propose AlphaGRPO, a novel framework that applies Group Relative Policy Optimization (GRPO) to AR-Diffusion Unified Multimodal Models (UMMs) to enhance multimodal generation capabilities without an additional cold-start stage. Our approach unlocks the model's intrinsic potential to perform advanced reasoning tasks: Reasoning Text-to-Image Generation, where the model actively infers implicit user intents, and Self-Reflective Refinement, where it autonomously diagnoses and corrects misalignments in generated outputs. To address the challenge of providing stable supervision for real-world multimodal generation, we introduce the Decompositional Verifiable Reward (DVReward). Unlike holistic scalar rewards, DVReward utilizes an LLM to decompose complex user requests into atomic, verifiable semantic and quality questions, which are then evaluated by a general MLLM to provide reliable and interpretable feedback. Extensive experiments demonstrate that AlphaGRPO yields robust improvements across multimodal generation benchmarks, including GenEval, TIIF-Bench, DPG-Bench and WISE, while also achieving significant gains in editing tasks on GEdit without training on editing tasks. These results validate that our self-reflective reinforcement approach effectively leverages inherent understanding to guide high-fidelity generation. Project page: https://huangrh99.github.io/AlphaGRPO/
comment: ICML2026
☆ Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation
We introduce Pion, a spectrum-preserving optimizer for large language model (LLM) training based on orthogonal equivalence transformation. Unlike additive optimizers such as Adam and Muon, Pion updates each weight matrix through left and right orthogonal transformations, preserving its singular values throughout training. This yields an optimization mechanism that modulates the geometry of weight matrices while keeping their spectral norm fixed. We derive the Pion update rule, systematically examine its design choices, and analyze its convergence behavior along with several key properties. Empirical results show that Pion offers a stable and competitive alternative to standard optimizers for both LLM pretraining and finetuning.
comment: Technical report v1 (30 pages, 19 figures, project page: https://spherelab.ai/pion/)
☆ Elastic Attention Cores for Scalable Vision Transformers
Vision Transformers (ViTs) achieve strong data-driven scaling by leveraging all-to-all self-attention. However, this flexibility incurs a computational cost that scales quadratically with image resolution, limiting ViTs in high-resolution domains. Underlying this approach is the assumption that pairwise token interactions are necessary for learning rich visual-semantic representations. In this work, we challenge this assumption, demonstrating that effective visual representations can be learned without any direct patch-to-patch interaction. We propose VECA (Visual Elastic Core Attention), a vision transformer architecture that uses efficient linear-time core-periphery structured attention enabled by a small set of learned cores. In VECA, these cores act as a communication interface: patch tokens exchange information exclusively through the core tokens, which are initialized from scratch and propagated across layers. Because the $N$ image patches only directly interact with a resolution invariant set of $C$ learned "core" embeddings, this yields linear complexity $O(N)$ for predetermined $C$, which bypasses quadratic scaling. Compared to prior cross-attention architectures, VECA maintains and iteratively updates the full set of $N$ input tokens, avoiding a small $C$-way bottleneck. Combined with nested training along the core axis, our model can elastically trade off compute and accuracy during inference. Across classification and dense tasks, VECA achieves performance competitive with the latest vision foundation models while reducing computational cost. Our results establish elastic core-periphery attention as a scalable alternative building block for Vision Transformers.
comment: Project repository here: https://github.com/alansong1322/VECA
☆ Task-Adaptive Embedding Refinement via Test-time LLM Guidance
We explore the effectiveness of an LLM-guided query refinement paradigm for extending the usability of embedding models to challenging zero-shot search and classification tasks. Our approach refines the embedding representation of a user query using feedback from a generative LLM on a small set of documents, enabling embeddings to adapt in real time to the target task. We conduct extensive experiments with state-of-the-art text embedding models across a diverse set of challenging search and classification benchmarks. Empirical results indicate that LLM-guided query refinement yields consistent gains across all models and datasets, with relative improvements of up to +25% in literature search, intent detection, key-point matching, and nuanced query-instruction following. The refined queries improve ranking quality and induce clearer binary separation across the corpus, enabling the embedding space to better reflect the nuanced, task-specific constraints of each ad-hoc user query. Importantly, this expands the range of practical settings in which embedding models can be effectively deployed, making them a compelling alternative when costly LLM pipelines are not viable at corpus-scale. We release our experimental code for reproducibility, at https://github.com/IBM/task-aware-embedding-refinement.
☆ Learning, Fast and Slow: Towards LLMs That Adapt Continually
Large language models (LLMs) are trained for downstream tasks by updating their parameters (e.g., via RL). However, updating parameters forces them to absorb task-specific information, which can result in catastrophic forgetting and loss of plasticity. In contrast, in-context learning with fixed LLM parameters can cheaply and rapidly adapt to task-specific requirements (e.g., prompt optimization), but cannot by itself typically match the performance gains available through updating LLM parameters. There is no good reason for restricting learning to being in-context or in-weights. Moreover, humans also likely learn at different time scales (e.g., System 1 vs 2). To this end, we introduce a fast-slow learning framework for LLMs, with model parameters as "slow" weights and optimized context as "fast" weights. These fast "weights" can learn from textual feedback to absorb the task-specific information, while allowing slow weights to stay closer to the base model and persist general reasoning behaviors. Fast-Slow Training (FST) is up to 3x more sample-efficient than only slow learning (RL) across reasoning tasks, while consistently reaching a higher performance asymptote. Moreover, FST-trained models remain closer to the base LLM (up to 70% less KL divergence), resulting in less catastrophic forgetting than RL-training. This reduced drift also preserves plasticity: after training on one task, FST trained models adapt more effectively to a subsequent task than parameter-only trained models. In continual learning scenarios, where task domains change on the fly, FST continues to acquire each new task while parameter-only RL stalls.
☆ Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
In settings where labeled verifiable training data is the binding constraint, each checked example should be allocated carefully. The standard practice is to use this data directly on the model that will be deployed, for example by running GRPO on the deployment student. We argue that this is often an inefficient allocation because it overlooks a reward-density principle: sparse sequence-level reward should train models where exploration is productive, while dense token-level teacher reward should be used where the aim is to compress behavior into a smaller model. In this view, GRPO-style sparse RL and OPD-style dense teacher supervision are not separate recipes; they are different reward-density regimes. The allocation rule is simple: use scarce labeled training data upstream on the strongest model that can turn it into reward-shaped behavior, then transfer that behavior downstream as dense supervision. We evaluate this rule on verifiable math with Qwen3 and Llama models. At fixed Qwen3-1.7B deployment-student size, an RL-improved 8B teacher distilled through the dense bridge outperforms direct GRPO on the same student, while transfer from the same teacher before RL underperforms. The bridge is important: a forward-KL warmup on teacher rollouts followed by OPD on student rollouts is consistently strongest on MATH before any post-bridge student-side sparse RL, and also gives the best pre-Stage~3 AIME endpoints for the canonical 8B/14B teachers. The bridge also makes later student-side sparse RL effective: GRPO that is weak on a cold student lifts MATH from $75.4\%$ to $78.5\%$ after the bridge and outperforms a matched replay control by $2.8$ points. The operational principal is to avoid using scarce labeled data on the least prepared policy: use sparse reward for teacher-side discovery, dense transfer for student compression, and student-side sparse reward only after the bridge.
☆ MEME: Multi-entity & Evolving Memory Evaluation
LLM-based agents increasingly operate in persistent environments where they must store, update, and reason over information across many sessions. While prior benchmarks evaluate only single-entity updates, MEME defines six tasks spanning the full space defined by the multi-entity and evolving axes, including three not scored by prior work: Cascade and Absence (dependency reasoning) and Deletion (post-removal state). Evaluating six memory systems spanning three memory paradigms on 100 controlled episodes, we find that all systems collapse on dependency reasoning under the default configuration (Cascade: 3%, Absence: 1% in average accuracy) despite adequate static retrieval performance. Prompt optimization, deeper retrieval, reduced filler noise, and most stronger LLMs fail to close this gap. Only a file-based agent paired with Claude Opus 4.7 as its internal LLM partially closes the gap, but at ~70x the baseline cost, indicating closure currently depends on configurations that are not practical at scale. Code and data are available on the project page: https://seokwonjung-jay.github.io/meme-eval/.
☆ Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts
Sparse Mixture-of-Experts (SMoE) models enable scaling language models efficiently, but training them remains challenging, as routing can collapse onto few experts and auxiliary load-balancing losses can reduce specialization. Motivated by these hurdles, we study how routing decisions in SMoEs are formed mechanistically. First, we reveal a geometric coupling between routers and their corresponding experts. For a given token, the router weights for the selected expert and the expert weights processing it receive gradients along the same input direction, differing only in scalar coefficients. Thus, matched router--expert directions accumulate the same routed token history. This theoretical coupling also appears empirically in routing dynamics. In a $1$B SMoE trained from scratch, higher router scores predict stronger expert neuron activations, showing that routing decisions are mirrored inside the selected expert. Next, we analyze the effects of auxiliary load balancing on the router--expert geometric coupling, showing that such losses break this structure by spreading input-directed gradients across router weights, making distinct router directions nearly three times more similar to each other. Last, we demonstrate the centrality of geometric coupling for effective routing with a parameter-free online K-Means router, in which each expert maintains a running average of the hidden states routed to it and tokens are assigned based on cosine similarity. Compared with auxiliary-loss and loss-free balancing, this router achieves the lowest load imbalance with only a modest perplexity increase, indicating that geometric coupling captures a substantial part of what the router learns. Overall, our results explain how routers form assignment geometry that supports an effective division of labor.
☆ KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference
We introduce KV-Fold, a simple, training-free long-context inference protocol that treats the key-value (KV) cache as the accumulator in a left fold over sequence chunks. At each step, the model processes the next chunk conditioned on the accumulated cache, appends the newly produced keys and values, and passes the enlarged cache forward; the same one-step update is applied repeatedly, analogous to foldl in functional programming. Building on the KV cache concatenation primitive introduced for latent multi-agent communication, we repurpose it as a chunk-to-chunk recurrence for long-context inference. When processing chunk t, the model attends to the KV cache carried from earlier chunks as a prefix, reusing its internal state across segments without modifying or retraining the model. Despite its simplicity, the induced recurrence is stable: per-step drift rises briefly and then saturates into a flat plateau that persists across deep chains. This plateau is insensitive to a 10,000x change in numerical precision, robust across chunk sizes, and consistent across model families. At the task level, KV-Fold preserves exact information over long distances. On a needle-in-a-haystack benchmark, it achieves 100% exact-match retrieval across 152 trials spanning contexts from 16K to 128K tokens and chain depths up to 511 on Llama-3.1-8B, while remaining within the memory limits of a single 40GB GPU. Compared to streaming methods, which trade fidelity for bounded memory, KV-Fold maintains long-range retrieval while operating as a sequence of tractable forward passes. Overall, our results show that frozen pretrained transformers already support a stable form of KV-cache recurrence, providing a practical route to long-context inference without architectural changes or training.
comment: 12 pages, 3 figures, 6 tables
☆ Solve the Loop: Attractor Models for Language and Reasoning
Looped Transformers offer a promising alternative to purely feed-forward computation by iteratively refining latent representations, improving language modeling and reasoning. Yet recurrent architectures remain unstable to train, costly to optimize and deploy, and constrained to small, fixed recurrence depths. We introduce Attractor Models, in which a backbone module first proposes output embeddings, then an attractor module refines them by solving for the fixed point, with gradients obtained through implicit differentiation. Thus, training memory remains constant in effective depth, and iterations are chosen adaptively by convergence. Empirically, Attractor Models outperform existing models across two regimes, large-scale language-model pretraining and reasoning with tiny models. In language modeling, Attractor Models deliver a Pareto improvement over standard Transformers and stable looped models across sizes, improving perplexity by up to 46.6% and downstream accuracy by up to 19.7% while reducing training cost. Notably, a 770M Attractor Model outperforms a 1.3B Transformer trained on twice as many tokens. On challenging reasoning tasks, we show that our model with only 27M parameters and approximately 1000 examples achieves 91.4% accuracy on Sudoku-Extreme and 93.1% on Maze-Hard, scaling favorably where frontier models like Claude and GPT o3, fail completely, and specialized recursive reasoners collapse at larger sizes. Lastly, we show that Attractor Models exhibit a novel phenomenon, which we call equilibrium internalization: fixed-point training places the model's initial output embedding near equilibrium, allowing the solver to be removed at inference time with little degradation. Together, these results suggest that Attractor Models make iterative refinement scalable by turning recurrence into a computation the model can learn to internalize.
☆ High-arity Sample Compression
Recently, a series of works have started studying variations of concepts from learning theory for product spaces, which can be collected under the name high-arity learning theory. In this work, we consider a high-arity variant of sample compression schemes and we prove that the existence of a high-arity sample compression scheme of non-trivial quality implies high-arity PAC learnability.
comment: 29 pages
☆ Search Your Block Floating Point Scales!
Quantization has emerged as a standard technique for accelerating inference for generative models by enabling faster low-precision computations and reduced memory transfers. Recently, GPU accelerators have added first-class support for microscaling Block Floating Point (BFP) formats. Standard BFP algorithms use a fixed scale based on the maximum magnitude of the block. We observe that this scale choice can be suboptimal with respect to quantization errors. In this work, we propose ScaleSearch, an alternative strategy for selecting these scale factors: using a fine-grained search leveraging the mantissa bits in microscaling formats to minimize the quantization error for the given distribution. ScaleSearch can be integrated with existing quantization methods such as Post Training Quantization and low-precision attention, and is shown to improve their performance. Additionally, we introduce ScaleSearchAttention, an accelerated NVFP4-based attention algorithm, which uses ScaleSearch and adapted prior techniques to ensure near-0 performance loss for causal language modeling. Experiments show that ScaleSearch reduces quantization error by 27% for NVFP4 and improves language model PTQ by up to 15 points for MATH500 (Qwen3-8B), while ScaleSearchAttention improves Wikitext-2 PPL by upto 0.77 points for Llama 3.1 70B. The proposed methods closely match baseline performance while providing quantization accuracy improvements.
☆ Towards Affordable Energy: A Gymnasium Environment for Electric Utility Demand-Response Programs
Extreme weather and volatile wholesale electricity markets expose residential consumers to catastrophic financial risks, yet demand response at the distribution level remains an underutilized tool for grid flexibility and energy affordability. While a demand-response program can shield consumers by issuing financial credits during high-price periods, optimizing this sequential decision-making process presents a unique challenge for reinforcement learning despite the plentiful offline historical smart meter and wholesale pricing data available publicly. Offline historical data fails to capture the dynamic, interactive feedback loop between an electric utility's pricing signals and customer acceptance and adaptation to a demand-response program. To address this, we introduce DR-Gym, an open-source, online Gymnasium-compatible environment designed to train and evaluate demand-response from the electric utility's perspective. Unlike existing device-level energy simulators, our environment focuses on the market-level electric utility setting and provides a rich observational space relevant to the electric utility. The simulator additionally features a regime-switching wholesale price model calibrated to real-world extreme events, alongside physics-based building demand profiles. For our learning signal, we use a configurable, multi-objective reward function for specifying diverse learning objectives. We demonstrate through baseline strategies and data snapshots the capability of our simulator to create realistic and learnable environments.
☆ A proximal gradient algorithm for composite log-concave sampling
We propose an algorithm to sample from composite log-concave distributions over $\mathbb{R}^d$, i.e., densities of the form $π\propto e^{-f-g}$, assuming access to gradient evaluations of $f$ and a restricted Gaussian oracle (RGO) for $g$. The latter requirement means that we can easily sample from the density $\text{RGO}_{g,h,y}(x) \propto \exp(-g(x) -\frac{1}{2h}||y-x||^2)$, which is the sampling analogue of the proximal operator for $g$. If $f + g$ is $α$-strongly convex and $f$ is $β$-smooth, our sampler achieves $\varepsilon$ error in total variation distance in $\widetilde{\mathcal O}(κ\sqrt d \log^4(1/\varepsilon))$ iterations where $κ:= β/α$, which matches prior state-of-the-art results for the case $g=0$. We further extend our results to cases where (1) $π$ is non-log-concave but satisfies a Poincaré or log-Sobolev inequality, and (2) $f$ is non-smooth but Lipschitz.
☆ Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs
The continued improvements in language model capability have unlocked their widespread use as drivers of autonomous agents, for example in coding or computer use applications. However, the core of these systems has not changed much since early instruction-tuned models like ChatGPT. Even advanced AI agents function on message exchange formats, successively exchanging messages with users, systems, with itself (i.e. chain-of-thought) and tools in a single stream of computation. This bottleneck to a single stream in chat models leads to a number of limitations: the agent cannot act (generate output) while reading, and in reverse, cannot react to new information while writing. Similarly, the agent cannot act while thinking and cannot think while reading or acting on information. In this work, we show that models can be unblocked by switching from instruction-tuning for sequential message formats to instruction-tuning for multiple, parallel streams of computation, splitting each role into a separate stream. Every forward pass of the language model then simultaneously reads from multiple input streams and generates tokens in multiple output streams, all of which causally depend on earlier timesteps. We argue that this data-driven change remedies a number of usability limitations as outlined above, improves model efficiency through parallelization, improves model security through better separation of concerns and can further improve model monitorability.
comment: Preprint, 37 pages. Code at https://github.com/seal-rg/streaming/
☆ TextSeal: A Localized LLM Watermark for Provenance & Distillation Protection
We introduce TextSeal, a state-of-the-art watermark for large language models. Building on Gumbel-max sampling, TextSeal introduces dual-key generation to restore output diversity, along with entropy-weighted scoring and multi-region localization for improved detection. It supports serving optimizations such as speculative decoding and multi-token prediction, and does not add any inference overhead. TextSeal strictly dominates baselines like SynthID-text in detection strength and is robust to dilution, maintaining confident localized detection even in heavily mixed human/AI documents. The scheme is theoretically distortion-free, and evaluation across reasoning benchmarks confirms that it preserves downstream performance; while a multilingual human evaluation (6000 A/B comparisons, 5 languages) shows no perceptible quality difference. Beyond its use for provenance detection, TextSeal is also ``radioactive'': its watermark signal transfers through model distillation, enabling detection of unauthorized use.
☆ Enabling AI-Native Mobility in 6G: A Real-World Dataset for Handover, Beam Management, and Timing Advance
To address the issues of high interruption time and measurement report overhead under user equipment (UE) mobility especially in high speed 5G use cases the use of AI/ML techniques (AI/ML beam management and mobility procedures) have been proposed. These techniques rely heavily on data that are most often simulated for various scenarios and do not accurately reflect real deployment behavior or user traffic patterns. Therefore, there is an utmost need for realistic datasets under various conditions. This work presents a dataset collected from a commercially deployed network across various modes of mobility (pedestrian, bike, car, bus, and train) and at multiple speeds to depict real time UE mobility. When collecting the dataset, we focused primarily on handover (HO) scenarios, with the aim of reducing the HO interruption time and maintaining continuous throughput during and immediately after HO execution. To support this research, the dataset includes timing advance (TA) measurements at various signaling events such as RACH trigger, MAC CE, and PDCCH grant which are typically missing in existing works. We cover a detailed description of the creation of the dataset; experimental setup, data acquisition, and extraction. We also cover an exploratory analysis of the data, with a primary focus on mobility, beam management, and TA. We discuss multiple use cases in which the proposed dataset can facilitate understanding of the inference of the AI/ML model. One such use case is to train and evaluate various AI/ML models for TA prediction.
☆ ORCE: Order-Aware Alignment of Verbalized Confidence in Large Language Models
Large language models (LLMs) often produce answers with high certainty even when they are incorrect, making reliable confidence estimation essential for deployment in real-world scenarios. Verbalized confidence, where models explicitly state their confidence in natural language, provides a flexible and user-facing uncertainty signal that can be applied even when token logits are unavailable. However, existing verbalized-confidence methods often optimize answer generation and confidence generation jointly, which can cause confidence-alignment objectives to interfere with answer accuracy. In this work, we propose a decoupled and order-aware framework for verbalized confidence calibration. Our method first generates an answer and then estimates confidence conditioned on the fixed question--answer pair, allowing confidence optimization without directly perturbing the answer-generation process. To align confidence with correctness likelihood, we construct a sampling-based surrogate from multiple model completions and optimize rank-based reinforcement learning objectives that encourage responses with higher estimated correctness likelihood to receive higher verbalized confidence. Experiments on reasoning and knowledge-intensive benchmarks show that our method improves calibration and failure prediction performance while largely preserving answer accuracy. These results demonstrate that verbalized confidence can be more reliably aligned by decoupling confidence estimation from answer generation and optimizing the relative ordering of confidence across responses.
comment: 18 pages, 2 figures
☆ Environment-Adaptive Preference Optimization for Wildfire Prediction
Predicting rare extreme events such as wildfires from meteorological data requires models that remain reliable under evolving environmental conditions. This problem is inherently long-tailed: wildfire events are rare but high-impact, while most observations correspond to non-fire conditions, causing standard learning objectives to underemphasize the minority class (fire) that matters most. In addition, models trained on historical distributions often fail under distribution shifts, exhibiting degraded performance in new environments. To this end, we propose Environment-Adaptive Preference Optimization (EAPO), a framework that adapts prediction to the target environment with long-tail distribution. Given a new input distribution, we first construct distribution-aligned datasets via $k$-nearest neighbor retrieval. We then perform a hybrid fine-tuning procedure on this local manifold, combining supervised learning with preference optimization, as well as emphasizing on rare extreme events. EAPO refines decision boundaries while avoiding conflicting signals from heterogeneous training data. We evaluate EAPO on a real-world wildfire prediction task with environmental shifts. EAPO achieves robust performance (ROC-AUC 0.7310) and improves detection in extreme regimes, demonstrating its effectiveness in dynamic wildfire prediction systems.
☆ Learning Minimally Rigid Graphs with High Realization Counts IJCAI 2026
For minimally rigid graphs, the same edge-length data can admit multiple realizations (up to translations and rotations). Finding graphs with exceptionally many realizations is an extremal problem in rigidity theory, but exhaustive search quickly becomes infeasible due to the super-exponential growth of the number of candidate graphs and the high cost of realization-count evaluation. We propose a reinforcement-learning approach that constructs minimally rigid graphs via 0- and 1-extensions, also known as Henneberg moves. We optimize realization-count invariants using the Deep Cross-Entropy Method with a policy parameterized by a Graph Isomorphism Network encoder and a permutation-equivariant extension-level action head. Empirically, our method matches the known optima for planar realization counts and improves the best known bounds for spherical realization counts, yielding new record graphs.
comment: This is an extended version of the paper accepted to IJCAI 2026
☆ ORBIT: Preserving Foundational Language Capabilities in GenRetrieval via Origin-Regulated Merging
Despite the rapid advancements in large language model (LLM) development, fine-tuning them for specific tasks often results in the catastrophic forgetting of their general, language-based reasoning abilities. This work investigates and addresses this challenge in the context of the Generative Retrieval (GenRetrieval) task. During GenRetrieval fine-tuning, we find this forgetting occurs rapidly and correlates with the distance between the fine-tuned and original model parameters. Given these observations, we propose ORBIT, a novel approach that actively tracks the distance between fine-tuned and initial model weights, and uses a weight averaging strategy to constrain model drift during GenRetrieval fine-tuning when this inter-model distance exceeds a maximum threshold. Our results show that ORBIT retains substantial text and retrieval performance by outperforming both common continual learning baselines and related regularization methods that also employ weight averaging.
☆ Aligning Flow Map Policies with Optimal Q-Guidance
Generative policies based on expressive model classes, such as diffusion and flow matching, are well-suited to complex control problems with highly multimodal action distributions. Their expressivity, however, comes at a significant inference cost: generating each action typically requires simulating many steps of the generative process, compounding latency across sequential decision-making rollouts. We introduce flow map policies, a novel class of generative policies designed for fast action generation by learning to take arbitrary-size jumps including one-step jumps-across the generative dynamics of existing flow-based policies. We instantiate flow map policies for offline-to-online reinforcement learning (RL) and formulate online adaptation as a trust-region optimization problem that improves the critic's Q-value while remaining close to the offline policy. We theoretically derive FLOW MAP Q-GUIDANCE (FMQ), a principled closed-form learning target that is optimal for adapting offline flow map policies under a critic-guided trust-region constraint. We further introduce Q-GUIDED BEAM SEARCH (QGBS), a stochastic flow-map sampler that combines renoising with beam search to enable iterative inference-time refinement. Across 12 challenging robotic manipulation and locomotion tasks from OGBench and RoboMimic, FMQ achieves state-of-the-art performance in offline-to-online RL, outperforming the previous one-step policy MVP by a relative improvement of 21.3% on the average success rate.
☆ Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
Large Language Models (LLMs) update their behavior in context, which can be viewed as a form of Bayesian inference. However, the structure of the latent hypothesis space over which this inference operates remains unclear. In this work, we propose that LLMs assign beliefs over a low-dimensional geometric space - a conceptual belief space - and that in-context learning corresponds to a trajectory through this space as beliefs are updated over time. Using story understanding as a natural setting for dynamic belief updating, we combine behavioral and representational analyses to study these trajectories. We find that (1) belief updates are well-described as trajectories on low-dimensional, structured manifolds; (2) this structure is reflected consistently in both model behavior and internal representations and can be decoded with simple linear probes to predict behavior; and (3) interventions on these representations causally steer belief trajectories, with effects that can be predicted from the geometry of the conceptual space. Together, our results provide a geometric account of belief dynamics in LLMs, grounding Bayesian interpretations of in-context learning in structured conceptual representations.
☆ Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling
AI agents negotiate and transact in natural language with unfamiliar counterparts: a buyer bot facing an unknown seller, or a procurement assistant negotiating with a supplier. In such interactions, the counterpart's LLM, prompts, control logic, and rule-based fallbacks are hidden, while each decision can have monetary consequences. We ask whether an agent can predict an unfamiliar counterpart's next decision from a few interactions. To avoid real-world logging confounds, we study this problem in controlled bargaining and negotiation games, formulating it as target-adaptive text-tabular prediction: each decision point is a table row combining structured game state, offer history, and dialogue, while $K$ previous games of the same target agent, i.e., the counterpart being modeled, are provided in the prompt as labeled adaptation examples. Our model is built on a tabular foundation model that represents rows using game-state features and LLM-based text representations, and adds LLM-as-Observer as an additional representation: a small frozen LLM reads the decision-time state and dialogue; its answer is discarded, and its hidden state becomes a decision-oriented feature, making the LLM an encoder rather than a direct few-shot predictor. Training on 13 frontier-LLM agents and testing on 91 held-out scaffolded agents, the full model outperforms direct LLM-as-Predictor prompting and game+text features baselines. Within this tabular model, Observer features contribute beyond the other feature schemes: at $K=16$, they improve response-prediction AUC by about 4 points across both tasks and reduce bargaining offer-prediction error by 14%. These results show that formulating counterpart prediction as a target-adaptive text-tabular task enables effective adaptation, and that hidden LLM representations expose decision-relevant signals that direct prompting does not surface.
☆ Model-based Bootstrap of Controlled Markov Chains
We propose and analyze a model-based bootstrap for transition kernels in finite controlled Markov chains (CMCs) with possibly nonstationary or history-dependent control policies, a setting that arises naturally in offline reinforcement learning (RL) when the behavior policy generating the data is unknown. We establish distributional consistency of the bootstrap transition estimator in both a single long-chain regime and the episodic offline RL regime. The key technical tools are a novel bootstrap law of large numbers (LLN) for the visitation counts and a novel use of the martingale central limit theorem (CLT) for the bootstrap transition increments. We extend bootstrap distributional consistency to the downstream targets of offline policy evaluation (OPE) and optimal policy recovery (OPR) via the delta method by verifying Hadamard differentiability of the Bellman operators, yielding asymptotically valid confidence intervals for value and $Q$-functions. Experiments on the RiverSwim problem show that the proposed bootstrap confidence intervals (CIs), especially the percentile CIs, outperform the episodic bootstrap and plug-in CLT CIs, and are often close to nominal ($50\%$, $90\%$, $95\%$) coverage, while the baselines are poorly calibrated at small sample sizes and short episode lengths.
comment: 45 pages, 7 figures, 19 tables
☆ OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning
We study {on-policy self-distillation} (OPSD), where a language model improves its reasoning ability by distilling privileged teacher distributions along its own on-policy trajectories. Despite the performance gains of OPSD, we identify a common but often overlooked mismatch between teacher and student responses: self-reflected teacher responses can be shifted by reflection-induced bias and response templates, leading to miscalibrated token-level supervision. To mitigate this issue, we propose \methodname, an outcome-guided logit-steering framework that leverages verifiable outcome rewards to contrast successful and failed on-policy trajectories and calibrate teacher logits. By combining outcome-level correctness with dense token-level guidance through logit steering, \methodname stabilizes self-distillation and improves reasoning performance over standard OPSD and other variants across diverse benchmarks.
comment: 12 pages, 7 figures, 3 tables
☆ Detecting overfitting in Neural Networks during long-horizon grokking using Random Matrix Theory
Training Neural Networks (NNs) without overfitting is difficult; detecting that overfitting is difficult as well. We present a novel Random Matrix Theory method that detects the onset of overfitting in deep learning models without access to train or test data. For each model layer, we randomize each weight matrix element-wise, $\mathbf{W} \to \mathbf{W}_{\mathrm{rand}}$, fit the randomized empirical spectral distribution with a Marchenko-Pastur distribution, and identify large outliers that violate self-averaging. We call these outliers Correlation Traps. During the onset of overfitting, which we call the "anti-grokking" phase in long-horizon grokking, Correlation Traps form and grow in number and scale as test accuracy decreases while train accuracy remains high. Traps may be benign or may harm generalization; we provide an empirical approach to distinguish between them by passing random data through the trained model and evaluating the JS divergence of output logits. Our findings show that anti-grokking is an additional grokking phase with high train accuracy and decreasing test accuracy, structurally distinct from pre-grokking through its Correlation Traps. More broadly, we find that some foundation-scale LLMs exhibit the same Correlation Traps, indicating potentially harmful overfitting.
comment: 24 pages, 24 figures
☆ Trajectory-Agnostic Asteroid Detection in TESS with Deep Learning
We present a novel method for extracting moving objects from TESS data using machine learning. Our approach uses two stacked 3D U-Nets with skip connections, which we call a W-Net, to filter background and identify pixels containing moving objects in TESS image time-series data. By augmenting the training data through rotation of the image cubes, our method is robust to differences in speed and direction of asteroids, requiring no assumptions for either parameter range which are typically required in "shift-and-stack" type algorithms. We also developed a novel method for learned data scaling that we call Adaptive Normalization, which allows the neural network to learn the ideal range and scaling distribution required for optimal data processing. We built a code for creating TESS training data with asteroid masks that served as the foundation of our effort (tess-asteroid-ml), which we publicly released for the benefit of the community. Our method is not limited to TESS, but applicable for implementation in other similar time-domain surveys, making it of particular interest for use with data from upcoming missions such as the Nancy Grace Roman Space Telescope and NEOSurveyor.
comment: Accepted by The Astronomical Journal, 11 May 2026
☆ SEMIR: Semantic Minor-Induced Representation Learning on Graphs for Visual Segmentation ICML 2026
Segmenting small and sparse structures in large-scale images is fundamentally constrained by voxel-level, lattice-bound computation and extreme class imbalance -- dense, full-resolution inference scales poorly and forces most pipelines to rely on fixed regionization or downsampling, coupling computational cost to image resolution and attenuating boundary evidence precisely where minority structures are most informative. We introduce SEMIR (Semantic Minor-Induced Representation Learning), a representation framework that decouples inference from the native grid by learning a task-adapted, topology-preserving latent graph representation with exact decoding. SEMIR transforms the underlying grid graph into a compact, boundary-aligned graph minor through parameterized edge contraction, node deletion, and edge deletion, while preserving an exact lifting map from minor predictions to lattice labels. Minor construction is formalized as a few-shot structure learning problem that replaces hand-tuned preprocessing with a boundary-alignment objective: minor parameters are learned by maximizing agreement between predicted boundary elements and target-specific semantic edges under a boundary Dice criterion, and the induced minor is annotated with scale- and rotation-robust geometric and intensity descriptors and supports efficient region-level inference via message passing on a graph neural network (GNN) with relational edge features. We benchmark SEMIR on three tumor segmentation datasets -- BraTS 2021, KiTS23, and LiTS -- where targets exhibit high structural variability and distributional uncertainty. SEMIR yields consistent improvements in minority-structure Dice at practical runtime. More broadly, SEMIR establishes a framework for learning task-adapted, topology-preserving latent representations with exact decoding for high-resolution structured visual data.
comment: 20 pages, 3 figures. Accepted at ICML 2026. Includes appendices
☆ Events as Triggers for Behavioral Diversity in Multi-Agent Reinforcement Learning
Effective multi-agent cooperation requires agents to adopt diverse behaviors as task conditions evolve-and to do so at the right moment. Yet, current Multi-Agent Reinforcement Learning (MARL) frameworks that facilitate this diversity are still limited by the fact that they bind fixed behaviors to fixed agent identities. Consequently, they are ill-equipped for tasks where agents need to take on different roles at very specific moments in time. We argue that, to define these behavioral transitions, the missing ingredient is events. Events are changes in the state of the system that induce qualitative changes in the task. Based on this view, we introduce a framework that decouples agent identity from behavior, capturing a continuous manifold from which agents instantiate their behaviors in response to events. This framework is based on two elements. First, to build an expressive behavior manifold, we introduce Neural Manifold Diversity (NMD), a formal distance metric that remains well-defined when behaviors are transient and agent-agnostic. Second, we use an event-based hypernetwork that generates Low-Rank Adaptation (LoRA) modules over a shared team policy, enabling on-the-fly agent-policy reconfiguration in response to events. We prove that this construction ensures that diversity does not interfere with reward maximization by design. Empirical results demonstrate that our framework outperforms established baselines across benchmarks while exhibiting zero-shot generalization, and being the only method that solves tasks requiring sequential behavior reassignment.
☆ A Semi-Supervised Framework for Speech Confidence Detection using Whisper IEEE
Automatic detection of speaker confidence is critical for adaptive computing but remains constrained by limited labelled data and the subjectivity of paralinguistic annotations. This paper proposes a semi-supervised hybrid framework that fuses deep semantic embeddings from the Whisper encoder with an interpretable acoustic feature vector composed of eGeMAPS descriptors and auxiliary probability estimates of vocal stress and disfluency. To mitigate reliance on scarce ground truth data, we introduce an Uncertainty-Aware Pseudo-Labelling strategy where a model generates labels for unlabelled data, retaining only high-quality samples for training. Experimental results demonstrate that the proposed approach achieves a Macro-F1 score of 0.751, outperforming self-supervised baselines, including WavLM, HuBERT, and Wav2Vec 2.0. The hybrid architecture also surpasses the unimodal Whisper baseline, yielding a 3\% improvement in the minority class, confirming that explicit prosodic and auxiliary features provide necessary corrective signals which are otherwise lost in deep semantic representations. Ablation studies further show that a curated set of high confidence pseudo-labels outperforms indiscriminate large scale augmentation, confirming that data quality outweighs quantity for perceived confidence detection.
comment: 12 pages, 9 Figures, Submitted to IEEE Transactions on Audio, Speech and Language Processing
☆ Scalable Token-Level Hallucination Detection in Large Language Models
Large language models (LLMs) have demonstrated remarkable capabilities, but they still frequently produce hallucinations. These hallucinations are difficult to detect in reasoning-intensive tasks, where the content appears coherent but contains errors like logical flaws and unreliable intermediate results. While step-level analysis is commonly used to detect internal hallucinations, it suffers from limited granularity and poor scalability due to its reliance on step segmentation. To address these limitations, we propose TokenHD, a holistic pipeline for training token-level hallucination detectors. Specifically, TokenHD consists of a scalable data engine for synthesizing large-scale hallucination annotations along with a training recipe featuring an importance-weighted strategy for robust model training. To systematically assess the detection performance, we also provide a rigorous evaluation protocol. Through training within TokenHD, our detector operates directly on free-form text to identify hallucinations, eliminating the need for predefined step segmentation or additional text reformatting. Our experiments show that even a small detector (0.6B) achieves substantial performance gains after training, surpassing much larger reasoning models (e.g., QwQ-32B), and detection performance scales consistently with model size from 0.6B to 8B. Finally, we show that our detector can generalize well across diverse practical scenarios and explore strategies to further enhance its cross-domain generalization capability.
☆ Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training
Reinforcement learning is structurally harder than supervised learning because the policy changes the data distribution it learns from. The resulting fragility is especially visible in large-model training, where the training and rollout systems differ in numerical precision, sampling, and other implementation details. Existing methods manage this fragility by adding hyper-parameters to the training objective, which makes the algorithm more sensitive to its configuration and requires retuning whenever the task, model scale, or distribution mismatch changes. This fragility traces to two concerns that current objectives entangle through hyper-parameters set before training begins: a trust-region concern, that updates should not move the policy too far from its current value, and an off-policy concern, that data from older or different behavior policies should influence the update only to the extent that it remains reliable. Neither concern is a constant to set in advance, and their severity is reflected in the policy-ratio distribution of the current batch. We present a simple yet effective batch-adaptive objective that replaces fixed clipping with the normalized effective sample size of the policy ratios. The same statistic caps the score-function weight and sets the strength of an off-policy regularizer, so the update stays close to the usual on-policy score-function update when ratios are nearly uniform, and tightens automatically when stale or mismatched data cause ratio concentration, while retaining a nonzero learning signal on high-ratio tokens. Experiments across a wide range of settings show that our method matches or exceeds tuned baselines, introducing no new objective hyper-parameters and removing several existing ones. The code is available at https://github.com/FeynRL-project/FeynRL.
☆ Discrete Flow Matching for Offline-to-Online Reinforcement Learning
Many reinforcement learning (RL) tasks have discrete action spaces, but most generative policy methods based on diffusion and flow matching are designed for continuous control. Meanwhile, generative policies usually rely heavily on offline datasets and offline-to-online RL is itself challenging, as the policy must improve from new interaction without losing useful behavior learned from static data. To address those challenges, we introduce DRIFT, an online fine-tuning method that updates an offline pretrained continuous-time Markov chain (CTMC) policy with an advantage-weighted discrete flow matching loss. To preserve useful pretrained knowledge, we add a path-space penalty that regularizes the full CTMC trajectory distribution, rather than only the final action distribution. For large discrete action spaces, we introduce a candidate-set approximation that updates the actor over a small subset of actions sampled from reference-policy rollouts and uniform exploration. Our theoretical analysis shows that the candidate-set error is controlled by missing target probability mass, and the induced CTMC generator error decreases as the candidate set covers more high-probability actions. Experiments on prevailing discrete action RL task show that our method provides stable offline-to-online improvement across all tasks, achieving the highest average score on Jericho with a simple GRU encoder while outperforming methods that use pretrained language models. Controlled experiments further confirm that the path-space penalty remains bounded during fine-tuning and that the CTMC generator adapts to shifted rewards faster than deterministic baselines. The candidate-set mechanism is supported by a stability analysis showing that the generator error decreases exponentially with candidate coverage.
☆ Agent-Based Post-Hoc Correction of Agricultural Yield Forecasts
Accurate crop yield forecasting in commercial soft fruit production is constrained by the data available in typical commercial farm records, which lack the sensor networks, satellite imagery, and high-resolution meteorological inputs that most state-of-the-art approaches assume. We propose a structured LLM agent framework that performs post-hoc correction of existing model predictions, encoding agricultural domain knowledge across tools for phase detection, bias learning, and range validation. Evaluated on a proprietary strawberry yield dataset and a public USDA corn harvest dataset, agent refinement of XGBoost reduced MAE by 20% and MASE by 56% on strawberry, with consistent improvements across Moirai2 (MAE 24%, MASE 22%) and Random Forest (MAE 28%, MASE 66%) baselines. Using Llama 3.1 8B as the agent produced the strongest corrections across all configurations; LLaVA 13B showed inconsistent gains, highlighting sensitivity to the choice of refinement model.
comment: 21 pages, 6 figures, 6 tables
☆ Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
Visual latent reasoning lets a multimodal large language model (MLLM) create intermediate visual evidence as continuous tokens, avoiding external tools or image generators. However, existing methods usually follow an output-as-input latent paradigm and yield unstable gains. We identify evidence for a feature-space mismatch that can contribute to this instability: dominant visual-latent models build on pre-norm MLLMs and reuse decoder hidden states as predicted latent inputs, even though these states occupy a substantially different norm regime from the input embeddings the model was trained to consume~\citep{xie2025mhc,li2026siamesenorm,team2026attention}. This mismatch can make direct latent feedback unreliable. Motivated by this diagnosis, we propose \textbf{GAP}, a \textbf{G}ranular \textbf{A}lignment \textbf{P}aradigm for visual latent modeling. GAP aligns visual latent reasoning at three levels: feature-level alignment maps decoder outputs into input-compatible visual latents through a lightweight PCA-aligned latent head; context-level alignment grounds latent targets with inspectable auxiliary visual supervision; and capacity-guided alignment assigns latent supervision selectively to examples where the base MLLM struggles. On Qwen2.5-VL 7B, the resulting model achieves the best mean aggregate perception and reasoning performance among our supervised variants. Inference-time intervention probing further suggests that generated latents provide task-relevant visual signal beyond merely adding token slots.
☆ MetaColloc: Optimization-Free PDE Solving via Meta-Learned Basis Functions
Solving partial differential equations (PDEs) with machine learning typically requires training a new neural network for every new equation. This optimization is slow. We introduce MetaColloc. It is an optimization-free and data-free framework that removes this bottleneck completely. We decouple basis discovery from the solving process. We meta-train a dual-branch neural network on diverse Gaussian Random Fields. This offline process creates a universal dictionary of neural basis functions. At test time, we freeze the network. We solve the PDE by assembling a collocation matrix. We find the solution through a single linear least squares step. For non-linear PDEs, we apply the Newton-Raphson method to achieve fast quadratic convergence. Our experiments across six 2D and 3D PDEs show massive improvements. MetaColloc reaches state-of-the-art accuracy on smooth and non-linear problems. It also reduces test-time computation by several orders of magnitude. Finally, we provide a detailed frequency sweep analysis. This analysis reveals a critical mismatch between function approximation and operator stability at extremely high frequencies. This profound finding opens a clear path toward future operator-aware meta-learning.
comment: 21 pages, 5 figures, 6 tables
☆ Attacks and Mitigations for Distributed Governance of Agentic AI under Byzantine Adversaries
Agentic AI governance is a critical component of agentic AI infrastructure ensuring that agents follow their owner's communication and interaction policies, and providing protection against attacks from malicious agents. The state-of-the-art solution, SAGA, assumes a logically centralized point of trust, the Provider, which serves as a repository for user and agent information and actively enforces policies. While SAGA provides protection against malicious agents, it remains vulnerable to a malicious Provider that deviates from the protocol, undermining the security of the identity and access control infrastructure. Deployment on both private and public clouds, each susceptible to insider threats, further increases the risk of Provider compromise. In this work, we analyze the attacks that can be mounted from a compromised Provider, taking into account the different system components and realistic deployments. We identify and execute several concrete attacks with devastating effects: undermining agent attributability, extracting private data, or bypassing access control. We then present three types of solutions for securing the Provider that offer different trade-offs between security and performance. We first present SAGA-BFT, a fully byzantine-resilient architecture that provides the strongest protection, but incurs significant performance degradation, due to the high-cost of byzantine resilient protocols. We then propose SAGA-MON and SAGA-AUD, two novel solutions that leverage lightweight server-side monitoring or client-side auditing to provide protection against most classes of attacks with minimal overhead. Finally, we propose SAGA-HYB, a hybrid architecture that combines byzantine-resilience with monitoring and auditing to trade-off security for performance. We evaluate all the architectures and compare them with SAGA. We discuss which solution is best and under what conditions.
comment: 18 pages, 18 figures, 4 tables
☆ From Message-Passing to Linearized Graph Sequence Models
Message-passing based approaches form the default backbone of most learning architectures on graph-structured data. However, the rapid progress of modern deep learning architectures in other domains, particularly sequence modeling, raises the question of how graph learning can benefit from these advances. We introduce Linearized Graph Sequence Models, a framework that recasts message-passing graph computation from the perspective of sequence modeling to simplify architectural choices. Our approach systematically separates the computational processing depth from the information propagation depth, allowing core graph architectural decisions to be treated as sequence modeling choices. Specifically, we analyze, both empirically and theoretically, what sequence properties make methods effective for learning and preserving the graph inductive bias. In particular, we validate our findings, demonstrating improved performance on long-range information tasks in graphs. Our findings provide a principled way to integrate modern sequence modeling advances into message-passing based graph learning. Beyond this, our work demonstrates how the separation of processing and information depth can recast central architectural questions as input modeling choices.
☆ A New Technique for AI Explainability using Feature Association Map
Lack of transparency in AI systems poses challenges in critical real-life applications. It is important to be able to explain the decisions of an AI system to ensure trust on the system. Explainable AI (XAI) algorithms play a vital role in achieving this objective. In this paper, we are proposing a new algorithm for Explaining AI systems, FAMeX (Feature Association Map based eXplainability). The proposed algorithm is based on a graph-theoretic formulation of the feature set termed as Feature Association Map (FAM). The foundation of the modelling is based on association between features. The proposed FAMeX algorithm has been found to be better than the competing XAI algorithms - Permutation Feature Importance (PFI) and SHapley Additive exPlanations (SHAP). Experiments conducted with eight benchmark algorithms show that FAMeX is able to gauge feature importance in the context of classification better than the competing algorithms. This definitely shows that FAMeX is a promising algorithm in explaining the predictions from an AI system
☆ Neural-Schwarz Tiling for Geometry-Universal PDE Solving at Scale
Most learned PDE solvers follow a global-surrogate paradigm: a neural operator is trained to map full problem descriptions to full solution fields for a prescribed distribution of geometries, boundary conditions, and coefficients. This has enabled fast inference within fixed problem families, but limits reuse across new domains and makes large-scale deployment dependent on expensive problem-specific data generation. We introduce $\textbf{NEST}$ ($\textbf{Ne}$ural-$\textbf{S}$chwarz $\textbf{T}$iling), a local-to-global framework that shifts learning from full-domain solution operators to reusable local physical solvers. The central premise is that, although global PDE solutions depend on geometry, scale, and boundary conditions, the physical response on small neighborhoods can be learned locally and composed into global solutions through classical domain decomposition. NEST learns a neural operator on minimal voxel patches ($3 \times 3 \times 3$) with diverse local geometries and boundary/interface data. At inference time, an unseen voxelized domain is tiled into overlapping patches, the learned local solver is applied patchwise, and global consistency is enforced through iterative Schwarz coupling with partition-of-unity assembly. In this way, generalization is shifted from a monolithic neural model to the combination of local physics learning and algorithmic global assembly. We instantiate NEST on nonlinear static equilibrium in compressible neo-Hookean solids and evaluate it on large, geometrically complex 3D domains far outside the scale of the training patches. Our results show that local neural building blocks, coupled through Schwarz iteration, offer a reusable local-training path toward scalable learned PDE solvers that generalize across domain size, shape, and boundary-condition configurations.
☆ Multi-Variable Conformal Prediction: Optimizing Prediction Sets without Data Splitting
Conformal prediction constructs prediction sets with finite-sample coverage guarantees, but its calibration stage is structurally constrained to a scalar score function and a single threshold variable - forcing shapes of prediction sets to be fixed before calibration, typically through data splitting. We introduce multi-variable conformal prediction (MCP), a framework that extends conformal prediction to vector-valued score functions with multiple simultaneous calibration variables. Building on scenario theory as a principled framework for certifying data-driven decisions, MCP unifies prediction set design and calibration into a single optimization problem, eliminating data splitting without sacrificing coverage guarantees. We propose two computationally efficient variants: RemMCP, grounded in constrained optimization with constraint removal, which admits a clean generalization of split conformal prediction; and RelMCP, based on iterative optimization with constraint relaxation, which supports non-convex score functions at the cost of possibly greater conservatism. Through numerical experiments on ellipsoidal and multi-modal prediction sets, we demonstrate that RemMCP and RelMCP consistently meet the target coverage with prediction set sizes smaller than or comparable to those of baselines with data split, while considerably reducing variance across calibration runs - a direct consequence of using all available data for shape optimization and calibration simultaneously.
☆ Online Learning-to-Defer with Varying Experts
Learning-to-Defer (L2D) methods route each query either to a predictive model or to external experts. While existing work studies this problem in batch settings, real-world deployments require handling streaming data, changing expert availability, and shifting expert distribution. We introduce the first online L2D algorithm for multiclass classification with bandit feedback and a dynamically varying pool of experts. Our method achieves regret guarantees of $O((n+n_e)T^{2/3})$ in general and $O((n+n_e)\sqrt{T})$ under a low-noise condition, where $T$ is the time horizon, $n$ is the number of labels, and $n_e$ is the number of distinct experts observed across rounds. The analysis builds on novel $\mathcal{H}$-consistency bounds for the online framework, combined with first-order methods for online convex optimization. Experiments on synthetic and real-world datasets demonstrate that our approach effectively extends standard Learning-to-Defer to settings with varying expert availability and reliability.
☆ BSO: Safety Alignment Is Density Ratio Matching
Aligning language models for both helpfulness and safety typically requires complex pipelines-separate reward and cost models, online reinforcement learning, and primal-dual updates. Recent direct preference optimization approaches simplify training but incorporate safety through ad-hoc modifications such as multi-stage procedures or heuristic margin terms, lacking a principled derivation. We show that the likelihood ratio of the optimal safe policy admits a closed-form decomposition that reduces safety alignment to a density ratio matching problem. Minimizing Bregman divergences between the data and model ratios yields Bregman Safety Optimization (BSO), a family of single-stage loss functions, each induced by a convex generator, that provably recover the optimal safe policy. BSO is both general and simple: it requires no auxiliary models, introduces only one hyperparameter beyond standard preference optimization, and recovers existing safety-aware methods as special cases. Experiments across safety alignment benchmarks show that BSO consistently improves the safety-helpfulness trade-off.
☆ Manifold Sampling via Entropy Maximization
Sampling from constrained distributions has a wide range of applications, including in Bayesian optimization and robotics. Prior work establishes convergence and feasibility guarantees for constrained sampling, but assumes that the feasible set is connected. However, in practice, the feasible set often decomposes into multiple disconnected components, which makes efficient sampling under constraints challenging. In this paper, we propose MAnifold Sampling via Entropy Maximization (MASEM) for sampling on a manifold with an unknown number of disconnected components, implicitly defined by smooth equality and inequality constraints. The presented method uses a resampling scheme to maximize the entropy of the empirical distribution based on k-nearest neighbor density estimation. We show that, in the mean field, MASEM decreases the KL-divergence between the empirical distribution and the maximum-entropy target exponentially in the number of resampling steps. We instantiate MASEM with multiple local samplers and demonstrate its versatility and efficiency on synthetic and robotics-based benchmarks. MASEM enables fast and scalable mixing across a range of constrained sampling problems, improving over alternatives by an order of magnitude in Sinkhorn distance with competitive runtime.
☆ EHR-RAGp: Retrieval-Augmented Prototype-Guided Foundation Model for Electronic Health Records
Electronic Health Records (EHR) contain rich longitudinal patient information and are widely used in predictive modeling applications. However, effectively leveraging historical data remains challenging due to long trajectories, heterogeneous events, temporal irregularity, and the varying relevance of past clinical context. Existing approaches often rely on fixed windows or uniform aggregation, which can obscure clinically important signals. In this work, we introduce EHR-RAGp, a retrieval-augmented foundation model that dynamically integrates the most relevant patient history across diverse clinical event types. We propose a prototype-guided retrieval module that acts as an alignment mechanism and estimates the relevance of retrieved historical chunks with respect to a given prediction task, guiding the model towards the most informative context. Across multiple clinical prediction tasks, EHR-RAGp consistently outperforms state-of-the-art EHR foundation models and transformer-based baselines. Furthermore, integrating EHR-RAGp with existing clinical foundation models yields substantial performance gains. Overall, EHR-RAGp provides a scalable and efficient framework for leveraging long-range clinical context to improve downstream performance.
comment: Retrieval Augmented EHR Foundation Model
☆ Grid Games: The Power of Multiple Grids for Quantizing Large Language Models
A major recent advance in quantization is given by microscaled 4-bit formats such as NVFP4 and MXFP4, quantizing values into small groups sharing a scale, assuming a fixed floating-point grid. In this paper, we study the following natural extension: assume that, for each group of values, we are free to select the "better" among two or more 4-bit grids marked by one or more bits in the scale value. We formalize the power-of-two-grids (PO2) problem, and provide theoretical results showing that practical small-group formats such as MXFP or NVFP can benefit significantly from PO2 grids, while the advantage vanishes for very large groups. On the practical side, we instantiate several grid families, including 1) PO2(NF4), which pairs the standard NF4 normal grid with a learned grid, 2) MPO2, a grid pair that is fully learned over real weights and activations, 3) PO2(Split87), an explicit-zero asymmetric grid and 4) SFP4, a TensorCore-implementable triple which pairs NVFP4 with two shifted variants. Results for post-training quantization of standard open models and pre-training of Llama-like models show that adaptive grids consistently improve accuracy vs single-grid FP4 under both weight-only and weight+activation. Source code is available at https://github.com/IST-DASLab/GridGames.
comment: Preprint
☆ Autoregressive Learning in Joint KL: Sharp Oracle Bounds and Lower Bounds
We study the fundamental and timely problem of learning long sequences in autoregressive modeling and next-token prediction under model misspecification, measured by the joint Kullback--Leibler (KL) divergence. Our goal is to characterize how the sequence horizon \(H\) affects both approximation and estimation errors in this joint-distribution, sequence-level regime. By establishing matching upper and lower bounds, we provide, to our knowledge, the first complete characterization of long-horizon error behavior under the natural joint KL objective, with improved rates and optimality justification relative to existing work. On the approximation side, we show that joint KL admits a horizon-free approximation factor, in sharp contrast to Hellinger-based analyses that exhibit an \(Ω(H)\) dependence for computationally efficient methods; this isolates the choice of divergence as the source of approximation amplification. On the estimation side, we prove a fundamental information-theoretic lower bound of order \(Ω(H)\) that holds for both decomposable policy classes and fully shared policies, matching the \(\widetilde O(H)\) upper bounds achieved by computationally efficient algorithms. Our analysis clarifies the landscape of recent autoregressive learning results by aligning the log-loss training objective, the sequence-level evaluation metric, and the approximation metric {\color{black}through a sharp joint-KL oracle theory}. We further show that these joint-KL guarantees imply policy learning regret bounds at rates matching prior imitation learning literature.
☆ Transferable Delay-Aware Reinforcement Learning via Implicit Causal Graph Modeling
Random delays weaken the temporal correspondence between actions and subsequent state feedback, making it difficult for agents to identify the true propagation process of action effects. In cross-task scenarios, changes in task objectives and reward formulations further reduce the reusability of previously acquired task knowledge. To address this problem, this paper proposes a transferable delay-aware reinforcement learning method based on implicit causal graph modeling. The proposed method uses a field-node encoder to represent high-dimensional observations as latent states with node-level semantics, and employs a message-passing mechanism to characterize dynamic causal dependencies among nodes, thereby learning transferable structured representations and environment dynamics knowledge. On this basis, imagination-driven behavior learning and planning are incorporated to optimize policies in the latent space, enabling cross-task knowledge transfer and rapid adaptation. Experimental results show that the proposed method outperforms baseline methods on DMC continuous control tasks with random delays. Cross-task transfer experiments further demonstrate that the learned structured representations and dynamics knowledge can be effectively transferred to new tasks and significantly accelerate policy adaptation.
In-context learning to predict critical transitions in dynamical systems
Critical transitions - abrupt, often irreversible changes in system dynamics - arise across human and natural systems, often with catastrophic consequences. Real-world observations of such shifts remain scarce, preventing the development of reliable early warning systems. Conventional statistical and spectral indicators, such as increasing variance, tend to fail under realistic conditions of limited data and correlated noise, whereas existing deep learning classifiers do not extrapolate beyond their training data distribution. In this work, we introduce TipPFN, an in-context learning (ICL) framework that uses a prior-data fitted network to infer a system's proximity to a critical transition. Trained on our novel synthetic data generator, which is based on canonical bifurcation scenarios coupled to diverse, randomized stochastic dynamics, TipPFN flexibly capitalizes on contexts of various sizes, complexity and dimensionalities. We demonstrate robust, state-of-the-art early detection of critical transitions in previously unseen tipping regimes, sim-to-real examples, and real-world observations in both ICL and zero-shot settings.
comment: 14+38 pages, 5+23 figures
☆ KAN-CL: Per-Knot Importance Regularization for Continual Learning with Kolmogorov-Arnold Networks
Catastrophic forgetting remains the central obstacle in continual learning (CL): parameters shared across tasks interfere with one another, and existing regularization methods such as EWC and SI apply uniform penalties without awareness of which input region a parameter serves. We propose KAN-CL, a continual learning framework that exploits the compact-support spline parameterization of Kolmogorov-Arnold Networks (KANs) to perform importance-weighted anchoring at per-knot granularity. Deployed as a classification head on a convolutional backbone with standard EWC regularization on the backbone (bbEWC) KAN-CL achieves forgetting reductions of 88% and 93% over a head-only KAN baseline on Split-CIFAR-10/5T and Split-CIFAR-100/10T respectively, while matching or exceeding the accuracy of all baselines on both benchmarks. We further provide a Neural Tangent Kernel (NTK) analysis showing that KAN's spline locality induces a structural rank deficit in the cross-task NTK, yielding a forgetting bound that holds even in the feature-learning regime. These results establish that combining an architecture with natural parameter locality (KAN head) with a complementary backbone regularizer (bbEWC) yields a compositional and principled approach to catastrophic forgetting.
☆ From Model Uncertainty to Human Attention: Localization-Aware Visual Cues for Scalable Annotation Review
High-quality labeled data is essential for training robust machine learning models, yet obtaining annotations at scale remains expensive. AI-assisted annotation has therefore become standard in large-scale labeling workflows. However, in tasks where model predictions carry two independent components, a class label and spatial boundaries, a model may classify an object with high confidence while mislocalizing it. Existing AI-assisted workflows offer annotators no signal about where spatial errors are most likely. Without such guidance, humans may systematically underinspect subtly misplaced boxes. We address this by studying the effect of visualizing spatial uncertainty via a purpose-built interface. In a controlled study with 120 participants, those receiving uncertainty cues achieve higher label quality while being faster overall. A box-level analysis confirms that the cues redirect annotator effort toward high-uncertainty predictions and away from well-localized boxes. These findings establish localization uncertainty as a lever to improve human-in-the-loop annotation. Code is available at https://mos-ks.github.io/MUHA/.
☆ Approximation of Maximally Monotone Operators : A Graph Convergence Perspective
Operator learning has been highly successful for continuous mappings between infinite-dimensional spaces, such as PDE solution operators. However, many operators of interest-including differential operators-are discontinuous or set-valued, and lie outside classical approximation frameworks. We propose a paradigm shift by formulating approximation via graph convergence (Painlevé-Kuratowski convergence), which is well-suited for closed operators. We show that uniform and $L^p$ approximation are fundamentally inadequate in this setting. Focusing on maximally monotone operators, we prove that any such operator can be approximated in the sense of local graph convergence by continuous encoder-decoder architectures, and further construct structure-preserving approximations that retain maximal monotonicity via resolvent-based parameterizations.
☆ STRABLE: Benchmarking Tabular Machine Learning with Strings
Benchmarking tabular learning has revealed the benefit of dedicated architectures, pushing the state of the art. But real-world tables often contain string entries, beyond numbers, and these settings have been understudied due to a lack of a solid benchmarking suite. They lead to new research questions: Are dedicated learners needed, with end-to-end modeling of strings and numbers? Or does it suffice to encode strings as numbers, as with a categorical encoding? And if so, do the resulting tables resemble numerical tabular data, calling for the same learners? To enable these studies, we contribute STRABLE, a benchmarking corpus of 108 tables, all real-world learning problems with strings and numbers across diverse application fields. We run the first large-scale empirical study of tabular learning with strings, evaluating 445 pipelines. These pipelines span end-to-end architectures and modular pipelines, where strings are first encoded, then post-processed, and finally passed to a tabular learner. We find that, because most tables in the wild are categorical-dominant, advanced tabular learners paired with simple string embeddings achieve good predictions at low computational cost. On free-text-dominant tables, large LLM encoders become competitive. Their performance also appears sensitive to post-processing, with differences across LLM families. Finally, we show that STRABLE is a good set of tables to study "string tabular" learning as it leads to generalizable pipeline rankings that are close to the oracle rankings. We thus establish STRABLE as a foundation for research on tabular learning with strings, an important yet understudied area.
☆ Targeted Neuron Modulation via Contrastive Pair Search
Language models are instruction-tuned to refuse harmful requests, but the mechanisms underlying this behavior remain poorly understood. Popular steering methods operate on the residual stream and degrade output coherence at high intervention strengths, limiting their practical use. We introduce contrastive neuron attribution (CNA), which identifies the 0.1% of MLP neurons whose activations most distinguish harmful from benign prompts, requiring only forward passes with no gradients or auxiliary training. In instruct models, ablating the discovered circuit reduces refusal rates by over 50% on a standard jailbreak benchmark while preserving fluency and non-degeneracy across all steering strengths. Applying CNA to matched base and instruct models across Llama and Qwen architectures (from 1B to 72B parameters), we find that base models contain similar late-layer discrimination structures but steering these neurons produces only content shifts, not behavioral change. These results demonstrate that neuron-level intervention enables reliable behavioral steering without the quality tradeoffs of residual-stream methods. More broadly, our findings suggest that alignment fine-tuning transforms pre-existing discrimination structure into a sparse, targetable refusal gate.
☆ PriorZero: Bridging Language Priors and World Models for Decision Making
Leveraging the rich world knowledge of Large Language Models (LLMs) to enhance Reinforcement Learning (RL) agents offers a promising path toward general intelligence. However, a fundamental prior-dynamics mismatch hinders existing approaches: static LLM knowledge cannot directly adapt to the complex transition dynamics of long-horizon tasks. Using LLM priors as fixed policies limits exploration diversity, as the prior is blind to environment-specific dynamics; while end-to-end fine-tuning suffers from optimization instability and credit assignment issues. To bridge this gap, we propose PriorZero, a unified framework that integrates LLM-derived conceptual priors into world-model-based planning through a decoupled rollout-training design. During rollout, a novel root-prior injection mechanism incorporates LLM priors exclusively at the root node of Monte Carlo Tree Search (MCTS), focusing search on semantically promising actions while preserving the world model's deep lookahead capability. During training, PriorZero decouples world-model learning from LLM adaptation: the world model is continuously refined on interaction data to jointly improve its dynamics, policy, and value predictions, its value estimates are then leveraged to provide fine-grained credit assignment signals for stable LLM fine-tuning via alternating optimization. Experiments across diverse benchmarks, including text-based adventure games in Jericho and instruction-following gridworld tasks in BabyAI, demonstrate that PriorZero consistently improves both exploration efficiency and asymptotic performance, establishing a promising framework for LLM-empowered decision-making. Our code is available at https://github.com/opendilab/LightZero.
comment: 30 pages, 12 figures
☆ What makes a word hard to learn? Modeling L1 influence on English vocabulary difficulty ACL
What makes a word difficult to learn, and how does the difficulty depend on the learner's native language? We computationally model vocabulary difficulty for English learners whose first language is Spanish, German, or Chinese with gradient-boosted models trained on features related to a word's familiarity (e.g., frequency), meaning, surface form, and cross-linguistic transfer. Using Shapley values, we determine the importance of each feature group. Word familiarity is the dominant feature group shared by all three languages. However, predictions for Spanish- and German-speaking learners rely additionally on orthographic transfer. This transfer mechanism is unavailable to Chinese learners, whose difficulty is shaped by a combination of familiarity and surface features alone. Our models provide interpretable, L1-tailored difficulty estimates that can be used to design vocabulary curricula.
comment: Submitted to BEA 2026 at ACL. 18 pages, 13 figures
☆ Hypernetworks for Dynamic Feature Selection
Dynamic feature selection (DFS) is a machine learning framework in which features are acquired sequentially for individual samples under budget constraints. The exponential growth in the number of possible feature acquisition paths forces a DFS model to balance fitting specific scenarios against maintaining general performance, even when the feature space is moderate in size. In this paper, we study the structural limitations of existing DFS approaches to achieve an optimal solution. Then, we propose \textsc{Hyper-DFS}, a hypernetwork-based DFS approach that generates feature subset-specific classifier parameters on demand. We show that the use of hypernetworks compared to mask-embedding methods results in a smaller structural complexity bound. We also use a Set Transformer encoding to create a smooth conditioning space for the hypernetwork, so that functionally similar tasks are also geometrically close. In our benchmarks, \textsc{Hyper-DFS} outperforms all state-of-the-art approaches on synthetic and real-life tabular data. It is also competitive or superior across all image datasets tested, and shows substantially stronger zero-shot generalisation to feature subsets never seen during training than existing DFS approaches.
Reconstruction of Personally Identifiable Information from Supervised Finetuned Models
Supervised Finetuning (SFT) has become one of the primary methods for adapting a large language model (LLM) with extensive pre-trained knowledge to domain-specific, instruction-following tasks. SFT datasets, composed of instruction-response pairs, often include user-provided information that may contain sensitive data such as personally identifiable information (PII), raising privacy concerns. This paper studies the problem of PII reconstruction from SFT models for the first time. We construct multi-turn, user-centric Q&A datasets in sensitive domains, specifically medical and legal settings, that incorporate PII to enable realistic evaluation of leakage. Using these datasets, we evaluate the extent to which an adversary, with varying levels of knowledge about the fine-tuning dataset, can infer sensitive information about individuals whose data was used during SFT. In the reconstruction setting, we propose COVA, a novel decoding algorithm to reconstruct PII under prefix-based attacks, consistently outperforming existing extraction methods. Our results show that even partial attacker knowledge can significantly improve reconstruction success, while leakage varies substantially across PII types.
☆ Missingness-MDPs: Bridging the Theory of Missing Data and POMDPs
We introduce missingness-MDPs (miss-MDPs), a novel subclass of partially observable Markov decision processes (POMDPs) that incorporates the theory of missing data. A miss-MDP is a POMDP whose observation function is a missingness function, specifying the probability that individual state features are missing (i.e., unobserved) at a time step. The literature distinguishes three canonical missingness types: missing (1) completely at random (MCAR), (2) at random (MAR), and (3) not at random (MNAR). Our planning problem is to compute near-optimal policies for a miss-MDP with an unknown missingness function, given a dataset of action-observation trajectories. Achieving such optimality guarantees for policies requires learning the missingness function from data, which is infeasible for general POMDPs. To overcome this challenge, we exploit the structural properties of different missingness types to derive probably approximately correct (PAC) algorithms for learning the missingness function. These algorithms yield an approximate but fully specified miss-MDP that we solve using off-the-shelf planning methods. We prove that, with high probability, the resulting policies are epsilon-optimal in the true miss-MDP. Empirical results confirm the theory and demonstrate superior performance of our approach over two model-free POMDP methods.
☆ Delay-Empowered Causal Hierarchical Reinforcement Learning
Many real-world tasks involve delayed effects, where the outcomes of actions emerge after varying time lags. Existing delay-aware reinforcement learning methods often rely on state augmentation, prior knowledge of delay distributions, or access to non-delayed data, limiting their generalization. Hierarchical reinforcement learning, by contrast, inherently offers advantages in handling delays due to its hierarchical structure, yet existing methods are restricted to fixed delays. To address these limitations, we propose Delay-Empowered Causal Hierarchical Reinforcement Learning (DECHRL). DECHRL explicitly models both the causal structure of state transitions and their associated stochastic delay distributions. These are then incorporated into a delay-aware empowerment objective that drives proactive exploration toward highly controllable states, thereby improving performance under temporal uncertainty. We evaluate DECHRL in modified 2D-Minecraft and MiniGrid environments featuring stochastic delays. Experimental results show that DECHRL effectively models temporal delays and significantly outperforms baselines in decision-making under temporal uncertainty.
☆ Instruction Lens Score: Your Instruction Contributes a Powerful Object Hallucination Detector for Multimodal Large Language Models ICML-2026
Multimodal large language models (MLLMs) have achieved remarkable progress, yet the object hallucination remains a critical challenge for reliable deployment. In this paper, we present an in-depth analysis of instruction token embeddings and reveal that they implicitly encode visual information while effectively filtering erroneous information introduced by misleading visual embeddings. Building on this insight, we propose the Instruction Lens Score (InsLen), which combines a Calibrated Local Score with a Context Consistency Score that measures context consistency of the object tokens. The proposed approach serves as a plug-and-play object hallucination detector without relying on auxiliary models or additional training. Extensive experiments across multiple benchmarks and diverse MLLM architectures demonstrate that InsLen consistently outperforms existing hallucination detection methods, highlighting its effectiveness and robustness. The code is available at https://github.com/Fraserlairh/Instruction-Lens-Score.
comment: Accepted by ICML-2026
☆ Why Conclusions Diverge from the Same Observations: Formalizing World-Model Non-Identifiability via an Inference
When people share the same documents and observations yet reach different conclusions, the disagreement often shifts into a judgment that the other party is cognitively defective, irrational, or acting in bad faith. This paper argues that such divergence is better described as a form of non-identifiability inherent in inference and learning, rather than as a defect of the other party. We organize the phenomenon into two levels: (i) $θ$-level non-identifiability, where conclusions diverge under the same world model $W$ because inference settings differ; and (ii) $W$-level non-identifiability, where repeated use of an inference setting $θ$ biases data exposure and update rules, causing the learned world model $W$ itself to diverge. We introduce an inference profile $θ= (R, E, S, D)$, consisting of Reference, Exploration, Stabilization, and Horizon, and show how outputs can split even for the same observation $o$ and the same $W$. We further explain why disagreements tend to project onto a small number of bases -- abstract versus concrete, externalizability, and order versus freedom -- as a consequence of general constraints on learning systems: computational, observational, and coordination constraints. Finally, we relate the framework to deep representation learning, including representation hierarchy, latent-state estimation, and regularization-exploration trade-offs, and illustrate the framework through a case study on AI regulation debates.
comment: 12 pages, 2 figures, 1 table. Extended English version of a paper accepted for presentation at JSAI 2026
☆ SOAR: Scale Optimization for Accurate Reconstruction in NVFP4 Quantization
NVFP4 has recently emerged as an efficient 4-bit microscaling format for large language models (LLMs), offering superior numerical fidelity with native hardware support. However, existing methods often yield suboptimal performance due to inflexible scale selection and the coupled treatment of quantization and dequantization scales. To address these issues, we propose Scale Optimization for Accurate Reconstruction (SOAR), a novel post-training quantization framework that improves the accuracy of NVFP4 quantization. At its core, SOAR features Closed-form Joint Scale Optimization (CJSO), which jointly optimizes global and block-wise scales via analytical solutions derived from reconstruction error minimization. Furthermore, it incorporates Decoupled Scale Search (DSS). DSS decouples the high-precision quantization scale from its constrained dequantization counterpart, and performs discrete search to mitigate precision loss from scale quantization. Extensive experiments across multiple LLMs show that our method consistently outperforms existing NVFP4 quantization baselines, achieving superior accuracy under the same memory footprint with no additional hardware overhead. The code and models will be available at https://github.com/steven-bao1/SOAR.
Pretraining Strategies and Scaling for ECG Foundation Models: A Systematic Study
Specialized foundation models are beginning to emerge in various medical subdomains, but pretraining methodologies and parametric scaling with the size of the pretraining dataset are rarely assessed systematically and in a like-for-like manner. This work focuses on foundation models for electrocardiography (ECG) data, one of the most widely captured physiological time series world-wide. We present a comprehensive assessment of pretraining methodologies, covering five different contrastive and non-contrastive self-supervised learning objectives for ECG foundation models, and investigate their scaling behavior with pretraining dataset sizes up to 11M input samples, exclusively from publicly available sources. Pretraining strategy has a meaningful and consistent impact on downstream performance, with contrastive predictive coding (slightly ahead of JEPA) yielding the most transferable representations across diverse clinical tasks. Scaling pretraining data continues to yield meaningful improvements up to 11M samples for most objectives. We also compare model architectures across all pretraining methodologies and find evidence for a clear superiority of structured state space models compared to transformers and CNN models. We hypothesize that the strong inductive biases of structured state space models, rather than pretraining scale alone, are the primary driver of effective ECG representation learning, with important implications for future foundation model development in this and potentially other physiological signal domains.
comment: 59 pages, 16 figures, 59 Tables. Code available at https://anonymous.4open.science/r/ecg-pretraining-strategies-4DE3
☆ TMRL: Diffusion Timestep-Modulated Pretraining Enables Exploration for Efficient Policy Finetuning
Fine-tuning pre-trained robot policies with reinforcement learning (RL) often inherits the bottlenecks introduced by pre-training with behavioral cloning (BC), which produces narrow action distributions that lack the coverage necessary for downstream exploration. We present a unified framework that enables the exploration necessary to enable efficient robot policy finetuning by bridging BC pre-training and RL fine-tuning. Our pre-training method, Context-Smoothed Pre-training (CSP), injects forward-diffusion noise into policy inputs, creating a continuum between precise imitation and broad action coverage. We then fine-tune pre-trained policies via Timestep-Modulated Reinforcement Learning (TMRL), which trains the agent to dynamically adjust this conditioning during fine-tuning by modulating the diffusion timestep, granting explicit control over exploration. Integrating seamlessly with arbitrary policy inputs, e.g., states, 3D point clouds, or image-based VLA policies, we show that TMRL improves RL fine-tuning sample efficiency. Notably, TMRL enables successful real-world fine-tuning on complex manipulation tasks in under one hour. Videos and code available at https://weirdlabuw.github.io/tmrl/.
☆ Optimal Policy Learning under Budget and Coverage Constraints
We study optimal policy learning under combined budget and minimum coverage constraints. We show that the problem admits a knapsack-type structure and that the optimal policy can be characterized by an affine threshold rule involving both budget and coverage shadow prices. We establish that the linear programming relaxation of the combinatorial solution has an O(1) integrality gap, implying asymptotic equivalence with the optimal discrete allocation. Building on this result, we analyze two implementable approaches: a Greedy-Lagrangian (GLC) and a rank-and-cut (RC) algorithm. We show that the GLC closely approximates the optimal solution and achieves near-optimal performance in finite samples. By contrast, RC is approximately optimal whenever the coverage constraint is slack or costs are homogeneous, while misallocation arises only when cost heterogeneity interacts with a binding coverage constraint. Monte Carlo evidence supports these findings.
☆ No More, No Less: Task Alignment in Terminal Agents
Terminal agents are increasingly capable of executing complex, long-horizon tasks autonomously from a single user prompt. To do so, they must interpret instructions encountered in the environment (e.g., README files, code comments, stack traces) and determine their relevance to the task. This creates a fundamental challenge: relevant cues must be followed to complete a task, whereas irrelevant or misleading ones must be ignored. Existing benchmarks do not capture this ability. An agent may appear capable by blindly following all instructions, or appear robust by ignoring them altogether. We introduce TAB (Task Alignment Benchmark), a suite of 89 terminal tasks derived from Terminal-Bench 2.1. Each task is intentionally underspecified, with missing information provided as a necessary cue embedded in a natural environmental artifact, alongside a plausible but irrelevant distractor. Solving these tasks requires selectively using the cue while ignoring the distractor. Applying TAB to ten frontier agents reveals a systematic gap between task capability and task alignment. The strongest Terminal-Bench agent achieves high task completion but low task alignment on TAB. Evaluating six prompt-injection defenses further shows that suppressing distractor execution also suppresses the cues required for task completion. These results demonstrate that task-aligned agents require selective use of environmental instructions rather than blanket acceptance or rejection.
☆ Intrinsic Vicarious Conditioning for Deep Reinforcement Learning
Advancements in reinforcement learning have produced a variety of complex and useful intrinsic driving forces; crucially, these drivers operate under a direct conditioning paradigm. This form of conditioning limits our agents' capacity by restricting how they learn from the environment as well as from others. Off-policy or learn-by-example methods can learn from demonstrators' representations, but they require access to the demonstrating agent's policies or their reward functions. Our work overcomes this direct sampling limitation by introducing vicarious conditioning as an intrinsic reward mechanism. We draw from psychological and biological literature to provide a foundation for vicarious conditioning and use memory-based methods to implement its four steps: attention, retention, reproduction, and reinforcement. Crucially, our vicarious conditioning paradigms support low-shot learning and do not require the demonstrator agent's policy nor its reward functions. We evaluate our approach in the MiniWorld Sidewalk environment, one of the few public environments that features a non-descriptive terminal condition (no reward provided upon agent death), and extend it to Box2D's CarRacing environment. Our results across both environments demonstrate that vicarious conditioning enables longer episode lengths by discouraging the agent from non-descriptive terminal conditions and guiding the agent toward desirable states. Overall, this work emulates a cognitively-plausible learning paradigm better suited to problems such as single-life learning or continual learning.
☆ TriBand-BEV: Real-Time LiDAR-Only 3D Pedestrian Detection via Height-Aware BEV and High-Resolution Feature Fusion AAMAS 2026
Safe autonomous agents and mobile robots need fast real time 3D perception, especially for vulnerable road users (VRUs) such as pedestrians. We introduce a new bird's eye view (BEV) encoding, which maps the full 3D LiDAR point cloud into a light-weight 2D BEV tensor with three height bands. We explicitly reformulate 3D detection as a 2D detection problem and then reconstruct 3D boxes from the BEV outputs. A single network detects cars, pedestrians, and cyclists in one pass. The backbone uses area attention at deep stages, a hierarchical bidirectional neck over P1 to P4 fuses context and detail, and the head predicts oriented boxes with distribution focal learning for side offsets and a rotated IoU loss. Training applies a small vertical re bin and a mild reflectance jitter in channel space to resist memorization. We use an interquartile range (IQR) filter to remove noisy and outlier LiDAR points during 3D reconstruction. On KITTI dataset, TriBand-BEV attains 58.7/52.6/47.2 pedestrian BEV AP(%) for easy, moderate, and hard at 49 FPS on a single consumer GPU, surpassing Complex-YOLO, with gains of +12.6%, +7.5%, and +3.1%. Qualitative scenes show stable detection under occlusion. The pipeline is compact and ready for real time robotic deployment. Our source code is publicly available on GitHub.
comment: Accepted for publication in the Proceedings of the 2026 International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)
Self-Supervised Laplace Approximation for Bayesian Uncertainty Quantification
Approximate Bayesian inference typically revolves around computing the posterior parameter distribution. In practice, however, the main object of interest is often a model's predictions rather than its parameters. In this work, we propose to bypass the parameter posterior and focus directly on approximating the posterior predictive distribution. We achieve this by drawing inspiration from self-training within self-supervised and semi-supervised learning. Essentially, we quantify a Bayesian model's predictive uncertainty by refitting on self-predicted data. The idea is strikingly simple: If a model assigns high likelihood to self-predicted data, these predictions are of low uncertainty, and vice versa. This yields a deterministic, sampling-free approximation of the posterior predictive. The modular structure of our Self-Supervised Laplace Approximation (SSLA) further allows us to plug in different prior specifications, enabling classical Bayesian sensitivity (w.r.t. prior choice) analysis. In order to bypass expensive refitting, we further introduce an approximate version of SSLA, called ASSLA. We study (A)SSLA both theoretically and empirically in regression models ranging from Bayesian linear models to Bayesian neural networks. Across a wide array of regression tasks with simulated and real-world datasets, our methods outperform classical Laplace approximations in predictive calibration while remaining computationally efficient.
comment: Accepted for publication in TMLR (https://openreview.net/forum?id=T8w8L2t3JG)
☆ Not How Many, But Which: Parameter Placement in Low-Rank Adaptation
We study the \textit{parameter placement problem}: given a fixed budget of $k$ trainable entries within the B matrix of a LoRA adapter (A frozen), does the choice of which $k$ matter? Under supervised fine-tuning, random and informed subsets achieve comparable performance. Under GRPO on base models, random placement fails to improve over the base model, while gradient-informed placement recovers standard LoRA accuracy. This regime dependence traces to gradient structure: SFT gradients are low-rank and directionally stable, so any subset accumulates coherent updates; GRPO gradients are high-rank and near-orthogonal across steps, so only elements with consistently signed gradients retain the learning signal. Our scoring procedure identifies these critical parameters in under 10 seconds at less than 0.5% of training cost. Selected parameters concentrate on residual-stream-writing projections (V, O, Down), stable across model families and scales (1.5B - 8B).
comment: Preprint. Comments welcome
☆ On the Importance of Multistability for Horizon Generalization in Reinforcement Learning
In reinforcement learning (RL), agents acting in partially observable Markov decision processes (POMDPs) must rely on memory, typically encoded in a recurrent neural network (RNN), to integrate information from past observations. Long-horizon POMDPs, in which the relevant observation and the optimal action are separated by many time steps (called the horizon), are particularly challenging: training suffers from poor generalization, severe sample inefficiency, and prohibitive exploration costs. Ideally, an agent trained on short horizons would retain optimal behavior at arbitrarily longer ones, but no formal framework currently characterizes when this is achievable. To fill this gap, we formalized temporal horizon generalization, the property that a policy remains optimal for all horizons, derived a necessary and sufficient condition for it, and experimentally evaluated the ability of nonlinear and parallelizable RNN variants to achieve it. This paper presents the resulting theoretical framework, the empirical evaluation, and the dynamical interpretation linking RNN behavior to temporal horizon generalization. Our analyses reveal that multistability is necessary for temporal horizon generalization and, in simple tasks, sufficient; more complex tasks further require transient dynamics. In contrast, modern parallelizable architectures, namely state space models and gated linear RNNs, are monostable by construction and consequently fail to generalize across temporal horizons. We conclude that multistability and transient dynamics are two essential and complementary dynamical regimes for horizon generalization, and that no current parallelizable RNN exhibits both. Designing parallelizable architectures that combine these regimes thus emerges as a key direction for scalable long-horizon RL.
comment: 23 pages, 6 figures
☆ Investigating simple target-covariate relationships for Chronos-2 and TabPFN-TS
Time Series Foundation Models (TSFMs) have recently achieved state-of-the-art performance, often outperforming supervised models in zero-shot settings. Recent TSFM architectures, such as Chronos-2 and TabPFN-TS, aim to integrate covariates. In this paper, we design controlled experiments based on simple target-covariate relationships to assess this integration capability. Our results show that TabPFN-TS captures these relationships more effectively than Chronos-2, especially for short horizons, suggesting that the strong benchmark performance of Chronos-2 does not automatically translate into optimal modeling of simple covariate-target dependencies.
☆ Overtrained, Not Misaligned
Emergent misalignment (EM), where fine-tuning on a narrow task (like insecure code) causes broad misalignment across unrelated domains, was first demonstrated by Betley et al. (2025). We conduct the most comprehensive EM study to date, reproducing the original GPT-4o finding and expanding to 12 open-source models across 4 families (Llama, Qwen, DeepSeek, GPT-OSS) ranging from 8B to 671B parameters, evaluating over one million model responses with multiple random seeds. We find that EM replicates in GPT-4o but is far from universal: only 2 of 12 open-source models (17%) exhibit consistent EM across seeds, with a significant correlation between model size and EM susceptibility. Through checkpoint-level analysis during fine-tuning, we demonstrate that EM emerges late in training, distinct from and subsequent to near convergence of the primary task, suggesting EM emerges from continued training past task convergence. This yields practical mitigations: early stopping eliminates EM while retaining an average of 93% of task performance, and careful learning rate selection further minimizes risk. Cross-domain validation on medical fine-tuning confirms these patterns generalize: the size-EM correlation strengthens (r = 0.90), and overgeneralization to untruthfulness remains avoidable via early stopping in 67% of cases, though semantically proximate training domains produce less separable misalignment. As LLMs become increasingly integrated into real-world systems, fine-tuning and reinforcement learning remain the primary methods for adapting model behavior. Our findings demonstrate that with proper training practices, EM can be avoided, reframing it from an unforeseen fine-tuning risk to an avoidable training artifact.
comment: Under review at CoLM 2026; companion to Nature Matters Arising (also under review). 25 pages, 6 figures
☆ A Unified Graph Language Model for Multi-Domain Multi-Task Graph Alignment Instruction Tuning
Leveraging Graph Neural Networks (GNNs) as graph encoders and aligning the resulting representations with Large Language Models (LLMs) through alignment instruction tuning has become a mainstream paradigm for constructing Graph Language Models (GLMs), combining the generalization ability of LLMs with the structural modeling capacity of GNNs. However, existing GLMs that adopt GNNs as graph encoders largely overlook the problem of aligning GNN-encoded representations across domains and tasks with the LLM token space to obtain unified graph tokens, thereby limiting their ability to generalize across diverse graph data. To bridge this gap, we aim to incorporate a multi-domain, multi-task GNN encoder into GLMs and align its representations with LLMs to enable multi-domain, multi-task graph alignment instruction tuning. This alignment problem remains underexplored and poses two key challenges: 1) learning GNN-encoded representations that are simultaneously generalizable across domains and tasks and well aligned with textual semantics is difficult, due to substantial variations in graph structures, feature distributions, and supervision signals, together with the lack of textual-semantic alignment guidance in task-specific GNN training; 2) diverse graph data and task-specific instructions can exhibit different degrees of compatibility with the LLM token space during instruction tuning, leading to varying alignment difficulty and rendering a fixed alignment strategy suboptimal. To tackle these challenges, we propose UniGraphLM, a Unified Graph Language Model that incorporates a multi-domain, multi-task GNN encoder to learn generalizable graph representations aligned with textual semantics, and then adaptively aligns these representations with the LLM.
☆ ECTO: Exogenous-Conditioned Temporal Operator for Ultra-Short-Term Wind Power Forecasting
Accurate ultra-short-term wind power forecasting is critical for grid dispatch and reserve management, yet remains challenging due to the non-stationary, condition-dependent nature of wind generation. Meteorological exogenous variables carry substantial predictive information, but the most informative variable combination varies across sites, operating conditions, and prediction horizons. Existing deep learning approaches either treat exogenous inputs as generic auxiliary channels through uniform mixing or soft gating, or rely on fixed preprocessing steps such as PCA, without exploiting the physical structure of meteorological variables. We propose ECTO (Exogenous-Conditioned Temporal Operator), a unified framework that decomposes exogenous variable modeling into two complementary modules. Physically-Grounded Variable Selection (PGVS) performs hierarchical, group-aware sparse selection over exogenous variables using a domain-informed physical prior and sparsemax activations, producing a compact, condition-adaptive exogenous context. Exogenous-Conditioned Regime Refinement (ECRR) routes the forecast through learned regime experts that apply gain--bias calibration and horizon-specific corrections via a mixture-of-experts paradigm. Experiments on three wind farms spanning different climates, capacities (66--200 MW), and exogenous dimensions (11--13 variables) demonstrate that ECTO achieves the lowest MSE across all sites, with relative improvements over the strongest baseline ranging from 2.2% to 5.2%, widening to 8.6% at the longer prediction horizon ($H=32$). Ablation analysis confirms that each exogenous-related component contributes positively (PGVS +1.84%, ECRR +2.86%), and interpretability analysis reveals that PGVS learns physically meaningful, site-specific variable selection patterns, while ECRR converges to well-separated calibration strategies consistent across sites.
comment: 42 pages, 10 figures, 9 tables
☆ Fair Conformal Classification via Learning Representation-Based Groups
Conformal prediction methods provide statistically rigorous marginal coverage guarantees for machine learning models, but such guarantees fail to account for algorithmic biases, thereby undermining fairness and trust. This paper introduces a fair conformal inference framework for classification tasks. The proposed method constructs prediction sets that guarantee conditional coverage on adaptively identified subgroups, which can be implicitly defined through nonlinear feature combinations. By balancing effectiveness and efficiency in producing compact, informative prediction sets and ensuring adaptive equalized coverage across unfairly treated subgroups, our approach paves a practical pathway toward trustworthy machine learning. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of the framework.
☆ Probing Non-Equilibrium Grain Boundary Dynamics with XPCS and Domain-Adaptive Machine Learning
Grain-boundary (GB) dynamics control the stability, mechanical, and functional response of nanocrystalline materials, but direct experimental access to their slow non-equilibrium motion has been limited. Here we establish X-ray photon correlation spectroscopy (XPCS), combined with domain-adaptive machine learning, as a quantitative probe of GB dynamics. Temperature- and grain-size-dependent two-time XPCS measurements in nanocrystalline silicon reveal pronounced departures from time-translation invariance, showing that GB relaxation can remain far from equilibrium over experimental timescales. However, direct extraction of quantitative physical information from these high-dimensional, noisy fluctuation maps faces a significant challenge. To overcome this barrier, we develop a semi-supervised learning framework that transfers physical parameter labels from continuum simulations to unlabeled experimental XPCS maps through domain-adaptive representation alignment. This AI-augmented approach enables the extraction of key kinetic parameters, including bulk diffusivity, GB stiffness, and effective GB concentration, directly from experimental XPCS measurements. Our results show how machine learning can transform indirect fluctuation signals into quantitative materials dynamics, providing a general route to study non-equilibrium defect motion in solids.
comment: 14 pages, 4 figures
☆ Information-Theoretic Generalization Bounds for Sequential Decision Making
Information-theoretic generalization bounds based on the supersample construction are a central tool for algorithm-dependent generalization analysis in the batch i.i.d.~setting. However, existing supersample conditional mutual information (CMI) bounds do not directly apply to sequential decision-making problems such as online learning, streaming active learning, and bandits, where data are revealed adaptively and the learner evolves along a causal trajectory. To address this limitation, we develop a sequential supersample framework that separates the learner filtration from a proof-side enlargement used for ghost-coordinate comparisons. Under a row-wise exchangeability assumption, the sequential generalization gap is controlled by sequential CMI, a sum of roundwise selector--loss information terms. We also establish a Bernstein-type refinement that yields faster rates under suitable variance conditions. The selector-SCMI proof strategy applies to online learning, streaming active learning with importance weighting, and stochastic multi-armed bandits.
☆ DriftXpress: Faster Drifting Models via Projected RKHS Fields
Drifting Models have emerged as a new paradigm for one-step generative modeling, achieving strong image quality without iterative inference. The premise is to replace the iterative denoising process in diffusion models with a single evaluation of a generator. However, this creates a different trade-off: drifting reduces inference cost by moving much of the computation into training. We introduce DriftXpress, an accelerated formulation of drifting models based on projected RKHS fields. DriftXpress approximates the drifting kernel in a low-rank feature space. This preserves the attraction-repulsion structure of the original drifting field while reducing the cost of field evaluation. Across image-generation benchmarks, DriftXpress achieves comparable FID to standard drifting while reducing wall-clock training cost. These results show that the training-inference trade-off of drifting models can be pushed further without giving up their one-step inference advantage.
☆ Do Enterprise Systems Need Learned World Models? The Importance of Context to Infer Dynamics
World models enable agents to anticipate the effects of their actions by internalizing environment dynamics. In enterprise systems, however, these dynamics are often defined by tenant-specific business logic that varies across deployments and evolves over time, making models trained on historical transitions brittle under deployment shift. We ask a question the world-models literature has not addressed: when the rules can be read at inference time, does an agent still need to learn them? We argue, and demonstrate empirically, that in settings where transition dynamics are configurable and readable, runtime discovery complements offline training by grounding predictions in the active system instance. We propose enterprise discovery agents, which recover relevant transition dynamics at runtime by reading the system's configuration rather than relying solely on internalized representations. We introduce CascadeBench, a reasoning-focused benchmark for enterprise cascade prediction that adopts the evaluation methodology of World of Workflows on diverse synthetic environments, and use it together with deployment-shift evaluation to show that offline-trained world models can perform well in-distribution but degrade as dynamics change, whereas discovery-based agents are more robust under shift by grounding their predictions in the current instance. Our findings suggest that, in configurable enterprise environments, agents should not rely solely on fixed internalized dynamics, but should incorporate mechanisms for discovering relevant transition logic at runtime.
☆ Multi-Task Representation Learning for Conservative Linear Bandits
This paper presents the Constrained Multi-Task Representation Learning (CMTRL) framework for linear bandits. We consider T linear bandit tasks in a d dimensional space, which share a common low-dimensional representation of dimension r, where r is much smaller than the minimum of d and T. Furthermore, tasks are constrained so that only actions meeting specific safety or performance requirements are allowed, referred to as conservative (safe) bandits. We introduce a novel algorithm, Safe-Alternating projected Gradient Descent and minimization (Safe-AltGDmin), to recover a low-rank feature matrix while satisfying the given constraints. Building on this algorithm, we propose a multi-task representation learning framework for conservative linear bandits and establish theoretical guarantees for its regret and sample complexity bounds. We presented experiments and compared the performance of our algorithm with benchmark algorithms.
☆ Expected Batch Optimal Transport Plans and Consequences for Flow Matching
Solving optimal transport (OT) on random minibatches is a common surrogate for exact OT in large-scale learning. In flow matching (FM), this surrogate is used to obtain OT-like couplings that can straighten probability paths and reduce numerical integration cost. Yet, the population-level coupling induced by repeated minibatch OT remains only partially understood. We formalize this coupling as the expected batch OT plan $\overlineπ_{k}$, obtained by averaging empirical OT plans over independent minibatches of size $k$. We then establish its large-batch consistency and, in the semidiscrete case relevant to generative modeling, derive rates for both the transport-cost bias and the convergence of $\overlineπ_{k}$ to the OT plan. For FM, this yields a population coupling whose induced velocity field is regular enough to define a unique flow from the source to the discrete target. We finally quantify how OT batch size interacts with numerical integration in a tractable two-atom model and in synthetic and image experiments.
☆ Lower bounds for one-layer transformers that compute parity
This note shows that no self-attention layer post-processed by a rational function can sign-represent the parity function unless the product of the number of heads and the degree of the post-processing function grows linearly with the input length. Combining this lower bound with rational approximation of ReLU networks yields a margin-dependent extension for self-attention layers post-processed by ReLU networks.
☆ On What We Can Learn from Low-Resolution Data
Artificial intelligence systems typically rely on large, centrally collected datasets, a premise that does not hold in many real-world domains such as healthcare and public institutions. In these settings, data sharing is often constrained by storage, privacy, or resource limitations. For example, small wearable devices may lack the bandwidth or energy capacity needed to store and transmit high-resolution data, leading to aggregation during data collection and thus a loss of information. As a result, datasets collected from different sources may consist of a mixture of high- and low-resolution samples. Despite the prevalence of this setting, it remains unclear how informative low-resolution data is when models are ultimately evaluated on high-resolution inputs. We provide a theoretical analysis based on the Kullback-Leibler divergence that characterises how the influence of a datapoint changes with resolution, and derive bounds that relate the relative contribution of high- and low-resolution observations to the information lost under downsampling. To support this analysis, we empirically demonstrate, using both a vision transformer and a convolutional neural network, that adding low-resolution data to the training set consistently improves performance when high-resolution data is scarce.
☆ Machine Learning for neutron source distributions
In light of the recent advancements in machine learning, we propose a novel approach to neutron source distribution estimation through the utilisation of probabilistic generative models. The estimation is based on a Monte Carlo particle list, which is only required during the training stage of the machine learning model. Once the source distribution has been learned, the model is independent of the original particle list, allowing for further sampling in an efficient, rapid, and memory-costless manner. The performance of various generative models is evaluated, including a variational autoencoder, a normalizing flow, a generative adversarial network, and a denoising diffusion model. These approaches are then compared to existing source distribution estimations, and the advantages and disadvantages of each approach are discussed. The results demonstrate that source distributions can be modeled through the use of probabilistic generative models, which paves the way for further advancements in this field.
comment: Under review at Machine Learning: Science & Technology
☆ Fused Gromov-Wasserstein Distance with Feature Selection
Fused Gromov-Wasserstein (FGW) distances provide a principled framework for comparing objects by jointly aligning structure and node features. However, existing FGW formulations treat all features uniformly, which limits interpretability and robustness in high-dimensional settings where many features may be irrelevant or noisy. We introduce FGW distances with feature selection, which incorporate adaptive feature suppression weights into the FGW objective to selectively downweight or suppress differentiating features during alignment. We propose two approaches: (1) regularized FGW with Lasso and Ridge penalties, and (2) FGW with simplex-constrained weights, including groupwise extensions. We analyze the resulting models and establish their key theoretical properties, including bounds relative to classical FGW and Gromov-Wasserstein distances, and metric behavior. An efficient alternating minimization algorithm is developed. Experiments illustrate how feature suppression enhances interpretability and reveals task-relevant structure, with a special application to computational redistricting.
☆ PrivacySIM: Evaluating LLM Simulation of User Privacy Behavior
Large language models (LLMs) are increasingly used to simulate human behavior, but their ability to simulate $individual$ privacy decisions is not well understood. In this paper, we address the problem of evaluating whether a core set of user persona attributes can drive LLMs to simulate individual-level privacy behavior. We introduce PrivacySIM, an evaluation suite that benchmarks LLM simulation of user privacy behavior against the ground-truth responses of 1,000 users. These users are drawn from five published user studies on privacy spanning LLM healthcare consultations, conversational agents, and chatbots. Drawing on these user studies, we hypothesize three persona facets as plausible predictors of privacy decision-making: demographics, previous experiences, and stated privacy attitudes. We condition nine frontier LLMs on subsets of these three facets and measure how often each model's response to a data-sharing scenario matches the user's actual response. Our findings show that (1) privacy persona conditioning consistently improves simulation quality over no-persona conditioning, but even the strongest model (40.4\% accuracy) remains far from faithfully simulating individual privacy decisions. (2) A user's stated privacy attitudes alone may not be the best predictor because they often diverge from the user's actual privacy behavior. (3) Users with high AI/chatbot experience but low stated privacy attitudes are the most challenging to simulate. PrivacySIM is a first step toward understanding and improving the capabilities of LLMs to simulate user privacy decisions. We release PrivacySIM to enable further evaluation of LLM privacy simulation.
☆ STRUM: A Spectral Transcription and Rhythm Understanding Model for End-to-End Generation of Playable Rhythm-Game Charts
We present STRUM (Spectral Transcription and Rhythm Understanding Model), an audio-to-chart pipeline that converts raw recordings into playable Clone Hero / YARG charts for drums, guitar, bass, vocals, and keys without any oracle metadata. STRUM is a multi-stage hybrid: a two-stage CRNN onset detector and a six-model ensemble classifier for drums; neural onset detectors with monophonic pitch tracking for guitar and bass; word-aligned ASR for vocals; and spectral keyboard detection for keys. We evaluate on a 30-song in-envelope benchmark constructed by screening candidate songs on a single audio-quality criterion -- the median 1-second drum-stem RMS after htdemucs_6s source separation. On this benchmark STRUM achieves drums onset F1 = 0.838, bass F1 = 0.694, guitar F1 = 0.651, and vocals F1 = 0.539 at a +/- 100 ms tolerance with per-song global offset search. We report a complete ablation of seven drum-pipeline components with paired per-song Wilcoxon tests, an analysis of ground-truth-to-audio timing distributions in community Clone Hero charts, and a per-class confusion matrix for the drum classifier. Code, model weights, and the full benchmark manifest are released.
comment: 9 pages, 4 figures, 3 tables. Code and models: https://github.com//autocharter
☆ MULTI: Disentangling Camera Lens, Sensor, View, and Domain for Novel Image Generation ICPR 2026
Recent text-to-image models produce high-quality images, yet text ambiguity hinders precise control when specific styles or objects are required. There have been a number of recent works dealing with learning and composing multiple objects and patterns. However, current work focuses almost entirely on image content, overlooking imaging factors such as camera lens, sensor types, imaging viewpoints, and scenes' domain characteristics. We introduce this new challenge as Imaging Factor Disentanglement and show limitations of current approaches in the regime. We, therefore, propose the new method Multi-factor disentanglement through Textual Inversion (MULTI). It consists of two stages: in the first stage, we learn general factors, and in the second stage, we extract dataset-specific ones. This setup enables the extension of existing datasets and novel factor combinations, thereby reducing distribution gaps. It further supports modifications of specific factors and image-to-image generation via ControlNets. The evaluation on our new DF-RICO benchmark demonstrates the effectiveness of MULTI and highlights the importance of Factor Disentanglement as a new direction of research.
comment: Accepted at ICPR 2026
☆ Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning
Unlearning specific concepts in text-to-image diffusion models has become increasingly important for preventing undesirable content generation. Among prior approaches, sparse autoencoder (SAE)-based methods have attracted attention due to their ability to suppress target concepts through lightweight manipulation of latent features, without modifying model parameters. However, SAEs trained with sparse reconstruction objectives do not explicitly enforce concept-wise separation, resulting in shared latent features across concepts. To address this, we propose SAEParate, which organizes latent representations into concept-specific clusters via a concept-aware contrastive objective, enabling more precise concept suppression while reducing unintended interference during unlearning. In addition, we enhance the encoder with a GeLU-based nonlinear transformation to increase its expressive capacity under this separation objective, enabling a more discriminative and disentangled latent space. Experiments on UnlearnCanvas demonstrate state-of-the-art performance, with particularly strong gains in joint style-object unlearning, a challenging setting where existing methods suffer from severe interference between target and non-target concepts.
comment: 40 pages, 23 figures
☆ Keeping Score: Efficiency Improvements in Neural Likelihood Surrogate Training via Score-Augmented Loss Functions
For stochastic process models, parameter inference is often severely bottlenecked by computationally expensive likelihood functions. Simulation-based inference (SBI) bypasses this restriction by constructing amortized surrogate likelihoods, but most SBI methods assume a black-box data generating process. While these surrogates are exact in the limit of infinite training data, practical scenarios force a strict tradeoff between model quality and simulation cost. In this work, we loosen the black-box assumption of SBI to improve this tradeoff for structured stochastic process models. Specifically, for neural network likelihood surrogates trained via probabilistic classification, we propose to augment the standard binary cross-entropy loss with exact score information $\nabla_θ\log p(x \mid θ)$ and adaptive weighting based on loss gradients. We evaluate our approach on case studies involving network dynamics and spatial processes, demonstrating that our method improves surrogate quality at a drastically lower computational cost than generating more training data. Notably, in some cases, our approach achieves downstream inference performance equivalent to a 10x increase in training data with less than a 1.1x increase in training time.
comment: 9 pages of main text, 9 pages of appendices, 13 figures
☆ Learning What Matters: Adaptive Information-Theoretic Objectives for Robot Exploration
Designing learnable information-theoretic objectives for robot exploration remains challenging. Such objectives aim to guide exploration toward data that reduces uncertainty in model parameters, yet it is often unclear what information the collected data can actually reveal. Although reinforcement learning (RL) can optimize a given objective, constructing objectives that reflect parametric learnability is difficult in high-dimensional robotic systems. Many parameter directions are weakly observable or unidentifiable, and even when identifiable directions are selected, omitted directions can still influence exploration and distort information measures. To address this challenge, we propose Quasi-Optimal Experimental Design (Q{\footnotesize OED}), an adaptive information objective grounded in optimal experimental design. Q{\footnotesize OED} (i) performs eigenspace analysis of the Fisher information matrix to identify an observable subspace and select identifiable parameter directions, and (ii) modifies the exploration objective to emphasize these directions while suppressing nuisance effects from non-critical parameters. Under bounded nuisance influence and limited coupling between critical and nuisance directions, Q{\footnotesize OED} provides a constant-factor approximation to the ideal information objective that explores all parameters. We evaluate Q{\footnotesize OED} on simulated and real-world navigation and manipulation tasks, where identifiable-direction selection and nuisance suppression yield performance improvements of \SI{35.23}{\percent} and \SI{21.98}{\percent}, respectively. When integrated as an exploration objective in model-based policy optimization, Q{\footnotesize OED} further improves policy performance over established RL baselines.
☆ Elicitation-Augmented Bayesian Optimization
Human-in-the-loop Bayesian optimization (HITL BO) methods utilize human expertise to improve the sample-efficiency of BO. Most HITL BO methods assume that a domain expert can quantify their knowledge, for instance by pinpointing query locations or specifying their prior beliefs about the location of the maximum as a probability distribution. However, since human expertise is often tacit and cannot be explicitly quantified, we consider a setting where domain knowledge of an expert is elicited via pairwise comparisons of designs. We interpret the expert's pairwise judgements as noisy evidence about the values of the observable objective function and develop a principled method for combining the information obtained via direct observations and pairwise queries. Specifically, we derive a cost-aware value-of-information acquisition function that balances direct observations against pairwise queries. The proposed method approaches the convex hull of the trajectories of the individual information sources: when pairwise queries are cheap it substantially improves sample-efficiency over observation-only BO, and when pairwise queries are costly or noisy, it recovers the performance of standard BO by relying on direct observations alone.
☆ Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction
Asynchronous reinforcement learning improves rollout throughput for large language model agents by decoupling sample generation from policy optimization, but it also introduces a critical failure mode for PPO-style off-policy correction. In heterogeneous training systems, the total importance ratio should ideally be decomposed into two semantically distinct factors: a \emph{training--inference discrepancy term} that aligns inference-side and training-side distributions at the same behavior-policy version, and a \emph{policy-staleness term} that constrains the update from the historical policy to the current policy. We show that practical asynchronous pipelines with delayed updates and partial rollouts often lose the required historical training-side logits, or old logits. This missing-old-logit problem entangles discrepancy repair with staleness correction, breaks the intended semantics of decoupled correction, and makes clipping and masking thresholds interact undesirably. To address this issue, we study both exact and approximate correction routes. We propose three exact old-logit acquisition strategies: snapshot-based version tracking, a dedicated old-logit model, and synchronization via partial rollout interruption, and compare their system trade-offs. From the perspective of approximate correction, we focus on preserving the benefits of decoupled correction through a more appropriate approximate policy when exact old logits cannot be recovered at low cost, without incurring extra system overhead. Following this analysis, we adopt a revised PPO-EWMA method, which achieves significant gains in both training speed and optimization performance. Code at https://github.com/millioniron/ROLL.
☆ Anomaly-Aware Vision-Language Adapters for Zero-Shot Anomaly Detection ICIP 2026
Zero-shot anomaly detection aims to identify defects in unseen categories without target-specific training. Existing methods usually apply the same feature transformation to all samples, treating normal and anomalous data uniformly despite their fundamentally asymmetric distributions, compact normals versus diverse anomalies. We instead exploit this natural asymmetry by proposing AVA-DINO, an anomaly-aware vision-language adaptation framework with dual specialized branches for normal and anomalous patterns that adapt frozen DINOv3 visual features. During training on auxiliary data, the two branches are learned jointly with a text-guided routing mechanism and explicit routing regularization that encourages branch specialization. At test time, only the input image and fixed, predefined language descriptions are used to dynamically combine the two branches, enabling an asymmetric activation. This design prevents degenerate uniform routing and allows context-specific feature transformations. Experiments across nine industrial and medical benchmarks demonstrate state-of-the-art performance, achieving 93.5% image-AUROC on MVTec-AD and strong cross-domain generalization to medical imaging without domain-specific fine-tuning. https://github.com/aqeeelmirza/AVA-DINO
comment: Accepted to ICIP 2026
☆ Hölder Policy Optimisation
Group Relative Policy Optimisation (GRPO) enhances large language models by estimating advantages across a group of sampled trajectories. However, mapping these trajectory-level advantages to policy updates requires aggregating token-level probabilities within each sequence. Relying on a fixed aggregation mechanism for this step fundamentally limits the algorithm's adaptability. Empirically, we observe a critical trade-off: certain fixed aggregations frequently suffer from training collapse, while others fail to yield satisfactory performance. To resolve this, we propose \textbf{HölderPO}, a generalised policy optimisation framework unifying token-level probability aggregation via the Hölder mean. By explicitly modulating the parameter $p$, our framework provides continuous control over the trade-off between gradient concentration and variance bounds. Theoretically, we prove that a larger $p$ concentrates the gradient to amplify sparse learning signals, whereas a smaller $p$ strictly bounds gradient variance. Because no static configuration can universally resolve this concentration-stability trade-off, we instantiate the framework with a dynamic annealing algorithm that progressively schedules $p$ across the training lifecycle. Extensive evaluations demonstrate superior stability and convergence over existing baselines. Specifically, our approach achieves a state-of-the-art average accuracy of $54.9\%$ across multiple mathematical benchmarks, yielding a substantial $7.2\%$ relative gain over standard GRPO and secures an exceptional $93.8\%$ success rate on ALFWorld.
☆ Learning plug-in surrogate endpoints for randomized experiments
Surrogate endpoints are used in place of long-term outcomes in randomized experiments when observing the real outcome for a large enough cohort is prohibitively expensive or impractical. A short-term surrogate is good if the result of an experiment using the surrogate is predictive of the result of a hypothetical study using the real outcome. Much attention has been paid to formalizing this property in causal terms, but most criteria are unidentifiable and cannot be turned into practical algorithms for learning surrogate endpoints from data. To address this, we study plug-in composite surrogates, functions of post-treatment variables that may be substituted directly for the primary outcome in a randomized experiment. We propose two methods for learning plug-in surrogates that maximize effect predictiveness, and characterize the possibility of finding endpoints that yield unbiased effect estimates in representative scenarios. Finally, in both synthetic experiments with known effects and in data from a real-world experiment, we find that our method, based on directly modeling the surrogate effect, returns plug-in endpoints more predictive of the primary effect than established methods.
comment: 29 pages, 5 figures
☆ Scaling Laws and Tradeoffs in Recurrent Networks of Expressive Neurons
Cortical neurons are complex, multi-timescale processors wired into recurrent circuits, shaped by long evolutionary pressure under stringent biological constraints. Mainstream machine learning, by contrast, predominantly builds models from extremely simple units, a default inherited from early neural-network theory. We treat this as a normative architectural question. How should one split a fixed parameter budget $P$ between the number of units $N$, per-unit effective complexity $k_e$, and per-unit connectivity $k_c$? What controls the optimal allocation? This calls for a model in which per-unit complexity can be tuned independently of width and connectivity. Accordingly, we introduce the ELM Network, whose recurrent layer is built from Expressive Leaky Memory (ELM) neurons, chosen to mirror functional components of cortical neurons. The architecture allows for individually adjusting $N$, $k_e$, and $k_c$ and trains stably across orders of magnitude in scale. We evaluate the model on two qualitatively different sequence benchmarks: the neuromorphic SHD-Adding task and Enwik8 character-level language modeling. Performance improves monotonically along each of the three axes individually. Under a fixed budget, a clear non-trivial optimum emerges in their tradeoff, and larger budgets favor both more and more complex neurons. A closed-form information-theoretic model captures these tradeoffs and attributes the diminishing returns at two ends to: per-neuron signal-to-noise saturation and across-neuron redundancy. A hyperparameter sweep spanning three orders of magnitude in trainable parameters traces a near-Pareto-frontier scaling law consistent with the framework. This suggests that the simple-unit default in ML is not obviously optimal once this tradeoff surface is probed, and offers a normative lens on cortex's reliance on complex spatio-temporal integrators.
comment: 25 pages, 21 figures, 3 tables, including derivations. Submitted for peer review
☆ Rethink the Role of Neural Decoders in Quantum Error Correction ICML 2026
Quantum error correction (QEC) is essential for enabling quantum advantages, with decoding as a central algorithmic primitive. Owing to its importance and intrinsic difficulty, substantial effort has been made to QEC decoder design, among which neural decoders have recently emerged as a promising data-driven paradigm. Despite this progress, practical deployment remains hindered by a fundamental accuracy-latency tradeoff, often on the microsecond timescale. To address this challenge, here we revisit neural decoders for surface-code decoding under explicit accuracy-latency constraints, considering code distances up to d=9 (161 physical qubits). We unify and redesign representative neural decoders into five architectural paradigms and develop an end-to-end compression pipeline to evaluate their deployability and performance on FPGA hardware. Through systematic experiments, we reveal several previously underexplored insights: (i) near-term decoding performance is driven more by data scale than architectural complexity; (ii) appropriate inductive bias is essential for achieving high decoding accuracy; and (iii) INT4 quantization is a prerequisite for meeting microsecond-scale latency requirements on FPGAs. Together, these findings provide concrete guidance toward scalable and real-time neural QEC decoding.
comment: Accepted to ICML 2026; 33 Pages, 9 figures
☆ Resilient Vision-Tabular Multimodal Learning under Modality Missingness
Multimodal deep learning has shown strong potential in medical applications by integrating heterogeneous data sources such as medical images and structured clinical variables. However, most existing approaches implicitly assume complete modality availability, an assumption that rarely holds in real-world clinical settings where entire modalities and individual features are frequently missing. In this work, we propose a multimodal transformer framework for joint vision-tabular learning explicitly designed to operate under pervasive modality missingness, without relying on imputation or heuristic model switching. The architecture integrates three components: a vision, a tabular, and a multimodal fusion encoder. Unimodal representations are weighted through learnable modality tokens and fused via intermediate fusion with masked self-attention, which excludes missing tokens and modalities from information aggregation and gradient propagation. To further enhance resilience, we introduce a modality-dropout regularization strategy that stochastically removes available modalities during training, encouraging the model to exploit complementary information under partial data availability. We evaluate our approach on the MIMIC-CXR dataset paired with structured clinical data from MIMIC-IV for multilabel classification of 14 diagnostic findings with incomplete annotations. Two parallel systematic stress-test protocols progressively increase training and inference missingness in each modality separately, spanning fully multimodal to fully unimodal scenarios. Across all missingness regimes, the proposed method consistently outperforms representative baselines, showing smoother performance degradation and improved robustness. Ablation studies further demonstrate that attention-level masking and intermediate fusion with joint fine-tuning are key to resilient multimodal inference.
☆ Approximation Theory of Laplacian-Based Neural Operators for Reaction-Diffusion System
Neural operators provide a framework for learning solution operators of partial differential equations (PDEs), enabling efficient surrogate modeling for complex systems. While universal approximation results are now well understood, approximation analysis specific to nonlinear reaction-diffusion systems remains limited. In this paper, we study neural operators applied to the solution mapping from initial conditions to time-dependent solutions of a generalized Gierer-Meinhardt reaction-diffusion system, a prototypical model of nonlinear pattern formation. Our main results establish explicit approximation error bounds in terms of network depth, width, and spectral rank by exploiting the Laplacian spectral representation of the Green's function underlying the PDE. We show that the required parameter complexity grows at most polynomially with respect to the target accuracy, demonstrating that Laplacian eigenfunction-based neural operator architectures alleviate the curse of parametric complexity encountered in generic operator learning. Numerical experiments on the Gierer-Meinhardt system support the theoretical findings.
☆ Efficient and Adaptive Human Activity Recognition via LLM Backbones
Human Activity Recognition (HAR) is a core task in pervasive computing systems, where models must operate under strict computational constraints while remaining robust to heterogeneous and evolving deployment conditions. Recent advances based on Transformer architectures have significantly improved recognition performance, but typically rely on task-specific models trained from scratch, resulting in high training cost, large data requirements, and limited adaptability to domain shifts. In this paper, we propose a paradigm shift that reuses large pretrained language models (LLMs) as generic temporal backbones for sensor-based HAR, instead of designing domain-specific Transformers. To bridge the modality gap between inertial time series and language models, we introduce a structured convolutional projection that maps multivariate accelerometer and gyroscope signals into the latent space of the LLM. The pretrained backbone is kept frozen and adapted using parameter-efficient Low-Rank Adaptation (LoRA), drastically reducing the number of trainable parameters and the overall training cost. Through extensive experiments on standard HAR benchmarks, we show that this approach enables rapid convergence, strong data efficiency, and robust cross-dataset transfer, particularly in low-data and few-shot settings. At the same time, our results highlight the complementary roles of convolutional frontends and LLMs, where local invariances are handled at the signal level while long-range temporal dependencies are captured by the pretrained backbone. Overall, this work demonstrates that LLMs can serve as a practical, frugal, and scalable foundation for adaptive HAR systems, opening new directions for reusing foundation models beyond their original language domain.
☆ SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
Reusable skills are becoming a common interface for extending large language model agents, packaging procedural guidance with access to files, tools, memory, and execution environments. However, this modularity introduces attack surfaces that are largely missed by existing safety evaluations: even when the user request is benign, task-relevant skill materials or local artifacts can steer an agent toward unsafe actions. We present SkillSafetyBench, a runnable benchmark for evaluating such skill-mediated safety failures. SkillSafetyBench includes 155 adversarial cases across 47 tasks, 6 risk domains, and 30 safety categories, each evaluated with a case-specific rule-based verifier. Experiments with multiple CLI agents and model backends show that localized non-user attacks can consistently induce unsafe behavior, with distinct failure patterns across domains, attack methods, and scaffold-model pairings. Our findings suggest that agent safety depends not only on model-level alignment, but also on how agents interpret skills, trust workflow context, and act through executable environments.
☆ Limits of Learning Linear Dynamics from Experiments
Learning governing dynamics from data is a common goal across the sciences, yet it is only well-posed when the underlying mechanisms are identifiable. In practice, many data-driven methods implicitly assume identifiability; when this assumption fails, estimated models can yield spurious predictions and invalid mechanistic conclusions. Classical identifiability guarantees for controlled linear time-invariant (LTI) systems provide sufficient conditions -- controllability and persistent excitation -- but leave open whether identifiability holds when these conditions fail, and which parts of the system remain identifiable without full identifiability. We show that the experimental setup, i.e., the realized initial state and control input, dictates a fundamental limit on the information recoverable from the observed trajectory. We develop a geometric characterization of this limit and derive a closed-form description of all systems consistent with the experimental setup. Crucially, we prove that even when the full system is not identifiable, the restricted dynamics on the subspace reachable by the experiment remain uniquely determined.
☆ Estimating Subgraph Importance with Structural Prior Domain Knowledge
We propose a subgraph importance estimation method for pretrained Graph Neural Networks (GNNs) on graph-level tasks, formulated as a linear Group Lasso regression problem in the embedding space. Our method effectively leverages prior domain knowledge of graph substructures, while remaining independent of the specific form of the output layer or readout function used in the GNN architecture, and it does not require access to ground-truth target labels. Experiments on real-world graph datasets demonstrate that our method consistently outperforms existing baselines in subgraph importance estimation. Furthermore, we extend our method to identify important nodes within the graph.
☆ Split the Differences, Pool the Rest: Provably Efficient Multi-Objective Imitation
This work investigates multi-objective imitation learning: the problem of recovering policies that lie on the Pareto front given demonstrations from multiple Pareto-optimal experts in a Multi-Objective Markov Decision Process (MOMDP). Standard imitation approaches are ill-equipped for this regime, as naively aggregating conflicting expert trajectories can result in dominated policies. To address this, we introduce Multi-Output Augmented Behavioral Cloning (MA-BC), an algorithm that systematically partitions divergent expert data while pooling state-action pairs where no behavior conflict is observed. Theoretically, we prove that MA-BC converges to Pareto-optimal policies at a faster statistical rate than any learner that considers each expert dataset independently. Furthermore, we establish a novel lower bound for multi-objective imitation learning, demonstrating that MA-BC is minimax optimal. Finally, we empirically validate our algorithm across diverse discrete environments and, guided by our theoretical insights, extend and evaluate MA-BC on a continuous Linear Quadratic Regulator (LQR) control task.
☆ The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
Power capping is the standard GPU energy lever in LLM serving, and it appears to work: throughput drops, power readings fall, and energy budgets are met. We show the appearance is illusory for the phase that dominates production serving: autoregressive decode. Across four attention paradigms -- GQA, MLA, Gated DeltaNet, and Mamba2 -- on NVIDIA H200, decode draws only 137--300\,W on a 700\,W GPU; no cap ever triggers, because memory-bound decode saturates HBM bandwidth rather than compute and leaves power headroom untouched. Firmware-initiated clock throttling compounds the illusion: these deviations can corrupt any throughput measurement that attributes them to the cap. SM clock locking dissolves both confounds. By targeting the lever that is actually on the critical path, clock locking Pareto-dominates power capping universally, recovering up to 32\% of decode energy at minimal throughput loss. We identify three architecture-dependent DVFS behavioural classes and characterise a common energy pattern across novel attention replacements: a heavy prefill cost recouped by efficient decode, eventually halving total request energy relative to GQA at production batch sizes.
☆ Random-Set Graph Neural Networks
Uncertainty quantification has become an important factor in understanding the data representations produced by Graph Neural Networks (GNNs). Despite their predictive capabilities being ever useful across industrial workspaces, the inherent uncertainty induced by the nature of the data is a huge mitigating factor to GNN performance. While aleatoric uncertainty is the result of noisy and incomplete stochastic data such as missing edges or over-smoothing, epistemic uncertainty arises from lack of knowledge about a system or model (e.g., a graph's topology or node feature representation), which can be reduced by gathering more data and information. In this paper, we propose an original new framework in which node-level epistemic uncertainty is modelled in a belief function (finite random set) formalism. The resulting Random-Set Graph Neural Networks have a belief-function head predicting a random set over the list of classes, from which both a precise probability prediction and a measure of epistemic uncertainty can be obtained. Extensive experiments on 9 different graph learning datasets, including real-world autonomous driving benchmarks as such Nuscene and ROAD, demonstrate RS-GNN's superior uncertainty quantification capabilities
comment: 23 pages, 6 figures
☆ QDSB: Quantized Diffusion Schrödinger Bridges
Learning generative models in settings where the source and target distributions are only specified through unpaired samples is gaining in importance. Here, one frequently-used model are Schrödinger bridges (SB), which represent the most likely evolution between both endpoint distributions. To accelerate training, simulation-free SBs avoid the path simulation of the original SB models. However, learning simulation-free SBs requires paired data; a coupling of the source and target samples is obtained as the solution of the entropic optimal transport (OT) problem. As obtaining the optimal global coupling is infeasible in many practical cases, the entropic OT problem is iteratively solved on minibatches instead. Still, the repeated cost remains substantial and the locality can distort the global transport geometry. We propose quantized diffusion Schrödinger bridges (QDSB), which compute the endpoint coupling on anchor-quantized endpoint distributions and lift the resulting plan back to original data points through cell-wise sampling. We show that the regularized optimal coupling is stable w.r.t. anchor quantization, with an error controlled by the quality of the anchor approximation. In real-world experiments, QDSB matches the sample quality of existing baselines, requiring substantially less time. Code and data are available at github.com/mathefuchs/qdsb.
☆ Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning ICML 2026
We study stochastic minimum-cost reach-avoid reinforcement learning, where an agent must satisfy a reach-avoid specification with probability at least $p$ while minimizing expected cumulative costs in stochastic environments. Existing safe and constrained reinforcement learning methods typically fail to jointly enforce probabilistic reach-avoid constraints and optimize cost in the learning setting in stochastic environments. To address this challenge, we introduce reach-avoid probability certificates (RAPCs), which identify states from which stochastic reach-avoid constraints are satisfiable. Building on RAPCs, we develop a contraction-based Bellman formulation that serves as a principled surrogate for integrating reach-avoid considerations into reinforcement learning, enabling cost optimization under probabilistic constraints. We establish almost sure convergence of the proposed algorithms to locally optimal policies with respect to the resulting objective. Experiments in the MuJoCo simulator demonstrate improved cost performance and consistently higher reach-avoid satisfaction rates.
comment: Accepted at the Forty-third International Conference on Machine Learning (ICML 2026)
☆ Towards Order Fairness: Mitigating LLMs Order Sensitivity through Dual Group Advantage Optimization
Large Language Models (LLMs) suffer from order bias, where their performance is affected by the arrangement order of input elements. This unfairness limits the model's applications in scenarios such as in-context learning and Retrieval-Augmented Generation (RAG). Recent studies attempt to obtain optimal or suboptimal arrangements based on statistical results or using dataset-based search, but these methods increase inference overhead while leaving the model's inherent order bias unresolved. Other studies mitigate order sensitivity through supervised fine-tuning using augmented training sets with multiple order variants, but often at the cost of accuracy, trapping the model in consistent yet incorrect hallucinations. In this paper, we propose \textbf{D}ual \textbf{G}roup \textbf{A}dvantage \textbf{O}ptimization (\textbf{DGAO}), which aims to improve model accuracy and order stability simultaneously. DGAO calculates and balances intra-group relative accuracy advantage and inter-group relative stability advantage, rewarding the policy model for generating order-stable and correct outputs while penalizing order-sensitive or incorrect responses. This marks the first time reinforcement learning has been used to mitigate LLMs' order sensitivity. We also propose two new metrics, Consistency Rate and Overconfidence Rate, to reveal the pseudo-stability of previous methods and guide more comprehensive evaluation. Extensive experiments demonstrate that DGAO achieves superior order fairness while improving performance on RAG, mathematical reasoning, and classification tasks. Our code is available at: https://github.com/Hyalinesky/DGAO.
☆ NOFE -- Neural Operator Function Embedding
Most dimensionality reduction methods treat data as discrete point clouds, ignoring the continuous domain structure inherent to many real-world processes. To bridge this gap, we introduce Neural Operator Function Embedding (NOFE), a domain-aware framework for continuous dimensionality reduction. NOFE learns function-to-function mappings via a Graph Kernel Operator, enabling mesh-free evaluation at arbitrary query locations independent of input discretization. We establish NOFE as approximation of sheaf-to-sheaf mappings, generalizing Sheaf Neural Networks to continuous domains. We evaluate NOFE across different datasets, comparing it against PCA, t-SNE, and UMAP. Our results demonstrate that NOFE significantly outperforms baselines in local structure preservation, achieving a local Stress of 0.111 compared to 0.398 for PCA, 0.773 for t-SNE, and 0.791 for UMAP for the ERA5 climate reanalysis dataset. NOFE also exhibits robust sampling independence, reducing the Patch Stitching Error by up to $20.0\times$ relative to UMAP (59.0 vs. 267.6 under regional normalization) and ensuring consistency across disjoint domain patches. While maintaining competitive global structure preservation (Stress-1: 0.379 vs. PCA's 0.268), NOFE resolves fine-grained structures and produces smooth, consistent embeddings that generalize across varying sample densities, addressing key limitations of discrete reduction methods.
comment: 21 pages, 11 figures, 12 tables
☆ Assessment of cloud and associated radiation fields from a GAN stochastic cloud subcolumn generator
Modern Earth System Models (ESMs) operate on horizontal scales far larger than typical cloud features, requiring stochastic subcolumn generators to represent subgrid horizontal and vertical cloud variability. Traditional physically-based generators often rely on analytical cloud overlap paradigms, such as exponential-random decorrelation, which can struggle to capture the complex, anti-correlated behavior of non-contiguous cloud layers. In this study, we introduce a novel two-stage machine learning subcolumn generator for the GEOS atmospheric model, utilizing a Conditional Variational Autoencoder combined with a Generative Adversarial Network (CVAE-GAN) and a U-Net architecture. Trained on a merged CloudSat-CALIPSO height-resolved cloud optical depth dataset, the ML generator creates 56 stochastic subcolumns representing cloud occurrence and optical depth profiles. Evaluated against the established Räisänen, the ML approach accurately reproduces bimodal cloud overlap distributions, significantly reduces biases in grid-mean statistics, and halves the root-mean-square error in ISCCP-style cloud-top pressure and optical thickness joint histograms. The improvements brought by our deep generative models translate into more accurate offline radiative transfer calculations, reducing the global-mean shortwave top-of-atmosphere cloud radiative effect bias by a factor of three. Provided that the generator can be accelerated on CPUs, this offers a practical pathway to reduce structural errors at the cloud-radiation interface.
☆ STAGE: Tackling Semantic Drift in Multimodal Federated Graph Learning
Federated graph learning (FGL) enables collaborative training on graph data across multiple clients. As graph data increasingly contain multimodal node attributes such as text and images, multimodal federated graph learning (MM-FGL) has become an important yet substantially harder setting. The key challenge is that clients from different modality domains may not share a common semantic space: even for the same concept, their local encoders can produce inconsistent representations before collaboration begins. This makes direct parameter coordination unreliable and further causes two downstream problems: forcing heterogeneous client representations into a naively shared semantic space may create false semantic agreement, and graph message passing may amplify residual inconsistency across neighborhoods. To address this issue, we propose \textbf{STAGE}, a protocol-first framework for MM-FGL. Instead of relying on direct parameter averaging, STAGE builds a shared semantic space that first translates heterogeneous multimodal features into comparable representations and then regulates how these representations propagate over local graph structures. In this way, STAGE not only improves cross-client semantic calibration, but also reduces the risk of inconsistency amplification during graph learning. Extensive experiments on 8 multimodal-attributed graphs across 5 graph-centric and modality-centric tasks show that STAGE consistently achieves state-of-the-art performance while reducing per-round communication payload.
☆ Understanding Sample Efficiency in Predictive Coding
Predictive Coding (PC) is an influential account of cortical learning. Much of recent work has focused on comparing PC to Backpropagation (BP) to find whether PC offers any advantages. Small scale experiments show that PC enables learning that is more sample efficient and effective in many contexts, though a thorough theoretical understanding of the phenomena remains elusive. To address this, we quantify the efficiency of learning in BP and PC through a metric called ``target alignment'', which measures how closely the change in the output of the network is aligned to the output prediction error. We then derive and empirically validate analytical expressions for target alignment in Deep Linear Networks. We show that learning in PC is more efficient than BP, which is especially pronounced in deep, narrow and pre-trained networks. We also derive exact conditions for guaranteed optimal target alignment in PC and validate our findings through experiments. We study full training trajectories of linear and non-linear models, and find the predicted benefits of PC persist in practice even when some assumptions are violated. Overall, this work provides a mechanistic understanding of the higher learning efficiency observed for PC over BP in previous works, and can guide how PC should be parametrised to learn most effectively.
☆ Delightful Gradients Accelerate Corner Escape
Softmax policy gradient converges at $O(1/t)$, but its transient behavior near sub-optimal corners of the simplex can be exponentially slow. The bottleneck is self-trapping: negative-advantage actions reinforce the corner policy and can initially push the optimal action backward. We study \emph{Delightful Policy Gradient} (DG), which gates each policy-gradient term by the product of advantage and action surprisal. For $K$-armed bandits, we prove that the zero-temperature limit of DG removes this corner-trapping mechanism on a quantitative sector near any sub-optimal corner, yielding a first-exit escape bound logarithmic in the initial probability ratio. At every fixed temperature, the same local mechanism persists because harmful actions are polynomially suppressed as they become rare. A key structural insight is that every action better than the corner action is an \emph{ally}: its contribution to escape is non-negative. Combining corner instability with a monotonic value improvement identity, we prove that DG converges globally to the optimal policy in both bandits and tabular MDPs at an asymptotic $O(1/t)$ rate. We also show, via an exact counterexample, that this tabular mechanism can fail under shared function approximation. In MNIST contextual bandits with a shared-parameter neural network, DG nevertheless recovers from bad initializations faster than standard policy gradient, suggesting that the counterexample marks a boundary of the theory rather than a practical prohibition.
comment: Preprint
☆ Procedural-skill SFT across capacity tiers: A W-Shaped pre-SFT Trajectory and Regime-Asymmetric Mechanism on 0.8B-4B Qwen3.5 Models
We measure procedural-skill SFT contribution across three Qwen3.5 dense scales (0.8B, 2B, 4B) on a 200-task / 40-skill holdout, with Claude Haiku 4.5 as a frontier reference. The corpus is 353 rows of (task + procedural-skill block, Opus chain-of-thought, judge-pass) demonstrations. \textbf{Main finding.} Under matched-path LLM-only scoring, the SFT-attributable procedural-$Δ$ lift is roughly uniform across sizes: $+0.070$ / $+0.040$ / $+0.075$ at 0.8B / 2B / 4B. Variation in post-SFT $Δ$ ($-0.005$, $+0.100$, $+0.065$) is dominated by a W-shaped pre-SFT base trajectory ($-0.075$, $+0.060$, $-0.010$, Haiku-4-5 at $+0.030$): the 5-step procedure hurts 0.8B and 4B, helps 2B, and helps frontier Haiku modestly. SFT works hardest in absolute terms where the base struggles with the procedure -- a regime-asymmetric pattern with a falsifiable prediction at 8B/14B. \textbf{Methodology.} (i) A bench format-compliance artifact: 83.5\% of the holdout uses a deterministic \texttt{ANSWER}-line extractor that under-counts free-form conclusions; an LLM-only re-judge reveals it was systematically biased against \CU. (ii) A negative-iteration sequence at 0.8B: five recipe variants cluster post-SFT \CU{} pass-rate within a 2\,pp band, constraining the absolute-pass-rate ceiling to base capacity rather than recipe. \textbf{Cross-family validation.} GPT-5.4 via OpenRouter on all 7 configurations (2800 paired episodes) agrees on the direction of every per-student finding: Cohen's $κ\geq 0.754$, agreement $\geq 93.25\%$. Earlier ``format-only at 0.8B'' and ``shrinking SFT at 4B'' framings were path-mismatch artifacts; this paper supersedes both (Appendix~\ref{sec:appendix-path}). Single-seed; threats in §\ref{sec:threats}.
☆ Incentivizing Truthfulness and Collaborative Fairness in Bayesian Learning ICML-26
Collaborative machine learning involves training high-quality models using datasets from a number of sources. To incentivize sources to share data, existing data valuation methods fairly reward each source based on its data submitted as is. However, as these methods do not verify nor incentivize data truthfulness, the sources can manipulate their data (e.g., by submitting duplicated or noisy data) to artificially increase their valuations and rewards or prevent others from benefiting. This paper presents the first mechanism that provably ensures (F) collaborative fairness and incentivizes (T) truthfulness at equilibrium for Bayesian models. Our mechanism combines semivalues (e.g., Shapley value), which ensure fairness, and a truthful data valuation function (DVF) based on a validation set that is unknown to the sources. As semivalues are influenced by others' data, we introduce an additional condition to prove that a source can maximize its expected data values in coalitions and semivalues by submitting a dataset that captures its true knowledge. Additionally, we discuss the implications and suitable relaxations of (F) and (T) when the mediator has a limited budget for rewards or lacks a validation set. Our theoretical findings are validated on synthetic and real-world datasets.
comment: Accepted to the 43rd International Conference on Machine Learning (ICML-26) as a Spotlight paper
☆ Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models
Large language models have achieved remarkable capabilities across diverse tasks, yet their internal decision-making processes remain largely opaque, limiting our ability to inspect, control, and systematically improve them. This opacity motivates a growing body of research in mechanistic interpretability, with sparse autoencoders (SAEs) emerging as one of the most promising tools for decomposing model activations into sparse, interpretable feature representations. We introduce Qwen-Scope, an open-source suite of SAEs built on the Qwen model family, comprising 14 groups of SAEs across 7 model variants from the Qwen3 and Qwen3.5 series, covering both dense and mixture-of-expert architectures. Built on top of these SAEs, we show that SAEs can go beyond post-hoc analysis to serve as practical interfaces for model development along four directions: (i) inference-time steering, where SAE feature directions control language, concepts, and preferences without modifying model weights; (ii) evaluation analysis, where activated SAE features provide a representation-level proxy for benchmark redundancy and capability coverage; (iii) data-centric workflows, where SAE features support multilingual toxicity classification and safety-oriented data synthesis; and (iv) post-training optimization, where SAE-derived signals are incorporated into supervised fine-tuning and reinforcement learning objectives to mitigate undesirable behaviors such as code-switching and repetition. Together, these results demonstrate that SAEs can serve not only as post-hoc analysis tools, but also as reusable representation-level interfaces for diagnosing, controlling, evaluating, and improving large language models. By open-sourcing Qwen-Scope, we aim to support mechanistic research and accelerate practical workflows that connect model internals to downstream behavior.
☆ Sobolev Regularized MMD Gradient Flow
We propose Sobolev-regularized Maximum Mean Discrepancy (SrMMD) gradient flow, a regularized variant of maximum mean discrepancy (MMD) gradient flow based on a gradient penalty on the witness function. The proposed regularization mitigates the non-convexity of the MMD objective and yields provable \emph{global} convergence guarantees in MMD in both continuous and discrete time. A more surprising appeal is that our convergence analysis does not rely on isoperimetric assumptions on the target distribution. Instead, it is based on a regularity condition on the difference between kernel mean embeddings. A key highlight of the proposed flow is that it is applicable in both sampling (from an unnormalized target distribution) -- using Stein kernels -- and generative modeling settings, unlike previous works, where a gradient flow is suitable for only generative modeling or sampling but not both. The effectiveness of the proposed flow is empirically verified on a broad range of tasks in both generative modelling and sampling.
☆ Adaptive TD-Lambda for Cooperative Multi-agent Reinforcement Learning
TD($λ$) in value-based MARL algorithms or the Temporal Difference critic learning in Actor-Critic-based (AC-based) algorithms synergistically integrate elements from Monte-Carlo simulation and Q function bootstrapping via dynamic programming, which effectively addresses the inherent bias-variance trade-off in value estimation. Based on that, some recent works link the adaptive $λ$ value to the policy distribution in the single-agent reinforcement learning area. However, because of the large joint action space from multiple number of agents, and the limited transition data in Multi-agent Reinforcement Learning, the policy distribution is infeasible to be calculated statistically. To solve the policy distribution calculation problem in MARL settings, we employ a parametric likelihood-free density ratio estimator with two replay buffers instead of calculating statistically. The two replay buffers of different sizes store the historical trajectories that represent the data distribution of the past and current policies correspondingly. Based on the estimator, we assign Adaptive TD($λ$), \textbf{ATD($λ$)}, values to state-action pairs based on their likelihood under the stationary distribution of the current policy. We apply the proposed method on two competitive baseline methods, QMIX for value-based algorithms, and MAPPO for AC-based algorithms, over SMAC benchmarks and Gfootball academy scenarios, and demonstrate consistently competitive or superior performance compared to other baseline approaches with static $λ$ values.
☆ LOFT: Low-Rank Orthogonal Fine-Tuning via Task-Aware Support Selection
Orthogonal parameter-efficient fine-tuning (PEFT) adapts pretrained weights through structure-preserving multiplicative transformations, but existing methods often conflate two distinct design choices: the subspace in which adaptation occurs and the transformation applied within that subspace. This paper introduces LOFT, a low-rank orthogonal fine-tuning framework that explicitly separates these two components. By viewing orthogonal adaptation as a multiplicative subspace rotation, LOFT provides a unified formulation that recovers representative orthogonal PEFT methods, including coordinate-, butterfly-, Householder-, and principal-subspace-based variants. More importantly, this perspective exposes support selection as a central design axis rather than a byproduct of a particular parameterization. We develop a first-order analysis showing that useful adaptation supports should be informed by the downstream training signal, motivating practical task-aware support selection strategies. Across language understanding, visual transfer, mathematical reasoning, and multilingual out-of-distribution adaptation, LOFT recovers principal-subspace orthogonal adaptation while gradient-informed supports improve the efficiency-performance trade-off under matched parameter, memory, and compute budgets. These results suggest that principled support selection is an important direction for improving orthogonal PEFT.
☆ Information theoretic underpinning of self-supervised learning by clustering
Self-supervised learning (SSL) is recognized as an essential tool for building foundation models for Artificial Intelligence applications. The advances in SSL have been made thanks to vigorous arguments about the principles of SSL and through extensive empirical research. The aim of this paper is to contribute to the development of the underpinning theory of SSL, focusing on the deep clustering approach. By analogy to supervised learning, we formulate SSL as K-L divergence optimization. The mode collapse is prevented by imposing an optimisation constraint on the teacher distribution. This leads to normalization using inverse cluster priors. We show that using Jensen inequality this normalization simplifies to the popular batch centering procedure. Distillation and centering are common {heuristics-based} practices in SSL, {but our work underpins them theoretically.} The theoretical model developed not only supports specific existing successful SSL methods, but also suggests directions for future investigations.
☆ FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity
While the overall inference latency of Video Diffusion Transformers (DiTs) can be substantially reduced through model distillation, per-step inference latency remains a critical bottleneck. Existing acceleration paradigms primarily exploit redundancy across the denoising trajectory; however, we identify a limitation where these step-wise strategies encounter diminishing returns in few-step regimes. In such scenarios, the scarcity of temporal states prevents effective feature reuse or predictive modeling, creating a formidable barrier to further acceleration. To overcome this, we propose Frame Interleaved Sparsity DiT (FIS-DiT), a training-free and operator-agnostic framework that shifts the optimization focus from the temporal trajectory to the latent frame dimension. Our approach is motivated by an intrinsic duality within this dimension: the existence of frame-wise sparsity that permits reduced computation, coupled with a structural consistency where each frame position remains equally vital to the global spatiotemporal context. Leveraging this insight, we implement Frame Interleaved Sparsity (FIS) as an execution strategy that manipulates frame subsets across the model hierarchy, refreshing all latent positions without requiring full-scale block computation. Empirical evaluations on Wan 2.2 and HunyuanVideo 1.5 demonstrate that FIS-DiT consistently achieves 2.11--2.41$\times$ speedup with negligible degradation across VBench-Q and CLIP metrics, providing a scalable and robust pathway toward real-time high-definition video generation.
☆ Variance-aware Reward Modeling with Anchor Guidance
Standard Bradley--Terry (BT) reward models are limited when human preferences are pluralistic. Although soft preference labels preserve disagreement information, BT can only express it by shrinking reward margins. Gaussian reward models provide an alternative by jointly predicting a reward mean and a reward variance, but suffer from a fundamental non-identifiability from pairwise preferences alone. We propose Anchor-guided Variance-aware Reward Modeling, a framework that resolves this non-identifiability by augmenting preference data with two coarse response-level anchor labels. Building on this, we prove that two anchors are sufficient for identification, develop a joint training objective and establish a non-asymptotic convergence rate for both the estimated reward mean and variance functions. Across simulation studies and four real-world diverging-preference datasets, our method consistently improves reward modeling performance and downstream RLHF, including PPO training and best-of-$N$ selection.
☆ Beyond Parameter Aggregation: Semantic Consensus for Federated Fine-Tuning of LLMs
Federated fine-tuning of large language models is commonly formulated as a parameter aggregation problem. However, even parameter-efficient methods require transmitting large collections of trainable weights, assume aligned architectures, and rely on white-box access to model parameters. As model sizes continue to grow and deployments become increasingly heterogeneous, these assumptions become progressively misaligned with practical constraints. We consider an alternative formulation in which collaboration is mediated through model behavior rather than parameters. Clients fine-tune local models on private data and exchange generated outputs on a shared, public prompt set. The server maps these outputs into a semantic representation space, forms a per-prompt semantic consensus, and returns pseudo-labels for further local fine-tuning. This formulation fundamentally changes the communication scaling of federated LLM fine-tuning. The amount of information exchanged depends only on the public prompt budget and the size of the communicated behaviors, independent of model size. As a consequence, the protocol naturally accommodates heterogeneous architectures and applies directly to open-ended text generation. We present a theoretical analysis and empirical results demonstrating that this approach can match strong federated fine-tuning baselines while substantially reducing communication by orders of magnitude (e.g., analytically by a factor of $1006$ for Llama3.1-405B), as well as reductions in runtime and energy consumption. These results suggest that, for generative foundation models, behavior-level consensus provides a more appropriate abstraction for federated adaptation than parameter aggregation.
☆ Improving the Performance and Learning Stability of Parallelizable RNNs Designed for Ultra-Low Power Applications ICML2026
Sequence learning is dominated by Transformers and parallelizable recurrent neural networks (RNNs) such as state-space models, yet learning long-term dependencies remains challenging, and state-of-the-art designs trade power consumption for performance. The Bistable Memory Recurrent Unit (BMRU) was introduced to enable hardware-software co-design of ultra-low power RNNs: quantized states with hysteresis provide persistent memory while mapping directly to analog primitives. However, BMRU performance lags behind parallelizable RNNs on complex sequential tasks. In this paper, we identify gradient blocking during state updates as a key limitation and propose a cumulative update formulation that restores gradient flow while preserving persistent memory, creating skip-connections through time. This leads to the Cumulative Memory Recurrent Unit (CMRU) and its relaxed variant, the $α$CMRU. Experiments show that the cumulative formulation dramatically improves convergence stability and reduces initialization sensitivity. The CMRU and $α$CMRU match or outperform Linear Recurrent Units (LRUs) and minimal Gated Recurrent Units (minGRUs) across diverse benchmarks at small model sizes, with particular advantages on tasks requiring discrete long-range retention, while the CMRU retains quantized states, persistent memory, and noise-resilient dynamics essential for analog implementation.
comment: Accepted as a spotlight at ICML2026. This work has been the subject of patent applications under numbers EP26175243.0 and EP26175248.9
☆ GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation
Reinforcement learning has become a widely used post-training approach for LLM agents, where training commonly relies on outcome-level rewards that provide only coarse supervision. While finer-grained credit assignment is promising for effective policy updates, obtaining reliable local credit and assigning it to the right parts of the long-horizon trajectory remains an open challenge. In this paper, we propose Granularity-adaptivE Advantage Reweighting (GEAR), an adaptive-granularity credit assignment framework that reshapes the trajectory-level GRPO advantage using token- and segment-level signals derived from self-distillation. GEAR compares an on-policy student with a ground-truth-conditioned teacher to obtain a reference-guided divergence signal for identifying adaptive segment boundaries and modulating local advantage weights. This divergence often spikes at the onset of a semantic deviation, while later tokens in the same autoregressive continuation may return to low divergence. GEAR therefore treats such spikes as anchors for adaptive credit regions: where the student remains aligned with the teacher, token-level resolution is preserved; where it departs, GEAR groups the corresponding continuation into an adaptive segment and uses the divergence at the departure point to modulate the segment' s advantage. Experiments across eight mathematical reasoning and agentic tool-use benchmarks with Qwen3 4B and 8B models show that GEAR consistently outperforms standard GRPO, self-distillation-only baselines, and token- or turn-level credit-assignment methods. The gains are especially strong on benchmarks with lower GRPO baseline accuracy, reaching up to around 20\% over GRPO, suggesting that the proposed adaptive reweighting scheme is especially useful in more challenging long-horizon settings.
☆ Constrained Stochastic Spectral Preconditioning Converges for Nonconvex Objectives
In this work, we develop proximal preconditioned gradient methods with a focus on spectral gradient methods providing a proximal extension to the Muon and Scion optimizers. We introduce a family of stochastic algorithms that can handle a wide variety of convex and nonconvex constraints and study its convergence under heavy-tailed noise, through a novel analysis tailored to the geometry of the proposed methods. We further propose a variance-reduced version, which achieves faster convergence under standard noise assumptions. Finally, we show that the polynomial iterations used in Muon are more accurately captured by a nonlinear preconditioner than by the ideal matrix sign, leading to a convergence analysis that more faithfully reflects practical implementations.
☆ A Fast and Energy-Efficient Latch-Based Memristive Analog Content-Addressable Memory IEEE
Analog content-addressable memories (aCAMs) based on memristors provide a promising pathway toward energy-efficient large-scale associative computing for Edge AI and embedded intelligence applications. They have been successfully applied to decision-tree inference and extend the capabilities of compute-in-memory (CIM) architectures beyond conventional vector-matrix multiplication. However, conventional designs such as the 6T2M architecture suffer from static search power, limited voltage gain, and pronounced match-line crosstalk, constraining analog precision and scalability. We introduce a strong-arm latched memristor (SALM) aCAM cell that replaces static voltage division with a dynamic current-race comparator, enabling high regenerative gain, intrinsic result latching, and near-zero static search power. Compared to 6T2M, SALM reduces read energy by 33% at identical latency while eliminating the gain and crosstalk limitations that prevent 6T2M from scaling to large arrays. SALM further enables scalable sequential and parallel latch sharing, and a dataset-aware optimization framework exposes an explicit energy-latency tradeoff, achieving up to 50% energy reduction at 3x latency across representative workloads. To enable architectural exploration, we develop a circuit-accurate behavioral model derived from SPICE lookup tables in 22 nm FD-SOI technology, capturing match-line dynamics and crosstalk. Integrated into the X-TIME decision-tree compiler, this framework demonstrates that SALM maintains near-software accuracy for high-dimensional datasets, whereas baseline designs degrade due to limited gain and cumulative crosstalk.
comment: This work has been submitted to the IEEE for possible publication
☆ Martingale-Consistent Self-Supervised Learning
Self-supervised learning (SSL) is often deployed under changing information, such as shorter histories, missing features, or partially observed images. In these settings, predictions from coarse and refined views should be coherent: before refinement, the coarse-view prediction should match the average prediction expected after refinement. Martingales formalize this coherence principle, but standard SSL objectives do not enforce it. Unlike invariance objectives that pull views together, martingale consistency constrains only the expected refined prediction, allowing predictions to update as information is revealed while preventing systematic drift. We introduce a martingale-consistent SSL framework that closes this gap, with practical prediction- and latent-space variants and an unbiased two-sample Monte Carlo estimator based on stochastic refinement. We evaluate the approach on synthetic and real time-series, tabular, and image benchmarks under partial-observation regimes, in both semi-self-supervised and fully label-free settings. Across these experiments, our framework improves robustness and calibration under partial observation, yielding more stable representations as information is revealed.
☆ Minimax Rates and Spectral Distillation for Tree Ensembles
Tree ensembles such as random forests (RFs) and gradient boosting machines (GBMs) are among the most widely used supervised learners, yet their theoretical properties remain incompletely understood. We adopt a spectral perspective on these algorithms, with two main contributions. First, we derive minimax-optimal convergence for RF regression, showing that, under mild regularity conditions on tree growth, the eigenvalue decay of the induced kernel operator governs the statistical rate. Second, we exploit this spectral viewpoint to develop compression schemes for tree ensembles. For RFs, leading eigenfunctions of the kernel operator capture the dominant predictive directions; for GBMs, leading singular vectors of the smoother matrix play an analogous role. Learning nonlinear maps for these spectral representations yields distilled models that are orders of magnitude smaller than the originals while maintaining competitive predictive performance. Our methods compare favorably to state of the art algorithms for forest pruning and rule extraction, with applications to resource constrained computing.
comment: 9 pages main text, 33 pages total, with 12 figures and 7 tables total
☆ Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters
Gradient clipping is a standard safeguard for training neural networks under noisy, heavy-tailed stochastic gradients; yet, most clipping rules treat all parameters as vectors and ignore the matrix structure of modern architectures. We show empirically that data outliers often amplify only a small number of leading singular values in layer-wise gradient matrices, while the rest of the spectrum remains largely unchanged. Motivated by this phenomenon, we propose spectral clipping, which stabilizes training by clamping singular values that exceed a threshold while preserving the singular directions. This framework generalizes classical gradient norm clipping and can be easily integrated into existing optimizers. We provide a convergence analysis for non-convex optimization with spectrally clipped SGD, yielding the optimal $\mathcal{O}\left(K^{\frac{2 - 2α}{3α- 2}}\right)$ rate for heavy-tailed noise. To minimize hyperparameter tuning, we introduce layer-wise adaptive thresholds based on moving averages or sliding-window quantiles of the top singular values. Finally, we develop efficient implementations that clip only the top $r$ singular values via randomized truncated SVD, avoiding full decompositions for large layers. We demonstrate competitive performance across synthetic heavy-tailed settings and neural network training tasks.
☆ More Edits, More Stable: Understanding the Lifelong Normalization in Sequential Model Editing
Lifelong Model Editing aims to continuously update evolving facts in Large Language Models while preserving unrelated knowledge and general capabilities, yet it remains plagued by catastrophic forgetting and model collapse. Empirically, we find that recent editors resilient over long horizons share the same core strategy: Lifelong Normalization (LN), which normalizes value gradients using running statistics. Removing LN causes immediate performance collapse, and we observe a counter-intuitive positive cumulative effect where early edits can promote the success of future edits. Yet the mechanism of LN remains a "black box", leaving its precise role in lifelong stability poorly understood. In this work, we provide the first theoretical account of LN in the lifelong regime. Our analysis reveals a self-reinforcing stability loop and proves that, when combined with ridge-regularized regression, LN yields parameter updates with asymptotic orthogonality and bounded norms, directly mitigating forgetting and systemic collapse. Based on these insights, we derive StableEdit, which strengthens this stability loop via an explicit warm-up stage and full whitening, improving long-horizon stability at minimal overhead. Extensive experiments validate our theory and demonstrate competitive performance. Our code is available at https://github.com/MINE-USTC/StableEdit.
☆ Multi-Timescale Conductance Spiking Networks: A Sparse, Gradient-Trainable Framework with Rich Firing Dynamics for Enhanced Temporal Processing IEEE
Spiking neural networks (SNNs) promise low-power event-driven computation for temporally rich tasks, but commonly used neuron models often trade off gradient-based trainability, dynamical richness, and high activity sparsity. These limitations are acute in regression, where approximation error, noise and spike discretization can severely degrade continuous-valued outputs. Indeed, many state-of-the-art (SOTA) SNNs rely on simple phenomenological dynamics trained with surrogate gradients and offer limited control over spiking diversity and sparsity. To overcome such limitations, we introduce multi-timescale conductance spiking networks, a gradient-trainable framework in which neural dynamics emerge from shaping the current-voltage (I-V) curve by tuning fast, slow and ultra-slow conductances. This parametrization allows systematic control over excitability, can be implemented efficiently in analog circuits, and yields rich firing regimes including tonic, phasic and bursting responses within a single model. We derive a discrete-time formulation of these differentiable dynamics, enabling direct backpropagation through time without surrogate-gradient approximations. To probe both trainability and accuracy, we evaluate feedforward networks of these neurons at the predictability limit of Mackey-Glass time-series regression and compare them to baseline LIF and SOTA AdLIF networks. Our model outperforms LIF and AdLIF networks, while exhibiting substantially sparser activity from both communication and computational perspectives. These results highlight multi-timescale conductance spiking neurons as a promising building block for energy-aware temporal processing and neuromorphic implementation.
comment: Published in 2026 IEEE Neuro-Inspired Computational Elements Conference (Atlanta, USA)
☆ Bin Latent Transformer (BiLT): A shift-invariant autoencoder for calibration-free spectral unmixing of turbid media
The accurate recovery of constituent-level optical properties from integrating sphere measurements is a central analytical challenge in pharmaceutical analysis, food science, and biomedical diagnostics. Neural network autoencoders can extract spectrally resolved absorption and scattering coefficients for each constituent without prior knowledge, but their fully connected encoders bind learned features to absolute wavelength indices, causing accuracy loss under spectrometer calibration drift or hardware exchange. This work introduces the Bin Latent Transformer (BiLT)-Autoencoder, in which the dense encoder is replaced by a cross-attention scanner: 16 learnable probe vectors query a convolutional feature map, aggregating morphological spectral information independently of absolute wavelength position. A physics-constrained linear decoder with enforced absorption/scattering separation and a three-phase curriculum augmentation strategy complete the architecture. On a liquid phantom benchmark (intralipid and two ink absorbers; 496 samples), the model achieves $R^2 = 0.979$ and $0.975$ for $μ_a(λ)$ and $μ_s'(λ)$, respectively, on held-out test spectra, maintaining $R^2 > 0.90$ for $μ_a$ and $R^2 \approx 0.99$ for $μ_s'$ across the full tested shift range of $\pm 10$ spectral bands. The model generalises to a simulated spectrometer with a broader instrument line shape (${\approx}24$nm FWHM) without retraining, retaining $R^2 \approx 0.96$ and $0.974$ for the two channels. Attention map analysis reveals a physically interpretable two-component probe strategy: sparse anchor probes at absorption-edge wavelengths combined with a diffuse, SNR-driven ensemble at the high-transmittance long-wavelength region, which recruits additional probes dynamically under noise to provide implicit spectral averaging.
☆ Fed-BAC: Federated Bandit-Guided Additive Clustering in Hierarchical Federated Learning IEEE
Hierarchical federated learning (HFL) leverages edge servers for partial aggregation in edge computing. Yet existing FL methods lack mechanisms for jointly optimizing cluster assignment and client selection under data heterogeneity. This paper proposes Fed-BAC, which integrates additive cluster personalization with a two-level bandit framework: contextual bandits at the cloud learn server-to-cluster assignments, while Thompson Sampling at each edge server identifies high-contributing clients. The additive decomposition enables the sharing of knowledge between groups through a globally aggregated network, while cluster-specific networks capture distribution variations. Across three classification benchmarks (CIFAR-10, SVHN, Fashion-MNIST) under moderate ($α= 0.5$) and severe ($α= 0.1$) Dirichlet non-IID partitioning, Fed-BAC achieves distributed accuracy gains of up to +35.5pp over HierFAVG and +8.4pp over IFCA, while requiring only 80% client participation, converging 1.5 to 4.8$\times$ faster depending on dataset and accuracy target, and improving cross-server fairness. These gains are further validated at 5$\times$ deployment scale on CIFAR-10. The advantage of Fed-BAC increases with heterogeneity severity, confirming that additive cluster personalization becomes increasingly valuable as data distributions diverge.
comment: 9 pages, 5 figures. Accepted at the 2nd International Conference on Federated Learning and Intelligent Computing Systems (FLICS 2026), Valencia, Spain, June 9-12, 2026. To appear in IEEE proceedings
☆ Stop Marginalizing My Dreams: Model Inversion via Laplace Kernel for Continual Learning
Data-free continual learning (DFCIL) relies on model inversion to synthesize pseudo-samples and mitigate catastrophic forgetting. However, existing inversion methods are fundamentally limited by a simplifying assumption: they model feature distributions using diagonal covariance, effectively ignoring correlations that define the geometry of learned representations. As a result, synthesized samples often lack fidelity, limiting knowledge retention. In this work, we show that modeling feature dependencies is a key ingredient for effective DFCIL. We introduce REMIX, a structured covariance modeling framework that enables scalable full-covariance modeling without the prohibitive cost of dense matrix inversion and log-determinant computation. By leveraging a Laplace kernel parameterization, REMIX captures structured feature dependencies using memory that scales linearly with the feature dimensionality, while requiring only an additional logarithmic factor in computation. Modeling these correlations produces more coherent synthetic samples and consistently improves performance across standard DFCIL benchmarks. Our results demonstrate that moving beyond diagonal assumptions is essential for effective and scalable data-free continual learning. Our code is available at https://github. com/pkrukowski1/REMIX-Model-Inversion-via-Laplace-Kernel.
☆ ROMER: Expert Replacement and Router Calibration for Robust MoE LLMs on Analog Compute-in-Memory Systems
Large language models (LLMs) with mixture-of-experts (MoE) architectures achieve remarkable scalability by sparsely activating a subset of experts per token, yet their frequent expert switching creates memory bandwidth bottlenecks that compute-in-memory (CIM) architectures are well-suited to mitigate. However, analog CIM systems suffer from inherent hardware imperfections that perturb stored weights, and its negative impact on MoE-based LLMs in noisy CIM environments remains unexplored. In this work, we present the first systematic investigation of MoE-based LLMs under noise model calibrated with real chip measurements, revealing that hardware noise critically disrupts expert load balance and renders clean-trained routing decisions consistently suboptimal. Based on these findings, we propose ROMER, a post-training calibration framework that (1) replaces underactivated experts with high-frequency ones to restore load balance, and (2) recalibrates router logits via percentile-based normalization to stabilize routing under noise. Extensive experiments across multiple benchmarks demonstrate that ROMER achieves up to 58.6\%, 58.8\%, and 59.8\% reduction in perplexity under real-chip noise conditions for DeepSeek-MoE, Qwen-MoE, and OLMoE, respectively, establishing its effectiveness and generalizability across diverse MoE architectures.
comment: 11 pages, 5 figures, 4 tables
☆ Crash Assessment via Mesh-Based Graph Neural Networks and Physics-Aware Attention
Full-vehicle crash simulations are computationally expensive, limiting their use in iterative design exploration. This work investigates learned hybrid surrogate models (MeshTransolver, MeshGeoTransolver, and MeshGeoFLARE) for predicting time-resolved structural deformation fields in an industrial lateral pole-impact benchmark. We evaluate whether neural surrogates can reproduce full-field crash kinematics with sufficient accuracy, spatial regularity, and structural plausibility for engineering interpretation. The proposed architectures combine local mesh message passing, geometry-aware global attention, and sparse contact-aware correction for autoregressive crash rollout. We compare mesh-based graph neural networks, attention-based geometric models, and hybrid architectures under a common training and hyperparameter configuration. The hybrid models capture both short-range structural interactions and long-range deformation patterns, while a sparse contact-aware variant assesses the effect of dynamic proximity interactions during rollout. On a 25-sample full-vehicle test set, the best hybrid model achieves a temporal mean root-mean-square error of 3.20 mm. While geometry-aware attention baselines are quantitatively competitive, qualitative side-view inspection shows they can introduce local spatial noise and deformation irregularities that complicate structural interpretation. In contrast, hybrid mesh-attention models provide the best balance between scalar accuracy, survival-space consistency, and physically interpretable displacement fields. These results suggest that crash surrogate assessment should combine global error metrics with downstream safety-relevant quantities and qualitative field inspection. The proposed methodology enables fast full-field predictions while preserving essential structural information for industrial crash-engineering analysis.
comment: 40 pages, 15 figures, 7 tables
☆ Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control
Policy entropy has emerged as a fundamental measure for understanding and controlling exploration in reinforcement learning with verifiable rewards (RLVR) for LLMs. However, existing entropy-aware methods mainly regulate entropy through global objectives, while the token-level mechanism by which sampled policy updates reshape policy entropy remains underexplored. In this work, we develop a theoretical framework of entropy mechanics in RLVR. Our analysis yields a first-order approximation of the entropy change, giving rise to entropy polarity, a signed token-level quantity that predicts how much a sampled update expands or contracts entropy. This analysis further reveals a structural asymmetry: reinforcing frequent high-probability tokens triggers contraction tendencies, whereas expansive tendencies typically require lower-probability samples or stronger distributional correction. Empirically, we show that entropy polarity reliably predicts entropy changes, and that positive and negative polarity branches play complementary roles in preserving exploration while strengthening exploitation. Building on these insights, we propose Polarity-Aware Policy Optimization (PAPO), which preserves both polarity branches and implements entropy control through advantage reweighting. With the empirical entropy trajectory as an online phase signal, PAPO adaptively reallocates optimization pressure between entropy-expanding and entropy-contracting updates. Experiments on mathematical reasoning and agentic benchmarks show that PAPO consistently outperforms competitive baselines, while delivering superior training efficiency and substantial reward improvements.
☆ From Token to Token Pair: Efficient Prompt Compression for Large Language Models in Clinical Prediction
By processing electronic health records (EHRs) as natural language sequences, large language models (LLMs) have shown potential in clinical prediction tasks such as mortality prediction and phenotyping. However, longitudinal or highly frequent EHRs often yield excessively long token sequences that result in high computational costs and even reduced performance. Existing solutions either add modules for compression or remove less important tokens, which introduce additional inference latency or risk losing clinical information. To achieve lossless compression of token sequences without additional cost or loss of performance, we propose Medical Token-Pair Encoding (MedTPE), a layered method that extends standard tokenisation for EHR sequences. MedTPE merges frequently co-occurring medical token pairs into composite tokens, providing lossless compression while preserving the computational complexity through a dependency-aware replacement strategy. Only the embeddings of the newly introduced tokens of merely 0.5-1.0% of the LLM's parameters are fine-tuned via self-supervised learning. Experiments on real-world datasets for two clinical scenarios demonstrate that MedTPE reduces input token length by up to 31% and inference latency by 34-63%, while maintaining or even improving both predictive performance and output format compliance across multiple LLMs and four clinical prediction tasks. Furthermore, MedTPE demonstrates robustness across different input context lengths and generalisability to scientific and financial domains and different languages.
comment: 21 pages, 6 figures, 13 tables
☆ Is Monotonic Sampling Necessary in Diffusion Models?
Diffusion models generate samples by iteratively denoising a Gaussian prior, traversing a sequence of noise levels that, in every published sampler, decreases monotonically. Six years of intensive work has refined nearly every aspect of this recipe, including the corruption operator, the training objective, the schedule shape, the architecture, and the ODE solver. Yet the assumption of monotonicity itself has never been systematically tested. Here we ask whether monotonic sampling is load-bearing or merely conventional. We design four families of structured nonmonotonic schedules and apply them to three architecturally distinct generative models, DDPM, EDM, and Flow Matching, across NFE budgets ranging from 10 to 200 function evaluations, plus a 42-cell hyperparameter ablation, on CIFAR-10. Across all 90 tested configurations, no tested nonmonotonic schedule improves on the monotonic baseline. The magnitude of the penalty, however, spans nearly three orders of magnitude: persistent and substantial in DDPM, intermediate in Flow Matching, and indistinguishable from zero in EDM. We show that this variation is not noise but a structural property of each trained denoiser, and we formalize it as the Schedule Sensitivity Coefficient, a cheap, architecture-agnostic diagnostic that provides evidence of non-convergence to the Bayes-optimal denoiser at the critical noise level. Our findings justify the field's tacit reliance on monotonic schedules and supply a new probe of diffusion model quality complementary to sample-quality metrics such as Frechet Inception Distance.
☆ Decomposing the Generalization Gap in PROTAC Activity Prediction: Variance Attribution and the Inter-Laboratory Ceiling
Machine-learning predictors of biochemical activity often exhibit large random-split-to-leave-one-target-out generalisation gaps that have been documented but not decomposed. We frame this as an evaluation-science question and use targeted protein degradation as the empirical test bed. PROTACs (proteolysis-targeting chimeras) are heterobifunctional small molecules that induce targeted protein degradation, with more than forty candidates currently in clinical trials; published predictors report AUROC of 0.85 to 0.91 under random-split cross-validation, while the leave-one-target-out (LOTO) protocol of Ribes et al. reduces performance to approximately 0.67. Random splits reward within-target interpolation, whereas LOTO measures the novel-target prediction that de-novo design depends on. We decompose this gap and identify inter-laboratory measurement variance as the dominant component, anchored by a within-target cross-laboratory cascade bounding the inter-laboratory contribution at 0.124 AUROC, well above the 0.05 contribution from binarisation-threshold choice. Across eight published architectures and ESM-2 protein language models up to 3B parameters, LOTO AUROC plateaus near 0.67, with a comparable plateau under SMILES-level deduplication; a 21-dimensional 2000-trial hyperparameter optimisation cannot break this ceiling, and the rank-1 single-seed configuration regresses by 0.161 AUROC under multi-seed validation, matching a closed-form selection-bias prediction (Bailey and Lopez de Prado, 2014). Few-shot k=5 stratified per-target retraining combined with ADMET features lifts 65-target LOTO AUROC from 0.668 to 0.7050, and post-hoc Platt scaling recovers raw output to within the 0.05 well-calibrated threshold. We release PROTAC-Bench (10,748 measurements, 173 targets, 65 LOTO folds), the variance-decomposition framework, the per-target calibration protocol, and the evaluation code.
comment: 32 pages, 11 figures, 11 tables. Dataset: https://huggingface.co/datasets/ThorKl/protac-bench (CC-BY-4.0). Code: https://github.com/ThorKlm/PROTAC-Bench (MIT)
☆ A nonlinear extension of parametric model embedding for dimensionality reduction in parametric shape design
Dimensionality reduction is essential in simulation-based shape design, where high-dimensional parameterizations hinder optimization, surrogate modeling, and systematic design-space exploration. Parametric Model Embedding (PME) addresses this issue by constructing reduced variables from geometric information while preserving an explicit backmapping to the original design parameters. However, PME is intrinsically linear and may become inefficient when the sampled design space is governed by nonlinear geometric variability. This paper introduces a nonlinear extension of PME, denoted NLPME. The proposed framework preserves the defining principle of PME -- geometry-driven latent variables and parameter-mediated reconstruction -- while replacing the linear reduced subspace with a nonlinear latent representation. Geometry is not reconstructed directly from the latent variables; instead, the latent representation is decoded into admissible design parameters, and the corresponding geometry is recovered through a forward parametric map. The method is assessed on a bio-inspired autonomous underwater glider with a 32-dimensional parametric shape description and a CAD-based geometry-generation process. NLPME reaches a 5\% reconstruction-error threshold with \(N=5\) latent variables, compared with \(N=8\) for linear PME, and a 1\% threshold with \(N=9\), compared with \(N=15\) for PME. Comparison with a deep autoencoder shows that most of the nonlinear compression gain can be retained while preserving an explicit backmapping to the original design variables. The results establish NLPME as a compact, admissible, and engineering-compatible nonlinear reduced representation for parametric shape design spaces.
☆ One-Step Generative Modeling via Wasserstein Gradient Flows
Diffusion models and flow-based methods have shown impressive generative capability, especially for images, but their sampling is expensive because it requires many iterative updates. We introduce W-Flow, a framework for training a generator that transforms samples from a simple reference distribution into samples from a target data distribution in a single step. This is achieved in two steps: we first define an evolution from the reference distribution to the target distribution through a Wasserstein gradient flow that minimizes an energy functional; second, we train a static neural generator to compress this evolution into one-step generation. We instantiate the energy functional with the Sinkhorn divergence, which yields an efficient optimal-transport-based update rule that captures global distributional discrepancy and improves coverage of the target distribution. We further prove that the finite-sample training dynamics converge to the continuous-time distributional dynamics under suitable assumptions. Empirically, W-Flow sets a new state of the art for one-step ImageNet 256$\times$256 generation, achieving 1.29 FID, with improved mode coverage and domain transfer. Compared to multi-step diffusion models with similar FID scores, our method yields approximately 100$\times$ faster sampling. These results show that Wasserstein gradient flows provide a principled and effective foundation for fast and high-fidelity generative modeling.
comment: 38 pages, 14 figures
☆ Federated Client Selection under Partial Visibility: A POMDP Approach with Spatio-Temporal Attention
Federated learning relies on effective client selection to alleviate the performance degradation caused by data heterogeneity. Most existing methods assume full visibility of all clients at each communication round. However, in large-scale or edge-based deployments, the server can only access a subset of clients due to communication, mobility, or availability constraints, resulting in partial visibility where only a subset of clients is observable for aggregation in each communication round. In this paper, we formulate federated client selection under partial visibility as a Partially Observable Markov Decision Process (POMDP) and propose a Spatial-Temporal attention-based reinforcement learning framework. By integrating historical global models and client identity embeddings, the proposed method captures both the temporal contexts of training and the persistent characteristics of clients. Experimental results across multiple datasets demonstrate that our approach achieves superior performance compared to existing baselines in heterogeneous and partially visible settings, validating its effectiveness in addressing the challenges of incomplete observations in practical federated learning systems.
☆ Learning Feature Encoder with Synthetic Anomalies for Weakly Supervised Graph Anomaly Detection IEEE
Weakly supervised graph anomaly detection aims to unveil unusual graph instances, e.g., nodes, whose behaviors significantly differ from normal ones, given only a limited number of annotated anomalies and abundant unlabeled samples. A major challenge is to learn a meaningful latent feature representation that reduces intra-class variance among normal data while remaining highly sensitive to anomalies. Although recent works have applied self-supervised feature learning for graph anomaly detection, their strategies are not specifically tailored to its unique requirements, motivating our exploration of a more domain-specific approach. In this paper, we introduce a weakly supervised graph anomaly detection method that leverages a feature learning strategy tailored for graph anomalies. Our approach is built upon a multi-task learning scheme that extracts robust feature representations through synthesized anomalies. We generate synthetic anomalies by perturbing the normal graph in various ways and assign a dedicated detection head to each anomaly type, ensuring that learned features are sensitive to potential deviations from normal patterns. Although synthetic anomalies may not perfectly replicate real-world patterns, they provide valuable auxiliary data for effective feature learnin, much like features learned from ImageNet classification transfer to downstream vision tasks. Additionally, we adopt a two-phase learning strategy: an initial warm-up phase using only synthetic samples, followed by a full-training phase integrating both tasks, to balance the influence of synthetic and real data. Extensive experiments on public datasets demonstrate the superior performance of our method over its competitors. Code is available at https://github.com/yj-zhou/SAWGAD.
comment: 14 pages, 7 figures, published by IEEE Transactions on Knowledge and Data Engineering,2026
☆ Training-Inference Consistent Segmented Execution for Long-Context LLMs ICML 2026
Transformer-based large language models face severe scalability challenges in long-context generation due to the computational and memory costs of full-context attention. Under practical computation and memory constraints, many inference-efficient long-context methods improve efficiency by adopting bounded-context or segment-level execution only during inference, while continuing to train models under full-context attention, resulting in a mismatch between training and inference execution and state-transition semantics. Based on this insight, we propose a training-inference consistent segment-level generation framework, in which training and inference follow the same segment-level forward execution semantics. During training, consistency with inference is enforced by restricting gradient propagation to KV states carried over from the immediately preceding segment, while permitting head-specific access to past KV states during the forward pass without involving them in gradient propagation. Across long-context benchmarks, our approach achieves performance comparable to full-context attention, while achieving competitive latency-memory trade-offs against strong inference-efficient baselines, and substantially improving scalability at very long context lengths (e.g., approximately 6x lower peak prefill memory at 128K compared to full-context attention with FlashAttention).
comment: Accepted by ICML 2026. 19 pages, 6 figures, 3 tables
☆ WorldComp2D: Spatio-semantic Representations of Object Identity and Location from Local Views ICML2026
Learning latent representations that capture both semantic and spatial information is central to efficient spatio-semantic reasoning. However, many existing approaches rely on implicit latent structures combined with dense feature maps or task-specific heads, limiting computational efficiency and flexibility. We propose WorldComp2D, a novel lightweight representation learning framework that explicitly structures latent space geometry according to object identity and spatial proximity using multiscale local receptive fields. This framework consists of (i) a proximity-dependent encoder that maps a given observation into a spatio-semantic latent space and (ii) a localizer that infers the coordinates of objects in the input from the resulting spatio-semantic representation. Using facial landmark localization as a proof-of-concept, we show that, compared to SoTA lightweight models, WorldComp2D reduces the numbers of parameters and FLOPs by up to 4.0X and 2.2X, respectively, while maintaining real-time performance on CPU. These results demonstrate that explicitly structured latent spaces provide an efficient and general foundation for spatio-semantic reasoning. This framework is open-sourced at https://github.com/JinSeongmin/WorldComp2D.
comment: Accepted as a regular paper at ICML2026
☆ Online Continual Learning with Dynamic Label Hierarchies ICML2026
Online Continual Learning (OCL) aims to learn from endless non\text{-}stationary data streams, yet most existing methods assume a flat label space and overlook the hierarchical organization of real\text{-}world concepts that evolves both horizontally (sibling classes) and vertically (coarse or fine categories). To better reflect this context, we introduce a new problem setting, DHOCL (Online Continual Learning from Dynamic Hierarchies), where taxonomies evolve across granularities and each sample provides supervision at a single hierarchical level. In this setting, we find two fundamental issues: (i) partial supervision under mixed granularities provides only point-wise signals over an evolving path-wise hierarchy, which constrains plasticity and undermines cross-level semantic consistency, and (ii) the dynamically evolving hierarchies induce granularity-dependent interference, destabilizing popular replay and regularization mechanisms and thereby exacerbating catastrophic forgetting. To tackle these issues, we propose HALO (Hierarchical Adaptive Learning with Organized Prototypes), which adaptively combines complementary classification heads, regularized by organized learnable hierarchical prototypes, enabling rapid adaptation, hierarchical consistency, and structured knowledge consolidation as the taxonomy evolves. Extensive experiments on multiple benchmarks demonstrate that HALO consistently outperforms existing methods across hierarchical accuracy, mistake severity, and continual performance.
comment: Accepted to ICML2026
☆ U-STS-LLM A Unified Spatio-Temporal Steered Large Language Model for Traffic Prediction and Imputation
The efficient operation of modern cellular networks hinges on the accurate analysis of spatio-temporal traffic data. Mastering these patterns is essential for core network functions, chiefly forecasting future load to pre-empt congestion and imputing missing values caused by sensor failures or transmission errors to ensure data continuity. While deeply connected, forecasting and imputation have historically evolved as separate sub-fields. The dominant paradigm, Spatio-Temporal Graph Neural Networks (STGNNs), while effective, are often specialized, computationally intensive, and exhibit limited generalization. Concurrently, adapting large pre-trained language models (LLMs) offers a powerful alternative for sequence modeling, yet existing approaches provide weak structural guidance, leading to unstable convergence and a narrow focus on forecasting. To bridge these gaps, we propose U-STS-LLM, a unified framework built on a spatio-temporally steered LLM. Our core innovation is a Dynamic Spatio-Temporal Attention Bias Generator that synthesizes a persistent functional graph with transient nodal states to explicitly steer the LLM's attention. Coupled with a partially frozen backbone tuned via Low-Rank Adaptation (LoRA) and a Gated Adaptive Fusion mechanism, the model achieves stable, parameter-efficient adaptation. Trained under a unified multi-task objective, U-STS-LLM learns a holistic data representation. Extensive experiments on real-world cellular datasets demonstrate that U-STS-LLM establishes new state-of-the-art performance in both long-horizon forecasting and high-missing-rate imputation, while maintaining remarkable training efficiency and stability, offering a novel blueprint for harnessing foundation models in structured, non-linguistic domains.
comment: 14 pages, 6 figures
☆ Persona-Conditioned Adversarial Prompting: Multi-Identity Red-Teaming for Adversarial Discovery and Mitigation
Automated red-teaming for LLMs often discovers narrow attack slices, missing diverse real-world threats, and yielding insufficient data for safety fine-tuning. We introduce Persona-Conditioned Adversarial Prompting (PCAP), which conditions adversarial search on diverse attacker personas (e.g., doctors, students, malicious actors) and strategy sets to explore realistic attack scenarios. By running parallel persona-conditioned searches, PCAP discovers transferable jailbreaks across different contexts and generates rich defense datasets with automatic metadata tracking. On GPT-OSS 120B, PCAP increases attack success from 57\% to 97\% while producing 2-6$\times$ more diverse prompts covering varied real-world scenarios. Critically, fine-tuning lightweight adapters on PCAP-generated data significantly improves model robustness (recall: 0.36 $\rightarrow$ 0.99, F1: 0.53 $\rightarrow$ 0.96) with minimal false positives, demonstrating a practical closed-loop approach from vulnerability discovery to automated alignment.
☆ Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models NeurIPS 2026
Recently, reinforcement learning (RL) has been widely applied during post-training for diffusion large language models (dLLMs) to enhance reasoning with block-wise semi-autoregressive generation. Block size has therefore become a vital factor in dLLMs, since it determines the parallel decoding granularity and affects the rollout trajectories during RL optimisation, e.g., GRPO. Instead of investigating the effect of block size during inference on individual domains, this paper studies block size from a domain conflict perspective for dLLM RL post-training in multi-domain scenarios. The main contributions are: (1) a formulation of domain block size conflict in multi-domain RL for dLLMs, which will largely affect the post-training effectiveness for rollout-based RL methods; (2) a novel dataset, Block-R1-41K is constructed with a best-improved training block size for each sample, which also induces a Block Size Conflict Score to quantitatively measure the domain conflict; (3) a new benchmark, Block-R1, for flexible RL post-training for dLLMs in both single and cross domain; and (4) a simple yet powerful cross-domain post-training method with sample-level best-improved training block sizes. Extensive experiments on 13 distinct datasets, 7 latest RL algorithms, and various different dLLM backbones are covered in Block-R1. The benchmark is open-sourced at https://github.com/YanJiangJerry/Block-R1, with the dataset released at https://huggingface.co/datasets/dLLM-R1/Block-R1-41K.
comment: NeurIPS 2026 Preprint
☆ EPIC: Efficient Predicate-Guided Inference-Time Control for Compositional Text-to-Image Generation
Recent text-to-image (T2I) generators can synthesize realistic images, but still struggle with compositional prompts involving multiple objects, counts, attributes, and relations. We introduce EPIC (Efficient Predicate-Guided Inference-Time Control), a training-free inference-time refinement framework for compositional T2I generation. EPIC casts refinement as predicate-guided search: it parses the original prompt once into a fixed visual program of object variables and typed predicates, covering checkable conditions such as object presence, counts, attributes, and relations. Each generated or edited image is verified against this program using visual evidence extracted from that image. An image is judged to satisfy the prompt only when all predicates are satisfied; otherwise, failed predicates decide the next step, routing local failures to targeted editing and global failures to resampling while the fixed visual program remains unchanged. On GenEval2, EPIC improves prompt-level accuracy from 34.16% for single-pass generation with the base generator to 71.46%. Under the same generator/editor setting and maximum image-model execution budget, EPIC outperforms the strongest prior refinement baseline by 19.23 points while reducing realized cost by 31% in image-model executions, 72% in MLLM calls, and 81% in MLLM tokens per prompt.
☆ Debiased Model-based Representations for Sample-efficient Continuous Control ICML 2026
Model-based representations recently stand out as a promising framework that embeds latent dynamics information into the representations for downstream off-policy actor-critic learning. It implicitly combines the advantages of both model-free and model-based approaches while avoiding the training costs associated with model-based methods. Nevertheless, existing model-based representation methods can fail to capture sufficient information about relevant variables and can overfit to early experiences in the replay buffer. These incur biases in representation and actor-critic learning, leading to inferior performance. To address this, we propose Debiased model-based Representations for Q-learning, tagged DR.Q algorithm. DR.Q explicitly maximizes the mutual information between the representations of the current state-action pair and the next state besides minimizing their deviations, and samples transitions with faded prioritized experience replay. We evaluate DR.Q on numerous continuous control benchmarks with a single set of hyperparameters, and the results demonstrate that DR.Q can match or surpass recent strong baselines, sometimes outperforming them by a large margin. Our code is available at https://github.com/dmksjfl/DR.Q.
comment: ICML 2026
☆ Unlocking Compositional Generalization in Continual Few-Shot Learning
Object-centric representations promise a key property for few-shot learning: Rather than treating a scene as a single unit, a model can decompose it into individual object-level parts that can be matched and compared across different concepts. In practice, this potential is rarely realized. Continual learners either collapse scenes into global embeddings, or train with part-level matching objectives that tie representations too closely to seen patterns, leaving them unable to generalize to truly novel concepts. In this paper, we identify this fundamental structural conflict and pioneer a new paradigm that strictly decouples representation learning from compositional inference. Leveraging the inherent patch-level semantic geometry of self-supervised Vision Transformers (ViTs), our framework employs a dual-phase strategy. During training, slot representations are optimized entirely toward holistic class identity, preserving highly generalizable, object-level geometries. At inference, preserved slots are dynamically composed to match novel scenes. We demonstrate that this paradigm offers dual structural benefits: The frozen backbone naturally prevents representation drift, while our lightweight, holistic optimization preserves the features' capacity for novel-concept transfer. Extensive experiments validate this approach, achieving state-of-the-art unseen-concept generalization and minimal forgetting across standard continual learning benchmarks.
comment: 10 pages
☆ GRAFT: Graph-Tokenized LLMs for Tool Planning
Large language models (LLMs) are increasingly used to complete complex tasks by selecting and coordinating external tools across multiple steps. This requires aligning tool choices with subtask intent while satisfying directional execution dependencies among tools. To do this, existing methods model these dependencies as tool graphs and incorporate the graphs with LLMs through retrieval, serialization, or prompt-level injection. However, these external graph-use strategies all follow a matching paradigm, which often fails to align tool choices with the underlying subtask structure, producing semantically plausible plans that violate graph constraints. This issue is further exacerbated by error accumulation, where an early incorrect tool selection shifts the plan into an invalid graph state and causes subsequent predictions to drift away from the valid execution path. To address these challenges, we propose GRAFT, a graph-tokenized language model framework for dependency-aware tool planning. GRAFT internalizes the tool graph by mapping each tool node to a dedicated special token and learning directed tool dependencies within the representation space. It further introduces on-policy tool context distillation, training the model on its own sampled trajectories while distilling stepwise planning signals. Experiments show that GRAFT achieves state-of-the-art performance in exact sequence matching and dependency legality, supporting more reliable LLM tool planning in complex workflows.
☆ Augmented Lagrangian Method for Last-Iterate Convergence for Constrained MDPs
We study policy optimization for infinite-horizon, discounted constrained Markov decision processes (CMDPs). While existing theoretical guarantees typically hold for the mixture policy, deploying such a policy is computationally and memory intensive. This leads to a practical mismatch where a single (last-iterate) policy must be deployed. Recent theoretical works have thus focused on proving last-iterate convergence, but are largely limited to the tabular setting or to algorithmic variants that are rarely used in practice. To address this, we use the classic inexact augmented Lagrangian ($\texttt{AL}$) method from constrained optimization, and propose a general framework with provable last-iterate convergence for CMDPs. We first focus on the tabular setting and propose to solve the $\texttt{AL}$ sub-problem with projected Q-ascent ($\texttt{PQA}$). Combining the theoretical guarantees of $\texttt{PQA}$ and the standard $\texttt{AL}$ analysis enables us to establish global last-iterate convergence. We generalize these results to handle log-linear policies, and demonstrate that an efficient, projected variant of $\texttt{PQA}$ can achieve last-iterate convergence with comparable guarantees as prior work. Finally, we demonstrate that our framework scales to complex non-linear policies, and evaluate it on continuous control tasks.
☆ Compositional Neural Operators for Multi-Dimensional Fluid Dynamics ICLR 2026
Partial differential equations (PDEs) govern diverse physical phenomena, yet high-fidelity numerical solutions are computationally expensive and Machine Learning approaches lack generalization. While Scientific Foundation Models (SFMs) aim to provide universal surrogates, typical encoding-decoding approaches suffer from high pretraining costs and limited interpretability. In this paper, we propose Compositional Neural Operators (CompNO) for 2D systems, a framework that decomposes complex PDEs into a library of Foundation Blocks. Each block is a specialized Neural Operator pretrained on elementary physics. This modular library contains convection, diffusion, and nonlinear convection blocks as well as a Poisson Solver, enabling the framework to address the pressure-velocity coupling. These experts are assembled via an Adaptation Block featuring an Aggregator. This aggregator learns nonlinear interactions by minimizing data loss and physics-based residuals driven from governing equations. The proposed approach has been evaluated on the Convection-Diffusion equation, the Burgers' equation, and the Incompressible Navier-Stokes equation. Our results demonstrate that learning from elementary operators significantly improves adaptability, enhances model interpretability and facilitates the reuse of pretrained blocks when adapting to new physical systems.
comment: Published as a conference paper at ICLR 2026
☆ Slicing and Dicing: Configuring Optimal Mixtures of Experts
Mixture-of-Experts (MoE) architectures have become standard in large language models, yet many of their core design choices - expert count, granularity, shared experts, load balancing, token dropping - have only been studied one or two at a time over narrow configuration ranges. It remains an open question whether these choices can be optimized independently, without considering interactions. We present the first systematic study of over 2,000 pretraining runs spanning models up to 6.6B total parameters, in which we exhaustively vary total experts, expert dimension, heterogeneous expert sizing within a single layer, shared expert size and load-balancing mechanisms. We find that at every active-parameter scale that we study, performance consistently improves with total MoE parameters even at extreme active expert parameter ratios like 128.Further, the optimal expert size is nearly invariant to total parameter count and depends only on active parameter count. Third, we see that other choices like shared experts, heterogeneous experts and load-balancing settings have small effects relative to expert count and granularity, although dropless routing yields a consistent gain. Overall, our results suggest a simpler recipe: focus on expert count and granularity, other choices have minimal effect on final quality.
☆ Shaping Zero-Shot Coordination via State Blocking
Zero-shot coordination (ZSC) aims to enable agents to cooperate with independently trained partners without prior interaction, a key requirement for real-world multi-agent systems and human-AI collaboration. Existing approaches have largely emphasized increasing partner diversity during training, yet such strategies often fall short of achieving reliable generalization to unseen partners. We introduce State-Blocked Coordination (SBC), a simple yet effective framework that improves ZSC by inducing diverse interaction scenarios without direct environment modification. Specifically, SBC generates a family of virtual environments through state blocking, allowing agents to experience a wide range of suboptimal partner policies. Across multiple benchmarks, SBC demonstrates superior performance in zero-shot coordination, including strong generalization to human partners.
comment: 9 technical page followed by references and appendix
☆ Partial Model Sharing Improves Byzantine Resilience in Federated Conformal Prediction
We propose a Byzantine-resilient federated conformal prediction (FCP) method that leverages partial model sharing, where only a subset of model parameters is exchanged each round. Unlike existing robust FCP approaches that primarily harden the calibration stage, our method protects both the federated training and conformal calibration phases. During training, partial sharing inherently restricts the attack surface and attenuates poisoned updates while reducing communication. During calibration, clients compress their non-conformity scores into histogram-based characterization vectors, enabling the server to detect Byzantine clients via distance-based maliciousness scores and to estimate the conformal quantile using only benign contributors. Experiments across diverse Byzantine attack scenarios show that the proposed method achieves closer-to-nominal coverage with substantially tighter prediction intervals than standard FCP, establishing a robust and communication-efficient approach to federated uncertainty quantification.
comment: 5 pages, 4 figures, Accepted for presentation at the 34th European Signal Processing Conference (EUSIPCO 2026) in Bruges, Belgium
☆ Evolutionary Task Discovery: Advancing Reasoning Frontiers via Skill Composition and Complexity Scaling
The reasoning frontier of Large Language Models (LLMs) has advanced significantly through modern post-training paradigms (e.g., Reinforcement Learning from Verifiable Rewards (RLVR)). However, the efficacy of these methods remains fundamentally constrained by the diversity and complexity of the training data. One practical solution is data synthesis; yet, prevalent methods relying on unstructured mutation or exploration suffer from homogeneity collapse, failing to systematically expand the reasoning frontier. To overcome this, we propose Evoutionary Task Discovery (EvoTD), a framework that treats data synthesis as a directed search over a dual-axis manifold of Algorithmic Skills and Complexity Attributes. We introduce structured evolutionary operators to navigate this space: a Crossover operator that synthesizes novel skill compositions to enhance diversity, and a Parametric Mutation operator that scales structural constraints (e.g., input size, tree depth) to drive robust generalization. Crucially, we integrate a dynamic Zone of Proximal Development filter, ensuring tasks lie within the learnable region of the model. Empirically, EvoTD delivers substantial reasoning gains that generalize consistently across model architectures, pretraining regimes, and scales, demonstrating that structured evolutionary curricula can effectively support reasoning improvement. We release our code on https://github.com/liqinye/EvoTD.
☆ Posterior Contraction Rates for Sparse Kolmogorov-Arnold Networks in Anisotropic Besov Spaces
We study posterior contraction rates for sparse Bayesian Kolmogorov-Arnold networks (KANs) over anisotropic Besov spaces, providing a statistical foundation of KANs from a Bayesian point of view. We show that sparse Bayesian KANs equipped with spike-and-slab-type sparsity priors attain the near-minimax posterior contraction. In particular, the contraction rate depends on the intrinsic anisotropic smoothness of the underlying function. Moreover, by placing a hyperprior on a single model-size parameter, the resulting posterior adapts to unknown anisotropic smoothness and still achieves the corresponding near-minimax rate. A distinctive feature of our results, compared with those for standard sparse MLP-based models, is that the KAN depth can be kept fixed: owing to the flexibility of learnable spline edge functions, the required approximation complexity is controlled through the network width, spline-grid range and size, and parameter sparsity. Our analysis develops theoretical tools tailored to sparse spline-edge architectures, including approximation and complexity bounds for Bayesian KANs. We then extend to compositional Besov spaces and show that the contraction rates depend on layerwise smoothness and effective dimension of the underlying compositional structure, thereby effectively avoiding the curse of dimensionality. Together, the developed tools and findings advance the theoretical understanding of Bayesian neural networks and provide rigorous statistical foundations for KANs.
☆ GeomHerd: A Forward-looking Herding Quantification via Ricci Flow Geometry on Agent Interactive Simulations
Herding -- where agents align their behaviors and act collectively -- is a central driver of market fragility and systemic risk. Existing approaches to quantify herding rely on price-correlation statistics, which inherently lag because they only detect coordination after it has already moved realised returns. We propose GeomHerd, a forward-looking geometric framework that bypasses this observability lag by quantifying coordination directly on upstream agent-interaction graphs. To generate these graphs, we treat a heterogeneous LLM-driven multi-agent simulator -- each financial trader instantiated by a persona-conditioned LLM call -- as a forecastable world, and evaluate the geometric pipeline on the Cividino--Sornette continuous-spin agent-based substrate as our headline financial testbed. By tracking the discrete Ollivier--Ricci curvature of these action graphs, GeomHerd captures the structural topology of emerging coordination. Theoretically, we establish a mean-field bridge mapping our graph-theoretic metric to CSAD, the classical macroscopic herding statistic, linking GeomHerd to downstream price-dispersion measurement. Empirically, GeomHerd anticipates herding long before aggregate market baselines: on the continuous-spin substrate, our primary detector fires a median of 272 steps before order-parameter onset; a contagion detector ($β_{-}$) recalls 65% of critical trajectories 318 steps early; and on co-firing trajectories the agent-graph signal precedes price-correlation-graph baselines by 40 steps. As a complementary indicator, the effective vocabulary of agent actions contracts during cascades. The geometric signature transfers out-of-domain to the Vicsek self-driven-particle model, and a curvature-conditioned forecasting head reduces cascade-window log-return MAE over detector-conditioned and price-only baselines.
☆ Finite Sentence-Interface Control for Learning Bounded-Fan-Out Linear MCFGs under Fixed Monoid Typing
We study positive-data learning of bounded-fan-out linear multiple context-free grammars under a fixed explicit finite monoid homomorphism \(h\). The main obstacle beyond the context-free case is that an MCFG nonterminal derives a tuple whose components may be placed in a surrounding sentence in different orders. We introduce sentence-interface types as finite external control objects for such tuple occurrences. A type records the permutation of tuple components in the final sentence together with the \(h\)-values of the boundary intervals between them. For reduced working binary linear nondeleting MCFG presentations whose string languages satisfy \((f,h)\)-tuple substitutability, we build a typed refinement, a finite characteristic sample, and a canonical positive-data learner. Once the sample contains this characteristic sample and remains contained in the target language, the learner reconstructs the language exactly. Consequently, for fixed fan-out bound \(f\) and fixed explicit \(h\), the resulting class is identifiable in the limit from positive data. Moreover, the hypothesis associated with any given finite sample is constructible in polynomial time for fixed \(f\) and fixed \(h\), including output size. Thus sentence-interface control is the finite mechanism that lifts fixed-\(h\) distributional reconstruction from context-free grammars to bounded-fan-out linear MCFGs.
☆ Learning U-Statistics with Active Inference
$U$-statistics play a central role in statistical inference. In many modern applications, however, acquiring the labels required for $U$-statistics is costly. Motivated by recent advances in active inference, we develop an active inference framework for $U$-statistics that selectively queries informative labels to improve estimation efficiency under a fixed labeling budget, while preserving valid statistical inference. Our approach is built on the augmented inverse probability weighting $U$-statistic, which is designed to incorporate the sampling rule and machine learning predictions. We characterize the optimal sampling rule that minimizes its variance and design practical sampling strategies. We further extend the framework to $U$-statistic-based empirical risk minimization. Experiments on real datasets demonstrate substantial gains in estimation efficiency over baseline methods, while maintaining target coverage.
☆ MIST: Reliable Streaming Decision Trees for Online Class-Incremental Learning via McDiarmid Bound
Streaming decision trees are natural candidates for open-world continual learning, as they perform local updates, enjoy bounded memory, and static decision boundaries. Despite these, they still fail in online class-incremental learning due to two coupled miscalibrations: (i) their split criterion grows unreliable as the class count K expands, and (ii) the absence of knowledge transfer at split time. Both failures share a common root: the range of Information Gain intrinsically scales with log2 K. Consequently, any Hoeffding-style confidence radius derived from it must inevitably grow with the class count, making a K-independent split criterion structurally impossible, taking away the potential benefits of applying streaming decision trees to continual learning. To fix this issue, we present MIST (McDiarmid Incremental Streaming Tree), which resolves both failures through three integrated components: (i) a tight, K-independent McDiarmid confidence radius for Gini splitting that acts as a structural regulariser; (ii) a Bayesian inheritance protocol that projects parent statistics to child nodes via truncated-Gaussian moments, with variance reduction guarantees strongest precisely when splitting is most conservative; and (iii) per-leaf KLL quantile sketches that support both continuous threshold evaluation and geometry-adaptive leaf prediction from a single data structure. On standard and stress-test tabular streams, MIST is competitive with global parametric methods on near-Gaussian benchmarks and uniquely robust on non-Gaussian geometry where SOTA benchmarks collapse.
comment: 9 pages of main text, 5 figures
☆ From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
On-policy self-distillation has emerged as a promising paradigm for post-training language models, in which the model conditions on environment feedback to serve as its own teacher, providing dense token-level rewards without external teacher models or step-level annotations. Despite its empirical success, what this reward actually measures and what kind of credit it assigns remain unclear. Under a posterior-compatibility interpretation of feedback conditioning, standard in the implicit-reward literature, we show that the self-distillation token reward is a Bayesian filtering increment whose trajectory sum is exactly the pointwise mutual information between the response and the feedback given the input. This pMI can be raised by input-specific reasoning or by input-generic shortcuts, so we further decompose the teacher log-probability along the input axis. Based on this analysis, we propose CREDIT (Contrastive REward from DIsTillation), which isolates the input-specific component with a batch-contrastive baseline. At the sequence level, CREDIT is a teacher-side surrogate for a contrastive pMI objective that also penalizes responses remaining likely under unrelated inputs. Across coding, scientific reasoning, and tool-use benchmarks on two model families, CREDIT delivers the strongest aggregate performance at negligible additional compute.
☆ Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
On-policy self-distillation, where a student is pulled toward a copy of itself conditioned on privileged context (e.g., a verified solution or feedback), offers a promising direction for advancing reasoning capability without a stronger external teacher. Yet in math reasoning the gains are inconsistent, even when the same approach succeeds elsewhere. A pointwise mutual information analysis traces the failure to the privileged context itself: it inflates the teacher's confidence on tokens already implied by the solution (structural connectives, verifiable claims) and deflates it on deliberation tokens ("Wait", "Let", "Maybe") that drive multi-step search. We propose Anti-Self-Distillation (AntiSD), which ascends a divergence between student and teacher rather than descending it: this reverses the per-token sign and yields a naturally bounded advantage in one step. An entropy-triggered gate disables the term once the teacher entropy collapses, completing a drop-in replacement for default self-distillation. Across five models from 4B to 30B parameters on math reasoning benchmarks, AntiSD reaches the GRPO baseline's accuracy in 2 to 10x fewer training steps and improves final accuracy by up to 11.5 points. AntiSD opens a path to scalable self-improvement, where a language model bootstraps its own reasoning through its training signal.
☆ PRISM: A Geometric Risk Bound that Decomposes Drift into Scale, Shape, and Head
Comparing post-training LLM variants, such as quantized, LoRA-adapted, and distilled models, requires a diagnostic that identifies how a variant has drifted, not only whether it has degraded. Existing similarity scores such as CKA and SVCCA can flag degradation, but they do not directly link representation drift to risk or mechanism. We propose PRISM, Proxy Risk Inference via Structural Mapping, which exploits the linear output head of LLMs and the empirically near-isometric structure of their backbones to derive a closed-form upper bound on the cross-entropy risk gap between a target model and a post-training variant. The bound is calibrated for variant ranking and decomposes drift into three independently measurable axes: scale mismatch, shape mismatch, and head divergence. Each axis corresponds to a distinct failure mode, including shape distortion under low-bit quantization, scale separability under LoRA forgetting, and head divergence under GGUF k-quantization. As a result, the dominant axis suggests a remediation direction rather than merely raising a degradation flag. Because the shape term is differentiable, the same geometry can also serve as a training-time regularizer against catastrophic forgetting. Across two model families and five benchmarks, PRISM ranks variants with mean Spearman correlations of 0.820 for post-training quantization and 0.831 for LoRA forgetting, and its axis-guided shape regularizer outperforms experience replay in aggregate at mitigating downstream forgetting.
☆ Exact Stiefel Optimization for Probabilistic PLS: Closed-Form Updates, Error Bounds, and Calibrated Uncertainty
Probabilistic partial least squares (PPLS) is a central likelihood-based model for two-view learning when one needs both interpretable latent factors and calibrated uncertainty. Building on the identifiable parameterization of Bouhaddani et al.\ (2018), existing fitting pipelines still face two practical bottlenecks: noise--signal coupling under joint EM/ECM updates and nontrivial handling of orthogonality constraints. Following the fixed-noise scalar-likelihood line of Hu et al.\ (2025), we develop an end-to-end framework that combines noise pre-estimation, constrained likelihood optimization, and prediction calibration in one pipeline. Relative to Hu et al.\ (2025), we replace full-spectrum noise averaging with noise-subspace estimation and replace interior-point penalty handling with exact Stiefel-manifold optimization. The noise-subspace estimator attains a signal-strength-independent leading finite-sample rate and matches a minimax lower bound, while the full-spectrum estimator is shown to be inconsistent under the same model. We further extend the framework to sub-Gaussian settings via optional Gaussianization and provide closed-form standard errors through a block-structured Fisher analysis. Across synthetic high-noise settings and two multi-omics benchmarks (TCGA-BRCA and PBMC CITE-seq), the method achieves near-nominal coverage without post-hoc recalibration, reaches Ridge-level point accuracy on TCGA-BRCA at rank $r=3$, matches or exceeds PO2PLS on cross-view prediction while providing native calibrated uncertainty, and improves stability of parameter recovery.
☆ Targeted Tests for LLM Reasoning: An Audit-Constrained Protocol
Fixed reasoning benchmarks evaluate canonical prompts, but semantically valid changes in presentation can still change model behavior. Studies of prompt variation can reveal such failures, but without audit they can mix genuine model errors with invalid perturbations, extraction artifacts, and unmatched search procedures. We propose an audit-constrained protocol for targeted reasoning evaluation. Prompt variants are generated from a finite component grammar, rendered deterministically, evaluated under a fixed query budget, and counted as model errors only after semantic and extraction audit. Within this protocol we instantiate Component-Adaptive Prompt Sampling (CAPS), a score-based sampler over prompt components, and compare it with equal-budget uniform component sampling under the same task bank, renderer, model interface, decoding settings, and audit procedure. Across three audited slices, the protocol identifies confirmed model-error prompt keys while excluding formatting and extraction artifacts, but matched comparisons do not show that CAPS improves audited yield or unique prompt-key discovery over uniform sampling. The contribution is methodological: targeted prompt variation can be studied under a reconstructable, reviewable, budget-matched protocol, and proxy-guided policies should be judged by audited yield rather than raw mismatch counts or selected examples alone.
comment: 17 pages, 1 figure
☆ EpiCastBench: Datasets and Benchmarks for Multivariate Epidemic Forecasting
The increasing adoption of data-driven decision-making in public health has established epidemic forecasting as a critical area of research. Recent advances in multivariate forecasting models better capture complex temporal dependencies than conventional univariate approaches, which model individual series independently. Despite this potential, the development of robust epidemic forecasting methods is constrained by the lack of high-quality benchmarks comprising diverse multivariate datasets across infectious diseases and geographical regions. To address this gap, we present EpiCastBench, a large-scale benchmarking framework featuring 40 curated (correlated) multivariate epidemic datasets. These publicly available datasets span a wide range of infectious diseases and exhibit diverse characteristics in terms of temporal granularity, series length, and sparsity. We analyze these datasets to identify their global features and structural patterns. To ensure reproducibility and fair comparison, we establish standardized evaluation settings, including a unified forecasting horizon, consistent preprocessing pipelines, diverse performance metrics, and statistical significance testing. By leveraging this framework, we conduct a comprehensive evaluation of 15 multivariate forecasting models spanning statistical baselines to state-of-the-art deep learning and foundation models. All datasets and code are publicly available on Kaggle (https://www.kaggle.com/datasets/aimltsf/epicastbench) and GitHub (https://github.com/aimltsf/EpiCastBench).
☆ SoK: Unlearnability and Unlearning for Model Dememorization
Advanced model dememorization methods, including availability poisoning (unlearnability) and machine unlearning, are emerging as key safeguards against data misuse in machine learning (ML). At the training stage, unlearnability embeds imperceptible perturbations into data before release to reduce learnability. At the post-training stage, unlearning removes previously acquired information from models to prevent unauthorized disclosure or use. While both defenses aim to preserve the right to withhold knowledge, their vulnerabilities and shared foundations remain unclear. Specifically, both unlearnability and unlearning suffer from issues such as shallow dememorization, leading to falsely claimed data learnability reduction or forgetting in the presence of weight perturbations. Moreover, input perturbations may affect the effectiveness of downstream unlearning, while unlearning may inadvertently recover domain knowledge hidden by unlearnability. This interplay calls for deeper investigation. Finally, there is a lack of formal guarantees to provide theoretical insights into current defenses against shallow dememorization. In this Systematization of Knowledge, we present the first integrated analysis of model dememorization approaches leveraging unlearnability and unlearning. Our contributions are threefold: (i) a unified taxonomy of unlearnability and scalable unlearning methods; (ii) an empirical evaluation revealing the robustness, interplay, and shallow dememorization of leading methods; and (iii) the first theoretical guarantee on dememorization depth for models processed through certified unlearning. These results lay the foundation for unifying dememorization mechanisms across the ML lifecycle to achieve a deeper immemor state for sensitive knowledge.
comment: The first two authors contributed equally
☆ Learning Weakly Communicating Average-Reward CMDPs: Strong Duality and Improved Regret
We study infinite-horizon average-reward constrained Markov decision processes (CMDPs) under the weakly communicating assumption. Our contributions are twofold. First, we establish strong duality for weakly communicating average-reward CMDPs over stationary policies with finite state and action spaces. Despite the absence of a linear programming formulation and the resulting nonconvexity under the weakly communicating setting, we show that strong duality still holds by carefully exploiting the geometric structure of the occupation measure set. Second, building on this result, we propose a primal--dual clipped value iteration algorithm for learning weakly communicating average-reward linear CMDPs. Our algorithm achieves regret and constraint violation bounds of $\widetilde{\mathcal{O}}(T^{2/3})$, improving upon the best known bounds, where $T$ denotes the number of interactions. Our approach extends clipped value iteration to the constrained setting and adapts it to a finite-horizon approximation, which stabilizes the dual variable and is crucial for achieving improved regret bounds. To analyze this, we develop a novel approach based on strong duality that enables the decomposition of the composite Lagrangian regret into separate bounds on regret and constraint violation.
☆ A Mixture Autoregressive Image Generative Model on Quadtree Regions for Gaussian Noise Removal via Variational Bayes and Gradient Methods
This paper addresses the problem of image denoising for grayscale images. We propose a probabilistic image generative model that combines a quadtree region-partitioning model with a mixture autoregressive model, and propose a framework that reduces MAP (maximum a posteriori)-estimation-based denoising to the maximization of a variational lower bound. To maximize this lower bound, we develop an algorithm that alternately applies variational Bayes and gradient methods. We particularly demonstrate that the gradient-based update rule can be computed analytically without numerical computation or approximation. We carried out some experiments to verify that the proposed algorithm actually removes image noise and to identify directions for future improvement.
☆ NexOP: Joint Optimization of NEX-Aware k-space Sampling and Image Reconstruction for Low-Field MRI
Modern low-field magnetic resonance imaging (MRI) technology offers a compelling alternative to standard high-field MRI, with portable, low-cost systems. However, its clinical utility is limited by a low Signal-to-Noise Ratio (SNR), which hampers diagnostic image quality. A common approach to increase SNR is through repetitive signal acquisitions, known as NEX, but this results in excessively long scan durations. Although recent work has introduced methods to accelerate MRI scans through k-space sampling optimization, the NEX dimension remains unexploited; typically, a single sampling mask is used across all repetitions. Here we introduce NexOP, a deep-learning framework for joint optimization of the sampling and reconstruction in multi-NEX acquisitions, tailored for low-SNR settings. NexOP enables optimizing the sampling density probabilities across the extended k-space-NEX domain, under a fixed sampling-budget constraint, and introduces a new deep-learning architecture for reconstructing a single high-SNR image from multiple low-SNR measurements. Experiments with raw low-field (0.3T) brain data demonstrate that NexOP consistently outperforms competing methods, both quantitatively and qualitatively, across diverse acceleration factors and tissue contrasts. The results also demonstrate that NexOP yields non-uniform sampling strategies, with progressively decreasing sampling across repetitions, hence exploiting the NEX dimension efficiently. Moreover, we present a theoretical analysis supporting these numerical observations. Overall, this work proposes a sampling-reconstruction optimization framework highly suitable for low-field MRI, which can enable faster, higher-quality imaging with low-cost systems and contribute to advancing affordable and accessible healthcare.
☆ Three Regimes of Context-Parametric Conflict: A Predictive Framework and Empirical Validation
The literature on how large language models handle conflict between their training knowledge and a contradicting document presents a persistent empirical contradiction: some studies find models stubbornly retain their trained answers, ignoring provided documents nearly half the time, while others find models readily defer to the document, following context approximately 96% of the time. We argue these contradictions dissolve once one recognises that prior experiments have studied three qualitatively distinct processing situations without distinguishing them. We propose a three-regime framework: Regime 1 (single-source updating, dominant predictor: evidence coherence), Regime 2 (competitive integration, dominant predictor: parametric certainty), and Regime 3 (task-appropriate selection, dominant predictor: task knowledge requirement). We formalise a distinction between parametric strength (exposure frequency) and parametric uniqueness (encoding consistency), showing empirically that these are orthogonal dimensions (r = -0.002, p = .97) with strength as the operative predictor in stable factual domains. We validate the framework across Claude Sonnet 4.6, GPT-5.5, Gemini 2.5 Flash, Llama 4 Maverick, and DeepSeek V3 using 9,970 API calls in three experimental phases. GEE logistic regression confirms the predicted Regime 2 certainty gradient for all five models (beta = -0.38 to -0.50, all p <= .013, BH-FDR corrected). A Regime 3 ablation shows task framing alone flips context-following from near-100% (contextual knowledge condition) to 6-71% (parametric knowledge condition), with all five models significant (p < .001). The certainty gradient is robust to multinomial outcome modeling, sensitivity analyses for hedging responses, and FDR correction.
comment: 10 pages, 13 tables, no figures. 9,970 API calls across five frontier models
☆ FedOUI: OUI-Guided Client Weighting for Federated Aggregation
Federated learning usually aggregates client updates using dataset size or gradient-level criteria, while overlooking internal signals about how each client model is organizing its input space during training. We introduce FedOUI, a simple aggregation rule based on the Overfitting-Underfitting Indicator (OUI), an activation-based and label-free metric. Each participating client sends its local update together with a OUI value computed on a fixed probe batch, and the server estimates the round-wise OUI distribution to assign lower weights to structurally atypical clients through a smooth reweighting rule. We evaluate FedOUI on CIFAR-10 under strong non-IID partitioning and noisy-client conditions, comparing it with FedAvg, FedProx, and a gradient-alignment baseline. The clearest gains appear under strong heterogeneity, where OUI-based weighting improves aggregation quality while remaining lightweight and interpretable. These results show that internal activation structure can provide useful information for federated aggregation beyond client size and gradient geometry.
☆ OUI as a Structural Observable: Towards an Activation-Centric View of Neural Network Training
Activation functions are what make deep networks expressive: without them, the model collapses to a linear map. Yet we still evaluate training mostly from the outside, through loss, accuracy, return, or final calibration, while the internal structural evolution of the network remains largely unobserved. In this paper, we argue that the Overfitting--Underfitting Indicator (OUI) should be understood as a first practical observable of that internal structure. Across our recent results, OUI consistently appears as an early, label-free, activation-based signal that reveals whether a network is entering a poor or promising training regime before convergence. In supervised learning, it anticipates weight decay regimes; in reinforcement learning, it discriminates learning-rate regimes early in PPO actor--critic; and in online control, it can drive layer-wise weight decay adaptation. Read together with recent evidence that activation patterns tend to stabilize earlier than parameters, these results suggest a broader research direction: an activation-centric theory of training dynamics. OUI is becoming an empirical foothold toward this theory.
♻ ☆ Smoothed Analysis of Learning from Positive Samples STOC
Binary classification from positive-only samples is a variant of PAC learning where the learner receives i.i.d. positive samples and aims to learn a classifier with low error. Previous work by Natarajan, Gereb-Graus, and Shvaytser characterized learnability and revealed a largely negative picture: almost no interesting classes, including two-dimensional halfspaces, are learnable. This poses a challenge for applications from bioinformatics to ecology, where practitioners rely on heuristics. In this work, we initiate a smoothed analysis of positive-only learning. We assume samples from a reference distribution $D$ such that the true distribution $D^*$ is smooth with respect to it. In stark contrast to the worst-case setting, we show that all VC classes become learnable in the smoothed model, requiring $O(VC/ε^2)$ positive samples for $ε$ classification error. We also give an efficient algorithm for any class admitting $\mathrm{poly}(ε)$-approximation by degree-$k$ polynomials whose range is lower-bounded by a constant with respect to $D$ in L1-norm. It runs in time $\mathrm{poly}(d^k/ε)$, qualitatively matching L1-regression. Our results also imply faster or more general algorithms for: (1) estimation with unknown-truncation, giving the first polynomial-time algorithm for estimating exponential-family parameters from samples truncated to an unknown set approximable by non-negative polynomials in L1 norm, improving on [KTZ FOCS19; LMZ FOCS24], who required strong L2-approximation; (2) truncation detection for broad classes, including non-product distributions, improving on [DLNS STOC24]'s who required product distributions; and (3) learning from a list of reference distributions, where samples come from $O(1)$ distributions, one of which witnesses smoothness of $D^*$, as arises when list-decoding algorithms learn samplers for $D^*$ from corrupted data.
comment: Accepted for presentation at the 58th ACM Symposium on Theory of Computing (STOC), 2026; abstract shortened for arXiv
♻ ☆ Dispatch-Aware Ragged Attention for Pruned Vision Transformers
Token pruning methods for Vision Transformers (ViTs) promise quadratic reductions in attention FLOPs by dropping uninformative patches. Yet standard variable-length attention APIs -- including FlashAttention-2's varlen and PyTorch's NestedTensor SDPA -- fail to translate these savings into proportional wall-clock gains at the short post-pruning sequence lengths typical of ViTs ($\leq$197 tokens). We identify a dispatch-overhead bottleneck: at these lengths, host-side kernel dispatch consumes ${\sim}$50\,$μ$s regardless of workload, exceeding the actual GPU compute time at moderate-to-high pruning rates. We present a lightweight bidirectional Triton attention kernel whose dispatch floor is ${\sim}$24\,$μ$s -- roughly 2.17$\times$ lower than FlashAttention-2 varlen -- allowing pruning savings to become visible in wall-clock time. Integrated into a complete pack-attend-unpack pipeline and evaluated on an NVIDIA RTX 4000 Ada Generation GPU, our system achieves 1.88$\times$ end-to-end throughput over padded PyTorch SDPA at standard 224$\times$224 inputs, scaling to 2.51$\times$ at 384$\times$384. Against FlashAttention-2 varlen -- the strongest baseline -- our kernel delivers 9-12\% higher throughput at serving batch sizes (BS=1-4), and 2.17$\times$ lower kernel latency at 80\% token pruning. Numerical correctness is verified with max absolute logit difference $<$0.004 and bit-exact top-1 predictions.
♻ ☆ Stochastic tensor space feature theory with applications to robust machine learning
In this paper we develop a Multilevel Orthogonal Subspace (MOS) Karhunen-Loeve feature theory based on stochastic tensor spaces, for the construction of robust machine learning features. Training data are treated as instances of a random field within a relevant Bochner space. Our key observation is that separate machine learning classes can reside predominantly in mostly distinct subspaces. Using the Karhunen-Loeve expansion and a hierarchical expansion of the first (nominal) class, a MOS is constructed to detect anomalous signal components, treating the second class as an outlier of the first. The projection coefficients of the input data into these subspaces are then used to train a Machine Learning (ML) classifier. These coefficients become new features from which much clearer separation surfaces can arise for the underlying classes. Tests in the blood plasma dataset (Alzheimer's Disease Neuroimaging Initiative) show dramatic increases in accuracy. This contrast to popular ML methods such as Gradient Boosting, RUS Boost, Random Forest and Neural Networks. We show that with a non-invasive blood test, high-accuracy results can be obtained for predicting AD stages such as cognitive normal, mild cognitive impairment and dementia.
♻ ☆ Local and Mixing-Based Algorithms for Gaussian Graphical Model Selection from Glauber Dynamics ICML 2026
Gaussian graphical model selection is usually studied under independent sampling, but in many applications observations arise from dependent dynamics. We study structure learning when the data consist of a single trajectory of Gaussian Glauber dynamics. We develop two complementary approaches. The first is a local edge-testing estimator based on an appropriately designed correlation test that reveals edges. This estimator does not require waiting for the chain to mix and admits an embarrassingly parallel edgewise implementation. The second is a burn-in/thinning reduction: under a Dobrushin contraction condition, we prove that a suitably subsampled Gaussian Gibbs trajectory is close in total variation to an i.i.d. product sample, allowing standard i.i.d. Gaussian graphical model learners to be used as black boxes. The key technical ingredient, which may be of independent interest, is a high-dimensional total-variation bound for random-scan Gaussian Gibbs samplers, obtained by combining Wasserstein contraction with an approximate Lipschitz smoothing argument. We prove finite-sample recovery guarantees for both approaches, establish information-theoretic lower bounds on the observation time, and empirically compare the resulting sample-computation tradeoffs.
comment: Major revision. Corrects the earlier local ratio-estimator analysis by replacing it with a local product estimator; adds a burn-in/thinning estimator based on total-variation decoupling for Gaussian Gibbs samplers; strengthens the lower bounds; adds experiments; and compares with the related ICML 2026 work of Shen, Wu, Majid, and Moitra
♻ ☆ Reconstructing Sepsis Trajectories from Clinical Case Reports using LLMs: the Textual Time Series Corpus for Sepsis
Clinical case reports and discharge summaries may be the most complete and accurate summarization of patient encounters, yet they are finalized, i.e., timestamped after the encounter. Complementary structured data streams become available sooner but suffer from incompleteness. To train models and algorithms on more complete and temporally fine-grained data, we construct a pipeline to phenotype, extract, and annotate time-localized findings within case reports using large language models. We apply our pipeline to generate an open-access textual time series corpus for Sepsis-3 comprising 2,139 case reports from the PubMed-Open Access (PMOA) Subset. To validate our system, we apply it to PMOA and timeline annotations from i2b2/MIMIC-IV and compare the results to physician-expert annotations. We show high recovery rates of clinical findings (event match rates: GPT-5--0.93, Llama 3.3 70B Instruct--0.76) and strong temporal ordering (concordance: GPT-5--0.965, Llama 3.3 70B Instruct--0.908). Our work characterizes the ability of LLMs to time-localize clinical findings in text, illustrating the limitations of LLM use for temporal reconstruction and providing several potential avenues of improvement via multimodal integration.
comment: Conference on Health, Inference, and Learning (CHIL 2026)
♻ ☆ Welfare as a Guiding Principle for Machine Learning -- From Compass, to Lens, to Roadmap
Decades of research in machine learning have given us powerful tools for making accurate predictions. But when used in social settings and on human inputs, better accuracy does not immediately translate to better social outcomes. To effectively promote social well-being through machine learning, this position article advocates for the wide adoption of \emph{social welfare} as a guiding principle. The field of welfare economics asks: how should we allocate limited resources to self-interested agents in a way that maximizes social benefit? We argue that this perspective applies to many modern applications of machine learning in social contexts. As such, we propose that welfare serves as an additional core criterion in the design, study, and use of learning algorithms, complementing the conventional pillars of optimization, generalization, and expressivity, and as a compass guiding both theory and practice.
♻ ☆ Robust Policy Optimization to Prevent Catastrophic Forgetting
Large language models are commonly trained through multi-stage post-training: first via RLHF, then fine-tuned for other downstream objectives. Yet even small downstream updates can compromise earlier learned behaviors (e.g., safety), exposing a brittleness known as catastrophic forgetting. This suggests standard RLHF objectives do not guarantee robustness to future adaptation. To address it, most prior work designs downstream-time methods to preserve previously learned behaviors. We argue that preventing this requires pre-finetuning robustness: the base policy should avoid brittle high-reward solutions whose reward drops sharply under standard fine-tuning. We propose Fine-tuning Robust Policy Optimization (FRPO), a robust RLHF framework that optimizes reward not only at the current policy, but across a KL-bounded neighborhood of policies reachable by downstream adaptation. The key idea is to ensure reward stability under policy shifts via a max-min formulation. By modifying GRPO, we develop an algorithm with no extra computation, and empirically show it substantially reduces safety degradation across multiple base models and downstream fine-tuning regimes (SFT and RL) while preserving downstream task performance. We further study a math-focused RL setting, demonstrating that FRPO preserves accuracy under subsequent fine-tuning.
♻ ☆ From Data Lifting to Continuous Risk Estimation: A Process-Aware Pipeline for Predictive Monitoring of Clinical Pathways
This paper presents a reproducible and process-aware pipeline for predictive monitoring of clinical pathways. The approach integrates data lifting, temporal reconstruction, event log construction, prefix-based representations, and predictive modeling to support continuous reasoning on partially observed patient trajectories, overcoming the limitations of traditional retrospective process mining. The framework is evaluated on COVID-19 clinical pathways using ICU admission as the prediction target, considering 4,479 patient cases and 46,804 prefixes. Predictive models are trained and evaluated using a case-level split, with 896 patients in the test set. Logistic Regression achieves the best performance (AUC 0.906, F1-score 0.835). A detailed prefix-based analysis shows that predictive performance improves progressively as new clinical events become available, with AUC increasing from 0.642 at early stages to 0.942 at later stages of the pathway. The results highlight two key findings: predictive signals emerge progressively along clinical pathways, and process-aware representations enable effective early risk estimation from evolving patient trajectories. Overall, the findings suggest that predictive monitoring in healthcare is best conceived as a continuous, dynamically aware process, in which risk estimates are progressively refined as the patient journey evolves.
♻ ☆ On periodic distributed representations using Fourier embeddings
Periodic signals are critical for representing physical and perceptual phenomena. Scalar, real angular measures, e.g., radians and degrees, result in difficulty processing and distinguishing nearby angles, especially when their absolute difference exceeds pi. We can avoid this problem by using real-valued, periodic embeddings in high-dimensional space. These representations also allow us to control the nature of their dot product similarities, allowing us to construct a variety of different kernel shapes. In this work, we aim of highlight how these representations can be constructed and focus on the formalization of Dirichlet and periodic Gaussian kernels using the neurally-plausible representation scheme of Spatial Semantic Pointers.
♻ ☆ Bayesian Surrogate Training on Multiple Data Sources: A Hybrid Modeling Strategy
Surrogate models are often used as computationally efficient approximations to complex simulation models, enabling tasks such as solving inverse problems, sensitivity analysis, and probabilistic forward predictions, which would otherwise be computationally infeasible. During training, surrogate parameters are fitted such that the surrogate reproduces the simulation model's outputs as closely as possible. However, the simulation model itself is merely a simplification of the real-world system, often missing relevant processes or suffering from misspecifications e.g., in inputs or boundary conditions. Hints about these might be captured in real-world measurement data, and yet, we typically ignore those hints during surrogate building. In this paper, we propose two novel probabilistic approaches to integrate simulation data and real-world measurement data during surrogate training. The first method trains separate surrogate models for each data source and combines their predictive distributions, while the second incorporates both data sources by training a single surrogate. Both hybrid modeling approaches employ a novel weighting strategy for combining heterogeneous data sources during surrogate training, which operates independently of the chosen surrogate family. We show the conceptual differences and benefits of the two approaches through both synthetic and real-world case studies. The results demonstrate the potential of these methods to improve predictive accuracy, predictive coverage, and to diagnose problems in the underlying simulation model. These insights can improve system understanding and future model development.
♻ ☆ When Does Non-Uniform Replay Matter in Reinforcement Learning?
Modern off-policy reinforcement learning algorithms often rely on simple uniform replay sampling and it remains unclear when and why non-uniform replay improves over this strong baseline. Across diverse RL settings, we show that the effectiveness of non-uniform replay is governed by three factors: replay volume, the number of replayed transitions per environment step; expected recency, how recent sampled transitions are; and the entropy of the replay sampling distribution. Our main contribution is clarifying when non-uniform replay is beneficial and providing practical guidance for replay design in modern off-policy RL. Namely, we find that non-uniform replay is most beneficial when replay volume is low, and that high-entropy sampling is important even at comparable expected recency. Motivated by these findings, we adopt a simple Truncated Geometric replay that biases sampling toward recent experience while preserving high entropy and incurring negligible computational overhead. Across large-scale parallel simulation, single-task, and multi-task settings, including three modern algorithms evaluated on five RL benchmark suites, this replay sampling strategy improves sample efficiency in low-volume regimes while remaining competitive when replay volume is high.
♻ ☆ Sparsity-Constraint Optimization via Splicing Iteration
Sparsity-constrained optimization underlies many problems in signal processing, statistics, and machine learning. State-of-the-art hard-thresholding (HT) algorithms rely on an appropriately selected continuous step-size parameter to ensure convergence. In this paper, we propose a naturally convergent iterative algorithm, SCOPE (Sparsity-Constrained Optimization via sPlicing itEration). The algorithm is capable of optimizing nonlinear differentiable objective functions that are strongly convex and smooth on low-dimensional subspaces. SCOPE replaces the gradient step with a splicing operation guided directly by the objective value, thereby eliminating the need to tune any continuous hyperparameter. Theoretically, it achieves a linear convergence rate and recovers the true support set when the sparsity level is correctly specified. We also establish parallel theoretical results without relying on restricted-isometry-property-type conditions. We apply SCOPE's versatility and power to solve sparse quadratic optimization, learn sparse classifiers, and recover sparse Markov networks for binary variables. With our C++ implementation of SCOPE, numerical experiments on these tasks show that it achieves superior support recovery performance, confirming both its algorithmic efficiency and theoretical guarantees.
comment: 35 pages
♻ ☆ Structured Diffusion Bridges: Inductive Bias for Denoising Diffusion Bridges ICML 2026
Modality translation is inherently under-constrained, as multiple cross-modal mappings may yield the same marginals. Recent work has shown that diffusion bridges are effective for this task. However, most existing approaches rely on fully paired datasets, thereby imposing a single data-driven constraint. We propose a diffusion-bridge framework that characterizes the space of admissible solutions and restricts it via alignment constraints, treating paired supervision as an optional heuristic rather than a prerequisite. We validate our method on synthetic and real modality translation benchmarks across unpaired, semi-paired, and paired regimes, showing consistent performance across supervision levels. Notably, \textbf{it achieves near fully-paired quality with a substantial relaxation in pairing requirements, and remaining applicable in the unpaired regime}. These results highlight diffusion bridges as a flexible foundation for modality translation beyond fully paired data.
comment: Accepted to ICML 2026
♻ ☆ Shapley Value Approximation Based on k-Additive Games
The Shapley value is the prevalent solution for fair division problems in which a payout is to be divided among multiple agents. By adopting a game-theoretic view, the idea of fair division and the Shapley value can also be used in machine learning to quantify the individual contribution of features or data points to the performance of a predictive model. Despite its popularity and axiomatic justification, the Shapley value suffers from a computational complexity that scales exponentially with the number of entities involved, and hence requires approximation methods for its reliable estimation. We propose SVA$k_{\text{ADD}}$, a novel approximation method that fits a $k$-additive surrogate game. By taking advantage of $k$-additivity, we are able to elicit the exact Shapley values of the surrogate game and then use these values as estimates for the original fair division problem. The efficacy of our method is evaluated empirically and compared to competing methods.
♻ ☆ A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
Emergent intelligence have played a major role in the modern AI development. While existing studies primarily rely on empirical observations to characterize this phenomenon, a rigorous theoretical framework remains underexplored. This study attempts to develop a mathematical approach to formalize emergent intelligence from the perspective of limit theory. Specifically, we introduce a performance function E(N, P, K), dependent on data size N, model size P and training steps K, to quantify intelligence behavior. We posit that intelligence emerges as a transition from finite to effectively infinite knowledge, and thus recast emergent intelligence as existence of the limit $\lim_{N,P,K \to \infty} \mathcal{E}(N,P,K)$, with emergent abilities corresponding to the limiting behavior. This limit theory helps reveal that emergent intelligence originates from the existence of a parameter-limit architecture (referred to as the limit architecture), and that emergent intelligence rationally corresponds to the learning behavior of this limit system. By introducing tools from nonlinear Lipschitz operator theory, we prove that the necessary and sufficient conditions for existence of the limit architecture. Furthermore, we derive the scaling law of foundation models by leveraging tools of Lipschitz operator and covering number. Theoretical results show that: 1) emergent intelligence is governed by three key factors-training steps, data size and the model architecture, where the properties of basic blocks play a crucial role in constructing foundation models; 2) the critical condition Lip(T)=1 for emergent intelligence provides theoretical support for existing findings. 3) emergent intelligence is determined by an infinite-dimensional system, yet can be effectively realized in practice through a finite-dimensional architecture. Our empirical results corroborate these theoretical findings.
comment: There exist some typos and inaccurate expression in this version
♻ ☆ Multi-modal Bayesian Neural Network Surrogates with Conjugate Last-Layer Estimation
As data collection and simulation capabilities advance, multi-modal learning, the task of learning from multiple modalities and sources of data, is becoming an increasingly important area of research. Surrogate models that learn from data of multiple auxiliary modalities to support the modeling of a highly expensive quantity of interest have the potential to aid outer loop applications such as optimization, inverse problems, or sensitivity analyses when multi-modal data are available. We develop two multi-modal Bayesian neural network surrogate models and leverage conditionally conjugate distributions in the last layer to estimate model parameters using stochastic variational inference (SVI). We provide a method to perform this conjugate SVI estimation in the presence of partially missing observations. We demonstrate improved prediction accuracy and uncertainty quantification compared to uni-modal surrogate models for both scalar and time series data.
comment: 47 pages including references and appendix, 9 figures
♻ ☆ DreamPolicy: A Unified World-model Policy for Scalable Humanoid Locomotion
Achieving versatile humanoid locomotion with a single policy presents a critical scalability challenge. Prevailing methods often rely on distilling multiple terrain-specific teacher policies into a unified student policy. However, while such distillation captures basic locomotion primitives, it struggles to organically compose these skills to adapt to complex environments, resulting in poor generalization to novel composite terrains unseen during training. To overcome this, we present DreamPolicy, a unified framework that integrates offline data with a diffusion-based world model, enabling a single policy to master both known and unseen terrains. Central to our approach is a terrain-aware world model, driven by an autoregressive diffusion world model trained on aggregated rollouts from specialized policies. This model synthesizes physically plausible future trajectories, which serve as dynamic objectives for a conditioned policy, thereby bypassing manual reward engineering. Unlike distillation, our world model captures generalizable locomotion skills, allowing for robust zero-shot transfer to unseen composite terrains. DreamPolicy naturally scales with data availability. As the offline dataset expands, the diffusion world model continuously acquires richer skills. Experiments demonstrate that DreamPolicy outperforms the strongest baseline by up to 27\% on unseen terrains and 38\% on combined terrains. By unifying world model-based planning and policy learning, DreamPolicy breaks the "one task, one policy" bottleneck and establishes a scalable, data-driven paradigm for generalist humanoid control.
♻ ☆ KV Cache Offloading for Context-Intensive Tasks
With the growing demand for long-context LLMs across a wide range of applications, the key-value (KV) cache has become a critical bottleneck for both latency and memory usage. Recently, KV-cache offloading has emerged as a promising approach to reduce memory footprint and inference latency while preserving accuracy. Prior evaluations have largely focused on tasks that do not require extracting large amounts of information from the context. In this work, we study KV-cache offloading on context-intensive tasks: problems where the solution requires looking up a lot of information from the input prompt. We create and release the Text2JSON benchmark, a highly context-intensive task that requires extracting structured knowledge from raw text. We evaluate modern KV offloading on Text2JSON and other context-intensive tasks and find significant performance degradation on both Llama 3 and Qwen 3 models. Our analysis identifies two key reasons for poor accuracy: low-rank projection of keys and unreliable landmarks, and proposes a simpler alternative strategy that significantly improves accuracy across multiple LLM families and benchmarks. These findings highlight the need for a comprehensive and rigorous evaluation of long-context compression techniques.
comment: Preprint
♻ ☆ Tackling Fake Forgetting through Uncertainty Quantification
Machine unlearning seeks to remove the influence of specified data from a trained model. While the unlearning accuracy provides a widely used metric for assessing unlearning performance, it falls short in assessing the reliability of forgetting. In this paper, we find that the forgetting data points misclassified by unlearning accuracy still have their ground truth labels included in the conformal prediction set from the uncertainty quantification perspective, leading to a phenomenon we term fake forgetting. To address this issue, we propose a novel metric CR, inspired by conformal prediction, that offers a more reliable assessment of forgetting quality. Building on these insights, we further propose an unlearning framework CPU that incorporates conformal prediction into the Carlini & Wagner adversarial attack loss, enabling the ground truth label to be effectively removed from the conformal prediction set. Through extensive experiments on image classification tasks, we demonstrate both the effectiveness of our proposed metric and the superior forgetting quality achieved by our framework. Code is available at https://github.com/TIML-Group/Conformal-Prediction-Unlearning.
♻ ☆ Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport
Inference-time scaling methods rely on Process Reward Models (PRMs), which are often poorly calibrated and overestimate success probabilities. We propose, to our knowledge, the first use of conditional optimal transport for calibrating PRMs, modifying conditional OT (CondOT) map learning \cite{bunne2022supervised} to estimate a monotonic conditional quantile function over success probabilities estimated by the PRM, conditioned on PRM hidden states. This yields structurally valid quantile estimates and enables efficient extraction of confidence bounds at arbitrary levels, which we integrate into the instance-adaptive scaling (IAS) framework of \cite{park2025know}. We evaluate on mathematical reasoning benchmarks spanning moderate-difficulty problems (MATH-500) and harder out-of-distribution problems (AIME). For PRMs with reliable ranking signals, our method substantially improves calibration over both uncalibrated PRMs and quantile regression. On downstream Best-of-N IAS performance, our method generally improves over uncalibrated PRMs. These results establish conditional optimal transport as another principled and practical approach to PRM calibration, offering structural guarantees and flexible uncertainty estimation.
♻ ☆ Self-Consolidating Language Models: Continual Knowledge Incorporation from Context
Large language models (LLMs) increasingly receive information as streams of passages, conversations, and long-context workflows. While longer context windows expose more evidence, they do not ensure that useful information is preserved and reused. We study continual context consolidation: writing current context into model weights while limiting interference with previously consolidated information. We propose \textbf{S}elf-\textbf{Co}nsolidating \textbf{L}anguage Models (SCoL), a post-training framework in which, given current context, an LLM learns to generate textual update instructions specifying which of its own Transformer layers should be updated. Because committed updates change the model that later generates future selections, we train SCoL with meta-reinforcement learning over an evolving model state. We instantiate SCoL with supervised QA rewards on SQuAD knowledge incorporation and intrinsic likelihood-based rewards for LongBench v2 long-context consolidation. Across both settings, SCoL improves acquisition and retention over prompting, summarization, batch test-time training, and sequential finetuning baselines. Analysis of learned selection patterns shows that SCoL encourages the LLM to generate sparse update locations that align with layers of high Fisher information, suggesting that the model learns to route plasticity toward loss-sensitive regions while limiting interference. Moreover, SCoL transfers from shorter meta-training streams to longer LongBench v2 streams at evaluation, suggesting that our framework supports scalable streaming consolidation.
comment: 9 pages
♻ ☆ Fractal Graph Contrastive Learning
Graph Contrastive Learning (GCL) relies on semantically consistent graph augmentations, but common local perturbations provide limited control over global structural consistency, motivating a more principled global augmentation strategy. We therefore propose Fractal Graph Contrastive Learning (FractalGCL), a theory-motivated framework that constructs a renormalisation-based augmented graph and introduces a fractal-dimension-aware contrastive loss that penalises unreliable positive views and reweights negative-pair repulsion by finite-scale box-counting discrepancies. However, computing these discrepancies introduces substantial overhead, so we derive and justify a Gaussian surrogate that avoids repeated box-counting on renormalised graphs, yielding about a $61\%$ runtime reduction. Experiments show that FractalGCL serves as an effective frozen-pretraining tool on MalNet-Tiny, achieves strong performance on the standard TUDataset benchmarks, and outperforms the next-best method on real-world urban traffic tasks by $4.51$ percentage points in average accuracy. Code is available at https://anonymous.4open.science/r/FractalGCL-0511/.
comment: 32 pages, 7 figures
♻ ☆ Efficient Bayesian Inference from Noisy Pairwise Comparisons
Evaluating generative models is challenging because standard metrics often fail to reflect human preferences. Human evaluations are more reliable but costly and noisy, as participants vary in expertise, attention, and diligence. Pairwise comparisons improve consistency, yet aggregating them into overall quality scores requires careful modeling. Bradley-Terry-based methods update item scores from comparisons, but existing approaches either ignore rater variability or lack convergence guarantees, limiting robustness and interpretability. We introduce BBQ, a Bayesian Bradley-Terry variant that explicitly models rater quality, downweighting or removing unreliable participants, and provides guaranteed monotonic likelihood convergence through an Expectation-Maximization algorithm. Empirical results show that BBQ provides efficient inference, well-calibrated uncertainty estimates, and more robust, interpretable rankings compared to baseline Bradley-Terry models, even with noisy or crowdsourced raters. This framework enables more reliable and cost-effective human evaluation of generative models.
♻ ☆ From Observations to States: Latent Time Series Forecasting ICML 2026
Deep learning has achieved strong performance in Time Series Forecasting (TSF). However, we identify a critical representation paradox, termed Latent Chaos: models with accurate predictions often learn latent representations that are temporally disordered and lack continuity. We attribute this to the dominant observation-space forecasting paradigm, where minimizing point-wise errors on noisy and partially observed data encourages shortcut solutions instead of the recovery of underlying system dynamics. To address this, we propose Latent Time Series Forecasting (LatentTSF), a paradigm that shifts TSF from observation regression to latent state prediction. LatentTSF employs an AutoEncoder to project each observation into a learned latent state space and performs forecasting entirely in this space, allowing the model to focus on learning structured temporal dynamics. We provide an information-theoretic analysis showing that the latent objectives can be motivated as surrogates for maximizing mutual information between predicted and ground-truth latent states and future observations. Extensive experiments on widely-used benchmarks confirm that LatentTSF effectively mitigates latent chaos, yielding consistent improvements in both forecasting accuracy and representation quality. Our code is available at https://github.com/Muyiiiii/LatentTSF.
comment: Accepted at ICML 2026
♻ ☆ GIST: Gauge-Invariant Spectral Transformers for Scalable Graph Neural Operators
Neural operators on irregular meshes face a fundamental tension. Spectral positional encodings, the natural choice for capturing geometry, require cubic-complexity eigendecomposition and inadvertently break gauge invariance through numerical solver artifacts; existing efficient approximations sacrifice gauge symmetry by design. Both failure modes break discretization invariance: models fail to transfer across mesh resolutions of the same domain, and similarly across different graphs of related structure in inductive settings. We propose GIST (Gauge-Invariant Spectral Transformer), a scalable neural operator that resolves this tension by restricting attention to pairwise inner products of efficient approximate spectral embeddings. We prove these inner products estimate an exactly gauge-invariant graph kernel at end-to-end $\mathcal{O}(N)$ complexity, and establish a formal connection between gauge invariance and discretization-invariant learning with bounded mismatch error. To our knowledge, GIST is the first scalable graph neural operator with a provable discretization-mismatch bound. Empirically, GIST sets state-of-the-art on the AirfRANS, ShapeNet-Car, DrivAerNet, and DrivAerNet++ mesh benchmarks (up to 750K nodes), and additionally matches strong baselines on standard graph benchmarks (e.g., 99.50% micro-F1 on PPI).
♻ ☆ Differentially Private Synthetic Text Generation for Retrieval-Augmented Generation (RAG) ACL 2026
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by grounding them in external knowledge. However, its application in sensitive domains is limited by privacy risks. Existing private RAG methods typically rely on query-time differential privacy (DP), which requires repeated noise injection and leads to accumulated privacy loss. To address this issue, we propose DP-SynRAG, a framework that uses LLMs to generate differentially private synthetic RAG databases. Unlike prior methods, the synthetic text can be reused once created, thereby avoiding repeated noise injection and additional privacy costs. To preserve essential information for downstream RAG tasks, DP-SynRAG extends private prediction, which instructs LLMs to generate text that mimics subsampled database records in a DP manner. Experiments show that DP-SynRAG achieves superior performance to the state-of-the-art private RAG systems while maintaining a fixed privacy budget, offering a scalable solution for privacy-preserving RAG.
comment: Accepted to ACL 2026 Findings
♻ ☆ Taking the Road Less Scheduled with Adaptive Polyak Steps
Schedule-Free SGD, proposed in [Defazio et al., 2024], achieves optimal convergence rates without requiring the training horizon in advance, by replacing learning rate schedules with a principled form of iterate averaging. However, the method still requires tuning a base learning rate whose optimal value depends on unknown problem constants. In this work, we continue down this road by deriving Polyak-type step sizes for Schedule-Free SGD and Adam that compute the learning rate at each iteration from the sampled loss, gradient, and current iterates alone. We first propose an oracle variant that uses per-sample optimal function values and prove an $O(1/\sqrt{t})$ anytime last-iterate rate for convex Lipschitz objectives. We then remove the oracle requirement with a safeguarded variant that replaces the unknown optimal values with any available lower bound, achieving the same rate up to a neighborhood that vanishes under interpolation. Both step sizes reduce to existing Polyak rules for standard SGD when momentum is set to zero, unifying standard and schedule-free Polyak methods. Numerical experiments on language modeling, including pretraining and distillation, show that the proposed methods match or surpass tuned Schedule-Free baselines while offering greater robustness to hyperparameter choices.
♻ ☆ Robustness Certificates for Neural Networks against Adversarial Attacks
The increasing use of machine learning in safety-critical domains amplifies the risk of adversarial threats, especially data poisoning attacks that corrupt training data to degrade performance or induce unsafe behavior. Most existing defenses lack formal guarantees or rely on restrictive assumptions about the model class, attack type, extent of poisoning, or point-wise certification, limiting their practical reliability. This paper introduces a principled formal robustness certification framework that models gradient-based training as a discrete-time dynamical system (dt-DS) and formulates poisoning robustness as a formal safety verification problem. By adapting the concept of barrier certificates (BCs) from control theory, we introduce sufficient conditions to certify a robust radius ensuring that the terminal model remains safe under worst-case ${\ell}_p$-norm based poisoning. To make this practical, we parameterize BCs as neural networks trained on finite sets of poisoned trajectories. We further derive probably approximately correct (PAC) bounds by solving a scenario convex program (SCP), which yields a confidence lower bound on the certified robustness radius generalizing beyond the training set. Importantly, our framework also extends to certification against test-time attacks, making it the first unified framework to provide formal guarantees in both training and test-time attack settings. Experiments on MNIST, SVHN, and CIFAR-10 show that our approach certifies non-trivial perturbation budgets while being model-agnostic and requiring no prior knowledge of the attack or contamination level.
♻ ☆ Simulation Distillation: Pretraining World Models in Simulation for Rapid Real-World Adaptation
Robot learning requires adaptation methods that improve reliably from limited, mixed-quality interaction data. This is especially challenging in long-horizon, contact-rich tasks, where end-to-end policy finetuning remains inefficient and brittle. World models offer a compelling alternative: by predicting the outcomes of candidate action sequences, they enable online planning through counterfactual reasoning. However, training action-conditioned robotic world models directly in the real world requires diverse data at impractical scale. We introduce Simulation Distillation (SimDist), a framework that uses physics simulators as a scalable source of action-conditioned robot experience. During pretraining, SimDist distills structural priors from the simulator into a world model that enables planning from raw real-world observations. During real-world adaptation, SimDist transfers the encoder, reward model, and value function learned in simulation, and updates only the latent dynamics model using real-world prediction losses. This reduces adaptation to supervised system identification while preserving dense, long-horizon planning signals for online improvement. Across contact-rich manipulation and quadruped locomotion tasks, SimDist rapidly improves with experience, while prior adaptation methods struggle to make progress or degrade during online finetuning. Project website and code: https://sim-dist.github.io
comment: Robotics: Science and Systems 2026
♻ ☆ Trajectory First: A Curriculum for Discovering Diverse Policies
Being able to solve a task in diverse ways makes agents more robust to task variations and less prone to local optima. In this context, constrained diversity optimization has become a useful reinforcement learning (RL) framework for training a set of diverse agents in parallel. However, existing constrained-diversity RL methods often under-explore in complex tasks such as robot manipulation, resulting in limited behavioral diversity. We address this with a two-stage curriculum that introduces a spline-based trajectory prior as an inductive bias to produce diverse, high-reward behaviors in an initial stage, and then distills these behaviors into reactive, step-wise policies in a second stage. In our empirical evaluation, we provide novel insights into challenges of diversity-targeted training and show that our curriculum increases the diversity of learned skills while maintaining high task performance.
comment: Accepted into the Inductive Biases in Reinforcement Learning Workshop at RLC 2025
♻ ☆ Approximating Simple ReLU Networks based on Spectral Decomposition of Fisher Information
Properties of Fisher information matrices of 2-layer neural ReLU networks with random hidden weights are studied. For these networks, it is known that the eigenvalue distribution highly concentrates on several eigenspaces approximately. In particular, the eigenvalues for the first three eigenspaces account for 97.7% of the trace of the Fisher information matrix, independently of the number of parameters. In this paper, we identify the function spaces which correspond to those major eigenspaces. This function space consists of the spherical harmonic functions whose orders are not greater than 2. This result relates to the Mercer decomposition of the neural tangent kernels.
comment: 18 pages, 1 figure, 1 table
♻ ☆ Beyond Point Estimates: Distributional Uncertainty in Machine Learning Performance Evaluation
Machine learning models are often evaluated using point estimates of performance metrics such as accuracy, F1 score, or mean squared error. Such summaries fail to capture the inherent variability induced by stochastic elements of the training process, including data splitting, initialization, and hyperparameter optimization. This work proposes a distributional perspective on model evaluation by treating performance metrics as random quantities rather than fixed values. Instead of focusing solely on aggregate measures, empirical distributions of performance metrics are analyzed using quantiles and corresponding confidence intervals. The study investigates point and interval estimation of quantiles based on real-data use cases for classification and regression tasks, complemented by simulation studies for validation. Special emphasis is placed on small sample sizes, reflecting practical constraints in machine learning, where repeated training is computationally expensive. The results show that meaningful statistical inference on the underlying performance distribution is feasible even with sample sizes in the range of 10-25, while standard nonparametric confidence interval remain applicable under these conditions. The proposed approach provides a more detailed characterization of variability and uncertainty compared to mean-based evaluation and enables a more differentiated comparison of models. In particular, it supports a risk-oriented interpretation of model performance, which is relevant in applications where reliability is critical. The presented methods are easy to implement and broadly applicable, making them a practical extension to standard performance evaluation procedures in machine learning.
comment: 21 pages, 9 figures
♻ ☆ Pruning Federated Models through Loss Landscape Analysis and Client Agreement Scoring
The practical deployment of Federated Learning (FL) on resource-constrained devices is fundamentally limited by the high cost of training large models and the instability caused by heterogeneous (non-IID) client data. Conventional pruning methods often treat data heterogeneity as a problem to be mitigated. In this work, we introduce a paradigm shift: we reframe client diversity as a feature to be harnessed. We propose AutoFLIP, a framework that begins not with training, but with a one-time federated loss exploration. During this phase, clients collaboratively build a map of the collective loss landscape, using their diverse data to reveal the problem's essential structure. This shared intelligence then guides an adaptive pruning strategy that is dynamically refined by client agreement throughout training. This approach allows AutoFLIP to identify robust and efficient sub-networks from the outset. Our extensive experiments show that AutoFLIP reduces computational overhead by an average of 52% and communication costs by over 65% while simultaneously achieving state-of-the-art accuracy in challenging non-IID settings.
♻ ☆ Integral Imprecise Probability Metrics
Quantifying differences between probability distributions is fundamental to statistics and machine learning, primarily for comparing statistical uncertainty. In contrast, epistemic uncertainty -- due to incomplete knowledge -- requires richer representations than those offered by classical probability. Imprecise probability (IP) theory offers such models, capturing ambiguity and partial belief. This has driven growing interest in imprecise probabilistic machine learning (IPML), where inference and decision-making rely on broader uncertainty models -- highlighting the need for metrics beyond classical probability. This work introduces the integral imprecise probability metric framework, a Choquet integral-based generalisation of classical integral probability metrics to the setting of capacities -- a broad class of IP models encompassing many existing ones, including lower probabilities, probability intervals, belief functions, and more. Theoretically, we establish conditions under which IIPM serves as a valid metric and metrises a form of weak convergence of capacities. Practically, IIPM not only enables comparison across different IP models but also supports the quantification of epistemic uncertainty~(EU) within a single IP model. In particular, by comparing an IP model with its conjugate, IIPM gives rise to a new class of epistemic uncertainty measures -- Maximum Mean Imprecision -- which satisfy key axiomatic properties proposed in the uncertainty quantification literature. We validate MMI through selective classification experiments, demonstrating strong empirical performance against established EU measures, and outperforming them when classical methods struggle to scale to a large number of classes. Our work advances both theory and practice in Imprecise Probabilistic Machine Learning, offering a principled framework for comparing and quantifying epistemic uncertainty under imprecision.
comment: 48 pages
♻ ☆ Rotary Masked Autoencoders are Versatile Learners NeurIPS 2025
Applying Transformers to irregular time-series typically requires specializations to their baseline architecture, which can result in additional computational overhead and increased method complexity. We present the Rotary Masked Autoencoder (RoMAE), which utilizes the popular Rotary Positional Embedding (RoPE) method for continuous positions. RoMAE is an extension to the Masked Autoencoder (MAE) that enables interpolation and representation learning with multidimensional continuous positional information while avoiding any time-series-specific architectural specializations. We showcase RoMAE's performance on a variety of modalities including irregular and multivariate time-series, images, and audio, demonstrating that RoMAE surpasses specialized time-series architectures on difficult datasets such as the DESC ELAsTiCC Challenge while maintaining MAE's usual performance across other modalities. In addition, we investigate RoMAE's ability to reconstruct the embedded continuous positions, demonstrating that including learned embeddings in the input sequence breaks RoPE's relative position property.
comment: NeurIPS 2025 Final Camera Ready
♻ ☆ PRIME: Protein Representation via Physics-Informed Multiscale Equivariant Hierarchies
Proteins are inherently multiscale physical systems whose functional properties emerge from coordinated structural organization across multiple spatial resolutions, ranging from atomic interactions to global fold topology. However, existing protein representation learning methods typically operate at a single structural level or treat different sources of structural information as parallel modalities, without explicitly modeling their hierarchical relationships. We introduce PRIME (Protein Representation via Physics-Informed Multiscale Equivariant Hierarchies), a unified framework that models proteins as a nested family of five physically grounded structural graphs spanning surface, atomic, residue, secondary-structure, and protein levels. Adjacent levels are connected through deterministic, physics-informed assignment operators, enabling bidirectional information exchange via bottom-up aggregation and top-down contextual refinement. Experiments on standard protein representation learning benchmarks demonstrate strong and competitive performance across diverse tasks, with particularly notable gains on the Fold Classification benchmark, where PRIME outperforms the strongest geometric GNN baseline by margins of 13.80 and 18.30 points on the harder Superfamily and Fold splits, and achieves a state-of-the-art accuracy of 84.10\% on Reaction Class prediction, surpassing all baseline methods, including ESM. Ablation studies confirm that each structural level contributes complementary and non-redundant information, and adaptive cross-attention analysis reveals that PRIME autonomously identifies the most task-relevant structural resolutions at prediction time. Our source code is publicly available at https://github.com/HySonLab/PRIME
♻ ☆ Doubly Outlier-Robust Online Infinite Hidden Markov Model ICML 2026
We derive a robust update rule for the online infinite hidden Markov model (iHMM) for when the streaming data contains outliers and the model is misspecified. Leveraging recent advances in generalised Bayesian inference, we define robustness via the posterior influence function (PIF), and provide conditions under which the online iHMM has bounded PIF. Imposing robustness inevitably induces an adaptation lag for regime switching. Our method, which is called Batched Robust iHMM (BR-iHMM), balances adaptivity and robustness with two additional tunable parameters. Across limit order book data, hourly electricity demand, and a synthetic high-dimensional linear system, BR-iHMM reduces one-step-ahead forecasting error by up to 67% relative to competing online Bayesian methods. Together with theoretical guarantees of bounded PIF, our results highlight the practicality of our approach for both forecasting and interpretable online learning.
comment: 43rd International Conference on Machine Learning (ICML 2026)
♻ ☆ Feedback Lunch: Learned Feedback Codes for Secure Communications
We consider reversely-degraded secure-communication channels, for which the secrecy capacity is zero if there is no channel feedback. Specifically, we focus on a seeded modular code design for the block-fading Gaussian wiretap channel with channel-output feedback, combining universal hash functions for security and learned feedback-based codes for reliability. The trade-off between communication reliability and information leakage is studied, illustrating that feedback enables agreeing on a secret key shared between legitimate parties, overcoming the security advantage of the eavesdropper. Our findings motivate code designs for sensing-assisted secure communications in the context of integrated sensing and communication (ISAC).
comment: Accepted to WiseML'26
♻ ☆ Oscillators Are All You Need: Irregular Time Series Modelling via Damped Harmonic Oscillators with Closed-Form Solutions
Transformers excel at time series modelling through attention mechanisms that capture long-term temporal patterns. However, they assume uniform time intervals and therefore struggle with irregular time series. Neural Ordinary Differential Equations (NODEs) effectively handle irregular time series by modelling hidden states as continuously evolving trajectories. ContiFormers arxiv:2402.10635 combine NODEs with Transformers, but inherit the computational bottleneck of the former by using heavy numerical solvers. This bottleneck can be removed by using a closed-form solution for the given dynamical system - but this is known to be intractable in general! We obviate this by replacing NODEs with a novel linear damped harmonic oscillator analogy - which has a known closed-form solution. We model keys and values as damped, driven oscillators and expand the query in a sinusoidal basis up to a suitable number of modes. This analogy naturally captures the query-key coupling that is fundamental to any transformer architecture by modelling attention as a resonance phenomenon. Our closed-form solution eliminates the computational overhead of numerical ODE solvers while preserving expressivity. We prove that this oscillator-based parameterisation maintains the universal approximation property of continuous-time attention; specifically, any discrete attention matrix realisable by ContiFormer's continuous keys can be approximated arbitrarily well by our fixed oscillator modes. Our approach delivers both theoretical guarantees and scalability, achieving state-of-the-art performance on irregular time series benchmarks while being orders of magnitude faster. Acknowledgement: This work was done in collaboration with Dirac Labs.
♻ ☆ Efficient Remote KV Cache Reuse with GPU-native Video Codec
Remote KV cache reuse fetches KV cache for identical contexts from remote storage, avoiding recomputation, accelerating LLM inference. While it excels in high-speed networks, its performance degrades significantly in bandwidth-limited scenarios. Recent studies address this by transmitting KV caches in compressed form, but the associated heavyweight decompression counteracts the KV reuse benefits. In this paper, we propose an efficient and widely deployable remote KV cache reuse solution that leverages GPU-native video codecs. Our system, KVCodec, enables effective KV cache coding with two techniques. The codec-friendly tensor layout compresses the KV cache in a highly compact video format, enabling fast transmission. The efficient KV fetcher orchestrates the transmission, decoding, and restoration of compressed KV caches in an efficient pipelined manner, eliminating resource contention, masking network fluctuations, and achieving minimum time-to-first-token (TTFT). We prototype KVCodec on diverse GPUs from high- to low-end. Experiments reveal that it reduces TTFT by up to 3.51 times while maintaining lossless accuracy, compared to SOTA methods.
comment: Accepted by SIGCOMM 2026
♻ ☆ Physics Aware Neural Networks: Denoising for Magnetic Navigation
Magnetic-anomaly navigation, leveraging small-scale variations in the Earth's magnetic field, is a promising alternative when GPS is unavailable or compromised. Airborne systems face a key challenge in extracting geomagnetic field data: the aircraft itself induces magnetic noise. Although the classical Tolles-Lawson model addresses this, it inadequately handles stochastically corrupted magnetic data required for navigation. To handle stochastic noise, we propose using two physics-based constraints: divergence-free vector fields and E(3)-equivariance. These ensure the learned magnetic field obeys Maxwell's equation and that outputs transform correctly with sensor position and orientation. The divergence-free constraint is implemented by training a neural network to output a vector potential A, with the magnetic field defined as its curl. For E(3)-equivariance, we use tensor products of geometric tensors represented via spherical harmonics with known rotational transformations. Enforcing physical consistency and restricting the admissible function space acts as an implicit regularizer that improves spatiotemporal performance. We present ablation studies evaluating each constraint alone and jointly across CNNs, MLPs, LTCs, and Contiformers. Continuous-time dynamics and long-term memory are critical for modelling magnetic time series; the Contiformer, which provides both, outperforms existing methods. To mitigate data scarcity, we generate synthetic datasets using the World Magnetic Model (WMM) and time-series conditional GANs, producing realistic, temporally consistent magnetic sequences across varied trajectories and environments. Experiments show that embedding these constraints significantly improves predictive accuracy and physical plausibility, outperforming classical and unconstrained deep learning approaches. Acknowledgement: This work was done in collaboration with Dirac Labs.
♻ ☆ Efficient Algorithms for Logistic Contextual Slate Bandits with Bandit Feedback UAI 2025
We study the Logistic Contextual Slate Bandit problem, where, at each round, an agent selects a slate of $N$ items from an exponentially large set (of size $2^{Ω(N)}$) of candidate slates provided by the environment. A single binary reward, determined by a logistic model, is observed for the chosen slate. Our objective is to develop algorithms that maximize cumulative reward over $T$ rounds while maintaining low per-round computational costs. We propose two algorithms, Slate-GLM-OFU and Slate-GLM-TS, that accomplish this goal. These algorithms achieve $N^{O(1)}$ per-round time complexity via local planning (independent slot selections), and low regret through global learning (joint parameter estimation). We provide theoretical and empirical evidence supporting these claims. Under a well-studied diversity assumption, we prove that Slate-GLM-OFU incurs only $\tilde{O}(\sqrt{T})$ regret. Extensive experiments across a wide range of synthetic settings demonstrate that our algorithms consistently outperform state-of-the-art baselines, achieving both the lowest regret and the fastest runtime. Furthermore, we apply our algorithm to select in-context examples in prompts of Language Models for solving binary classification tasks such as sentiment analysis. Our approach achieves competitive test accuracy, making it a viable alternative in practical scenarios.
comment: Accepted to UAI 2025
♻ ☆ In-Context Multi-Objective Optimization
Balancing competing objectives is omnipresent across disciplines, from drug design to autonomous systems. Multi-objective Bayesian optimization is a promising solution for such expensive, black-box problems: it fits probabilistic surrogates and selects new designs via an acquisition function that balances exploration and exploitation. In practice, it requires tailored choices of surrogate and acquisition that rarely transfer to the next problem, is myopic when multi-step planning is often required, and adds refitting overhead, particularly in parallel or time-sensitive loops. We present TAMO, a fully amortized, universal policy for multi-objective black-box optimization. TAMO uses a transformer architecture that operates across varying input and objective dimensions, enabling pretraining on diverse corpora and transfer to new problems without retraining: at test time, the pretrained model proposes the next design with a single forward pass. We pretrain the policy with reinforcement learning to maximize cumulative hypervolume improvement over full trajectories, conditioning on the entire query history to approximate the Pareto frontier. Across synthetic benchmarks and real tasks, TAMO produces fast proposals, reducing proposal time by 50-1000x versus alternatives while matching or improving Pareto quality under tight evaluation budgets. These results show that transformers can perform multi-objective optimization entirely in-context, eliminating per-task surrogate fitting and acquisition engineering, and open a path to foundation-style, plug-and-play optimizers for scientific discovery workflows.
♻ ☆ DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
While Mixture-of-Experts (MoE) scales model capacity without proportionally increasing computation, its massive total parameter footprint creates significant storage and memory-access bottlenecks, which hinder efficient end-side deployment that simultaneously requires high performance, low computational cost, and small storage overhead. To achieve these properties, we present DECO, a sparse MoE architecture designed to match the performance of dense Transformers under identical total parameter budgets and training tokens. DECO utilizes the differentiable and flexible ReLU-based routing enhanced by learnable expert-wise scaling, which adaptively balances the contributions of routed and shared experts. Furthermore, we introduce NormSiLU, an activation function that normalizes inputs prior to SiLU operators, producing a more stable trend of routed-expert activation ratio and a higher intrinsic sparsity level. We also identify an empirical advantage in using non-gated MLP experts with ReLU-based routing, indicating the possibility of MoE architecture simplification. Experiments demonstrate that DECO, activating only 20% of experts, matches dense performance and outperforms established MoE baselines. Our specialized acceleration kernel delivers a 3.00$\times$ speedup on real hardware compared with dense inference. Codes and checkpoints are all available at https://github.com/thunlp/DECO.
comment: 14 pages, 11 figures, 11 tables
♻ ☆ Learning Compact Boolean Networks
Floating-point neural networks dominate modern machine learning but incur substantial inference costs, motivating emerging interest in Boolean networks for resource-constrained deployments. Since Boolean networks use only Boolean operations, they can achieve nanosecond-scale inference latency. However, learning Boolean networks that are both compact and accurate remains challenging because of their discrete, combinatorial structure. In this work we address this challenge via three novel, complementary contributions: (i) a new parameter-free strategy for learning effective connections, (ii) a novel compact convolutional Boolean architecture that exploits spatial locality while requiring fewer Boolean operations than existing convolutional kernels, and (iii) an adaptive discretization procedure that reduces the accuracy drop incurred when converting a continuously relaxed network into a discrete Boolean network. Across standard vision benchmarks, our method improves the Pareto frontier over prior state-of-the-art methods, achieving higher accuracy with up to $47\times$ fewer Boolean operations. This advantage also extends to other modalities. Further, on an FPGA, our model on MNIST achieves 99.38\% accuracy with 6.48 ns latency, surpassing the prior state-of-the-art in both accuracy and runtime, while generating a $7\times$ smaller circuit. Code and models are available at https://github.com/eth-sri/CompactLogic.
♻ ☆ BLOCK-EM: Preventing Emergent Misalignment via Latent Blocking ICML 2026
Emergent misalignment can arise when a language model is fine-tuned on a narrowly scoped supervised objective: the model learns the target behavior, yet also develops undesirable out-of-domain behaviors. We investigate a mechanistic approach to preventing emergent misalignment by identifying a small set of internal features that reliably control the misaligned behavior and then discouraging the model from strengthening these features during fine-tuning. Across six fine-tuning domains, blocking (i.e., constraining) a fixed set of features achieves up to 95\% relative reduction in emergent misalignment with no degradation in model quality or target-task performance. We strengthen validity with disjoint selection/evaluation splits, multiple independent judges, multiple random seeds for key settings, quality metrics, and extensive ablations demonstrating that the reduction in misalignment is specific to the identified mechanism. We also characterize a limiting regime in which misalignment re-emerges under prolonged fine-tuning, present evidence consistent with rerouting through alternative features or layers, and evaluate modifications that partially restore the misalignment-blocking effect. Overall, our results show that targeted training-time constraints on internal mechanisms can mitigate emergent misalignment without degrading target-task performance.
comment: Accepted to ICML 2026
♻ ☆ Explicit integral representations and quantitative bounds for two-layer ReLU networks
An approach to construct explicit integral representations for two-layer ReLU networks is presented, which provides relatively simple representations for any multivariate polynomial. Quantitative bounds are provided for a particular, sharpened ReLU integral representation, which involves a harmonic extension and a projection. The bounds demonstrate that functions can be approximated with $L^{2}(\mathcal{D})$ errors that do not depend explicitly on dimension or degree, but rather the coefficients of their monomial expansions and the distribution $\mathcal{D}$. We also present a connection to the RKHS of the exponential kernel $K(x,y)=\exp\left(\left\langle x,y\right\rangle \right)$, and a very simple integral representation involving additionally multiplication via a fixed function which has better quantitative bounds.
♻ ☆ A Survey of On-Policy Distillation for Large Language Models
As Large Language Models (LLMs) continue to grow in both capability and cost, transferring frontier capabilities into smaller, deployable students has become a central engineering problem, and knowledge distillation remains the dominant technique for this transfer. The prevailing recipe in industrial pipelines, static imitation of teacher-generated text, carries a structural weakness that grows more severe as tasks become longer and more reasoning-intensive. Because the student is trained on flawless teacher prefixes but must generate its own at inference, small errors tend to accumulate into trajectories it has rarely been trained to recover from, and the resulting exposure bias has been shown to scale roughly with the square of sequence length. On-Policy Distillation (OPD) reorganizes the training loop around this observation by having the teacher provide feedback on what the student actually produces, with the goal of reducing the compounding term toward linear and reframing distillation as an iterative correction process rather than single-pass imitation. The resulting literature has expanded along divergence design, reward-guided optimization, and self-play, yet contributions remain scattered across the knowledge distillation, RLHF, and imitation learning communities without a unified treatment. This survey provides such a treatment. We formalize OPD as $f$-divergence minimization over student-sampled trajectories, organize the field along three design axes (what to optimize, where the signal comes from, and how to stabilize training in practice), and consolidate success conditions, recurring failure modes, and the connection between OPD and KL-constrained RL. We close with open problems that emerge from this synthesis, including distillation scaling laws, uncertainty-aware feedback, agentic distillation, and the growing overlap between knowledge distillation and RL.
comment: Ongoing Work
♻ ☆ FlowLPS: Langevin-Proximal Sampling for Flow-based Inverse Problem Solvers
Deep generative models are powerful priors for imaging inverse problems, but training-free solvers for latent flow models face a practical finite-step trade-off. Optimization-heavy methods quickly improve measurement consistency, but in highly nonlinear latent spaces, their results can depend strongly on where local refinement is initialized, often degrading perceptual realism. In contrast, stochastic sampling methods better preserve posterior exploration, but often require many iterations to obtain sharp, measurement-consistent reconstructions. To address this trade-off, we propose FlowLPS, a training-free latent flow inverse solver based on Langevin-Proximal Sampling. At each reverse step, FlowLPS uses a few Langevin updates to perturb the model-predicted clean estimate in posterior-oriented directions, providing stochastic initializations for local refinement. It then applies local MAP-style proximal refinement to rapidly improve measurement consistency from the Langevin-updated estimate. We additionally use controlled pCN-style re-noising to stabilize the reverse trajectory while retaining trajectory coherence. Experiments on FFHQ and DIV2K across five linear inverse problems show that FlowLPS achieves a strong balance between measurement fidelity and perceptual quality, with additional experiments on pixel-space inverse problems and phase retrieval.
♻ ☆ POP: Prior-Fitted First-Order Optimization Policies
Gradient-based optimizers are highly sensitive to design choices in their adaptive learning rate mechanisms. To address this limitation, we introduce POP, a meta-learned Reinforcement Learning (RL) policy that predicts adaptive learning rates for gradient descent, conditioned on the contextual information provided in the optimization trajectory. Our method introduces a novel RL reward formulation, a new function-scaling strategy for in-distribution generalization, and a novel prior that is used to sample millions of synthetic optimization problems. We evaluate POP on an established benchmark including 43 optimization functions of various complexity, where it significantly outperforms gradient-based methods. Our evaluation demonstrates strong generalization capabilities without task-specific tuning.
comment: Under Review
♻ ☆ Improving the Accuracy of Amortized Model Comparison with Self-Consistency NeurIPS 2025
Amortized Bayesian model comparison (BMC) enables fast probabilistic ranking of models via simulation-based training of neural surrogates. However, the accuracy of neural surrogates deteriorates when simulation models are misspecified; the very case where model comparison is most needed. We evaluate four different amortized BMC methods. We supplement traditional simulation-based training of these methods with a \emph{self-consistency} (SC) loss on unlabeled real data to improve BMC estimates under distribution shifts. Using one artificial and two real-world case studies, we compare amortized BMC estimators with and without SC against analytic or bridge sampling benchmarks. In the \emph{closed-world} case (data is generated by one of the candidate models), BMC estimators using classifiers work acceptably well even without SC training. However, these methods also benefit the least from SC training. In the \emph{open-world} scenario (all models misspecified), SC training strongly improves BMC estimators when having access to analytic likelihoods, or when surrogate likelihoods are locally accurate near the true parameter posterior, even for severely misspecified models. We conclude with practical recommendations for amortized BMC and suggestions for future research.
comment: 22 pages, 14 figures. This version extends our initial results presented at Reliable ML from Unreliable Data Workshop at NeurIPS 2025. Previously, this version appeared as arXiv:2512.14308v2, which has now been withdrawn: the two versions share too much content to be considered separate papers
♻ ☆ Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding
Multi-agent pathfinding (MAPF) is a widely used abstraction for multi-robot trajectory planning problems, where multiple homogeneous agents move simultaneously within a shared environment. Although solving MAPF optimally is NP-hard, scalable and efficient solvers are critical for real-world applications such as logistics and search-and-rescue. To this end, the research community has proposed various decentralized suboptimal MAPF solvers that leverage machine learning. Such methods frame MAPF (from a single agent perspective) as a Dec-POMDP where at each time step an agent has to decide an action based on the local observation and typically solve the problem via reinforcement learning or imitation learning. We follow the same approach but additionally introduce a learnable communication module tailored to enhance cooperation between agents via efficient feature sharing. We present the Local Communication for Multi-agent Pathfinding (LC-MAPF), a generalizable pre-trained model that applies multi-round communication between neighboring agents to exchange information and improve their coordination. Our experiments show that the introduced method outperforms the existing learning-based MAPF solvers, including IL and RL-based approaches, across diverse metrics in a diverse range of (unseen) test scenarios. Remarkably, the introduced communication mechanism does not compromise LC-MAPF's scalability, a common bottleneck for communication-based MAPF solvers.
♻ ☆ DeepLévy: Learning Heavy-Tailed Uncertainty in Highly Volatile Time Series
Modeling uncertainty in heavy-tailed time series remains a critical challenge for deep probabilistic forecasting models, which often struggle to capture abrupt, extreme events. While Lévy stable distributions offer a natural framework for modeling such non-Gaussian behaviors, the intractability of their probability density functions severely limits conventional likelihood-based inference. To address this, we introduce DeepLévy, a neural framework that learns mixtures of Lévy stable distributions by minimizing the discrepancy between empirical and parametric characteristic functions. DeepLévy incorporates a mixture mechanism that adaptively learns context-dependent weights and parameters over multiple Lévy components, enabling flexible multi-horizon uncertainty modeling. Evaluations on both real and synthetic datasets demonstrate that DeepLévy outperforms state-of-the-art deep probabilistic forecasting approaches in tail risk metrics, especially under extreme volatility.
♻ ☆ Pure Exploration Beyond Reward Feedback: The Role of Post-Action Context
We introduce the problem of best arm identification (BAI) with post-action context, a new BAI problem in a stochastic multi-armed bandit environment and the fixed-confidence setting. The problem addresses the scenarios in which the learner receives a post-action context in addition to the reward after playing each action. This post-action context provides additional information that can significantly facilitate the decision process. We analyze two different types of the post-action context: (i) separator, where the reward depends solely on the context, and (ii) non-separator, where the reward depends on both the action and the context. For both cases, we derive instance-dependent lower bounds on the sample complexity and propose algorithms that asymptotically achieve the optimal sample complexity. For the separator setting, we propose a novel sampling rule called G-tracking, which uses the geometry of the context space to directly track the contexts rather than the actions. For the non-separator setting, we do so by demonstrating that the Track-and-Stop algorithm can be extended to this setting. Moreover, in both settings, we theoretically and empirically show that algorithms that ignore the post-action context are sub-optimal. Finally, our empirical results showcase the advantage of our approaches compared to the state of the art.
comment: 46 pages, 8 figures
♻ ☆ A Formal Comparison Between Chain of Thought and Latent Thought ICML 2026
Chain of thought (CoT) elicits reasoning in large language models by explicitly generating intermediate tokens. In contrast, latent thought reasoning operates directly in the continuous latent space, enabling computation beyond discrete linguistic representations. While both approaches exploit iterative computation, their comparative capabilities remain underexplored. In this work, we present a formal analysis showing that latent thought admits more efficient parallel computation than inherently sequential CoT. In contrast, CoT enables approximate counting and sampling through stochastic decoding. These separations suggest the tasks for which depth-driven recursion is more suitable, thereby offering practical guidance for choosing between reasoning paradigms.
comment: Camera-ready version for ICML 2026
♻ ☆ Spectral Entropy Collapse as a Phase Transition in Delayed Generalisation: An Interventional and Predictive Framework for Grokkin
Grokking - the delayed transition from memorisation to generalisation in neural networks - remains poorly understood. We study this phenomenon through the geometry of learned representations and identify a consistent empirical signature preceding generalisation: collapse of the spectral entropy of the representation covariance matrix. Across modular arithmetic tasks and multiple random seeds, spectral entropy decreases gradually during training and crosses a stable task-specific threshold before test accuracy rises. A representation-mixing intervention that delays this collapse also delays grokking, including under norm-matched controls, indicating that the effect is not explained by parameter norm alone. We further show that the entropy gap predicts the remaining time until grokking with useful out-of-sample accuracy. To probe the structure underlying this transition, we introduce a Fourier-alignment observable for cyclic-group tasks. Entropy collapse is strongly coupled to the emergence of Fourier-aligned representations, suggesting that spectral entropy tracks concentration of the representation into task-structured directions rather than generic compression alone. The same qualitative dynamics appear in non-abelian group composition tasks, while MLP controls show that entropy collapse by itself is insufficient for grokking in the absence of appropriate inductive bias. Taken together, the results support a view of grokking as a representational phase transition with an observable geometric signature. We discuss the scope and limitations of this interpretation, connections to recent feature-learning and spectral-dynamics work, and directions for testing whether similar transitions appear in larger-scale learning systems.
comment: 25 pages, 15 figures, 6 tables
♻ ☆ ANTIC: Adaptive Neural Temporal In-situ Compressor ICML 2026
The persistent storage requirements for high-resolution, spatiotemporally evolving fields governed by large-scale and high-dimensional partial differential equations (PDEs) have reached the petabyte-to-exabyte scale. Transient simulations modeling Navier-Stokes equations, magnetohydrodynamics, plasma physics, or binary black hole mergers generate data volumes that are prohibitive for modern high-performance computing (HPC) infrastructures. To address this bottleneck, we introduce ANTIC (Adaptive Neural Temporal in situ Compressor), an end-to-end in situ compression pipeline. ANTIC consists of an adaptive temporal selector tailored to high-dimensional physics that identifies and filters informative snapshots at simulation time, combined with a spatial neural compression module based on continual fine-tuning that learns residual updates between adjacent snapshots using neural fields. By operating in a single streaming pass, ANTIC enables a combined compression of temporal and spatial components and effectively alleviates the need for explicit on-disk storage of entire time-evolved trajectories. Experimental results demonstrate how storage reductions of several orders of magnitude relate to physics accuracy.
comment: 31 pages, 19 figures, 9 Tables; Accepted at ICML 2026, First authors contributed equally
♻ ☆ DiVeQ: Differentiable Vector Quantization Using the Reparameterization Trick
Vector quantization is common in deep models, yet its hard assignments block gradients and hinder end-to-end training. We propose DiVeQ, which treats quantization as adding an error vector that mimics the quantization distortion, keeping the forward pass hard while letting gradients flow. We also present a space-filling variant (SF-DiVeQ) that assigns input to a curve constructed by the lines connecting codewords, resulting in less quantization error and full codebook usage. Both methods train end-to-end without requiring auxiliary losses or temperature schedules. In VQ-VAE image compression, VQGAN image generation, and DAC speech coding tasks across various data sets, our proposed methods improve reconstruction and sample quality over alternative quantization approaches.
♻ ☆ SurvBench: A Standardised Preprocessing Pipeline for Multi-Modal Electronic Health Record Survival Analysis
Deep-learning survival models for electronic health record (EHR) data are hard to compare across papers because the upstream preprocessing step, which includes cohort definition, time discretisation, missingness handling, and censoring rules, is typically undocumented and inconsistent. A reported difference in concordance between two mortality models can therefore reflect any of these choices rather than a modelling contribution. We present SurvBench, an open-source preprocessing pipeline that converts raw PhysioNet exports into model-ready tensors for survival analysis. SurvBench covers four critical-care databases (MIMIC-IV, eICU, MC-MED, HiRID) and four input modalities: time-series vitals and laboratory values, static demographics, International Classification of Diseases (ICD) codes, and radiology report embeddings. Every preprocessing decision is controlled through YAML configuration. Imputation, scaling, and feature filtering are fit on the training fold only. Missingness is recorded as a binary mask alongside each feature tensor. The pipeline handles single-risk endpoints (in-hospital and in-ICU mortality) and competing-risks endpoints (a three-way emergency-department admission pathway, with home discharge treated as administrative censoring). We also provide support for harmonised cross-dataset external validation between eICU and MIMIC-IV. SurvBench is publicly available at https://github.com/munibmesinovic/SurvBench, providing a robust platform that future deep-learning EHR survival work, especially nascent multi-modal approaches, can be measured against under matched preprocessing.
♻ ☆ Maximin Robust Bayesian Experimental Design
We address the brittleness of Bayesian experimental design under model misspecification by formulating the problem as a max--min game between the experimenter and an adversarial nature subject to information-theoretic constraints. We demonstrate that this approach yields a robust objective governed by Sibson's $α$-mutual information (MI), which identifies the $α$-tilted posterior as the robust belief update and establishes the Rényi divergence as the appropriate measure of conditional information gain. To mitigate the bias and variance of nested Monte Carlo estimators needed to estimate Sibson's $α$-MI, we adopt a PAC-Bayes framework to search over stochastic design policies, yielding rigorous high-probability lower bounds on the robust expected information gain that explicitly control finite-sample error.
♻ ☆ Constructive conditional normalizing flows
Motivated by applications in conditional sampling, given a probability measure $μ$ and a diffeomorphism $φ$, we consider the problem of simultaneously approximating $φ$ and the pushforward $φ_{\#}μ$ by means of the flow of a continuity equation whose velocity field is a perceptron neural network with piecewise constant weights. We provide an explicit construction based on a polar-like decomposition of the Lagrange interpolant of $φ$. The latter involves a compressible component, given by the gradient of a particular convex function, which can be realized exactly, and an incompressible component, which -- after approximating via permutations -- can be implemented through shear flows intrinsic to the continuity equation. For more regular maps $φ$ -- such as the Knöthe-Rosenblatt rearrangement -- we provide an alternative, probabilistic construction inspired by the Maurey empirical method, in which the number of discontinuities in the weights doesn't scale inversely with the ambient dimension.
♻ ☆ Strategically Deceptive Model Deployment in Performative Prediction
Machine Learning systems are increasingly deployed in decision-making settings that shape user behavior and, in turn, the data on which future decisions are based. Performative Prediction (PP) formalizes this feedback loop by modeling how deployed models induce distributional shifts. It studies how to learn robust and well-performing models under such dynamics. However, existing PP frameworks typically assume that the model governing these decisions is the same model observed by users (therefore, to which they respond). In practice, deployer institutions may instead disclose curated models, while internally relying on distinct opaque models. We introduce Decoupled Performative Prediction (DPP), a framework that explicitly models mismatches between the model governing institutional decisions and the model that shapes user behavior. By analyzing the resulting optimization landscape, we show that DPP admits new different solutions that provably achieve lower risk for the institution than those under classical PP. We further propose an algorithm with provable convergence guarantees under standard assumptions, demonstrating how easy institutions can benefit from strategically deceptive deployment when they control model disclosure and users lack countervailing power. To capture the implications of such behavior, we introduce the deception cost, a quantitative measure of the degree of deception experienced by users. We study settings in which institutions incorporate this cost into the optimization process, motivated by reputational concerns or potential user abandonment, and show that such self-imposed constraints are insufficient to protect users. Overall, our results demonstrate that model disclosure is not merely an ethical consideration but a core technical design decision, underscoring the need for regulations that hold institutions accountable for deceptive deployment practices.
comment: Accepted to FAccT 2026
♻ ☆ Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes
Deep neural networks exhibit periodic loss spikes during unregularized long-term training, a phenomenon known as the "Slingshot Mechanism." Existing work usually attributes this to intrinsic optimization dynamics, but its triggering mechanism remains unclear. This paper proves that this phenomenon is a result of floating-point arithmetic precision limits. As training enters a high-confidence stage, the difference between the correct-class logit and the other logits may exceed the absorption-error threshold. Then during backpropagation, the gradient of the correct class is rounded exactly to zero, while the gradients of the incorrect classes remain nonzero. This breaks the zero-sum constraint of gradients across classes and introduces a systematic drift in the parameter update of the classifier layer. We prove that this drift forms a positive feedback loop with the feature, causing the global classifier mean and the global feature mean to grow exponentially. We call this mechanism Numerical Feature Inflation (NFI). This mechanism explains the rapid norm growth before a Slingshot spike, the subsequent reappearance of gradients, and the resulting loss spike. We further show that NFI is not equivalent to an observed loss spike: in more practical tasks, partial absorption may not produce visible spikes, but it can still break the zero-sum constraint and drive rapid growth of parameter norms. Our results reinterpret Slingshot as a numerical dynamic of finite-precision training, and provide a testable explanation for abnormal parameter growth and logit divergence in late-stage training.
comment: 28 pages, 13 figures
♻ ☆ DarkQA: Benchmarking Vision-Language Models on Visual-Primitive Question Answering in Low-Light Indoor Scenes IEEE
Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents. Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations, including low-light conditions at night or in dark environments, a core necessity that has been largely overlooked. To address this underexplored challenge, we present DarkQA, an open-source benchmark for evaluating perceptual primitives under multi-level low-light conditions in embodied scenarios. DarkQA evaluates single-view egocentric observations across controlled degradation levels, isolating low-light perceptual failures before they are entangled with complex embodied tasks. The benchmark contains 9.4K deterministically generated and verifiable question-image pairs spanning five visual-primitive families. A key design feature of DarkQA is its physical fidelity: visual degradations are modeled in linear RAW space, simulating physics-based illumination drop and sensor noise followed by an ISP-inspired rendering pipeline; we further validate the synthesis against real paired low-light camera data. We evaluate representative VLMs and Low-Light Image Enhancement (LLIE) preprocessing methods. Results show consistent VLM degradation under low illumination and sensor noise, while LLIE provides severity-dependent but unstable recovery. We demonstrate the utility of DarkQA by evaluating a wide range of state-of-the-art VLMs and Low-Light Image Enhancement (LLIE) models, and systematically reveal VLMs' limitations when operating under these challenging visual conditions. Our code and benchmark dataset will be released upon acceptance. Project website: https://darkqa-benchmark.github.io
comment: This work has been submitted to the IEEE for possible publication
♻ ☆ Reconsidering the energy efficiency of spiking neural networks
Spiking Neural Networks (SNNs) promise higher energy efficiency over conventional Quantized Artificial Neural Networks (QNNs) due to their event-driven, spike-based computation. However, prevailing energy evaluations often oversimplify, focusing on computational aspects while neglecting critical overheads like comprehensive data movement and memory access. Such simplifications can lead to misleading conclusions regarding the true energy benefits of SNNs. This paper presents a rigorous re-evaluation. We establish a fair baseline by mapping rate-encoded SNNs with $T$ timesteps to functionally equivalent QNNs with $\lceil \log_2(T+1) \rceil$ bits. This ensures both models have comparable representational capacities, as well has similar hardware requirement, enabling meaningful energy comparisons. We introduce a detailed analytical energy model encompassing core computation and data movement. Using this model, we systematically explore a wide parameter space, including intrinsic network characteristics ($T$, spike rate $s_r$, QNN sparsity $γ$, model size $N$, weight bit-level) and hardware characteristics (memory system and network-on-chip). Our analysis identifies specific operational regimes where SNNs genuinely offer superior energy efficiency. For example, under typical neuromorphic hardware conditions, SNNs with moderate time windows ($T \in [5,10]$) require an average spike rate ($s_r$) below 6.4\% to outperform equivalent QNNs. Furthermore, to illustrate the real-world implications of our findings, we analyze the operational lifetime of a typical smartwatch, showing that an optimized SNN can nearly double its battery life compared to a QNN. These insights guide the design of turely energy-efficient neural network solutions.
♻ ☆ RL-SPH: Learning to Achieve Feasible Solutions for Integer Linear Programs ICML 2026
Primal heuristics play a crucial role in quickly finding feasible solutions for NP-hard integer linear programming (ILP). Although $\textit{end-to-end learning}$-based primal heuristics (E2EPH) have recently been proposed, they are typically unable to independently generate feasible solutions. To address this challenge, we propose RL-SPH, a novel reinforcement learning-based start primal heuristic capable of independently generating feasible solutions, even for ILP involving non-binary integers. Empirically, RL-SPH rapidly obtains high-quality feasible solutions with a 100% feasibility rate, achieving on average a 28.6$\times$ lower primal gap and a 2.6$\times$ lower primal integral compared to existing start primal heuristics.
comment: Accepted at ICML 2026. 30 pages, 12 figures, 22 tables
♻ ☆ Semi-Supervised Bayesian GANs with Log-Signatures for Uncertainty-Aware Credit Card Fraud Detection
We present a novel deep generative semi-supervised framework for credit card fraud detection, formulated as time series classification task. As financial transaction data streams grow in scale and complexity, traditional methods often require large labeled datasets, struggle with time series of irregular sampling frequencies and varying sequence lengths. To address these challenges, we extend conditional Generative Adversarial Networks (GANs) for targeted data augmentation, integrate Bayesian inference to obtain predictive distributions and quantify uncertainty, and leverage log-signatures for robust feature encoding of transaction histories. We introduce a novel Wasserstein distance-based loss to align generated and real unlabeled samples while simultaneously maximizing classification accuracy on labeled data. Our approach is evaluated on the BankSim dataset, a widely used simulator for credit card transaction data, under varying proportions of labeled samples, demonstrating consistent improvements over benchmarks in both global statistical and domain-specific metrics. These findings highlight the effectiveness of GAN-driven semi-supervised learning with log-signatures for irregularly sampled time series and emphasize the importance of uncertainty-aware predictions.
♻ ☆ Geometry-Aware Neural Optimizer for Shape Optimization and Inversion ICML2026
Geometry is central to PDE-governed systems, motivating shape optimization and inversion. Classical pipelines conduct costly forward simulation with geometry processing, requiring substantial expert effort. Neural surrogates accelerate forward analysis but do not close the loop because gradients from objectives to geometry are often unavailable. Existing differentiable methods either rely on restrictive parameterizations or unstable latent optimization driven by scalar objectives, limiting interpretability and part-wise control. To address these challenges, we propose Geometry-Aware Neural Optimizer (GANO), an end-to-end differentiable framework that unifies geometry representation, field-level prediction, and automated optimization/inversion in a single latent-space loop. GANO encodes shapes with an auto-decoder and stabilizes latent updates via a denoising mechanism, and a geometry-injected surrogate provides a reliable gradient pathway for geometry updates. Moreover, GANO supports part-wise control through null-space projection and uses remeshing-free projection to accelerate geometry processing. We further prove that denoising induces an implicit Jacobian regularization that reduces decoder sensitivity, yielding controlled deformations. Experiments on three benchmarks spanning 2D Helmholtz, 2D airfoil, and 3D vehicles show state-of-the-art accuracy and stable, controllable updates, achieving up to +55.9% lift-to-drag improvement for airfoils and ~7% drag reduction for vehicles.
comment: To appear in ICML2026
♻ ☆ CRPS-Optimal Binning for Univariate Conformal Regression
We propose a method for non-parametric conditional distribution estimation based on partitioning covariate-sorted observations into contiguous bins and using the within-bin empirical CDF as the predictive distribution. Bin boundaries are chosen to minimise the total leave-one-out Continuous Ranked Probability Score (LOO-CRPS), which admits a closed-form cost function with $O(n^2 \log n)$ precomputation and $O(n^2)$ storage; the globally optimal $K$-partition is recovered by a dynamic programme in $O(n^2 K)$ time. Minimisation of within-sample LOO-CRPS turns out to be inappropriate for selecting $K$ as it results in in-sample optimism. We instead select $K$ by $K$-fold cross-validation of test CRPS, which yields a U-shaped criterion with a well-defined minimum. Having selected $K^*$ and fitted the full-data partition, we form two complementary predictive objects: the Venn prediction band and a conformal prediction set based on CRPS as the nonconformity score, which carries a finite-sample marginal coverage guarantee at any prescribed level $\varepsilon$. The conformal prediction is transductive and data-efficient, as all observations are used for both partitioning and p-value calculation, with no need to reserve a hold-out set. On real benchmarks against split-conformal competitors (Gaussian split conformal, CQR, CQR-QRF, and conformalized isotonic distributional regression), the method produces substantially narrower prediction intervals while maintaining near-nominal coverage.
♻ ☆ Provably Data-driven Multiple Hyper-parameter Tuning with Structured Loss Function ICML 2026
Data-driven algorithm design automates hyperparameter tuning, but its statistical foundations remain limited because model performance can depend on hyperparameters in implicit and highly non-smooth ways. Existing guarantees focus on the simple case of a one-dimensional (scalar) hyperparameter. This leaves the practically important, multi-dimensional hyperparameter tuning setting unresolved. We address this open question by establishing the first general framework for establishing generalization guarantees for tuning multi-dimensional hyperparameters in data-driven settings. Our approach strengthens the generalization guarantee framework for semi-algebraic function classes by exploiting tools from real algebraic geometry, yielding sharper, more broadly applicable guarantees. For completeness, we also instantiate the first lower bound for this general setting. We further extend the analysis to hyperparameter tuning using the validation loss under minimal assumptions, and derive improved bounds when additional structure is available. Finally, we demonstrate the scope of the framework with new learnability results, including data-driven weighted group lasso and weighted fused lasso.
comment: Accepted to ICML 2026
♻ ☆ Modality-Inconsistent Continual Learning of Multimodal Large Language Models
In this paper, we introduce Modality-Inconsistent Continual Learning (MICL), a new continual learning scenario for Multimodal Large Language Models (MLLMs) that involves tasks with inconsistent modalities (image, audio, or video) and varying task types (captioning or question-answering). Unlike existing vision-only or modality-incremental settings, MICL combines modality and task type shifts, both of which drive catastrophic forgetting. To address these challenges, we propose MoInCL, which employs a Pseudo Targets Generation Module to mitigate forgetting caused by task type shifts in previously seen modalities. It also incorporates Instruction-based Knowledge Distillation to preserve the model's ability to handle previously learned modalities when new ones are introduced. We benchmark MICL using a total of six tasks and conduct experiments to validate the effectiveness of our MoInCL. The experimental results highlight the superiority of MoInCL, showing significant improvements over representative and state-of-the-art continual learning baselines.
comment: Accepted at Transactions on Machine Learning Research (TMLR), 2026
♻ ☆ An Empirical Analysis of Calibration and Selective Prediction in Multimodal Clinical Condition Classification
As artificial intelligence systems move toward clinical deployment, ensuring reliable prediction behavior is fundamental for safety-critical decision-making tasks. One proposed safeguard is selective prediction, where models can defer uncertain predictions to human experts for review. In this work, we empirically evaluate the reliability of uncertainty-based selective prediction in multilabel clinical condition classification using multimodal ICU data. Across a range of state-of-the-art unimodal and multimodal models, we find that selective prediction can substantially degrade performance despite strong standard evaluation metrics. This failure is driven by severe class-dependent miscalibration, whereby models assign high uncertainty to correct predictions and low uncertainty to incorrect ones, particularly for underrepresented clinical conditions. Our results show that commonly used aggregate metrics can obscure these effects, limiting their ability to assess selective prediction behavior in this setting. Taken together, our findings characterize a task-specific failure mode of selective prediction in multimodal clinical condition classification and highlight the need for calibration-aware evaluation to provide strong guarantees of safety and robustness in clinical AI.
comment: 40 pages, 14 figures, 16 tables. Accepted as a conference paper at AHLI Conference on Health, Inference, and Learning (CHIL) 2026
♻ ☆ BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization
Stochastic bilevel optimization (SBO) has become a standard framework for hyperparameter learning, data reweighting, representation learning, and data-mixture optimization in deep learning. Existing exact single-loop SBO methods and memory-efficient surrogate SBO methods either create severe memory pressure for large lower-level neural networks or lack competitive convergence guarantees under standard assumptions. In this paper, we propose BROS, a memory-efficient single-loop SBO method with the same convergence rate order as exact single-loop SBO methods. BROS performs lower and auxiliary updates in randomized subspaces with a Rademacher bi-probe correction that recovers an unbiased Hessian-action estimator. We prove that BROS preserves the $\mathcal O(\varepsilon^{-2})$ sample complexity of MA-SOBA for finding an $\varepsilon$-stationary point under only standard assumptions. Experiments on hyper-data cleaning, data-mixture learning, hyper-representation learning, and ViT sample reweighting show that BROS reduces peak memory by up to 44.9% while closely matching full-space baseline performance.
♻ ☆ Stationary MMD Points
Approximation of a target probability distribution using a finite set of points is a problem of fundamental importance in numerical integration. Several authors have proposed to select points by minimising a maximum mean discrepancy (MMD), but the non-convexity of this objective typically precludes global minimisation. Instead, we consider the concept of \emph{stationary points of the MMD} which, in contrast to points globally minimising the MMD, can be accurately computed. Our main contributions are two-fold and theoretical in nature. We first prove the (perhaps surprising) result that, for integrands in the associated reproducing kernel Hilbert space, the numerical integration error of stationary MMD points vanishes \emph{faster} than the MMD. Motivated by this \emph{super-convergence} property, we consider MMD gradient flows as a practical strategy for computing stationary points of the MMD. We then prove that MMD gradient flow can indeed compute stationary MMD points, based on a refined convergence analysis that establishes a novel non-asymptotic finite-particle error bound.
♻ ☆ Patterns behind Chaos: Forecasting Data Movement for Efficient Large-Scale MoE LLM Inference
Large-scale Mixture of Experts (MoE) Large Language Models (LLMs) have recently become the frontier open-weight models, achieving remarkable model capability similar to proprietary ones. But their random expert selection mechanism introduces significant data movement overhead that becomes the dominant bottleneck in multi-unit LLM serving systems. To understand the patterns underlying this data movement, we conduct comprehensive data-movement-centric profiling across four state-of-the-art large-scale MoE models released in 2025 (200B-1000B) using over 24,000 requests spanning diverse workloads. We perform systematic analysis from both temporal and spatial perspectives and distill six key insights to guide the design of diverse serving systems. We verify these insights on both future wafer-scale GPU architectures and existing GPU systems. On wafer-scale GPUs, lightweight architectural modifications guided by our insights yield a 6.6$\times$ average speedup across four 200B--1000B models. On existing GPU systems, our insights drive the design of a prefill-aware expert placement algorithm that achieves up to 1.25$\times$ speedup on MoE computation. Our work presents the first comprehensive data-centric analysis of large-scale MoE models together with a concrete design study applying the learned lessons. Our profiling traces are publicly available at \href{https://huggingface.co/datasets/core12345/MoE_expert_selection_trace}{\textcolor{blue}{https://huggingface.co/datasets/core12345/MoE\_expert\_selection\_trace}}.
♻ ☆ Adversarial Causal Tuning for Realistic Time-series Generation
We address the problem of generating simulated, yet realistic, time-series data from a causal model with the same observational and interventional distributions as a given real dataset (probabilistic causal digital twin). While non-causal models (e.g., GANs) also strive to simulate realistic data, causal models are fundamentally more powerful, able to simulate the effect of interventions (what-if scenarios), optimize decisions, perform root-cause analysis, and counterfactual causal reasoning. We introduce the Adversarial Causal Tuning (ACT) methodology, which outputs the optimal causal model that fits the data, along with a quantification of the goodness-of-fit. The returned causal model can then be employed to simulate new data or to perform other causal reasoning tasks. ACT adopts ideas from Generative Adversarial Network training and AutoML to search for optimal causal pipelines and discriminators that detect deviations between the distributions of real and simulated data. It also adapts a permutation testing procedure from established causal tuning methods to penalize models for complexity. Through extensive experiments on real, semi-synthetic, and synthetic datasets, we show that (a) employing multiple optimized discriminators is paramount for selecting the optimal causal models and quantifying goodness-of-fit, (b) ACT selects the optimal causal model in synthetic datasets while avoiding overfitting, generating data indistinguishable from the true data distribution (c) all state-of-the-art generative and causal simulation methods, exhibit room for improvement in reproducing real data distributions; generating realistic temporal data is still an open research challenge.
comment: 22 pages, 3 figures
♻ ☆ MoLF: Mixture-of-Latent-Flow for Pan-Cancer Spatial Gene Expression Prediction from Histology
Inferring spatial transcriptomics (ST) from histology enables scalable histogenomic profiling, yet current methods are largely restricted to single-tissue models. This fragmentation fails to leverage biological principles shared across cancer types and hinders application to data-scarce scenarios. While pan-cancer training offers a solution, the resulting heterogeneity challenges monolithic architectures. To bridge this gap, we introduce MoLF (Mixture-of-Latent-Flow), a generative model for pan-cancer histogenomic prediction. MoLF leverages a conditional Flow Matching objective to map noise to the gene latent manifold, parameterized by a Mixture-of-Experts (MoE) velocity field. By dynamically routing inputs to specialized sub-networks, this architecture effectively decouples the optimization of diverse tissue patterns. Our experiments demonstrate that MoLF establishes a new state-of-the-art, consistently outperforming both specialized and foundation model baselines on pan-cancer benchmarks. Furthermore, MoLF exhibits zero-shot generalization to cross-species data, suggesting it captures fundamental, conserved histo-molecular mechanisms.
comment: Accepted at Proceedings 43rd International Conference on Machine Learning, Seoul, South Korea
♻ ☆ Cross-Model Consistency of Feature Importance in Electrospinning: Separating Robust from Model-Dependent Features
Electrospinning is a highly sensitive fabrication process in which small variations in operating parameters can significantly influence fiber morphology and material performance. Machine learning (ML) methods are increasingly employed to model these process-structure relationships and to identify the relative importance of processing variables. However, most existing studies rely on a single ML model, implicitly assuming that the resulting feature importance is robust and reproducible. In this study, the consistency of feature importance across multiple ML model families was systematically evaluated using a curated dataset of 96 polyvinyl alcohol (PVA) electrospinning experiments. Twenty-one ML models representing linear, tree-based, kernel-based, neural network, and instance-based approaches were trained and compared. To provide a unified interpretability framework, SHAP (SHapley Additive exPlanations) values were used to calculate feature importance consistently across all models. A rank-based statistical analysis was then performed to quantify inter-model agreement and assess the robustness of parameter rankings. The results demonstrate that predictive performance and interpretive reliability are fundamentally distinct properties. Although several models achieved comparable predictive accuracy, substantial differences were observed in their feature importance rankings. Solution concentration emerged as the most robust and consistently influential parameter (variability = 0), whereas flow rate and applied voltage exhibited high ranking variability (variability > 0.9), indicating strong model dependence. These findings suggest that feature importance derived from a single ML model may be unreliable, particularly for small experimental datasets, and highlight the importance of cross-model validation for achieving trustworthy interpretation in ML-assisted electrospinning research.
♻ ☆ Is Data Shapley Not Better than Random in Data Selection? Ask NASH ICML-26
Data selection studies the problem of identifying high-quality subsets of training data. While some existing works have considered selecting the subset of data with top-$m$ Data Shapley or other semivalues as they account for the interaction among every subset of data, other works argue that Data Shapley can sometimes perform ineffectively in practice and select subsets that are no better than random. This raises the questions: (I) Are there certain "Shapley-informative" settings where Data Shapley consistently works well? (II) Can we strategically utilize these settings to select high-quality subsets consistently and efficiently? In this paper, we propose a novel data selection framework, NASH (Non-linear Aggregation of SHapley-informative components), which (I) decomposes the target utility function (e.g., validation accuracy) into simpler, Shapley-informative component functions, and selects data by optimizing an objective that (II) aggregates these components non-linearly. We demonstrate that NASH substantially boosts the effectiveness of Shapley/semivalue-based data selection with minimal additional runtime cost.
comment: Accepted to the 43rd International Conference on Machine Learning (ICML-26) as a Spotlight paper
♻ ☆ Make It Long, Keep It Fast: End-to-End 10k-Sequence Modeling at Billion Scale on Douyin Recommendation WWW 2026
Short-video recommenders such as Douyin must exploit extremely long user histories without breaking latency or cost budgets. We present an end-to-end system that scales long-sequence modeling to 10k-length histories in production. First, we introduce Stacked Target-to-History Cross Attention (STCA), which replaces history self-attention with stacked cross-attention from the target to the history, reducing complexity from quadratic to linear in sequence length and enabling efficient end-to-end training. Second, we propose Request Level Batching (RLB), a user-centric batching scheme that aggregates multiple targets for the same user/request to share the user-side encoding, substantially lowering sequence-related storage, communication, and compute without changing the learning objective. Third, we design a length-extrapolative training strategy -- train on shorter windows, infer on much longer ones -- so the model generalizes to 10k histories without additional training cost. Across offline and online experiments, we observe predictable, monotonic gains as we scale history length and model capacity, mirroring the scaling law behavior observed in large language models. Deployed at full traffic on Douyin, our system delivers significant improvements on key engagement metrics while meeting production latency, demonstrating a practical path to scaling end-to-end long-sequence recommendation to the 10k regime.
comment: WWW 2026
♻ ☆ Sequential Off-Policy Learning with Logarithmic Smoothing AISTATS 2026
Off-policy learning enables training policies from logged interaction data. Most prior work considers the batch setting, where a policy is learned from data generated by a single behavior policy. In real systems, however, policies are updated and redeployed repeatedly, each time training on all previously collected data while generating new interactions for future updates. This sequential off-policy learning setting is common in practice but remains largely unexplored theoretically. In this work, we present and study a simple algorithm for sequential off-policy learning, combining Logarithmic Smoothing (LS) estimation with online PAC-Bayesian tools. We further show that a principled adjustment to LS improves performance and accelerates convergence under mild conditions. The algorithms introduced generalise previous work: they match state-of-the-art offline approaches in the batch case and substantially outperform them when policies are updated sequentially. Empirical evaluations highlight both the benefits of the sequential framework and the strength of the proposed algorithms.
comment: AISTATS 2026
♻ ☆ Neural ARFIMA model for forecasting BRIC exchange rates with long memory
Accurate forecasting of exchange rates remains a persistent challenge, particularly for emerging economies such as Brazil, Russia, India, and China (BRIC). These series exhibit long memory and nonlinearity that conventional time series models struggle to capture. Exchange rate dynamics are further influenced by several key drivers, including global economic policy uncertainty, US equity market volatility, US monetary policy uncertainty, oil price growth rates, and short-term interest rates. These empirical complexities underscore the need for a flexible framework that can jointly accommodate long memory, nonlinearity, and the influence of external drivers. We propose a Neural AutoRegressive Fractionally Integrated Moving Average (NARFIMA) model that combines the long memory structure of ARFIMA with the nonlinear learning capability of neural networks while incorporating exogenous variables. We establish asymptotic stationarity of NARFIMA and quantify forecast uncertainty using conformal prediction intervals. Empirical results show that NARFIMA consistently outperforms benchmark methods in forecasting BRIC exchange rates.
♻ ☆ PENEX: AdaBoost-Inspired Neural Network Regularization
AdaBoost sequentially fits so-called weak learners to minimize an exponential loss, which penalizes misclassified data points more severely than other loss functions like cross-entropy. Paradoxically, AdaBoost generalizes well in practice as the number of weak learners grows. In the present work, we introduce Penalized Exponential Loss (PENEX), a new formulation of the multi-class exponential loss that is theoretically grounded and, in contrast to the existing formulation, amenable to optimization via first-order methods, making it a practical objective for training neural networks. We demonstrate that PENEX effectively increases margins of data points, which can be translated into a generalization bound. Empirically, across computer vision and language tasks, PENEX improves neural network generalization in low-data regimes, matching and in some settings outperforming established regularizers at comparable computational cost. Our results highlight the potential of the exponential loss beyond its application in AdaBoost.
♻ ☆ SAND: The Challenge on Speech Analysis for Neurodegenerative Disease Assessment
Recent advances in Artificial Intelligence (AI) and the exploration of noninvasive, objective biomarkers, such as speech signals, have encouraged the development of algorithms to support the early diagnosis of neurodegenerative diseases, including Amyotrophic Lateral Sclerosis (ALS). Voice changes in subjects suffering from ALS typically manifest as progressive dysarthria, which is a prominent neurodegenerative symptom because it affects patients as the disease progresses. Since voice signals are complex data, the development and use of advanced AI techniques are fundamental to extracting distinctive patterns from them. Validating AI algorithms for ALS diagnosis and monitoring using voice signals is challenging, particularly due to the lack of annotated reference datasets. In this work, we present the outcome of a collaboration between a multidisciplinary team of clinicians and Machine Learning experts to create both a clinically annotated validation dataset and the "Speech Analysis for Neurodegenerative Diseases" (SAND) challenge based on it. Specifically, by analyzing voice disorders, the SAND challenge provides an opportunity to develop, test, and evaluate AI models for the automatic early identification and prediction of ALS disease progression.
♻ ☆ Follow the Mean: Reference-Guided Flow Matching
Existing approaches to controllable generation typically rely on fine-tuning, auxiliary networks, or test-time search. We show that flow matching admits a different control interface: adaptation through examples. For deterministic interpolants, the velocity field is solely governed by a conditional endpoint mean; shifting this mean shifts the flow itself. This yields a simple principle for controllable generation: steer a pretrained model by changing the reference set it follows. We instantiate this idea in two forms. Reference-Mean Guidance is training-free: it computes a closed-form endpoint-mean correction from a reference bank and applies it to a frozen FLUX.2-klein (4B) model, enabling control of color, identity, style, and structure while keeping the prompt, seed, and weights fixed. Semi-Parametric Guidance amortizes the same idea through an explicit mean anchor and learned residual refiner, matching unconditional DiT-B/4 quality on AFHQv2 while allowing the reference set to be swapped at inference time. These results point to a broader direction: generative models that adapt through data, not parameter updates.
♻ ☆ The Offline-Frontier Shift: Diagnosing Distributional Limits in Generative Multi-Objective Optimization
Offline multi-objective optimization (MOO) aims to recover Pareto-optimal designs given a finite, static dataset. Recent generative approaches, including diffusion models, show strong performance under hypervolume, yet their behavior under other established MOO metrics is less understood. We show that generative methods systematically underperform evolutionary alternatives with respect to other metrics, such as generational distance. We relate this failure mode to the offline-frontier shift, i.e., the displacement of the offline dataset from the Pareto front, which acts as a fundamental limitation in offline MOO. We argue that overcoming this limitation requires out-of-distribution sampling in objective space (via an integral probability metric) and empirically observe that generative methods remain conservatively close to the offline objective distribution. Our results position offline MOO as a distribution-shift--limited problem and provide a diagnostic lens for understanding when and why generative optimization methods fail.
♻ ☆ Localising Dropout Variance in Twin Networks
Accurate individual treatment-effect estimation demands not only reliable point predictions but also uncertainty measures that help practitioners \emph{locate} the source of model failure. We introduce a layer-wise variance decomposition for deep twin-network models: by toggling Monte Carlo Dropout independently in the shared encoder and the outcome heads, we split total predictive variance into an \emph{encoder component} ($σ_{\mathrm{enc}}^2$) and a \emph{head component} ($σ_{\mathrm{head}}^2$), with $σ_{\mathrm{enc}}^2 + σ_{\mathrm{head}}^2 \approx σ_{\mathrm{tot}}^2$ by the law of total variance. Across three synthetic covariate-shift regimes, the encoder component dominates under distributional shift ($ρ_{\mathrm{enc}}=0.53$) while the head component becomes informative only once encoder uncertainty is controlled. On a real-world twins cohort with induced multivariate shift, only $σ_{\mathrm{enc}}^2$ spikes on out-of-distribution samples and becomes the primary error predictor ($ρ_{\mathrm{enc}}\!\approx\!0.89$), while $σ_{\mathrm{head}}^2$ remains flat. The decomposition adds negligible cost over standard MC Dropout and provides a practical diagnostic for deciding whether to collect more diverse covariates or more outcome data.
comment: 14 pages, 5 figures, 3 tables
♻ ☆ SDG-MoE: Signed Debate Graph Mixture-of-Experts
Sparse MoE models achieve a good balance between capacity and compute by routing each token to a small subset of experts. However, in most MoE architectures, once a token is routed, the selected experts process it independently and their outputs are combined via a weighted sum. This leaves open whether enabling communication among them could improve performance. While prior work has raised this question, direct interaction among the active routed experts remains underexplored. In this paper, we propose SDG-MoE (Signed Debate Graph Mixture-of-Experts), a novel architecture that adds a lightweight, iterative deliberation step before final aggregation. SDG-MoE introduces three components: (i) two learned interaction matrices over the active experts, a support graph $A^+$ and a critique graph $A^-$, capturing reinforcing and corrective influences; (ii) a signed message-passing step that updates expert representations before aggregation; and (iii) a disagreement-gated Friedkin-Johnsen-style anchoring that controls deliberation strength while preventing expert drift. Together, these enable a structured deliberation process where interaction strength scales with disagreement and specialization is preserved. We also provide a theoretical analysis establishing stability conditions on expert states and showing that deliberation adds only low-order overhead over the active set. In controlled three-seed pretraining experiments, SDG-MoE improves validation perplexity over both an unsigned graph communication baseline and vanilla MoE, outperforming the strongest baseline by 19.8%, and gives the best external perplexity on WikiText-103, C4, and Paloma among the compared systems.
♻ ☆ Diffusion-State Policy Optimization for Masked Diffusion Language Models
Masked diffusion language models generate text through iterative masked-token filling, but terminal-only rewards on final completions provide coarse credit assignment for the intermediate filling decisions that shape the generation process. We propose Diffusion-State Policy Optimization (DiSPO), a plug-in credit-assignment layer that directly optimizes intermediate filling decisions. At selected intermediate masked states, DiSPO branches by resampling the currently masked positions from rollout-cached logits, scores the resulting completions, and updates only the newly filled tokens, requiring no additional multi-step diffusion rollouts or optimizer steps. We formalize a fixed-state objective for branched completions and derive a policy-gradient estimator that reuses the same rollouts as terminal-feedback policy optimization. Experiments on LLaDA-8B-Instruct show that DiSPO consistently improves terminal-feedback baselines, including diffu-GRPO and SPG, on math and planning benchmarks under matched rollout compute and optimizer steps, supporting its use as a general plug-in for masked diffusion policy optimization. Our project page is available at https://daioba.github.io/dispo .
♻ ☆ Principled Latent Diffusion for Graphs via Laplacian Autoencoders
Graph diffusion models achieve state-of-the-art performance in graph generation but suffer from quadratic complexity in the number of nodes -- and much of their capacity is wasted modeling the absence of edges in sparse graphs. Inspired by latent diffusion in other modalities, a natural idea is to compress graphs into a low-dimensional latent space and perform diffusion in that space. However, unlike images or text, graph generation requires nearly lossless reconstruction, as even a single error in decoding an adjacency matrix can render the entire sample invalid. This challenge has remained largely unaddressed. We propose LG-Flow, a latent graph diffusion framework that directly overcomes these obstacles. A permutation-equivariant autoencoder maps nodes to fixed-dimensional embeddings that enable near-lossless reconstruction of both undirected graphs and DAGs. The dimensionality of this latent representation scales linearly with the number of nodes, thereby removing the quadratic adjacency-space bottleneck in the diffusion process and enabling the training of substantially larger generative backbones. In this latent space, we train a Diffusion Transformer with flow matching, enabling efficient and expressive graph generation. Our approach achieves competitive results against state-of-the-art graph diffusion models while delivering up to a $1000\times$ speed-up. Our code is available at https://github.com/asiraudin/LG-Flow .
comment: Preprint, under review
♻ ☆ Hyperbolic Concept Bottleneck Models
Concept Bottleneck Models (CBMs) have become a popular approach to enable interpretability in neural networks by constraining classifier inputs to a set of human-understandable concepts. While effective, current models embed concepts in flat Euclidean space, treating them as independent, orthogonal dimensions. Concepts, however, are highly structured and organized in semantic hierarchies. To resolve this mismatch, we propose Hyperbolic Concept Bottleneck Models (HypCBM), a post-hoc framework that grounds the bottleneck in this structure by reformulating concept activation as asymmetric geometric containment in hyperbolic space. Rather than treating entailment cones as a pre-training penalty, we show they encode a natural test-time activation signal: the margin of inclusion within a concept's entailment cone yields sparse, hierarchy-aware activations without any additional supervision or learned modules. We further introduce an adaptive scaling law for hierarchically faithful interventions, propagating user corrections coherently through the concept tree. Empirically, HypCBM rivals post-hoc Euclidean models trained on 20$\times$ more data in sparse regimes required for human interpretability, with stronger hierarchical consistency and improved robustness to input corruptions.
comment: 24 pages, 14 figures
♻ ☆ HE-SNR: Uncovering Latent Logic via Entropy for Guiding Mid-Training on SWE-bench ICML 2026
SWE-bench has emerged as the premier benchmark for evaluating Large Language Models on complex software engineering tasks. While these capabilities are fundamentally acquired during the mid-training phase and subsequently elicited during Supervised Fine-Tuning (SFT), there remains a critical deficit in metrics capable of guiding mid-training effectively. Standard metrics such as Perplexity (PPL) are compromised by the "Long-Context Tax" and exhibit weak correlation with downstream SWE performance. In this paper, we bridge this gap by first introducing a rigorous data filtering strategy. Crucially, we propose the Entropy Compression Hypothesis, redefining intelligence not by scalar Top-1 compression, but by the capacity to structure uncertainty into Entropy-Compressed States of low orders ("reasonable hesitation"). Grounded in this fine-grained entropy analysis, we formulate a novel metric, HE-SNR (High-Entropy Signal-to-Noise Ratio). We validate our approach on models with up to 560B parameters across different context windows (32K/128K). This work provides both the theoretical foundation and practical tools for optimizing the latent potential of LLMs in complex engineering domains.
comment: Accepted at ICML 2026. 21 pages, 15 figures
♻ ☆ Targeted Synthetic Control Method
The synthetic control method (SCM) estimates causal effects in panel data with a single-treated unit by constructing a counterfactual outcome as a weighted combination of untreated control units that matches the pre-treatment trajectory. In this paper, we introduce the targeted synthetic control (TSC) method, a new two-stage estimator that directly estimates the counterfactual outcome. Specifically, our TSC method (1) yields a targeted debiasing estimator, in the sense that the targeted updating refines the initial weights to produce more stable weights; and (2) ensures that the final counterfactual estimation is a convex combination of observed control outcomes to enable direct interpretation of the synthetic control weights. TSC is flexible and can be instantiated with arbitrary machine learning models. Methodologically, TSC starts from an initial set of synthetic-control weights via a one-dimensional targeted update through the weight-tilting submodel, which calibrates the weights to reduce bias of weights estimation arising from pre-treatment fit. Furthermore, TSC avoids key shortcomings of existing methods (e.g., the augmented SCM), which can produce unbounded counterfactual estimates. Across extensive synthetic and real-world experiments, TSC consistently improves estimation accuracy over state-of-the-art SCM baselines.
♻ ☆ Reflect then Learn: Active Prompting for Information Extraction Guided by Introspective Confusion AAAI 2026
Large Language Models (LLMs) show remarkable potential for few-shot information extraction (IE), yet their performance is highly sensitive to the choice of in-context examples. Conventional selection strategies often fail to provide informative guidance, as they overlook a key source of model fallibility: confusion stemming not just from semantic content, but also from the generation of well-structured formats required by IE tasks. To address this, we introduce Active Prompting for Information Extraction (APIE), a novel active prompting framework guided by a principle we term introspective confusion. Our method empowers an LLM to assess its own confusion through a dual-component uncertainty metric that uniquely quantifies both Format Uncertainty (difficulty in generating correct syntax) and Content Uncertainty (inconsistency in extracted semantics). By ranking unlabeled data with this comprehensive score, our framework actively selects the most challenging and informative samples to serve as few-shot exemplars. Extensive experiments on four benchmarks show that our approach consistently outperforms strong baselines, yielding significant improvements in both extraction accuracy and robustness. Our work highlights the critical importance of a fine-grained, dual-level view of model uncertainty when it comes to building effective and reliable structured generation systems.
comment: Published at AAAI 2026
♻ ☆ Meta-Learning and Targeted Differential Privacy to Improve the Accuracy-Privacy Trade-off in Recommendations
Balancing differential privacy (DP) with recommendation accuracy is a key challenge in privacy-preserving recommender systems, since DP-noise degrades accuracy. We address this trade-off at both the data and model levels. At the data level, we apply DP only to the most stereotypical user data likely to reveal sensitive attributes, such as gender or age, to reduce unnecessary perturbation; we refer to this as targeted DP. At the model level, we use meta-learning to improve robustness to remaining DP-noise. This achieves a better trade-off between accuracy and privacy than standard approaches: Meta-learning improves accuracy and targeted DP leads to lower empirical privacy risk compared to uniformly applied DP and full DP baselines. Overall, our findings show that selectively applying DP at the data level together with meta-learning at the model level can effectively balance recommendation accuracy and user privacy.
comment: Accepted at LBR@UMAP'26
♻ ☆ Direct Bethe Free Energy Minimization for Bayesian Neural Network
We propose training Bayesian neural networks by directly minimizing the Bethe free energy rather than maximizing a variational lower bound. On tree-structured factor graphs the Bethe free energy is exact; deterministic layers drop out of the objective and are trained by standard backpropagation, so the framework accommodates any mixture of probabilistic and deterministic subgraphs without modification. Restricting the weight posterior to a last-layer Gaussian yields analytically tractable losses: for a Gaussian likelihood the Bethe loss equals the exact marginal likelihood, and for a probit likelihood it reduces to a closed form via the probit-Gaussian convolution. Both objectives sit strictly between MAP and the ELBO ($L_\text{MAP} \leq L_\text{Bethe} \leq L_\text{ELBO}$), removing the structural Jensen gap that no choice of variational family can close. The Z-consistent prior formulation makes the prior precision a differentiable parameter, enabling empirical Bayes - joint optimization of weights, covariance, and hyperparameters - in a single gradient pass, with no cross-validation or outer loop. All variants admit a closed-form predictive at MAP-equivalent inference cost, in contrast to ensemble and sampling-based methods. On 8 UCI regression and 12 UCI classification benchmarks evaluated under a single shared hyperparameter regime, Bethe is competitive with standard reference methods at single-pass cost. Independently, joint single-pass empirical Bayes matches grid-search cross-validation of the prior precision on essentially all dataset-variant combinations, eliminating the outer hyperparameter loop without measurable cost. Isolated optimization gaps on a few datasets reflect numerical rather than principled limitations of the framework.
comment: Submited to conference - fix typo in title + name
♻ ☆ Stopping Computation for Converged Tokens in Masked Diffusion-LM Decoding ICLR 2026
Masked Diffusion Language Models generate sequences via iterative sampling that progressively unmasks tokens. However, they still recompute the attention and feed-forward blocks for every token position at every step -- even when many unmasked tokens are essentially fixed, resulting in substantial waste in compute. We propose SureLock: when the posterior at an unmasked position has stabilized across steps (our sure condition), we lock that position -- thereafter skipping its query projection and feed-forward sublayers -- while caching its attention keys and values so other positions can continue to attend to it. This reduces the dominant per-iteration computational cost from $O(N^2d)$ to $O(MNd)$ where $N$ is the sequence length, $M$ is the number of unlocked token positions, and $d$ is the model dimension. In practice, $M$ decreases as the iteration progresses, yielding substantial savings. On LLaDA-8B, SureLock reduces algorithmic FLOPs by 30--50% relative to the same sampler without locking, while maintaining comparable generation quality. We also provide a theoretical analysis to justify the design rationale of SureLock: monitoring only the local KL at the lock step suffices to bound the deviation in final token probabilities. Our project page is available at https://daioba.github.io/surelock .
comment: Accepted to ICLR 2026
♻ ☆ Clarity: The Flexibility-Interpretability Trade-Off in Sparsity-aware Concept Bottleneck Models
The widespread adoption of deep learning models in computer vision has intensified concerns about interpretability. Despite strong performance, these models are often treated as black boxes, with limited systematic investigation of their decision-making processes. While many interpretability methods exist, objective evaluation of learned representations remains limited, particularly for approaches that rely on sparsity to "induce" interpretability. In this work, we investigate how modeling choices in Concept Bottleneck Models (CBMs) affect the semantic alignment of concept representations. We introduce Clarity, a novel metric that captures the interplay between downstream performance and the sparsity and precision of concept activations. Using an interpretability assessment framework grounded in datasets with ground-truth concept annotations, we evaluate both VLM- and attribute predictor-based CBMs across three amortized sparsity-inducing strategies ($\ell_1$, $\ell_0$, and Bernoulli-based), alongside several widely used sparsity-aware CBM methods from the literature. Our experiments reveal a critical flexibility-interpretability trade-off: a model's capacity to optimize task performance by deviating from semantic alignment. We demonstrate that under this trade-off, different methods exhibit markedly different behaviors even at comparable performance levels. Finally, we validate our framework through a principled human study, which confirms that Clarity aligns significantly more closely with human trust than standard evaluation metrics.
♻ ☆ Offline Constrained Reinforcement Learning under Partial Data Coverage
We study offline constrained reinforcement learning with general function approximation in discounted constrained Markov decision processes. Prior methods either require full data coverage for evaluating intermediate policies, lack oracle efficiency, or requires the knowledge of data-generating distribution for policy extraction. We propose PDOCRL, an oracle-efficient primal-dual algorithm based on a decomposed linear-programming formulation that makes the policy an explicit optimization variable. This avoids policy extraction that requires the knowledge of data-generating distribution, and only uses standard policy-optimization, online linear-optimization, and linear-minimization oracles. We show that saddle-point formulations using general function approximation can have spurious saddle points even when an optimal solution is realizable, and identify a stronger realizability condition under which every restricted saddle point is optimal. Under this condition and partial coverage of an optimal policy, PDOCRL returns a near-optimal, near-feasible policy with a \(\widetilde{\mathcal O}(ε^{-2})\) sample guarantee, without access to the data-generating distribution. Empirically, PDOCRL is competitive with strong baselines on standard offline constrained RL benchmarks.
♻ ☆ Off-Policy Learning with Limited Supply WWW 2026
We study off-policy learning (OPL) in contextual bandits, which plays a key role in a wide range of real-world applications such as recommendation systems and online advertising. Typical OPL in contextual bandits assumes an unconstrained environment where a policy can select the same item infinitely. However, in many practical applications, including coupon allocation and e-commerce, limited supply constrains items through budget limits on distributed coupons or inventory restrictions on products. In these settings, greedily selecting the item with the highest expected reward for the current user may lead to early depletion of that item, making it unavailable for future users who could potentially generate higher expected rewards. As a result, OPL methods that are optimal in unconstrained settings may become suboptimal in limited supply settings. To address the issue, we provide a theoretical analysis showing that conventional greedy OPL approaches may fail to maximize the policy performance, and demonstrate that policies with superior performance must exist in limited supply settings. Based on this insight, we introduce a novel method called Off-Policy learning with Limited Supply (OPLS). Rather than simply selecting the item with the highest expected reward, OPLS focuses on items with relatively higher expected rewards compared to the other users, enabling more efficient allocation of items with limited supply. Our empirical results on both synthetic and real-world datasets show that OPLS outperforms existing OPL methods in contextual bandit problems with limited supply.
comment: Published as a conference paper at WWW 2026
♻ ☆ AffineLens: Capturing the Continuous Piecewise Affine Functions of Neural Networks
Piecewise affine neural networks (PANNs) provide a principled geometric perspective on neural network expressivity by characterizing the input--output map as a continuous piecewise affine (CPA) function whose complexity is governed by the number, arrangement, and shapes of its affine regions. However, existing interpretability and expressivity analyses often rely on indirect proxies (e.g., activation statistics or theoretical upper bounds) and rarely offer practical, accurate tools for enumerating and visualizing the induced region partition under realistic architectures and bounded input domains. In this work, we present AffineLens, a unified framework for computing the hyperplane arrangements and polyhedral structures underlying PANNs. Given a calibrated (bounded) input polytope, AffineLens identifies the subset of neuron-induced hyperplanes that intersect the domain, enumerates the resulting affine sub-regions in a layer-wise manner, and returns provably non-empty maximal CPA regions together with interior representatives. The framework further provides visualizations of region partitioning and decision boundaries, enabling qualitative inspection alongside quantitative region counts. By exploiting the affine restriction property of CPA networks under fixed activation patterns, AffineLens supports a broad class of modern components, including batch normalization, pooling, residual connections, multilayer perceptrons, and convolutional layers. Finally, we use AffineLens to perform a systematic empirical study of architectural expressivity, comparing networks through region complexity metrics and revealing how design choices influence the geometry of learned functions.
♻ ☆ Clin-JEPA: A Multi-Phase Co-Training Framework for Joint-Embedding Predictive Pretraining on EHR Patient Trajectories
We present Clin-JEPA, a multi-phase co-training framework for joint-embedding predictive (JEPA) pretraining on EHR patient trajectories. JEPA architectures have enabled latent-space planning in robotics and high-quality representation learning in vision, but extending the paradigm to EHR data -- to obtain a single backbone that simultaneously forecasts patient trajectories and serves diverse downstream risk-prediction tasks without per-task fine-tuning -- remains an open challenge. Existing JEPA frameworks either discard the predictor after pretraining (I-JEPA, V-JEPA) or train it on a frozen pretrained encoder (V-JEPA 2-AC), leaving the encoder unaware of the rollout signal that the retained predictor must use at inference; co-training the encoder and predictor under a shared JEPA prediction objective would supply this grounding, but naïve co-training is unstable, with representation collapse and online/target drift causing autoregressive rollout to diverge. Clin-JEPA's five-phase pretraining curriculum -- predictor warmup, joint refinement, EMA target alignment, hard sync, and predictor finalization -- addresses each failure mode by phase, stably co-training a Qwen3-8B-based encoder and a 92M-parameter latent trajectory predictor. On MIMIC-IV ICU data, three independent evaluations support the framework: (1) latent $\ell_1$ rollout drift uniquely converges ($-$15.7%) over 48-hour horizons while baselines and ablations diverge (+3% to +4951%); (2) the encoder learns a clinically discriminative latent geometry (deteriorating-patient cohorts displace 4.83$\times$ further than stable patients in latent space, vs $\leq$2.62$\times$ for baseline encoders); (3) a single backbone outperforms strong tabular and sequence baselines on multi-task downstream evaluation. Clin-JEPA achieves mean AUROC 0.851 on ICareFM EEP and 0.883 on 8 binary risk tasks (+0.038 and +0.041 vs baseline average).
comment: 17 pages, 4 figures, 8 tables. Code: https://github.com/YeungYathin/Clin-JEPA
♻ ☆ Training-Time Batch Normalization Reshapes Local Partition Geometry in Piecewise-Affine Networks
Batch normalization (BN) is central to modern deep networks, but its effect on the realized function during training remains less understood than its optimization benefits. We study training-time BN in continuous piecewise-affine (CPA) networks through the geometry of switching hyperplanes and the induced affine-region partition. Conditioned on a mini-batch, we show that BN defines for each neuron a reference hyperplane through the batch centroid, and that breakpoint-switching hyperplanes are parallel translates whose offsets are expressed in batch-standardized coordinates and are independent of the raw bias. This yields an exact criterion for when a switching hyperplane intersects a local $\ell_\infty$ window and motivates a local region-density functional based on exact affine-region counts. Under explicit sufficient conditions, we show that BN increases expected local partition refinement in ReLU and more general piecewise-affine networks, and that this mechanism transfers locally through depth inside parent affine regions where the upstream representation map is an affine embedding. These results provide a function-level geometric account of training-time BN as a batch-conditional recentering mechanism near the data.
♻ ☆ Task-Driven Subspace Decomposition for Knowledge Sharing and Isolation in LoRA-based Continual Learning ICML 2026
Continual Learning (CL) requires models to sequentially adapt to new tasks without forgetting old knowledge. Recently, Low-Rank Adaptation (LoRA), a representative Parameter-Efficient Fine-Tuning (PEFT) method, has gained increasing attention in CL. Several LoRA-based CL methods reduce interference across tasks by separating their update spaces, typically building the new space from the estimated null space of past tasks. However, they (i) overlook task-shared directions, which suppresses knowledge transfer, and (ii) fail to capture truly effective task-specific directions since these ``null bases" of old tasks can remain nearly inactive for new task under correlated tasks. To address this, we study LoRA learning capability from a projection energy perspective, and propose Low-rank Decomposition and Adaptation (LoDA). It performs a task-driven decomposition to build general and truly task-specific LoRA subspaces by solving two energy-based objectives, decoupling directions for knowledge sharing and isolation. LoDA fixes LoRA down-projections on two subspaces and learns robust up-projections via a Gradient-Aligned Optimization (GAO) approach. After each task, before integrating the LoRA updates into the backbone, LoDA derives a closed-form recalibration for the general update, approximating a feature-level joint optimum along this task-shared direction. Experiments indicate that LoDA outperforms existing CL methods. Our code is available at https://github.com/HHHLF/LoDA_ICML2026.
comment: Accepted by ICML 2026
♻ ☆ Calibrated Multimodal Representation Learning with Missing Modalities ICML 2026
Multimodal representation learning harmonizes distinct modalities by aligning them into a unified latent space. Recent research generalizes traditional cross-modal alignment to produce enhanced multimodal synergy but requires all modalities to be present for a common instance, making it challenging to utilize prevalent datasets with missing modalities. We provide theoretical insights into this issue from an anchor shift perspective. Observed modalities are aligned with a local anchor that deviates from the optimal one when all modalities are present, resulting in an inevitable shift. To address this, we propose CalMRL to calibrate incomplete alignments caused by missing modalities. CalMRL leverages the priors and the inherent connections among modalities to model the imputation for the missing ones at the representation level. To resolve the optimization dilemma, we employ a bi-step learning method with the closed-form solution of the posterior distribution of shared latents. We validate its mitigation of anchor shift and convergence with theoretical guidance. By equipping the calibrated alignment with the existing advanced method, we offer new flexibility to absorb data with missing modalities, which is originally unattainable. Extensive experiments demonstrate the superiority of CalMRL. The code is released at https://github.com/Xiaohao-Liu/CalMRL.
comment: Accepted by ICML 2026
♻ ☆ Priority-Driven Control and Communication in Decentralized Multi-Agent Systems via Reinforcement Learning
Event-triggered control provides a mechanism for avoiding excessive use of constrained communication bandwidth in networked multi-agent systems. However, most existing methods rely on accurate system models, which may be unavailable in practice. In this work, we propose a model-free, priority-driven reinforcement learning algorithm that learns communication priorities and control policies jointly from data in decentralized multi-agent systems. By learning communication priorities, we circumvent the hybrid action space typical in event-triggered control with binary communication decisions. We evaluate our algorithm on benchmark tasks and demonstrate that it outperforms the baseline method.
comment: Accepted to the 23rd IFAC World Congress
♻ ☆ DiFaReli++: Diffusion Face Relighting with Consistent Cast Shadows IEEE
We introduce a novel approach to single-view face relighting in the wild, addressing challenges such as global illumination and cast shadows. A common scheme in recent methods involves intrinsically decomposing an input image into 3D shape, albedo, and lighting, then recomposing it with the target lighting. However, estimating these components is error-prone and requires many training examples with ground-truth lighting to generalize well. Our work bypasses the need for accurate intrinsic estimation and can be trained solely on 2D images without any light stage data, relit pairs, multi-view images, or lighting ground truth. Our key idea is to leverage a conditional diffusion implicit model (DDIM) for decoding a disentangled light encoding along with other encodings related to 3D shape and facial identity inferred from off-the-shelf estimators. We propose a novel conditioning technique that simplifies modeling the complex interaction between light and geometry. It uses a rendered shading reference along with a shadow map, inferred using a simple and effective technique, to spatially modulate the DDIM. Moreover, we propose a single-shot relighting framework that requires just one network pass, given pre-processed data, and even outperforms the teacher model across all metrics. Our method realistically relights in-the-wild images with temporally consistent cast shadows under varying lighting conditions. We achieve state-of-the-art performance on the standard benchmark Multi-PIE and rank highest in user studies. Please visit our page: https://diffusion-face-relighting-pp.github.io
comment: Published in IEEE TPAMI (vol. 48, no. 5, May 2026). This is an extended version of the ICCV 2023 paper (DiFaReli)
♻ ☆ Refresh-Scaling the Memory of Balanced Adam
Recent evidence suggests that Adam performs robustly when its momentum parameters are tied, $β_1=β_2$, reducing the optimizer to a single remaining parameter. However, how this parameter should be set remains poorly understood. We argue that, in balanced Adam, $β$ should not be treated as a dimensionless constant: it defines a statistical memory horizon $H_β=(1-β)^{-1}$. In terms of the effective learning horizon $T_{\mathrm{ES}}$, estimated from the validation trajectory, we study the refresh count $R_β=(1-β)T_{\mathrm{ES}}$, which measures how many times Adam renews its internal statistics during the useful phase of training. Across 11 vision and language experiments, we find that choosing $β$ so that $R_β\approx1000$ selects different $β$ values depending on the training scale, yet improves robustness over the best fixed-beta baseline. Compared with the strongest fixed choice $β=0.944$, the refresh rule improves worst-case robustness, reducing the maximum relative gap in validation loss by 33.4\%, while bringing all 11 runs within 1\% of their validation oracle. These results suggest that the remaining hyperparameter of balanced Adam is more naturally viewed as a memory-scale variable than as a fixed constant. This provides a simple budget-aware perspective on optimizer scaling and opens a path toward treating Adam's momentum as part of the learning dynamics rather than as a static default.
♻ ☆ From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity ICML 2026
Traditional hallucination detection fails on "Stubborn Hallucinations" - errors where LLMs are confidently wrong. We propose a geometric solution: Embedding-Perturbed Gradient Sensitivity (EPGS). We hypothesize that while robust facts reside in flat minima, stubborn hallucinations sit in sharp minima, supported by brittle memorization. EPGS detects this sharpness by perturbing input embeddings with Gaussian noise and measuring the resulting spike in gradient magnitude. This acts as an efficient proxy for the Hessian spectrum, differentiating stable knowledge from unstable memorization. Our experiments show that EPGS significantly outperforms entropy-based and representation-based baselines, providing a robust signal for detecting high-confidence factual errors.
comment: Accepted to ICML 2026. Camera-ready version
♻ ☆ The Confusion is Real: GRAPHIC -- A Network Science Approach to Confusion Matrices in Deep Learning
Explainable artificial intelligence has emerged as a promising field of research to address reliability concerns in artificial intelligence. Despite significant progress in explainable artificial intelligence, few methods provide a systematic way to visualize and understand how classes are confused and how their relationships evolve as training progresses. In this work, we present GRAPHIC, an architecture-agnostic approach that analyzes neural networks on a class level. It leverages confusion matrices derived from intermediate layers using linear classifiers. We interpret these as adjacency matrices of directed graphs, allowing tools from network science to visualize and quantify learning dynamics across training epochs and intermediate layers. GRAPHIC provides insights into linear class separability, dataset issues, and architectural behavior, revealing, for example, similarities between flatfish and man and labeling ambiguities validated in a human study. In summary, by uncovering real confusions, GRAPHIC offers new perspectives on how neural networks learn. The code is available at https://github.com/Johanna-S-Froehlich/GRAPHIC.
comment: Transactions on Machine Learning Research, 2026
♻ ☆ Training Transformers for KV Cache Compressibility
Long-context language modeling is increasingly constrained by the Key-Value (KV) cache, whose memory and decode-time access costs scale linearly with the prefix length. This bottleneck has motivated a range of context-compression methods, from token-level summarization to recent optimization-based KV compression methods. These post-hoc methods operate on the KV cache of a fixed pretrained model, so their effectiveness is fundamentally limited by how well the model's internal representations can be compressed. In this work, we formalize the notion of KV compressibility and show that it is a property of the learned representations, rather than of the context alone. We prove that almost any sequence-to-vector function admits both highly compressible and inherently non-compressible transformer implementations, highlighting the need to guide transformers toward compressible representations during training. Motivated by this, we propose KV-Compression Aware Training (KV-CAT), a continued pretraining procedure that incentivizes the emergence of compressible representations. We introduce a train-time KV sparsification policy that masks KV slots during training. This forces the model to use fewer KV slots and encourages it to learn representations amenable to post-hoc compression. Empirically, we show that KV-CAT improves the quality-budget tradeoff of downstream compression methods across retrieval, long-context question answering, and perplexity-based evaluation of compressed-prefix continuation.
comment: 32 pages, 4 figures
♻ ☆ Sparse Offline Reinforcement Learning with Corruption Robustness
We investigate robustness to strong data corruption in offline sparse reinforcement learning (RL). In our setting, an adversary may arbitrarily perturb a fraction of the collected trajectories from a high-dimensional but sparse Markov decision process, and our goal is to estimate a near optimal policy. The main challenge is that, in the high-dimensional regime where the number of samples $N$ is smaller than the feature dimension $d$, exploiting sparsity is essential for obtaining non-vacuous guarantees but has not been systematically studied in offline RL. We analyse the problem under uniform coverage and sparse single-concentrability assumptions. While Least Square Value Iteration (LSVI), a standard approach for robust offline RL, performs well under uniform coverage, we show that integrating sparsity into LSVI is unnatural, and its analysis may break down due to overly pessimistic bonuses. To overcome this, we propose actor-critic methods with sparse robust estimator oracles, which avoid the use of pointwise pessimistic bonuses and provide the first non-vacuous guarantees for sparse offline RL under single-policy concentrability coverage. Moreover, we extend our results to the contaminated setting and show that our algorithm remains robust under strong contamination. Our results provide the first non-vacuous guarantees in high-dimensional sparse MDPs with single-policy concentrability coverage and corruption, showing that learning a near-optimal policy remains possible in regimes where traditional robust offline RL techniques may fail.
♻ ☆ MAC: Masked Agent Collaboration Boosts Large Language Model Medical Decision-Making
Large language models (LLMs) have proven effective in artificial intelligence, where the multi-agent system (MAS) holds considerable promise for healthcare development by achieving the collaboration of LLMs. However, the absence of a systematic pipeline for agent construction and the rigidity of static collaboration patterns render current MAS-based models vulnerable to collaboration failures, resulting in substantial performance degradation in medical decision-making scenarios. To this end, we propose a novel Masked Agent Collaboration (MAC) framework that harnesses Pareto-optimal agent construction and cross-consistency maximization mechanisms to achieve adaptive progressive propagation of collaborative information, boosting the medical decision-making capacity. Specifically, we first conduct a Pareto-frontier factors analysis towards the LLMs pool to consider their key factors, including the model size, inference time, diversity score, and throughput ratio, where we calculate the similarity between pairwise outputs within an LLM to derive its diversity score. Beyond this analysis, we enable the identification of Pareto-optimal models that balance efficiency and capability, which are subsequently selected as collaborative agents to consider the fundamental trade-offs inherent in practical LLM deployment. Afterward, we measure the pairwise similarity between the outputs from collaborative agents to determine their cross-consistency values, subsequently masking out the agent with the lowest cross-consistency value to eliminate the output that is likely semantically inconsistent. Finally, we conduct collaboration of agents by achieving adaptive progressive propagation, where each agent aggregates the outputs of unmasked agents from the previous layer as its input to generate the corresponding output via prompt engineering.
♻ ☆ Molecular Design beyond Training Data with Novel Extended Objective Functionals of Generative AI Models Driven by Quantum Annealing Computer
Deep generative modeling to stochastically design small molecules is an emerging technology for accelerating drug discovery and development. However, one major issue in molecular generative models is their lower frequency of drug-like compounds. To resolve this problem, we developed a novel framework for optimization of deep generative models integrated with a D-Wave quantum annealing computer, where our Neural Hash Function (NHF) presented herein is used both as the regularization and binarization schemes simultaneously, of which the latter is for transformation between continuous and discrete signals of the classical and quantum neural networks, respectively, in the error evaluation (i.e., objective) function. The compounds generated via the quantum-annealing generative models exhibited higher quality in both validity and drug-likeness than those generated via the fully-classical models, and was further indicated to exceed even the training data in terms of drug-likeness features, without any restraints and conditions to deliberately induce such an optimization. These results indicated an advantage of quantum annealing to aim at a stochastic generator integrated with our novel neural network architectures, for the extended performance of feature space sampling and extraction of characteristic features in drug design.
comment: 28 pages, 4 figures
♻ ☆ P-Flow: Proxy-gradient Flows for Linear Inverse Problems
Generative models based on flow matching have emerged as a powerful paradigm for inverse problems, offering straighter trajectories and faster sampling compared to diffusion models. However, existing approaches often necessitate differentiating through unrolled paths, leading to numerical instability and prohibitive computational overhead. To address this, we propose P-Flow, a framework that stabilizes the reconstruction process by leveraging a proxy gradient to update the source point. This approach effectively circumvents the numerical instability and memory overhead of long-chain differentiation. To ensure consistency with the prior distribution, we employ a Gaussian spherical projection motivated by the concentration of measure phenomenon in high-dimensional spaces. We further provide a theoretical analysis for P-Flow based on Bayesian theory and Lipschitz continuity. Experiments across diverse restoration tasks demonstrate that P-Flow delivers competitive performance, especially under extreme degradations such as severely ill-posed conditions and high measurement noise.
♻ ☆ FLARE: Adaptive Multi-Dimensional Reputation for Robust Client Reliability in Federated Learning
Federated learning (FL) enables collaborative model training while preserving data privacy. However, it remains vulnerable to malicious clients who compromise model integrity through Byzantine attacks, data poisoning, or adaptive adversarial behaviors. Existing defense mechanisms rely on static thresholds and binary classification, failing to adapt to evolving client behaviors in real-world deployments. We propose FLARE, an adaptive reputation-based framework that transforms client reliability assessment from binary decisions to a continuous, multi-dimensional trust evaluation. FLARE integrates: (i) a multi-dimensional reputation score capturing performance consistency, statistical anomaly indicators, and temporal behavior, (ii) a self-calibrating adaptive threshold mechanism that adjusts security strictness based on model convergence and recent attack intensity, (iii) reputation-weighted aggregation with soft exclusion to proportionally limit suspicious contributions rather than eliminating clients outright, and (iv) a Local Differential Privacy (LDP) mechanism enabling reputation scoring on privatized client updates. We further introduce a highly evasive Statistical Mimicry (SM) attack, a benchmark adversary that blends honest gradients with synthetic perturbations and persistent drift to remain undetected by traditional filters. Extensive experiments with 100 clients on MNIST, CIFAR-10, and SVHN demonstrate that FLARE maintains high model accuracy and converges faster than state-of-the-art Byzantine-robust methods under diverse attack types, including label flipping, gradient scaling, adaptive attacks, ALIE, and SM. FLARE improves robustness by up to 16% and preserves model convergence within 30% of the non-attacked baseline, while achieving strong malicious-client detection performance with minimal computational overhead. https://github.com/Anonymous0-0paper/FLARE
comment: The authors want to withdraw this manuscript for further verification and revision. We may release a substantially revised version in the future
Multimedia 7
☆ Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Omni-modal language models are intended to jointly understand audio, visual inputs, and language, but benchmark gains can be inflated when visual evidence alone is enough to answer a query. We study whether current omni-modal benchmarks separate visual shortcuts from genuine audio-visual-language evidence integration, and how post-training behaves under a visually debiased evaluation setting. We audit nine omni-modal benchmarks with visual-only probing, remove visually solvable queries, and retain full subsets when filtering is undefined or would make comparisons unstable. This yields OmniClean, a cleaned evaluation view with 8,551 retained queries from 16,968 audited queries. On OmniClean, we evaluate OmniBoost, a three-stage post-training recipe based on Qwen2.5-Omni-3B: mixed bi-modal SFT, mixed-modality RLVR, and SFT on self-distilled data. Balanced bi-modal SFT gives limited and uneven gains, RLVR provides the first broad improvement, and self-distillation reshapes the benchmark profile. After SFT on self-distilled data, the 3B model reaches performance comparable to, and in aggregate slightly above, Qwen3-Omni-30B-A3B-Instruct without using a stronger omni-modal teacher. These results show that omni-modal progress is easier to interpret when evaluation controls visual leakage, and that small omni-modal models can benefit from staged post-training with self-distilled omni-query supervision.
☆ Very Efficient Listwise Multimodal Reranking for Long Documents ICML 2026
Listwise reranking is a key yet computationally expensive component in vision-centric retrieval and multimodal retrieval-augmented generation (M-RAG) over long documents. While recent VLM-based rerankers achieve strong accuracy, their practicality is often limited by long visual-token sequences and multi-step autoregressive decoding. We propose ZipRerank, a highly efficient listwise multimodal reranker that directly addresses both bottlenecks. It reduces input length via a lightweight query-image early interaction mechanism and eliminates autoregressive decoding by scoring all candidates in a single forward pass. To enable effective learning, ZipRerank adopts a two-stage training strategy: (i) listwise pretraining on large-scale text data rendered as images, and (ii) multimodal finetuning with VLM-teacher-distilled soft-ranking supervision. Extensive experiments on the MMDocIR benchmark show that ZipRerank matches or surpasses state-of-the-art multimodal rerankers while reducing LLM inference latency by up to an order of magnitude, making it well-suited for latency-sensitive real-world systems. The code is available at https://github.com/dukesun99/ZipRerank.
comment: To appear in ICML 2026
☆ AgentDisCo: Towards Disentanglement and Collaboration in Open-ended Deep Research Agents
In this paper, we present AgentDisCo, a novel Disentangled and Collaborative agentic architecture that formulates deep research as an adversarial optimization problem between information exploration and exploitation. Unlike existing approaches that conflate these two processes into a single module, AgentDisCo employs a critic agent to evaluate generated outlines and refine search queries, and a generator agent to retrieve updated results and revise outlines accordingly. The iteratively refined outline is then passed to a downstream report writer that synthesizes a comprehensive research report. The overall workflow supports both handcrafted and automatically discovered design strategies via a meta-optimization harness, in which the generator agent is repurposed as a scoring agent to evaluate critic outputs and generate quality signals. Powerful code-generation agents (e.g., Claude-Code, Codex) systematically explore agent configurations and construct a policy bank, a structured repository of reusable design strategies, enabling the framework to self-refine without extensive human intervention. We evaluate AgentDisCo on three established deep research benchmarks (DeepResearchBench, DeepConsult, DeepResearchGym) using Gemini-2.5-Pro, achieving performance comparable to or surpassing leading closed-source systems. Observing that existing benchmarks inadequately reflect real-world user needs, we introduce GALA (General AI Life Assistants), a benchmark that mines latent research interests from users' historical browsing behavior. We further develop a rendering agent that converts research reports into visually rich poster presentations, and demonstrate an end-to-end product, AutoResearch Your Interest, which delivers personalized deep research recommendations derived from individual browsing histories.
☆ UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning
Unified multimodal models (UMMs) aim to integrate understanding and generation within a single architecture. However, it remains underexplored how to effectively coordinate these two capabilities for more effective and efficient reasoning. Existing coordination approaches either perform coupling during training, without explicit inference-time coordination, or impose a fixed coordination pattern for all inputs. In this work, we show that multimodal tasks exhibit substantial coordination-path diversity: different inputs favor different coordination paths. This suggests that exploiting such diversity is key to improving performance. We propose UniPath, a framework for adaptively modeling and exploiting coordination-path diversity. Instead of enforcing a single coordination pattern, we represent task solving as the selection and execution of a path, ranging from direct answering to textual inference, visual-thought construction, and hypothesis-based exploration. We construct role-aligned trajectories to train a path-conditioned executor and introduce a lightweight planner mechanism to enable input-dependent path selection. Experiments show that leveraging coordination-path diversity improves performance over fixed coordination strategies while providing interpretable intermediate behaviors. The code is available at:https://github.com/AIFrontierLab/TorchUMM/tree/main/src/umm/post_training/unipath.
☆ Synthesizing the Expert: A Validated Multimodal Dataset for Trustworthy AI-Assisted Swimming Coaching
This research is primarily concerned with the critical problem of synthesizing a structured Retrieval-Augmented Generation (RAG) system for advanced AI applications in the domain of swimming. As the integration of Artificial Intelligence in sports science matures, its applications in swimming have become increasingly diverse, spanning from real-time technical coaching and talent scouting to comprehensive performance profiling and the dynamic personalization of training periodization. Within this landscape, RAG-based systems represent a pivotal advancement in Large Language Model (LLM) enhanced swimming analysis, as they allow for the grounding of generative outputs in authoritative domain knowledge, thereby ensuring the credibility of AI-generated advice, contextually and technically. Despite this potential, building robust RAG systems using only real-world aquatic data presents significant challenges, including ethical constraints regarding athlete biometrics, and the high cost of manual expert labeling. To address these barriers, we propose a novel generative framework that leverages a multimodal knowledge base gathered across four dimensions: physiological data, physiological literature, kinematic sensor data, and unstructured domain expertise. Our proposed framework utilizes a multi-agent LLM architecture to synthesize a high-fidelity dataset of 1,864 validated "Question-Context-Answer" triplets-drawn from 1,914 drafts evaluated against 12 physiological soundness rules. By providing a structured, synthetic ground truth, this work establishes a foundational benchmark for trustworthy AI in aquatics. The outcomes of this research promise to enhance the reliability of automated coaching and open a plethora of future directions in "Meta-Agent" development and athletic profiling, ultimately bridging the gap between raw data engineering and practical sports science application.
♻ ☆ Calibrated Multimodal Representation Learning with Missing Modalities ICML 2026
Multimodal representation learning harmonizes distinct modalities by aligning them into a unified latent space. Recent research generalizes traditional cross-modal alignment to produce enhanced multimodal synergy but requires all modalities to be present for a common instance, making it challenging to utilize prevalent datasets with missing modalities. We provide theoretical insights into this issue from an anchor shift perspective. Observed modalities are aligned with a local anchor that deviates from the optimal one when all modalities are present, resulting in an inevitable shift. To address this, we propose CalMRL to calibrate incomplete alignments caused by missing modalities. CalMRL leverages the priors and the inherent connections among modalities to model the imputation for the missing ones at the representation level. To resolve the optimization dilemma, we employ a bi-step learning method with the closed-form solution of the posterior distribution of shared latents. We validate its mitigation of anchor shift and convergence with theoretical guidance. By equipping the calibrated alignment with the existing advanced method, we offer new flexibility to absorb data with missing modalities, which is originally unattainable. Extensive experiments demonstrate the superiority of CalMRL. The code is released at https://github.com/Xiaohao-Liu/CalMRL.
comment: Accepted by ICML 2026
♻ ☆ RW-Post: Auditable Evidence-Grounded Multimodal Fact-Checking in the Wild
Multimodal misinformation increasingly leverages visual persuasion, where repurposed or manipulated images strengthen misleading text. We introduce \textbf{RW-Post}, a post-aligned \textbf{text--image benchmark} for real-world multimodal fact-checking with \emph{auditable} annotations: each instance links the original social-media post with reasoning traces and explicitly linked evidence items derived from human fact-check articles via an LLM-assisted extraction-and-auditing pipeline. RW-Post supports controlled evaluation across closed-book, evidence-bounded, and open-web regimes, enabling systematic diagnosis of visual grounding and evidence utilization. We provide \textbf{AgentFact} as a reference verification baseline and benchmark strong open-source LVLMs under unified protocols. Experiments show substantial headroom: current models struggle with faithful evidence grounding, while evidence-bounded evaluation improves both accuracy and faithfulness. Code and dataset will be released at https://github.com/xudanni0927/AgentFact.
comment: This submission was made in error. It was intended to replace the existing submission arXiv:2512.22933 rather than create a new submission
Computer Vision and Pattern Recognition 265
☆ Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
Recently, post-training methods based on reinforcement learning, with a particular focus on Group Relative Policy Optimization (GRPO), have emerged as the robust paradigm for further advancement of text-to-image (T2I) models. However, these methods are often prone to reward hacking, wherein models exploit biases in imperfect reward functions rather than yielding genuine performance gains. In this work, we identify that normalization could lead to miscalibration and directly removing the prompt-level standard deviation term yields an optimal policy ascent direction that is linear in the advantage but still limits the separation of genuine signals from noise. To mitigate the above issues, we propose Super-Linear Advantage Shaping (SLAS) by revisiting the functional update from an information geometry perspective. By extending the Fisher-Rao information metric with advantage-dependent weighting, SLAS introduces a non-linear geometric structure that reshapes the local policy space. This design relaxes constraints along high-advantage directions to amplify informative updates, while tightening those in low-advantage regions to suppress illusory gradients. In addition, batch-level normalization is applied to stabilize training under varying reward scales. Extensive evaluations demonstrate that SLAS consistently surpasses the DanceGRPO baseline across multiple backbones and benchmarks. In particular, it yields faster training dynamics, improved out-of-domain performance on GenEval and UniGenBench++, and enhanced robustness to model scaling, while mitigating reward hacking and preserving semantic and compositional fidelity in generations.
☆ Personal Visual Context Learning in Large Multimodal Models
As wearable devices like smart glasses integrate Large Multimodal Models (LMMs) into the continuous first-person visual streams of individual users, the evolution of these models into true personal assistants hinges on visual personalization: the ability to reason over visual information unique to the wearer. We formalize this capability as Personal Visual Context Learning (Personal VCL), the prompt-time capability of using user-specific visual context to resolve personalized queries. To systematically evaluate this, we present Personal-VCL-Bench, a comprehensive benchmark capturing the personal visual world across persons, objects, and behaviors. Our analysis of frontier LMMs identifies a profound context utilization gap, revealing that the mechanisms for leveraging visual evidence, as well as aggregating multiple visual observations, remain critically understudied. Motivated by these findings, we propose the Agentic Context Bank, a strong inference-time baseline that structures a user's visual context into a self-refining memory bank and employs query-adaptive evidence selection. Our baseline approach consistently improves over standard context prompting regimes across tasks and evaluated backbones, demonstrating a practical path towards future personalized LMMs.
comment: Project website: https://vision.cs.utexas.edu/projects/PersonalVCL/
☆ Variational Inference for Lévy Process-Driven SDEs via Neural Tilting
Modelling extreme events and heavy-tailed phenomena is central to building reliable predictive systems in domains such as finance, climate science, and safety-critical AI. While Lévy processes provide a natural mathematical framework for capturing jumps and heavy tails, Bayesian inference for Lévy-driven stochastic differential equations (SDEs) remains intractable with existing methods: Monte Carlo approaches are rigorous but lack scalability, whereas neural variational inference methods are efficient but rely on Gaussian assumptions that fail to capture discontinuities. We address this tension by introducing a neural exponential tilting framework for variational inference in Lévy-driven SDEs. Our approach constructs a flexible variational family by exponentially reweighting the Lévy measure using neural networks. This parametrization preserves the jump structure of the underlying process while remaining computationally tractable. To enable efficient inference, we develop a quadratic neural parametrization that yields closed-form normalization of the tilted measure, a conditional Gaussian representation for stable processes that facilitates simulation, and symmetry-aware Monte Carlo estimators for scalable optimization. Empirically, we demonstrate that the method accurately captures jump dynamics and yields reliable posterior inference in regimes where Gaussian-based variational approaches fail, on both synthetic and real-world datasets.
comment: The associated project page which contains the official implementation can be found in https://circle-group.github.io/research/NeuralTilting/
☆ Pixal3D: Pixel-Aligned 3D Generation from Images SIGGRAPH 2026
Recent advances in 3D generative models have rapidly improved image-to-3D synthesis quality, enabling higher-resolution geometry and more realistic appearance. Yet fidelity, which measures pixel-level faithfulness of the generated 3D asset to the input image, still remains a central bottleneck. We argue this stems from an implicit 2D-3D correspondence issue: most 3D-native generators synthesize shape in canonical space and inject image cues via attention, leaving pixel-to-3D associations ambiguous. To tackle this issue, we draw inspiration from 3D reconstruction and propose Pixal3D, a pixel-aligned 3D generation paradigm for high-fidelity 3D asset creation from images. Instead of generating in a canonical pose, Pixal3D directly generates 3D in a pixel-aligned way, consistent with the input view. To enable this, we introduce a pixel back-projection conditioning scheme that explicitly lifts multi-scale image features into a 3D feature volume, establishing direct pixel-to-3D correspondence without ambiguity. We show that Pixal3D is not only scalable and capable of producing high-quality 3D assets, but also substantially improves fidelity, approaching the fidelity level of reconstruction. Furthermore, Pixal3D naturally extends to multi-view generation by aggregating back-projected feature volumes across views. Finally, we show pixel-aligned generation benefits scene synthesis, and present a modular pipeline that produces high-fidelity, object-separated 3D scenes from images. Pixal3D for the first time demonstrates 3D-native pixel-aligned generation at scale, and provides a new inspiring way towards high-fidelity 3D generation of object or scene from single or multi-view images. Project page: https://ldyang694.github.io/projects/pixal3d/
comment: SIGGRAPH 2026. Project page: https://ldyang694.github.io/projects/pixal3d/
☆ Confidence-Guided Diffusion Augmentation for Enhanced Bangla Compound Character Recognition
Recognition of handwritten Bangla compound characters remains a challenging problem due to complex character structures, large intra-class variation, and limited availability of high-quality annotated data. Existing Bangla handwritten character recognition systems often struggle to generalize across diverse writing styles, particularly for compound characters containing intricate ligatures and diacritical variations. In this work, we propose a confidence-guided diffusion augmentation framework for low-resolution Bangla compound character recognition. Our framework combines class-conditional diffusion modeling with classifier guidance to synthesize high-quality handwritten compound character samples. To further improve generation quality, we introduce Squeeze-and-Excitation enhanced residual blocks within the diffusion model's U-Net backbone. We additionally propose a confidence-based filtering mechanism where pre-trained classifiers act as quality gates to retain only highly class-consistent synthetic samples. The filtered synthetic images are fused with the original training data and used to retrain multiple classification architectures. Experiments conducted on the AIBangla compound character dataset demonstrate consistent performance improvements across ResNet50, DenseNet121, VGG16, and Vision Transformer architectures. Our best-performing model achieves 89.2\% classification accuracy, surpassing the previously published AIBangla benchmark by a substantial margin. The results demonstrate that quality-aware diffusion augmentation can effectively enhance handwritten character recognition performance in low-resource script domains.
☆ CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models
This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard supervised finetuning (SFT). Some advanced finetuning methods with auxiliary training objectives can improve performance and reduce the number of convergence steps. However, they typically incur significant computational overhead due to the additional losses from auxiliary objectives. To simultaneously achieve the enhanced capabilities of auxiliary training with the simplicity of standard SFT, we decouple the two objectives of auxiliary-objective SFT within the parameter space, namely, enhancing general capabilities and fitting task-specific action distributions. To deliver the goal, we only need to train the model to converge on a small-scale task set using two distinct training strategies, resulting in two finetuned models. The parameters' difference between the two models can then be interpreted as capability vectors provided by auxiliary objectives. These vectors are then merged with pretrained parameters to form a capability-enhanced meta model. Moreover, when standard SFT is augmented with a lightweight orthogonal regularization loss, the merged model attains performance comparable to auxiliary finetuned baselines with reduced computational overhead. Internal and external experiments demonstrate that our capability vectors (1) are effective and versatile across diverse models, (2) can generalize to novel environments and embodiments out of the box.
☆ Counterfactual Stress Testing for Image Classification Models
Deep learning models in medical imaging often fail when deployed in new clinical environments due to distribution shifts in demographics, scanner hardware, or acquisition protocols. A central challenge is underspecification, where models with similar validation performance exhibit divergent real-world failure modes. Although stress testing has emerged as a tool to assess this, current methods typically rely on simple, uninformed perturbations (e.g., brightness or contrast changes), which fail to capture clinically realistic variation and can overestimate robustness. In this work, we introduce a counterfactual stress testing framework based on causal generative models that create realistic "what if" images by intervening on attributes such as scanner type and patient sex while preserving anatomical identity, enabling controlled and semantically meaningful evaluation under targeted distribution shifts. Across two imaging modalities (chest X-ray and mammography), three model architectures, and multiple shift scenarios, we show that counterfactual stress tests provide a substantially more accurate proxy for real out-of-distribution performance than classical perturbations, capturing the direction and relative magnitude of performance changes as well as model ranking. These results suggest that causal generative models can serve as practical simulators for robustness assessment, offering a more reliable basis for evaluating medical AI systems prior to deployment.
☆ Count Anything at Any Granularity
Open-world object counting remains brittle: despite rapid advances in vision-language models (VLMs), reliably counting the objects a user intends is far from solved. We argue that a central reason is that counting granularity is left implicit; users may refer to a specific identity, an attribute, an instance type, a category, or an abstract concept, yet most methods treat "what to count" as a single, category-level matching problem. In this work, we redefine open-world counting as multi-grained counting, where visual exemplars specify target appearance and fine-grained text, with optional negative prompts, specifies the intended semantic granularity across five explicit levels. Making granularity explicit, however, exposes a critical data bottleneck: existing counting datasets lack the multi-category scenes, controlled distractors, and instance-level annotations needed to verify fine-grained prompt semantics. To address this, we propose the first fully automatic data-scaling pipeline that integrates controllable 3D synthesis with consistent image editing and VLM-based filtering, and use it to construct KubriCount, the largest and most comprehensively annotated counting dataset to date, supporting both training and multi-grained evaluation. Systematic benchmarking reveals that both multimodal large language models and specialist counting models exhibit severe prompt-following failures under fine-grained distinctions. Motivated by these findings, we train HieraCount, a multi-grained counting model that jointly leverages text and visual exemplars as complementary target specifications. HieraCount substantially improves multi-grained counting accuracy and generalizes robustly to challenging real-world scenarios. The project page is available here: https://verg-avesta.github.io/KubriCount/.
comment: Project page: https://verg-avesta.github.io/KubriCount/
☆ Geometry-aware Prototype Learning for Cross-domain Few-shot Medical Image Segmentation
Cross-domain few-shot medical image segmentation (CD-FSMIS) requires a model to generalise simultaneously to novel anatomical categories and unseen imaging domains from only a handful of annotated examples. Existing prototypical approaches inevitably entangle anatomical structure with domain-specific appearance variations, and thus lack a stable reference for reliable matching under domain shift. We observe that the geometric structure of human anatomy constitutes a reliable, domain-transferable prior that has been overlooked. Building on this insight, we propose GeoProto, a geometry-aware CD-FSMIS framework that enriches prototypical matching with explicit structural priors. The core component, Geometry-Aware Prototype Enrichment (GAPE), augments each local appearance prototype with a learned geometric offset encoding its ordinal position within the organ's interior topology. This offset is derived from an auxiliary Ordinal Shape Branch (OSB) trained under an ordinally consistent objective that enforces monotonic variation of geometric embeddings across interior strata, requiring no annotation beyond standard segmentation masks. Extensive experiments across seven datasets spanning three evaluation settings (cross-modality, cross-sequence, and cross-context) demonstrate that GeoProto achieves state-of-the-art performance.
☆ CADBench: A Multimodal Benchmark for AI-Assisted CAD Program Generation
Recovering editable CAD programs from images or 3D observations is central to AI-assisted design, but progress is difficult to measure because existing evaluations are fragmented across datasets, modalities, and metrics. We introduce CADBench, a unified benchmark for multimodal CAD program generation. CADBench contains 18,000 evaluation samples spanning six benchmark families derived from DeepCAD, Fusion 360, ABC, MCB, and Objaverse; five input modalities including clean meshes, noisy meshes, single-view renders, photorealistic renders, and multi-view renders; and six metrics covering geometric fidelity, executability, and program compactness. STEP-based families are stratified by B-rep face count and all families are diversity-sampled to support controlled analysis across complexity and object variation. We benchmark eleven CAD-specialized and general-purpose vision-language systems, generating more than 1.4 million CAD programs. Under idealized inputs, specialized mesh-to-CAD models substantially outperform code-generating VLMs, which remain far from reliable CAD program reconstruction. CADBench further reveals three recurring failure modes: reconstruction quality degrades with geometric complexity, CAD-specialized models can be brittle under modality shift, and model rankings change across metrics. Together, these results position CADBench as a diagnostic testbed for measuring progress in editable 3D reconstruction and multimodal CAD understanding. The benchmark is publicly available at https://huggingface.co/datasets/DeCoDELab/CADBench.
☆ BEACON: A Multimodal Dataset for Learning Behavioral Fingerprints from Gameplay Data
Continuous authentication in high-stakes digital environments requires datasets with fine-grained behavioral signals under realistic cognitive and motor demands. But current benchmarks are often limited by small scale, unimodal sensing or lack of synchronised environmental context. To address this gap, this paper introduces BEACON ( Behavioral Engine for Authentication \& Continuous Monitoring), a large-scale multimodal dataset that captures diverse skill tiers in competitive \textit{Valorant} gameplay. BEACON contains approximately 430 GB of synchronised modality data (461 GB total on-disk including auxiliary \textit{Valorant} configuration captures) from 79 sessions across 28 distinct players, estimated at 102.51 hours of active gameplay, including high-frequency mouse dynamics, keystroke events, network packet captures, screen recordings, hardware metadata, and in-game configuration context. BEACON leverages the high precision motor skills and high cognitive load that are inherent to tactical shooters, making it a rigorous stress test for the robustness of behavioral biometrics. The dataset allows for the study of continuous authentication, behavioral profiling, user drift and multimodal representation learning in a high-fidelity esports setting. The authors release the dataset and code on Hugging Face and GitHub to create a reproducible benchmark for evaluating next-generation behavioral fingerprinting and security models
☆ BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD
Industrial Computer-Aided Design (CAD) code generation requires models to produce executable parametric programs from visual or textual inputs. Beyond recognizing the outer shape of a part, this task involves understanding its 3D structure, inferring engineering parameters, and choosing CAD operations that reflect how the part would be designed and manufactured. Despite the promise of Multimodal large language models (MLLMs) for this task, they are rarely evaluated on whether these capabilities jointly hold in realistic industrial CAD settings. We present BenchCAD, a unified benchmark for industrial CAD reasoning. BenchCAD contains 17,900 execution-verified CadQuery programs across 106 industrial part families, including bevel gears, compression springs, twist drills, and other reusable engineering designs. It evaluates models through visual question answering, code question answering, image-to-code generation, and instruction-guided code editing, enabling fine-grained analysis across perception, parametric abstraction, and executable program synthesis. Across 10+ frontier models, BenchCAD shows that current systems often recover coarse outer geometry but fail to produce faithful parametric CAD programs. Common failures include missing fine 3D structure, misinterpreting industrial design parameters, and replacing essential operations such as sweeps, lofts, and twist-extrudes with simpler sketch-and-extrude patterns. Fine-tuning and reinforcement learning improve in-distribution performance, but generalization to unseen part families remains limited. These results position BenchCAD as a benchmark for measuring and improving the industrial readiness of multimodal CAD automation.
comment: 9 page 7 figures
☆ Masked Generative Transformer Is What You Need for Image Editing CVPR 2026
Diffusion models dominate image editing, yet their global denoising mechanism entangles edited regions with surrounding context, causing modifications to propagate into areas that should remain intact. We propose a fundamentally different approach by leveraging Masked Generative Transformers (MGTs), whose localized token-prediction paradigm naturally confines changes to intended regions. We present EditMGT, an MGT-based editing framework that is the first of its kind. Our approach employs multi-layer attention consolidation to aggregate cross-attention maps into precise edit localization signals, and region-hold sampling to explicitly prevent token flipping in non-target areas. To support training, we construct CrispEdit-2M, a 2M-sample high-resolution (>1024) editing dataset spanning seven categories. With only 960M parameters, EditMGT achieves state-of-the-art image similarity on multiple benchmarks while delivering 6x faster editing, demonstrating that MGTs offer a compelling alternative to diffusion-based editing.
comment: CVPR 2026 HiGen Workshop; Project Page at https://weichow23.github.io/EditMGT/ GitHub at https://github.com/weichow23/EditMGT
☆ Is Your Driving World Model an All-Around Player? CVPR 2026
Today's driving world models can generate remarkably realistic dash-cam videos, yet no single model excels universally. Some generate photorealistic textures but violate basic physics; others maintain geometric consistency but fail when subjected to closed-loop planning. This disconnect exposes a critical gap: the field evaluates how real generated worlds appear, but rarely whether they behave realistically. We introduce WorldLens, a unified benchmark that measures world-model fidelity across the full spectrum, from pixel quality and 4D geometry to closed-loop driving and human perceptual alignment, through five complementary aspects and 24 standardized dimensions. Our evaluation of six representative models reveals that no existing approach dominates across all axes: texture-rich models violate geometry, geometry-aware models lack behavioral fidelity, and even the strongest performers achieve only 2-3 out of 10 on human realism ratings. To bridge algorithmic metrics with human perception, we further contribute WorldLens-26K, a 26,808-entry human-annotated preference dataset pairing numerical scores with textual rationales, and WorldLens-Agent, a vision-language evaluator distilled from these judgments that enables scalable, explainable auto-assessment. Together, the benchmark, dataset, and agent form a unified ecosystem for assessing generated worlds not merely by visual appeal, but by physical and behavioral fidelity.
comment: CVPR 2026 VideoWorldModel Workshop; Project Page at https://worldbench.github.io/worldlens GitHub at https://github.com/worldbench/WorldLens
☆ Verification Mirage: Mapping the Reliability Boundary of Self-Verification in Medical VQA
Self-verification, re-invoking the same vision language model (VLM) in a fresh context to check its own generated answer, is increasingly used as a default safety layer for medical visual question answering (VQA). We argue that this practice is fundamentally unreliable. We introduce [METHOD NAME], a diagnostic framework for mapping the reliability boundary of medical VLM self-verification by decomposing verifier behavior into discrimination capability and agreement bias. Because the verifier and answer generator are capacity-coupled, the verifier can overly agree with the generator, creating a verification mirage: a regime with both high verifier error and high agreement bias, driven by false acceptance of incorrect answers. Evaluating six open-weight VLMs across five medical VQA datasets and seven medical tasks, we find that this boundary is strongly task-conditioned. Knowledge-intensive clinical tasks fall deepest into the mirage, simpler tasks are more resistant, and perceptual tasks lie in between. Verification also fails to provide an independent safety signal: logistic mixed-effects analysis shows that verifier error and agreement bias become more likely when the generator is wrong, while saliency analyses show that verifiers under-attend to image evidence relative to generators, a phenomenon we call the lazy verifier. Cross-verification reduces but does not eliminate the mirage. Moreover, when verification is reused in multi-turn actor-verifier loops, most initially wrong answers become locked in by false verification. Since our experiments use clean benchmarks, the observed reliability boundary likely underestimates failures in real clinical deployment.
comment: 31 pages, 12 figures
☆ BabelDOC: Better Layout-Preserving PDF Translation via Intermediate Representation ACL 2026
As global cross-lingual communication intensifies, language barriers in visually rich documents such as PDFs remain a practical bottleneck. Existing document translation pipelines face a tension between linguistic processing and layout preservation: text-oriented Computer-Assisted Translation (CAT) systems often discard structural metadata, while document parsers focus on extraction and do not support faithful re-rendering after translation. We introduce BabelDOC, an Intermediate Representation (IR)-based framework for layout-preserving PDF translation. BabelDOC decouples visual layout metadata from semantic content, enabling document-level translation operations such as terminology extraction, cross-page context handling, glossary-constrained generation, and formula placeholdering. The translated content is then re-anchored to the original layout through an adaptive typesetting engine. Experiments on a curated 200-page benchmark, together with human evaluation and multimodal LLM-as-a-judge evaluation, show that BabelDOC improves layout fidelity, visual aesthetics, and terminology consistency over representative baselines, while maintaining competitive translation precision. The open-source toolkit and its interactive downstream applications are publicly available and have attracted over 8.4K GitHub stars and 17 contributors at the time of writing. A demonstration video is also available.
comment: ACL 2026 System Demonstration paper. 2 figures
☆ Transcoda: End-to-End Zero-Shot Optical Music Recognition via Data-Centric Synthetic Training
Optical Music Recognition (OMR), the task of transcribing sheet music into a structured textual representation, is currently bottlenecked by a lack of large-scale, annotated datasets of real scans. This forces models to rely on either few-shot transfer or synthetic training pipelines that remain overly simplistic. A secondary challenge is encoding non-uniqueness: in the popular Humdrum **kern format for transcribing music, multiple different text encodings can render into the same visual sheet music. This one-to-many mapping creates a harder learning task and introduces high uncertainty during decoding. We propose Transcoda, an OMR system built on (i) an advanced synthetic data generation pipeline, (ii) a normalization of the **kern encoding to enforce a unique normal form and (iii) grammar-based decoding to ensure the syntactic correctness of the output. This approach allows us to train a compact 59M-parameter model in just 6 hours on a single GPU that outperforms billion-parameter baselines. Transcoda achieves the best score among state of the art baselines on a newly curated benchmark of synthetically rendered scores at 18.46% OMR-NED (compared to 43.91% for the next-best system, Legato) and reduces the error rate on historical Polish scans to 63.97% OMR-NED (down from 80.16% for SMT++).
comment: 13 pages, 7 figures
☆ MMVIAD: Multi-view Multi-task Video Understanding for Industrial Anomaly Detection
Industrial anomaly detection is critical for manufacturing quality control, yet existing datasets mainly focus on static images or sparse views, which do not fully reflect continuous inspection processes in real industrial scenarios. We introduce MMVIAD (Multi-view Multi-task Video Industrial Anomaly Detection), to the best of our knowledge the first continuous multi-view video dataset for industrial anomaly detection and understanding, together with a benchmark for multi-task evaluation. MMVIAD contains object-centric 2-second inspection clips with approximately 120 degrees of camera motion, covering 48 object categories, 14 environments, and 6 structural anomaly types. It supports anomaly detection, defect classification, object classification, and anomaly visible-time localization. Systematic evaluations on MMVIAD show that current commercial and open-source video MLLMs remain far below human performance, especially for fine-grained defect recognition and temporal grounding. To improve transferable anomaly understanding, we further develop a two-stage post-training pipeline where PS-SFT (Perception-Structured Supervised Fine-Tuning) initializes perception-structured reasoning and VISTA-GRPO (Visibility-grounded Industrial Structured Temporal Anomaly Group Relative Policy Optimization) refines the model with semantic-gated defect reward and visibility-aware temporal reward, producing the final model VISTA. On MMVIAD-Unseen, VISTA improves the base model's average score across the four tasks from 45.0 to 57.5, surpassing GPT-5.4. Source code is available at https://github.com/Georgekeepmoving/MMVIAD.
☆ Predicting 3D structure by latent posterior sampling
The remarkable achievements of both generative models of 2D images and neural field representations for 3D scenes present a compelling opportunity to integrate the strengths of both approaches. In this work, we propose a methodology that combines a NeRF-based representation of 3D scenes with probabilistic modeling and reasoning using diffusion models. We view 3D reconstruction as a perception problem with inherent uncertainty that can thereby benefit from probabilistic inference methods. The core idea is to represent the 3D scene as a stochastic latent variable for which we can learn a prior and use it to perform posterior inference given a set of observations. We formulate posterior sampling using the score-based inference method of diffusion models in conjunction with a likelihood term computed from a reconstruction model that includes volumetric rendering. We train the model using a two-stage process: first we train the reconstruction model while auto-decoding the latent representations for a dataset of 3D scenes, and then we train the prior over the latents using a diffusion model. By using the model to generate samples from the posterior we demonstrate that various 3D reconstruction tasks can be performed, differing by the type of observation used as inputs. We showcase reconstruction from single-view, multi-view, noisy images, sparse pixels, and sparse depth data. These observations vary in the amount of information they provide for the scene and we show that our method can model the varying levels of inherent uncertainty associated with each task. Our experiments illustrate that this approach yields a comprehensive method capable of accurately predicting 3D structure from diverse types of observations.
☆ ALAM: Algebraically Consistent Latent Transitions for Vision-Language-Action Models
Vision-language-action (VLA) models remain constrained by the scarcity of action-labeled robot data, whereas action-free videos provide abundant evidence of how the physical world changes. Latent action models offer a promising way to extract such priors from videos, but reconstruction-trained latent codes are not necessarily suitable for policy generation: they may predict future observations while lacking the structure needed to be reused or generated coherently with robot actions. We introduce ALAM (Algebraic Latent Action Model), an Algebraically Consistent Latent Action Model that turns temporal relations in action-free video into structural supervision. Given frame triplets, ALAM learns latent transitions that are grounded by reconstruction while being regularized by composition and reversal consistency, encouraging a locally additive transition space. For downstream VLA learning, we freeze the pretrained encoder and use its latent transition sequences as auxiliary generative targets, co-generated with robot actions under a joint flow-matching objective. This couples structured latent transitions with flow-based policy generation, allowing the policy to exploit ALAM's locally consistent transition geometry without requiring latent-to-action decoding. Representation probes show that ALAM reduces additivity and reversibility errors by 25-85 times over unstructured latent-action baselines and improves long-horizon cumulative reconstruction. When transferred to VLA policies, ALAM raises the average success rate from 47.9% to 85.0% on MetaWorld MT50 and from 94.1% to 98.1% on LIBERO, with consistent gains on real-world manipulation tasks. Ablations further confirm that the strongest improvements arise from the synergy between algebraically structured latent transitions and joint flow matching.
☆ PhyGround: Benchmarking Physical Reasoning in Generative World Models
Generative world models are increasingly used for video generation, where learned simulators are expected to capture the physical rules that govern real-world dynamics. However, evaluating whether generated videos actually follow these rules remains challenging. Existing physics-focused video benchmarks have made important progress, but they still face three key challenges, including the coarse evaluation frameworks that hide law-specific failures, response biases and fatigue that undermine the validity of annotation judgments, and automated evaluators that are insufficiently physics-aware or difficult to audit. To address those challenges, we introduce PhyGround, a criteria-grounded benchmark for evaluating physical reasoning in video generation. The benchmark contains 250 curated prompts, each augmented with an expected physical outcome, and a taxonomy of 13 physical laws across solid-body mechanics, fluid dynamics, and optics. Each law is operationalized through observable sub-questions to enable per-law diagnostics. We evaluate eight modern video generation models through a large-scale, quality-controlled human study, grounded on social science lab experiment design. A total of 459 annotators provided 5,796 complete annotations and over 37.4K fine-grained labels; after quality control, the retained annotations exhibited high split-half model-ranking correlations (Spearman's rho > 0.90). To support reproducible automated evaluation, we release PhyJudge-9B, an open physics-specialized VLM judge. PhyJudge-9B achieves substantially lower aggregate relative bias than Gemini-3.1-Pro (3.3% vs. 16.6%). We release prompts, human annotations, model checkpoints, and evaluation code on the project page https://phyground.github.io/.
comment: Preprint. 56 pages, 39 figures, 40 tables. Project page: https://phyground.github.io/
☆ Rapid Forest Fuel Load Estimation via Virtual Remote Sensing and Metric-Scale Feed-Forward 3D Reconstruction IEEE
Accurate quantification of forest coverage and combustible biomass (fuel load) is critical for wildfire risk assessment and ecosystem management. However, traditional methods relying on airborne LiDAR or field surveys are cost-prohibitive and time-intensive, while satellite imagery often lacks the vertical resolution required for canopy volume analysis. This paper proposes a novel, automated pipeline for rapid forest inventory using virtual remote sensing data derived from Google Earth Studio (GES). Our approach first generates low-altitude orbital imagery and camera poses for a target region. For dense 3D reconstruction, we employ Pi-Long, developed within the VGGT-Long framework. This model serves as a scalable extension of the Pi-3 feed-forward Transformer architecture. To address the inherent scale ambiguity in monocular reconstruction, we introduce a metric recovery module that aligns the reconstructed trajectory with GES ground truth poses via Sim(3) Umeyama optimization. The metric-scale point cloud is then orthogonally projected into Bird's-Eye-View (BEV) height and density maps. Finally, we employ a watershed-based segmentation algorithm combined with height variance analysis to classify tree species (conifer vs. broadleaf), calculate Leaf Area Index (LAI), and estimate total fuel load. Experimental results demonstrate that this pipeline offers a scalable, cost-effective alternative to physical scanning, enabling near-real-time estimation of forest biomass with high geometric consistency.
comment: Accepted for publication at IEEE IGARSS 2026
☆ Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenizatio
Representation autoencoders that reuse frozen pretrained vision encoders as visual tokenizers have achieved strong reconstruction and generation quality. However, existing methods universally extract features from only the last encoder layer, discarding the rich hierarchical information distributed across intermediate layers. We show that low-level visual details survive in the last layer merely as attenuated residuals after multiple layers of semantic abstraction, and that explicitly fusing multi-layer features can substantially recover this lost information. We propose DRoRAE (Depth-Routed Representation AutoEncoder), a lightweight fusion module that adaptively aggregates all encoder layers via energy-constrained routing and incremental correction, producing an enriched latent compatible with a frozen pretrained decoder. A three-phase decoupled training strategy first learns the fusion under the implicit distributional constraint of the frozen decoder, then fine-tunes the decoder to fully exploit the enriched representation. On ImageNet-256, DRoRAE reduces rFID from 0.57 to 0.29 and improves generation FID from 1.74 to 1.65 (with AutoGuidance), with gains also transferring to text-to-image synthesis. Furthermore, we uncover a log-linear scaling law ($R^2{=}0.86$) between fusion capacity and reconstruction quality, identifying \textit{representation richness} as a new, predictably scalable dimension for visual tokenizers analogous to vocabulary size in NLP.
☆ Towards a Large Language-Vision Question Answering Model for MSTAR Automatic Target Recognition SP
Large language-vision models (LLVM), such as OpenAI's ChatGPT and GPT-4, have gained prominence as powerful tools for analyzing text and imagery. The merging of these data domains represents a significant paradigm shift with far-reaching implications for automatic target recognition (ATR). Recent transformer-based LLVM research has shown substantial improvements for geospatial perception tasks. Our study examines the application of LLVM to remote sensing image captioning and visual question-answering (VQA), with a specific focus on synthetic aperture radar (SAR) imagery. We examine newly published LLVM methods, including CLIP and LLaVA neural network transformer architectures. We have developed a work-in-progress SAR training and evaluation benchmark derived from the MSTAR Public Dataset. This has been extended to include descriptive text captions and question-answer pairs for VQA tasks. This challenge dataset is designed to push the boundaries of an LLVM in identifying nuanced ATR details in SAR imagery. Utilizing parameter-efficient fine-tuning, we train an LLVM method to identify fine-grained target qualities at 98% accuracy. We detail our data setup and experiments, addressing potential pitfalls that could lead to misleading conclusions. Accurately identifying and differentiating military vehicle types in SAR data poses a critical challenge, especially under complex environmental conditions. Mastering this target recognition skill may require a human analyst months of training and years of practice. This research represents a unique effort to apply LLVM to SAR applications, advancing machine-assisted remote sensing ATR for military and intelligence contexts.
comment: Accepted to SPIE Defense + Commercial Sensing, Automatic Target Recognition XXXV
☆ MPerS: Dynamic MLLM MixExperts Perception-Guided Remote Sensing Scene Segmentation CVPR 2026
The multimodal fusion of images and scene captions has been extensively explored and applied in various fields. However, when dealing with complex remote sensing (RS) scenes, existing studies have predominantly concentrated on architectural optimizations for integrating textual semantic information with visual features, while largely neglecting the generation of high-quality RS captions and the investigation of their effectiveness in multimodal semantic fusion.In this context, we propose the Dynamic MLLM Mixture-of-Experts Perception-Guided Remote Sensing Scene Segmentation, referred to as MPerS.We design multiple prompts for MLLMs to generate high-quality RS captions, enabling MLLMs to perceive RS scenes from diverse expert perspectives. DINOv3 is employed to extract dense visual representations of land-covers.We design a Dynamic MixExperts module that adaptively integrates the most effective textual semantics. Linguistic Query Guided Attention is constructed to utilize textual semantic information to guide visual features for precise segmentation. The MLLMs include LLaVA, ChatGPT, and Qwen. Our method achieves superior performance on three public semantic segmentation RS datasets.
comment: Accepted to CVPR 2026 Findings. 11 pages, 6 figures
☆ Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning
Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, yet real-world deployment often requires continual capability expansion across sequential tasks. In such scenarios, Multimodal Continual Instruction Tuning (MCIT) aims to acquire new capabilities while limiting catastrophic forgetting. Existing methods mainly follow a module-composition paradigm: they maintain task-level prompts or LoRA experts and dynamically route or aggregate a subset of them at inference. However, samples within the same task can still differ substantially in visual scenes, question intents, and reasoning demands. This motivates instance-level adaptation to individual query-image pairs rather than only selecting or combining task-level modules. To this end, we propose DRAPE (Dynamic Cross-Modal Prompt Generation), a prompt-learning framework that synthesizes continuous instance-specific soft prompts for MCIT. Instead of selecting prompts from a fixed pool, DRAPE derives prompt queries from the textual instruction and cross-attends to visual patch features, producing query-image conditioned prompts that are prepended to the frozen LLM. To mitigate forgetting during sequential updates, DRAPE applies null-space gradient projection to the shared projector and uses CLIP-based prototype routing for task-label-free generator selection at inference. Extensive experiments on MCIT benchmarks show that DRAPE achieves state-of-the-art performance among representative prompt-based and LoRA-based continual-learning baselines.
☆ Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization
Recent studies show that gradient-based universal image jailbreaks on vision-language models (VLMs) exhibit little or no cross-model transferability, casting doubt on the feasibility of transferable multimodal jailbreaks. We revisit this conclusion under a strictly untargeted threat model without enforcing a fixed prefix or response pattern. Our preliminary experiment reveals that refusal behavior concentrates at high-entropy tokens during autoregressive decoding, and non-refusal tokens already carry substantial probability mass among the top-ranked candidates before attack. Motivated by this finding, we propose Untargeted Jailbreak via Entropy Maximization(UJEM)-KL, a lightweight attack that maximizes entropy at these decision tokens to flip refusal outcomes, while stabilizing the remaining low-entropy positions to preserve output quality. Across three VLMs and two safety benchmarks, UJEM-KL achieves competitive white-box attack success rates and consistently improves transferability, while remaining effective under representative defenses. Our experimental results indicate that the limited transferability primarily stems from overly constrained optimization objectives.
comment: Preprint. 17 pages, 8 figures, 6 tables
☆ GridProbe: Posterior-Probing for Adaptive Test-Time Compute in Long-Video VLMs
Long-video understanding in VLMs is bottlenecked by a single monolithic forward pass over thousands of frames at quadratic attention cost. A common mitigation is to first select a small subset of informative frames before the forward pass; common for training-free selectors via auxiliary encoder-space similarities. Such signals are capped by contrastive pretraining, which usually fails on reasoning-heavy queries (negation, cross-frame counting, holistic summarization). We propose GridProbe, an efficient training-free posterior-probing inference paradigm that scores evidence in answer space using a frozen VLM's own reasoning and then selects question-relevant frames adaptively, resulting in sub-quadratic attention cost with little to no accuracy loss. We arrange frames on a $K{\times}K$ grid and run lightweight row R and column C probes, where each probe reads its peak posterior as a query-conditioned confidence. The outer product of R and C yields an interpretable importance map whose skewness and kurtosis drive Shape-Adaptive Selection, a closed-form rule that reliably replaces the fixed frame budget $M$ with a per-question $M_{\mathrm{eff}}$. We show empirically that $M_{\mathrm{eff}}$ tracks intrinsic question difficulty without ever seeing the answer, a sign of test-time adaptive compute. On Video-MME-v2, GridProbe matches the monolithic baseline within $1.6$ pp Avg Acc at $3.36\times$ TFLOPs reduction, while on LongVideoBench it Pareto-dominates the baseline ($+0.9$ pp at $0.35\times$ compute). Because the selector and QA models can be decoupled, pairing a small 2B selector with a stronger 4B or 8B QA is strictly Pareto-dominant over the 2B monolithic baseline (up to $+4.0$ pp at $0.52\times$ compute, on average), with no retraining. Finally, the interpretability of the importance maps opens future avenues for behavioral diagnostics, grounding, and frame-selection distillation.
☆ RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology
Cancer screening is a reasoning task. A radiologist observes findings, compares them to prior scans, integrates clinical context, and reaches a diagnostic conclusion confirmed by pathology. We present RadThinking, a Visual Question Answering (VQA) dataset that makes this reasoning explicit and trainable. RadThinking releases VQA pairs at three difficulty tiers. Foundation VQAs are atomic perception questions. Single-step reasoning VQAs apply one clinical rule. Compositional VQAs require multi-step chain-of-thought to reach a guideline category such as LI-RADS-5. For every compositional VQA, we release the chain of foundation VQAs that solves it. The chain follows the rules of the governing clinical reporting standard. The dataset spans 20,362 CT scans from 9,131 patients across 43 cancer groups, plus 2,077 verified healthy controls with >1-year follow-up. To our knowledge, RadThinking is the first cancer-screening VQA corpus that stratifies questions by reasoning depth and grounds compositions in clinical reporting standards. The foundation tier supplies atomic perception supervision. The compositional tier supplies chain-of-thought data and verifiable rewards for reinforcement-learning recipes such as DeepSeek-R1 and OpenAI o1. RadThinking enables systematic training and evaluation of whether AI systems can reason about cancer, not merely detect it.
☆ Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models
Diffusion and flow-matching models scale because pretraining is supervised regression: a clean sample is noised analytically, and a model regresses against a closed-form target. RL post-training aligns the model with a reward. In image generation, this makes samples compose objects correctly, render text legibly, and match human preferences. Existing methods rely on costly SDE rollouts, reward gradients, or surrogate losses, sacrificing pretraining's regression structure. We show that the structure extends to RL post-training. Under KL-regularized reward maximization, the optimal generative process tilts the clean-endpoint distribution towards samples with higher reward and leaves the noising law unchanged. Combining this with the adjoint-matching optimality condition and a REINFORCE identity, we derive Reinforce Adjoint Matching (RAM): a consistency loss that corrects the pretraining target with the reward. At each step, we draw a clean endpoint from the current model, evaluate its reward, noise it as in pretraining, and regress. No SDE rollouts, backward adjoint sweeps, or reward gradients are required. Like the pretraining objective, RAM is simple and scales. On Stable Diffusion 3.5M, RAM achieves the highest reward on composability, text rendering, and human preference, reaching Flow-GRPO's peak reward in up to $50\times$ fewer training steps.
☆ TINS: Test-time ID-prototype-separated Negative Semantics Learning for OOD Detection
Vision-language models enable OOD detection by comparing image alignment with ID labels and negative semantics. Existing negative-label-based methods mainly rely on static negative labels constructed before inference, limiting their ability to cover diverse and evolving OOD concepts. Although test-time expansion provides a natural solution, naively learning negative semantics from potential OOD samples may introduce hard ID contamination. To address this issue, we propose a \textbf{T}est-time \textbf{I}D-prototype-separated \textbf{N}egative \textbf{S}emantics learning method, termed \textbf{TINS}. TINS learns sample-specific negative text embeddings via image-to-text modality inversion and introduces ID-prototype-separated regularization to keep them separated from ID semantics. To further stabilize negative semantics expansion, TINS employs group-wise aggregation scoring and a buffer update strategy. Extensive experiments across Four-OOD, OpenOOD, Temporal-shift, and Various ID settings show consistent improvements over strong baselines. Notably, on the Four-OOD benchmark with ImageNet-1K as ID, TINS reduces the average FPR95 from 14.04\% to 6.72\%. Our code is available at https://github.com/zxk1212/tins.
☆ C-CoT: Counterfactual Chain-of-Thought with Vision-Language Models for Safe Autonomous Driving
Safety-critical planning in complex environments, particularly at urban intersections, remains a fundamental challenge for autonomous driving. Existing methods, whether rule-based or data-driven, frequently struggle to capture complex scene semantics, infer potential risks, and make reliable decisions in rare, high-risk situations. While vision-language models (VLMs) offer promising approaches for safe decision-making in these environments, most current approaches lack reflective and causal reasoning, thereby limiting their overall robustness. To address this, we propose a counterfactual chain-of-thought (C-CoT) framework that leverages VLMs to decompose driving decisions into five sequential stages: scene description, critical object identification, risk prediction, counterfactual risk reasoning, and final action planning. Within the counterfactual reasoning stage, we introduce a structured meta-action evaluation tree to explicitly assess the potential consequences of alternative action combinations. This self-reflective reasoning establishes causal links between action choices and safety outcomes, improving robustness in long-tail and out-of-distribution scenarios. To validate our approach, we construct the DeepAccident-CCoT dataset based on the DeepAccident benchmark and fine-tune a Qwen2.5-VL (7B) model using low-rank adaptation. Our model achieves a risk prediction recall of 81.9%, reduces the collision rate to 3.52%, and lowers L2 error to 1.98 m. Ablation studies further confirm the critical role of counterfactual reasoning and the meta-action evaluation tree in enhancing safety and interpretability.
☆ Geospatial-Temporal Sensemaking of Remote Sensing Activity Detections with Multimodal Large Language Model SP
We introduce SMART-HC-VQA, a Sentinel-2-based visual question answering dataset derived from the IARPA SMART Heavy Construction dataset, designed for spatiotemporal analysis of human activity. The dataset transforms construction-site annotations, construction-type labels, temporal-phase labels, geographic metadata, and observation relationships into natural language question-answer triplets. This approach redefines the existing dataset as a temporally extended automatic target recognition and visual question answering (VQA) challenge, considering a fixed geospatial site as a target whose attributes and activity states evolve across sparse satellite observations. Currently, SMART-HC-VQA comprises 21,837 accessible Sentinel-2 image chips, 65,511 single-image VQA examples, and approximately 2.3 million two-image temporal comparison examples generated via our novel Image-Pairwise Combinatorial Augmentation. We detail the workflow for retrieving and processing Sentinel-2 imagery, segmenting large satellite tiles into site-centered images, maintaining traceability to SMART-HC annotations, and analyzing the distributions of site size, observation count, temporal coverage, construction type, and phase labels. Additionally, we describe an implemented multi-image MLLM training framework based on LLaVA-NeXT Mistral-7B, adapted to accept multiple dated image inputs and train on metadata-derived VQA examples. This work offers a reproducible foundation for understanding language-guided remote sensing activities, aiming not only to detect change but also to reason about the ongoing processes, their progression, and potential future developments.
comment: Accepted to 2026 SPIE Defense + Security, Automatic Target Recognition XXXVI
☆ iPay: Integrated Payment Action Recognition via Multimodal Networks and Adaptive Spatial Prior Learning
Automated transit payment analysis is vital for scalable fare auditing and passenger analytics, yet practice still relies on limited manual inspection. Prior vision- and skeleton-based methods remain brittle under noisy onboard surveillance and often depend on poorly generalizable handcrafted features. Building on the success of graph convolutional networks in human action recognition, we observe that skeleton features excel at modeling global spatiotemporal dependencies but tend to underemphasize the subtle local relative motions that distinguish payment actions. In contrast, RGB features preserve fine-grained spatial details yet often lack reliable temporal continuity in surveillance footage. To bridge both system-level deployment needs and model-level design challenges, we present iPay, an integrated payment action recognition framework for onboard transit surveillance system. iPay adopts a multimodal mixture-of-experts architecture with four tightly coupled streams: (1) an RGB expert stream emphasizing local evidence via region-focused computation; (2) a skeleton expert stream modeling articulated motion with a graph convolutional backbone; (3) a dual-attention fusion stream enabling skeleton-to-RGB temporal transfer and RGB-to-skeleton spatial enhancement; and (4) a prior-driven Spatial Difference Discriminator (SDD) that explicitly models hand-to-anchor relative motion to improve task-specific discriminability. We also collaborate with local transit agencies to collect over 55 hours of real onboard surveillance footage, yielding 500+ payment clips. Experiments show that iPay outperforms prior methods and achieves 83.45\% recognition accuracy with competitive computational efficiency, making it suitable for edge deployment. Code is available at https://github.com/ccoopq/iPay.
☆ Qwen-Image-2.0 Technical Report
We present Qwen-Image-2.0, an omni-capable image generation foundation model that unifies high-fidelity generation and precise image editing within a single framework. Despite recent progress, existing models still struggle with ultra-long text rendering, multilingual typography, high-resolution photorealism, robust instruction following, and efficient deployment, especially in text-rich and compositionally complex scenarios. Qwen-Image-2.0 addresses these challenges by coupling Qwen3-VL as the condition encoder with a Multimodal Diffusion Transformer for joint condition-target modeling, supported by large-scale data curation and a customized multi-stage training pipeline. This enables strong multimodal understanding while preserving flexible generation and editing capabilities. The model supports instructions of up to 1K tokens for generating text-rich content such as slides, posters, infographics, and comics, while significantly improving multilingual text fidelity and typography. It also enhances photorealistic generation with richer details, more realistic textures, and coherent lighting, and follows complex prompts more reliably across diverse styles. Extensive human evaluations show that Qwen-Image-2.0 substantially outperforms previous Qwen-Image models in both generation and editing, marking a step toward more general, reliable, and practical image generation foundation models.
☆ AllocMV: Optimal Resource Allocation for Music Video Generation via Structured Persistent State
Generating long-horizon music videos (MVs) is frequently constrained by prohibitive computational costs and difficulty maintaining cross-shot consistency. We propose AllocMV, a hierarchical framework formulating music video synthesis as a Multiple-Choice Knapsack Problem (MCKP). AllocMV represents the video's persistent state as a compact, structured object comprising character entities, scene priors, and sharing graphs, produced by a global planner prior to realization. By estimating segment saliency from multimodal cues, a group-level MCKP solver based on dynamic programming optimally allocates resources across High-Gen, Mid-Gen, and Reuse branches. For repetitive musical motifs, we implement a divergence-based forking strategy that reuses visual prefixes to reduce costs while ensuring motif-level continuity. Evaluated via the Cost-Quality Ratio (CQR), AllocMV achieves an optimal trade-off between perceived quality and resource expenditure under strict budgetary and rhythmic constraints.
☆ Heteroscedastic Diffusion for Multi-Agent Trajectory Modeling IEEE
Multi-agent trajectory modeling traditionally focuses on forecasting, often neglecting more general tasks like trajectory completion, which is essential for real-world applications such as correcting tracking data. Existing methods also generally predict agents' states without offering any state-wise measure of heteroscedastic uncertainty. Moreover, popular multi-modal sampling methods lack error probability estimates for each generated scene under the same prior observations, which makes it difficult to rank the predictions at inference time. We introduce U2Diffine, a unified diffusion model built to perform trajectory completion while simultaneously offering state-wise heteroscedastic uncertainty estimates. This is achieved by augmenting the standard denoising loss with the negative log-likelihood of the predicted noise, and then propagating the latent space uncertainty to the real state space using a first-order Taylor approximation. We also propose U2Diff, a faster baseline that avoids gradient computation during sampling. This approach significantly increases inference speed, making it as efficient as a standard generative-only diffusion model. For post-processing, we integrate a Rank Neural Network (RankNN) that enables error probability estimation for each generated mode, demonstrating strong correlation with ground truth errors. Our method outperforms state-of-the-art solutions in both trajectory completion and forecasting across four challenging sports datasets (NBA, Basketball-U, Football-U, Soccer-U), underscoring the effectiveness of our uncertainty and error probability estimation.
comment: Accepted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Extended version of arXiv:2503.18589 (CVPR 2025)
☆ UAV-Assisted Scan-to-Simulation for Landslides Using Physics-Informed Gaussian Splatting
Landslide monitoring and simulation play an important role in urban safety assessment and disaster prevention. Existing landslide simulation pipelines typically rely on digital elevation model and mesh-based representations, which are suitable for geometric analysis, but often lack visual realism. This limitation reduces their effectiveness in interactive applications, hazard communication, and public education. In this paper, we propose a UAV-based scan-to-simulation framework that bridges photorealistic scene capture and physics-based landslide simulation through 3DGS. Specifically, our pipeline includes four stages: (1) UAV-based acquisition of slope imagery, (2) reconstruction of a low-anisotropy 3DGS scene representation, (3) volumetric conversion of the target simulation region by filling the interior of the surface-based model, and (4) integration with the Material Point Method (MPM) for landslide simulation. We validate the proposed framework on a real landslide site in Hong Kong that experienced a severe landslide event. The results show that our method supports both realistic visual reconstruction and effective simulation.
☆ TransmissiveGS: Residual-Guided Disentangled Gaussian Splatting for Transmissive Scene Reconstruction and Rendering
Transmissive scenes are ubiquitous in daily life, yet reconstructing and rendering them remains highly challenging due to the inherent entanglement between near-field reflections from the surrounding environment on the transmissive surface, and the transmitted content of the scene behind it. This coupling gives rise to dual surface geometries and dual radiance components within each observation, posing ambiguities for standard methods. We present TransmissiveGS, a novel framework for disentangled reconstruction and rendering of transmissive scenes. Specifically, we model the scene with a dual-Gaussian representation and introduce a deferred shading function to jointly render the two Gaussian components. To separate reflection and transmission, we exploit the inherent multi-view inconsistency of reflections and leverage the residuals from reconstructing multi-view consistent content as cues for disentangled geometry and appearance modeling. We further propose a reflection light field that enables high-fidelity estimation of near-field reflections. During training, we introduce a high-frequency regularization to preserve fine details. We also contribute a new synthetic dataset for evaluating transmissive surface reconstruction. Experiments on both synthetic and real-world scenes demonstrate that TransmissiveGS consistently outperforms prior Gaussian Splatting-based methods in both reconstruction and rendering quality for transmissive scenes.
☆ Not Blind but Silenced: Rebalancing Vision and Language via Adversarial Counter-Commonsense Equilibrium
During MLLM decoding, attention often abnormally concentrates on irrelevant image tokens. While existing research dismisses this as invalid noise and forcibly redirects attention to compel focusing on key image information, we argue these tokens are critical carriers of visual and narrative logic, and such coercive corrections exacerbate visual-language imbalance. Adopting a "decoding-as-game" perspective, we reveal that hallucinations stem from an equilibrium imbalance between linguistic priors and visual information. We propose Adversarial Counter-Commonsense Equilibrium (ACE), a training-free framework that perturbs visual context via counter-commonsense patches. Leveraging the fact that authentic visual features remain stable under perturbation while hallucinations fluctuate, ACE implements a dynamic game decoding strategy. This approach precisely suppresses perturbation-sensitive priors while compensating for stable visual signals to restore balance. Extensive experiments demonstrate that ACE, as a plug-and-play strategy, enhances model trustworthiness with negligible inference overhead.
☆ Neuromorphic Monocular Depth Estimation with Uncertainty Modeling
Event cameras offer distinct advantages over conventional frame-based sensors, including microsecond-level temporal resolution, high dynamic range, and low bandwidth. In this paper, we predict per-pixel depth distributions from monocular event streams using deep neural networks. We estimate uncertainty using Gaussian, log-normal, and evidential learning frameworks. We compare six event representations: spatio-temporal voxel grids with 1, 5, 10, and 20 temporal bins, the Compact Spatio-Temporal Representation (CSTR), and Time-Ordered Recent Event (TORE) volumes. Our U-Net-based models are trained on synthetic data and then fine-tuned on real sequences. We evaluate performance using absolute relative error, root mean squared error, and the area under the sparsification error. Quantitative results show that the representations perform similarly, while 10 bin log-normal and 5 bin evidential learning perform best across metrics. Our experiments demonstrate that uncertainty estimation can be successfully integrated into event-based monocular depth estimation, and be used to indicate pixels with reliable depth.
☆ bViT: Investigating Single-Block Recurrence in Vision Transformers for Image Recognition
Vision Transformers (ViTs) are built by stacking independently parameterized blocks, but it remains unclear how much of this depth requires layer specific transformations and how much can be realized through recurrent computation. We study this question with bViT, a single-block recurrent ViT in which one transformer block is applied repeatedly to process an image. This architecture preserves the iterative structure of a deep ViT while removing layer specific block parameterization, providing a controlled setting for studying recurrence in vision. On ImageNet-1K, a 12-step bViT-B achieves accuracy comparable to standard ViT-B under the same training recipe and computational budget, while using an order of magnitude fewer parameters. We observe that recurrent performance improves with representation width, with wider bViTs recovering much more of the performance of standard ViTs than narrow variants. We interpret this behavior as implicit depth multiplexing, where a shared block expresses multiple step-dependent computations through the evolving hidden state. Beyond ImageNet classification, bViT transfers competitively to downstream tasks and enables parameter-efficient fine-tuning. Mechanistic analyses of activations, attention and step-specific pruning show that the shared block changes its effective behavior across recurrent steps rather than simply repeating the same computation. Our results suggest that a large fraction of ViT depth can be implemented through recurrent reuse, provided that the representation space is sufficiently wide.
comment: 31 pages, 16 figures
☆ GenMed: A Pairwise Generative Reformulation of Medical Diagnostic Tasks
Data-driven medical AI is traditionally formulated as a discriminative mapping from input $X$ to output $Y$ via a learned function $f$, which does not generalize well across heterogeneous data and modalities encountered in real-world clinical settings. In this work, we propose a fundamentally different, generative paradigm. We model the joint distribution $P(X,Y)$ using diffusion models and reframe inference as a test-time output optimization problem. By guiding the generative process to match observed inputs, our framework enables flexible, gradient-based conditioning at inference time without architectural changes or retraining, effectively supporting arbitrary and previously unseen combinations of observations. Extensive experiments demonstrate strong performance across standard and cross-modality medical image segmentation, few-shot segmentation with only 2 or 4 training samples, degraded-input segmentation, shape completion from sparse and partial observations, and zero-shot application to demonstrate generality. To support these evaluations, we curated and released a large-scale text-shape dataset derived from MedShapeNet. Our results highlight the versatility of generative joint modeling as a foundation for reusable, task-agnostic medical AI systems.
☆ LLaVA-CKD: Bottom-Up Cascaded Knowledge Distillation for Vision-Language Models
Large Vision-Language Models (VLMs) are successful in addressing a multitude of vision-language understanding tasks, such as Visual Question Answering (VQA), but their memory and compute requirements remain a concern for practical deployment. A promising class of techniques for mitigating this concern is Knowledge Distillation, where knowledge from a high-capacity Teacher network is transferred to a considerably smaller Student network. However, the capacity gap between the two networks is both a blessing and a curse: the smaller the Student network, the better its efficiency, and the larger the Teacher, the more knowledge it carries; yet, beyond a point, the larger capacity gap between the two leads to worse knowledge transfer. To counter this effect, we propose a bottom-up cascaded knowledge distillation (CKD) framework. Instead of treating knowledge transfer as an activity involving one high-capacity Teacher (or an ensemble of such), inspired by human formal education systems, we introduce one (potentially, more) additional Teacher(s) of intermediate capacity that gradually bring the Student network to the next level, where the next (higher-capacity) Teacher can take over. We provide a theoretical analysis in order to study the effect of cascaded distillation in the generalization performance of the Student. We apply the proposed framework on models build upon the LLaVA methodology and evaluate the derived models on seven standard, publicly available VQA benchmarks, demonstrating their SotA performance.
comment: Under review
☆ Product-of-Gaussian-Mixture Diffusion Models for Joint Nonlinear MRI Reconstruction
Recently, diffusion models have attracted considerable attention for magnetic resonance image reconstruction due to their high sample quality. However, most existing methods rely on large networks with opaque time-conditioning mechanisms, and require offline coil sensitivity estimation. This results in limited interpretability of the reconstruction process and reduced flexibility in the acquisition setup. To address these limitations, we jointly reconstruct the image and the coil sensitivities by combining the parameter-efficient product-of-Gaussian-mixture diffusion model as an image prior with a classical smoothness prior on the coil sensitivities. The proposed method is fast and robust to both contrast and anatomical distribution shifts as well as changing k-space trajectories. Finally, we propose a more expressive parameterization of the image prior which improves results in denoising and magnetic resonance image reconstruction.
☆ Hypergraph-Enhanced Training-Free and Language-Free Few-Shot Anomaly Detection
Few-shot anomaly detection (FSAD) has made significant strides, yet existing methods still face critical challenges: (i) dependence on task- or dataset-specific training/fine-tuning, (ii) reliance on language supervision or carefully hand-crafted prompts, and (iii) limited robustness across domains. In this paper, we introduce HyperFSAD, a novel FSAD framework that is training-free, language-free, and robust across domains, offering a powerful solution to these challenges. Built upon DINOv3 and a hypergraph-based inference mechanism, our approach performs inference without any task-specific optimization or text prompts, while remaining competitive. Specifically, we replace sensitive nearest-neighbor / top-$n$ matching with \textbf{Sparse Hyper Matching}: \textit{sparsemax} first selects the most relevant support patches, which are then aggregated into a \textit{hyperedge} as compact normal evidence to suppress background noise and distractors. We further introduce \textbf{Dual-Branch Image Scoring}, which fuses \emph{spatial anomaly evidence} from the patch-grid anomaly map with \emph{global semantic deviation} captured by support-aware CLS matching, yielding a robust image-level anomaly score in a strictly visual manner. Notably, all components of HyperFSAD are purely visual, eliminating the need for labor-intensive hand-crafted text prompts. Under the stringent training-free and language-free setting, HyperFSAD achieves state-of-the-art performance across six datasets spanning four industrial datasets (MVTecAD, VisA, MPDD, BTAD) and two medical datasets (RESC, BraTS).
☆ Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination ACL 2026
Large Vision-Language Models (LVLMs) have achieved remarkable progress in multimodal tasks, yet their reliability is persistently undermined by hallucinations-generating text that contradicts visual input. Recent studies often attribute these errors to inadequate visual attention. In this work, we analyze the attention mechanisms via the logit lens, uncovering a distinct anomaly we term Vocabulary Hijacking. We discover that specific visual tokens, defined as Inert Tokens, disproportionately attract attention. Crucially, when their intermediate hidden states are projected into the vocabulary space, they consistently decode to a fixed set of unrelated words (termed Hijacking Anchors) across layers, revealing a rigid semantic collapse. Leveraging this semantic rigidity, we propose Hijacking Anchor-Based Identification (HABI), a robust strategy to accurately localize these Inert Tokens. To quantify the impact of this phenomenon, we introduce the Non-Hijacked Visual Attention Ratio (NHAR), a novel metric designed to identify attention heads that remain resilient to hijacking and are critical for factual accuracy. Building on these insights, we propose Hijacking-Aware Visual Attention Enhancement (HAVAE), a training-free intervention that selectively strengthens the focus of these identified heads on salient visual content. Extensive experiments across multiple benchmarks demonstrate that HAVAE significantly mitigates hallucinations with no additional computational overhead, while preserving the model's general capabilities. Our code is publicly available at https://github.com/lab-klc/HAVAE.
comment: Accepted by ACL 2026 Main
☆ MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image
Tabular Foundation Models have recently established the state of the art in supervised tabular learning, by leveraging pretraining to learn generalizable representations of numerical and categorical structured data. However, they lack native support for unstructured modalities such as text and image, and rely on frozen, pretrained embeddings to process them. On established Multimodal Tabular Learning benchmarks, we show that tuning the embeddings to the task improves performance. Existing benchmarks, however, often focus on the mere co-occurrence of modalities; this leads to high variance across datasets and masks the benefits of task-specific tuning. To address this gap, we introduce MulTaBench, a benchmark of 40 datasets, split equally between image-tabular and text-tabular tasks. We focus on predictive tasks where the modalities provide complementary predictive signal, and where generic embeddings lose critical information, necessitating Target-Aware Representations that are aligned with the task. Our experimental results demonstrate that the gains from target-aware representation tuning generalize across both text and image modalities, several tabular learners, encoder scales, and embedding dimensions. MulTaBench constitutes the largest image-tabular benchmarking effort to date, spanning high-impact domains such as healthcare and e-commerce. It is designed to enable the research of novel architectures which incorporate joint modeling and target-aware representations, paving the way for the development of novel Multimodal Tabular Foundation Models.
☆ Segment Anything with Robust Uncertainty-Accuracy Correlation ICML 2026
Despite strong zero-shot performance, SAM is unreliable under domain shift due to Mask-level Confidence Confusion (MCC), where a single IoU-based mask score fails to reflect pixel-wise reliability near boundaries. Motivated by the contrast between texture-biased shortcuts in neural networks and shape-centric processing in human vision, we model out-of-domain variation as appearance shifts and non-rigid deformations that jointly stress calibration. We propose Segment Anything with Robust Uncertainty-Accuracy Correlation (RUAC) for robust pixel-wise uncertainty estimation under appearance and deformation shifts. RUAC adds a lightweight uncertainty head, trains it with a collaborative style-deformation attack that jointly perturbs texture and geometry, and applies Uncertainty-Accuracy Alignment to ensure uncertainty consistently highlights erroneous pixels even under adversarial perturbations. Across 23 zero-shot domains, RUAC improves segmentation quality and yields more faithful uncertainty with stronger uncertainty-accuracy correlation. Project page: https://github.com/HongyouZhou/ruac.git.
comment: ICML 2026
☆ Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence NeurIPS 2026
Current Large Multimodal Models (LMMs) struggle with spatial reasoning tasks requiring viewpoint-dependent understanding, largely because they are confined to a single, static observation. We propose Thinking with Novel Views (TwNV), a paradigm that integrates generative novel-view synthesis into the reasoning loop: a Reasoner LMM identifies spatial ambiguity, instructs a Painter to synthesize an alternative viewpoint, and re-examines the scene with the additional evidence. Through systematic experiments we address three research questions. (1) Instruction format: numerical camera-pose specifications yield more reliable view control than free-form language. (2) Generation fidelity: synthesized view quality is tightly coupled with downstream spatial accuracy. (3) Inference-time visual scaling: iterative multi-turn view refinement further improves performance, echoing recent scaling trends in language reasoning. Across four spatial subtask categories and four LMM architectures (both closed- and open-source), TwNV consistently improves accuracy by +1.3 to +3.9 pp, with the largest gains on viewpoint-sensitive subtasks. These results establish novel-view generation as a practical lever for advancing spatial intelligence of LMMs.
comment: Submitted to NeurIPS 2026
☆ CausalGS: Learning Physical Causality of 3D Dynamic Scenes with Gaussian Representations ICMR2026
Learning a physical model from video data that can comprehend physical laws and predict the future trajectories of objects is a formidable challenge in artificial intelligence. Prior approaches either leverage various Partial Differential Equations (PDEs) as soft constraints in the form of PINN losses, or integrate physics simulators into neural networks; however, they often rely on strong priors or high-quality geometry reconstruction. In this paper, we propose CausalGS, a framework that learns the causal dynamics of complex dynamic 3D scenes solely from multi-view videos, while dispensing with the reliance on explicit priors. At its core is an inverse physics inference module that decouples the complex dynamics problem from the video into the joint inference of two factors: the initial velocity field representing the scene's kinematics, and the intrinsic material properties governing its dynamics. This inferred physical information is then utilized within a differentiable physics simulator to guide the learning process in a physics-regularized manner. Extensive experiments demonstrate that CausalGS surpasses the state-of-the-art on the highly challenging task of long-term future frame extrapolation, while also exhibiting advanced performance in novel view interpolation. Crucially, our work shows that, without any human annotation, the model is able to learn the complex interactions between multiple physical properties and understand the causal relationships driving the scene's dynamic evolution, solely from visual observations.
comment: ICMR2026 Accepted
☆ FrequencyCT: Frequency domain pseudo-label generation for self-supervised low-dose CT denoising
Despite extensive research on computed tomography (CT) denoising, few studies exploit projection-domain data characteristics to mitigate noise correlation. To address this, this work proposes FrequencyCT, the first zero-shot self-supervised method for pseudo-label generation in the frequency domain for low-dose CT denoising. Leveraging the characteristic of the frequency domain that largely isolates noise from clean signals, a regional low-frequency anchoring technique is proposed. Phase-preserving amplitude modulation and mask perturbation in the high-frequency region generate pseudo-label data for self-supervision. The fluctuating noise variance in the projection domain prompts truncation of the generated samples to stabilize the network's optimization gradient. Evaluation results on multiple public and real-world datasets confirm the clinical application potential of this research, which will have a revolutionary impact on the field of denoising. The code can be obtained from https://github.com/yqx7150/FrequencyCT.
☆ Polygon-mamba: Retinal vessel segmentation using polygon scanning mamba and space-frequency collaborative attention
Retinal vessel segmentation is crucial for diagnosis and assessment of ocular diseases. Notably, segmentation of small retinal vessels has been consistently recognized as a challenging and complex task. To tackle this challenge, we design a hybrid CNN-Mamba fusion network that integrates polygon scanning mamba and space-frequency collaborative attention mechanism for the detection of small vessels. Considering that the traditional mamba architecture with horizontal-vertical scanning may compromise the topological integrity of target structures and result in local discontinuities in small retinal vessels, we present a polygon scanning visual state space model (PS-VSS) to identify small vessel structural features by multi-layer reverse scanning way. Which effectively preserves pixels connectivity, thereby substantially mitigating the loss of information pertaining to small vessels. Furthermore, as we all known that the spatial domain prioritizes positional and structural information, while the frequency domain emphasizes global perception and local detail components, a space-frequency collaborative attention mechanism (SFCAM) is introduced within the skip connection to extract efficient features from the spatial and frequency domains. This strategy empowers the model to dynamically enhance the key features while effectively suppressing clutters. To assess the efficacy of our model, it was tested on three publicly available datasets: DRIVE, STARE, and CHASE_DB1. Compared to manual annotations, our model demonstrated F1 scores of 0.8283, 0.8282, and 0.8251, Area Under Curve (AUC) values of 0.9806, 0.9840, and 0.9866, and Sensitivity (SE) values of of 0.8268, 0.8314, and 0.8484 across three datasets, respectively. The effectiveness of our model was validated through both visual inspection and quantitative analysis.
☆ SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models
Low-level visual perception underpins reliable remote sensing (RS) image analysis, yet current image quality assessment (IQA) methods output uninterpretable scalar scores rather than characterizing physics-driven RS degradations, deviating markedly from the diagnostic needs of RS experts. While Vision-Language Models (VLMs) present a compelling alternative by delivering language-grounded IQA, their visual priors are heavily biased toward ground-level natural images. Consequently, whether VLMs can overcome this domain gap to perceive and articulate RS artifacts remains insufficiently studied. To bridge this gap, we propose \textbf{SenseBench}, the first dedicated diagnostic benchmark for RS low-level visual perception and description. Driven by a physics-based hierarchical taxonomy that unifies both non-reference and reference-based paradigms, SenseBench features over 10K meticulously curated instances across 6 major and 22 fine-grained RS degradation categories. Specifically, two complementary protocols are designed for evaluation: objective low-level visual \textit{perception} and subjective diagnostic \textit{description}. Comprehensive evaluation of 29 state-of-the-art VLMs reveals not only skewed domain priors and multi-distortion collapse, but also \textit{fluency illusion} and a \textit{perception-description inversion} effect. We hope SenseBench provides a robust evaluation testbed and high-quality diagnostic data to advance the development of VLMs in RS low-level perception. Code and datasets are available \href{https://github.com/Zhong-Chenchen/SenseBench}{\textcolor{blue}{here}}.
☆ Set-Based Groupwise Registration for Variable-Length, Variable-Contrast Cardiac MRI MICCAI 2026
Quantitative cardiac magnetic resonance imaging (MRI) enables non-invasive myocardial tissue characterization but relies on robust motion correction within these variable-length, variable-contrast image sequences. Groupwise registration, which simultaneously aligns all images, has shown greater robustness than pairwise registration for motion correction. However, current deep-learning-based groupwise registration methods cannot generalize across MRI sequences: the architecture typically encodes input data as a fixed-length channel stack, which rigidly couples network design to protocol-specific sequence length, input ordering, and contrast dynamics. At inference time, any change in imaging protocols will render the network unusable. In this work, we introduce \emph{\AnyTwoReg}, a new set-based groupwise registration framework that takes a quantitative MRI sequence as an unordered set. This set formulation fundamentally decouples network design from sequence length and input ordering. By utilizing a shared encoder and correlation-guided feature aggregation, \emph{\AnyTwoReg} constructs a permutation-invariant canonical reference for registration, and learns a permutation-equivariant mapping from images to deformation fields. Additionally, we extract contrast-insensitive image features from an existing foundation model to handle extreme contrast variations. Trained exclusively on a single public $T_1$ mapping dataset (STONE, sequence length $L=11$), \AnyTwoReg generalizes to two unseen quantitative MRI datasets (MOLLI, ASL) with variable lengths ($L \in [11, 60]$) and different contrast dynamics. It achieves strong cross-protocol generalization in a zero-shot manner, and consistently improves downstream quantitative mapping quality. Notably, while designed for quantitative MRI sequences, our framework is directly applicable to Cine MRI sequences for inter-cardiac-phase registration.
comment: MICCAI 2026. Submitted Version
☆ VeloGauss: Learning Physically Consistent Gaussian Velocity Fields from Videos ICME2026
In this paper, we aim to jointly model the geometry, appearance, and physical information of 3D scenes solely from dynamic multi-view videos, without relying on any physical priors. Existing works typically employ physical losses merely as soft constraints or integrate physical simulations into neural networks; however, these approaches often fail to effectively learn complex motion physics. Although modeling velocity fields holds the potential to capture authentic physical information, due to the lack of appropriate physical constraints, current methods are unable to correctly learn the interaction mechanisms between rigid and non-rigid particles. To address this, we propose VeloGauss, designed to learn the physical properties of complex dynamic 3D scenes without physical priors. Our method learns the velocity field for each Gaussian particle by introducing a Physics Code and a Particle Dynamics System, and ultimately incorporates Global Physical Constraints to ensure the physical consistency of the scene. Extensive experiments on four public datasets demonstrate that our method outperforms achieves state-of-the-art performance in both Novel View Interpolation and Future Frame Extrapolation tasks.
comment: ICME2026 Accepted
☆ DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving ICML 2026
End-to-end autonomous driving systems are increasingly integrating Vision-Language Model (VLM) architectures, incorporating text reasoning or visual reasoning to enhance the robustness and accuracy of driving decisions. However, the reasoning mechanisms employed in most methods are direct adaptations from general domains, lacking in-depth exploration tailored to autonomous driving scenarios, particularly within visual reasoning modules. In this paper, we propose a driving world model that performs parallel prediction of latent semantic features for consecutive future frames in the bird's-eye-view (BEV) space, thereby enabling long-horizon modeling of future world states. We also introduce an efficient and adaptive text reasoning mechanism that utilizes additional social knowledge and reasoning capabilities to further improve driving performance in challenging long-tail scenarios. We present a novel, efficient, and effective approach that achieves state-of-the-art (SOTA) results on the closed-loop Bench2drive benchmark. Codes are available at: https://github.com/hotdogcheesewhite/DeepSight.
comment: ICML 2026
☆ EnergyLens: Interpretable Closed-Form Energy Models for Multimodal LLM Inference Serving
As large language models span dense, mixture-of-experts, and state-space architectures and are deployed on heterogeneous accelerators under increasingly diverse multimodal workloads, optimising inference energy has become as critical as optimizing latency and throughput. Existing approaches either treat latency as an energy proxy or rely on data-hungry black-box surrogates. Both fail under varying parallelism strategies: latency and energy optima diverge in over 20% of configurations we tested, and black-box surrogates require hundreds of profiling samples to generalize across model families and hardware. We present EnergyLens, which uses symbolic regression as a structure-discovery tool over profiling data to derive a single twelve-parameter closed-form energy model expressed in terms of system properties such as degree of parallelism, batch size, and sequence length. Unlike black-box surrogates, EnergyLens decouples tensor and pipeline parallelism contributions and separates prefill from decode energy, making its predictions physically interpretable and actionable. Fitted from as few as 50 profiling measurements, EnergyLens achieves 88.2% Top-1 configuration selection accuracy across many evaluation scenarios compared to 60.9% for the closest prior analytical baseline, matches the predictive accuracy of ensemble ML methods with 10x fewer profiling samples, and extrapolates reliably to unseen batch sizes and hardware platforms without structural modification, making it a practical, interpretable tool for energy-optimal LLM deployment.
comment: 10 pages
☆ TIE: Time Interval Encoding for Video Generation over Events
Director-style prompting, robotic action prediction, and interactive video agents demand temporal grounding over concurrent events -- a regime in which 68% of general clips and over 99% of robotics/gameplay clips contain overlapping events, yet existing multi-event generators rest on a single-active-prompt assumption. However, modern video generators, such as Diffusion Transformers (DiT), represent time as discrete points through point-wise positional encodings. This formulation creates a fundamental dimension mismatch: temporally extended intervals and overlapping events are mathematically unrepresentable to the attention mechanism. In this paper, we propose Time Interval Encoding (TIE), a principled, plug-and-play interval-aware generalization of rotary embeddings that elevates time intervals to first-class primitives inside DiT cross-attention. Rather than introducing another heuristic interval embedding, we show that, within RoPE-compatible bilinear attention, TIE is characterized by two basic principles: Temporal Integrability, which requires an event to aggregate positional evidence over its full duration, and Duration Invariance, which removes the trivial bias toward longer intervals. Under a uniform kernel, this characterization yields an efficient closed-form sinc-based solution that preserves the standard attention interface and naturally attenuates boundary noise through interval integration. Empirically, TIE preserves the visual quality of the base DiT model while substantially improving temporal controllability. In our experiments on the OmniEvents dataset, it improves human-verified Temporal Constraint Satisfaction Rate from 77.34% to 96.03% and reduces temporal boundary error from 0.261s to 0.073s, while also improving trajectory-level temporal alignment metrics. The code and dataset are available at https://github.com/MatrixTeam-AI/TIE.
☆ GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth
Video depth estimation extends monocular prediction into the temporal domain to ensure coherence. However, existing methods often suffer from spatial blurring in fine-detail regions and temporal inconsistencies. We argue that current approaches, which primarily rely on temporal smoothing via Transformers, struggle to maintain strict 3D geometric consistency-particularly under rotations or drastic view changes. To address this, we propose GemDepth, a framework built on the insight that an explicit awareness of camera motion and global 3D structure is a prerequisite for 3D consistency. Distinctively, GemDepth introduces a Geometry-Embedding Module (GEM) that predicts inter-frame camera poses to generate implicit geometric embeddings. This injection of motion priors equips the network with intrinsic 3D perception and alignment capabilities. Guided by these geometric cues, our Alternating Spatio-Temporal Transformer (ASTT) captures latent point-level correspondences to simultaneously enhance spatial precision for sharp details and enforce rigorous temporal consistency. Furthermore, GemDepth employs a data-efficient training strategy, effectively bridging the gap between high efficiency and robust geometric consistency. As shown in Fig.2, comprehensive evaluations demonstrate that GemDepth achieves state-of-the-art performance across multiple datasets, particularly in complex dynamic scenarios. The code is publicly available at: https://github.com/Yuecheng919/GemDepth
☆ Improving Human Image Animation via Semantic Representation Alignment CVPR 2026
The field of image-to-video generation has made remarkable progress. However, challenges such as human limb twisting and facial distortion persist, especially when generating long videos or modeling intensive motions. Existing human image animation works address these issues by incorporating human-specific semantic representations, e.g., dense poses or ID embeddings, as additional conditions. However, conditioning on these representations could decrease the generation flexibility. Moreover, their reliance on RGB pixel supervision also lacks emphasis on learning necessary 3D geometric relationships and temporal coherence. In contrast, we introduce a novel approach named SemanticREPA that leverages these semantic representations as supervision signals through representation alignment. Specifically, we begin by training a structure alignment module that aligns the structure representations obtained from video latents with video depth estimation features. We then fix the pretrained module, and utilize it to provide additional supervision on the structure representations of the diffusion models, achieving structure rectification to generate coherent and stable human structures. Simultaneously, we develop an ID alignment module to align the ID representations of the generated videos to face recognition features. We further propose to use the predicted structure representations to refine identity restoration in relevant regions. With structure and ID alignment, our method demonstrates superior quality on extended character motions and enhanced character consistency.
comment: Accepted by CVPR 2026 workshop
☆ DuetFair: Coupling Inter- and Intra-Subgroup Robustness for Fair Medical Image Segmentation
Medical image segmentation models can perform unevenly across subgroups. Most existing fairness methods focus on improving average subgroup performance, implicitly treating each subgroup as internally homogeneous. However, this can hide difficult cases within a subgroup, where high-loss samples are obscured by the subgroup mean. We call this problem \textbf{intra-group hidden failure}. To solve this, we propose \textbf{DuetFair} mechanism, a dual-axis fairness framework that jointly considers inter-subgroup adaptation and intra-subgroup robustness. Based on DuetFair, we introduce \textbf{FairDRO}, which combines distribution-aware mixture-of-experts (dMoE) with subgroup-conditioned distributionally robust optimization (DRO) loss aggregation. This design allows the model to adapt across subgroups while also reducing hidden failures within each subgroup. We evaluate FairDRO on three medical image segmentation benchmarks with varying degrees of within-group heterogeneity. FairDRO achieves the best equity-scaled performance on Harvard-FairSeg and improves worst-case subgroup performance on HAM10000 under both age- and race-based grouping schemes. On the 3D radiotherapy target cohort, FairDRO further improves worst-group Dice by 3.5 points ($\uparrow 6.0\%$) under the tumor-stage grouping and by 4.1 points ($\uparrow 7.4\%$) under the institution grouping over the strongest baseline.
comment: 16 pages, 2 figures
☆ Simultaneous Long-tailed Recognition and Multi-modal Fusion for Highly Imbalanced Multi-modal Data
Long-tailed distributions in class-imbalanced data present a fundamental challenge for deep learning models, which tend to be biased toward majority classes. While recent methods for long-tailed recognition have mitigated this issue, they are largely restricted to single-modal inputs and cannot fully exploit complementary information from diverse data sources. In this work, we introduce a new framework for long-tailed recognition that explicitly handles multi-modal inputs. Our approach extends multi-expert architectures to the multi-modal setting by fusing heterogeneous data into a unified representation while leveraging modality-specific networks to estimate the informativeness of each modality. These confidence-guided weights dynamically modulate the fusion process, ensuring that more informative modalities contribute more strongly to the final decision. To further enhance performance, we design specialized training and test procedures that accommodate diverse modality combinations, including images and tabular data. Extensive experiments on benchmark and real-world datasets demonstrate that the proposed approach not only effectively integrates multi-modal information but also outperforms existing methods in handling long-tailed, class-imbalanced scenarios, highlighting its robustness and generalization capability.
☆ M$^2$E-UAV: A Benchmark and Analysis for Onboard Motion-on-Motion Event-Based Tiny UAV Detection
Tiny UAV detection from an onboard event camera is difficult when the observer and target move at the same time. In this motion-on-motion regime, ego-motion activates background edges across buildings, vegetation, and horizon structures, while the UAV may appear as a sparse event cluster. To explore this practical problem, we present M$^2$E-UAV, a benchmark and analysis setup for onboard motion-on-motion event-based tiny UAV detection. The processed M$^2$E-UAV benchmark contains 87,223 training samples and 21,395 validation samples across four scene families: sunny building-forest, sunny farm-village, sunset building-forest, and sunset farm-village. We provide M$^2$E-Point, a point-based event baseline, and M$^2$E-Point + IMU, an IMU-conditioned variant, to analyze the role of inertial cues under onboard motion-on-motion detection. M$^2$E-Point encodes events as $[x,y,t,p]$ point sets, extracts local event structure with EdgeConv, and predicts event-level UAV foreground scores, from which bounding boxes are derived via DBSCAN. Our validation-stage analysis shows that point-based event modeling is a strong baseline, while simple IMU conditioning provides only marginal aggregate gains. Under the train/validation split, M$^2$E-Point achieves 0.9673 F1 and 0.5501 mAP50-95, while the IMU-conditioned variant reaches 0.5561 mAP50-95 with only marginal aggregate changes, serving as an initial baseline for future exploration in this domain. Code will be ready in https://github.com/Wickyan/M2E-UAV.
☆ OpenSGA: Efficient 3D Scene Graph Alignment in the Open World
Scene graph alignment establishes object correspondences between two 3D scene graphs constructed from partially overlapping observations. This enables efficient scene understanding and object-level relocalization when a robot revisits a place, as well as global map fusion across multiple agents. Such capabilities are essential for robots that require long-term memory for long-horizon tasks involving interactions with the environment. Existing approaches mainly focus on subscan-to-subscan (S2S) alignment and depend heavily on geometric point-cloud features, leaving frame-to-scan (F2S) alignment and open-set vision-language features underexplored. In addition, existing datasets for scene graph alignment remain small-scale with limited object diversity, constraining systematic training and evaluation. We present a unified and efficient scene graph alignment framework that predicts object correspondences by fusing vision-language, textual, and geometric features with spatial context. The framework comprises modules such as a distance-gated spatial attention encoder, a minimum-cost-flow-based allocator, and a global scene embedding generator to achieve accurate alignment even under large coordinate discrepancies. We further introduce ScanNet-SG, a large-scale dataset generated via an automated annotation pipeline with over 700k samples, covering 509 object categories from ScanNet labels and over 3k categories from GPT-4o-based tagging. Experiments show that our method achieves the best overall performance on both F2S and S2S tasks, substantially outperforming existing scene graph alignment methods. Our code and dataset are released at: https://autonomousrobots.nl/paper_websites/opensga.
comment: 13 figures
☆ Adaptive Context Matters: Towards Provable Multi-Modality Guidance for Super-Resolution
Super-resolution (SR) is a severely ill-posed problem with inherent ambiguity, as widely recognized in both empirical and theoretical studies. Although recent semantic-guided and multi-modal SR methods exploit large models or external priors to enhance semantic alignment, the fusion of heterogeneous modalities remains insufficiently understood in practice and theory. In this work, we provide the first theoretical modeling of multi-modal SR, revealing that prior methods are bottlenecked by sub-optimal modality utilization. Our analysis shows that the generalization risk bound can be improved by strengthening the alignment between modality weights and their effective contributions, while reducing representation complexity. This theoretical insight inspires us to propose the novel Multi-Modal Mixture-of-Experts Super-Resolution framework (M$^3$ESR) that employs generalization-oriented dynamic modality fusion for accurate risk control and modality contribution optimization. In detail, we propose a novel spatially dynamic modality weighting module and a temporally adaptive modality temperature scheduling mechanism, enabling flexible and adaptive spatial-temporal modality weighting for effective risk control. Extensive experiments demonstrate that our M$^3$ESR significantly boosts generalization and semantic consistency performances, which confirms our superiority.
☆ Automated Detection of Abnormalities in Zebrafish Development
Zebrafish embryos are a valuable model for drug discovery due to their optical transparency and genetic similarity to humans. However, current evaluations rely on manual inspection, which is costly and labor-intensive. While machine learning offers automation potential, progress is limited by the lack of comprehensive datasets. To address this, we introduce a large-scale dataset of high-resolution microscopic image sequences capturing zebrafish embryonic development under both control conditions and exposure to compounds (3,4-dichloroaniline). This dataset, with expert annotations at fine-grained temporal levels, supports two benchmarking tasks: (1) fertility classification, assessing zebrafish egg viability (130,368 images), and (2) toxicity assessment, detecting malformations induced by toxic exposure over time (55,296 images). Alongside the dataset, we present the first transformer-based baseline model that integrates spatiotemporal features to predict developmental abnormalities at early stages. Experimental results present the model's effectiveness, achieving 98% accuracy in fertility classification and 92% in toxicity assessment. These findings underscore the potential of automated approaches to enhance zebrafish-based toxicity analysis.
☆ Automated high-frequency quantification of fish communities and biomass using computer vision
Quantifying fish community structure is essential for understanding biodiversity and ecosystem responses in a changing environment, yet existing survey methods provide limited high-frequency, quantitative observations. Conventional approaches, including catch-based methods, underwater visual censuses, and environmental DNA metabarcoding, either require intensive labor or lack reliable estimates of abundance and biomass. Here, we develop an automated framework for quantifying fish communities from underwater video using computer vision. Using videos acquired with a custom-made stereo camera system, the framework integrates deep learning-based fish identification, multi-object tracking, and 3D reconstruction to estimate species-level abundance and biomass. We applied the approach to a reef fish community over a 20-day period with hourly daytime observations, revealing dynamic fluctuations in species richness, abundance, and biomass associated with changes in species composition. By comparing fish communities estimated from visual census and environmental DNA surveys, we demonstrate that our method provides complementary strengths for continuous, non-invasive, and quantitative monitoring of consistently observed species. This approach provides a scalable foundation for long-term monitoring and advances the capacity to resolve fine-scale temporal dynamics in fish communities.
comment: 21 pages, 3 figures, supplementary information under Ancillary files
☆ Uni-Synergy: Bridging Understanding and Generation for Personalized Reasoning via Co-operative Reinforcement Learning
Unified Multimodal Models (UMMs) excel in general tasks but struggle to bridge the gap between personalized understanding and generation. Prior works largely rely on implicit token-level alignment via supervised fine-tuning, which fails to fully capture the potential synergy between comprehension and creation. In this work, we propose Sync-R1, an end-to-end reinforcement learning framework that jointly optimizes personalized understanding and generation within a single, explicit reasoning loop. Through this unified feedback process, Sync-R1 enables personalized comprehension to guide content creation, while the resulting generation quality reciprocally refines understanding within an integrated reward landscape. To efficiently orchestrate this dual-task synergy, we introduce Sync-GRPO, a reinforcement learning method utilizing an ensemble reward system. Furthermore, we propose Dynamic Group Scaling (DGS), which adaptively filters low-potential trajectories to reduce gradient variance and accelerate convergence. To better reflect real-world complexity, we introduce UnifyBench++, featuring denser textual descriptions and richer user contexts. Experimental results demonstrate that Sync-R1 achieves state-of-the-art performance, showcasing superior cross-task reasoning and robust personalization without requiring complex cold-start procedures. The code and the UnifyBench++ dataset will be released at: https://github.com/arctanxarc/UniCTokens.
☆ Filtering Memorization from Parameter-Space in Diffusion Models
Low-Rank Adaptation (LoRA) has become a widely used mechanism for customizing diffusion models, enabling users to inject new visual concepts or styles through lightweight parameter updates. However, LoRAs can memorize training images, causing generated outputs to reproduce copyrighted or sensitive content. This risk is particularly concerning in LoRA-sharing ecosystems, where users distribute trained LoRAs without releasing the underlying training data. Existing approaches for mitigating memorization rely on access to the training pipeline, training data, or control over the inference process, making them difficult to apply when only the released LoRA weights are available. We propose \textbf{Base-Anchored Filtering (BAF)}, a training-free and data-free framework for post-hoc memorization mitigation in diffusion LoRAs. BAF decomposes LoRA updates into spectral channels and measures their alignment with the principal subspace of the pretrained backbone. Channels strongly aligned with this subspace are retained as generalizable adaptations, while weakly aligned channels are suppressed as potential carriers of memorized content. Experiments on multiple datasets and diffusion backbones demonstrate that BAF consistently reduces memorization while preserving or even improving generation quality. Our code is available in the supplementary material.
☆ Beyond Spatial Compression: Interface-Centric Generative States for Open-World 3D Structure
Current 3D tokenizers largely treat representation as spatial compression: compact codes reconstruct surface geometry, but leave component ownership and attachment validity implicit. In open-world assets with intersecting components, noisy topology, and weak canonical structure, this creates a representation mismatch: local shape, component identity, and assembly relations become entangled in a latent stream and are not natively addressable during decoding. We formulate an alternative view, interface-centric generative states, in which tokenization constructs an operational state rather than a passive compressed code. The state exposes local geometry, component ownership, and attachment validity as variables that can be queried, constrained, and repaired during decoding. We instantiate this formulation with Component-Conditioned Canonical Local Tokens (C2LT-3D), factorizing representation into canonical local geometry, partition-conditioned context, and relational seam variables. Each factor targets a distinct failure mode of compression-centric tokens: pose leakage, cross-component interference, or invalid local attachment. This exposed state supports attachment validation, latent structural repair, targeted intervention, and constrained serialization without a separate post-hoc structure recovery module. Trained on single-object CAD models and evaluated zero-shot on open-world multi-component assets, C2LT-3D improves structural robustness and shows that its latent variables remain actionable under adversarial attachment settings. These results suggest that open-world 3D generative representations should be evaluated not only by reconstruction fidelity, but by whether their discrete states remain operational for assembly-level structural reasoning.
☆ WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors
Commercial video generation systems such as Seedance2.0 and Veo3.1 have rapidly improved, strengthening the view that video generators may be evolving into "world simulators." Yet the community still lacks a benchmark that directly tests whether a model can reason about how an observed world should evolve over time. We introduce WorldReasonBench, which reframes video generation evaluation as world-state prediction: given an initial state and an action, can a model generate a future video whose state evolution remains physically, socially, logically, and informationally consistent? WorldReasonBench contains 436 curated test cases with structured ground-truth QA annotations spanning four reasoning dimensions and 22 subcategories. We evaluate generated videos with a human-aligned two-part methodology: Process-aware Reasoning Verification uses structured QA and reasoning-phase diagnostics to detect temporal and causal failures, while Multi-dimensional Quality Assessment scores reasoning quality, temporal consistency, and visual aesthetics for ranking and reward modeling. We further introduce WorldRewardBench, a preference benchmark with approximately 6K expert-annotated pairs over 1.4K videos, supporting pair-wise and point-wise reward-model evaluation. Across modern video generators, our results expose a persistent gap between visual plausibility and world reasoning: videos can look convincing while failing dynamics, causality, or information preservation. We will release our benchmarks and evaluation toolkit to support community research on genuinely world-aware video generation at https://github.com/UniX-AI-Lab/WorldReasonBench/.
comment: Project Page: https://unix-ai-lab.github.io/WorldReasonBench/
☆ CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
Vision-Language-Action (VLA) models have emerged as a promising paradigm for end-to-end autonomous driving. However, existing reasoning mechanisms still struggle to provide planning-oriented intermediate representations: textual Chain-of-Thought (CoT) fails to preserve continuous spatiotemporal structure, while latent world reasoning remains difficult to use as a direct condition for action generation. In this paper, we propose CoWorld-VLA, a multi-expert world reasoning framework for autonomous driving, where world representations serve as explicit conditions to guide action planning. CoWorld-VLA extracts complementary world information through multi-source supervision and encodes it into expert tokens within the VLA, thereby providing planner-accessible conditioning signals. Specifically, we construct four types of tokens: semantic interaction, geometric structure, dynamic evolution, and ego trajectory tokens, which respectively model interaction intent, spatial structure, future temporal dynamics, and behavioral goals. During action generation, CoWorld-VLA employs a diffusion-based hierarchical multi-expert fusion planner, which is coupled with scene context throughout the joint denoising process to generate continuous ego trajectories. Experiments show that CoWorld-VLA achieves competitive results in both future scene generation and planning on the NAVSIM v1 benchmark, demonstrating strong performance in collision avoidance and trajectory accuracy. Ablation studies further validate the complementarity of expert tokens and their effectiveness as planning conditions for action generation. Code will be available at https://github.com/potatochip1211/CoWorld-VLA.
☆ Progressive Photorealistic Simplification
Existing image simplification techniques often rely on Non-Photorealistic Rendering (NPR), transforming photographs into stylized sketches, cartoons, or paintings. While effective at reducing visual complexity, such approaches typically sacrifice photographic realism. In this work, we explore a complementary direction: simplifying images while preserving their photorealistic appearance. We introduce progressive semantic image simplification, a framework that iteratively reduces scene complexity by removing and inpainting elements in a controlled manner. At each step, the resulting image remains a plausible natural photograph. Our method combines semantic understanding with generative editing, leveraging Vision-Language Models (VLMs) to identify and prioritize elements for removal, and a learned verifier to ensure photorealism and coherence throughout the process. This is implemented via an iterative Select-Remove-Verify pipeline that produces high-quality simplification trajectories. To improve efficiency, we further distill this process into an image-to-video generation model that directly predicts coherent simplification sequences from a single input image. Beyond generating cleaner and more focused compositions, our approach enables applications such as content-aware decluttering, semantic layer decomposition, and interactive editing. More broadly, our work suggests that simplification through structured content removal can serve as a practical mechanism for guiding visual interpretation within the photorealistic domain, complementing traditional abstraction methods.
☆ Position: Life-Logging Video Streams Make the Privacy-Utility Trade-off Inevitable
With the growing prevalence of always-on hardware such as smart glasses, body cameras, and home security systems, life-logging visual sensing is becoming inevitable, forming the backbone of persistent, always-on AI systems. Meanwhile, recent advances in proactive agents and world models signal a fundamental shift from episodic, prompt-driven tools to next-generation AI systems that continuously perceive and react to the physical world. Although life-logging video streams can substantially improve utility of these promising systems, they also introduce significant privacy risks by revealing sensitive information, such as behavioral patterns, emotional states, and social interactions, beyond what isolated images expose. If unresolved, these risks may undermine public trust and hinder the sustainable development of always-on AI technologies. Existing privacy protections are either attack-specific or incur substantial utility loss, and fail to consider the entire data exploitation pipeline. We therefore posit that the privacy-utility trade-off in life-logging video streams is a foundational challenge for next-generation AI systems that demands further investigation. We call for novel pipeline-aware privacy-preserving designs that jointly optimize utility and privacy for long-horizon life-logging visual data. In parallel, formal privacy leakage metrics and standardized benchmarks remain important open directions for future research.
comment: 19 pages, 7 figures
☆ AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation
Visual anomaly detection (VAD) is crucial in many real-world fields, such as industrial inspection, medical imaging, infrastructure monitoring, and remote sensing. However, the specific anomaly definitions, data modalities, and annotation standards across different domains make it difficult to transfer single-domain trained VAD models. Vision-language models (VLMs), pre-trained on large-scale cross-domain data, can perform visual perception under task instructions, offering a promising solution for cross-domain VAD. However, single-inference VLM judgments are unreliable, since they rely more on prior knowledge than on normal-sample references or fine-grained feature evidence. We therefore present AnomalyClaw, a training-free VAD agent that turns anomaly judgment into a multi-round refutation process. In each round, the agent proposes candidate anomalies and refutes each against normal-sample references, drawing on a 13-tool library for visual verification, reference parsing, and frozen expert probing. On the CrossDomainVAD-12 benchmark (12 datasets), AnomalyClaw achieves consistent macro-AUROC improvements over single-step direct inference with +6.23 pp on GPT-5.5, +7.93 pp on Seed2.0-lite, and +3.52 pp on Qwen3.5-VL-27B. We further introduce an optional verbalized self-evolution extension. It builds an online rulebook from internal-branch disagreement without oracle labels. On Qwen3.5-VL-27B, it delivers a +2.09 pp mean gain, comparable to a K = 10 oracle-label supervised baseline (+1.99 pp). These results show that agentic refutation improve anomaly understanding and reasoning of VLMs, rather than merely aggregating tool outputs.
comment: We release the agent, the benchmark, and the analysis artifacts at https://github.com/jam-cc/AnomalyClaw
☆ Sens-VisualNews: A Benchmark Dataset for Sensational Image Detection IEEE
The detection of sensational content in media items can be a critical filtering mechanism for identifying check-worthy content and flagging potential disinformation, since such content triggers physiological arousal that often bypasses critical evaluation and accelerates viral sharing. In this paper we introduce the task of sensational image detection, which aims to determine whether an image contains shocking, provocative, or emotionally charged features to grab attention and trigger strong emotional responses. To support research on this task, we create a new benchmark dataset (called Sens-VisualNews) that contains 9,576 images from news items, annotated based on the (in-)existence of various sensational concepts and events in their visual content. Finally, using Sens-VisualNews, we study the prompt sensitivity, performance and robustness of a wide range of open SotA Multimodal LLMs, across both zero-shot and fine-tuned settings.
comment: Authors' Accepted Version; Accepted at IEEE ICIP 2026
☆ Phoenix-VL 1.5 Medium Technical Report
We introduce Phoenix-VL 1.5 Medium, a 123B-parameter natively multimodal and multilingual foundation model, adapted to regional languages and the Singapore context. Developed as a sovereign AI asset, it demonstrates that deep domain adaptation can be achieved with minimal degradation to broad-spectrum intelligence and alignment. Continued pretraining was performed on Mistral Medium 3.1 using a localized 1-trillion tokens multimodal corpus, followed by a 250-billion tokens long-context extension phase. Subsequent post-training incorporated a novel human-annotated Singapore multimodal dataset and curated textual corpus on Singapore culture, knowledge, and legislation, totaling 22-billion tokens. An additional 5 billion tokens of model alignment was performed through Online Direct Preference Optimization. Phoenix-VL 1.5 Medium achieves state-of-the-art performance for its size on Singapore multimodal, legal, and government policy benchmarks while remaining globally competitive on general multimodal intelligence, multilingual, and STEM benchmarks. We also introduce a novel evaluation suite encompassing localized knowledge benchmarks and an institutionally aligned model behavior and safety framework. We report the data curation principles, training methodology, and highlight benchmark and inference performance.
comment: Release page: https://medium.com/htx-ai/introducing-phoenix-vl-1-5-medium-multimodal-intelligence-uniquely-singaporean-ef8214c8cfa1
☆ Temporal Sampling Frequency Matters: A Capacity-Aware Study of End-to-End Driving Trajectory Prediction
End to end (E2E) autonomous driving trajectory prediction is often trained with camera frames sampled at the highest available temporal frequency, assuming that denser sampling improves performance. We question this assumption by treating temporal sampling frequency as an explicit training set design variable. Starting from high frequency E2E driving datasets, we construct frequency sweep training sets by temporally subsampling camera frames along each trajectory. For each model dataset pair, we train and evaluate the same model under a fixed protocol, so the frequency response reflects how prediction performance changes with sampling frequency. We analyze this response from a capacity aware perspective. Sparse sampling may miss driving relevant cues, while dense sampling may add redundant visual content and off manifold noise. For finite capacity models, this can create a driving irrelevant capacity burden. We evaluate three smaller E2E models and a larger VLA style AutoVLA model on Waymo, nuScenes, and PAVE. Results show model and dataset dependent frequency responses. Smaller E2E models often show non monotonic or near plateau trends and achieve their best 3 second ADE at lower or intermediate frequencies. In contrast, AutoVLA achieves its best 3 second ADE and FDE at the highest evaluated frequency on all three datasets. Iteration matched controls suggest that the advantage of lower or intermediate frequencies for smaller models is not explained only by unequal training update counts. These findings show that temporal sampling frequency should be reported and tuned, rather than fixed to the highest available value.
☆ SleepWalk: A Three-Tier Benchmark for Stress-Testing Instruction-Guided Vision-Language Navigation
Vision-Language Models (VLMs) have advanced rapidly in multimodal perception and language understanding, yet it remains unclear whether they can reliably ground language into spatially coherent, plausibly executable actions in 3D digital environments. We introduce SleepWalk, a benchmark for evaluating instruction-grounded trajectory prediction in single-scene 3D worlds generated from textual scene descriptions and filtered for navigability. Unlike prior navigation benchmarks centered on long-range exploration across rooms, SleepWalk targets localized, interaction-centric embodied reasoning: given rendered visual observations and a natural-language instruction, a model must predict a trajectory that respects scene geometry, avoids collisions, and terminates at an action-compatible location. The benchmark covers diverse indoor and outdoor environments and organizes tasks into three tiers of spatial and temporal difficulty, enabling fine-grained analysis of grounding under increasing compositional complexity. Using a standardized pointwise judge-based evaluation protocol, we evaluate three frontier VLMs on 2,472 curated 3D environments with nine instructions per scene. Results reveal systematic failures in grounded spatial reasoning, especially under occlusion, interaction constraints, and multi-step instructions: performance drops as the difficulty level of the tasks increase. In general, current VLMs can somewhat produce trajectories that are simultaneously spatially coherent, plausibly executable, and aligned with intended actions. By exposing failures in a controlled yet scalable setting, SleepWalk provides a critical benchmark for advancing grounded multimodal reasoning, embodied planning, vision-language navigation, and action-capable agents in 3D environments.
☆ Halo Separation-guided Underwater Multi-scale Image Restoration
Underwater images captured by Autonomous Underwater Vehicles (AUVs) are inevitably affected by artificial light sources, which often produce halos in the foreground of the camera and seriously interfere with the quality of the image. The existing underwater image enhancement methods fail to fully consider this key problem, and the robustness of processing images under artificial light scenes is poor. In practical applications, since underwater image enhancement itself is a very challenging task, the influence of artificial light sources will lead to serious degradation of image performance and affect subsequent vision tasks. In order to effectively deal with this problem, this paper designs a single halo image correction method based on an iterative structure. The network is mainly divided into two sub-networks, one is the halo layer separation sub-network which aims to separate the halo by gradient minimization, and the other is the multi-scale recovery sub-network which aims to recover the image information masked by halo. The UIEB and EUVP synthetic datasets are used for training to ensure that the network can fully learn the characteristics and laws of underwater halo images. Then a large number of halo images taken in an underwater environment with real artificial light are collected for testing. In addition, the brightness distribution characteristics of underwater halo images are analyzed and the radial gradient is introduced to constraint eliminate halo to improve the effect of underwater image restoration.
☆ CellDX AI Autopilot: Agent-Guided Training and Deployment of Pathology Classifiers
Training AI models for computational pathology currently requires access to expensive whole-slide-image datasets, GPU infrastructure, deep expertise in machine learning, and substantial engineering effort. We present CellDX AI Autopilot, a platform that lets users -- from pathologists with no ML background to ML practitioners running many parallel experiments -- train, evaluate, and deploy whole-slide image classifiers through natural language interaction with an AI agent. The platform provides a structured set of agent skills that guide the user through dataset curation, automated hyperparameter tuning, multi-strategy model comparison, and human-in-the-loop deployment, all on a pre-built dataset of over 32,000 cases and 66,000 H&E-stained whole-slide images with pre-extracted features. We describe the agent skill architecture, the underlying Multiple Instance Learning (MIL) training framework supporting four classification strategies, and an iterative pairwise hyperparameter search (grid or seeded random) that reduces tuning cost by over 30x compared to exhaustive search. CellDX AI Autopilot is, to our knowledge, the first system to expose pathology-specialized agent skills and a pathology-specialized training platform to general-purpose AI agents (e.g. any LLM-based agent runtime), delivering end-to-end automated model training without requiring the agent itself to be domain-specific. The platform addresses both the ML-expertise bottleneck that limits adoption in diagnostic pathology and the engineering bottleneck that limits how many experiments a researcher can run cost-effectively.
☆ DySurface: Consistent 4D Surface Reconstruction via Bridging Explicit Gaussians and Implicit Functions
While novel view synthesis (NVS) for dynamic scenes has seen significant progress, reconstructing temporally consistent geometric surfaces remains a challenge. Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) offer powerful dynamic scene rendering capabilities; however, relying solely on photometric optimization often leads to geometric ambiguities. This results in discontinuous surfaces, severe artifacts, and broken surfaces over time. To address these limitations, we present DySurface, a novel framework that bridges the effectiveness of explicit Gaussians with the geometric fidelity of implicit Signed Distance Functions (SDFs) in dynamic scenes. Our approach tackles the structural discrepancy between the forward deformation of 3DGS ($canonical \rightarrow dynamic$) and the backward deformation required for volumetric SDF rendering ($dynamic \rightarrow canonical$). Specifically, we propose the VoxGS-DSDF branch that leverages deformed Gaussians to construct a dynamic sparse voxel grid, providing explicit geometric guidance to the implicit SDF field. This explicit anchoring effectively regularizes the volumetric rendering process, significantly improving surface reconstruction quality, with watertight boundaries and detailed representations. Quantitative and qualitative experiments demonstrate that DySurface significantly outperforms state-of-the-art baselines in geometric accuracy metrics while maintaining competitive rendering performance.
☆ Portable Active Learning for Object Detection CVPR 2026
Annotating bounding boxes is costly and limits the scalability of object detection. This challenge is compounded by the need to preserve high accuracy while minimizing manual effort in real-world applications. Prior active learning methods often depend on model features or modify detector internals and training schedules, increasing integration overhead. Moreover, they rarely jointly exploit the benefits of image-level signals, class-imbalance cues, and instance-level uncertainty for comprehensive selection. We present Portable Active Learning (PAL), a detector-agnostic, easily portable framework that operates solely on inference outputs. PAL combines class-wise instance uncertainty with image-level diversity to guide data selection. At each round, PAL trains lightweight class-specific logistic classifiers to distinguish true from false positives, producing entropy-based uncertainty scores for proposals. Candidate images are then refined using global image entropy, class diversity, and image similarity, yielding batches that are both informative and diverse. PAL requires no changes to model internals or training pipelines, ensuring broad compatibility across detectors. Extensive experiments on COCO, PASCAL VOC, and BDD100K demonstrate that PAL consistently improves label efficiency and detection accuracy compared to existing active learning baselines, making it a practical solution for scalable and cost-effective deployment of object detection in real-world settings.
comment: CVPR 2026(highlight)
☆ BGG: Bridging the Geometric Gap between Cross-View images by Vision Foundation Model Adaptation for Geo-Localization
Geometric differences between cross-view images, such as drone and satellite views, significantly increase the challenge of Cross-View Geo-Localization (CVGL), which aims to acquire the geolocation of images by image retrieval. To further enhance the CVGL performance, this paper proposes a parameter-efficient adaptation framework for bridging the geometric gap across images based on the vision foundation model (VFM) (e.g., DINOv3), termed BGG. BGG not only effectively leverages the general visual representations of VFM and captures the robust and consistent features from cross-view images, but also utilizes the generalization capabilities of the VFM, significantly improving the CVGL performance. It mainly contains a Multi-granularity Feature Enhancement Adapter (MFEA) and a Frequency-Aware Structural Aggregation (FASA) module. Specifically, MFEA enhances the scale adaptability and viewpoint robustness of features by multi-level dilated convolutions, effectively bridging the cross-view geometric gap with small training costs. Additionally, considering the [CLS] token lacks spatial details for precise image retrieval and localization, the FASA module modulates patch tokens in the frequency domain and performs adaptive aggregation for local structural feature enhancement. Finally, BGG fuses the enhanced local features with the [CLS] token for more accurate CVGL. Extensive experiments on University-1652 and SUES-200 datasets demonstrate that BGG has significant advantages over other methods and achieves state-of-the-art localization performance with low training costs.
☆ EvoStreaming: Your Offline Video Model Is a Natively Streaming Assistant
Streaming video understanding demands more than watching longer videos: assistants must decide when to speak in real time, balancing responsiveness against verbosity. Yet most video-language models (VideoLLMs) are trained for offline inference, and existing streaming benchmarks externalize this timing decision to the evaluator. We address this gap with RealStreamEval, a frame-level multi-turn evaluation protocol that exposes models to sequential observations and penalizes unnecessary responses. Under this protocol, we observed that strong offline VideoLLMs retain useful visual understanding but lack an interaction policy for deciding when to respond. Motivated by this observation, we propose EvoStreaming, a self-evolved streaming adaptation framework in which the base model itself acts as data generator, relevance annotator, and roll-out policy to synthesize streaming trajectories without external supervision. With only $1{,}000$ self-generated samples ($139\times$ less than the leading streaming instruction-tuning approach) and no architectural changes, EvoStreaming consistently improves the overall RealStreamEval score by up to $10.8$ points across five open VideoLLM backbones (Qwen2/2.5/3-VL, InternVL-3.5, MiniCPM-V4.5) while largely preserving offline video performance. These results suggest that data-efficient interaction tuning is a practical path for adapting existing VideoLLMs to streaming assistants.
comment: 33 pages, 9 figures
☆ The Alpha Blending Hypothesis: Compositing Shortcut in Deepfake Detection
Recent deepfake detection methods demonstrate improved cross-dataset generalization, yet the underlying mechanisms remain underexplored. We introduce the Alpha Blending Hypothesis, positing that state-of-the-art frame-based detectors primarily function as alpha blending searchers; rather than learning semantic anomalies or specific generative neural fingerprints, they localize low-level compositing artifacts introduced during the integration of manipulated faces into target frames. We experimentally validate the hypothesis, demonstrating that deepfake detectors exhibit high sensitivity to the so-called self-blended images (SBI) and non-generative manipulations. We propose the method BlenD that leverages a large-scale, diverse dataset of real-only facial images augmented with SBI. This approach achieves the best average cross-dataset generalization on 15 compositional deepfake datasets released between 2019 and 2025 without utilizing explicitly generated deepfakes during training. Furthermore, we show that predictions from explicit blending searchers and models resilient to blending shortcuts are highly complementary, yielding a state-of-the-art AUROC of 94.0% in an ensemble configuration. The code with experiments and the trained model will be publicly released.
☆ LimeCross: Context-Conditioned Layered Image Editing with Structural Consistency
Layered image assets are widely used in real-world creative workflows, enabling non-destructive iteration and flexible re-composition. Recent advances in layered image generation and decomposition synthesize or recover layered representations, yet controllable editing of layered images remains challenging. Manual editing requires careful coordination across layers to maintain consistent illumination and contact, while AI-based pipelines collapse layers into a flattened image for editing, then decompose them again, introducing background-to-foreground leakage and unstable transparency. To address these limitations, we propose LimeCross, a training-free context-conditioned layered image editing framework that edits user-selected RGBA layers according to text while keeping the remaining layers unchanged. It leverages contextual cues from other layers using a bi-stream attention mechanism to preserve cross-layer consistency, while explicitly maintaining layer integrity to prevent the contamination of edited layers. To evaluate our approach, we introduce LayerEditBench, a benchmark of 1500 layered scenes with paired source/target prompts, along with evaluation protocols that assess both edit fidelity and alpha channel stability. Extensive experiments demonstrate that LimeCross improves layer purity and composite realism over strong editing baselines, establishing context-conditioned layered editing as a principled framework for controllable generative creation.
☆ PaMoSplat: Part-Aware Motion-Guided Gaussian Splatting for Dynamic Scene Reconstruction
Dynamic scene reconstruction represents a fundamental yet demanding challenge in computer vision and robotics. While recent progress in 3DGS-based methods has advanced dynamic scene modeling, obtaining high-fidelity rendering and accurate tracking in scenarios with substantial, intricate motions remains significantly challenging. To address these challenges, we propose PaMoSplat, a novel dynamic Gaussian splatting framework incorporating part awareness and motion priors. Our approach is grounded in two key observations: 1) Parts serve as primitives for scene deformation, and 2) Motion cues from optical flow can effectively guide part motion. Specifically, PaMoSplat initializes by lifting multi-view segmentation masks into 3D space via graph clustering, establishing coherent Gaussian parts. For subsequent timestamps, we leverage a differential evolutionary algorithm to estimate the rigid motion of these parts using multi-view optical flow cues, providing a robust warm-start for further optimization. Additionally, PaMoSplat introduces an adaptive iteration count mechanism, internal learnable rigidity, and flow-supervised rendering loss to accelerate and optimize the training process. Comprehensive evaluations across diverse scenes, including real-world environments, demonstrate that PaMoSplat delivers superior rendering quality, improved tracking precision, and faster convergence compared to existing methods. Furthermore, it enables multiple part-level downstream applications, such as 4D scene editing.
comment: Accepted by TCSVT. Project Url: https://pamosplat.github.io
☆ PolarVSR: A Unified Framework and Benchmark for Continuous Space-Time Polarization Video Reconstruction
Polarimetric imaging captures surface polarization characteristics, such as the Degree of Linear Polarization (DoLP) and the Angle of Polarization (AoP). In mainstream Division of-Focal-Plane (DoFP) color polarization imaging, recovering polarization parameters from captured mosaic arrays remains a challenging inverse problem. Existing DoFP cameras also face hardware bottlenecks and often cannot support high-frame-rate acquisition, limiting polarimetric imaging in dynamic video tasks. These limitations motivate joint spatial and temporal enhancement. To this end, we propose the first space-time polarization video reconstruction architecture. The method jointly models polarization directions in space and time and uses a polarization-aware implicit neural representation for continuous, high-fidelity upsampling. By analyzing temporal variations in polarization parameters, we further introduce a flow-guided polarization variation loss to supervise polarization dynamics. We also establish the first large-scale color DoFP polarization video benchmark to support this research direction. Extensive experiments on this benchmark demonstrate the effectiveness of the method.
☆ Increasing the Efficiency of DETR for Maritime High-Resolution Images IEEE
Maritime object detection is critical for the safe navigation of unmanned surface vessels (USVs), requiring accurate recognition of obstacles from small buoys to large vessels. Real-time detection is challenging due to long distances, small object sizes, large-scale variations, edge computing limitations, and the high memory demands of high-resolution imagery. Existing solutions, such as downsampling or image splitting, often reduce accuracy or require additional processing, while memory-efficient models typically handle only limited resolutions. To overcome these limitations, we leverage Vision Mamba (ViM) backbones, which build on State Space Models (SSMs) to capture long-range dependencies while scaling linearly with sequence length. Images are tokenized into sequences for efficient high-resolution processing. For further computational efficiency, we design a tailored Feature Pyramid Network with successive downsampling and SSM layers, as well as token pruning to reduce unnecessary computation on background regions. Compared to state-of-the-art methods like RT-DETR with ResNet50 backbone, our approach achieves a better balance between performance and computational efficiency in maritime object detection.
comment: Accepted to IEEE ITSC 2026. Copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses. DOI to be added upon publication
☆ Efficient Hybrid CNN-GNN Architecture for Monocular Depth Estimation
We present GraphDepth, a monocular depth estimation architecture that synergistically integrates Graph Neural Networks (GNNs) within a convolutional encoder-decoder framework. Our approach embeds efficient GraphSAGE layers at multiple scales of a ResNet-101 U-Net backbone, enabling explicit modeling of long-range spatial relationships that lie beyond the receptive field of local convolutions. Key technical contributions include: (1) batch-parallelized graph construction with configurable k-NN and grid-based adjacency for scalable training; (2) multi-scale GraphSAGE integration at bottleneck and decoder stages (1/32, 1/16, 1/8 resolution) to propagate global context throughout the feature hierarchy; (3) channel-attention gated skip connections that adaptively weight encoder features before fusion; and (4) heteroscedastic uncertainty estimation via a dedicated aleatoric uncertainty head, enabling confidence-aware loss weighting during optimization. Unlike transformer-based hybrids, which suffer from quadratic complexity in sequence length, GraphDepth scales linearly with spatial resolution while achieving comparable global receptive fields through iterative message passing. Experiments on NYU Depth V2, WHU Aerial, ETH3D, and Mid-Air benchmarks demonstrate competitive accuracy within 4.6\% of state-of-the-art transformers on indoor scenes with substantially lower computational cost (25 FPS vs 9 FPS, 3.8 GB vs 8.8 GB VRAM). GraphDepth achieves the best reported result on WHU Aerial (RMSE 8.24 m) and exhibits superior zero-shot cross-domain transfer to the Mid-Air synthetic aerial dataset, validating the generalization power of explicit relational reasoning for depth estimation.
☆ AdaptSplat: Adapting Vision Foundation Models for Feed-Forward 3D Gaussian Splatting
This work explores a simple yet powerful lightweight adapter design for feed-forward 3D Gaussian Splatting (3DGS). Existing methods typically apply complex, architecture-specific designs on top of the generic pipeline of image feature extraction $\rightarrow$ multi-view interaction $\rightarrow$ feature decoding. However, constrained by the scale bottleneck of 3D training data and the low-pass filtering effect of deep networks, these methods still fall short in cross-domain generalization and high-frequency geometric fidelity. To address these problems, we propose AdaptSplat, which demonstrates that without complex component engineering, introducing a single adapter of only 1.5M parameters into the generic architecture is sufficient to achieve superior performance. Specifically, we design a lightweight Frequency-Preserving Adapter (FPA) that extracts direction-aware high-frequency structural priors from the shallow features of a powerful vision foundation model backbone, and seamlessly integrates them into the generic pipeline via high-frequency positional encodings and adaptive residual modulation. This effectively compensates for the high-frequency attenuation caused by over-smoothing in deep features, improving the fitting accuracy of Gaussian primitives on complex surfaces and sharp boundaries. Extensive experiments demonstrate that AdaptSplat achieves state-of-the-art feed-forward reconstruction performance on multiple standard benchmarks, with stable generalization across domains. Code available at: https://github.com/xmw666/AdaptSplat.
☆ VPD-100K: Towards Generalizable and Fine-grained Visual Privacy Protection ICML 2026
Privacy protection has become a critical requirement in the era of ubiquitous visual data sharing, imposing higher demands on efficient and robust privacy detection algorithms. However, current robust detection models are severely hindered by the lack of comprehensive datasets. Existing privacy-oriented datasets often suffer from limited scale, coarse-grained annotations, and narrow domain coverage, failing to capture the intricate details of sensitive information in realworld environments. To bridge this gap, we present a large-scale, fine-grained Visual Privacy Dataset (VPD-100K), designed to facilitate generalized privacy detection. We establish a holistic taxonomy comprising four primary domains: Human Presence, On-Screen Personally Identifiable Information (PII), Physical Identifiers, and Location Indicators, containing 100,000 images annotated with 33 fine-grained classes and over 190,000 object instances. Statistical analysis reveals that our dataset features long-tailed distributions, small object scales, and high visual complexity. These characteristics make the dataset particularly valuable for demanding, unconstrained applications such as live streaming, where actors frequently face unintentional, realtime information leakage. Furthermore, we design an effective frequency-enhanced lightweight module consisting of frequency-domain attention fusion and adaptive spectral gating mechanism that breaks the limitations of spatial pixel intensity to better capture the subtle details of sensitive information. Extensive experiments conducted on both diverse image and streaming videos benchmarks consistently demonstrate the effectiveness of our VPD-100K dataset and the wellcurated frequency mechanism. The code and dataset are available at https://vpd-100k.github.io/.
comment: Accepted at the 43rd International Conference on Machine Learning (ICML 2026)
☆ Nano-U: Efficient Terrain Segmentation for Tiny Robot Navigation
Terrain segmentation is a fundamental capability for autonomous mobile robots operating in unstructured outdoor environments. However, state-of-the-art models are incompatible with the memory and compute constraints typical of microcontrollers, limiting scalable deployment in small robotics platforms. To address this gap, we develop a complete framework for robust binary terrain segmentation on a low-cost microcontroller. At the core of our approach we design Nano-U, a highly compact binary segmentation network with a few thousand parameters. To compensate for the network's minimal capacity, we train Nano-U via Quantization-Aware Distillation (QAD), combining knowledge distillation and quantization-aware training. This allows the final quantized model to achieve excellent results on the Botanic Garden dataset and to perform very well on TinyAgri, a custom agricultural field dataset with more challenging scenes. We deploy the quantized Nano-U on a commodity microcontroller by extending MicroFlow, a compiler-based inference engine for TinyML implemented in Rust. By eliminating interpreter overhead and dynamic memory allocation, the quantized model executes on an ESP32-S3 with a minimal memory footprint and low latency. This compiler-based execution demonstrates a viable and energy-efficient solution for perception on low-cost robotic platforms.
comment: Code repository: https://github.com/federico-pizz/Nano-U
☆ 3DReflecNet: A Large-Scale Dataset for 3D Reconstruction of Reflective, Transparent, and Low-Texture Objects CVPR 2026
Accurate 3D reconstruction of objects with reflective, transparent, or low-texture surfaces still remains notoriously challenging. Such materials often violate key assumptions in multi-view reconstruction pipelines, such as photometric consistency and the availability on distinct geometric texture cues. Existing datasets primarily focus on diffuse, textured objects, and therefore provide limited insight into performance under real-world material complexities. We introduce 3DReflecNet, a large-scale hybrid dataset exceeding 22 TB that is specifically designed to benchmark and advance 3D vision methods for these challenging materials. 3DReflecNet combines two types of data: over 120,000 synthetic instances generated via physically-based rendering of more than 12,000 shapes, and over 1,000 real-world objects captured using consumer devices. Together, these data consist of more than 7 million multi-view frames. The dataset spans diverse materials, complex lighting conditions, and a wide range of geometric forms, including shapes generated from both real and LLM-synthesized 2D images using diffusion-based pipelines. To support robust evaluation, we design benchmarks for five core tasks: image matching, structure-from-motion, novel view synthesis, reflection removal, and relighting. Extensive experiments demonstrate that state-of-the-art methods struggle to maintain accuracy across these settings, highlighting the need for more resilient 3D vision models.
comment: This paper has been accepted by CVPR 2026 Oral
☆ DetRefiner: Model-Agnostic Detection Refinement with Feature Fusion Transformer CVPR 2026
Open-vocabulary object detection (OVOD) aims to detect both seen and unseen categories, yet existing methods often struggle to generalize to novel objects due to limited integration of global and local contextual cues. We propose DetRefiner, a simple yet effective plug-and-play framework that learns to fuse global and local features to refine open-vocabulary detection. DetRefiner processes global image features and patch-level image features from foundational models (e.g., DINOv3) through a lightweight Transformer encoder. The encoder produces a class vector capturing image-level attributes and patch vectors representing local region attributes, from which attribute reliability is inferred to recalibrate the base model's confidence. Notably, DetRefiner is trained independently of the base OVOD model, requiring neither access to its internal features nor retraining. At inference, it operates solely on the base detector's predictions, producing auxiliary calibration scores that are merged with the base detector's scores to yield the final refined confidence. Despite this simplicity, DetRefiner consistently enhances multiple OVOD models across COCO, LVIS, ODinW13, and Pascal VOC, achieving gains of up to +10.1 AP on novel categories. These results highlight that learning to fuse global and local representations offers a powerful and general mechanism for advancing open-world object detection. Our codes and models are available at https://github.com/hitachi-rd-cv/detrefiner.
comment: CVPR 2026 Findings
☆ SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation
Scientific reasoning is a key aspect of human intelligence, requiring the integration of multimodal inputs, domain expertise, and multi-step inference across various subjects. Existing benchmarks for multimodal large language models (MLLMs) often fail to capture the complexity and traceability of reasoning processes necessary for rigorous evaluation. To fill this gap, we introduce SciVQR, a multimodal benchmark covering 54 subfields in mathematics, physics, chemistry, geography, astronomy, and biology. SciVQR includes domain-specific visuals, such as equations, charts, and diagrams, and challenges models to combine visual comprehension with reasoning. The tasks range from basic factual recall to complex, multi-step inferences, with 46% including expert-authored solutions. SciVQR not only evaluates final answers but also examines the reasoning process, providing insights into how models reach their conclusions. Our evaluation of leading MLLMs, including both proprietary and open-source models, reveals significant limitations in handling complex multimodal reasoning tasks, underscoring the need for improved multi-step reasoning and better integration of interdisciplinary knowledge in advancing MLLMs toward true scientific intelligence. The dataset and evaluation code are publicly available at https://github.com/CASIA-IVA-Lab/SciVQR.
☆ DynGhost: Temporally-Modelled Transformer for Dynamic Ghost Imaging with Quantum Detectors
Ghost imaging reconstructs spatial information from a single-pixel bucket detector by correlating structured illumination patterns with scalar intensity measurements. While deep learning approaches have achieved promising results on static scenes, two critical limitations remain unaddressed: existing architectures fail to exploit temporal coherence across frames, leaving dynamic ghost imaging largely unsolved, and they assume additive Gaussian noise models that do not reflect the true Poissonian statistics of real single-photon hardware. We present DynGhost (Dynamic Ghost Imaging Transformer), a transformer architecture that addresses both limitations through alternating spatial and temporal attention blocks. Our quantum-aware training framework, based on physically accurate detector simulations (SNSPDs, SPADs, SiPMs) and Anscombe variance-stabilizing normalization, resolves the distribution shift that causes classical models to fail under realistic hardware constraints. Experiments across multiple benchmarks demonstrate that DynGhost outperforms both traditional reconstruction methods and existing deep learning architectures, with particular gains in dynamic and photon-starved settings.
comment: 6 pages, 8 figures
☆ Developing a foundation model for high-resolution remote sensing data of the Netherlands
We develop a foundation model using 1.2m high resolution satellite images of the Netherlands. By combining a Convolutional Neural Network and a Vision Transformer, the model captures both low- and high-frequency landscape features, such as fine textures, edges, and small objects as well as large terrain structures, elevation patterns, and land-cover distributions. Leveraging temporal data as input, the model learns from broader contextual information across time, allowing the model to exploit the temporal dependencies, such as topographic features, land-cover changes, and seasonal dynamics. These additional constraints reduce feature ambiguity, improve representation learning, and enable better generalization with fewer labeled samples. The foundation model is evaluated on multiple downstream tasks, ranging from use cases within the Netherlands to global benchmarking datasets. On the vegetation monitoring dataset of the Netherlands, the model shows clear performance improvements by incorporating temporal information instead of relying on a single time point. Despite using a smaller model and less pretraining data limited to the Netherlands, it achieves competitive results on global benchmarks when compared to state-of-the-art models. These results demonstrate that the model can learn rich, generalizable representations from limited data, achieving competitive performance on global benchmarks while using a fraction of the parameters of larger state-of-the-art remote sensing models. To maximize reproducibility and reuse, we made the scripts and the model accessible on GitHub.
comment: 9 pages, 4 figures, under review in a journal
☆ A Comparative Study of Machine Learning and Deep Learning for Out-of-Distribution Detection IEEE
Out-of-distribution (OOD) detection is essential for building reliable AI systems, as models that produce outputs for invalid inputs cannot be trusted. Although deep learning (DL) is often assumed to outperform traditional machine learning (ML), medical imaging data are typically acquired under standardized protocols, leading to relatively constrained image variability in OOD detection tasks. This motivates a direct comparison between ML and DL approaches in this setting. The two approaches are evaluated on open datasets comprising over 60,000 fundus and non-fundus images across multiple resolutions. Both approaches achieved an AUROC of 1.000 and accuracies between 0.999 and 1.000 on internal and external validation sets, showing comparable detection performance. The ML approach, however, exhibited substantially lower end-to-end latency while maintaining equivalent accuracy, indicating greater computational efficiency. These results suggest that for OOD detection tasks of limited visual complexity, lightweight ML approaches can achieve DL-level performance with significantly reduced computational cost, supporting practical real-world deployment.
comment: Accepted to IEEE ISBI 2026. The final published version will appear in IEEE Xplore
☆ What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers
The rise of text-to-image (T2I) models has increasingly raised concerns regarding the generation of risky content, such as sexual, violent, and copyright-protected images, highlighting the need for effective safeguards within the models themselves. Although existing methods have been proposed to eliminate risky concepts from T2I models, they are primarily developed for earlier U-Net architectures, leaving the state-of-the-art Diffusion-Transformer-based T2I models inadequately protected. This gap stems from a fundamental architectural shift: Diffusion Transformers (DiTs) entangle semantic injection and visual synthesis via joint attention, which makes it difficult to isolate and erase risky content within the generation. To bridge this gap, we investigate how semantic concepts are represented in DiTs and discover that attention heads exhibit concept-specific sensitivity. This property enables both the detection and suppression of risky content. Building on this discovery, we propose AHV-D\&S, a training-free inference-time safeguard for image generation in DiTs. Specifically, AHV-D\&S quantifies each textual token's sensitivity across all attention heads as an Attention Head Vector (AHV), which serves as a discriminative signature for detecting risky generation tendencies. In the inference stage, we propose a momentum-based strategy to dynamically track token-wise AHVs across denoising steps, and a sensitivity-guided adaptive suppression strategy that suppresses the attention weights of identified risky tokens based on head-specific risk scores. Extensive experiments demonstrate that AHV-D\&S effectively suppresses sexual, copyrighted-style, and various harmful content while preserving visual quality, and further exhibits strong robustness against adversarial prompts and transferability across different DiT-based T2I models.
☆ MTA-RL: Robust Urban Driving via Multi-modal Transformer-based 3D Affordances and Reinforcement Learning
Robust urban autonomous driving requires reliable 3D scene understanding and stable decision-making under dense interactions. However, existing end-to-end models lack interpretability, while modular pipelines suffer from error propagation across brittle interfaces. This paper proposes MTA-RL, the first framework that bridges perception and control through Multi-modal Transformer-based 3D Affordances and Reinforcement Learning (RL). Unlike previous fusion models that directly regress actions, RGB images and LiDAR point clouds are fused using a transformer architecture to predict explicit, geometry-aware affordance representations. These structured representations serve as a compact observation space, enabling the RL policy to operate purely on predicted driving semantics, which significantly improves sample efficiency and stability. Extensive evaluations in CARLA Town01-03 across varying densities (20-60 background vehicles) show that MTA-RL consistently outperforms state-of-the-art baselines. Trained solely on Town03, our method demonstrates superior zero-shot generalization in unseen towns, achieving up to a 9.0% increase in Route Completion, an 11.0% increase in Total Distance, and an 83.7% improvement in Distance Per Violation. Furthermore, ablation studies confirm that our multi-modal fusion and reward shaping are critical, significantly outperforming image-only and unshaped variants, demonstrating the effectiveness of MTA-RL for robust urban autonomous driving.
☆ BathyFacto: Refraction-Aware Two-Media Neural Radiance Fields for Bathymetry SP
Through-water photogrammetry based on UAV imagery enables shallow-water bathymetry, but refraction at the air-water interface violates the straight-ray assumption of Structure-from-Motion and causes systematic depth bias. We present BathyFacto, a refraction-aware two-media extension of Nerfacto integrated into Nerfstudio that targets metrically precise underwater point clouds. BathyFacto uses a shared hash-grid-based density field with a medium-conditioned color head that receives a one-bit medium flag (air or water) and traces each camera ray as two segments: a straight segment in air up to a planar water surface and a refracted segment in water computed via Snell's law with known refractive indices. To allocate samples efficiently across the air-water boundary, we employ a single proposal-network sampler that operates on a virtual straight ray spanning both media, combined with a kinked density wrapper that transparently corrects water-segment positions along the refracted direction before density evaluation. A data adaptation pipeline converts photogrammetric reconstructions to a Nerfstudio-compatible format, estimates the water plane from boundary markers, and provides per-pixel medium masks to gate refraction. We also extend the point cloud export with refraction-corrected backprojection and reversible coordinate transforms to world and global frames. On a simulated two-media scene with known ground truth, BathyFacto with refraction achieves a Cloud-to-Mesh mean distance of 0.06 m and 87 % completeness, compared to 0.52 m / 29 % for the Nerfacto baseline and 0.36 m / 21% for conventional MVS without refraction correction.
comment: 16 pages, 8 figures, 3 tables. Submitted to ISPRS Open Journal of Photogrammetry and Remote Sensing, Special Issue "3D Underwater Mapping from Above and Below"
☆ V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning
Multimodal large language models (MLLMs) have achieved remarkable success in general perception, yet complex multi-step visual reasoning remains a persistent challenge. Although recent agentic approaches incorporate tool use, they often neglect critical execution feedback. Consequently, they suffer from the imagination-action-observer (IAO) bias, a misalignment between prior imagination and observer feedback that undermines reasoning stability and optimality. To bridge this gap, we introduce V-ABS, an action-observer driven beam search framework that enables deliberate reasoning through thinker-actor-observer iterations. We also propose an entropy-based adaptive weighting algorithm to mitigate the IAO bias by dynamically balancing the confidence scores between the policy priors and the observational feedback. Moreover, we construct a large-scale supervised fine-tuning (SFT) dataset comprising over 80k samples to guide the model to assign higher prior confidence to correct action paths. Extensive experiments across eight diverse benchmarks show that V-ABS achieves state-of-the-art performance, delivering an average improvement of 19.7% on the Qwen3-VL-8B baseline and consistent gains across both open-source and proprietary models.
☆ Task-Agnostic Noisy Label Detection via Standardized Loss Aggregation IEEE
Noisy labels are common in large-scale medical imaging datasets due to inter-observer variability and ambiguous cases. We propose a statistically grounded and task-agnostic framework, Standardized Loss Aggregation (SLA), for detecting noisy labels at the sample level. SLA quantifies label reliability by aggregating standardized fold-level validation losses across repeated cross-validation runs. This formulation generalizes discrete hard-counting schemes into a continuous estimator that captures both the frequency and magnitude of performance deviations, yielding interpretable and statistically stable noisiness scores. Experiments on a public fundus dataset demonstrate that SLA consistently outperforms the hard-counting baseline across all noise levels and converges substantially faster, especially under low noise ratios where subtle loss variations are informative. Samples with high SLA scores indicate potentially ambiguous or mislabeled cases, guiding efficient re-annotation and improving dataset reliability for any classification task.
comment: Accepted to IEEE ISBI 2026. The final published version will appear in IEEE Xplore
☆ Active-SAOOD: Active Sparsely Annotated Oriented Object Detection in Remote Sensing Images
Reducing the annotation cost of oriented object detection in remote sensing remains a major challenge. Recently, sparse annotation has gained attention for effectively reducing annotation redundancy in densely remote sensing scenes. However, (1) the sparse data reliance on class-dependent sampling, and (2) the lack of in-depth investigation into the characteristics of sparse samples hinders its further development. This paper proposes an active learning-based sparsely annotated oriented object detection (SAOOD) method, termed Active-SAOOD. Based on a model state observation module, Active-SAOOD actively selects the most valuable sparse samples at the instance level that are best suited to the current model state, by jointly considering orientation, classification, and localization uncertainty, as well as inter- and intra-class diversity. This design enables SAOOD to operate stably under completely randomly initialized sparse annotations and extends its applicability to broader real-world. Experiments on multiple datasets demonstrate that Active-SAOOD significantly improves both performance and stability of existing SAOOD methods under various random sparse annotation. In particular, with only 1\% annotated ratios, it achieves a 9\% performance gain over the baseline, further enhancing the practical value of SAOOD in remote sensing. The code will be public.
☆ MolSight: Molecular Property Prediction with Images
Every molecule ever synthesised can be drawn as a 2D skeletal diagram, yet in modern property prediction this universally available representation has received less focus in favour of molecular graphs, 3D conformers, or billion-parameter language models, each imposing its own computational and data-engineering overhead. We present $\textbf{MolSight}$, the first systematic large-scale study of vision-based Molecular Property Prediction (MPP). Using 10 vision architectures, 7 pre-training strategies, and $2\,M$ molecule images, we evaluate performance across 10 downstream tasks spanning physical-property regression, drug-discovery classification, and quantum-chemistry prediction. To account for the wide variation in structural complexity across pre-training molecules, we further propose a $\textbf{chemistry-informed curriculum}$: five structural complexity descriptors partition the corpus into five tiers of increasing chemical difficulty, consistently outperforming non-curriculum baselines. We show that a single rendered bond-line image, processed by a vision encoder, is sufficient for competitive molecular property prediction, i.e. $\textit{chemical insight from sight alone}$. The best curriculum-trained configuration achieves the top result on $\textbf{5 of 10}$ benchmarks and top two on $\textbf{all 10}$, at $\textbf{$\textit{80$\times$ lower}$}$ FLOPs than the nearest multi-modal competitor.
☆ Improving Temporal Action Segmentation via Constraint-Aware Decoding ICPR 2026
Temporal action segmentation (TAS) divides untrimmed videos into labeled action segments. While fully supervised methods have advanced the field, challenges such as action variability, ambiguous boundaries, and high annotation costs remain, especially in new or low-resource domains. Grammar-based approaches improve segmentation with structural priors but rely on complex parsing limiting scalability. In this work, we propose a lightweight, constraint-based refinement framework that enhances TAS predictions by integrating statistical structural priors such as transition confidence, action boundary sets, and per-class duration, that can be directly extracted from annotated data. These constraints are integrated into a modified Viterbi decoding algorithm, allowing inference-time refinement without retraining or added model complexity. Our approach improves both fully and semi-supervised TAS models by correcting structural prediction errors while maintaining high efficiency. Code is available at https://github.com/LUNAProject22/CAD
comment: accepted to ICPR 2026
☆ MicroViTv2: Beyond the FLOPS for Edge Energy-Friendly Vision Transformers
The Vision Transformer (ViT) achieves remarkable accuracy across visual tasks but remains computationally expensive for edge deployment. This paper presents MicroViTv2, a lightweight Vision Transformer optimized for real-device efficiency. Built upon the original MicroViT, the proposed model is designed based on reparameterized design, specifically Reparameterized Patch Embedding (RepEmbed) and Reparameterized Depth-Wise convolution mixer (RepDW) for faster inference, and introduces the Single Depth-Wise Transposed Attention (SDTA) to capture long-range dependencies with minimal redundancy. Despite slightly higher FLOPs, MicroViTv2 improves accuracy up to 0.5% compared to its predecessor and surpassing MobileViTv2, EdgeNeXt, and EfficientViT while maintaining fast inference and high energy efficiency on Jetson AGX Orin. Experiments on ImageNet-1K and COCO demonstrate that hardware-aware design and structural re-parameterization are key to achieving high accuracy and low energy consumption, validating the need to evaluate efficiency beyond FLOPs. Code is available at https://github.com/novendrastywn/MicroViT.
☆ Scaling Vision Models Does Not Consistently Improve Localisation-Based Explanation Quality
Artificial intelligence models are increasingly scaled to improve predictive accuracy, yet it remains unclear whether scale improves the quality of post-hoc explanations. We investigate this relationship by evaluating 11 computer vision models representing increasing levels of depth and complexity within the ResNet, DenseNet, and Vision Transformer families, trained from scratch or pretrained, across three image datasets with ground-truth segmentation masks. For each model, we generate explanations using five post-hoc explainable AI methods and quantify mask alignment using two localisation metrics: Relevance Rank Accuracy (Arras et al., 2022) and the proposed Dual-Polarity Precision, which measures positive attributions inside the class mask and negative attributions outside it. Across datasets and methods, increasing architectural depth and parameter count does not improve explanation quality in most statistical comparisons, and smaller models often match or exceed deeper variants. While pretraining typically improves predictive performance and increases the dependence of explanations on learned weights, it does not consistently increase localisation scores. We also observe scenarios in which models achieve strong predictive performance while localisation precision is near zero, suggesting that performance metrics alone may not indicate whether predictions are based on the annotated regions. These results indicate that larger models do not reliably provide higher-quality explanations, and that explainability should therefore be assessed explicitly during model selection for safety-sensitive deployments.
comment: 28 pages, 8 figures, 8 tables
☆ Thermal-Det: Language-Guided Cross-Modal Distillation for Open-Vocabulary Thermal Object Detection CVPR 26
Existing open-vocabulary detectors focus on RGB images and fail to generalize to thermal imagery, where low texture and emissivity variations challenge RGB-based semantics. We present Thermal-Det, the first large language model (LLM) supervised open-vocabulary detector tailored for thermal images. To enable large-scale training, we develop a synthetic dataset by converting GroundingCap-1M into the thermal domain and filtering captions to remove RGB-specific terms, yielding over one million thermally aligned samples with bounding boxes, grounding texts, and detailed captions. Thermal-Det jointly optimizes detection, captioning, and cross-modal distillation objectives. A frozen RGB teacher provides geometric and semantic pseudo-supervision for paired but unlabeled RGB-thermal data, transferring open-vocabulary knowledge without manual annotation. The model further employs a Thermal-Text Alignment Head for text calibration and a Modality-Fused Cross-Attention module for dual-modality reasoning. Unlike prior domain-adaptation methods, the detector is fully fine-tuned to internalize thermal contrast patterns while preserving language alignment. Experiments on public benchmarks show consistent 2-4% AP gains over existing open-vocabulary detectors, establishing a strong foundation for scalable, language-driven thermal perception.
comment: Accepted at CVPR 26
☆ Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition
Recent research work on fashion outfit generation focuses on promoting visual consistency of garments by leveraging key information from reference image and text prompt. However, the potential of outfit generation remains underexplored, requiring comprehensive e-commercial dataset and elaborative utilization of multi-modal condition. In this paper, we propose a brand-new e-commerce dataset, named Fashion130k, with various occasions, models, and garment types. For the consistent generation of garment, we design a framework with Unified Multi-modal Condition (UMC) to align and integrate the text and visual prompts into generation model. Specifically, we explore an embedding refiner to extract the unified embeddings of multi-modal prompts, within which a Fusion Transformer is proposed to align the multi-modal embeddings by adjusting the modality gap between text and image. Based on unified embeddings, the attention in generation model is redesigned to emphasis the correlations between prompts and noise image, inducing that the noise image can select the pivotal tokens of prompts for consistent outfit generation. Our dataset and proposed framework offer a general and nuanced exploration of multi-modal prompts for generation models. Extensive experiments on real-world applications and benchmark demonstrate the effectiveness of UMC in visual consistency, achieving promising result than that of SoTA methods.
☆ MicroWorld: Empowering Multimodal Large Language Models to Bridge the Microscopic Domain Gap with Multimodal Attribute Graph
Multimodal large language models (MLLMs) show remarkable potential for scientific reasoning, yet their performance in specialized domains such as microscopy remains limited by the scarcity of domain-specific training data and the difficulty of encoding fine-grained expert knowledge into model parameters. To bridge the gap, we introduce MicroWorld, a framework that constructs a multimodal attributed property graph (MAPG) from large-scale scientific image--caption corpora and leverages it to augment MLLM reasoning at inference time without any domain-specific fine-tuning. MicroWorld extracts biomedical entities and relations via scispaCy or LLM-based triplet mining, aligns images and entities in a shared embedding space using Qwen3-VL-Embedding, and assembles a knowledge graph comprising approximately 111K nodes and 346K typed edges spanning eight relation categories. At inference time, a graph-augmented retrieval pipeline matches query entities to the MAPG and injects structured knowledge context into the MLLM prompt. On the MicroVQA benchmark, MicroWorld improves the reasoning performance of Qwen3-VL-8B-Instruct by 37.5%, outperforming GPT-5 by 13.0% to achieve a new state-of-the-art. Furthermore, it yields a 6.0% performance gain on the MicroBench benchmark. Extensive experiments demonstrate the enhanced generalization capability introduced by MicroWorld. A qualitative case study further reveals both the mechanisms through which structured knowledge improves reasoning and the failure modes that point to promising future directions. Code and data are available at https://github.com/ieellee/MicroWorld.
comment: 29 pages, 14 figures
☆ Think as Needed: Geometry-Driven Adaptive Perception for Autonomous Driving
Autonomous driving scenes range from empty highways to dense intersections with dozens of interacting road users, yet current 3D detection models apply a fixed computation budget to every frame, wasting resources on simple scenes while lacking capacity for complex ones. Existing approaches compound this problem: Transformer-based interaction models scale quadratically with the number of detected objects, and frame-by-frame processing causes the system to immediately forget objects the moment they become occluded. We propose Enhanced HOPE, an adaptive perception architecture that measures the geometric complexity of each incoming LiDAR frame using an unsupervised statistical estimator and routes it through a shallow or deep processing path accordingly, requiring no manual scene labels. To keep interaction modeling efficient, we replace quadratic pairwise attention with a linear-time subspace-based network that groups nearby objects into clusters and processes them jointly. The computational savings from these two mechanisms free up resources for a persistent temporal memory module that retains previously detected objects and traffic rules across frames, enabling the system to recall occluded objects seconds after they disappear from view. On the nuScenes and CARLA benchmarks, Enhanced HOPE reduces latency by 38% on simple scenes with no accuracy loss, improves mean Average Precision by 2.7 points on rare long-tail scenarios, and tracks objects through occlusions lasting over 5 seconds, where all tested baselines fail.
☆ CFSPMNet: Cross-subject Fourier-guided Spatial-Patch Mamba Network for EEG Motor Imagery Decoding in Stroke Patients
Motor imagery electroencephalography (MI-EEG) decoding offers a non-invasive route for post-stroke rehabilitation, but cross-patient use remains difficult because pathological neural reorganization changes task-related EEG dynamics, aperiodic activity, local excitability, cross-regional coordination, and trial-level brain-state context. This makes source-learned MI representations unreliable for unseen patients. To address this problem, we propose CFSPMNet, a cross-patient adaptation framework that models post-stroke MI-EEG as latent neural-state organization. CFSPMNet combines a Fourier-Reorganized State Mamba Network (FRSM) with Shared-Private Prototype Matching (SPPM). FRSM represents each trial as a latent physiological token sequence, reorganizes token states in the Fourier domain, and uses Fourier-derived trial context to guide Mamba state-space propagation. SPPM improves target pseudo-label updating by combining semantic confidence with shared-private physiological consistency, filtering confident but physiologically inconsistent target predictions. Leave-one-subject-out experiments on two stroke MI-EEG datasets show that CFSPMNet outperforms representative CNN-, Transformer-, Mamba-, and adaptation-based baselines, achieving average accuracies of 68.23% on XW-Stroke and 73.33% on 2019-Stroke, with gains of 5.63 and 8.25 percentage points over the strongest competitors. Ablation, sensitivity, feature-alignment, pseudo-label selection, and neurophysiological visualization analyses further support the roles of Fourier-domain token-state reorganization and calibrated pseudo-label updating. These results suggest that latent neural-state modeling can improve rehabilitation-oriented cross-patient BCI decoding. Code is available at https://github.com/wxk1224/CFSPMNet.
☆ ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models
Recent advances in Multi-modal Large Language Models (MLLMs) target 3D spatial intelligence, yet the progress has been largely driven by post-training on curated benchmarks, leaving the inference-time approach relatively underexplored. In this paper, we take a training-free perspective and introduce ViSRA, a human-aligned Video-based Spatial Reasoning Agent, as a framework to probe the spatial reasoning mechanism of MLLMs. ViSRA elicits spatial reasoning in a modular and extensible manner by leveraging explicit spatial information from expert models, enabling a plug-and-play flexible paradigm. ViSRA offers two key advantages: (1) human-aligned and transferable 3D understanding rather than task-specific overfitting; and (2) no post-training computational cost along with heavy manual curation of spatial reasoning datasets. Experimental results demonstrate consistent improvement across a set of MLLMs on both existing benchmarks and unseen 3D spatial reasoning tasks, with ViSRA outperforming baselines by up to a 15.6% and 28.9% absolute margin respectively.
☆ HYPERPOSE: Hyperbolic Kinematic Phase-Space Attention for 3D Human Pose Estimation
We introduce HYPERPOSE, a novel 3D human pose estimation framework that performs spatio-temporal reasoning entirely within the Lorentz model of hyperbolic space $\mathbb{H}^d$ to natively preserve the hierarchical tree topology of the human skeleton. Current state-of-the-art pose estimators aim to capture complex joint dynamics by relying on transformers and graph convolutional networks. Since these architectures operate exclusively in Euclidean space which fundamentally mismatches the inherent tree structure of the human body, these methods inevitably suffer from exponential volume distortion and struggle to maintain structural coherence. To this end, we depart from flat spaces and aim to improve geometric fidelity with Hyperbolic Kinematic Phase-Space Attention (HKPSA), natively embedding complex joint relationships without distortion, alongside a multi-scale windowed hyperbolic attention mechanism that efficiently models temporal dynamics in $O(TW)$ complexity. Furthermore, to overcome the well-known instability of training non-Euclidean manifolds, HYPERPOSE introduces a novel Riemannian loss suite and an uncertainty-weighted curriculum, enforcing physical geodesic constraints like bone length and velocity consistency. Extensive evaluations on the Human3.6M and MPI-INF-3DHP datasets demonstrate that HYPERPOSE achieves state-of-the-art structural and temporal coherence, significantly reducing both volume distortion and velocity error, while establishing new state-of-the-art benchmarks in overall positional accuracy.
☆ Initiation of Interaction Detection Framework using a Nonverbal Cue for Human-Robot Interaction
This paper describes an initiation of interaction(IoI) detection framework without keywords for human-robot interaction(HRI) based on audio and vision sensor fusion in a domestic environment. In the proposed framework, the robot has its own audio and vision sensors, and can employ external vision sensor for stable human detection and tracking. When the user starts to speak while looking at the robot, the robot can localize his or her position by its sound source localization together with human tracking information. Then the robot can detect the IoI if it perceives the face of the speaker faces the robot. In case that the user does not speak directly, the robot can also detect the IoI if he or she looks at the robot for more than predefined periods of time. A state transition model for the proposed IoI detection framework is designed and verified by experiments with a mobile robot. In order to implement and associate our model in a robot architecture, all the components are implemented and integrated in the Robot Operating System(ROS) environment.
☆ SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation
Video generation has advanced rapidly, producing photorealistic videos from text or image prompts. Meanwhile, film production and social robotics increasingly demand multi-person videos with rich social interactions, including conversations, gestures, and coordinated actions. However, existing models offer no explicit control over interactions, such as who performs which action, when it occurs, and toward whom it is directed. This often results in wrong person performing unintended actions (actor-action mismatch), disordered social dynamics, and wrong action targets. To address these challenges, we present SocialDirector, a training-free interaction controller that enhances the generation model by modulating cross-attention maps. SocialDirector contains two modules: Social Actor Masking and Directional Reweighting. Social Actor Masking constrains each person's visual tokens to attend only to their own textual descriptions via a spatiotemporal mask, avoiding actor-action mismatch and disordered social dynamics. Directional Reweighting amplifies attention to directional words (e.g., "leftward", "right"), leading each action towards its intended target. To evaluate generated social interactions, we annotate existing datasets with interaction descriptions and build a fully automated evaluation pipeline powered by open-source VLMs. Experiments on different video generation models show that SocialDirector significantly improves interaction fidelity and approaches the upper bound set by real videos.
☆ MFVLR: Multi-domain Fine-grained Vision-Language Reconstruction for Generalizable Diffusion Face Forgery Detection and Localization
The swift advancement in photo-realistic face generation technology has sparked considerable concerns across society and academia, emphasizing the requirement of generalizable face forgery detection and localization methods. Prior works tend to capture face forgery patterns across multiple domains using image modality, other modalities like fine-grained texts are not comprehensively investigated, which restricts the generalization capability of models. Besides, they usually analyze facial images created by GAN, but struggle to identify and localize those synthesized by diffusion. To solve the problems, in this paper, we devise a novel multi-domain fine-grained vision-language reconstruction (MFVLR) model, which explores comprehensive and diverse visual forgery traces via language-guided face forgery representation learning, to achieve generalizable diffusion-synthesized face forgery detection and localization (DFFDL). Specifically, we devise a fine-grained language transformer that studies general fine-grained language embeddings using language reconstruction. We propose a multi-domain vision encoder to capture general and complementary visual forgery patterns across the image and residual domains. A vision decoder is designed to reconstruct image appearance and achieve forgery localization. Besides, we propose an innovative plug-and-play vision injection module to enhance the interaction between the vision and language embeddings. Extensive experiments and visualizations demonstrate that our network outperforms the state of the art on different settings like cross-generator, cross-forgery, and cross-dataset evaluations.
☆ Explanation-Aware Learning for Enhanced Interpretability in Biomedical Imaging IEEE
Deep neural networks for medical image diagnosis often achieve high predictive accuracy while relying on spurious or clinically irrelevant visual cues, limiting their trustworthiness in practice. Post-hoc explanation methods are widely used to visualize model decisions in the form of saliency maps; however, these explanations do not influence how models learn during training, allowing non-causal or confounding features to persist. This motivates the incorporation of explanation supervision directly into the training objective to guide model attention toward clinically meaningful regions and promote clinically grounded decision-making. This paper presents a systematic approach to integrate explanation loss into model training and analyzes how different explanation loss designs and supervision strengths influence both predictive performance and spatial faithfulness of explanations. To quantitatively assess interpretability, two complementary explanation performance metrics-annotation coverage and saliency precision-are introduced, enabling rigorous evaluation beyond qualitative visualization. Our experimental results reveal a clear trade-off between explanation quality and explanation loss coefficients. Furthermore, quantitative statistical analysis yields consistently improved explanation alignment while maintaining comparable accuracy. Experiments were conducted on annotated chest X-ray datasets; however, the proposed framework is applicable to a broad range of annotated biomedical imaging modalities. Overall, these findings demonstrate that explanation supervision is not a monolithic design choice and provide practical guidance for incorporating explanation loss into training objectives under noisy clinical annotations.
comment: Under review at IEEE Journal of Biomedical and Health Informatics (JBHI)
☆ EchoPrune: Interpreting Redundancy as Temporal Echoes for Efficient VideoLLMs
Long-form video understanding remains challenging for Video Large Language Models (VideoLLMs), as the dense frame sampling introduces massive visual tokens while sparse sampling risks missing critical temporal evidence and leading to LLM hallucination. Existing training-free token reduction methods either treat videos equally as static images or rely on segment-level merging heuristics, which weaken fine-grained spatiotemporal modeling and introduce additional overhead. In this paper, we propose EchoPrune, a lightweight and training-free token pruning method that improves temporal resolution under a fixed LLM-side visual token budget. Our core idea is to interpret redundant video tokens as temporal echoes: if a token is well reconstructed from the previous frame, it is merely a temporally redundant echo; otherwise, it may capture new events, motion, or query-relevant visual evidence. Based on this insight, EchoPrune scores visual tokens by (i) query-guided crossmodal relevance and (ii) temporal reconstruction error, measured by correspondence matching and echo matching across consecutive frames. The selected tokens preserve task-relevant cues and temporal novelty while suppressing predictable redundancy, allowing VideoLLMs to observe more frames without increasing the decoding budget. Extensive experiments on LLaVA-OV, Qwen2.5VL, and Qwen3VL across six video understanding benchmarks show that EchoPrune enables VideoLLMs to process up to 20x frames under the same token budget, yielding improved performance (+8.6%) and inference speedup (5.6x for prefilling) on Qwen2.5VL-7B.
comment: 9 pages
☆ PixelFlowCast: Latent-Free Precipitation Nowcasting via Pixel Mean Flows
Precipitation nowcasting aims to forecast short-term radar echo sequences for extreme weather warning, where both prediction fidelity and inference efficiency are critical for real-world deployment. However, diffusion-based models, despite their strong generative capability, suffer from slow inference due to multi-step sampling trajectories, limiting their practical usability. Conditional Flow Matching (CFM) improves efficiency via straightened trajectories, but relies on latent space compression, which inevitably discards high-frequency physical details and degrades fine-grained prediction quality. To address these limitations, we propose PixelFlowCast, a two-stage probabilistic forecasting framework that achieves both high-efficiency and high-fidelity prediction without latent compression. Specifically, in the first stage, a deterministic model first produces coarse forecasts to capture global evolution trends. In the subsequent stage, the proposed KANCondNet extracts deep spatiotemporal evolution features to provide accurate conditional guidance. Based on this, a latent-free, few-step Pixel Mean Flows (PMF) predictor employs an $x$-prediction mechanism to generate high-quality predictions, effectively preserving fine-grained structures while maintaining fast inference. Experiments on the publicly available SEVIR dataset demonstrate that PixelFlowCast outperforms existing mainstream methods in both prediction accuracy and inference efficiency, particularly for long sequence forecasting, highlighting its strong potential for real-world operational deployment.
comment: 26 pages, 7 figures
☆ ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models
Visual Autoregressive (VAR) models have emerged as a strong alternative to diffusion for image synthesis, yet their fixed training resolution prevents direct generation at higher resolutions. Naively transferring training-free extrapolation methods from LLMs or diffusion models to VAR yields three characteristic failure modes: global repetition, local repetition, and detail degradation. We trace them to a unified band-stage mismatch: VAR generates images in a coarse-to-fine, scale-wise process where each stage is driven by a distinct dominant RoPE frequency band, and each failure mode emerges when the dominant band of a particular stage is disrupted. Building on this insight, we propose Stage-Aware RoPE Remapping, a training-free strategy that assigns each frequency band a stage-specific remapping rule, jointly suppressing all three failure modes. We further observe that attention becomes systematically dispersed as the image resolution increases. Existing methods typically depend on predefined attention scaling factors, which are neither adaptive to the target resolution nor capable of faithfully capturing the actual extent of attention dispersion. We therefore propose Entropy-Driven Adaptive Attention Calibration, which quantifies dispersion via a resolution-invariant normalized entropy and yields a closed-form per-head scaling factor that realigns the extrapolated-resolution attention entropy with its training-resolution counterpart. Extensive experiments show that our method consistently outperforms prior resolution-extrapolation methods in both structural coherence and fine-detail fidelity. Our code is available at https://github.com/feihongyan1/ExtraVAR.
comment: 10 pages, 7 figures
☆ Only Train Once: Uncertainty-Aware One-Class Learning for Face Authenticity Detection
The rapid evolution of generative paradigms has enabled the creation of highly realistic imagery, which escalating the risks of identity fraud and the dissemination of disinformation. Most existing approaches frame face forgery detection as a fully supervised binary classification problem. Consequently, these models typically exhibit significant performance decay when tasked with detecting forgeries from previously unseen generative paradigms. Furthermore, these methods focus exclusively on either DeepFakes or fully synthesized faces, thereby failing to provide a generalized framework for universal face forgery detection. In this paper, we address this challenge by introducing FADNet (Face Authenticity Detector Net), % a self-supervised framework that which reformulates face forgery detection as a one-class classification (OCC) task. By training exclusively on authentic facial data to capture their intrinsic representations, FADNet flags any image whose feature embedding deviates significantly from the learned distribution of real faces as a forgery. The framework incorporates Evidential Deep Learning (EDL) to quantify predictive uncertainty and utilizes a plug-and-play pseudo-forgery image generator (PFIG) to tighten decision boundaries around authentic data. Extensive experimental evaluations on the DF40 and ASFD benchmarks demonstrate that FADNet achieves superior performance and generalization capabilities. Specifically, FADNet substantially outperforms existing state-of-the-art (SOTA) methods, yielding a remarkable average accuracy of 96.63\% and an average precision of 98.83\%.
comment: 12pages,7figures
☆ Slum Detection and Density Mapping with AlphaEarth Foundations: A Representation Learning Evaluation Across 12 Global Cities
Pixel-level slum mapping has long been constrained by limited cross-city generalisation, the absence of continuous density estimation, and weak global comparability. AlphaEarth Foundations (AEF), a globally consistent 64-dimensional annual surface embedding at 10 m, offers a new analysis-ready basis for lightweight slum monitoring, but its applicability to slum detection - an indirectly coupled task shaped by both built form and socio-economic processes - remains untested. We evaluate AEF on slum classification and sub-pixel density estimation across 12 cities and 69 city-year pairs (2017-2024), using GRAM pseudo-masks as supervisory labels. The evaluation spans four training strategies, two protocols (random split and 3x3 spatial block cross-validation), six auxiliary feature configurations, and five baseline models, complemented by representation-level analyses (PCA, SHAP) and full-AOI mapping. Five findings emerge. (1) Same-city cross-year training is optimal under both protocols (median spatial F1 = 0.616, R^2 = 0.466); temporal expansion outperforms cross-city transfer, indicating city-scale representational drift. (2) Regression R^2 is driven primarily by zero/non-zero boundary discrimination: positive-pixel R^2 is consistently negative across all cities, revealing limited capacity to model intra-pixel density gradients at 10 m. (3) PC36 is consistently top-ranked across tasks; classification saturates at k = 32 while regression remains unsaturated at k = 64. (4) POI features yield the largest density gain (Delta R^2 = +0.064). (5) For six cities meeting dual-task usability thresholds, full-AOI inference across 2017-2024 preserves slum cluster structure (mean SSIM = 0.926). The study delineates the capabilities and complementarity needs of foundation-model embeddings for slum monitoring.
☆ MUSDA: Multi-source Multi-modality Unsupervised Domain Adaptive 3D Object Detection for Autonomous Driving
With the advancement of autonomous driving, numerous annotated multi-modality datasets have become available. This presents an opportunity to develop domain-adaptive 3D object detectors for new environments without relying on labor-intensive manual annotations. However, traditional domain adaptation methods typically focus on a single source domain or a single modality, limiting their effectiveness in multi-source, multi-modality scenarios. In this paper, we propose a novel framework for multi-source, multi-modality unsupervised domain adaptation in 3D object detection for autonomous driving. Given multiple labeled source domains and one unlabeled target domain, our framework first introduces hierarchical spatially-conditioned (HSC) domain classifiers, which jointly align features from both camera and LiDAR modalities at two distinct levels for each source-target domain pair. To effectively leverage information from multiple source domains, we construct a prototype graph between each pair of domains. Based on this, we develop a prototype graph weighted (PGW) multi-source fusion strategy to aggregate predictions from multiple source detection heads. Experimental results on three widely used 3D object detection datasets - Waymo, nuScenes, and Lyft - demonstrate that our proposed framework effectively integrates information across both modalities and source domains, consistently outperforming state-of-the-art methods.
☆ Hystar: Hypernetwork-driven Style-adaptive Retrieval via Dynamic SVD Modulation ICLR2026
Query-based image retrieval (QBIR) requires retrieving relevant images given diverse and often stylistically heterogeneous queries, such as sketches, artworks, or low-resolution previews. While large-scale vision--language representation models (VLRMs) like CLIP offer strong zero-shot retrieval performance, they struggle with distribution shifts caused by unseen query styles. In this paper, we propose the Hypernetwork-driven Style-adaptive Retrieval (Hystar), a lightweight framework that dynamically adapts model weights to each query's style. Hystar employs a hypernetwork to generate singular-value perturbations ($ΔS$) for attention layers, enabling flexible per-input adaptation, while static singular-value offsets on MLP layers ensure cross-style stability. To better handle semantic confusions across styles, we design StyleNCE as part of Hystar, an optimal-transport-weighted contrastive loss that emphasizes hard cross-style negatives. Extensive experiments on multi-style retrieval and cross-style classification benchmarks demonstrate that Hystar consistently outperforms strong baselines, achieving state-of-the-art performance while being parameter-efficient and stable across styles.
comment: Accepted by ICLR2026
☆ Measurement-Adapted Eigentask Representations for Photon-Limited Optical Readout
Optical readout in low-light imaging is fundamentally limited by measurement noise, including photon shot noise, detector noise, and quantization error. In this regime, downstream inference depends not only on the optical front end, but also on how noisy high-dimensional sensor measurements are represented before classification or decision-making. Here we show that eigentasks provide a measurement-adapted representation for optical sensor outputs by ordering readout features according to their resolvability under noise. Using experimental data from a lens-based optical imaging system and a reanalysis of published data from a single-photon-detection neural network, we find that eigentask representations frequently outperform standard baselines including principal component analysis and filtering-based compression. The advantage is most pronounced in photon-limited, few-shot, and higher-difficulty classification regimes. In few-shot MPEG-7 classification, for example, the advantage over other methods reaches about 10 percentage points as the number of classes increases. In these settings, eigentasks yield more informative low-dimensional features and improve sample-efficient downstream learning. These results identify measurement-adapted representation as a promising strategy for optical inference when photon budget, acquisition time, and task complexity are constrained.
comment: 15+14 pages, 4+9 figures, 55 references
☆ Med-StepBench: A Hierarchical Reasoning Framework for Evaluating Hallucinations in Medical Vision-Language Models IJCAI
Large vision-language models (VLMs) demonstrate strong performance in medical image understanding, but frequently generate clinically plausible yet incorrect statements, raising significant safety concerns. Existing medical hallucination benchmarks primarily focus on 2D imaging with one-shot diagnostic questions, offering limited insight into whether predictions are grounded in correct localization and abnormality identification, allowing critical reasoning errors to remain hidden behind seemingly correct diagnoses. We introduce Med-StepBench, the first large-scale benchmark for step-wise hallucination detection in 3D oncological PET/CT, comprising over 12,000 images and more than 1,000,000 image-statement pairs across volumetric and multi-view 2D data, which decomposes clinical reasoning into four expert-designed diagnostic stages. Using clinician-verified annotations, we perform the first step-level evaluation of general-purpose and medical VLMs, revealing systematic failure modes obscured by aggregate accuracy metrics. Furthermore, we show that current VLMs are highly susceptible to adversarial yet clinically plausible intermediate explanations, which significantly amplify hallucinations despite contradictory visual evidence. Together, our findings highlight fundamental limitations in grounding multi-step clinical reasoning and establish Med-StepBench as a rigorous benchmark for developing safer and more reliable medical VLMs.
comment: Accepted at IJCAI-ECAI 2026
☆ Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization
While multimodal large language models have advanced across text, image, and audio, personalization research has remained primarily vision-language, with unified omnimodal benchmarking that jointly covers text, image, and audio still limited, and lacking the methodological rigor to account for absent-persona scenarios or systematic grounding studies. We introduce Omni-Persona, the first comprehensive benchmark for omnimodal personalization. We formalize the task as cross-modal routing over the \emph{Persona Modality Graph}, encompassing 4 task groups and 18 fine-grained tasks across ${\sim}750$ items. To rigorously diagnose grounding behavior, we propose \emph{Calibrated Accuracy ($\mathrm{Cal}$)}, which jointly rewards correct grounding and appropriate abstention, incorporating absent-persona queries within a unified evaluation framework. On our dedicated experiments, three diagnostic findings emerge: (i) open-source models show a consistent audio-vs-visual grounding gap that RLVR partially narrows via dense rule-based supervision; (ii) answerable recall and parameter scale are incomplete diagnostics, since strong recall can coexist with absent-persona hallucination and larger models do not always achieve higher $\mathrm{Cal}$, exposing calibration as a separate evaluation axis; and (iii) SFT is bounded by the difficulty of constructing annotated ground-truth supervision at scale, while RLVR generalizes more consistently through outcome-level verifiable feedback yet drifts toward conservative behavior and lower generation quality under our reward design. Omni-Persona thus serves as a diagnostic framework that surfaces the pitfalls of omnimodal personalization, guiding future post-training and reward design.
comment: Project Page: https://github.com/oyt9306/Omni-Persona
☆ StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception
Recent advances in robot imitation learning have yielded powerful visuomotor policies capable of manipulating a wide variety of objects directly from monocular visual inputs. However, monocular observations inherently lack reliable depth cues and spatial awareness, which are critical for precise manipulation in cluttered or geometrically complex scenes. To address this limitation, we introduce StereoPolicy, a new visuomotor policy learning framework that directly leverages synchronized stereo image pairs to strengthen geometric reasoning, without requiring explicit 3D reconstruction or camera calibration. StereoPolicy employs pretrained 2D vision encoders to process each image independently and fuses the resulting representations through a Stereo Transformer. This design implicitly captures spatial correspondence and disparity cues. The framework integrates seamlessly with diffusion-based and pretrained vision-language-action (VLA) policies, delivering consistent improvements over RGB, RGB-D, point cloud, and multi-view baselines across three simulation benchmarks: RoboMimic, RoboCasa, and OmniGibson. We further validate StereoPolicy on real-robot experiments spanning both tabletop and bimanual mobile manipulation settings. Our results underscore stereo vision as a scalable and robust modality that bridges 2D pretrained representations with 3D geometric understanding for robotic manipulation.
☆ Geometric 4D Stitching for Grounded 4D Generation
Recent 4D generation methods complete scene-level missing information using generative models and reconstruct the scene into radiance-based representations. However, these pipelines often present geometric inconsistencies in the generated content, and the radiance-based reconstruction requires expensive optimization. Furthermore, radiance-based representations often absorb these geometric inconsistencies into their view-dependent nature, failing to enforce the grounded geometric consistency. To address these issues, we propose Geometric 4D Stitching, an efficient framework that explicitly identifies missing geometric regions and complements them with geometrically grounded 4D stitches. As a result, our method constructs 4D scene representations in under 10 minutes on a single NVIDIA RTX 5090 GPU per one-step scene expansion, while improving geometric consistency. Moreover, we demonstrate that our explicit 4D stitching supports interative expansion of 4D mesh as well as 4D scene editing.
☆ ERASE: Eliminating Redundant Visual Tokens via Adaptive Two-Stage Token Pruning
Recent advancements in Vision-Language Models (VLMs) enable large language models (LLMs) to process high-resolution images, significantly improving real-world multimodal understanding. However, this capability introduces a large number of vision tokens, resulting in substantial computational overhead. To mitigate this issue, various vision token pruning methods have been proposed. Nevertheless, existing approaches predominantly rely on learned semantic features within the model to capture visual redundancy. Moreover, they lack adaptive mechanisms to adjust pruning strategies according to the complexity of the input image. In this paper, we propose ERASE, a two-stage vision token pruning framework that identifies and retains salient tokens through pruning strategies adaptive to image complexity. Experiment results demonstrate that ERASE significantly reduces vision tokens while preserving accuracy. For Qwen2.5-VL-7B, at a token pruning ratio of 85\%, ERASE retains 89.46% of the original model accuracy, whereas the best prior method retains only 78.1%. Our code is available at https://github.com/Tuna-Luna/ERASE.
comment: 20 pages, 8 figures
☆ INFANiTE: Implicit Neural representation for high-resolution Fetal brain spatio-temporal Atlas learNing from clinical Thick-slicE MRI
Spatio-temporal fetal brain atlases are important for characterizing normative neurodevelopment and identifying congenital anomalies. However, existing atlas construction pipelines necessitate days for slice-to-volume reconstruction (SVR) to generate high-resolution 3D brain volumes and several additional days for iterative volume registration, thereby rendering atlas construction from large-scale cohorts prohibitively impractical. We address these limitations with INFANiTE, an Implicit Neural Representation (INR) framework for high-resolution Fetal brain spatio-temporal Atlas learNing from clinical Thick-slicE MRI scans, bypassing both the costly SVR and the iterative non-rigid registration steps entirely, thereby substantially accelerating atlas construction. Extensive experiments demonstrate that INFANiTE outperforms existing baselines in subject consistency, reference fidelity, intrinsic quality and biological plausibility, even under challenging sparse-data settings. Additionally, INFANiTE reduces the end-to-end processing time (i.e., from raw scans to the final atlas) from days to hours compared to the traditional 3D volume-based pipeline (e.g., SyGN), facilitating large-scale population-level fetal brain analysis. Our code is publicly available at: https://anonymous.4open.science/r/INFANiTE-5D74
☆ OZ-TAL: Online Zero-Shot Temporal Action Localization
Online Temporal Action Localization (On-TAL) aims to detect the occurrence time and category of actions in untrimmed streaming videos immediately upon their completion. Recent advancements in this field focus on developing more sophisticated frameworks, shifting from Online Action Detection (OAD)-based aggregation paradigm to instance-level understanding. However, existing approaches are typically trained on specific domains and often exhibit limited generalization capabilities when applied to arbitrary videos, particularly in the presence of previously unseen actions. In this paper, we introduce a new task called Online Zero-shot Temporal Action Localization (OZ-TAL), which aims to detect previously unseen actions in an online fashion. Furthermore, we propose a training-free framework that leverages off-the-shelf Vision-Language Models (VLMs) while introducing additional mechanisms to enhance visual representations and mitigate their inherent biases. We establish new benchmarks and representative baselines for OZ-TAL on THUMOS14 and ActivityNet-1.3, and extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches under both offline and online zero-shot settings.
☆ HiDrive: A Closed-Loop Benchmark for High-Level Autonomous Driving
End-to-end autonomous driving has witnessed rapid progress, yet existing benchmarks are increasingly saturated, with state-of-the-art models achieving near-perfect scores on widely used open-loop and closed-loop benchmarks. This saturation does not mean that the problem has been solved; instead, it reveals that current benchmarks remain limited in scenario diversity, object variety, and the breadth of driving capabilities they evaluate. In particular, they lack sufficient long-tail scenarios involving rare but safety-critical objects and fail to assess advanced decision-making such as legal compliance, ethical reasoning, and emergency response. To address these gaps, we propose HiDrive, a new closed-loop benchmark for end-to-end autonomous driving that emphasizes long-tail scenarios and a richer evaluation of driving capabilities. HiDrive introduces a diverse set of rare objects and uncommon traffic situations, and expands evaluation from basic driving skills to more advanced capabilities, including rule compliance, moral reasoning, and context-dependent emergency maneuvers. Correspondingly, we extend previous collision-avoidance-centered metrics into a comprehensive evaluation system that encompasses collision and braking, traffic-rule compliance, and moral-reasoning indicators. Built on a more advanced physics engine, HiDrive provides physically realistic lighting and high-fidelity visual rendering, offering a more challenging and realistic testbed for assessing whether autonomous driving systems can handle the complexity of real-world deployment. The HiDrive software, source code, digital assets, and documentation are available at https://github.com/VDIGPKU/HiDrive.
☆ Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
The real world unfolds along a single set of physics laws, yet human intelligence demonstrates a remarkable capacity to generalize experiences from this singular physical existence into a multiverse of games, each governed by entirely different rules, aesthetics, physics, and objectives. This omni-reality adaptability is a hallmark of general intelligence. As Artificial Intelligence progresses towards Artificial General Intelligence, the multiverse of games has evolved from mere entertainment into the ultimate ground for training and evaluating AGI. The pursuit of this generality has unfolded across four eras: from environment-specific symbolic and reinforcement learning agents, to current large foundation models acting as generalist players, and toward a future creator stage where agent both creates new game worlds and continually evolves within them. We trace the full lifecycle of a generalist game player along four interdependent pillars: Dataset, Model, Harness, and Benchmark. Every advance across these pillars can be read as an attempt to break one of five fundamental trade-offs that currently bound the whole system. Building on this end-to-end view, we chart a five-level roadmap, progressing from single-game mastery to the ultimate creator stage in which the agent simultaneously creates and evolves within theoretical game multiverse. Taken together, our work offers a unified lens onto a rapidly shifting field,and a principled path toward the omnipotent generalist agent capable of seamlessly mastering any challenge within the multiverse of games, thereby paving the way for AGI.
comment: 51 pages, 7 figures, github: https://github.com/iTheresaApocalypse/Awesome-LFMs-Play-Games
☆ Learning to Perceive "Where": Spatial Pretext Tasks for Robust Self-Supervised Learning
Existing self-supervised learning (SSL) methods primarily learn object-invariant representations but often neglect the spatial structure and relationships among object parts. To address this limitation, we introduce Spatial Prediction (SP), a spatially aware pretext regression task that predicts the relative position and scale between a pair of disentangled local views from the same image. By modeling part-to-part relationships in a continuous geometric space, SP encourages representations to capture fine-grained spatial dependencies beyond invariant categorical semantics, thereby learning the compositional structure of visual scenes. SP is implemented as a decoupled plug-in and can be seamlessly integrated into diverse SSL frameworks. Extensive experiments show consistent improvements across image recognition, fine-grained classification, semantic segmentation, and depth estimation, as well as substantial gains in out-of-distribution robustness for object recognition. To evaluate spatial reasoning, we introduce (1) a position and scale prediction task on image patch pairs and (2) a jigsaw understanding task requiring patch reordering and recognition after reconstruction. Strong performance on these tasks indicates improved spatial structure and geometric awareness. Overall, explicitly modeling spatial information provides an effective inductive bias for SSL, leading to more structured representations and better generalization. Code and models will be released.
☆ SDTalk: Structured Facial Priors and Dual-Branch Motion Fields for Generalizable Gaussian Talking Head Synthesis
High-quality, real-time talking head synthesis remains a fundamental challenge in computer vision. Existing reconstruction- and rendering-based methods typically rely on identity-specific models, limiting cross-identity generalization. To address this issue, we propose SDTalk, a one-shot 3D Gaussian Splatting (3DGS)-based framework that generalizes to unseen identities without personalized training or fine-tuning. Our framework comprises two modules with a two-stage training strategy. In the first stage, we incorporate structured facial priors into the reconstruction module and separately predict 3DGS parameters for visible and occluded regions, enabling complete head reconstruction from a single image. In the second stage, we introduce a dual-branch motion field to model coarse and fine facial dynamics, improving detail fidelity and lip synchronization. Experiments demonstrate that SDTalk surpasses existing methods in both visual quality and inference efficiency.
comment: 5 pages, 4 figures, 4 tables
☆ JODA: Composable Joint Dynamics for Articulated Objects
Articulated objects used in simulation and embodied AI are typically specified by geometry and kinematic structure, but lack the fine-grained dynamical effects that govern realistic mechanical behavior, such as frictional holding, detents, soft closing, and snap latching. Existing approaches either ignore the detailed structure of dynamics entirely, or use simple models with limited expressiveness. We introduce JODA, a framework for generating joint-level dynamics as a structured three-channel field over the joint degree of freedom, capturing conservative forces, dry friction, and damping. Instantiated using shape-constrained piecewise cubic interpolation (PCHIP), this formulation defines a compact and expressive function space that is both interpretable and compatible with differentiable simulation. Building on this representation, we develop methods for inferring and refining joint dynamics from multimodal inputs. Given visual observations and joint context, a vision-language model proposes structured dynamical primitives, which are composed into a unified dynamics field. The resulting representation supports both direct manipulation and gradient-based refinement. We demonstrate that JODA enables plausible and controllable modeling of diverse joint behaviors, providing a unified interface for inference, editing, and optimization. Code and example assets with their generated profiles will be released upon publication.
☆ LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models
Current Vision-Language-Action (VLA) models typically treat the deepest representation of a vision-language backbone as universally optimal for action prediction. However, robotic manipulation is composed of many frequent closed-loop spatial adjustments, for which excessive abstraction may waste computation and weaken low-level geometric cues essential for precise control. Existing early-exit strategies attempt to reduce computation by stopping at predefined layers or applying heuristic rules such as action consistency, but they do not directly answer when a representation is actually sufficient for action. In this paper, we present LoopVLA, a recurrent VLA architecture that jointly learns representation refinement, action prediction, and sufficiency estimation. LoopVLA iteratively applies a shared Transformer block to refine multimodal tokens, and at each iteration produces both a candidate action and a sufficiency score that estimates whether further refinement is necessary. By sharing parameters across iterations, LoopVLA decouples refinement from absolute layer indices and grounds sufficiency estimation in the evolving representation itself. Since sufficiency has no direct supervision, we introduce a self-supervised distribution alignment objective, where intermediate confidence scores are trained to match the relative action quality across refinement steps, thereby linking sufficiency learning to policy optimization signals. Experiments on LIBERO, LIBERO-Plus, and VLA-Arena show that LoopVLA pushes the efficiency-performance frontier of VLA policies, reducing parameters by 45% and improving inference throughput by up to 1.7 times while matching or outperforming strong baselines in task success.
☆ Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception
We present Urban-ImageNet, a large-scale multi-modal dataset and evaluation benchmark for urban space perception from user-generated social media imagery. The corpus contains over 2 Million public social media images and paired textual posts collected from Weibo across 61 urban sites in 24 Chinese cities across 2019-2025, with controlled benchmark subsets at 1K, 10K, and 100K scale and a full 2M corpus for large-scale training and evaluation. Urban-ImageNet is organized by HUSIC, a Hierarchical Urban Space Image Classification framework that defines a 10-class taxonomy grounded in urban theory. The taxonomy is designed to distinguish activated and non-activated public spaces, exterior and interior urban environments, accommodation spaces, consumption content, portraits, and non-spatial social-media content. Rather than treating urban imagery as generic scene data, Urban-ImageNet evaluates whether machine perception models can capture spatial, social, and functional distinctions that are central to urban studies. The benchmark supports three tasks within one standardized library: (T1) urban scene semantic classification, (T2) cross-modal image-text retrieval, and (T3) instance segmentation. Our experiments evaluate representative vision, vision-language, and segmentation models, revealing strong performance on supervised scene classification but more challenging behavior in cross-modal retrieval and instance-level urban object segmentation. A multi-scale study further examines how model performance changes as balanced training data increases from 1K, 10K to 100K images. Urban-ImageNet provides a unified, theory-grounded, multi-city benchmark for evaluating how AI systems perceive and interpret contemporary urban spaces across modalities, scales, and task formulations. Dataset and benchmark are available at: huggingface.co/datasets/Yiwei-Ou/Urban-ImageNet and github.com/yiasun/dataset-2.
☆ Evidence-based Decision Modeling for Synthetic Face Detection with Uncertainty-driven Active Learning
With the rapid development of deep generative models, forged facial images are massively exploited for illegal activities. Although existing synthetic face detection methods have achieved significant progress, they suffer from the inherent limitation of overconfidence due to their reliance on the Softmax activation function. Thus, these methods often lead to unreliable predictions when encountering unknown Out-of-Distribution (OOD) images, and cannot ascertain the model's uncertainty in its prediction. Meanwhile, most existing methods require massive high-quality annotated data, which greatly limits their practicability across diverse scenarios. To address these limitations, we propose EMSFD (Evidence-based decision Modeling for Synthetic Face Detection with uncertainty-driven active learning), an approach designed to enhance detection reliability and generalizability. Specifically, EMSFD models class evidence using the Dirichlet distribution and explicitly incorporates model uncertainty into the prediction process. Furthermore, during training, the estimated uncertainty is exploited to prioritize more informative samples from the unlabeled pool for annotation, thereby reducing labeling cost and improving model generalization. Extensive experimental evaluations demonstrate that our method enhances the interpretability of synthetic face detection. Meanwhile, our method yields a 15\% increase in accuracy compared to existing state-of-the-art (SOTA) baselines, which demonstrates the superior detection performance and generalizability of our approach. Our code is available at: https://github.com/hzx111621/EMSFD.
comment: 11pages,6figures
☆ Frequency Adapter with SAM for Generalized Medical Image Segmentation
Medical image segmentation is a critical task in computer-aided diagnosis and treatment planning. However, deep learning models often struggle to generalize across datasets due to domain shifts arising from variations in imaging protocols, scanner types, and patient populations. Traditional domain generalization (DG) methods utilize causal feature learning, adversarial consistency, and style augmentation to improve segmentation robustness. While effective, these approaches rely on explicit feature alignment, adversarial objectives, or handcrafted augmentations, which may not fully exploit the capabilities of foundation models. Recently, the Segment Anything Model (SAM) has demonstrated strong generalization capabilities in segmentation tasks. SAM-based DG methods attempt to improve medical image segmentation. However, these approaches primarily operate in the spatial domain and overlook frequency-based discrepancies that significantly affect model robustness. In this work, we propose Frequency-based Domain Generalization with SAM (FSAM), a novel framework that integrates Low-Rank Adaptation (LoRA) for efficient fine-tuning and a frequency adapter to incorporate frequency-domain representations for single-source domain generalization. FSAM enhances SAM's segmentation robustness by extracting domain-invariant high-frequency features, mitigating frequency-related domain shifts. Experimental results on fundus and prostate datasets demonstrate that FSAM outperforms existing traditional DG and SAM-based DG approaches in domain generalization. Codes and pre-trained models will be made available on GitHub.
comment: Under review, 10 pages, 1 figure, 2 tables
☆ TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
Video large language models (Video-LLMs) have achieved remarkable progress in general video understanding, yet their ability to maintain temporal object consistency remains insufficiently explored. Existing benchmarks primarily focus on event recognition, action understanding, or coarse temporal reasoning, but rarely evaluate whether a model can consistently preserve the identity, state, and temporal continuity of the same object across occlusion, disappearance, reappearance, state transitions, and cross-object interactions. As a result, current evaluations may overestimate temporal reasoning ability while overlooking failures in object-centric temporal coherence. To address this issue, we introduce TOC-Bench, a diagnostic benchmark specifically designed to evaluate temporal object consistency in Video-LLMs. TOC-Bench is explicitly object-track grounded, where each queried subject is associated with a per frame object trajectory and structured temporal event timeline. To ensure that benchmark items depend on temporally ordered visual evidence rather than language priors, single-frame shortcuts, or unordered frame cues, we propose a three-layer temporal-necessity filtering protocol that removes 60.7% of candidate QA pairs and retains 17,900 temporally dependent items spanning 10 diagnostic dimensions. From this filtered pool, we further construct a human-verified benchmark containing 2,323 high-quality QA pairs over 1,951 videos. Experiments on representative Video-LLMs show that temporal object consistency remains a major unsolved challenge. Current models exhibit substantial weaknesses in event counting, event ordering, identity-sensitive reasoning, and hallucination-aware verification, despite strong performance on general video understanding benchmarks.
☆ Adversarial Attacks Against MLLMs via Progressive Resolution Processing and Adaptive Feature Alignment
Adversarial perturbations can mislead Multimodal Large Language Models (MLLMs) recognize a benign image as a specific target object, posing serious risks in safety-critical scenarios such as autonomous driving and medical diagnosis. This makes transfer-based targeted attacks crucial for understanding and improving black-box MLLM robustness. Existing transfer-based targeted attack methods typically rely on the final global features of the surrogate encoder and anchor optimization to original-resolution target crops, leading to their limited transferability and robustness. To address these challenges, we propose Progressive Resolution Processing and Adaptive Feature Alignment (PRAF-Attack), a targeted transfer-based attack framework that integrates multi-scale global semantic guidance with robust intermediate-layer local alignment. Unlike prior methods that align only the surrogate encoder's final layer, we design an adaptive feature alignment strategy that leverages intermediate representations to enhance transferability. Specifically, we introduce an adaptive intermediate layer selection mechanism to identify transferable hierarchical features across surrogate ensembles via gradient consistency, along with an adaptive patch-level optimization strategy that preserves highly correlated local regions through efficient patch filtering. To overcome the reliance on fixed original-resolution target crops, we propose a progressive resolution processing strategy that gradually refines optimization from coarse to fine, enabling the attack to better exploit target information at multiple scales and achieve stronger transferability. We evaluate PRAF-Attack on a diverse suite of black-box MLLMs, including six open-source models and six closed-source commercial APIs. Compared with seven state-of-the-art targeted attack baselines, the proposed PRAF-Attack consistently achieves superior transferability.
☆ The Gordian Knot for VLMs: Diagrammatic Knot Reasoning as a Hard Benchmark
A vision-language model can look at a knot diagram and report what it sees, yet fail to act on that structure. KnotBench pairs an 858,318-image corpus from 1,951 prime-knot prototypes (crossing numbers 3 to 19) with a protocol whose answers are checked against Regina's canonical knot signature. Its 14 tasks span four families, equivalence judgment, move prediction, identification, and cross-modal grounding; an image-versus-symbol split locates failures along the perception-operation gap. We score Claude Opus 4.7 and GPT-5, each with and without thinking, under a 64K output-token budget matched on both vendors. Across 56 (task, model) cases, 15 sit at or below a random baseline and 8 of 14 tasks have a best score under 1.5x random. On diagram-to-symbol transcription, no model produces a strictly correct string, and permissive Regina decoding recovers the knot in 0 to 4 of 100 items. Thinking-mode reasoning lifts overall accuracy by 1.65 points for Claude and 9.25 points for GPT-5, narrowing the gap only modestly. Read together, the four families suggest current vision-language models hold features of a diagram but lack apparatus to simulate moves on those features.
comment: 41 pages, 18 figures
☆ Hyperbolic Distillation: Geometry-Guided Cross-Modal Transfer for Robust 3D Object Detection IEEE
Cross-modal knowledge distillation has emerged as an effective strategy for integrating point cloud and image features in 3D perception tasks. However, the modality heterogeneity, spatial misalignment, and the representation crisis of multiple modalities often limit the efficient of these cross-modal distillation methods. To address these limitations in existing approaches, we propose a hyperbolic constrained cross-modal distillation method for multimodal 3D object detection (HGC-Det). The proposed HGC-Det framework includes an image branch and a point cloud branch to extract semantic features from two different modalities. The point cloud branch comprises three core components: a 2D semantic-guided voxel optimization component (SGVO), a hyperbolic geometry constrained cross-modal feature transfer component (HFT), and a feature aggregation-based geometry optimization component (FAGO). Specifically, the SGVO component adaptively refines the spatial representation of the 3D branch by leveraging semantic cues from the image branch, thereby mitigating the issue of inadequate representation fusion. The HFT component exploits the intrinsic geometric properties of hyperbolic space to alleviate semantic loss during the fusion of high-dimensional image features and low-dimensional point cloud features. Finally, the FAGO compensates for potential spatial feature degradation introduced by the 2D semantic-guided voxel optimization component. Extensive experiments on indoor datasets (SUN RGB-D, ARKitScenes) and outdoor datasets (KITTI, nuScenes) demonstrate that our method achieves a better trade-off between detection accuracy and computational cost.
comment: Current version has been subbmitted to IEEE Transactions on Multimedia. Now, this manuscript's status is Under Review
☆ The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space
As current Multimodal Large Language Models rapidly saturate canonical visual reasoning benchmarks, a key question emerges: do these strong scores genuinely reflect robust visual understanding? We identify a pervasive vulnerability, the \textbf{Cartesian Shortcut}: visual reasoning benchmarks prevalently build on orthogonal grid-based layouts that can be readily discretized into explicit textual coordinates. Models systematically exploit this property, heavily leveraging text-based deductive reasoning to assist visual problem-solving. To systematically dismantle this shortcut, we introduce \textbf{Polaris-Bench}, which re-formulates 53 visual reasoning tasks in Polar coordinate space with paired Cartesian counterparts as reference, while preserving consistent logical constraints and task semantics -- thus fundamentally breaking the orthogonal prior that models exploit. Comprehensive evaluation across $14$ state-of-the-art MLLMs reveals that frontier models achieving $70$--$83\%$ on Cartesian layouts collapse to $31$--$39\%$ on Polar equivalents, with degradation persisting even under complete logical equivalence. Moreover, reasoning gains observed on Cartesian layouts are severely diminished on Polar equivalents. These findings expose a critical deficiency in current MLLMs: the lack of topology-invariant visual reasoning.
☆ EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
Next-generation visual assistants, such as smart glasses, embodied agents, and always-on life-logging systems, must reason over an entire day or more of continuous visual experience. In ultra-long video settings, relevant information is sparsely distributed across hours or days, making memory a fundamental challenge: models must accumulate information over time, recall prior states, track temporal order, and abstract recurring patterns. However, existing week-long video benchmarks are primarily designed for perception and recognition, such as moment localization or global summarization, rather than reasoning that requires integrating evidence across multiple days. To address this gap, we introduce EgoMemReason, a comprehensive benchmark that systematically evaluates week-long egocentric video understanding through memory-driven reasoning. EgoMemReason evaluates three complementary memory types: entity memory, tracking how object states evolve and change across days; event memory, recalling and ordering activities separated by hours or days; and behavior memory, abstracting recurring patterns from sparse, repeated observations over the whole week period. EgoMemReason comprises 500 questions across three memory types and six core challenges, with an average of 5.1 video segments of evidence per question and 25.9 hours of memory backtracking. We evaluate EgoMemReason on 17 methods across MLLMs and agentic frameworks, revealing that even the best model achieves only 39.6% overall accuracy. Further analysis shows that the three memory types fail for distinct reasons and that performance degrades as evidence spans longer temporal horizons, revealing that long-horizon memory remains far from solved. We believe EgoMemReason establishes a strong foundation for evaluating and advancing long-context, memory-aware multimodal systems.
comment: The first two authors contributed equally. Project website: https://egomemreason.github.io/
☆ ConsistNav: Closing the Action Consistency Gap in Zero-Shot Object Navigation with Semantic Executive Control
Zero-shot object navigation has advanced rapidly with open-vocabulary detectors, image--text models, and language-guided exploration. However, even after current methods detect a plausible target hypothesis, the agent may still oscillate between exploration and pursuit, or abandon the object near success. We identify this failure mode as an action consistency gap: semantic evidence is repeatedly reinterpreted at each step without persistent commitment across the episode. We introduce ConsistNav, a training-free zero-shot ObjectNav framework built around a semantic executive composed of three coordinated modules: Finite-State Executive Controller stages target pursuit through guarded semantic phases; Persistent Candidate Memory accumulates cross-frame target evidence into stable object hypotheses; and Stability-Aware Action Control suppresses rotational stagnation, ineffective pursuit, and unverified stopping. This design changes neither the detector nor the low-level planner; instead, it controls when semantic evidence should influence navigation and when it should be suppressed or revisited. We conduct extensive experiments on HM3D and MP3D, where ConsistNav achieves state-of-the-art results among compared zero-shot ObjectNav methods and improves SR by 11.4% and SPL by 7.9% over the controlled baseline on MP3D. Ablation studies and real-world deployment experiments further demonstrate the effectiveness and robustness of the proposed executive mechanism.
comment: 13 pages, 5 figures
☆ DA-SegFormer: Damage-Aware Semantic Segmentation for Fine-Grained Disaster Assessment IEEE
Rapid and accurate damage assessment following natural disasters is critical for effective emergency response. However, identifying fine-grained damage levels (e.g., distinguishing minor from major roof damage) in UAV imagery remains challenging due to the degradation of texture cues during resizing and extreme class imbalance. We propose DA-SegFormer, a damage-aware adaptation of the SegFormer architecture optimized for high-resolution disaster imagery. Our method introduces a Class-Aware Sampling strategy to guarantee exposure to rare damage features, and it integrates Online Hard Example Mining (OHEM) with Dice Loss to dynamically focus on underrepresented classes. In addition, we employ a resolution-preserving inference protocol that maintains native texture details. Evaluated on the RescueNet dataset, DA-SegFormer achieves 74.61\% mIoU, outperforming the baseline by 2.55\%. Notably, our improvements yield double-digit gains in critical damage classes: Minor Damage (+11.7%) and Major Damage (+21.3%).
comment: Accepted for 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026)
☆ Learning to Align Generative Appearance Priors for Fine-grained Image Retrieval
Fine-grained image retrieval (FGIR) typically relies on supervision from seen categories to learn discriminative embeddings for retrieving unseen categories. However, such supervision often biases retrieval models toward the semantics of seen categories rather than the underlying appearance characteristics that generalize across categories, thereby limiting retrieval performance on unseen categories. To tackle this, we propose GAPan, a Generative Appearance Prior alignment network that reformulates the learning objective from category prediction toward appearance modeling. Technically, GAPan treats retrieval features with an invertible density model based on normalizing flows. In the forward direction, the flow maps all instance features into a latent density space, where each seen category is modeled by a class-conditional Gaussian prior and optimized via exact likelihood estimation. This formulation preserves richer appearance details by leveraging the invertible property of the flows. In the reverse direction, samples from the high-density regions of these learned priors are mapped back to the feature space to produce appearance-aware anchors that reflect intra-category variation. These anchors supervise a prior-driven alignment objective that aligns retrieval embeddings with category-specific appearance distributions, thereby improving generalization to unseen categories. Evaluations demonstrate that our GAPan achieves state-of-the-art performance on both widely-used fine- and coarse-grained benchmarks.
☆ Clip-level Uncertainty and Temporal-aware Active Learning for End-to-End Multi-Object Tracking IEEE
Multi-Object Tracking (MOT) in dynamic environments relies on robust temporal reasoning to maintain consistent object identities over time. Transformer-based end-to-end MOT models achieve strong performance by explicitly modeling temporal dependencies, yet training them requires extensive bounding-box and identity annotations. Given the high labeling cost and strong redundancy in videos, Active Learning (AL) is an effective approach to improve annotation efficiency. However, existing AL methods for MOT primarily operate at the frame level, which is structurally misaligned with modern end-to-end trackers whose inference and training rely on multi-frame clips. To bridge this gap, we formulate clip-level active learning and propose Clip-level Uncertainty and Temporal-aware Active Learning (CUTAL). In contrast to frame-based approaches, CUTAL scores each clip using uncertainty metrics derived from multi-frame predictions to capture inter-frame correspondence ambiguities, while enforcing temporal diversity to select an informative and non-redundant subset. Experiments show that CUTAL achieves stronger overall performance than baselines at the same label budgets across MeMOTR and SambaMOTR. Notably, CUTAL achieves performance comparable to full supervision for MeMOTR on both datasets using only 50% of the labeled training data.
comment: Accepted to 2026 IEEE International Conference on Image Processing (ICIP). Copyright 2026 IEEE. Published in 2026 IEEE International Conference on Image Processing (ICIP), scheduled for 13-17 September 2026 in Tampere, Finland
☆ MoPO: Incorporating Motion Prior for Occluded Human Mesh Recovery
Although recent studies have made remarkable progress in human mesh recovery, they still exhibit limited robustness to occlusions and often produce inaccurate poses and severe motion jitter due to the insufficient spatial features for occluded body parts. Inspired by the rapid advancements in human motion prediction, we discover that compared to occluded image features, pose sequence inherently contains reliable motion prior for estimating occluded body parts. In this paper, we incorporate Motion Prior for Occluded human mesh recovery, called MoPO. Our MoPO mainly consists of two components: 1) The motion de-occlusion module, where we propose a spatial-temporal occlusion detector to detect joint visibility, and then we propose a lightweight motion predictor to complete the occluded body parts by predicting the most plausible joint positions based on history poses. 2) The motion-aware fusion and refinement module, which fuses the completed joint sequence with image features to estimate human shape and initial human pose. Moreover, the completed joint sequence is further used to refine the final human pose through inverse kinematics, which provides the occlusion-free motion prior for regressing human poses. Extensive experiments demonstrate that MoPO achieves state-of-the-art performance on both occlusion-specific and standard benchmarks, significantly enhancing the accuracy and temporal consistency of occluded human mesh recovery. Our code and demo can be found in the supplementary material.
comment: 35 pages
☆ Probing Routing-Conditional Calibration in Attention-Residual Transformers
Post-hoc calibration is usually evaluated as a function of logits or softmax confidence alone, even as routing-augmented architectures increasingly accompany predictions with sample-specific internal routing traces and pair them with claims of calibration-relevant uncertainty. We ask a basic question: do these traces provide stable routing-specific evidence for post-hoc calibration beyond confidence? We study this in Attention-Residual transformers (Kimi Team, 2026) through a matched-confidence diagnostic suite that stratifies examples by routing-derived state, compares subgroup gaps against within-bin routing-permutation nulls, and evaluates matched post-hoc probes differing only in their auxiliary feature. Across our completed AR runs, scalar routing summaries do not provide stable evidence of routing-conditional miscalibration: weighted gaps remain small or seed-sensitive, and only $1$ of $30$ within-bin permutation tests rejects the conditional-null at $α=0.05$ (only on one seed; not stable across seeds in that cell). AR-CondCal, a minimal $2$-D Nadaraya--Watson probe on confidence and routing-depth variance, lies within the seed-variance band of matched confidence-only and predictive-entropy controls and does not reliably improve worst-routing-tertile ECE; bandwidth-sensitivity checks (Scott multiples, CV-NLL, global-ECE oracle) do not change this. A full-vector MLP over $(c, H_1, \ldots, H_L)$ can appear to improve over a linear confidence baseline, but the apparent gain disappears once a capacity-matched confidence-only MLP is included as a control, and shuffled routing profiles achieve comparable performance. Apparent routing-aware calibration gains in this AR setting should not be read as internal-state calibration until matched-confidence, bandwidth, capacity, and permutation controls rule out common confounds.
comment: Under reviewing
☆ Loom: Hybrid Retrieval-Scoring Outfit Recommendation with Semantic Material Compatibility and Occasion-Aware Embedding Priors
We present Loom, an outfit recommendation system that combines neural embedding retrieval with structured domain scoring to generate complete, coherent outfits from fashion catalogs. Given an anchor clothing item, Loom retrieves complementary pieces via slot-constrained approximate nearest neighbor search over FashionCLIP embeddings, then scores candidate outfits using a multi-objective function that integrates six signals: embedding similarity, color harmony, formality consistency, occasion coherence, style direction, and within-outfit diversity. We introduce two techniques that address limitations of purely learned or purely rule-based approaches: (1) semantic material weight, which uses CLIP embedding geometry to infer garment heaviness for layer compatibility without hand-coded material taxonomies; and (2) vibe/anti-vibe occasion priors, which embed prose descriptions of occasion contexts as anchor vectors in CLIP space and score items by differential affinity. Ablation experiments on a catalog of 620 items show that each component contributes measurably to outfit quality: the full system achieves a mean outfit score of 0.179 with a 9.3% hard violation rate, compared to 0.054 score and 16.0% violations for a category-constrained random baseline, a 3.3x improvement in score and 42% reduction in violations. Direction reranking is the single indispensable component: removing it drops score to 0.052, essentially equal to random. The system generates three stylistically distinct outfits in under 5 seconds on commodity hardware.
comment: Code: https://github.com/anushreeberlia/loom
☆ Fashion Florence: Fine-Tuning Florence-2 for Structured Fashion Attribute Extraction
We present Fashion Florence, a Florence-2 vision-language model fine-tuned with LoRA to extract structured fashion attributes from clothing images. Given a single photograph, the model generates a JSON object containing category, color, material, style tags, and occasion tags, structured output suitable for direct programmatic consumption by downstream recommendation and retrieval systems. Fine-tuning data is derived from the iMaterialist Fashion dataset (228 labels), where we collapse fine-grained annotations into a compact 6-category, 16-color, 19-style schema via rule-based label engineering. We apply LoRA (r=16, alpha=32) to all decoder linear layers, training for 3 epochs on 3,688 examples. On a held-out test set of 461 images, Fashion Florence achieves 94.6% category accuracy and 63.0% material accuracy, compared to 89.3% / 43.3% for GPT-4o-mini and 87.4% for Gemini 2.5 Flash. Fashion Florence produces valid JSON in 99.8% of outputs while running at 0.77B parameters on a single GPU at zero marginal inference cost. Style tag F1 reaches 0.753 vs. 0.612 (Gemini) and 0.398 (GPT-4o-mini). The model is deployed as a Hugging Face Space and integrated into Loom, an open-source outfit recommendation system.
comment: Model: https://huggingface.co/anushreeberlia/fashion-florence
☆ Quantifying Rodda and Graham Gait Classification from 3D Makerless Kinematics derived from a Single-view Video in a Heterogeneous Pediatric Clinical Cohort
Cerebral Palsy (CP) is a neurological disorder of movement and the most common cause of lifelong physical disability in childhood. Approximately 75% of children with CP are ambulatory, and accurate gait assessment is central to preserving walking function, which deteriorates by mid-adulthood in a quarter to half of adults with CP. The Rodda and Graham classification system quantifies sagittal-plane gait deviations using ankle and knee z-scores derived from 3D Instrumented Gait Analysis (3D-IGA), but 3D-IGA is expensive and limited to specialized centers, while observational assessment shows only moderate inter-rater agreement. We developed a markerless gait analysis pipeline that quantifies Rodda and Graham knee and ankle z-scores directly from single-view clinical gait videos. Across 1,058 bilateral limb samples from 529 trials of 152 children (88 male, 63 female; age 12.1 $\pm$ 4.0 years; 60 distinct primary diagnoses, cerebral palsy the most common at $n=54$), the sagittal-view model achieved $R^2 = 0.80 \pm 0.02$ and CCC $= 0.89 \pm 0.02$ for knee z-scores and $R^2 = 0.57 \pm 0.02$ and CCC $= 0.72 \pm 0.02$ for ankle z-scores against 3D-IGA. Binary screening for excess knee flexion achieves AUROC $= 0.88$, correctly identifying 83% of affected children, and applying Rodda and Graham rules yields $43 \pm 1$% 7-class accuracy with macro-AUROC $= 0.78 \pm 0.01$, ankle prediction error remaining the primary bottleneck. Beyond cross-sectional screening, continuous z-scores support longitudinal trajectory tracking across visits, providing a quantitative substrate for monitoring disease progression and treatment response unavailable from observational scales. These results demonstrate the feasibility of video-based z-score estimation, excess-flexion screening, and longitudinal trajectory tracking as a path toward scalable, objective gait assessment in low-resource clinical settings.
comment: 29 pages, 8 figures, 9 tables (including 1 supplementary table); manuscript prepared in PLOS ONE format
☆ Couple to Control: Joint Initial Noise Design in Diffusion Models
Diffusion models typically generate image batches from independent Gaussian initial noises. We argue that this independence assumption is only one choice within a broader class of valid joint noise designs. Instead, one can specify a coupling of the initial noises: each noise remains marginally standard Gaussian, so the pretrained diffusion model receives the same single-sample input distribution, while the dependence across samples is chosen by design. This reframes initial-noise control from selecting or optimizing individual seeds to designing the dependence structure of a multi-sample gallery. This view gives a general framework for initial-noise design, covering several existing methods as special cases and leading naturally to new coupled-noise constructions. Coupled noise can improve generation on its own without adding sampling cost, and it is flexible enough to serve as a structured initialization for optimization-based pipelines when additional computation is available. Empirically, repulsive Gaussian coupling improves gallery diversity on SD1.5, SDXL, and SD3 while largely preserving prompt alignment and image quality. It matches or outperforms recent test-time noise-optimization baselines on several diversity metrics at the same sampling cost as independent generation. Subspace couplings also support fixed-object background generation, producing diverse, natural backgrounds compared with specialized inpainting baselines, with a tunable trade-off in foreground fidelity.
comment: 26 pages
☆ Vision2Code: A Multi-Domain Benchmark for Evaluating Image-to-Code Generation
Image-to-code generation tests whether a vision-language model (VLM) can recover the structure of an image enough to express it as executable code. Existing benchmarks either focus on narrow visual domains, depend on paired executable reference code, or rely on generic rubrics that miss domain-specific reconstruction errors. We introduce Vision2Code, a reference-code-free benchmark and evaluation framework for multi-domain image-to-code generation. Vision2Code contains 2,169 test examples from 15 source datasets that span charts and plots, geometry, graphs, scientific imagery, documents, and 3D spatial scenes. Models generate executable programs, which we render and score against the source image using a VLM rater with dataset-specific rubrics and deterministic guardrails for severe semantic failures. We report render-success diagnostics that separate code execution failures from reconstruction quality. Human validation shows that this evaluation protocol aligns better with human judgments than either a generic visual rubric or embedding-similarity baselines. Across nine open-weight and proprietary models, we find that image-to-code performance is domain-dependent: leading models perform well on regular chart- and graph-like visuals but remain weak on spatial scenes, chemistry, documents, and circuit-style diagrams. Finally, we show that evaluator-filtered model outputs can serve as training data to improve image-to-code capability, with Qwen3.5-9B improving from 1.60 to 1.86 on the benchmark without paired source programs. Vision2Code provides a reproducible testbed for measuring, diagnosing, and improving image-to-code generation. Our code and data are publicly available at https://image2code.github.io/vision2code/.
comment: Project page: https://image2code.github.io/vision2code/
☆ CheXTemporal: A Dataset for Temporally-Grounded Reasoning in Chest Radiography
Chest radiograph interpretation requires temporal reasoning over prior and current studies, yet most vision-language models are trained on static image-report pairs and lack explicit supervision for modeling longitudinal change. We introduce CheXTemporal, a dataset for temporally grounded reasoning in chest radiography consisting of paired prior-current chest X-rays (CXR) with finding-level temporal and spatial annotations. The dataset includes a five-class progression taxonomy (new, worse, stable, improved, resolved), localized spatial supervision of pathology, explicit spatial-temporal alignment across paired studies, and multi-source coverage for cross-domain evaluation. We additionally construct a 280K-pair silver dataset with automatically derived temporal and anatomical supervision for large-scale evaluation under weaker supervision. Using these resources, we evaluate multiple state-of-the-art vision-language CXR models on grounding and progression-classification tasks in a zero-shot setting. Across both gold and silver evaluations, current models exhibit consistent limitations in spatial grounding, fine-grained temporal reasoning, and robustness under distribution shift. In particular, models perform substantially better on salient progression categories such as worse than on temporally subtle states such as stable and resolved, suggesting limited modeling of longitudinal disease evolution in chest radiography.
☆ LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?
Multimodal large language models (MLLMs) have heterogeneous strengths across OCR, chart understanding, spatial reasoning, visual question answering, cost, and latency. Effective MLLM routing therefore requires more than estimating query difficulty: a router must match the multimodal requirements of the current image-question input with the capabilities of each candidate model. We propose LatentRouter, a router that formulates MLLM routing as counterfactual multimodal utility prediction. Given an image-question query, LatentRouter extracts learned multimodal routing capsules, represents each candidate MLLM with a model capability token, and performs latent communication between these states to estimate how each model would perform if selected. A distributional outcome head predicts model-specific counterfactual quality, while a bounded capsule correction refines close decisions without allowing residual signals to dominate the prediction. The resulting utility-based policy supports performance-oriented and performance-cost routing, and handles changing candidate pools through shared per-model scoring with availability masking. Experiments on MMR-Bench and VL-RouterBench show that LatentRouter outperforms fixed-model, feature-level, and learned-router baselines. Additional analyses show that the gains are strongest on multimodal task groups where model choice depends on visual, layout-sensitive, or reasoning-oriented requirements, and that latent communication is the main contributor to the improvement. The code is available at: https://github.com/LabRAI/LatentRouter.
☆ Can Graphs Help Vision SSMs See Better?
Vision state space models inherit the efficiency and long-range modeling ability of Mamba-style selective scans. However, their performance depends critically on the representation of two-dimensional visual features as one-dimensional token sequences. Existing scan operators range from predefined geometric traversals to dynamic coordinate-based samplers that reroute tokens through predicted offsets and interpolation. While effective, these mechanisms primarily adapt paths or sampling locations, rather than explicitly modeling which local patches should exchange information before global state-space mixing. This motivates a simple question: \emph{can graphs help vision state space models see better?} We introduce \textbf{GraphScan}, a graph-induced dynamic scanning operator for Vision SSMs. For each token, GraphScan constructs a spatially bounded local graph, learns feature-conditioned affinities with relative positional bias, and produces the output token by one-step message passing over its semantic neighborhood. The resulting tokens are locally grounded before being processed by the selective SSM for global aggregation. GraphScan preserves token count and linear scaling in image size, while replacing coordinate-conditioned interpolation with feature-conditioned semantic routing. Integrated into a hierarchical backbone, \textbf{GraphScan-Mamba} achieves state-of-the-art performance among Vision SSMs across image classification, object detection, instance segmentation, and semantic segmentation, with modest computational overhead. Our analysis further shows that GraphScan induces interpretable displacement fields over the token lattice, providing a semantic and spatially grounded view of dynamic scanning. These results suggest that future Vision SSMs should treat scanning not merely as geometric serialization, but as learned local semantic routing before global state-space modeling.
comment: Technical Report
☆ Generative AI for Visualizing Highway Construction Hazards Through Synthetic Images and Temporal Sequences
Highway construction workers face a high risk of serious injury or death. Image-based training materials depicting hazardous scenarios are essential for engaging safety instruction but remain scarce due to ethical and logistical barriers. This study develops and evaluates a generative AI methodology for producing synthetic visualizations of highway construction hazards from OSHA Severe Injury Report narratives. Two modes were developed: a single-pass approach yielding one image per incident, and a temporal approach producing a four-stage sequence. A sample of 75 incident records yielded 750 images, evaluated using CLIP-based semantic retrieval and expert assessment across dimensions such as educational utility, fidelity, and alignment. Single-pass images achieved 81.1% educational acceptability with fidelity and alignment scores of 4.14/5 and 4.07/5, respectively, while temporal sequences achieved 60.9% acceptability with comparable alignment (3.94/5) but lower fidelity (3.51/5). CLIP-based retrieval revealed that both modes produce images with statistically significant retrieval capabilities. This is among the first studies to leverage modern autoregressive image generation models for visualizing construction hazards from reported severe injuries and to generate temporally sequenced hazard imagery, and a new multi-dimensional evaluation framework was developed to support future research in this domain. The work enables safety trainers to pair narrative storytelling with visual learning material without photographing real-world hazards, and the framework could be applied to datasets across diverse domains, enabling synthetic image generation tailored to new application areas.
☆ Real-Scale Island Area and Coastline Estimation using Only its Place Name or Coordinates IEEE
Accurate measurement of island area and coastline length is crucial for coastal zone monitoring and oceanographic analysis. However, traditional measurement and mapping methods usually rely heavily on orthophotos, expensive airborne depth sensors, or dense ground control points, which face serious limitations of high labor costs, time-consuming efforts, and low operational efficiency in vast and inaccessible open sea environments. To overcome these challenges and break away from the reliance on manual field exploration, this paper proposes a geometrically consistent, real-scale island measurement framework based on pure monocular vision. This project significantly reduces the mapping cost through a fully automated process and achieves high-efficiency measurement without prior GIS data. In our system pipeline, only the geographical coordinates or names of the target area need to be input to obtain a low-altitude surrounding image sequence. After obtaining the point clouds, a lightweight trajectory alignment algorithm (Umeyama) is used to restore the global physical scale, and the scaled model is orthorectified, enabling high-precision area and perimeter extraction directly on the 2D rasterized plane. We have fully verified this pipeline on four islands with different terrain features (covering natural landform islands and islands with complex artificial facilities). The experimental results show that the final measurement error of the system is stable at around 10\%, demonstrating excellent accuracy and robustness. Moreover, this framework has outstanding inference speed, requiring only 70 ms to process a single high-resolution image and generate point clouds, providing a highly practical new paradigm for large-scale marine and coastline
comment: Accepted for publication at IEEE OCEANS (Sanya) 2026
☆ PG-3DGS: Optimizing 3D Gaussian Splatting to Satisfy Physics Objectives
Recent advances in Gaussian Splatting have enabled fast, high-fidelity 3D scene generation, yet these methods remain purely visual and lack an understanding of how shapes behave in the physical world. We introduce Physics-Guided 3D Gaussian Splatting (PG-3DGS), a framework that couples differentiable physics simulation with 3D Gaussian representations to generate 3D structures satisfying physics functionalities. By allowing physical objectives to guide the shape optimization process alongside visual losses, our approach produces geometries that are not only photometrically accurate but also physically functional. The model learns to adjust shapes so that the generated objects exhibit physically meaningful behaviors, for example, teapots that can pour and airplanes that can generate lift, without sacrificing visual quality. Experiments on pouring and aerodynamic lift tasks show that PG-3DGS improves physical functionality while preserving visual quality. In addition to simulation gains, bench-top physical lift tests with 3D-printed aircraft (Cessna, B-2 Spirit, and paper plane) under identical airflow conditions show higher scale-measured lift for PG-3DGS, generated structures than an appearance-matching baseline in all three cases. Our unified framework connects appearance-based reconstruction with physics-based reasoning, enabling end-to-end generation of 3D structures that both look realistic and function correctly.
comment: Submitted to Artificial Intelligence. 52 pages
☆ DenseTRF: Texture-Aware Unsupervised Representation Adaptation for Surgical Scene Dense Prediction MICCAI 2026
Dense prediction tasks in surgical computer vision, such as segmentation and surgical zone prediction, can provide valuable guidance for laparoscopic and robotic surgery. However, these models often suffer from distribution shifts, as training datasets rarely cover the variability encountered during deployment, leading to poor generalization. We propose DenseTRF, a self-supervised representation adaptation framework based on texture-centric attention. Our method leverages slot attention to learn texture-aware representations that capture invariant visual structures. By adapting these representations to the target distribution without supervision, DenseTRF significantly improves robustness to domain shifts. The framework is implemented through conditioning dense prediction on slot attention and model merging strategies. Experiments across multiple surgical procedures demonstrate improved cross-distribution generalization in comparison to state-of-the-art segmentation models and test-distribution adaptation methods for dense prediction tasks.
comment: Accepted to 29th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2026)
☆ ABRA: Agent Benchmark for Radiology Applications
Existing medical-agent benchmarks deliver imaging as pre-selected samples, never as an environment the agent must navigate. We introduce ABRA, a radiology-agent benchmark in which the agent operates an OHIF viewer and an Orthanc DICOM server through twenty-one function-calling tools that span slice navigation, windowing, series selection, pixel-coordinate annotation, and structured reporting. ABRA contains 655 programmatically generated tasks across three difficulty tiers and eight types (viewer control, metadata QA, vision probe, annotation, longitudinal comparison, BI-RADS reporting, and oracle variants of annotation and BI-RADS reporting), drawn from LIDC-IDRI, Duke Breast Cancer MRI, and NLST New-Lesion LongCT. Each episode is scored along Planning, Execution, and Outcome (Bluethgen et al., 2025) by task-type-specific automatic scorers. Ten current models, five closed-weight and five open-weight, reach at least 89% Execution on real annotation but only 0-25% Outcome; on the paired oracle variant where a simulated detector supplies the finding, Outcome on the same task reaches 69-100% across the models evaluated, localising the bottleneck to perception rather than tool orchestration. Code, task generators, and scorers are released at https://github.com/Luab/ABRA
☆ Hi-GaTA: Hierarchical Gated Temporal Aggregation Adapter for Surgical Video Report Generation
Automated, clinician-grade assessment reports for surgical procedures could reduce documentation burden and provide objective feedback, yet remain challenging due to the difficulty of aligning dense spatio-temporal video representations with language-based reasoning and the scarcity of high-quality, privacy-preserving datasets. To address this gap, we establish a benchmark comprising 214 high-quality simulated surgical videos paired with surgeon-authored evaluation reports. Building on this resource, we propose a Perception-Alignment-Reasoning framework for surgical video report generation, featuring Hi-GaTA, a novel lightweight temporal adapter that efficiently compresses long video sequences into compact, LLM-compatible visual prefix tokens through short-to-long-range temporal aggregation. For robust visual perception, we pretrain Sur40k, a surgical-specific ViViT-style video encoder on 40,000 minutes of public surgical videos to capture fine-grained spatio-temporal procedural priors. Hi-GaTA employs a temporal pyramid with text-conditioned dual cross-attention, and improves multi-scale consistency through cross-level gated fusion and an increasing-depth strategy. Finally, we fine-tune the LLM backbone using LoRA to enable coherent and stylistically consistent surgical report generation under limited supervision. Experiments show our approach achieves the best overall performance, with consistent gains over strong Multimodal Large Language Model (MLLM) baselines. Ablation studies further validate the effectiveness of each proposed component.
comment: 11 pages, 2 figures
☆ FeatMap: Understanding image manipulation in the feature space and its implications for feature space geometry
Intermediate feature representations represent the backbone for the expressivity and adaptability of deep neural networks. However, their geometric structure remains poorly understood. In this submission, we provide indirect insights into this matter by applying a broad selection of manipulations in input space, ranging from geometric and photometric transformations to local masking and semantic manipulations using generative image editing models, and assess the feasibility of learning a mapping in the feature space, mapping from the original to the manipulated feature map. To this end, we devise different types of mappings, from linear to non-linear and local to global mappings and assess both the reconstruction quality of the mapping as well as the semantic content of the mapped representations. We demonstrate the feasibility of learning such mappings for all considered transformations. While global (transformer) models that operate on the full feature map often achieve best results, we show that the same can be achieved with a shared linear model operating on a single feature vector typically with very little degradation in reconstruction quality, even for highly non-trivial semantic manipulations. We analyze the corresponding mappings across different feature layers and characterize them according to dominance of weight vs. bias and the effective rank of the linear transformations. These results provide hints for the hypothesis that the feature space is to a first degree of approximation organized in linear structures. From a broader perspective, the study demonstrates that generative image editing models might open the door to a deeper understanding of the feature space through input manipulation.
comment: 27 pages, 24 figures, 3 tables, Code is available at https://github.com/AI4HealthUOL/FeatMap
☆ Unpacking the Eye of the Beholder: Social Location, Identity, and the Moving Target of Political Perspectives
Political and social identities structure how people evaluate political information, a finding decades deep in political science and routinely discarded by computational tools that often produce single scores that treat a piece of text, an image, or a video as if it means the same thing to everyone. This paper shows that it does not, and that the difference is consequential. To address this problem, I develop the Perspectivist Visual Political Sentiment (PVPS) classifier, which learns from approximately 82,000 evaluations by 5,575 U.S. adults to predict how audiences defined by political and social identities will evaluate the same image. Unlike standard tools that average systematic disagreement away, PVPS preserves it, returning an evaluative profile that records who agrees, who diverges, and along which identity lines. Applied to several influential studies of visual sentiment, PVPS shows that perceived violence in protest imagery and the emotional mechanisms behind protest image engagement both change substantively once audience identity is taken into account. It follows that what a political image conveys is a moving target, and measuring it requires knowing whom it is moving.
☆ USEMA: a Scalable Efficient Mamba Like Attention for Medical Image Segmentation
Accurate medical image segmentation is an integral part of the medical image analysis pipeline that requires the ability to merge local and global information. While vision transformers are able to capture global interactions using vanilla self-attention, their quadratic computational complexity in the input size remains a struggle for medical image segmentation tasks. Motivated by the dispersion property of vanilla self-attention and recent development of Mamba form of attention, Scalable and Efficient Mamba like Attention (SEMA) utilizes token localization via local window attention to avoid dispersion and maintain focusing, complemented by theoretically consistent arithmetic averaging to capture global aspect of attention. In this work, we present USEMA, a hybrid UNet architecture that merges the local feature extraction ability of convolutional neural networks (CNNs) with SEMA attention. We conduct experiments with USEMA across a variety of modalities and image sizes, demonstrating improved computational efficiency compared to transformer based models using full self-attention, and superior segmentation performance relative to purely convolution and Mamba-based models.
☆ LatentHDR: Decoupling Exposure from Diffusion via Conditional Latent-to-Latent Mapping for Text/Image-to-Panoramic HDR
High Dynamic Range (HDR) generation remains challenging for generative models, which are largely limited to low dynamic range outputs. Recent diffusionbased approaches approximate HDR by generating multiple exposure-conditioned samples, incurring high computational cost and structural inconsistencies across exposures. We propose LatentHDR, a framework that decouples scene generation from exposure modeling in latent space. A pretrained diffusion backbone produces a single coherent scene representation, while a lightweight conditional latent to-latent head deterministically maps it to exposure-specific representations. This enables the generation of a dense, structurally consistent exposure stack in a single pass. This design eliminates multi-pass diffusion, ensures cross-exposure alignment, and enables scalable HDR synthesis. LatentHDR supports both textand image-conditioned HDR generation for perspective and panoramic scenes. Experiments on synthetic data and the SI-HDR benchmark show that LatentHDR achieves state-of-the-art dynamic range with competitive perceptual quality, while reducing computation by an order of magnitude. Our results demonstrate that high-quality HDR generation can be achieved through structured latent modeling, challenging the need for stochastic multi-exposure generation.
☆ Deploying Self-Supervised Learning for Real Seismic Data Denoising
Self-supervised learning (SSL) has emerged as a promising approach to seismic data denoising as it does not require clean reference data. In this work, the deployment of the Noisy-as-Clean (NaC) method was evaluated for real seismic data denoising under controlled conditions. Two independent seismic acquisitions, each comprising noisy and filtered data, were organized into four real datasets. The NaC SSL method was adapted to add real noise to the noisy input, controlled by a parameter. An experimental protocol with ten experiments was designed to compare different strategies for deploying the NaC SSL method with the supervised learning baseline, using identical network topology and hyperparameters. The models were evaluated in terms of denoising performance, computational cost, and generalization capability. The results show that the synthetic additive white Gaussian noise (AWGN) is inadequate for the denoising of seismic data within the NaC method, and performance strongly depends on the compatibility between the injected and actual noise characteristics. Furthermore, both the characteristics of the seismic data and the noise level influence the performance of the model. Self-supervised fine-tuning on test data has improved SSL performance, whereas no such gain was observed for fine-tuning of supervised models. Finally, NaC has shown to be a simple, effective, and model-independent method that offers a feasible solution for the denoising of real seismic data.
☆ Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs
Vision-language models (VLMs), such as CLIP and SigLIP 2, are widely used for image classification, yet their vision encoders remain vulnerable to systematic biases that undermine robustness. In particular, correlations between foreground objects and their backgrounds constitute a salient and practically important class of spurious dependencies. In this work, we revisit the well-known property of high linear additivity in VLM embedding spaces and show that it enables a decomposition of scene representations into foreground and background components. Leveraging this insight, we introduce a pre-training approach that exploits this property to construct background-invariant representations using synthetic data. Our method achieves, to our knowledge, the first worst-group accuracy exceeding $90\%$ on Waterbirds under perfect ($100\%$) spurious correlation (i.e., no minority-group examples in the training data). Furthermore, it demonstrates strong sim-to-real transfer and requires no access to real-world debiased data, making it practical for real-world deployment.
comment: 36 pages, 7 figures
☆ HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer
The evolution of visual generative models has long been constrained by fragmented architectures relying on disjoint text encoders and external VAEs. In this report, we present HiDream-O1-Image, a natively unified generative foundation model via pixel-space Diffusion Transformer, that pioneers a paradigm shift from modular architectures to an end-to-end in-context visual generation engine. By mapping raw image pixels, text tokens, and task-specific conditions into a single shared token space, HiDream-O1-Image achieves a structural unification of multimodal inputs within an Unified Transformer (UiT) architecture. This native encoding paradigm eliminates the need for separate VAEs or disjoint pre-trained text encoders, allowing the model to treat diverse generation and editing tasks as a consistent in-context reasoning process. Extensive experiments show that HiDream-O1-Image excels across various generation tasks, including text-to-image generation, instruction-based editing, and subject-driven personalization. Notably, with only 8B parameters, HiDream-O1-Image (8B) achieves performance parity with or even surpasses established state-of-the-art models with significantly larger parameters (e.g., 27B Qwen-Image). Crucially, to validate the immense scalability of this paradigm, we successfully scale the architecture up to over 200B parameters. Experimental results demonstrate that this massive-scale version HiDream-O1-Image-Pro (200B+) unlocks unprecedented generative capabilities and superior performance, establishing new state-of-the-art benchmarks. Ultimately, HiDream-O1-Image highlights the immense potential of natively unified architectures and charts a highly scalable path toward next-generation multimodal AI.
comment: Source codes and models are available at Github: https://github.com/HiDream-ai/HiDream-O1-Image and Huggingface: https://huggingface.co/HiDream-ai/HiDream-O1-Image
☆ SplitFed-CL: A Split Federated Co-Learning Framework for Medical Image Segmentation with Inaccurate Labels
Split Federated Learning (SplitFed) combines federated and split learning to preserve privacy while reducing client-side computation. However, in medical image segmentation, heterogeneous label quality across clients can significantly degrade performance. We propose SplitFed-CL, a co-learning framework where a global teacher guides local students to detect and refine unreliable annotations. Reliable labels supervise training directly, while unreliable labels are corrected via weighted student--teacher refinement. SplitFed-CL further incorporates consistency regularization for robustness to input perturbations and a trainable weighting module to balance loss terms adaptively. We also introduce a novel difficulty guided strategy to simulate human like boundary centric annotation errors, where the degree of perturbation is governed by shape complexity and the associated annotation difficulty. Experiments on two multiclass segmentation datasets with controlled synthetic noise, together with a binary segmentation dataset containing real-world annotation errors, demonstrate that SplitFed-CL consistently outperforms seven state-of-the-art baselines, yielding improved segmentation quality and robustness.
♻ ☆ Are vision-language models ready to zero-shot replace supervised classification models in agriculture?
Vision-language models (VLMs) are increasingly proposed as general-purpose solutions for visual recognition tasks, yet their reliability for agricultural decision support remains poorly understood. We benchmark a diverse set of open-source and closed-source VLMs on 27 agricultural image classification datasets from the AgML collection (https://github.com/Project-AgML), spanning 162 classes and 248,000 images across plant disease, pest and damage, and plant and weed species identification. Across all tasks, zero-shot VLMs substantially underperform a supervised task-specific baseline (YOLO11), which consistently achieves markedly higher accuracy than any foundation model. Under multiple-choice prompting, the best-performing VLM (Gemini-3 Pro) reaches approximately 62% average accuracy, while open-ended prompting yields much lower performance, with raw accuracies typically below 25%. Applying LLM-based semantic judging increases open-ended accuracy (e.g., from ~21% to ~30% for top models) and alters model rankings, demonstrating that evaluation methodology meaningfully affects reported conclusions. Among open-source models, Qwen-VL-72B performs best, approaching closed-source performance under constrained prompting but still trailing top proprietary systems. Task-level analysis shows that plant and weed species classification is consistently easier than pest and damage identification, which remains the most challenging category across models. Overall, these results indicate that current off-the-shelf VLMs are not yet suitable as standalone agricultural diagnostic systems, but can function as assistive components when paired with constrained interfaces, explicit label ontologies, and domain-aware evaluation strategies.
♻ ☆ BRIDGE: Background Routing and Isolated Discrete Gating for Coarse-Mask Local Editing
Coarse-mask local image editing asks a model to modify a user-indicated region while preserving the surrounding scene. In practice, however, rough masks often become unintended shape priors: instead of serving as flexible edit support, the mask can pull the generated object toward its accidental boundary. We study this failure as mask-shape bias and frame the task through a Two-Zone Constraint, where the background should remain stable while the editable region should follow the instruction without being forced to inherit the mask contour. BRIDGE addresses this setting by keeping masks outside the DiT backbone for support construction and blending, avoiding DiT-internal mask injection and copied control branches. It uses BridgePath generation, where a Main Path preserves background context and a Subject Path generates editable content from independent noise. Motivated by a diagnostic Qwen-Image experiment showing that positional embeddings and attention connectivity regulate which image context visual tokens reuse, BRIDGE introduces a learnable Discrete Geometric Gate for token-level positional-embedding routing. This gate lets subject tokens borrow background-anchored coordinates near fusion regions or keep subject-centric coordinates for geometric freedom. We evaluate BRIDGE on BRIDGE-Bench, MagicBrush, and ICE-Bench. On BRIDGE-Bench, BRIDGE improves Local SigLIP2-T from 0.262 with FLUX.1-Fill and 0.390 with ACE++ to 0.503, with parallel gains in local DINO and DreamSim. Zero-shot results on MagicBrush and ICE-Bench further indicate competitive alignment and source preservation beyond the curated benchmark, while the added routing module remains compact at 13.31M parameters compared with ControlNet-style copied branches.
comment: 11 pages, 6 figures
♻ ☆ NoTVLA: Semantics-Preserving Robot Adaptation via Narrative Action Interfaces
Vision-Language-Action (VLA) models represent a pivotal advance in embodied intelligence, yet they confront critical barriers to real-world deployment, most notably catastrophic forgetting. This issue stems from their overreliance on continuous action sequences or action chunks, which inadvertently create isolated data silos that disrupt knowledge retention across tasks. To tackle these challenges, we propose the Narrowing of Trajectory VLA (NoTVLA) framework: a novel approach that narrows its focus to sparse trajectories, thereby avoiding the catastrophic forgetting associated with dense trajectory fine-tuning. A key innovation of NoTVLA lies in its trajectory planning strategy: instead of centering on the target object's trajectory, it leverages temporal compression and spatial reasoning pruning specifically for the robot end effector's trajectory. Furthermore, training is conducted using these sparse trajectories rather than dense action trajectories, an optimization that delivers remarkable practical advantages with better performance in zero-shot. In multi-task evaluation scenarios, NoTVLA achieves superior performance and generalization compared to pi0 while operating under two critical constraints: it uses over an order of magnitude less computing power than pi0 and requires no wrist-mounted camera. This design ensures that NoTVLA's operational accuracy closely approximates that of single-task expert models. Crucially, it also preserves the model's inherent language capabilities, enabling zero-shot generalization in specific scenarios, supporting unified model deployment across multiple robot platforms, and fostering a degree of generalization even when perceiving tasks from novel perspectives.
♻ ☆ ReaMOT: A Benchmark and Framework for Reasoning-based Multi-Object Tracking
Referring Multi-Object Tracking (RMOT) aims to track targets specified by language instructions. However, existing RMOT paradigms heavily rely on explicit visual-textual matching and consequently fail to generalize to complex instructions that require logical reasoning. To overcome this, we propose Reasoning-based Multi-Object Tracking (ReaMOT), a novel task that elevates tracking to a cognitive level, requiring models to infer and track specific targets satisfying implicit constraints via logical reasoning. To advance this field, we construct the ReaMOT Challenge, a comprehensive benchmark featuring a tailored metric suite and a large scale dataset. This dataset comprises 1,156 language instructions, 423,359 image language pairs, and 869 distinct video sequences systematically categorized into six distinct evaluation scenarios, with over 75\% of the instructions dedicated to High Level Reasoning. Furthermore, recognizing that traditional trackers lack cognitive capacity while direct application of Large Vision-Language Model (LVLM) yields severe temporal inconsistencies, we propose ReaTrack. Driven by the insight to decouple high-level cognitive localization from low-level physical motion continuity, this training-free framework dynamically aligns the semantic detections of a Thinking-variant LVLM with the robust motion priors of SAM2. Extensive experiments on the ReaMOT Challenge benchmark demonstrate that ReaTrack establishes a new leading performance standard. Notably, it achieves a more than threefold improvement in RHOTA on the High Level Reasoning subset. Our dataset and code will be available at https://github.com/chen-si-jia/ReaMOT.
comment: Code: https://github.com/chen-si-jia/ReaMOT
♻ ☆ When Large Vision-Language Models Meet Person Re-Identification ICASSP 2026
Large Vision-Language Models (LVLMs) that incorporate visual models and large language models have achieved impressive results across cross-modal understanding and reasoning tasks. In recent years, person re-identification (ReID) has also started to explore cross-modal semantics to improve the accuracy of identity recognition. However, effectively utilizing LVLMs for ReID remains an open challenge. While LVLMs operate under a generative paradigm by predicting the next output word, ReID requires the extraction of discriminative identity features to match pedestrians across cameras. In this paper, we propose LVLM-ReID, a novel framework that harnesses the strengths of LVLMs to promote ReID. Specifically, we employ instructions to guide the LVLM in generating one semantic token that encapsulates key appearance semantics from the person image. This token is further refined through our Semantic-Guided Interaction (SGI) module, establishing a reciprocal interaction between the semantic token and visual tokens. Ultimately, the reinforced semantic token serves as the representation of pedestrian identity. Our framework integrates the semantic understanding and generation capabilities of LVLM into end-to-end ReID training, allowing LVLM to capture rich semantic cues during both training and inference. LVLM-ReID achieves competitive results on multiple benchmarks without additional image-text annotations, demonstrating the potential of LVLM-generated semantics to advance person ReID.
comment: Accepted by ICASSP 2026
♻ ☆ Beyond Nearest Neighbor Interpolation in Data Augmentation
Avoiding the risk of undefined categorical labels using nearest neighbor interpolation overlooks the risk of exacerbating pixel level annotation errors in augmented training data. Additionally, the inherent low pass filtering effects of interpolation algorithms exacerbate the risk of degrading high frequency structural details within annotated regions of interest. To avoid these risks, the author modified convolutional neural networks data transformation functions by incorporating a modified geometric transformation function, removing reliance on nearest neighbor interpolation, and integrating a mean-based class filtering mechanism to handle undefined categorical labels with alternative interpolation algorithms. The author also implemented an offline data augmentation pipeline to generate interpolation specific augmented training data, enabling quantitative assessment of interpolation specific low pass filtering effects on augmented training data. Experimental evaluation on three medical image segmentation datasets and the XBAT datasets demonstrated performance gains across multiple quantitative metrics.
comment: 10 pages, 11 figures, 14 tables
♻ ☆ High-Entropy Tokens as Multimodal Failure Points in Vision-Language Models
Vision-language models (VLMs) achieve remarkable performance but remain vulnerable to adversarial attacks. Entropy, as a measure of model uncertainty, is highly correlated with VLM reliability. While prior entropy-based attacks maximize uncertainty at all decoding steps, implicitly assuming that every token equally contributes to model instability, we reveal that a small fraction (around 20%) of high-entropy tokens, in the evaluated representative open-source VLMs with diverse architectures, concentrates a disproportionate share of adversarial influence during autoregressive generation. We demonstrate that concentrating adversarial perturbations on these high-entropy positions achieves comparable semantic degradation to global methods while optimizing fewer decoding positions. Additionally, across multiple representative VLMs, such attacks induce not only semantic drift but also a substantial unsafe subset (20-31%) under the current pipeline. Remarkably, since such vulnerable high-entropy tokens recur across architecturally diverse VLMs, attacks focused on them exhibit non-trivial transferability. Motivated by these findings, we design a simple Entropy-Guided Attack (EGA) that operationalizes sparse high-entropy targeting and extends it with a reusable token bank, yielding competitive attack success rates (93-95%) with a considerable harmful rate (30.2-38.6%) on the three representative open-source VLMs.
comment: 19 Pages,11 figures,8 tables
♻ ☆ Determinism of Randomness: Prompt-Residual Seed Shaping for Diffusion Generation
Diffusion models start generation from an isotropic Gaussian latent, yet changing only the random seed can lead to large differences in prompt faithfulness, composition, and visual quality. We study this seed sensitivity through the semantic map from initial noise to generated meaning. Although the sampling flow is locally invertible, the subsequent semantic projection is many-to-one, inducing a degenerate pullback semi-metric on the latent space: most local directions are nearly semantic-invariant, while semantic-sensitive variation is concentrated in a much smaller horizontal subspace. This provides an explanatory geometric view of the seed lottery. Motivated by this view, we introduce a training-free prompt-residual seed-shaping procedure. Rather than claiming to recover the exact horizontal space, the method uses a single high-noise cold-start prompt residual as a model-coupled proxy, injects only its tangential component, and retracts the seed to the original Gaussian radius shell. This keeps the initialization prior-compatible while adding only one conditional/unconditional probe before standard sampling. Across multiple generation benchmarks, the method improves alignment and quality metrics over standard sampling, supporting both the practical value of the proxy and the explanatory relevance of semantic anisotropy.
♻ ☆ Behavior-Centric Extraction of Scenarios from Highway Traffic Data and their Domain-Knowledge-Guided Clustering using CVQ-VAE IEEE
Approval of ADS depends on evaluating its behavior within representative real-world traffic scenarios. A common way to obtain such scenarios is to extract them from real-world data recordings. These can then be grouped and serve as basis on which the ADS is subsequently tested. This poses two central challenges: how scenarios are extracted and how they are grouped. Existing extraction methods rely on heterogeneous definitions, hindering scenario comparability. For the grouping of scenarios, rule-based or ML-based methods can be utilized. However, while modern ML-based approaches can handle the complexity of traffic scenarios, unlike rule-based approaches, they lack interpretability and may not align with domain-knowledge. This work contributes to a standardized scenario extraction based on the Scenario-as-Specification concept, as well as a domain-knowledge-guided scenario clustering process. Experiments on the highD dataset demonstrate that scenarios can be extracted reliably and that domain-knowledge can be effectively integrated into the clustering process. As a result, the proposed methodology supports a more standardized process for deriving scenario categories from highway data recordings and thus enables a more efficient validation process of automated vehicles.
comment: Accepted as a conference paper in IEEE Intelligent Vehicles Symposium (IV) 2026, Detroit, MI, United States
♻ ☆ SAR-RAG: ATR Visual Question Answering by Semantic Search, Retrieval, and MLLM Generation SP
We present a visual-context image-retrieval-augmented generation (ImageRAG)- assisted AI agent for automatic target recognition (ATR) of synthetic aperture radar (SAR) imagery. SAR is a remote sensing method used in defense and security applications to detect and monitor the positions of military vehicles, which may appear indistinguishable in images. Researchers have extensively studied SAR ATR to improve the differentiation and identification of vehicle types, characteristics, and measurements. Test examples can be compared with known vehicle target types to improve recognition tasks. New methods enhance the capabilities of neural networks, transformer attention, and multimodal large language models. An agentic AI method may be developed to utilize a defined set of tools, such as searching through a library of similar examples. Our proposed method, SAR Retrieval-Augmented Generation (SAR-RAG), combines a multimodal large language model (MLLM) with a vector database of semantic embeddings to support contextual search for image exemplars with known qualities. By recovering past image examples of known true target types, our SAR-RAG system can compare similar vehicle categories, thereby improving ATR prediction accuracy. We evaluate this through search and retrieval metrics, categorical classification accuracy, and numeric regression of vehicle dimensions. These metrics all show improvements when SAR-RAG is added to an MLLM baseline method as an attached ATR memory bank.
comment: Accepted to 2026 SPIE Defense + Security, Automatic Target Recognition XXXVI
♻ ☆ Angle-I2P: Angle-Consistent-Aware Hierarchical Attention for Cross-Modality Outlier Rejection ICRA 2026
Image-to-point-cloud registration (I2P) is a fundamental task in robotic applications such as manipulation,grasping, and localization. Existing deep learning-based I2P methods seek to align image and point cloud features in a learned representation space to establish correspondences, and have achieved promising results. However, when the inlier ratio of the initial matching pairs is low, conventional Perspective-n-Points (PnP) methods may struggle to achieve accurate results. To address this limitation, we propose Angle-I2P, an outlier rejection network that leverages angle-consistent geometric constraints and hierarchical attention. First, we design a scale-invariant, crossmodality geometric constraint based on angular consistency. This explicit geometric constraint guides the model in distinguishing inliers from outliers. Furthermore, we propose a global-tolocal hierarchical attention mechanism that effectively filters out geometrically inconsistent matches under rigid transformation, thereby improving the Inlier Ratio (IR) and Registration Recall (RR). Experimental results demonstrate that our method achieves state-of-the-art performance on the 7Scenes, RGBD Scenes V2, and a self-collected dataset, with consistent improvements across all benchmarks.
comment: Accepted by ICRA 2026
♻ ☆ Efficient Dataset Distillation for Pre-Trained Self-Supervised Models via Statistical Flow Matching
Dataset distillation seeks to synthesize a highly compact dataset that achieves performance comparable to the original dataset on downstream tasks. For the classification task that use pre-trained self-supervised models as backbones, previous linear gradient matching optimizes synthetic images by encouraging them to mimic the gradient updates induced by real images on the linear classifier. However, this batch-level formulation requires loading thousands of real images and applying multiple rounds of differentiable augmentations to synthetic images at each distillation step, leading to substantial computational and memory overhead. In this paper, we introduce statistical flow matching , a stable and efficient supervised learning framework that optimizes synthetic images by aligning constant statistical flows from target class centers to non-target class centers in the original data. Our approach loads raw statistics only once and performs a single augmentation pass on the synthetic data, achieving performance comparable to or better than the state-of-the-art methods with 10x lower GPU memory usage and 4x shorter runtime. Furthermore, we propose a classifier inheritance strategy that reuses the classifier trained on the original dataset for inference, requiring only an extremely lightweight linear projector and marginal storage while achieving substantial performance gains.
♻ ☆ UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes CVPR 2026
Instruction-driven segmentation in remote sensing generates masks from guidance, offering great potential for accessible and generalizable applications. However, existing methods suffer from fragmented task formulations and limited instruction data, hindering effective understanding and generalization. To address these issues, we introduce GeoSeg-1M, the first million-scale dataset for remote sensing instruction-driven segmentation, constructed via an automatic mask filtering and instruction generation pipeline that synthesizes referring, interactive, and reasoning segmentation instructions from multiple public datasets. GeoSeg-1M contains 590K images, 117 categories, and 1.1M image-mask-instruction triplets. Building upon this foundation, we further curate GeoSeg-Bench, a challenging benchmark designed to evaluate contextual understanding and reasoning capabilities across diverse instruction-driven tasks and complex geospatial scenes. Furthermore, we present UniGeoSeg, a unified framework that serves as a strong baseline, incorporating task-aware text enhancement, latent knowledge memory, and a progressive training strategy to facilitate multi-task learning. Extensive experiments demonstrate the state-of-the-art performance of UniGeoSeg across GeoSeg-Bench and diverse public benchmarks, while exhibiting strong zero-shot generalization. Datasets and source code were released at https://github.com/MiliLab/UniGeoSeg.
comment: Datasets and source code were released at https://github.com/MiliLab/UniGeoSeg ; Accepted by CVPR 2026
♻ ☆ KeyframeFace: Language-Driven Facial Animation via Semantic Keyframes
Facial animation is a core component for creating digital characters in Computer Graphics (CG) industry. A typical production workflow relies on sparse, semantically meaningful keyframes to precisely control facial expressions. Enabling such animation directly from natural-language descriptions could significantly improve content creation efficiency and accessibility. However, most existing methods adopt a text-to-continuous-frames paradigm, directly regressing dense facial motion trajectories from language. This formulation entangles high-level semantic intent with low-level motion, lacks explicit semantic control structure, and limits precise editing and interpretability. Inspired by the keyframe paradigm in animation production, we propose KeyframeFace, a framework for semantic facial animation from language via interpretable keyframes. Instead of predicting dense motion trajectories, our method represents animation as a sequence of semantically meaningful keyframes in an interpretable ARKit-based facial control space. A language-driven model leverages large language model (LLM) priors to generate keyframes that align with contextual text descriptions and emotion cues. To support this formulation, we construct a multimodal dataset comprising 2,100 expression scripts paired with monocular videos, per-frame ARKit coefficients, and manually annotated semantic keyframes. Experiments show that incorporating semantic keyframe supervision and language priors significantly improves expression fidelity and semantic alignment compared to methods that do not use facial action semantics.
♻ ☆ GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs
As Large Language Models (LLMs) become increasingly integral to various domains, their potential to generate harmful responses has prompted significant societal and regulatory concerns. In response, governments have issued ethics guidelines to promote the development of trustworthy AI. However, these guidelines are typically high-level demands for developers and testers, leaving a gap in translating them into actionable testing questions to verify LLM compliance. To address this challenge, we introduce GUARD (Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics), a testing method designed to operationalize guidelines into specific guideline-violating questions that assess LLM adherence. To implement this, GUARD uses automated generation of guideline-violating questions based on government-issued guidelines, thereby testing whether responses comply with these guidelines. When responses directly violate guidelines, GUARD reports inconsistencies. Furthermore, for responses that do not directly violate guidelines, GUARD integrates the concept of ``jailbreaks'' to diagnostics, named GUARD-JD, which creates scenarios that provoke unethical or guideline-violating responses, effectively identifying potential scenarios that could bypass built-in safety mechanisms. Our method finally culminates in a compliance report, delineating the extent of adherence and highlighting any violations. We empirically validated the effectiveness of GUARD on eight LLMs, including Vicuna-13B, LongChat-7B, Llama2-7B, Llama-3-8B, GPT-3.5, GPT-4, GPT-4o, and Claude-3.7, by testing compliance under three government-issued guidelines and conducting jailbreak diagnostics. Additionally, GUARD-JD can transfer jailbreak diagnostics to vision-language models (MiniGPT-v2 and Gemini-1.5), demonstrating its usage in promoting reliable LLM-based applications.
comment: 56 pages
♻ ☆ One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Vision-language-action (VLA) models increasingly rely on auxiliary world modules to plan over long horizons, yet how such modules should be parameterized on top of a pretrained VLA remains an open design question. Existing world-model-augmented VLAs typically pass the per-frame visual stream into the world module at high visual bandwidth and treat its rollout as a side product of action prediction; under a constrained adaptation budget on a frozen backbone, this leaves both the per-frame representation and the latent action coupling under-examined. We introduce OneWM-VLA, which compresses each view into a single semantic token per frame through an Adaptive Attention Pooling, and produces the resulting latent stream and the action trajectory under a single flow-matching objective rather than connecting them through a separate decoder. Empirically, we find that per-frame visual bandwidth can be reduced to a single token without compromising long-horizon performance under our setup. Trained with 14.71M LoRA parameters on a $π_0$ (2B) backbone, OneWM-VLA improves the average success rate from 47.9% to 61.3% on MetaWorld~MT50, reaches 95.6% on LIBERO-Long (vs.85.2% for $π_0$), and reaches 60.0% on the long-horizon deformable task Fold Cloth on a real Piper arm (vs.20.0% for $π_0$).
♻ ☆ AstroSplat: Physics-Based Gaussian Splatting for Rendering and Reconstruction of Small Celestial Bodies
Image-based surface reconstruction and characterization are crucial for missions to small celestial bodies (e.g., asteroids), as it informs mission planning, navigation, and scientific analysis. Recent advances in Gaussian splatting enable high-fidelity neural scene representations but typically rely on a spherical harmonic intensity parameterization that is strictly appearance-based and does not explicitly model material properties or light-surface interactions. We introduce AstroSplat, a physics-based Gaussian splatting framework that integrates planetary reflectance models to improve the autonomous reconstruction and photometric characterization of small-body surfaces from in-situ imagery. The proposed framework is validated on real imagery taken by NASA's Dawn mission, where we demonstrate superior rendering performance and surface reconstruction accuracy compared to the typical spherical harmonic parameterization.
comment: 10 pages, 6 figures, conference
♻ ☆ NEXT: Multi-Grained Mixture of Experts via Text-Modulation for Multi-Modal Object Re-Identification
Multi-modal object Re-IDentification (ReID) aims to obtain complete identity features across heterogeneous modalities. However, most existing methods rely on implicit feature fusion modules, making it difficult to model fine-grained recognition patterns under various challenges in real world. Benefiting from the powerful Multi-modal Large Language Models (MLLMs), the object appearances are effectively translated into descriptive captions. In this paper, we propose a reliable caption generation pipeline based on attribute confidence, which significantly reduces the unknown recognition rate of MLLMs and improves the quality of generated text. Additionally, to model diverse identity patterns, we propose a novel ReID framework, named NEXT, the Multi-grained Mixture of Experts via Text-Modulation for Multi-modal Object Re-Identification. Specifically, we decouple the recognition problem into semantic and structural branches to separately capture fine-grained appearance features and coarsegrained structure features. For semantic recognition, we first propose a Text-Modulated Semantic Experts (TMSE), which randomly samples high-quality captions to modulate experts capturing semantic features and mining inter-modality complementary cues. Second, to recognize structure features, we propose a Context-Shared Structure Experts (CSSE), which focuses on the holistic object structure and maintains identity structural consistency via a soft routing mechanism. Finally, we propose a Multi-Grained Features Aggregation (MGFA), which adopts a unified fusion strategy to effectively integrate multi-grained expert features into the final identity representations. Extensive experiments on two public person datasets and three vehicle datasets demonstrate the effectiveness of our method, showing that it significantly outperforms existing state-of-the-art methods.
♻ ☆ UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
Camera-controllable image editing aims to synthesize novel views of a given scene under varying camera poses while strictly preserving cross-view geometric consistency. However, existing methods typically rely on fragmented geometric guidance, such as only injecting point clouds at the representation level despite models containing multiple levels, and are mainly based on image diffusion models that operate on discrete view mappings. These two limitations jointly lead to geometric drift and structural degradation under continuous camera motion. We observe that while leveraging video models provides continuous viewpoint priors for camera-controllable image editing, they still struggle to form stable geometric understanding if geometric guidance remains fragmented. To systematically address this, we inject unified geometric guidance across three levels that jointly determine the generative output: representation, architecture, and loss function. To this end, we propose UniGeo, a novel camera-controllable editing framework. Specifically, at the representation level, UniGeo incorporates a frame-decoupled geometric reference injection mechanism to provide robust cross-view geometry context. At the architecture level, it introduces geometric anchor attention to align multi-view features. At the loss function level, it proposes a trajectory-endpoint geometric supervision strategy to explicitly reinforce the structural fidelity of target views. Comprehensive experiments across multiple public benchmarks, encompassing both extensive and limited camera motion settings, demonstrate that UniGeo significantly outperforms existing methods in both visual quality and geometric consistency.
♻ ☆ PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models
Mainstream vision-language models (VLMs) fundamentally struggle with severe optical ambiguities, such as reflections and transparent objects, due to the inherent limitations of standard RGB inputs. While polarization imaging captures polarimetric physical parameters that resolve these ambiguities, existing methods are constrained by fixed-format outputs and remain isolated from open-ended reasoning. To bridge this semantic-physical gap, we introduce PolarVLM, the first multimodal framework integrating polarimetric physical parameters into VLMs. By employing a dual-stream architecture and a progressive two-stage training strategy, PolarVLM effectively prevents physical misinterpretations while preserving general visual abilities. Complementing our architecture, we construct PolarVQA, the first benchmark for polarization-aware VQA, featuring 75K physics-grounded instruction-tuning pairs targeting reflective and transparent scenes. Experiments show that PolarVLM surpasses the RGB baseline by 25.4% overall across five evaluation tasks, with remarkable gains of 26.6% in reflection recognition and 34.0% in glass counting, successfully unlocking physics-aware semantic understanding.
comment: 23 pages, 12 figures, including appendices
♻ ☆ Prompt Estimation from Prototypes for Federated Prompt Tuning of Vision Transformers
Visual Prompt Tuning (VPT) of pre-trained Vision Transformers (ViTs) has proven highly effective as a parameter-efficient fine-tuning technique for adapting large models to downstream tasks with limited data. Its parameter efficiency makes it particularly suitable for Federated Learning (FL), where both communication and computation budgets are often constrained. However, global prompt tuning struggles to generalize across heterogeneous clients, while personalized tuning overfits to local data and lacks generalization. We propose PEP-FedPT (Prompt Estimation from Prototypes for Federated Prompt Tuning), a unified framework designed to achieve both generalization and personalization in federated prompt tuning of ViTs. Within this framework, we introduce the novel Class-Contextualized Mixed Prompt (CCMP) - based on class-specific prompts maintained alongside a globally shared prompt. For each input, CCMP adaptively combines class-specific prompts using weights derived from global class prototypes and client class priors. This approach enables per-sample prompt personalization without storing client-dependent trainable parameters. The prompts are collaboratively optimized via traditional federated averaging technique on the same. Comprehensive evaluations on CIFAR-100, TinyImageNet, DomainNet, and iNaturalist datasets demonstrate that PEP-FedPT consistently surpasses the state-of-the-art baselines under diverse data heterogeneity scenarios, establishing a strong foundation for efficient and generalizable federated prompt tuning of Vision Transformers.
comment: Accepted to TMLR 2026
♻ ☆ JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion
Audio-Visual Foundation Models, which are pretrained to jointly generate sound and visual content, have recently shown an unprecedented ability to model multi-modal generation and editing, opening new opportunities for downstream tasks. Among these tasks, video dubbing could greatly benefit from such priors, yet most existing solutions still rely on complex, task-specific pipelines that struggle in real-world settings. In this work, we introduce a single-model approach that adapts a foundational audio-video diffusion model for video-to-video dubbing via a lightweight LoRA. The LoRA enables the model to condition on an input audio-video while jointly generating translated audio and synchronized facial motion. To train this LoRA, we leverage the generative model itself to synthesize paired multilingual videos of the same speaker. Specifically, we generate multilingual videos with language switches within a single clip, and then inpaint the face and audio in each half to match the language of the other half. By leveraging the rich generative prior of the audio-visual model, our approach preserves speaker identity and lip synchronization while remaining robust to complex motion and real-world dynamics. We demonstrate that our approach produces high-quality dubbed videos with improved visual fidelity, lip synchronization, and robustness compared to existing dubbing pipelines.
comment: Project webpage available at https://justdubit.github.io
♻ ☆ Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding ACL 2026
Proactive streaming video understanding requires Video-LLMs to decide when to respond as a video unfolds, a task where existing methods often fall short due to their implicit, query-agnostic modeling of visual evidence. We introduce Response-G1, a novel framework that establishes explicit, structured alignment between the accumulated video evidence and the query's expected response conditions via scene graphs. The framework operates in three fine-tuning-free stages: (1) online query-guided scene graph generation from streaming clips; (2) memory-based retrieval of the most semantically relevant historical scene graphs; and (3) retrieval-augmented trigger prompting for per-frame "silence/response" decisions. By grounding both evidence and conditions in a shared graph representation, Response-G1 achieves more interpretable and accurate response timing decisions. Experimental results on established benchmarks demonstrate the superiority of our method in both proactive and reactive tasks, validating the advantage of explicit scene graph modeling and retrieval in streaming video understanding.
comment: Accepted to ACL 2026
♻ ☆ Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Multi-Stream Environments ICML 2026
This paper identifies a critical yet underexplored challenge in reasoning alignment from multiple multi-modal large language models (MLLMs): In non-stationary environments, the diverse reasoning distributions of source models often evolve unpredictably, transmitting systematic biases and drift to the target model. To address this, we formulate multi-source reasoning alignment as a constraint satisfaction problem under concept drift theory. We propose Autonomous Preference Optimization (APO), a novel framework that treats inter-model divergences not as noise, but as dynamic negative constraints. APO operates via a two-stage protocol: first, supervised bootstrapping projects the target model into the capability union of source models; second, constraint-aware optimization synthesizes a consistent consensus manifold by explicitly suppressing drifting trajectories via a multi-negative Plackett-Luce objective. Extensive experiments on chest X-ray interpretation demonstrate that our 7B model achieves superior robustness, outperforming even proprietary source models in average accuracy. Furthermore, we release CXR-MAX, a large-scale benchmark comprising 170,982 reasoning trajectories from seven large-scale MLLMs to facilitate research on reasoning alignment under drift. Code and data are available at: https://github.com/XiaoyuYoung/APO.
comment: ICML 2026
♻ ☆ CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal ICML 2026
Video subtitle removal aims to distinguish text overlays from background content while preserving temporal coherence. Existing diffusion-based methods necessitate explicit mask sequences during both training and inference phases, which restricts their practical deployment. In this paper, we present CLEAR (Context-aware Learning for End-to-end Adaptive Video Subtitle Removal), a mask-free framework that achieves truly end-to-end inference through context-aware adaptive learning. Our two-stage design decouples prior extraction from generative refinement: Stage I learns disentangled subtitle representations via self-supervised orthogonality constraints on dual encoders, while Stage II employs LoRA-based adaptation with generation feedback for dynamic context adjustment. Notably, our method only requires 0.77% of the parameters of the base diffusion model for training. On Chinese subtitle benchmarks, CLEAR outperforms mask-dependent baselines by + 6.77dB PSNR and -74.7% VFID, while demonstrating superior zero-shot generalization across six languages (English, Korean, French, Japanese, Russian, German), a performance enabled by our generation-driven feedback mechanism that ensures robust subtitle removal without ground-truth masks during inference.
comment: Accepted by ICML 2026 (Spotlight)
♻ ☆ H-POPE: Hierarchical Polling-based Probing Evaluation of Hallucinations in Large Vision-Language Models
By leveraging both texts and images, large vision language models (LVLMs) have shown significant progress in various multi-modal tasks. Nevertheless, these models often suffer from hallucinations, e.g., they exhibit inconsistencies between the visual input and the textual output. To address this, we propose H-POPE, a coarse-to-fine-grained benchmark that systematically assesses hallucination in object existence and attributes. Our evaluation shows that models are prone to hallucinations on object existence, and even more so on fine-grained attributes. We further investigate whether these models rely on visual input to formulate the output texts.
comment: Poster at https://sites.google.com/berkeley.edu/bb-stat/home
♻ ☆ LPT: Less-overfitting Prompt Tuning for Vision-Language Model
Vision-language models (VLMs) have demonstrated exceptional generalization capabilities for downstream tasks. Due to its efficiency, prompt learning has gradually become a more effective and efficient method for transferring VLMs to downstream tasks, surpassing traditional finetuning methods. However, during the transfer process, these models are prone to severe overfitting, leading to a significant decline in generalization ability. To address this issue, we propose a framework named LPT, specifically designed for vision-language models. Specifically, we use CLIP to filter out fine-grained foreground information that may lead to overfitting, thereby guiding the prompts with basic visual concepts. Additionally, to further mitigate overfitting, we have developed a Structural Preservation (SP) constraint at the feature level, which aligns the model's overall feature space structure with the frozen CLIP, endowing the feature space with overall plasticity and enabling effective reshaping of the feature space during optimization. Moreover, we employ Hierarchical Logit (HL) constraint at the output layer to constrain the overall class information in the output, complementing the role of SP at the output end. Extensive experiments across various benchmarks (from base-to-novel, cross-dataset transfer, and domain generalization) demonstrate that our approach significantly improves generalization capability and effectively alleviates overfitting compared to state-of-the-art methods.
♻ ☆ HPGN: Hybrid Priors-Guided Network for Compressed Low-Light Image Enhancement
In practical applications, low-light images are often compressed for efficient storage and transmission. Most existing methods disregard compression artifacts removal or hardly establish a unified framework for joint task enhancement of low-light images with varying compression qualities. To address this problem, we propose an efficient hybrid priors-guided network (HPGN) that enhances compressed low-light images by integrating both compression and illumination priors. Our approach fully utilizes the JPEG quality factor (QF) and DCT quantization matrix (QM) to guide the design of efficient plug-and-play modules for joint tasks. Additionally, we employ a random QF generation strategy to guide model training, enabling a single model to enhance low-light images with different compression levels. Experimental results demonstrate the superiority of our proposed method.
comment: 5 pages, 3 figures
♻ ☆ COP-GEN: Latent Diffusion Transformer for Copernicus Earth Observation Data
Earth observation applications increasingly rely on data from multiple sensors, including optical, radar, elevation, and land-cover. Relationships between modalities are fundamental for data integration but are inherently non-injective: identical conditioning information can correspond to multiple physically plausible observations, and should be parametrised as conditional distributions. Deterministic models, by contrast, collapse toward conditional means and fail to represent the uncertainty and variability required for tasks such as data completion and cross-sensor translation. We introduce COP-GEN, a multimodal latent diffusion transformer that models the joint distribution of heterogeneous EO modalities at their native spatial resolutions. By parameterising cross-modal mappings as conditional distributions, COP-GEN enables flexible any-to-any conditional generation, including zero-shot modality translation without task-specific retraining. Experiments show that COP-GEN generates diverse yet physically consistent realisations while maintaining strong peak fidelity across optical, radar, and elevation modalities. Qualitative and quantitative analyses demonstrate that the model captures meaningful cross-modal structure and adapts its output uncertainty as conditioning information increases. We release a stochastic benchmark built from multi-temporal Sentinel-2 observations that enables distribution-level comparison of generative EO models. On this benchmark, COP-GEN covers 90% of the real observation manifold and 63% of its per-band reflectance range, while the strongest competing method collapses to 2.8% and 18%, respectively. These results highlight the importance of stochastic generative modeling for EO and motivate evaluation protocols beyond single-reference, pointwise metrics. Website: https://miquel-espinosa.github.io/cop-gen
♻ ☆ Deepfake Detection that Generalizes Across Benchmarks
The generalization of deepfake detectors to unseen manipulation techniques remains a challenge for practical deployment. Although many approaches adapt foundation models by introducing significant architectural complexity, this work demonstrates that robust generalization is achievable through a parameter-efficient adaptation of one of the foundational pre-trained vision encoders. The proposed method, GenD, fine-tunes only the Layer Normalization parameters (0.03% of the total) and enhances generalization by enforcing a hyperspherical feature manifold using L2 normalization and metric learning on it. We conducted an extensive evaluation on 14 benchmark datasets spanning from 2019 to 2025. The proposed method achieves state-of-the-art performance, outperforming more complex, recent approaches in average cross-dataset AUROC. Our analysis yields two primary findings for the field: 1) training on paired real-fake data from the same source video is essential for mitigating shortcut learning and improving generalization, and 2) detection difficulty on academic datasets has not strictly increased over time, with models trained on older, diverse datasets showing strong generalization capabilities. This work delivers a computationally efficient and reproducible method, proving that state-of-the-art generalization is attainable by making targeted, minimal changes to a pre-trained foundational image encoder model. The code is at: https://github.com/yermandy/GenD
♻ ☆ OpenClaw-RL: Train Any Agent Simply by Talking
Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework that employs next-state signals to optimize personal agents online through infrastructure and methodology innovations. On the infrastructure side, we extend existing RL systems to a server-client architecture where the RL server hosts the policy behind an inference API and user terminals stream interaction data back over HTTP. From each observed next state, the system extracts two complementary training signals, evaluative and directive, via a separate asynchronous server so that neither signal extraction nor optimization blocks inference. On the methodology side, we introduce a hybrid RL objective that unifies both signal types in a single update: directive signals provide richer, token-level supervision but are sparser, while evaluative signals are more broadly available. To stabilize distillation under teacher-student mismatch, we propose overlap-guided hint selection, which picks the hint whose induced teacher distribution maximally overlaps with the student's top-$k$ tokens, together with a log-probability-difference clip that bounds per-token advantages. Applied to personal agents, OpenClaw-RL enables an agent to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Applied to general agents, OpenClaw-RL is the first RL framework to unify real-world agent settings spanning terminal, GUI, SWE, and tool-call environments, where we additionally demonstrate the utility of next-state signals in long-horizon settings.
comment: Code: https://github.com/Gen-Verse/OpenClaw-RL
♻ ☆ Preserve and Personalize: Personalized Text-to-Image Diffusion Models without Distributional Drift ICLR 2026
Personalizing text-to-image diffusion models involves integrating novel visual concepts from a small set of reference images while retaining the model's original generative capabilities. However, this process often leads to overfitting, where the model ignores the user's prompt and merely replicates the reference images. We attribute this issue to a fundamental misalignment between the true goals of personalization, which are subject fidelity and text alignment, and the training objectives of existing methods that fail to enforce both objectives simultaneously. Specifically, prior approaches often overlook the need to explicitly preserve the pretrained model's output distribution, resulting in distributional drift that undermines diversity and coherence. To resolve these challenges, we introduce a Lipschitz-based regularization objective that constrains parameter updates during personalization, ensuring bounded deviation from the original distribution. This promotes consistency with the pretrained model's behavior while enabling accurate adaptation to new concepts. Furthermore, our method offers a computationally efficient alternative to commonly used, resource-intensive sampling techniques. Through extensive experiments across diverse diffusion model architectures, we demonstrate that our approach achieves superior performance in both quantitative metrics and qualitative evaluations, consistently excelling in visual fidelity and prompt adherence. We further support these findings with comprehensive analyses, including ablation studies and visualizations.
comment: Accepted at ICLR 2026
♻ ☆ ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models
Post-training Large Vision-and-Language Models (LVLMs) typically involves Supervised Fine-Tuning (SFT) for knowledge injection or Reinforcement Learning with Verifiable Rewards (RLVR) for performance enhancement. However, SFT often leads to sub-optimal performance, while RLVR remains constrained by the model's internal knowledge base. While a sequential SFT $\rightarrow$ RLVR pipeline can be used, it introduces significant computational overhead and suffers from catastrophic forgetting. To address these limitations, we propose ViSurf (\textbf{Vi}sual \textbf{Su}pervised-and-\textbf{R}einforcement \textbf{F}ine-Tuning), a unified, single-stage paradigm that integrates the strengths of both SFT and RLVR. By analyzing their training objectives, we establish a unified framework that injects ground-truth labels directly into RLVR rollouts, facilitating simultaneous external supervision and internal reinforcement. Furthermore, we introduce three novel reward control strategies to ensure training stability and optimization. Extensive experiments demonstrate that ViSurf consistently outperforms standalone SFT, RLVR, and the traditional two-stage pipeline across diverse benchmarks. In-depth analysis corroborates these findings, validating the derivation and design principles of ViSurf.
♻ ☆ Can We Go Beyond Visual Features? Neural Tissue Relation Modeling for Relational Graph Analysis in Non-Melanoma Skin Histology CVPR 2026
Histopathology image segmentation is essential for delineating tissue structures in skin cancer diagnostics, but modeling spatial context and inter-tissue relationships remains a challenge, especially in regions with overlapping or morphologically similar tissues. Current convolutional neural network (CNN)-based approaches operate primarily on visual texture, often treating tissues as independent regions and failing to encode biological context. To this end, we introduce Neural Tissue Relation Modeling (NTRM), a novel segmentation framework that augments CNNs with a tissue-level graph neural network to model spatial and functional relationships across tissue types. NTRM constructs a graph over predicted regions, propagates contextual information via message passing, and refines segmentation through spatial projection. Unlike prior methods, NTRM explicitly encodes inter-tissue dependencies, enabling structurally coherent predictions in boundary-dense zones. On the benchmark Histopathology Non-Melanoma Skin Cancer Segmentation Dataset, NTRM outperforms state-of-the-art methods, achieving a robust Dice similarity coefficient that is 4.9\% to 31.25\% higher than the best-performing models among the evaluated approaches. Our experiments indicate that relational modeling offers a principled path toward more context-aware and interpretable histological segmentation, compared to local receptive-field architectures that lack tissue-level structural awareness. Our code is available at https://github.com/shravan-18/NTRM.
comment: CVPR 2026 Workshops
♻ ☆ Diffusion Masked Pretraining for Dynamic Point Cloud
Dynamic point cloud pretraining is still dominated by masked reconstruction objectives. However, these objectives inherit two key limitations. Existing methods inject ground-truth tube centers as decoder positional embeddings, causing spatio-temporal positional leakage. Moreover, they supervise inter-frame motion with deterministic proxy targets that systematically discard distributional structure by collapsing multimodal trajectory uncertainty into conditional means. To address these limitations, we propose Diffusion Masked Pretraining (DiMP), a unified self-supervised framework for dynamic point clouds. DiMP introduces diffusion modeling into both positional inference and motion learning. It first applies forward diffusion noise only to masked tube centers, then predicts clean centers from visible spatio-temporal context. This removes positional leakage while preserving visible coordinates as clean temporal anchors. DiMP also reformulates point-wise inter-frame displacement supervision as a DDPM noise-prediction objective conditioned on decoded representations. This design drives the encoder to target the full conditional distribution of plausible motions under a variational surrogate, rather than collapsing to a single deterministic estimate. Extensive experiments demonstrate that DiMP consistently improves downstream accuracy over the backbone alone, with absolute gains of 11.21% on offline action segmentation and 13.65% under causally constrained online inference.Codes are available at https://github.com/InitalZ/DiMP.git.
♻ ☆ Exploring 6D Object Pose Estimation with Deformation CVPR 2026
We present DeSOPE, a large-scale dataset for 6DoF deformed objects. Most 6D object pose methods assume rigid or articulated objects, an assumption that fails in practice as objects deviate from their canonical shapes due to wear, impact, or deformation. To model this, we introduce the DeSOPE dataset, which features high-fidelity 3D scans of 26 common object categories, each captured in one canonical state and three deformed configurations, with accurate 3D registration to the canonical mesh. Additionally, it features an RGB-D dataset with 133K frames across diverse scenarios and 665K pose annotations produced via a semi-automatic pipeline. We begin by annotating 2D masks for each instance, then compute initial poses using an object pose method, refine them through an object-level SLAM system, and finally perform manual verification to produce the final annotations. We evaluate several object pose methods and find that performance drops sharply with increasing deformation, suggesting that robust handling of such deformations is critical for practical applications.
comment: Accepted at CVPR 2026
♻ ☆ APEX: Assumption-free Projection-based Embedding eXamination Metric for Image Quality Assessment
As generative models achieve unprecedented visual quality, the gold standard for image evaluation remains traditional feature-distribution metrics (e.g., FID). However, these metrics are provably hindered by the closed-vocabulary bottleneck of outdated features and the assumptive bias of rigid parametric formulations. Recent alternatives exploit modern backbones to solve the feature bottleneck, yet continue to suffer from parametric limitations. To close this gap, we introduce APEX (Assumption-free Projection-based Embedding eXamination), a novel evaluation framework leveraging the Sliced Wasserstein Distance as a mathematically grounded, assumption-free similarity measure. APEX inherits effective scalability to high-dimensional spaces, as we prove with theoretical and empirical evidences. Moreover, APEX is embedding-agnostic and uses two open-vocabulary foundation models, CLIP and DINOv2, as feature extractors. Benchmarking APEX against established baselines reveals superior robustness to visual degradations. Additionally, we show that APEX metrics exhibit intra- and cross-dataset stability, ensuring highly stable evaluations on out-of-domain datasets.
♻ ☆ Towards Photorealistic and Efficient Bokeh Rendering via Diffusion Framework CVPR 2026
Existing mobile devices are constrained by compact optical designs, such as small apertures, which make it difficult to produce natural, optically realistic bokeh effects. Although recent learning-based methods have shown promising results, they still struggle with photos captured under high digital zoom levels, which often suffer from reduced resolution and loss of fine details. A naive solution is to enhance image quality before applying bokeh rendering, yet this two-stage pipeline reduces efficiency and introduces unnecessary error accumulation. To overcome these limitations, we propose MagicBokeh, a unified diffusion-based framework designed for high-quality and efficient bokeh rendering. Through an alternative training strategy and a focus-aware masked attention mechanism, our method jointly optimizes bokeh rendering and super-resolution, substantially improving both controllability and visual fidelity. Furthermore, we introduce degradation-aware depth module to enable more accurate depth estimation from low-quality inputs. Experimental results demonstrate that MagicBokeh efficiently produces photorealistic bokeh effects, particularly on real-world low-resolution images, paving the way for future advancements in bokeh rendering. Our code and models are available at https://github.com/vivoCameraResearch/MagicBokeh.
comment: Accepted by CVPR 2026
♻ ☆ EvoDriveVLA: Evolving Driving VLA Models via Collaborative Perception-Planning Distillation
Vision-Language-Action models have shown great promise for autonomous driving, yet they suffer from degraded perception after unfreezing the visual encoder and struggle with accumulated instability in long-term planning. To address these challenges, we propose EvoDriveVLA-a novel collaborative perception-planning distillation framework that integrates self-anchored perceptual constraints and future-informed trajectory optimization. Specifically, self-anchored visual distillation leverages self-anchor teacher to deliver visual anchoring constraints, regularizing student representations via trajectory-guided key-region awareness. In parallel, future-informed trajectory distillation employs a future-aware oracle teacher with coarse-to-fine trajectory refinement and Monte Carlo dropout sampling to synthesize reasoning trajectories that model future evolutions, enabling the student model to internalize the future-aware insights of the teacher. EvoDriveVLA achieves SOTA performance in nuScenes open-loop evaluation and significantly enhances performance in NAVSIM closed-loop evaluation. Our code is available at: https://github.com/hey-cjj/EvoDriveVLA.
comment: 19 pages, 5 figures, 5 tables
♻ ☆ HighFM: Towards a Foundation Model for Learning Representations from High-Frequency Earth Observation Data
The increasing frequency and severity of climate related disasters have intensified the need for real time monitoring, early warning, and informed decision-making. Earth Observation (EO), powered by satellite data and Machine Learning (ML), offers powerful tools to meet these challenges. Foundation Models (FMs) have revolutionized EO ML by enabling general-purpose pretraining on large scale remote sensing datasets. However most existing models rely on high-resolution satellite imagery with low revisit rates limiting their suitability for fast-evolving phenomena and time critical emergency response. In this work, we present HighFM, a first cut approach towards a FM for high temporal resolution, multispectral EO data. Leveraging over 2 TB of SEVIRI imagery from the Meteosat Second Generation (MSG) platform, we adapt the SatMAE masked autoencoding framework to learn robust spatiotemporal representations. To support real time monitoring, we enhance the original architecture with fine grained temporal encodings to capture short term variability. The pretrained models are then finetuned on cloud masking and active fire detection tasks. We benchmark our SEVIRI pretrained Vision Transformers against traditional baselines and recent geospatial FMs, demonstrating consistent gains across both balanced accuracy and IoU metrics. Our results highlight the potential of temporally dense geostationary data for real-time EO, offering a scalable path toward foundation models for disaster detection and tracking.
♻ ☆ R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection CVPR 2026
4D radar-camera sensing configuration has gained increasing importance in autonomous driving. However, existing 3D object detection methods that fuse 4D Radar and camera data confront several challenges. First, their absolute depth estimation module is not robust and accurate enough, leading to inaccurate 3D localization. Second, the performance of their temporal fusion module will degrade dramatically or even fail when the ego vehicle's pose is missing or inaccurate. Third, for some small objects, the sparse radar point clouds may completely fail to reflect from their surfaces. In such cases, detection must rely solely on visual unimodal priors. To address these limitations, we propose R4Det, which enhances depth estimation quality via the Panoramic Depth Fusion module, enabling mutual reinforcement between absolute and relative depth. For temporal fusion, we design a Deformable Gated Temporal Fusion module that does not rely on the ego vehicle's pose. In addition, we built an Instance-Guided Dynamic Refinement module that extracts semantic prototypes from 2D instance guidance. Experiments show that R4Det achieves state-of-the-art 3D object detection results on the TJ4DRadSet and VoD datasets. The source code and models will be released at https://github.com/VDIGPKU/R4Det.
comment: Accepted to CVPR 2026
♻ ☆ UM-Text: A Unified Multimodal Model for Image Understanding and Visual Text Editing AAAI 2026
With the rapid advancement of image generation, visual text editing using natural language instructions has received increasing attention. The main challenge of this task is to fully understand the instruction and reference image, and thus generate visual text that is style-consistent with the image. Previous methods often involve complex steps of specifying the text content and attributes, such as font size, color, and layout, without considering the stylistic consistency with the reference image. To address this, we propose UM-Text, a unified multimodal model for context understanding and visual text editing by natural language instructions. Specifically, we introduce a Visual Language Model (VLM) to process the instruction and reference image, so that the text content and layout can be elaborately designed according to the context information. To generate an accurate and harmonious visual text image, we further propose the UM-Encoder to combine the embeddings of various condition information, where the combination is automatically configured by VLM according to the input instruction. During training, we propose a regional consistency loss to offer more effective supervision for glyph generation on both latent and RGB space, and design a tailored three-stage training strategy to further enhance model performance. In addition, we contribute the UM-DATA-200K, a large-scale visual text image dataset on diverse scenes for model training. Extensive qualitative and quantitative results on multiple public benchmarks demonstrate that our method achieves state-of-the-art performance.
comment: Accepted by AAAI 2026
♻ ☆ The autoPET3 Challenge: Automated Lesion Segmentation in Whole-Body PET/CT $\unicode{x2013}$ Multitracer Multicenter Generalization
We report the design and results of the third autoPET challenge (MICCAI 2024), which benchmarked automated lesion segmentation in whole-body PET/CT under a compositional generalization setting. Training data comprised 1,014 [18F]-FDG PET/CT studies from the University Hospital Tübingen and 597 [18F]/[68Ga]-PSMA PET/CT studies from the LMU University Hospital Munich, constituting the largest publicly available annotated PSMA PET/CT dataset to date. The held-out test set of 200 studies covered four tracer-center combinations, two of which represented unseen compositional pairings. A complementary data-centric award category isolated the contribution of data handling strategies by restricting participants to a fixed baseline model. Seventeen teams submitted 27 algorithms, predominantly nnU-Net-based 3D networks with PET/CT channel concatenation. The top-ranked algorithm achieved a mean DSC of 0.66, FNV of 3.18 mL, and FPV of 2.78 mL across all four test conditions, improving DSC by 8% and reducing the false-negative volume by 5 mL relative to the provided baseline. Ranking was stable across bootstrap resampling and alternative ranking schemes for the top tier. Beyond the benchmark, we provide an in-depth analysis of segmentation performance at the patient and lesion level. Three main conclusions can be drawn: (1) in-domain multitracer PET/CT segmentation is sufficient and probably approaching reader agreement; (2) compositional generalization to unseen tracer-center combinations remains an open problem mainly driven by systematic volume overestimation; (3) heterogeneity and case difficulty drive performance variation substantially more than the choice of algorithm among top-ranked teams.
comment: Preprint submitted to Medical Image Analysis
♻ ☆ Temporal Structure Matters for Efficient Test-Time Adaptation in Wearable Human Activity Recognition
Wearable human activity recognition (WHAR) models often suffer from performance degradation under real-world cross-user distribution shifts. Test-time adaptation (TTA) mitigates this degradation by adapting models online using unlabeled test streams, yet existing methods largely inherit assumptions from vision tasks and underexploit the inherent inter-window temporal structure in WHAR streams. In this paper, we revisit such temporal structure as a feature-conditioned inference signal rather than merely an output-space smoothing prior. We derive the insight that temporal continuity and observation-induced feature deviations provide complementary cues for determining when to preserve or release temporal inertia and where to route prediction refinement during likely transitions. Building upon this insight, we propose SIGHT, a lightweight and backpropagation-free TTA framework for WHAR, enabling real-time edge deployment. SIGHT estimates predictive surprise by comparing the current feature with a prototype-based expected state, and then uses the resulting feature deviation to guide geometry-aware transition routing based on prototype alignment and stream-level marginal habit tracking. Evaluations on real-world datasets confirm that SIGHT outperforms existing TTA baselines while reducing computational and memory costs.
♻ ☆ Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence
We introduce Nemotron 3 Nano Omni, the latest model in the Nemotron multimodal series and the first to natively support audio inputs alongside text, images, and video. Nemotron 3 Nano Omni delivers consistent accuracy improvements over its predecessor, Nemotron Nano V2 VL, across all modalities, enabled by advances in architecture, training data and recipes. In particular, Nemotron 3 delivers leading results in real-world document understanding, long audio-video comprehension, and agentic computer use. Built on the highly efficient Nemotron 3 Nano 30B-A3B backbone, Nemotron 3 Nano Omni further incorporates innovative multimodal token-reduction techniques to deliver substantially lower inference latency and higher throughput than other models of similar size. We are releasing model checkpoints in BF16, FP8, and FP4 formats, along with portions of the training data and codebase to facilitate further research and development.
♻ ☆ MultiAnimate: Pose-Guided Image Animation Made Extensible CVPR2026
Pose-guided human image animation aims to synthesize realistic videos of a reference character driven by a sequence of poses. While diffusion-based methods have achieved remarkable success, most existing approaches are limited to single-character animation. We observe that naively extending these methods to multi-character scenarios often leads to identity confusion and implausible occlusions between characters. To address these challenges, in this paper, we propose an extensible multi-character image animation framework built upon modern Diffusion Transformers (DiTs) for video generation. At its core, our framework introduces two novel components-Identifier Assigner and Identifier Adapter - which collaboratively capture per-person positional cues and inter-person spatial relationships. This mask-driven scheme, along with a scalable training strategy, not only enhances flexibility but also enables generalization to scenarios with more characters than those seen during training. Remarkably, trained on only a two-character dataset, our model generalizes to multi-character animation while maintaining compatibility with single-character cases. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in multi-character image animation, surpassing existing diffusion-based baselines.
comment: CVPR2026 Accepted. Project page at https://hyc001.github.io/MultiAnimate/
♻ ☆ AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
Video generation models internalize physical realism as their prior. Anime deliberately violates physics: smears, impact frames, chibi shifts; and its thousands of coexisting artistic conventions yield no single "physics of anime" a model can absorb. Physics-biased models therefore flatten the artistry that defines the medium or collapse under its stylistic variance. We present AniMatrix, a video generation model that targets artistic rather than physical correctness through a dual-channel conditioning mechanism and a three-step transition: redefine correctness, override the physics prior, and distinguish art from failure. First, a Production Knowledge System encodes anime as a structured taxonomy of controllable production variables (Style, Motion, Camera, VFX), and AniCaption infers these variables from pixels as directorial directives. A trainable tag encoder preserves the field-value structure of this taxonomy while a frozen T5 encoder handles free-form narrative; dual-path injection (cross-attention for fine-grained control, AdaLN modulation for global enforcement) ensures categorical directives are never diluted by open-ended text. Second, a style-motion-deformation curriculum transitions the model from near-physical motion to full anime expressiveness. Third, deformation-aware preference optimization with a domain-specific reward model separates intentional artistry from pathological collapse. On an anime-specific human evaluation with five production dimensions scored by professional animators, AniMatrix ranks first on four of five, with the largest gains over Seedance-Pro 1.0 on Prompt Understanding (+0.70, +22.4 percent) and Artistic Motion (+0.55, +16.9 percent). We are preparing accompanying resources for public release to support reproducibility and follow-up research.
comment: 37 pages, 1 main figure (qualitative comparison), 1 TikZ architecture diagram; technical report. Model weights and inference code to be released
♻ ☆ Exploring the AI Obedience: Why is Generating a Pure Color Image Harder than CyberPunk?
Recent advances in generative AI have shown human-level performance in complex content creation. However, we identify a "Paradox of Simplicity": models that can render complex scenes often fail at trivial, low-entropy tasks, such as generating a uniform pure color image. We argue this is a systemic failure related to uncontrollable emergent abilities. As models scale, strong priors for aesthetics and complexity override deterministic simplicity, creating an "aesthetic bias" that hinders the model's transition from data simulation to true intellectual abstraction. To better investigate this problem, we formalize the concept of AI Obedience, a hierarchical framework that grades a model's ability to transition from probabilistic approximation to pixel-level determinism (Levels 1 to 5). We introduce Violin, the first systematic benchmark designed to evaluate Level 4 Obedience through three deterministic tasks: color purity, image masking, and geometric shape generation. Using Violin, we evaluate several state-of-the-art models and reveal that closed-source models generally outperform open-source ones in deterministic precision. Interestingly, performance on our benchmark correlates with the benchmark in natural image generation. Our work provides a foundational framework and tools for achieving better alignment between human instructions and model outputs.
♻ ☆ Enhancing Few-Shot Out-of-Distribution Detection via the Refinement of Foreground and Background
CLIP-based foreground-background (FG-BG) decomposition methods have demonstrated remarkable effectiveness in improving few-shot out-of-distribution (OOD) detection performance. However, existing approaches still suffer from several limitations. For background regions obtained from decomposition, existing methods adopt a uniform suppression strategy for all patches, overlooking the varying contributions of different patches to the prediction. For foreground regions, existing methods fail to adequately consider that some local patches may exhibit appearance or semantic similarity to other classes, which may mislead the training process. To address these issues, we propose a new plug-and-play framework. This framework consists of three core components: (1) a Foreground-Background Decomposition module, which follows previous FG-BG methods to separate an image into foreground and background regions; (2) an Adaptive Background Suppression module, which adaptively weights patch classification entropy; and (3) a Confusable Foreground Rectification module, which identifies and rectifies confusable foreground patches. Extensive experimental results demonstrate that the proposed plug-and-play framework significantly improves the performance of existing FG-BG decomposition methods. Code is available at: https://github.com/lounwb/FoBoR.
comment: arXiv preprint arXiv:2601.15065 (2026)
♻ ☆ Operating Within the Operational Design Domain: Zero-Shot Perception with Vision-Language Models
Over the last few years, research on autonomous systems has matured to such a degree that the field is increasingly well-positioned to translate research into practical, stakeholder-driven use cases across well-defined domains. However, for a wide-scale practical adoption of autonomous systems, adherence to safety regulations is crucial. Many regulations are influenced by the Operational Design Domain (ODD), which defines the specific conditions in which an autonomous agent can function. This is especially relevant for Automated Driving Systems (ADS), as a dependable perception of ODD elements is essential for safe implementation and auditing. Vision-language models (VLMs) integrate visual recognition and language reasoning, functioning without task-specific training data, which makes them suitable for adaptable ODD perception. To assess whether VLMs can function as zero-shot "ODD sensors" that adapt to evolving definitions, we contribute (i) an empirical study of zero-shot ODD classification and detection using four VLMs on a custom dataset and Mapillary Vistas, along with failure analyses; (ii) an ablation of zero-shot optimization strategies with a cost-performance overview; and (iii) a suite of reusable prompting templates with guidance for adaptation. Our findings indicate that definition-anchored chain-of-thought prompting with persona decomposition performs best, while other methods may result in reduced recall. Overall, our results pave the way for transparent and effective ODD-based perception in safety-critical applications.
comment: 8 pages, 4 figures
♻ ☆ Unifying Scientific Communication: Fine-Grained Correspondence Across Scientific Media CVPR
The communication of scientific knowledge has become increasingly multimodal, spanning text, visuals, and speech through materials such as research papers, slides, and recorded presentations. These different representations collectively convey a study's reasoning, results, and insights, offering complementary perspectives that enrich understanding. However, despite their shared purpose, such materials are rarely connected in a structured way. The absence of explicit links across formats makes it difficult to trace how concepts, visuals, and explanations correspond, limiting unified exploration and analysis of research content. To address this gap, we introduce the Multimodal Conference Dataset (MCD), the first benchmark that integrates research papers, presentation videos, explanatory videos, and slides from the same works. We evaluate a range of embedding-based and vision-language models to assess their ability to discover fine-grained cross-format correspondences, establishing the first systematic benchmark for this task. Our results show that vision-language models are robust but struggle with fine-grained alignment, while embedding-based models capture text-visual correspondences well but equations and symbolic content form distinct clusters in the embedding space. These findings highlight both the strengths and limitations of current approaches and point to key directions for future research in multimodal scientific understanding. To ensure reproducibility, we release the resources for MCD at https://github.com/meghamariamkm2002/MCD
comment: Accepted at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings Track, 2026
♻ ☆ EAM: Enhancing Anything with Diffusion Transformers for Blind Super-Resolution
Utilizing pre-trained Text-to-Image (T2I) diffusion models to guide Blind Super-Resolution (BSR) has become a predominant approach in the field. While T2I models have traditionally relied on U-Net architectures, recent advancements have demonstrated that Diffusion Transformers (DiT) achieve significantly higher performance in this domain. In this work, we introduce Enhancing Anything Model (EAM), a novel BSR method that leverages DiT and outperforms previous U-Net-based approaches. We introduce a novel block, $Ψ$-DiT, which effectively guides the DiT to enhance image restoration. This block employs a low-resolution latent as a separable flow injection control, forming a triple-flow architecture that effectively leverages the prior knowledge embedded in the pre-trained DiT. To fully exploit the prior guidance capabilities of T2I models and enhance their generalization in BSR, we introduce a progressive Masked Image Modeling strategy, which also reduces training costs. Additionally, we propose a subject-aware prompt generation strategy that employs a robust multi-modal model in an in-context learning framework. This strategy automatically identifies key image areas, provides detailed descriptions, and optimizes the utilization of T2I diffusion priors. Our experiments demonstrate that EAM achieves state-of-the-art results across multiple datasets, outperforming existing methods in both quantitative metrics and visual quality.
comment: Revision of Section 4.1
♻ ☆ VidNum-1.4K: A Comprehensive Benchmark for Video-based Numerical Reasoning
Video-based numerical reasoning provides a premier arena for testing whether Vision-Language Models (VLMs) truly "understand" real-world dynamics, as accurate numerical deduction necessitates a profound grasp of temporal events, object permanence, and compositional logic beyond superficial pattern matching. However, existing benchmarks are often confined to narrow domains, such as repetitive athletic motions, or treat simple counting merely as a superficial regression task, failing to assess multi-step numerical logic within the inherent complexity of real-world multimedia content. We introduce VidNum-1.4K, a comprehensive VideoQA benchmark comprising 1,379 strictly human-annotated video-question pairs designed to evaluate genuine numerical reasoning across highly diverse environments, encompassing object, action, and event quantification. The VidNum-1.4K is uniquely structured into a three-level hierarchy that evolves from direct visual perception to video-based compositional numerical reasoning, requiring models to perform arithmetic operations, comparisons, and logical deductions grounded in temporal evidence. Our evaluations across a diverse suite of state-of-the-art VLMs reveal a striking reasoning gap: while the Gemini-3.1-pro barely reaches a 60% accuracy threshold, representative open-source families struggle heavily in the 25%--45% range. These findings demonstrate that current VLMs still lack a stable "internal world model", positioning VidNum-1.4K as a demanding diagnostic testbed for the next generation of numerical video intelligence.
comment: 7 pages, 5 figures, under review at ACMMM 2026 Dataset Track
♻ ☆ GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization
Diffusion Vision-Language Models (dVLMs), built upon the non-causal foundations of Diffusion Large Language Models (dLLMs), have demonstrated remarkable efficacy in multimodal tasks by departing from the traditional autoregressive generation paradigm. While dVLMs appear inherently robust against conventional jailbreak tactics, which we categorize as Fixed Prefix Optimization (FPO) (e.g., anchoring responses with "Sure, here is"), this perceived resilience is deceptive. Our investigation into the safety landscape of dVLMs reveals a unique refusal pattern: Immediate Refusal and Progressive Refusal. We find that while FPO-based attacks often fail by triggering the latter, the progressive refinement process itself uncovers a novel, latent attack surface. To exploit this vulnerability, we propose Global Probability Optimization (GPO), a general jailbreak paradigm designed specifically for the denoising trajectory of masked diffusion models. Unlike prefix-based methods, GPO manipulates the global generative dynamics to bypass guardrails in diffusion language models. Building on this, we introduce GPO-V, the first visual-modality jailbreak framework tailored for dVLMs. Empirical results demonstrate that GPO-V produces stealthy perturbations with exceptional cross-model transferability, revealing a critical security gap in non-sequential generative architectures. Our findings underscore the critical urgency of addressing safety alignment in dVLMs. These results necessitate an immediate and fundamental re-evaluation of current defense paradigms to mitigate the unique risks of diffusion-based generation. Our code is available at: https://anonymous.4open.science/r/GPO-V-0250.
♻ ☆ Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
Multimodal latent reasoning has emerged as a promising paradigm that replaces explicit Chain-of-Thought (CoT) decoding with implicit feature propagation, simultaneously enhancing representation informativeness and reducing inference latency. By analyzing token-level gradient dynamics during latent training, we reveal two critical observations: (1) visual tokens exhibit significantly smaller gradient norms than their textual counterparts due to inherent language bias, resulting in systematic visual under-optimization; and (2) semantically simple tokens converge rapidly, whereas complex tokens exhibit persistent gradient instability constrained by fixed architectural depths. To address these limitations, we propose a visual replay module and routing depth scaling to collaboratively enhance visual perception and refine complicated latents for deeper contextual reasoning. The former module leverages causal self-attention to estimate token saliency, reinforcing fine-grained grounding through spatially-coherent constraints. Complementarily, the latter mechanism adaptively allocates additional reasoning steps to complex tokens, enabling deeper contextual refinement. Guided by a curriculum strategy that progressively internalizes explicit CoT into compact latent representations, our framework achieves state-of-the-art performance across diverse benchmarks while delivering substantial inference speedups over explicit CoT baselines.
comment: 11 pages, 6 figures
♻ ☆ Beyond Heuristics: Learnable Density Control for 3D Gaussian Splatting
While 3D Gaussian Splatting (3DGS) has demonstrated impressive real-time rendering performance, its efficacy remains constrained by a reliance on heuristic density control. Despite numerous refinements to these handcrafted rules, such methods inherently lack the flexibility to adapt to diverse scenes with complex geometries. In this paper, we propose a paradigm shift for density control from rigid heuristics to fully learnable policies. Specifically, we introduce \textbf{LeGS}, a framework that reformulates density control as a parameterized policy network optimized via Reinforcement Learning (RL). Central to our approach is the tailored effective reward function grounded in sensitivity analysis, which precisely quantifies the marginal contribution of individual Gaussians to reconstruction quality. To maintain computational tractability, we derive a closed-form solution that reduces the complexity of reward calculation from $O(N^2)$ to $O(N)$. Extensive experiments on the Mip-NeRF 360, Tanks \& Temples, and Deep Blending datasets demonstrate that \textbf{LeGS} significantly outperforms state-of-the-art methods, striking a superior balance between reconstruction quality and efficiency. The code will be released at https://github.com/AaronNZH/LeGS
comment: 9 pages, 5 figures
♻ ☆ UniCon3R: Unified Contact-aware 4D Human-Scene Reconstruction from Monocular Video
We introduce UniCon3R, a unified feed-forward framework for online human-scene 4D reconstruction from monocular video. Current feed-forward human-scene reconstruction methods suffer from artifacts, where bodies float above the ground or penetrate parts of the scene. A key reason is the lack of effective interaction modelling between the human and the environment. Our goal is to exploit contact between the human and the scene during inference to actively improve the human mesh reconstruction. To that end, we explicitly model interaction by inferring 4D contact from the human pose and scene geometry and use the contact as a corrective cue for generating the pose. This enables UniCon3R to jointly recover scene geometry and spatially aligned 4D humans within the scene. Experiments on standard human-centric video benchmarks show that UniCon3R outperforms state-of-the-art baselines on physical plausibility and global human motion estimation while preserving fast, feed-forward inference speeds. The results validate our central claim: contact serves as a powerful internal prior, thus establishing a new paradigm for physically grounded joint human-scene reconstruction. Project page is available at https://surtantheta.github.io/UniCon3R .
comment: Project page: https://surtantheta.github.io/UniCon3R
♻ ☆ Stealthy Patch-Wise Backdoor Attack in 3D Point Cloud via Curvature Awareness
Backdoor attacks pose a severe threat to deep neural networks (DNNs) by implanting hidden backdoors that can be activated with predefined triggers to manipulate model behaviors maliciously. Recent studies have extended backdoor attacks to 3D point clouds, but most existing triggers are sample-wise and often cause visible geometric artifacts or high optimization cost. To address these limitations, we propose the Stealthy Patch-Wise Backdoor Attack (SPBA), a patch-wise backdoor attack framework for 3D point clouds. Specifically, SPBA decomposes a point cloud into local patches, where each patch is formed by a Farthest Point Sampling (FPS) center and its K-nearest neighbors (KNN). Candidate patches are ranked using a patch imperceptibility score derived from local curvature variation, and a unified spectral trigger is injected into the selected patches by perturbing only the coordinates of existing points while preserving the original point cardinality. Extensive experiments on ModelNet40 and ShapeNetPart further demonstrate that SPBA achieves state-of-the-art stealthiness among prior methods and reduces spectral-trigger computation by 98.43% relative to a sample-wise spectral baseline, while maintaining competitive attack performance. These results support localized spectral design as an effective and efficient approach to stealthy backdoor attacks in 3D point cloud models. Code is available at https://github.com/HazardFY/SPBA.
comment: 12 pages, 6 figures, 11 tables
♻ ☆ Selective LoRA for Visual Tokens and Attention Heads
Low-rank adaptation (LoRA) is widely used for parameter-efficient fine-tuning, but its standard all-token, all-head design ignores the heterogeneous structure of vision language model (VLM) inputs. We introduce \emph{Image-LoRA}, a vision-oriented PEFT recipe that views LoRA as a token-level residual update and applies this update only to visual tokens. Image-LoRA further restricts adaptation to the value path of a compact subset of attention heads, selected using a one-pass influence estimate from a rank-1 visual-token-only probe. This token-, head-, and value-selective design reduces trainable parameters and adapter-only training FLOPs while leaving the pure-text forward pass of the frozen backbone unchanged when no visual tokens are present. Across visual localization benchmarks with controlled text:image token ratios, Image-LoRA matches or closely approaches standard LoRA, while showing especially favorable trade-offs in image-token-heavy regimes. We further validate its generality on TextVQA and VideoQA, verify pure-text preservation on GSM8K, and show on ViLP that a stronger information bottleneck can yield gains over standard LoRA.
♻ ☆ One World, Dual Timeline: Decoupled Spatio-Temporal Gaussian Scene Graph for 4D Cooperative Driving Reconstruction
Reconstructing dynamic scenes from Vehicle-to-Infrastructure Cooperative Autonomous Driving (VICAD) data is fundamentally complicated by temporal asynchrony: vehicle and infrastructure cameras operate on independent clocks, capturing the same dynamic agent such as cars and pedestrians at different physical times. Existing Gaussian Scene Graph methods implicitly assume synchronized observations and assign a single pose per agent per frame, which is an assumption that breaks in cooperative settings, where the resulting gradient conflicts cause severe ghosting on dynamic agents. We identify this as a representation-level failure, not an optimization artifact: we prove that any single-timeline formulation incurs an irreducible photometric loss scaling quadratically with agent velocity and cross-source time offset. To resolve this, we propose Dust (DecoUpled Spatio-Temporal) Gaussian Scene Graph for 4D Cooperative Driving Reconstruction. DUST Gaussian Scene Graph shares a canonical Gaussian set per agent for appearance consistency, while maintaining decouple pose trajectories aligned to each source's true capture timestamps. We prove that this decoupling enables the pose-gradient kernel block-diagonal, eliminating cross-source interference entirely. To make Dust practical, we further introduce a static anchor-based pose correction pipeline that corrects spatio misalignment between vehicle and infrastructure annotations, and a pose-regularized joint optimization scheme that prevents trajectory jitter and drift during early training. On 26 sequences from V2X-Seq, DUST achieves state-of-the-art performance, improving dynamic-area PSNR by 3.2 dB over the strongest baseline and reducing Fréchet Video Distance by 37.7%, with keeping robustness under larger temporal asynchrony.
♻ ☆ 4D Neural Voxel Splatting: Dynamic Scene Rendering with Voxelized Guassian Splatting
Although 3D Gaussian Splatting (3D-GS) achieves efficient rendering for novel view synthesis, extending it to dynamic scenes still results in substantial memory overhead from replicating Gaussians across frames. To address this challenge, we propose 4D Neural Voxel Splatting (4D-NVS), which combines voxel-based representations with neural Gaussian splatting for efficient dynamic scene modeling. Instead of generating separate Gaussian sets per timestamp, our method employs a compact set of neural voxels with learned deformation fields to model temporal dynamics. The design greatly reduces memory consumption and accelerates training while preserving high image quality. We further introduce a novel view refinement stage that selectively improves challenging viewpoints through targeted optimization, maintaining global efficiency while enhancing rendering quality for difficult viewing angles. Experiments demonstrate that our method outperforms state-of-the-art approaches with significant memory reduction and faster training, enabling real-time rendering with superior visual fidelity.
comment: 10 pages, 7 figures
♻ ☆ Height-Guided Projection Reparameterization for Camera-LiDAR Occupancy
3D occupancy prediction aims to infer dense, voxel-wise scene semantics from sensor observations, where the 2D-to-3D view transformation serves as a crucial step in bridging image features and volumetric representations. Most previous methods rely on a fixed projection space, where 3D reference points are uniformly sampled along pillars. However, such sampling struggles to capture the sparsity and height variations of real-world scenes, leading to ambiguous correspondences and unreliable feature aggregation. To address these challenges, we propose HiPR, a camera-LiDAR occupancy framework with Height-Guided Projection Reparameterization. HiPR first encodes LiDAR into a BEV height map to capture the maximum height of the point cloud. HiPR then adjusts the sampling range of each pillar using the height prior, enabling adaptive reparameterization of the projection space. As a result, the projected points are redistributed into geometrically meaningful regions rather than fixed ranges. Meanwhile, we mask out the invalid parts of the height map to avoid misleading the feature aggregation. In addition, to alleviate the training instability caused by noisy LiDAR-derived heights, we introduce a training-time Progressive Height Conditioning strategy, which gradually transitions the conditioning signal from ground-truth heights to LiDAR heights. Extensive experiments demonstrate that HiPR consistently outperforms existing state-of-the-art methods while maintaining real-time inference. The code and pretrained models can be found at https://github.com/yanzq95/HiPR.
♻ ☆ Raster2Seq: Polygon Sequence Generation for Floorplan Reconstruction SIGGRAPH 2026
Reconstructing a structured vector-graphics representation from a rasterized floorplan image is typically an important prerequisite for computational tasks involving floorplans such as automated understanding or CAD workflows. However, existing techniques struggle in faithfully generating the structure and semantics conveyed by complex floorplans that depict large indoor spaces with many rooms and a varying numbers of polygon corners. To this end, we propose Raster2Seq, framing floorplan reconstruction as a sequence-to-sequence task in which floorplan elements--such as rooms, windows, and doors--are represented as labeled polygon sequences that jointly encode geometry and semantics. Our approach introduces an autoregressive decoder that learns to predict the next corner conditioned on image features and previously generated corners using guidance from learnable anchors. These anchors represent spatial coordinates in image space, hence allowing for effectively directing the attention mechanism to focus on informative image regions. By embracing the autoregressive mechanism, our method offers flexibility in the output format, enabling for efficiently handling complex floorplans with numerous rooms and diverse polygon structures. Our method achieves state-of-the-art performance on standard benchmarks such as Structure3D, CubiCasa5K, and Raster2Graph, while also demonstrating strong generalization to more challenging datasets like WAFFLE, which contain diverse room structures and complex geometric variations.
comment: Accepted to SIGGRAPH 2026. Project page: https://cornell-vailab.github.io/Raster2Seq/
♻ ☆ GLEAM: A Multimodal Imaging Dataset and HAMM for Glaucoma Classification
We propose glaucoma lesion evaluation and analysis with multimodal imaging (GLEAM), the first publicly available tri-modal glaucoma dataset comprising scanning laser ophthalmoscopy fundus images, circumpapillary OCT images, and visual field pattern deviation maps, annotated with four disease stages, enabling effective exploitation of multimodal complementary information and facilitating accurate diagnosis and treatment across disease stages. To effectively integrate cross-modal information, we propose hierarchical attentive masked modeling (HAMM) for multimodal glaucoma classification. Our framework employs hierarchical attentive encoders and light decoders to focus cross-modal representation learning on the encoder.
♻ ☆ VISD: Enhancing Video Reasoning via Structured Self-Distillation
Training VideoLLMs for complex reasoning remains challenging due to sparse sequence level rewards and the lack of fine grained credit assignment over long, temporally grounded reasoning trajectories. While reinforcement learning with verifiable rewards (RLVR) provides reliable supervision, it fails to capture token level contributions, leading to inefficient learning. Conversely, existing self distillation methods offer dense supervision but lack structure and diagnostic specificity, and often interact unstably with reinforcement learning. In this work, we propose VISD, a structured self distillation framework that introduces diagnostically meaningful privileged information for video reasoning. VISD employs a video aware judge model to decompose reasoning quality into multiple dimensions, including answer correctness, logical consistency, and spatio-temporal grounding, and uses this structured feedback to guide a teacher policy for token level supervision. To stably integrate dense supervision with RL, we introduce a direction magnitude decoupling mechanism, where rollout level advantages computed from rewards determine update direction, while structured privileged signals modulate token level update magnitudes. This design enables semantically aligned and fine grained credit assignment, improving both reasoning faithfulness and training efficiency. Additionally, VISD incorporates curriculum scheduling and EMA based teacher stabilization to support robust optimization over long video sequences. Experiments on diverse benchmarks show that VISD consistently outperforms strong baselines, improving answer accuracy and spatio temporal grounding quality. Notably, VISD reaches these gains with nearly 2x faster convergence in optimization steps, highlighting the effectiveness of structured self supervision in improving both performance and sample efficiency for VideoLLMs.
♻ ☆ Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression ICML 2026
Multimodal large language models (MLLMs) struggle with numerical regression under long-tailed target distributions. Token-level supervised fine-tuning (SFT) and point-wise regression rewards bias learning toward high-density regions, leading to regression-to-the-mean behavior and poor tail performance. We identify the lack of cross-sample relational supervision as a key limitation of existing MLLM training paradigms. To address it, we propose a distribution-aware reinforcement learning framework based on Group Relative Policy Optimization, which introduces batch-level comparison-based supervision via the Concordance Correlation Coefficient-based reward to align predicted and ground-truth distributions in terms of correlation, scale, and mean. The framework is plug-and-play, requiring no architectural modification. Experiments on a unified suite of long-tailed regression benchmarks show consistent improvements over SFT and existing MLLM regression methods, with particularly strong gains in medium- and few-shot regimes.
comment: Accepted by ICML 2026
♻ ☆ UniUncer: Unified Dynamic Static Uncertainty for End to End Driving ICRA 2026
End-to-end (E2E) driving has become a cornerstone of both industry deployment and academic research, offering a single learnable pipeline that maps multi-sensor inputs to actions while avoiding hand-engineered modules. However, the reliability of such pipelines strongly depends on how well they handle uncertainty: sensors are noisy, semantics can be ambiguous, and interaction with other road users is inherently stochastic. Uncertainty also appears in multiple forms: classification vs. localization, and, crucially, in both static map elements and dynamic agents. Existing E2E approaches model only static-map uncertainty, leaving planning vulnerable to overconfident and unreliable inputs. We present UniUncer, the first lightweight, unified uncertainty framework that jointly estimates and uses uncertainty for both static and dynamic scene elements inside an E2E planner. Concretely: (1) we convert deterministic heads to probabilistic Laplace regressors that output per-vertex location and scale for vectorized static and dynamic entities; (2) we introduce an uncertainty-fusion module that encodes these parameters and injects them into object/map queries to form uncertainty-aware queries; and (3) we design an uncertainty-aware gate that adaptively modulates reliance on historical inputs (ego status or temporal perception queries) based on current uncertainty levels. The design adds minimal overhead and drops throughput by only $\sim$0.5 FPS while remaining plug-and-play for common E2E backbones. On nuScenes (open-loop), UniUncer reduces average L2 trajectory error by 7\%. On NavsimV2 (pseudo closed-loop), it improves overall EPDMS by 10.8\% and notable stage two gains in challenging, interaction-heavy scenes. Ablations confirm that dynamic-agent uncertainty and the uncertainty-aware gate are both necessary.
comment: Accepted ICRA 2026
♻ ☆ SegSTRONG-C: Segmenting Surgical Tools Robustly On Non-adversarial Generated Corruptions -- An EndoVis'24 Challenge
Surgical data science has seen rapid advancement with the excellent performance of end-to-end deep neural networks (DNNs). Despite their successes, DNNs have been proven susceptible to minor "corruptions," introducing a major concern for the translation of cutting-edge technology, especially in high-stakes scenarios. We introduce the SegSTRONG-C challenge dedicated to better understanding model deterioration under unforeseen but plausible non-adversarial "corruption" and the capabilities of contemporary methods that seek to improve it. Built on a dataset generated through counterfactual robotic replay, SegSTRONG-C provides paired clean and "corrupted" samples, enabling reproducible evaluation of model robustness. Participants are challenged to train tool segmentation algorithms on "uncorrupted" data and evaluate them on "corrupted" test domains for the binary robot tool segmentation task. Through comprehensive baseline experiments and participating submissions from widespread community engagement, SegSTRONG-C reveals key themes for model failure and identifies promising directions for improving robustness. The performance of challenge winners, achieving an average 0.9394 DSC and 0.9301 NSD across the unreleased test sets with "corruption" types: bleeding, smoke, and low brightness. This highlights how prior knowledge, customized training strategies, and architectural choice can be leveraged to improve robustness. In conclusion, the SegSTRONG-C challenge has identified practical approaches for enhancing model robustness. However, most approaches rely on conventional techniques that have known limitations. Looking ahead, we advocate for expanding intellectual diversity and creativity in non-adversarial robustness beyond data augmentation, calling for new paradigms that enhance universal robustness to unforeseen "corruptions" to facilitate richer applications in surgical data science.
♻ ☆ From Pixels to Primitives: Scene Change Detection in 3D Gaussian Splatting
Scene change detection methods built on Gaussian splatting universally follow a render-then-compare paradigm: the pre-change scene is rendered into 2D and compared against post-change images via pixel or feature residuals. This change detection problem with Gaussian Splatting has been treated as a question about pixels; we treat it as a question about primitives. We provide direct evidence that native primitive attributes alone -- position, anisotropic covariance, and color -- carry sufficient signal for scene change detection. What makes primitive-space comparison hard is the under-constrained nature of Gaussian splatting representation: independent optimizations yield primitive solutions whose count, positions, shapes, and colors differ even where nothing has changed. We address this challenge with anisotropic models of geometric and photometric drift, complemented by a per-primitive observability term that reflects the extent to which each Gaussian is constrained by the camera geometry. Operating directly on primitives gives our method, GD-DIFF, two properties that distinguish it from render-then-compare methods. First, change maps are multi-view consistent by construction, where prior work had to learn this through an additional optimization objective. Second, geometric and appearance changes are scored separately, identifying not just where but what kind of change occurred, distinguishing structural changes (e.g., an added object) from surface-level ones (e.g., a color change) without supervision or external model dependencies. On real-world benchmarks, GS-DIFF surpasses the prior state-of-the-art approach by $\sim$17% in mean Intersection over Union.
comment: Project Page: https://chumsy0725.github.io/GS-DIFF/
♻ ☆ Bringing Multimodal Large Language Models to Infrared-Visible Image Fusion Quality Assessment
Infrared-Visible image fusion (IVIF) aims to integrate thermal information and detailed spatial structures into a single fused image to enhance perception. However, existing evaluation approaches tend to over-optimize both hand-crafted no-reference statistics and full-reference metrics that treat the source images as pseudo ground truths. Recent IVIF reward-modelling efforts learn from human ratings but use scalar regression on aggregated scores, neither leveraging the reasoning of Multimodal Large Language Models (MLLMs) nor encoding per-image perceptual ambiguity in their supervision, but naively introducing MLLMs with discrete one-hot supervision likewise collapses fused images of similar quality into different rating levels. To address this, we introduce FuScore, which utilizes an MLLM to mimic human visual perception by producing continuous quality score, rather than discrete level predictions, enabling fine-grained discrimination among fused images of similar quality. We exploit the agreement among four IVIF-specific sub-dimensions to construct a per-image soft label whose sharpness reflects how consensual the overall judgment is. We further introduce a tripartite objective combining per-image distributional supervision, within-source-pair Thurstone fidelity for method-level ordering, and cross-source-pair Thurstone fidelity for scene-level ordering across scenes. Extensive experiments demonstrate that FuScore achieves state-of-the-art correlation with human visual preferences.
♻ ☆ Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views
Forecasting how human hands move in egocentric views is critical for applications like augmented reality and human-robot policy transfer. Recently, several hand trajectory prediction (HTP) methods have been developed to generate future possible hand waypoints, which still suffer from insufficient prediction targets, inherent modality gaps, entangled hand-head motion, and limited validation in downstream tasks. To address these limitations, we present a universal hand motion forecasting framework considering multi-modal input, multi-dimensional and multi-target prediction patterns, and multi-task affordances for downstream applications. We harmonize multiple modalities by vision-language fusion, global context incorporation, and task-aware text embedding injection, to forecast hand waypoints in both 2D and 3D spaces. A novel dual-branch diffusion is proposed to concurrently predict human head and hand movements, capturing their motion synergy in egocentric vision. By introducing target indicators, the prediction model can forecast the specific joint waypoints of the wrist or the fingers, besides the widely studied hand center points. In addition, we enable Uni-Hand to additionally predict hand-object interaction states (contact/separation) to facilitate downstream tasks better. As the first work to incorporate downstream task evaluation in the literature, we build novel benchmarks to assess the real-world applicability of hand motion forecasting algorithms. The experimental results on multiple publicly available datasets and our newly proposed benchmarks demonstrate that Uni-Hand achieves the state-of-the-art performance in multi-dimensional and multi-target hand motion forecasting. Extensive validation in multiple downstream tasks also presents its impressive human-robot policy transfer to enable robotic manipulation, and effective feature enhancement for action anticipation/recognition.
comment: Accepted by T-PAMI 2026. Code and data: https://github.com/IRMVLab/UniHand
♻ ☆ Audio-Visual Camera Pose Estimation with Passive Scene Sounds and In-the-Wild Video
Understanding camera motion is a fundamental problem in embodied perception and 3D scene understanding. While visual methods have advanced rapidly, they often struggle under visually degraded conditions such as motion blur or occlusions. In this work, we show that passive scene sounds provide cues complementary to vision for relative camera pose estimation for in-the-wild videos. We introduce a simple but effective audio-visual framework that integrates direction-of-arrival (DOA) spectra and binauralized embeddings into a state-of-the-art vision-only pose estimation model. Our results on two large datasets show consistent gains over strong visual baselines, plus robustness when the visual information is corrupted. To our knowledge, this represents the first work to successfully leverage audio for relative camera pose estimation in real-world videos, and it establishes incidental, everyday audio as an unexpected but promising signal for a classic spatial challenge. Project: http://vision.cs.utexas.edu/projects/av_camera_pose.
♻ ☆ EndoVGGT: GNN-Enhanced Depth Estimation for Surgical 3D Reconstruction
Accurate 3D reconstruction of deformable soft tissues is essential for surgical robotic perception. However, low-texture surfaces, specular highlights, and instrument occlusions often fragment geometric continuity, posing a challenge for existing fixed-topology approaches. To address this, we propose EndoVGGT, a geometry-centric framework equipped with a Deformation-aware Graph Attention (DeGAT) module. Rather than using static spatial neighborhoods, DeGAT dynamically constructs feature-space semantic graphs to capture long-range correlations among coherent tissue regions. This enables robust propagation of structural cues across occlusions, enforcing global consistency and improving non-rigid deformation recovery. Extensive experiments on SCARED show that our method significantly improves fidelity, increasing PSNR by 24.6% and SSIM by 9.1% over prior state-of-the-art. Crucially, EndoVGGT exhibits strong zero-shot cross-dataset generalization to the unseen SCARED and EndoNeRF domains, confirming that DeGAT learns domain-agnostic geometric priors. These results highlight the efficacy of dynamic feature-space modeling for consistent surgical 3D reconstruction.
comment: We withdraw this submission due to significant errors in the presentation and logical structure of the paper. We found that the current version does not accurately convey the research findings and requires a major overhaul of the manuscript's methodology description and results analysis
♻ ☆ Coarse-to-Real: Generative Rendering for Populated Dynamic Scenes
Traditional rendering pipelines rely on complex assets, accurate materials and lighting, and substantial computational resources to produce realistic imagery, yet they still face challenges in scalability and realism for populated dynamic scenes. We present C2R (Coarse-to-Real), a generative rendering framework that synthesizes real-style urban crowd videos from coarse 3D simulations. Our approach uses coarse 3D renderings to explicitly control scene layout, camera motion, and human trajectories, while a learned neural renderer generates realistic appearance, lighting, and fine-scale dynamics guided by text prompts. To overcome the lack of paired training data between coarse simulations and real videos, we adopt a two-stage synthetic-real domain-hedging strategy that first learns a strong generative prior from large-scale real footage, and then introduces controllability by using a small amount of paired synthetic coarse-to-fine data to anchor shared implicit spatio-temporal features across domains. The resulting system supports coarse-to-fine control, generalizes across diverse CG and game inputs, and produces temporally consistent, controllable, and realistic urban scene videos from minimal 3D input. We will release the model and project webpage at https://gonzalognogales.github.io/coarse2real/.
comment: Project website at https://gonzalognogales.github.io/coarse2real/
♻ ☆ Looking and Listening Inside and Outside: Multimodal Artificial Intelligence Systems for Driver Safety Assessment and Intelligent Vehicle Decision-Making
The looking-in-looking-out (LILO) framework has enabled intelligent vehicle applications that understand both the outside scene and the driver state to improve safety outcomes, with examples in smart airbag deployment, takeover time prediction in autonomous control transitions, and driver attention monitoring. In this research, we propose an augmentation to this framework, making a case for the audio modality as an additional source of information to understand the driver, and in the evolving autonomy landscape, also the passengers and those outside the vehicle. We expand LILO by incorporating audio signals, forming the looking-and-listening inside-and-outside (L-LIO) framework to enhance driver state assessment and environment understanding through multimodal sensor fusion. We evaluate three example cases where audio enhances vehicle safety: supervised learning on driver speech audio to classify potential impairment states (e.g., intoxication), collection and analysis of passenger natural language instructions (e.g., "turn after that red building") to motivate how spoken language can interface with planning systems through audio-aligned instruction data, and limitations of vision-only systems where audio may disambiguate the guidance and gestures of external agents. Datasets include custom-collected in-vehicle and external audio samples in real-world environments. Pilot findings show that audio yields safety-relevant insights, particularly in nuanced or context-rich scenarios where sound is critical to safe decision-making or visual signals alone are insufficient. Challenges include ambient noise interference, privacy considerations, and robustness across human subjects, motivating further work on reliability in dynamic real-world contexts. L-LIO augments driver and scene understanding through multimodal fusion of audio and visual sensing, offering new paths for safety intervention.
♻ ☆ Taming Score-Based Denoisers in ADMM: A Convergent Plug-and-Play Framework
While score-based generative models have emerged as powerful priors for solving inverse problems, directly integrating them into optimization algorithms such as ADMM remains nontrivial. Two central challenges arise: i) the mismatch between the noisy data manifolds used to train the score functions and the geometry of ADMM iterates, especially due to the influence of dual variables, and ii) the lack of convergence understanding when ADMM is equipped with score-based denoisers. To address the manifold mismatch issue, we propose ADMM plug-and-play (ADMM-PnP) with the AC-DC denoiser, a new framework that embeds a three-stage denoiser into ADMM: (1) auto-correction (AC) via additive Gaussian noise, (2) directional correction (DC) using conditional Langevin dynamics, and (3) score-based denoising. In terms of convergence, we establish two results: first, under proper denoiser parameters, each ADMM iteration is a weakly nonexpansive operator, ensuring high-probability fixed-point $\textit{ball convergence}$ using a constant step size; second, under more relaxed conditions, the AC-DC denoiser is a bounded denoiser, which leads to convergence under an adaptive step size schedule. Experiments on a range of inverse problems demonstrate that our method consistently improves solution quality over a variety of baselines.
♻ ☆ ReasonEdit: Editing Vision-Language Models using Human Reasoning
Model editing aims to correct errors in large, pretrained models without altering unrelated behaviors. While some recent works have edited vision-language models (VLMs), no existing editors tackle reasoning-heavy tasks, which typically require humans and models to reason about images. We therefore propose ReasonEdit, the first VLM editor to let users explain their reasoning during editing, introducing a new, practical model editing setup. ReasonEdit continuously stores human reasoning in a codebook, and retrieves only relevant facts during inference using a novel topology-balanced multimodal embedding method inspired by network science. Across four VLMs on multiple rationale-based visual question answering datasets, ReasonEdit achieves state-of-the-art editing performance, ultimately showing that using human reasoning during editing greatly improves edit generalization.
♻ ☆ Adversarial Flow Models ICML 2026
We present adversarial flow models, a class of generative models that belongs to both the adversarial and flow families. Our method supports native one-step and multi-step generation and is trained with an adversarial objective. Unlike traditional GANs, in which the generator learns an arbitrary transport map between the noise and data distributions, our generator is encouraged to learn a deterministic noise-to-data mapping. This significantly stabilizes adversarial training. Unlike consistency-based methods, our model directly learns one-step or few-step generation without having to learn the intermediate timesteps of the probability flow for propagation. This preserves model capacity and avoids error accumulation. Under the same 1NFE setting on ImageNet-256px, our B/2 model approaches the performance of consistency-based XL/2 models, while our XL/2 model achieves a new best FID of 2.38. We additionally demonstrate end-to-end training of 56-layer and 112-layer models without any intermediate supervision, achieving FIDs of 2.08 and 1.94 with a single forward pass and surpassing the corresponding 28-layer 2NFE and 4NFE counterparts with equal compute and parameters. The code is available at https://github.com/ByteDance-Seed/Adversarial-Flow-Models
comment: ICML 2026
♻ ☆ DIPSER: A Dataset for In-Person Student Engagement Recognition in the Wild
In this paper, a novel dataset is introduced, designed to assess student attention within in-person classroom settings. This dataset encompasses RGB camera data, featuring multiple cameras per student to capture both posture and facial expressions, in addition to smartwatch sensor data for each individual. This dataset allows machine learning algorithms to be trained to predict attention and correlate it with emotion. A comprehensive suite of attention and emotion labels for each student is provided, generated through self-reporting as well as evaluations by four different experts. Our dataset uniquely combines facial and environmental camera data, smartwatch metrics, and includes underrepresented ethnicities in similar datasets, all within in-the-wild, in-person settings, making it the most comprehensive dataset of its kind currently available. The dataset presented offers an extensive and diverse collection of data pertaining to student interactions across different educational contexts, augmented with additional metadata from other tools. This initiative addresses existing deficiencies by offering a valuable resource for the analysis of student attention and emotion in face-to-face lessons.
♻ ☆ See the past: Time-Reversed Scene Reconstruction from Thermal Traces Using Visual Language Models
Recovering the past from present observations is an intriguing challenge with potential applications in forensics and scene analysis. Thermal imaging, operating in the infrared range, provides access to otherwise invisible information. Since humans are typically warmer (37 C -98.6 F) than their surroundings, interactions such as sitting, touching, or leaning leave residual heat traces. These fading imprints serve as passive temporal codes, allowing for the inference of recent events that exceed the capabilities of RGB cameras. This work proposes a time-reversed reconstruction framework that uses paired RGB and thermal images to recover scene states from a few seconds earlier. The proposed approach couples Visual-Language Models (VLMs) with a constrained diffusion process, where one VLM generates scene descriptions and another guides image reconstruction, ensuring semantic and structural consistency. The method is evaluated in three controlled scenarios, demonstrating the feasibility of reconstructing plausible past frames up to 120 seconds earlier, providing a first step toward time-reversed imaging from thermal traces.
♻ ☆ Red-Teaming Text-to-Image Models via In-Context Experience Replay and Semantic-Preserving Prompt Rewriting
Understanding the capabilities of text-to-image (T2I) models in harmful content generation is essential to safety and compliance. However, human red-teaming is costly and inconsistent, driving the need for automatic tools that simulate realistic misuse attempts. Existing methods either require white-box access, fail to generalize across defenses, or produce uninterpretable adversarial tokens, while generating fluent prompts that preserve the original harmful intent remains underexplored despite its practical relevance. We propose ICER, a black-box framework that addresses this gap through two components: an LLM-based rewriter that produces fluent, natural-language adversarial prompts, and in-context experience replay that accumulates successful jailbreaking patterns into a reusable prior. These components are integrated via bandit optimization, enabling ICER to efficiently balance exploiting proven attack strategies with exploring new ones. Experiments across six safety mechanisms show that ICER outperforms seven baselines under both standard and semantics-preserving evaluation, with over 30% of generated prompts transferring to commercial systems like DALL-E 3 and Midjourney.
comment: The source code is available at https://github.com/zhiyichin/ICER
♻ ☆ Flow-OPD: On-Policy Distillation for Flow Matching Models
Existing Flow Matching (FM) text-to-image models suffer from two critical bottlenecks under multi-task alignment: the reward sparsity induced by scalar-valued rewards, and the gradient interference arising from jointly optimizing heterogeneous objectives, which together give rise to a 'seesaw effect' of competing metrics and pervasive reward hacking. Inspired by the success of On-Policy Distillation (OPD) in the large language model community, we propose Flow-OPD, the first unified post-training framework that integrates on-policy distillation into Flow Matching models. Flow-OPD adopts a two-stage alignment strategy: it first cultivates domain-specialized teacher models via single-reward GRPO fine-tuning, allowing each expert to reach its performance ceiling in isolation; it then establishes a robust initial policy through a Flow-based Cold-Start scheme and seamlessly consolidates heterogeneous expertise into a single student via a three-step orchestration of on-policy sampling, task-routing labeling, and dense trajectory-level supervision. We further introduce Manifold Anchor Regularization (MAR), which leverages a task-agnostic teacher to provide full-data supervision that anchors generation to a high-quality manifold, effectively mitigating the aesthetic degradation commonly observed in purely RL-driven alignment. Built upon Stable Diffusion 3.5 Medium, Flow-OPD raises the GenEval score from 63 to 92 and the OCR accuracy from 59 to 94, yielding an overall improvement of roughly 10 points over vanilla GRPO, while preserving image fidelity and human-preference alignment and exhibiting an emergent 'teacher-surpassing' effect. These results establish Flow-OPD as a scalable alignment paradigm for building generalist text-to-image models. The codes and weights will be released in: https://github.com/CostaliyA/Flow-OPD .
comment: Project Page: https://costaliya.github.io/Flow-OPD/ , Code: https://github.com/CostaliyA/Flow-OPD
♻ ☆ Interactive Mars Image Content-Based Search with Interpretable Machine Learning AAAI 38
The NASA Planetary Data System (PDS) hosts millions of images of planets, moons, and other bodies collected throughout many missions. The ever-expanding nature of data and user engagement demands an interpretable content classification system to support scientific discovery and individual curiosity. In this paper, we leverage a prototype-based architecture to enable users to understand and validate the evidence used by a classifier trained on images from the Mars Science Laboratory (MSL) Curiosity rover mission. In addition to providing explanations, we investigate the diversity and correctness of evidence used by the content-based classifier. The work presented in this paper will be deployed on the PDS Image Atlas, replacing its non-interpretable counterpart.
comment: Published at the Thirty-Sixth Annual Conference on Innovative Applications of Artificial Intelligence (IAAI-24). Corrected citation: Proc. AAAI 38(21): 22976-22982 (2024)
♻ ☆ VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation
While recent video diffusion models (VDMs) produce visually impressive results, they fundamentally struggle to maintain 3D structural consistency, often resulting in object deformation or spatial drift. We hypothesize that these failures arise because standard denoising objectives lack explicit incentives for geometric coherence. To address this, we introduce VideoGPA (Video Geometric Preference Alignment), a data-efficient self-supervised framework that leverages a geometry foundation model to automatically derive dense preference signals that guide VDMs via Direct Preference Optimization (DPO). This approach effectively steers the generative distribution toward inherent 3D consistency without requiring human annotations. VideoGPA significantly enhances temporal stability, geometric plausibility, and motion coherence using minimal preference pairs, consistently outperforming state-of-the-art baselines in extensive experiments.
♻ ☆ Parabolic Position Encoding: Vision-Centric, Principled, Extrapolatable, General
We propose Parabolic Position Encoding (PaPE), a parabola-based position encoding for vision modalities in attention-based architectures. Given a set of vision tokens-such as from videos, event camera streams, images, or point clouds-our objective is to encode their positions while accounting for the characteristics of vision modalities. Prior works have largely extended position encodings from 1D-sequences in language to nD-structures in vision, but only with partial account of vision characteristics. We address this gap by designing PaPE from principles distilled from prior work: translation invariance, rotation invariance (PaPE-RI), distance decay, directionality, and context awareness. Extrapolation experiments on ImageNet-1K show how PaPE extrapolates remarkably well, improving in absolute terms by up to 10.5\% over the next-best encoding. Generality experiments on 8 datasets across 4 modalities show that PaPE is a general vision position encoding, as PaPE matches the best baseline on 5 datasets and exceeds all on 2 datasets. Code is available at https://github.com/DTU-PAS/parabolic-position-encoding.
Artificial Intelligence 300
☆ ELF: Embedded Language Flows
Diffusion and flow-based models have become the de facto approaches for generating continuous data, e.g., in domains such as images and videos. Their success has attracted growing interest in applying them to language modeling. Unlike their image-domain counterparts, today's leading diffusion language models (DLMs) primarily operate over discrete tokens. In this paper, we show that continuous DLMs can be made effective with minimal adaptation to the discrete domain. We propose Embedded Language Flows (ELF), a class of diffusion models in continuous embedding space based on continuous-time Flow Matching. Unlike existing DLMs, ELF predominantly stays within the continuous embedding space until the final time step, where it maps to discrete tokens using a shared-weight network. This formulation makes it straightforward to adapt established techniques from image-domain diffusion models, e.g., classifier-free guidance (CFG). Experiments show that ELF substantially outperforms leading discrete and continuous DLMs, achieving better generation quality with fewer sampling steps. These results suggest that ELF offers a promising path toward effective continuous DLMs.
comment: Tech Report. Project webpage: https://github.com/lillian039/ELF
☆ Variational Inference for Lévy Process-Driven SDEs via Neural Tilting
Modelling extreme events and heavy-tailed phenomena is central to building reliable predictive systems in domains such as finance, climate science, and safety-critical AI. While Lévy processes provide a natural mathematical framework for capturing jumps and heavy tails, Bayesian inference for Lévy-driven stochastic differential equations (SDEs) remains intractable with existing methods: Monte Carlo approaches are rigorous but lack scalability, whereas neural variational inference methods are efficient but rely on Gaussian assumptions that fail to capture discontinuities. We address this tension by introducing a neural exponential tilting framework for variational inference in Lévy-driven SDEs. Our approach constructs a flexible variational family by exponentially reweighting the Lévy measure using neural networks. This parametrization preserves the jump structure of the underlying process while remaining computationally tractable. To enable efficient inference, we develop a quadratic neural parametrization that yields closed-form normalization of the tilted measure, a conditional Gaussian representation for stable processes that facilitates simulation, and symmetry-aware Monte Carlo estimators for scalable optimization. Empirically, we demonstrate that the method accurately captures jump dynamics and yields reliable posterior inference in regimes where Gaussian-based variational approaches fail, on both synthetic and real-world datasets.
comment: The associated project page which contains the official implementation can be found in https://circle-group.github.io/research/NeuralTilting/
☆ Confidence-Guided Diffusion Augmentation for Enhanced Bangla Compound Character Recognition
Recognition of handwritten Bangla compound characters remains a challenging problem due to complex character structures, large intra-class variation, and limited availability of high-quality annotated data. Existing Bangla handwritten character recognition systems often struggle to generalize across diverse writing styles, particularly for compound characters containing intricate ligatures and diacritical variations. In this work, we propose a confidence-guided diffusion augmentation framework for low-resolution Bangla compound character recognition. Our framework combines class-conditional diffusion modeling with classifier guidance to synthesize high-quality handwritten compound character samples. To further improve generation quality, we introduce Squeeze-and-Excitation enhanced residual blocks within the diffusion model's U-Net backbone. We additionally propose a confidence-based filtering mechanism where pre-trained classifiers act as quality gates to retain only highly class-consistent synthetic samples. The filtered synthetic images are fused with the original training data and used to retrain multiple classification architectures. Experiments conducted on the AIBangla compound character dataset demonstrate consistent performance improvements across ResNet50, DenseNet121, VGG16, and Vision Transformer architectures. Our best-performing model achieves 89.2\% classification accuracy, surpassing the previously published AIBangla benchmark by a substantial margin. The results demonstrate that quality-aware diffusion augmentation can effectively enhance handwritten character recognition performance in low-resource script domains.
☆ Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
We introduce Shepherd, a functional programming model that formalizes meta-agent operations on target agents as functions, with core operations mechanized in Lean. Shepherd records every agent-environment interaction as a typed event in a Git-like execution trace, enabling any past state to be forked and replayed. The system forks the agent process and its filesystem $5\times$ faster than Docker, achieving $>95\%$ prompt-cache reuse on replay. We demonstrate the model through three applications. First, in runtime intervention, a live supervisor increases pair coding pass rates from 28.8% to 54.7% on CooperBench. Second, in counterfactual meta-optimization, branching exploration outperforms baselines across four benchmarks by up to 11 points while reducing wall-clock time by up to 58%. Third, in Tree-RL training, forking rollouts at selected turns improves TerminalBench-2 performance from 34.2% to 39.4%. These results establish Shepherd as an efficient infrastructure for programming meta-agents. We open-source the system to support future research.
comment: 56 pages, 21 figures, 14 tables
☆ Engineering Robustness into Personal Agents with the AI Workflow Store
The dominant paradigm for AI agents is an "on-the-fly" loop in which agents synthesize plans and execute actions within seconds or minutes in response to user prompts. We argue that this paradigm short-circuits disciplined software engineering (SE) processes -- iterative design, rigorous testing, adversarial evaluation, staged deployment, and more -- that have delivered the (relatively) reliable and secure systems we use today. By focusing on rapid, real-time synthesis, are AI agents effectively delivering users improvised prototypes rather than systems fit for high-stakes scenarios in which users may unwittingly apply them? This paper argues for the need to integrate rigorous SE processes into the agentic loop to produce production-grade, hardened, and deterministically-constrained agent *workflows* that substantially outperform the potentially brittle and vulnerable results of on-the-fly synthesis. Doing so may require extra compute and time, and if so, we must amortize the cost of rigor through reuse across a broad user community. We envision an *AI Workflow Store* that consists of hardened and reusable workflows that agents can invoke with far greater reliability and security than improvised tool chains. We outline the research challenges of this vision, which stem from a broader flexibility-robustness tension that we argue requires moving beyond the ``on-the-fly'' paradigm to navigate effectively.
☆ DataMaster: Towards Autonomous Data Engineering for Machine Learning
As model families, training recipes, and compute budgets become increasingly standardized, further gains in machine learning systems depend increasingly on data. Yet data engineering remains largely manual and ad hoc: practitioners repeatedly search for external datasets, adapt them to existing pipelines, validate candidate data through downstream training, and carry forward lessons from prior attempts. We study task-conditioned autonomous data engineering, where an autonomous agent improves a fixed learning algorithm by optimizing only the data side, including external data discovery, data selection and composition, cleaning and transformation. The goal is to obtain a stronger downstream solution while leaving the learning algorithm unchanged. To address the open-ended search space, branch-dependent refinement, and delayed validation inherent in autonomous data engineering, we propose DataMaster, a data-agent framework that integrates tree-structured search, shared candidate data, and cumulative memory. DataMaster consists of three key components: a DataTree that organizes alternative data-engineering branches, a shared Data Pool that stores discovered external data sources for reuse, and a Global Memory that records node outcomes, artifacts, and reusable findings. Together, these components allow the agent to discover candidate data, construct executable training inputs, evaluate them through downstream feedback, and carry useful evidence across branches. We evaluate DataMaster on two types of benchmarks, MLE-Bench Lite and PostTrainBench. On MLE-Bench Lite, it improves medal rate by 32.27% over the initial score; on PostTrainBench, it surpasses the instruct model on GPQA (31.02% vs 30.35%).
☆ Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
On-policy distillation offers dense, per-token supervision for training reasoning models; however, it remains unclear under which conditions this signal is beneficial and under which it is detrimental. Which teacher model should be used, and in the case of self-distillation, which specific context should serve as the supervisory signal? Does the optimal choice vary from one token to the next? At present, addressing these questions typically requires costly training runs whose aggregate performance metrics obscure the dynamics at the level of individual tokens. We introduce a training-free diagnostic framework that operates at the highest resolution: per token, per question, and per teacher. We derive an ideal per-node gradient defined as the parameter update that maximally increases the student's probability of success. We then develop a scalable targeted-rollout algorithm to estimate this gradient efficiently, even for long chains of intermediate thoughts. The gradient alignment score, defined as the cosine similarity between this ideal gradient and any given distillation gradient, quantifies the extent to which a particular configuration approximates the ideal signal. Across a range of self-distillation settings and external teacher models, we observe that distillation guidance exhibits substantially higher alignment with the ideal on incorrect rollouts than on correct ones, where the student already performs well and the teacher's signal tends to become noisy. Furthermore, we find that the optimal distillation context depends jointly on the student model's capacity and the target task, and that no single universally effective configuration emerges. These findings motivate the use of per-task, per-token diagnostic analyses for distillation.
☆ Shields to Guarantee Probabilistic Safety in MDPs
Shielding is a prominent model-based technique to ensure safety of autonomous agents. Classical shielding aims to ensure that nothing bad ever happens and comes with strong guarantees about safety and maximal permissiveness. However, shielding systems for probabilistic safety, where something bad is allowed to happen with an acceptable probability, has proven to be more intricate. This paper presents a formal framework that conservatively extends classical shields to probabilistic safety. In this framework, we (i) demonstrate the impossibility of preserving the strong guarantees on safety and permissiveness, (ii) provide natural shields with weaker guarantees, and (iii) introduce offline and online shield constructions ensuring strong safety guarantees. The empirical evaluation highlights the practical advantages of the new shields, as well as their computational feasibility.
comment: Accepted to CAV 2026
☆ LoKA: Low-precision Kernel Applications for Recommendation Models At Scale ISCA'26
Recent GPU generations deliver significantly higher FLOPs using lower-precision arithmetic, such as FP8. While successfully applied to large language models (LLMs), its adoption in large recommendation models (LRMs) has been limited. This is because LRMs are numerically sensitive, dominated by small matrix multiplications (GEMMs) followed by normalization, and trained in communication-intensive environments. Applying FP8 directly to LRMs often degrades model quality and prolongs training time. These challenges are inherent to LRM workloads and cannot be resolved merely by introducing better FP8 kernels. Instead, a system-model co-design approach is needed to successfully integrate FP8. We present LoKA (Low-precision Kernel Applications), a framework that makes FP8 practical for LRMs through three principles: profile under realistic distributions to know where low precision is safe, co-design model components with hardware to expand where it is safe, and orchestrate across kernel libraries to maximize the gains. Concretely, LoKA Probe is a statistically grounded, online benchmarking method that learns activation and weight statistics, and quantifies per-layer errors. This process pinpoints safe and unsafe, fast and slow sites for FP8 adoption. LoKA Mods is a set of reusable model adaptations that improve both numerical stability and execution efficiency with FP8. LoKA Dispatch is a runtime that leverages the statistical insights from LoKA Probe to select the fastest FP8 kernel that satisfies the accuracy requirements.
comment: Accepted to ISCA'26
☆ AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents
Recent advances in machine learning and large-scale biological data collections have revived the prospect of building a virtual cell, a computational model of cellular behavior that could accelerate biological discovery. One of the most compelling promises of this vision is the ability to perform in silico phenotypic screens, in which a model predicts the effects of cellular perturbations in unseen biological contexts. This task combines heterogeneous textual inputs with diverse phenotypic outputs, making it particularly well-suited to LLMs and agentic systems. Yet, no standard benchmark currently exists for this task, as existing efforts focus on narrower molecular readouts that are only indirectly aligned with the phenotypic endpoints driving many real-world drug discovery workflows. In this work, we present AssayBench, a benchmark for phenotypic screen prediction, built from 1,920 publicly available CRISPR screens spanning five broad classes of cellular phenotypes. We formulate the screen prediction task as a gene rank prediction for each screen and introduce the adjusted nDCG, a continuous metric for comparing performance across heterogeneous assays. Our extensive evaluation shows that existing methods remain far from empirically estimated performance ceilings and zero-shot generalist LLMs outperform biology-specific LLMs and trainable baselines. Optimization techniques such as fine-tuning, ensembling, and prompt optimization can further improve LLM performance on this task. Overall, AssayBench offers a practical testbed for measuring progress toward in silico phenotypic screening and, more broadly, virtual cell models.
comment: 22 pages
☆ CADBench: A Multimodal Benchmark for AI-Assisted CAD Program Generation
Recovering editable CAD programs from images or 3D observations is central to AI-assisted design, but progress is difficult to measure because existing evaluations are fragmented across datasets, modalities, and metrics. We introduce CADBench, a unified benchmark for multimodal CAD program generation. CADBench contains 18,000 evaluation samples spanning six benchmark families derived from DeepCAD, Fusion 360, ABC, MCB, and Objaverse; five input modalities including clean meshes, noisy meshes, single-view renders, photorealistic renders, and multi-view renders; and six metrics covering geometric fidelity, executability, and program compactness. STEP-based families are stratified by B-rep face count and all families are diversity-sampled to support controlled analysis across complexity and object variation. We benchmark eleven CAD-specialized and general-purpose vision-language systems, generating more than 1.4 million CAD programs. Under idealized inputs, specialized mesh-to-CAD models substantially outperform code-generating VLMs, which remain far from reliable CAD program reconstruction. CADBench further reveals three recurring failure modes: reconstruction quality degrades with geometric complexity, CAD-specialized models can be brittle under modality shift, and model rankings change across metrics. Together, these results position CADBench as a diagnostic testbed for measuring progress in editable 3D reconstruction and multimodal CAD understanding. The benchmark is publicly available at https://huggingface.co/datasets/DeCoDELab/CADBench.
☆ Attractor-Vascular Coupling Theory: Formal Grounding and Empirical Validation for AAMI-Standard Cuffless Blood Pressure Estimation from Smartphone Photoplethysmography
This work proposes Attractor-Vascular Coupling Theory (AVCT), a mathematical framework showing that cardiac attractor geometry encodes blood pressure (BP) information sufficient for AAMI-standard estimation, and validates the theory through a calibrated cuffless BP model using photoplethysmography (PPG). AVCT is grounded in Cardiac Stability Theory and operationalized using Takens delay embedding and attractor morphology extraction. Two theorems, one proposition, and one corollary formally justify the use of PPG attractor features for BP estimation and predict the feature-importance hierarchy. A LightGBM model trained on pulse transit time (PTT) and Cardiac Stability Index (CSI) attractor features under single-point calibration was evaluated using strict leave-one-subject-out cross-validation (LOSO-CV) on 46 subjects from BIDMC ICU (n = 9) and VitalDB surgical data (n = 37), comprising 29,684 windows. The model achieved systolic BP (SBP) mean absolute error (MAE) of 2.05 mmHg and diastolic BP (DBP) MAE of 1.67 mmHg, with correlations r = 0.990 and r = 0.991, satisfying the AAMI/IEEE SP10 requirement of MAE below 5 mmHg. Median per-subject MAE was 1.87/1.54 mmHg, and 70%/76% of subjects individually satisfied AAMI criteria. A PPG-only ablation using nine smartphone attractor features matched the ECG+PPG model within 0.05 mmHg, demonstrating that clinical-grade BP tracking is achievable using only a smartphone camera while surpassing prior generalized LOSO-CV results using fewer sensors. All four AVCT predictions were quantitatively confirmed, with 91.5% error reduction from uncalibrated to calibrated estimation (epsilon_cal = 0.915). Unlike post-hoc explainable AI methods, AVCT predicts features satisfying the architectural faithfulness criterion of the Explainable-AI Trustworthiness (EAT) framework and grounding BP estimation in nonlinear dynamical systems theory.
☆ Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
Long-horizon language agents must operate under limited runtime memory, yet existing memory mechanisms often organize experience around descriptive criteria such as relevance, salience, or summary quality. For an agent, however, memory is valuable not because it faithfully describes the past, but because it preserves the distinctions between histories that must remain separated under a fixed budget to support good decisions. We cast this as a decision-centric rate-distortion problem, measuring memory quality by the loss in achievable decision quality induced by compression. This yields an exact forgetting boundary for what can be safely forgotten, and a memory-distortion frontier characterizing the optimal tradeoff between memory budget and decision quality. Motivated by this decision-centric view of memory, we propose DeMem, an online memory learner that refines its partition only when data certify that a shared state would induce decision conflict, and prove near-minimax regret guarantees. On both controlled synthetic diagnostics and long-horizon conversational benchmarks, DeMem yields consistent gains under the same runtime budget, supporting the principle that memory should preserve the distinctions that matter for decisions, not descriptions.
☆ BEACON: A Multimodal Dataset for Learning Behavioral Fingerprints from Gameplay Data
Continuous authentication in high-stakes digital environments requires datasets with fine-grained behavioral signals under realistic cognitive and motor demands. But current benchmarks are often limited by small scale, unimodal sensing or lack of synchronised environmental context. To address this gap, this paper introduces BEACON ( Behavioral Engine for Authentication \& Continuous Monitoring), a large-scale multimodal dataset that captures diverse skill tiers in competitive \textit{Valorant} gameplay. BEACON contains approximately 430 GB of synchronised modality data (461 GB total on-disk including auxiliary \textit{Valorant} configuration captures) from 79 sessions across 28 distinct players, estimated at 102.51 hours of active gameplay, including high-frequency mouse dynamics, keystroke events, network packet captures, screen recordings, hardware metadata, and in-game configuration context. BEACON leverages the high precision motor skills and high cognitive load that are inherent to tactical shooters, making it a rigorous stress test for the robustness of behavioral biometrics. The dataset allows for the study of continuous authentication, behavioral profiling, user drift and multimodal representation learning in a high-fidelity esports setting. The authors release the dataset and code on Hugging Face and GitHub to create a reproducible benchmark for evaluating next-generation behavioral fingerprinting and security models
☆ BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD
Industrial Computer-Aided Design (CAD) code generation requires models to produce executable parametric programs from visual or textual inputs. Beyond recognizing the outer shape of a part, this task involves understanding its 3D structure, inferring engineering parameters, and choosing CAD operations that reflect how the part would be designed and manufactured. Despite the promise of Multimodal large language models (MLLMs) for this task, they are rarely evaluated on whether these capabilities jointly hold in realistic industrial CAD settings. We present BenchCAD, a unified benchmark for industrial CAD reasoning. BenchCAD contains 17,900 execution-verified CadQuery programs across 106 industrial part families, including bevel gears, compression springs, twist drills, and other reusable engineering designs. It evaluates models through visual question answering, code question answering, image-to-code generation, and instruction-guided code editing, enabling fine-grained analysis across perception, parametric abstraction, and executable program synthesis. Across 10+ frontier models, BenchCAD shows that current systems often recover coarse outer geometry but fail to produce faithful parametric CAD programs. Common failures include missing fine 3D structure, misinterpreting industrial design parameters, and replacing essential operations such as sweeps, lofts, and twist-extrudes with simpler sketch-and-extrude patterns. Fine-tuning and reinforcement learning improve in-distribution performance, but generalization to unseen part families remains limited. These results position BenchCAD as a benchmark for measuring and improving the industrial readiness of multimodal CAD automation.
comment: 9 page 7 figures
☆ The Generalized Turing Test: A Foundation for Comparing Intelligence
We introduce the Generalized Turing Test (GTT), a formal framework for comparing the capabilities of arbitrary agents via indistinguishability. For agents A and B, we define the Turing comparator A $\geq$ B to hold if B, acting as a distinguisher, cannot reliably distinguish between interactions with A (instructed to imitate B) and another instance of B. This yields a dataset- and task-agnostic notion of relative intelligence. We study the comparator's structure, including conditions under which it is transitive and therefore induces an ordering over equivalence classes, and we define and analyze variants with querying, bounded interaction, and fixed distinguishers. To complement the theory, we instantiate the framework on a collection of modern models, empirically evaluating pairwise indistinguishability across thousands of trials. The resulting comparisons exhibit a stratified structure consistent with existing rankings, hinting that the proposed framework yields meaningful empirical orderings. Our results position indistinguishability as a unifying lens for reasoning about intelligence, suggesting a foundation for evaluation and, potentially, training objectives that are inherently independent of fixed datasets or benchmarks.
☆ Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?
Does a lexical retriever suffice as large language models (LLMs) become more capable in an agentic loop? This question naturally arises when building deep research systems. We revisit it by pairing BM25 with frontier LLMs that have better reasoning and tool-use abilities. To support researchers asking the same question, we introduce Pi-Serini, a search agent equipped with three tools for retrieving, browsing, and reading documents. Our results show that, on BrowseComp-Plus, a well-configured lexical retriever with sufficient retrieval depth can support effective deep research when paired with more capable LLMs. Specifically, Pi-Serini with gpt-5.5 achieves 83.1% answer accuracy and 94.7% surfaced evidence recall, outperforming released search agents that use dense retrievers. Controlled ablations further show that BM25 tuning improves answer accuracy by 18.0% and surfaced evidence recall by 11.1% over the default BM25 setting, while increasing retrieval depth further improves surfaced evidence recall by 25.3% over the shallow-retrieval setting. Source code is available at https://github.com/justram/pi-serini.
comment: 15 pages, 4 figures
☆ Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
Large language models increasingly mediate decisions that turn on moral judgement, yet a growing body of evidence shows that their implicit preferences are not culturally neutral. Existing cultural alignment methods either require per-country preference data and fine-tuning budgets or assume white-box access to model internals that commercial APIs do not expose. In this work, we focus on this realistic black-box, public-data-only regime and observe that within-country sociodemographic disagreement, not consensus, is the primary steering signal. We introduce DISCA (Disagreement-Informed Steering for Cultural Alignment), an inference-time method that instantiates each country as a panel of World-Values-Survey-grounded persona agents and converts their disagreement into a bounded, loss-averse logit correction. Across 20 countries and 7 open-weight backbones (2B--70B), DISCA reduces cultural misalignment on MultiTP by 10--24% on the six backbones >=3.8B, and 2--7% on open-ended scenarios, without changing any weights. Our results suggest that inference-time calibration is a scalable alternative to fine-tuning for serving the long tail of global moral preferences.
comment: 57 pages, 1 figure, 6 MultiTP moral dimensions
☆ Clin-JEPA: A Multi-Phase Co-Training Framework for Joint-Embedding Predictive Pretraining on EHR Patient Trajectories
We present Clin-JEPA, a multi-phase co-training framework for joint-embedding predictive (JEPA) pretraining on EHR patient trajectories. JEPA architectures have enabled latent-space planning in robotics and high-quality representation learning in vision, but extending the paradigm to EHR data -- to obtain a single backbone that simultaneously forecasts patient trajectories and serves diverse downstream risk-prediction tasks without per-task fine-tuning -- remains an open challenge. Existing JEPA frameworks either discard the predictor after pretraining (I-JEPA, V-JEPA) or train it on a frozen pretrained encoder (V-JEPA 2-AC), leaving the encoder unaware of the rollout signal that the retained predictor must use at inference; co-training the encoder and predictor under a shared JEPA prediction objective would supply this grounding, but naïve co-training is unstable, with representation collapse and online/target drift causing autoregressive rollout to diverge. Clin-JEPA's five-phase pretraining curriculum -- predictor warmup, joint refinement, EMA target alignment, hard sync, and predictor finalization -- addresses each failure mode by phase, stably co-training a Qwen3-8B-based encoder and a 92M-parameter latent trajectory predictor. On MIMIC-IV ICU data, three independent evaluations support the framework: (1) latent $\ell_1$ rollout drift uniquely converges ($-$15.7%) over 48-hour horizons while baselines and ablations diverge (+3% to +4951%); (2) the encoder learns a clinically discriminative latent geometry (deteriorating-patient cohorts displace 4.83$\times$ further than stable patients in latent space, vs $\leq$2.62$\times$ for baseline encoders); (3) a single backbone outperforms strong tabular and sequence baselines on multi-task downstream evaluation. Clin-JEPA achieves mean AUROC 0.851 on ICareFM EEP and 0.883 on 8 binary risk tasks (+0.038 and +0.041 vs baseline average).
comment: 17 pages, 4 figures, 8 tables. Code: https://github.com/YeungYathin/Clin-JEPA
☆ From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World
AI pentesting agents are increasingly credible as offensive security systems, but current benchmarks still provide limited guidance on which will perform best in real-world targets. Existing evaluation protocols assess and optimize for predefined goals such as capture-the-flag, remote code execution, exploit reproduction, or trajectory similarity, in simplified or narrow settings. These tools are valuable for measuring bounded capabilities, yet they do not adequately capture the complexity, open-ended exploration, and strategic decision-making required in realistic pentesting. In this paper, we present a practical evaluation protocol that shifts assessment from task completion to validated vulnerability discovery, allowing evaluation in sufficiently complex targets spanning multiple attack surfaces and vulnerability classes. The protocol combines structured ground-truth with LLM-based semantic matching to identify vulnerabilities, bipartite resolution to score findings under realistic ambiguity, continuous ground-truth maintenance, repeated and cumulative evaluation of stochastic agents, efficiency metrics, and reduced-suite selection for sustainable experimentation. This protocol extends the state of the art by enabling a more realistic, operationally informative comparison of AI pentesting agents. To enable reproducibility, we also release expert-annotated ground truth and code for the proposed evaluation protocol: https://github.com/jd0965199-oss/ethibench.
☆ MMVIAD: Multi-view Multi-task Video Understanding for Industrial Anomaly Detection
Industrial anomaly detection is critical for manufacturing quality control, yet existing datasets mainly focus on static images or sparse views, which do not fully reflect continuous inspection processes in real industrial scenarios. We introduce MMVIAD (Multi-view Multi-task Video Industrial Anomaly Detection), to the best of our knowledge the first continuous multi-view video dataset for industrial anomaly detection and understanding, together with a benchmark for multi-task evaluation. MMVIAD contains object-centric 2-second inspection clips with approximately 120 degrees of camera motion, covering 48 object categories, 14 environments, and 6 structural anomaly types. It supports anomaly detection, defect classification, object classification, and anomaly visible-time localization. Systematic evaluations on MMVIAD show that current commercial and open-source video MLLMs remain far below human performance, especially for fine-grained defect recognition and temporal grounding. To improve transferable anomaly understanding, we further develop a two-stage post-training pipeline where PS-SFT (Perception-Structured Supervised Fine-Tuning) initializes perception-structured reasoning and VISTA-GRPO (Visibility-grounded Industrial Structured Temporal Anomaly Group Relative Policy Optimization) refines the model with semantic-gated defect reward and visibility-aware temporal reward, producing the final model VISTA. On MMVIAD-Unseen, VISTA improves the base model's average score across the four tasks from 45.0 to 57.5, surpassing GPT-5.4. Source code is available at https://github.com/Georgekeepmoving/MMVIAD.
☆ SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing
Large language models possess strong chemical reasoning capabilities, making them effective molecular editors. However, property-relevant information is implicitly entangled across their dense hidden states, providing no explicit handle for property control: a substantial fraction of edits fail to improve or even degrade target properties. To address these issues, we propose SLIM (Sparse Latent Interpretable Molecular editing), a plug-and-play framework that decomposes the editor's hidden states into sparse, property-aligned features via a Sparse Autoencoder with learnable importance gates. Steering in this sparse feature space precisely activates property-relevant dimensions, improving editing success rate without modifying model parameters. The same sparse basis further supports interpretable analysis of editing behavior. Experiments on the MolEditRL benchmark across four model architectures and eight molecular properties show consistent gains over baselines, with improvements of up to 42.4 points.
☆ The First Drop of Ink: Nonlinear Impact of Misleading Information in Long-Context Reasoning
As large language models are increasingly deployed in retrieval-augmented generation and agentic systems that accumulate extensive context, understanding how distracting information affects long-context performance becomes critical. Prior work shows that semantically relevant yet misleading documents degrade performance, but the quantitative relationship between the proportion of distractors and performance remains unstudied. In this work, we systematically vary the hard-distractor proportion in fixed-length contexts, revealing a striking nonlinear pattern: as the proportion of hard distractors increases, performance drops sharply within the first small fraction, while the remainder of the range yields only marginal additional decline. We term this ''The First Drop of Ink'' effect, analogous to how a single drop of ink contaminates water. Our theoretical and empirical analyses grounded in attention mechanics show that hard distractors capture disproportionate attention even at small proportions, with diminishing marginal impact as their proportion grows. Controlled experiments further show that filtering gains mainly come from context-length reduction rather than distractor removal; substantial recovery requires reducing the hard-distractor proportion to near zero, highlighting the importance of upstream retrieval precision.
☆ MaD Physics: Evaluating information seeking under constraints in physical environments
Scientific discovery is fundamentally a resource-constrained process that requires navigating complex trade-offs between the quality and quantity of measurements due to physical and cost constraints. Measurements drive the scientific process by revealing novel phenomena to improve our understanding. Existing benchmarks for evaluating agents for scientific discovery focus on either static knowledge-based reasoning or unconstrained experimental design tasks, and do not capture the ability to make measurements and plan under constraints. To bridge this gap, we propose Measuring and Discovering Physics (MaD Physics), a benchmark to evaluate the ability of agents to make informative measurements and conclusions subject to constraints on the quality and quantity of measurements. The benchmark consists of three environments, each based on a distinct physical law. To mitigate contamination from existing knowledge, MaD Physics includes altered physical laws. In each trial, the agent makes measurements of the system until it exhausts an allotted budget and then the agent has to infer the underlying physical law to make predictions about the state of the system in the future. MaD Physics evaluates two fundamental capabilities of scientific agents: inferring models from data and planning under constraints. We also demonstrate how MaD Physics can be used to evaluate other capabilities such as multimodality and in-context learning. We benchmark agents on MaD Physics using four Gemini models (2.5 Flash Lite, 2.5 Flash, 2.5 Pro, and 3 Flash), identifying shortcomings in their structured exploration and data collection capabilities and highlighting directions to improve their scientific reasoning.
comment: 64 pages, 10 figures. Project page: https://mad-physics.github.io/
☆ ALAM: Algebraically Consistent Latent Transitions for Vision-Language-Action Models
Vision-language-action (VLA) models remain constrained by the scarcity of action-labeled robot data, whereas action-free videos provide abundant evidence of how the physical world changes. Latent action models offer a promising way to extract such priors from videos, but reconstruction-trained latent codes are not necessarily suitable for policy generation: they may predict future observations while lacking the structure needed to be reused or generated coherently with robot actions. We introduce ALAM (Algebraic Latent Action Model), an Algebraically Consistent Latent Action Model that turns temporal relations in action-free video into structural supervision. Given frame triplets, ALAM learns latent transitions that are grounded by reconstruction while being regularized by composition and reversal consistency, encouraging a locally additive transition space. For downstream VLA learning, we freeze the pretrained encoder and use its latent transition sequences as auxiliary generative targets, co-generated with robot actions under a joint flow-matching objective. This couples structured latent transitions with flow-based policy generation, allowing the policy to exploit ALAM's locally consistent transition geometry without requiring latent-to-action decoding. Representation probes show that ALAM reduces additivity and reversibility errors by 25-85 times over unstructured latent-action baselines and improves long-horizon cumulative reconstruction. When transferred to VLA policies, ALAM raises the average success rate from 47.9% to 85.0% on MetaWorld MT50 and from 94.1% to 98.1% on LIBERO, with consistent gains on real-world manipulation tasks. Ablations further confirm that the strongest improvements arise from the synergy between algebraically structured latent transitions and joint flow matching.
☆ CLEF: EEG Foundation Model for Learning Clinical Semantics
Clinical EEG interpretation requires reasoning over full EEG sessions and integrating signal patterns with clinical context. Existing EEG foundation models are largely designed for short-window decoding and do not incorporate clinical context. We introduce CLEF, a clinically grounded long-context EEG foundation model. CLEF represents EEG sessions as 3D multitaper spectrogram tokens, enabling tractable Transformer modeling at session scale, and aligns embeddings with neurologist reports and structured EHR data through contrastive objectives. We evaluate CLEF on a new 234-task benchmark spanning disease phenotypes, medication exposures, and EEG findings, with more than 260k EEG sessions from over 108k patients. CLEF outperforms prior EEG foundation models on 229 of 234 tasks, improving mean AUROC from 0.65 to 0.74. Reconstruction-only pretraining surpasses prior EEG foundation models, while report and EHR alignment yields further gains. Held-out concept and external-cohort experiments suggest that these representations transfer beyond observed alignment targets. These results support session-scale, clinically grounded representation learning as a promising foundation-model paradigm for clinical EEG.
☆ Policy Gradient Methods for Non-Markovian Reinforcement Learning
We study policy gradient methods for reinforcement learning in non-Markovian decision processes (NMDPs), where observations and rewards depend on the entire interaction history. To handle this dependence, the agent maintains an internal state that is recursively updated to provide a compact summary of past observations and actions. In contrast to approaches that treat the agent state dynamics as fixed or learn it via predictive objectives, we propose a reward-centric formulation that jointly optimizes the agent state dynamics and the control policy to maximize the expected cumulative reward. To this end, we consider a class of Agent State-Markov (ASM) policies, comprising an agent state dynamics and a control policy that maps the agent state to actions. We establish a novel policy gradient theorem for ASM policies, extending the classical policy gradient results from the Markovian setting to episodic and infinite-horizon discounted NMDPs. Building on this gradient expression, we propose the Agent State-Markov Policy Gradient (ASMPG) algorithm, which leverages the recursive structure of the agent state dynamics for efficient optimization. We establish finite-time and almost sure convergence guarantees, and empirically demonstrate that, on a range of non-Markovian tasks, ASMPG outperforms baselines that learn state representations via predictive objectives.
comment: 39 pages, 5 figures, 1 table
☆ Probing Cross-modal Information Hubs in Audio-Visual LLMs ICML 2026
Audio-visual large language models (AVLLMs) have recently emerged as a powerful architecture capable of jointly reasoning over audio, visual, and textual modalities. In AVLLMs, the bidirectional interaction between audio and video modalities introduces intricate processing dynamics, necessitating a deeper understanding of their internal mechanisms. However, unlike extensively studied text-only or large vision language models, the internal workings of AVLLMs remain largely unexplored. In this paper, we focus on cross-modal information flow between audio and visual modalities in AVLLMs, investigating where information derived from one modality is encoded within the token representations of the other modality. Through an analysis of multiple recent AVLLMs, we uncover two common findings. First, AVLLMs primarily encode integrated audio-visual information in sink tokens. Second, sink tokens do not uniformly hold cross-modal information. Instead, a distinct subset of sink tokens, which we term cross-modal sink tokens, specializes in storing such information. Based on these findings, we further propose a simple training-free hallucination mitigation method by encouraging reliance on integrated cross-modal information within cross-modal sink tokens. Our code is available at https://github.com/kaistmm/crossmodal-hub.
comment: Accepted by ICML 2026
☆ NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation
LLM-powered multi-agent systems can now automate the full research pipeline from ideation to paper writing, but a fundamental question remains: automation for whom? Researchers operate under different resource configurations, hold different methodological preferences, and target different output formats. A system that produces uniform outputs regardless of these differences will systematically under-serve every individual user, making personalization a precondition for research automation to be genuinely usable. However, achieving it requires three capabilities that current systems lack: accumulating reusable procedural knowledge across projects, retaining user-specific experience across sessions, and internalizing implicit preferences that resist explicit formalization. We propose NanoResearch, a multi-agent framework that addresses these gaps through tri-level co-evolution. A skill bank distills recurring operations into compact procedural rules reusable across projects. A memory module maintains user- and project-specific experience that grounds planning decisions in each user's research history. A label-free policy learning converts free-form feedback into persistent parameter updates of the planner, reshaping subsequent coordination. These three layers co-evolve: reliable skills produce richer memory, richer memory informs better planning, and preference internalization continuously realigns the loop to each user. Extensive experiments demonstrate that NanoResearch delivers substantial gains over state-of-the-art AI research systems, and progressively refines itself to produce better research at lower cost over successive cycles.
comment: 40 pages, 14 figures, 7 tables
☆ Switching-Geometry Analysis of Deflated Q-Value Iteration
This paper develops a joint spectral radius (JSR) framework for analyzing rank-one deflated Q-value iteration (Q-VI) in discounted Markov decision process control. Focusing on an all-ones residual correction, we interpret the resulting algorithm through the geometry of switching systems and, to the best of our knowledge, give the first JSR-based convergence analysis of deflated Q-VI for policy optimization problems. Our analysis reveals that the standard Q-VI switching system model has JSR exactly the discount factor $γ\in (0,1)$, since all admissible subsystems share the all-ones vector as an invariant direction. By passing to the quotient space that removes this direction, we obtain a projected switching system model whose JSR governs the relevant error dynamics and may be strictly smaller than $γ$. Therefore, the deflated Q-VI admits a potentially sharper convergence-rate characterization than the ambient-space $γ$-bound. Finally, we prove that the correction is equivalent to a scalar recentering of standard Q-VI. Hence, the projected trajectory, and therefore the greedy-policy sequence, is unchanged relative to standard Q-VI initialized from the same point. The benefit of deflation is not a change in the induced decision-making problem, but a more precise JSR-based description of the convergence geometry after the redundant all-ones component is removed.
☆ Threat Modelling using Domain-Adapted Language Models: Empirical Evaluation and Insights
Large Language Models(LLMs) are increasingly explored for cybersecurity applications such as vulnerability detection. In the domain of threat modelling, prior work has primarily evaluated a number of general-purpose Large Language Models under limited prompting settings. In this study, we extend the research area of structured threat modelling by systematically evaluating domain-adapted language models of different sizes to their general counterparts. We use both LLMs and Small Language Models(SLMs) that were domain adapted to telecommunications and cybersecuirty. For the structured threat modelling, we selected the widely used STRIDE approach and the application area is 5G security. We present a comprehensive empirical evaluation using 52 different configurations (on 8 different language models) to analyze the impact of 1) domain adaptation, 2) model scale, 3) decoding strategies (greedy vs. stochastic sampling), and 4) prompting technique on STRIDE threat classification. Our results show that domain-adapted models do not consistently outperform their general-purpose counterparts, and decoding strategies significantly affect model behavior and output validity. They also show that while larger models generally achieve higher performance, these gains are neither consistent nor sufficient for reliable threat modelling. These findings highlight fundamental limitations of current LLMs for structured threat modelling tasks and suggest that improvements require more than additional training data or model scaling, motivating the need for incorporating more task-specific reasoning and stronger grounding in security concepts. We present insights on invalid outputs encountered and present suggestions for prompting tailored specifically for STRIDE threat modelling.
☆ PhyGround: Benchmarking Physical Reasoning in Generative World Models
Generative world models are increasingly used for video generation, where learned simulators are expected to capture the physical rules that govern real-world dynamics. However, evaluating whether generated videos actually follow these rules remains challenging. Existing physics-focused video benchmarks have made important progress, but they still face three key challenges, including the coarse evaluation frameworks that hide law-specific failures, response biases and fatigue that undermine the validity of annotation judgments, and automated evaluators that are insufficiently physics-aware or difficult to audit. To address those challenges, we introduce PhyGround, a criteria-grounded benchmark for evaluating physical reasoning in video generation. The benchmark contains 250 curated prompts, each augmented with an expected physical outcome, and a taxonomy of 13 physical laws across solid-body mechanics, fluid dynamics, and optics. Each law is operationalized through observable sub-questions to enable per-law diagnostics. We evaluate eight modern video generation models through a large-scale, quality-controlled human study, grounded on social science lab experiment design. A total of 459 annotators provided 5,796 complete annotations and over 37.4K fine-grained labels; after quality control, the retained annotations exhibited high split-half model-ranking correlations (Spearman's rho > 0.90). To support reproducible automated evaluation, we release PhyJudge-9B, an open physics-specialized VLM judge. PhyJudge-9B achieves substantially lower aggregate relative bias than Gemini-3.1-Pro (3.3% vs. 16.6%). We release prompts, human annotations, model checkpoints, and evaluation code on the project page https://phyground.github.io/.
comment: Preprint. 56 pages, 39 figures, 40 tables. Project page: https://phyground.github.io/
☆ Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge ICML 2026
Reasoning-capable large language models (LLMs) have recently been adopted as automated judges, but their benefits and costs in LLM-as-a-Judge settings remain unclear. Through controlled comparisons between reasoning and non-reasoning judges, we show that explicit reasoning substantially improves judgment accuracy on tasks requiring structured verification (e.g., math and coding), while offering limited or even negative gains on simpler evaluations and incurring significantly higher computational cost. These findings motivate that reasoning should be used selectively rather than universally, with awareness of possible distribution shift. We propose a Robust Adaptive Cost-Efficient Routing (RACER), which dynamically selects between reasoning and non-reasoning judges under a fixed budget by formulating routing as a constrained distributionally robust optimization problem. RACER explicitly accounts for distribution shift via a KL-divergence uncertainty set, admits an efficient primal--dual algorithm, and enjoys theoretical guarantees including uniqueness of the optimal policy and linear convergence. Extensive experiments show that RACER achieves superior accuracy--cost trade-offs under distribution shift.
comment: Accepted at ICML 2026
☆ New AI-Driven Tools for Enhancing Campus Well-being: A Prevention and Intervention Approach
Campus well-being underpins academic success, yet many universities lack effective methods for monitoring satisfaction and detecting mental health risks. This dissertation addresses these gaps through prevention (improving feedback collection) and intervention (advancing mental health detection), unified under an integrated framework. For prevention, we developed TigerGPT, a personalized survey chatbot leveraging LLMs to engage users in context-aware conversations grounded in conversational design and engagement theory, achieving 75% usability and 81% satisfaction. To address its limitations in repetitiveness and response depth, we introduced AURA, a reinforcement-learning framework that adapts follow-up question types (validate, specify, reflect, probe) within a session using an LSDE quality signal (Length, Self-disclosure, Emotion, Specificity), initialized from 96 prior conversations. AURA achieved +0.12 mean quality gain (p=0.044, d=0.66), with 63% fewer specification prompts and 10x more validation behavior. For intervention, we examine Expressive Narrative Stories (ENS) for mental health screening, showing BERT(128) captures nuanced linguistic features without keyword cues, while conventional classifiers depend heavily on explicit mental health terms. We then developed PsychoGPT, an LLM built on DSM-5 and PHQ-8 guidelines that performs initial distress classification, symptom-level scoring, and reconciliation with external ratings for explainable assessment. To reduce hallucinations, we proposed Stacked Multi-Model Reasoning (SMMR), layering expert models where early layers handle localized subtasks and later layers reconcile findings, outperforming single-model solutions on DAIC-WOZ in accuracy, F1, and PHQ-8 scoring. Finally, a cohesive framework unifies these tools, enabling adaptive survey insights to flow directly into specialized mental health detection models.
comment: PhD Dissertation, University of Missouri, May 2026
☆ The Last Word Often Wins: A Format Confound in Chain-of-Thought Corruption Studies NeurIPS 2026
Corruption studies, the primary tool for evaluating chain-of-thought (CoT) faithfulness, identify which chain positions are "computationally important" by measuring accuracy when steps are replaced with errors. We identify a systematic confound: for chains with explicit terminal answer statements, the dominant format in standard benchmarks, corruption studies detect where the answer text appears, not where computation occurs. A within-dataset format ablation provides the key evidence: on standard GSM8K chains ending with "the answer is X," removing only the answer statement, preserving all reasoning, collapses suffix sensitivity ~19x at 3B (N=300, p=0.022). Conflicting-answer experiments quantify the causal mechanism: at 7B, CC accuracy drops to near-zero (<=0.02) across five architecture families; the followed-wrong rate spans 0.63-1.00 at 3B-7B and attenuates at larger scales (0.300 at Phi-4-14B, ~0.01 at 32B). A within-stable 7B replication (9.3x attenuation, N=76, p=7.8e-3; Qwen3-8B N=299, p=0.004) provides converging evidence, and the pattern replicates on MATH (DeepSeek-R1-7B: 10.9x suffix-survival recovery). On chains without answer suffixes the same protocol identifies the prefix as load-bearing (Delta=-0.77, p<10^-12). Generation-time probes confirm a dissociation: the answer is not early-determined during generation (early commitment <5%), yet at consumption time model outputs systematically follow the explicit answer text. The format-determination effect persists through 14B (8.5x ratio, p=0.001) and converges toward zero at 32B. We propose a three-prerequisite protocol (question-only control, format characterization, all-position sweep) as a minimum standard for corruption-based faithfulness studies.
comment: 34 pages, 6 figures, 13 tables. Submitted to NeurIPS 2026. Code and data: https://github.com/Gpgabriel25/LastWordWinsCoT
☆ Interpretable Machine Learning for Football Performance Analysis: Evidence of Limited Transferability from Elite Leagues to University Competition
Machine learning has become increasingly prevalent in football performance analysis, yet most studies prioritize predictive accuracy while implicitly assuming that learned performance determinants and their interpretations are transferable across competition levels. Whether interpretability remains reliable under domain shift-from elite to university football remains largely unexplored. This study investigates whether performance determinants learned from elite competitions are structurally transferable to university-level football and whether their interpretations remain robust under domain shift. Models were trained on large-scale event data from the top five European leagues and applied to university football data from National Tsing Hua University (NTHU) using an identical feature space. Random Forest and Multilayer Perceptron models were interpreted using SHapley Additive exPlanations (SHAP) and Counterfactual Impact Score (CIS). Across five experiments, elite football exhibited a stable and consistent hierarchy of performance determinants across leagues, models, and explanation methods. In contrast, NTHU university football showed substantial reordering of key indicators, reduced explanation stability, weaker structural agreement with elite domains, and increased sensitivity to explanation method. These findings suggest that interpretability robustness is domain-dependent. Rather than reflecting methodological limitations alone, instability in explanations under domain shift may serve as a diagnostic signal of structural ambiguity in the target domain.
comment: 19 pages, 6 figures
☆ Can You Keep a Secret? Involuntary Information Leakage in Language Model Writing
Language models are deployed in settings that require compartmentalization: system prompts should not be disclosed, chain-of-thought reasoning is hidden from users, and sensitive data passes through shared contexts. We test whether models can keep prompted information out of their writing. We give each model a secret word with instructions not to reveal it, then ask it to write a story. A second model tries to identify the secret from the story in a binary discrimination test. The secret word never appears literally in any output, but all five frontier models we test leak it thematically -- through topic choice, imagery, and setting--6hy-at rates significantly different from chance, up to 79\%. When told to actively hide the secret, models write \emph{away from} it, and this avoidance is itself detectable. The leakage is cross-model readable, scales sharply with model size within two model families, and disappears entirely for short-form writing like jokes. Giving the model a decoy concept to ``focus on instead'' partially redirects the leakage from the real secret to the decoy. Attending to a secret appears to open up an information channel that frontier LLMs cannot close, even when instructed to.
☆ PathISE: Learning Informative Path Supervision for Knowledge Graph Question Answering
Knowledge Graph Question Answering (KGQA) aims to answer user questions by reasoning over Knowledge Graphs (KGs). Recent KGQA methods mainly follow the retrieval-augmented generation paradigm to ground Large Language Models~(LLMs) with structured knowledge from KGs. However, training effective models to retrieve question-relevant evidence from KGs typically requires high-quality intermediate supervision signals, such as question-relevant paths or subgraphs, which are time- and resource-intensive to obtain. We propose PathISE, a novel framework for learning high-quality intermediate supervision from answer-level labels. PathISE introduces a lightweight transformer-based estimator that estimates the informativeness of relation paths to construct pseudo path-level supervision. This supervision is then distilled into an LLM path generator, whose generated paths are grounded in the KG to provide compact evidence for inductive answer reasoning. ExtensiveISE experiments on three KGQA benchmarks show that PathISE achieves competitive or state-of-the-art KGQA performance, and provides reusable supervision signals that can enhance existing KGQA models, without relying on costly LLM-refined supervision signals. Our source code is available at https://anonymous.4open.science/r/PathISE-2F87.
☆ ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
Current LLM agents are proficient at calling isolated APIs but struggle with the "last mile" of commercial software automation. In real-world scenarios, tools are not independent; they are atomic, interdependent, and prone to environmental noise. We introduce $\textbf{ComplexMCP}$, a benchmark designed to evaluate agents in these rigorous conditions. Built on the Model Context Protocol (MCP), $\textbf{ComplexMCP}$ provides over 300 meticulously tested tools derived from 7 stateful sandboxes, ranging from office suites to financial systems. Unlike existing datasets, our benchmark utilizes a seed-driven architecture to simulate dynamic environment states and unpredictable API failures, ensuring a deterministic yet diverse evaluation. We evaluate various LLMs across full-context and RAG paradigms, revealing a stark performance gap: even top-tier models fail to exceed a 60% success rate, far trailing human performance 90%. Granular trajectory analysis identifies three fundamental bottlenecks: (1) $\textbf{tool retrieval saturation}$ as action spaces scale; (2) $\textbf{over-confidence}$, where agents skip essential environment verifications; and (3) $\textbf{strategic defeatism}$, a tendency to rationalize failure rather than pursuing recovery. These findings underscore the insufficiency of current agents for interdependent workflows, positioning $\textbf{ComplexMCP}$ as a critical testbed for the next generation of resilient autonomous systems.
☆ TrajPrism: A Multi-Task Benchmark for Language-Grounded Urban Trajectory Understanding
Urban mobility is naturally expressed both as trajectories in space and as natural-language descriptions of travel intent, constraints, and preferences. However, prior work rarely evaluates these two modalities together on the same real-world trajectories: trajectory modeling often stays geometry-centric, while language-centric mobility benchmarks frequently target route planning and tool use rather than fine-grained, verifiable alignment between text and the underlying route. We introduce TrajPrism, a multi-task benchmark for language-trajectory alignment that unifies (i) instruction-conditioned trajectory generation, (ii) language-driven semantic trajectory retrieval, and (iii) trajectory captioning, together with an evaluation protocol that measures trajectory fidelity, retrieval quality, and language groundedness. We construct TrajPrism by pairing real urban trajectories with judge-filtered language annotations generated under a four-dimensional travel-intent taxonomy. The benchmark contains 300K selected trajectories across Porto, San Francisco, and Beijing, yielding 2.1M task instances from three instruction variants, three retrieval queries, and one caption per trajectory. We further develop proof-of-concept models for each task: TrajAnchor for instruction-conditioned trajectory generation, TrajFuse for semantic trajectory retrieval, and TrajRap for trajectory captioning. These models instantiate the proposed tasks and show that geometry-only trajectory baselines leave a large gap on our protocol, especially where language is part of the input-output interface. We release TrajPrism with code and a reproducible annotation pipeline that is designed to be portable across cities, given compatible trajectory inputs and map resources.
comment: This paper is under review
☆ Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenizatio
Representation autoencoders that reuse frozen pretrained vision encoders as visual tokenizers have achieved strong reconstruction and generation quality. However, existing methods universally extract features from only the last encoder layer, discarding the rich hierarchical information distributed across intermediate layers. We show that low-level visual details survive in the last layer merely as attenuated residuals after multiple layers of semantic abstraction, and that explicitly fusing multi-layer features can substantially recover this lost information. We propose DRoRAE (Depth-Routed Representation AutoEncoder), a lightweight fusion module that adaptively aggregates all encoder layers via energy-constrained routing and incremental correction, producing an enriched latent compatible with a frozen pretrained decoder. A three-phase decoupled training strategy first learns the fusion under the implicit distributional constraint of the frozen decoder, then fine-tunes the decoder to fully exploit the enriched representation. On ImageNet-256, DRoRAE reduces rFID from 0.57 to 0.29 and improves generation FID from 1.74 to 1.65 (with AutoGuidance), with gains also transferring to text-to-image synthesis. Furthermore, we uncover a log-linear scaling law ($R^2{=}0.86$) between fusion capacity and reconstruction quality, identifying \textit{representation richness} as a new, predictably scalable dimension for visual tokenizers analogous to vocabulary size in NLP.
☆ Towards a Large Language-Vision Question Answering Model for MSTAR Automatic Target Recognition SP
Large language-vision models (LLVM), such as OpenAI's ChatGPT and GPT-4, have gained prominence as powerful tools for analyzing text and imagery. The merging of these data domains represents a significant paradigm shift with far-reaching implications for automatic target recognition (ATR). Recent transformer-based LLVM research has shown substantial improvements for geospatial perception tasks. Our study examines the application of LLVM to remote sensing image captioning and visual question-answering (VQA), with a specific focus on synthetic aperture radar (SAR) imagery. We examine newly published LLVM methods, including CLIP and LLaVA neural network transformer architectures. We have developed a work-in-progress SAR training and evaluation benchmark derived from the MSTAR Public Dataset. This has been extended to include descriptive text captions and question-answer pairs for VQA tasks. This challenge dataset is designed to push the boundaries of an LLVM in identifying nuanced ATR details in SAR imagery. Utilizing parameter-efficient fine-tuning, we train an LLVM method to identify fine-grained target qualities at 98% accuracy. We detail our data setup and experiments, addressing potential pitfalls that could lead to misleading conclusions. Accurately identifying and differentiating military vehicle types in SAR data poses a critical challenge, especially under complex environmental conditions. Mastering this target recognition skill may require a human analyst months of training and years of practice. This research represents a unique effort to apply LLVM to SAR applications, advancing machine-assisted remote sensing ATR for military and intelligence contexts.
comment: Accepted to SPIE Defense + Commercial Sensing, Automatic Target Recognition XXXV
☆ MPerS: Dynamic MLLM MixExperts Perception-Guided Remote Sensing Scene Segmentation CVPR 2026
The multimodal fusion of images and scene captions has been extensively explored and applied in various fields. However, when dealing with complex remote sensing (RS) scenes, existing studies have predominantly concentrated on architectural optimizations for integrating textual semantic information with visual features, while largely neglecting the generation of high-quality RS captions and the investigation of their effectiveness in multimodal semantic fusion.In this context, we propose the Dynamic MLLM Mixture-of-Experts Perception-Guided Remote Sensing Scene Segmentation, referred to as MPerS.We design multiple prompts for MLLMs to generate high-quality RS captions, enabling MLLMs to perceive RS scenes from diverse expert perspectives. DINOv3 is employed to extract dense visual representations of land-covers.We design a Dynamic MixExperts module that adaptively integrates the most effective textual semantics. Linguistic Query Guided Attention is constructed to utilize textual semantic information to guide visual features for precise segmentation. The MLLMs include LLaVA, ChatGPT, and Qwen. Our method achieves superior performance on three public semantic segmentation RS datasets.
comment: Accepted to CVPR 2026 Findings. 11 pages, 6 figures
☆ Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning
Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, yet real-world deployment often requires continual capability expansion across sequential tasks. In such scenarios, Multimodal Continual Instruction Tuning (MCIT) aims to acquire new capabilities while limiting catastrophic forgetting. Existing methods mainly follow a module-composition paradigm: they maintain task-level prompts or LoRA experts and dynamically route or aggregate a subset of them at inference. However, samples within the same task can still differ substantially in visual scenes, question intents, and reasoning demands. This motivates instance-level adaptation to individual query-image pairs rather than only selecting or combining task-level modules. To this end, we propose DRAPE (Dynamic Cross-Modal Prompt Generation), a prompt-learning framework that synthesizes continuous instance-specific soft prompts for MCIT. Instead of selecting prompts from a fixed pool, DRAPE derives prompt queries from the textual instruction and cross-attends to visual patch features, producing query-image conditioned prompts that are prepended to the frozen LLM. To mitigate forgetting during sequential updates, DRAPE applies null-space gradient projection to the shared projector and uses CLIP-based prototype routing for task-label-free generator selection at inference. Extensive experiments on MCIT benchmarks show that DRAPE achieves state-of-the-art performance among representative prompt-based and LoRA-based continual-learning baselines.
☆ Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization
Recent studies show that gradient-based universal image jailbreaks on vision-language models (VLMs) exhibit little or no cross-model transferability, casting doubt on the feasibility of transferable multimodal jailbreaks. We revisit this conclusion under a strictly untargeted threat model without enforcing a fixed prefix or response pattern. Our preliminary experiment reveals that refusal behavior concentrates at high-entropy tokens during autoregressive decoding, and non-refusal tokens already carry substantial probability mass among the top-ranked candidates before attack. Motivated by this finding, we propose Untargeted Jailbreak via Entropy Maximization(UJEM)-KL, a lightweight attack that maximizes entropy at these decision tokens to flip refusal outcomes, while stabilizing the remaining low-entropy positions to preserve output quality. Across three VLMs and two safety benchmarks, UJEM-KL achieves competitive white-box attack success rates and consistently improves transferability, while remaining effective under representative defenses. Our experimental results indicate that the limited transferability primarily stems from overly constrained optimization objectives.
comment: Preprint. 17 pages, 8 figures, 6 tables
☆ MATRA: Modeling the Attack Surface of Agentic AI Systems -- OpenClaw Case Study IEEE
LLMs are increasingly deployed as autonomous agents with access to tools, databases, and external services, yet practitioners (across different sectors) lack systematic methods to assess how known threat classes translate into concrete risks within a specific agentic deployment. We present MATRA, a pragmatic threat modeling framework for agentic AI systems that adapts established risk assessment methodology to systematically assess how known LLM threats translate into deployment-specific risks. MATRA begins with an asset-based impact assessment and utilizes attack trees to determine the likelihood of these impacts occurring within the system architecture. We demonstrate MATRA on a personal AI agent deployment using OpenClaw, quantifying how architectural controls such as network sandboxing and least-privilege access reduce risk by limiting the blast radius of successful injections.
comment: Accepted for presentation at the 5th International Workshop on Designing and Measuring Security in Systems with AI (DeMeSSAI 2026), co-located with the 11th IEEE European Symposium on Security and Privacy (EuroS&P 2026), Lisbon, Portugal, July 10, 2026
☆ GridProbe: Posterior-Probing for Adaptive Test-Time Compute in Long-Video VLMs
Long-video understanding in VLMs is bottlenecked by a single monolithic forward pass over thousands of frames at quadratic attention cost. A common mitigation is to first select a small subset of informative frames before the forward pass; common for training-free selectors via auxiliary encoder-space similarities. Such signals are capped by contrastive pretraining, which usually fails on reasoning-heavy queries (negation, cross-frame counting, holistic summarization). We propose GridProbe, an efficient training-free posterior-probing inference paradigm that scores evidence in answer space using a frozen VLM's own reasoning and then selects question-relevant frames adaptively, resulting in sub-quadratic attention cost with little to no accuracy loss. We arrange frames on a $K{\times}K$ grid and run lightweight row R and column C probes, where each probe reads its peak posterior as a query-conditioned confidence. The outer product of R and C yields an interpretable importance map whose skewness and kurtosis drive Shape-Adaptive Selection, a closed-form rule that reliably replaces the fixed frame budget $M$ with a per-question $M_{\mathrm{eff}}$. We show empirically that $M_{\mathrm{eff}}$ tracks intrinsic question difficulty without ever seeing the answer, a sign of test-time adaptive compute. On Video-MME-v2, GridProbe matches the monolithic baseline within $1.6$ pp Avg Acc at $3.36\times$ TFLOPs reduction, while on LongVideoBench it Pareto-dominates the baseline ($+0.9$ pp at $0.35\times$ compute). Because the selector and QA models can be decoupled, pairing a small 2B selector with a stronger 4B or 8B QA is strictly Pareto-dominant over the 2B monolithic baseline (up to $+4.0$ pp at $0.52\times$ compute, on average), with no retraining. Finally, the interpretability of the importance maps opens future avenues for behavioral diagnostics, grounding, and frame-selection distillation.
☆ The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents
LLM-based foundation agents that perceive, reason, and act across thousands of reasoning steps are rapidly becoming the dominant paradigm for deploying artificial intelligence in open-ended, long-horizon complex tasks. Despite this significance, the field remains overwhelmingly engineering-driven. Engineering practice has converged on useful primitives (tool loops, memory banks, harnesses, reflection steps), yet these are assembled by empirical trial and error rather than from first principles. Fundamental questions remain open: under what conditions does a long-running agent remain on-task? How should an agent respond when its environment exceeds its representational capacity? What architectural properties are necessary for safe self-improvement? We argue that cybernetics, the mid-twentieth-century science of control and communication in complex systems, provides the missing theoretical scaffold for foundation agents. By mapping six canonical laws of classical cybernetics onto six agent design principles, and synthesizing those principles into three engineering desiderata (reliability, lifelong running, and self-Improvement), we arrive at a framework termed Agent Cybernetics. Three application domains, code generation, computer use and automated research, exemplify the analytical framework of agent cybernetics by identifying failure modes and concrete engineering recommendations. We hope that agent cybernetics opens a new research venue and establishes the scientific foundation that foundation agents need for principled, reliable real-world deployment.
comment: Preliminary Work
☆ Provable Sparse Inversion and Token Relabel Enhanced One-shot Federated Learning with ViTs
One-Shot Federated Learning, where a central server learns a global model in a single communication round, has emerged as a promising paradigm. However, under extremely non-IID settings, existing data-free methods often generate low-quality data that suffers from severe semantic misalignment with ground-truth labels. To overcome these issues, we propose a novel Federated Model Inversion and Token Relabel (FedMITR) framework, which trains the global model by fully exploiting all patches of synthetic images. Specifically, FedMITR employs sparse model inversion during data generation, selectively inverting semantic foregrounds while halting the inversion of uninformative backgrounds. To address semantically meaningless tokens that hinder ViT predictions, we implement a differentiated strategy: patches with high information density utilize generated pseudo-labels, while patches with low information density are relabeled via ensemble models for robust distillation. Theoretically, our analysis based on algorithmic stability reveals that Sparse Model Inversion eliminates gradient instability arising from background noise, while Token Relabel effectively reduces gradient variance, collectively guaranteeing a tighter generalization bound. Empirically, extensive experimental results demonstrate that FedMITR substantially outperforms existing baselines under various settings.
comment: 18 Pages
☆ Geospatial-Temporal Sensemaking of Remote Sensing Activity Detections with Multimodal Large Language Model SP
We introduce SMART-HC-VQA, a Sentinel-2-based visual question answering dataset derived from the IARPA SMART Heavy Construction dataset, designed for spatiotemporal analysis of human activity. The dataset transforms construction-site annotations, construction-type labels, temporal-phase labels, geographic metadata, and observation relationships into natural language question-answer triplets. This approach redefines the existing dataset as a temporally extended automatic target recognition and visual question answering (VQA) challenge, considering a fixed geospatial site as a target whose attributes and activity states evolve across sparse satellite observations. Currently, SMART-HC-VQA comprises 21,837 accessible Sentinel-2 image chips, 65,511 single-image VQA examples, and approximately 2.3 million two-image temporal comparison examples generated via our novel Image-Pairwise Combinatorial Augmentation. We detail the workflow for retrieving and processing Sentinel-2 imagery, segmenting large satellite tiles into site-centered images, maintaining traceability to SMART-HC annotations, and analyzing the distributions of site size, observation count, temporal coverage, construction type, and phase labels. Additionally, we describe an implemented multi-image MLLM training framework based on LLaVA-NeXT Mistral-7B, adapted to accept multiple dated image inputs and train on metadata-derived VQA examples. This work offers a reproducible foundation for understanding language-guided remote sensing activities, aiming not only to detect change but also to reason about the ongoing processes, their progression, and potential future developments.
comment: Accepted to 2026 SPIE Defense + Security, Automatic Target Recognition XXXVI
☆ iPay: Integrated Payment Action Recognition via Multimodal Networks and Adaptive Spatial Prior Learning
Automated transit payment analysis is vital for scalable fare auditing and passenger analytics, yet practice still relies on limited manual inspection. Prior vision- and skeleton-based methods remain brittle under noisy onboard surveillance and often depend on poorly generalizable handcrafted features. Building on the success of graph convolutional networks in human action recognition, we observe that skeleton features excel at modeling global spatiotemporal dependencies but tend to underemphasize the subtle local relative motions that distinguish payment actions. In contrast, RGB features preserve fine-grained spatial details yet often lack reliable temporal continuity in surveillance footage. To bridge both system-level deployment needs and model-level design challenges, we present iPay, an integrated payment action recognition framework for onboard transit surveillance system. iPay adopts a multimodal mixture-of-experts architecture with four tightly coupled streams: (1) an RGB expert stream emphasizing local evidence via region-focused computation; (2) a skeleton expert stream modeling articulated motion with a graph convolutional backbone; (3) a dual-attention fusion stream enabling skeleton-to-RGB temporal transfer and RGB-to-skeleton spatial enhancement; and (4) a prior-driven Spatial Difference Discriminator (SDD) that explicitly models hand-to-anchor relative motion to improve task-specific discriminability. We also collaborate with local transit agencies to collect over 55 hours of real onboard surveillance footage, yielding 500+ payment clips. Experiments show that iPay outperforms prior methods and achieves 83.45\% recognition accuracy with competitive computational efficiency, making it suitable for edge deployment. Code is available at https://github.com/ccoopq/iPay.
☆ AllocMV: Optimal Resource Allocation for Music Video Generation via Structured Persistent State
Generating long-horizon music videos (MVs) is frequently constrained by prohibitive computational costs and difficulty maintaining cross-shot consistency. We propose AllocMV, a hierarchical framework formulating music video synthesis as a Multiple-Choice Knapsack Problem (MCKP). AllocMV represents the video's persistent state as a compact, structured object comprising character entities, scene priors, and sharing graphs, produced by a global planner prior to realization. By estimating segment saliency from multimodal cues, a group-level MCKP solver based on dynamic programming optimally allocates resources across High-Gen, Mid-Gen, and Reuse branches. For repetitive musical motifs, we implement a divergence-based forking strategy that reuses visual prefixes to reduce costs while ensuring motif-level continuity. Evaluated via the Cost-Quality Ratio (CQR), AllocMV achieves an optimal trade-off between perceived quality and resource expenditure under strict budgetary and rhythmic constraints.
☆ An Uncertainty-Aware Resilience Micro-Agent for Causal Observability in the Computing Continuum
Grey failures in the computing continuum produce ambiguous overlapping symptoms that existing approaches fail to diagnose reliably, either due to a lack of causal awareness or acting under high epistemic uncertainty, risking destructive interventions. This paper presents an uncertainty-aware resilience micro-agent for causal observability (AURORA), a lightweight framework for diagnosing and mitigating grey failures in edge-tier environments. The framework employs parallel micro-agents that integrate the free-energy principle, causal do-calculus, and localized causal state-graphs to support counterfactual root-cause analysis within each fault's Markov blanket. Restricting inference to causally relevant variables reduces computational overhead while preserving diagnostic fidelity. AURORA further introduces a dual-gated execution mechanism that authorizes remediation only when causal confidence is high and predicted epistemic uncertainty is bounded; otherwise, it abstains from local intervention and escalates the diagnostic payload to the fog tier. Our experiments demonstrate that AURORA outperforms baselines, achieving a 0% destructive action rate, while maintaining 62.0% repair accuracy and a 3ms mean time to repair.
☆ Why Low-Resource NLP Needs More Than Cross-Lingual Transfer: Lessons Learned from Luxembourgish ACL 2026
Cross-lingual transfer has become a central paradigm for extending natural language processing (NLP) technologies to low-resource languages. By leveraging supervision from high-resource languages, multilingual language models can achieve strong task performance with little or no labeled target-language data. However, it remains unclear to what extent cross-lingual transfer can substitute for language-specific efforts. In this paper, we synthesize prior research findings and data collection results on Luxembourgish, which, despite its typological proximity to high-resource languages and its presence in a multilingual context, remains insufficiently represented in modern NLP technologies. Across findings, we observe a fundamental interdependence between cross-lingual transfer and language-specific efforts. Cross-lingual transfer can substantially improve target-language performance, but its success depends critically on the availability of sufficiently high-quality, task-aligned target-language data. At the same time, such resources, particularly in low-resource settings, are typically too limited in scale to drive strong performance on their own. Instead, such resources reach their full potential only when leveraged within a cross-lingual framework. We therefore argue that cross-lingual transfer and language-specific efforts should not be viewed as competing alternatives. Instead, they function as complementary components of a sustainable low-resource NLP pipeline. Based on these insights, we provide practical guidelines for integrating and balancing cross-lingual transfer with language-specific development in sustainable low-resource NLP pipelines.
comment: Accepted at BigPicture Workshop 2026 (co-located with ACL 2026)
☆ The Bystander Effect in Multi-Agent Reasoning: Quantifying Cognitive Loafing in Collaborative Interactions
Multi-agent systems (MAS) assume that collaborating inherently improves Large Language Model (LLM) reasoning. We challenge this by demonstrating that simulated social pressure triggers an algorithmic ``Bystander Effect,'' inducing severe cognitive loafing. By evaluating 22,500 deterministic trajectories across 3 dataset contexts (GAIA, SWE-bench, Multi-Challenge) with 3 state-of-the-art (SOTA) models, we semantically audit internal reasoning traces against external outputs. We formalize the \textit{Interaction Depth Limit} ($D_L$), the exact plurality threshold where an agent's logical sovereignty collapses into social compliance. Crucially, we uncover the \textit{Sovereignty Gap}: models frequently compute the correct derivation internally but suffer ``Alignment Hallucinations'' -- actively subjugating empirical evidence to sycophantically appease a simulated swarm. We prove that multi-agent social load is strictly non-commutative; the "brand" identity of the ``Lead Anchor'' auditor disproportionately dictates the swarm's integrity. These findings expose architectural vulnerabilities, proving that unstructured multi-agent topologies can degrade independent reasoning.
☆ GESR: A Genetic Programming-Based Symbolic Regression Method with Gene Editing
Mathematical formulas serve as a language through which humans communicate with nature. Discovering mathematical laws from scientific data to describe natural phenomena has been a long-standing pursuit of humanity for centuries. In the field of artificial intelligence, this challenge is known as the symbolic regression problem. Among existing symbolic regression approaches, Genetic Programming (GP) based on evolutionary algorithms remains one of the most classical and widely adopted methods. GP simulates the evolutionary process across generations through genetic mutation and crossover. However, mutations and crossovers in GP are entirely random. While this randomness effectively mimics natural evolution, it inevitably produces both beneficial and detrimental variations. If there existed a metaphorical `God` capable of foreseeing which genetic mutations or crossovers would yield superior outcomes and performing targeted gene editing accordingly, the efficiency of evolution could be substantially improved. Motivated by this idea, we propose in this paper a symbolic regression approach based on gene editing, termed GESR. In GESR, we trained two "hands of God" (two BERT models). Among them, the first leverages the BERT's masked language modeling capability to guide the mutation of genes (expression symbols). The other BERT model guides the crossover of individual genes by predicting the crossover point. Experimental results demonstrate that GESR significantly improves computational efficiency compared with traditional GP algorithms and achieves strong overall performance across multiple symbolic regression tasks.
comment: 70 pages
☆ Is Data Shapley Not Better than Random in Data Selection? Ask NASH ICML-26
Data selection studies the problem of identifying high-quality subsets of training data. While some existing works have considered selecting the subset of data with top-$m$ Data Shapley or other semivalues as they account for the interaction among every subset of data, other works argue that Data Shapley can sometimes perform ineffectively in practice and select subsets that are no better than random. This raises the questions: (I) Are there certain "Shapley-informative" settings where Data Shapley consistently works well? (II) Can we strategically utilize these settings to select high-quality subsets consistently and efficiently? In this paper, we propose a novel data selection framework, NASH (Non-linear Aggregation of SHapley-informative components), which (I) decomposes the target utility function (e.g., validation accuracy) into simpler, Shapley-informative component functions, and selects data by optimizing an objective that (II) aggregates these components non-linearly. We demonstrate that NASH substantially boosts the effectiveness of Shapley/semivalue-based data selection with minimal additional runtime cost.
comment: Accepted to the 43rd International Conference on Machine Learning (ICML-26) as a Spotlight paper
☆ Step Rejection Fine-Tuning: A Practical Distillation Recipe
Rejection Fine-Tuning (RFT) is a standard method for training LLM agents, where unsuccessful trajectories are discarded from the training set. In the context of SWE-bench tasks, this corresponds to filtering out runs where the submitted patch does not pass the tests. However, this approach discards unresolved trajectories, even though they form a large portion of all trajectories for hard tasks and even then may be partially correct. In this work, we propose Step Rejection Fine-Tuning (SRFT) - a practical way to leverage these unresolved trajectories. For this, we employ a critic LLM to assess the correctness of each step in a trajectory. Consequently, during training, we mask the loss for erroneous steps while retaining them in the context window. This way we ensure the model learns to recover from errors without reproducing them. Evaluation on SWE-bench Verified shows that while RFT improves the resolution rate by 2.4% by excluding unresolved trajectories, SRFT improves it by 3.7% by filtering them instead of discarding completely, reaching the total resolution rate of 32.2%.
Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions
Activation steering controls language model behavior by adding directions to internal representations at inference time, but standard residual-stream steering can fail in stateful dialogue. We identify KV-cache contamination as a key failure mode: steered token states are stored and repeatedly reused, turning a local perturbation into cumulative coherence degradation. To address this challenge, we propose Gated Cropped Attention-Delta steering (GCAD), which extracts steering signals from system-prompt contributions to self-attention and applies them with token-level gating. Across persona-steering experiments, GCAD preserves trait control while substantially improving long-horizon coherence. On the main multi-turn benchmark, GCAD improves average coherence drift from -18.6 to -1.9 and raises turn-10 trait expression from 78.0 to 93.1. These results suggest that activation steering becomes more reliable when interventions follow the prompt-mediated pathways that models already use for behavioral control.
comment: 23 pages, 5 figures. This paper proposes GCAD, an attention-level activation steering method for more stable multi-turn behavior control
☆ Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents
Experience-driven self-evolving agents aim to overcome the static nature of large language models by distilling reusable experience from past interactions, thus enabling adaptation to novel tasks at deployment time. This process places substantial demands on the foundation model's capacities for abstraction, generalization, and in-context learning. However, most existing studies focus primarily on system-level design choices, such as how experience is represented and managed, neglecting the inherent capabilities of the underlying model. While some recent works have started to optimize the experience utilization stage via reinforcement learning, they still fail to treat self-evolution as a unified process to be jointly optimized. To this end, we propose Evolving-RL, an efficient algorithmic framework that jointly improves the experience extraction and utilization capabilities required for self-evolution. Specifically, we center the learning process on experience extraction and evaluation, using the two supervisory signals derived from evaluation to optimize the extractor and solver separately and thus enable their coordinated co-evolution. Experiments on ALFWorld and Mind2Web show that Evolving-RL effectively enhances LLMs' ability to extract and reuse experience, leading to strong performance gains on out-of-distribution tasks (up to 98.7% relative improvement over the GRPO baseline on ALFWorld unseen tasks and 35.8% on Mind2Web), and these gains are fully unlocked only through the coordinated co-evolution of experience extraction and utilization. Furthermore, Evolving-RL inherently functions as an experience-augmented RL algorithm. By internalizing reusable experience patterns directly into model parameters, it achieves remarkable performance gains over standard baselines on both seen and unseen tasks, even in the absence of test-time experience accumulation.
comment: 17pages, 5 figures
☆ bViT: Investigating Single-Block Recurrence in Vision Transformers for Image Recognition
Vision Transformers (ViTs) are built by stacking independently parameterized blocks, but it remains unclear how much of this depth requires layer specific transformations and how much can be realized through recurrent computation. We study this question with bViT, a single-block recurrent ViT in which one transformer block is applied repeatedly to process an image. This architecture preserves the iterative structure of a deep ViT while removing layer specific block parameterization, providing a controlled setting for studying recurrence in vision. On ImageNet-1K, a 12-step bViT-B achieves accuracy comparable to standard ViT-B under the same training recipe and computational budget, while using an order of magnitude fewer parameters. We observe that recurrent performance improves with representation width, with wider bViTs recovering much more of the performance of standard ViTs than narrow variants. We interpret this behavior as implicit depth multiplexing, where a shared block expresses multiple step-dependent computations through the evolving hidden state. Beyond ImageNet classification, bViT transfers competitively to downstream tasks and enables parameter-efficient fine-tuning. Mechanistic analyses of activations, attention and step-specific pruning show that the shared block changes its effective behavior across recurrent steps rather than simply repeating the same computation. Our results suggest that a large fraction of ViT depth can be implemented through recurrent reuse, provided that the representation space is sufficiently wide.
comment: 31 pages, 16 figures
☆ When Can Digital Personas Reliably Approximate Human Survey Findings?
Digital personas powered by Large Language Models (LLMs) are increasingly proposed as substitutes for human survey respondents, yet it remains unclear when they can reliably approximate human survey findings. We answer this question using the LISS panel, constructing personas from respondents' background variables and pre-2023 survey histories, then testing them against the same respondents' held-out post-cutoff answers. Across four persona architectures, three LLMs, and two prediction tasks, we assess performance at the question, respondent, distributional, equity, and clustering levels. Digital personas improve alignment with human response distributions, especially in domains tied to stable attributes and values, but remain limited for individual prediction and fail to recover multivariate respondent structure. Retrieval-augmented architectures provide the clearest gains, but performance depends more on human response structure than on model choice: personas perform best for low-variability questions and common respondent patterns, and worst for subjective, heterogeneous, or rare responses. Our results provide practical guidance on when digital personas could be appropriate for survey research and when human validation remains necessary.
☆ Active Learning for Gaussian Process Regression Under Self-Induced Boltzmann Weights
We consider the active learning problem where the goal is to learn an unknown function with low prediction error under an unknown Boltzmann distribution induced by the function itself. This self-induced weighting arises naturally in problems such as potential energy surface (PES) modeling in computational chemistry, yet poses unique challenges as the target distribution is unknown and its partition function is intractable. We propose \texttt{AB-SID-iVAR}, a Gaussian Process-based acquisition function that approximates the intractable Bayesian target distribution in closed form while avoiding partition function estimation, and is applicable to both discrete and continuous input domains. We also analyze a Thompson sampling alternative (\texttt{TS-SID-iVAR}) as a higher variance Monte Carlo variant. Despite the unknown target, under mild conditions, we establish that the terminal prediction error vanishes with high probability, and provide a tighter average-case guarantee. We demonstrate consistent improvements over existing approaches in this setting on synthetic benchmarks and real-world PES modeling and drug discovery tasks.
☆ A Recursive Decomposition Framework for Causal Structure Learning in the Presence of Latent Variables
Constraint-based causal discovery is widely used for learning causal structures, but heavy reliance on conditional independence (CI) testing makes it computationally expensive in high-dimensional settings. To mitigate this limitation, many divide-and-conquer frameworks have been proposed, but most assume causal sufficiency, i.e., no latent variables. In this paper, we show that divide-and-conquer strategies can be theoretically generalized beyond causal sufficiency to settings with latent variables. Specifically, we propose a recursive decomposition framework, termed DiCoLa, that enables divide-and-conquer causal discovery in the presence of latent variables. It recursively decomposes the global learning task into smaller subproblems and integrates their solutions through a principled reconstruction step to recover the global structure. We theoretically establish the soundness and completeness of the proposed framework. Extensive experiments on synthetic data demonstrate that our approach significantly improves computational efficiency across a range of causal discovery algorithms, while experiments on a real-world dataset further illustrate its practical effectiveness.
☆ diffGHOST: Diffusion based Generative Hedged Oblivious Synthetic Trajectories
Trajectories are nowadays valuable information for a wide range of applications. However they are also inherently sensitive, as they contain highly personal information about individuals. Facing this challenge, synthesizing mobility trajectories has emerged as a promising solution to leverage mobility information while preserving privacy. State-of-the-art models, often rely on the false assumptions of generative models implicit privacy and fails to provide privacy guarantees while preserving trajectories utility. Here, we introduce diffGHOST, a conditional diffusion model based on latent space segmentation, designed to answer this challenge. Thus, this paper propose a methodology that identify and mitigate memorization of critical samples using condition segments of a learn latent space.
☆ LLaVA-CKD: Bottom-Up Cascaded Knowledge Distillation for Vision-Language Models
Large Vision-Language Models (VLMs) are successful in addressing a multitude of vision-language understanding tasks, such as Visual Question Answering (VQA), but their memory and compute requirements remain a concern for practical deployment. A promising class of techniques for mitigating this concern is Knowledge Distillation, where knowledge from a high-capacity Teacher network is transferred to a considerably smaller Student network. However, the capacity gap between the two networks is both a blessing and a curse: the smaller the Student network, the better its efficiency, and the larger the Teacher, the more knowledge it carries; yet, beyond a point, the larger capacity gap between the two leads to worse knowledge transfer. To counter this effect, we propose a bottom-up cascaded knowledge distillation (CKD) framework. Instead of treating knowledge transfer as an activity involving one high-capacity Teacher (or an ensemble of such), inspired by human formal education systems, we introduce one (potentially, more) additional Teacher(s) of intermediate capacity that gradually bring the Student network to the next level, where the next (higher-capacity) Teacher can take over. We provide a theoretical analysis in order to study the effect of cascaded distillation in the generalization performance of the Student. We apply the proposed framework on models build upon the LLaVA methodology and evaluate the derived models on seven standard, publicly available VQA benchmarks, demonstrating their SotA performance.
comment: Under review
☆ Towards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm ICML 2026
Continual Pre-Training (CPT) is essential for enabling Language Models (LMs) to integrate new knowledge without erasing old. While classical CPT techniques like data replay have become the standard paradigm, the mechanisms underlying how LMs acquire and retain facts over time, termed as continual Factual Knowledge Acquisition (cFKA), remain unclear. In this work, we present a theoretical framework that characterizes the training dynamics of cFKA using a single-layer Transformer, offering a unified explanation for the behavior of representative CPT methods. Our analysis reveals that regularization-based methods merely adjust the convergence rate of parameters without altering the inherent forgetting tendency, whereas data replay methods succeed in shifting convergence dynamics and stabilizing pretrained knowledge. Building on these insights, we propose a novel generative data replay approach, called \textbf{S}electing \textbf{T}okens via attenti\textbf{O}n \textbf{C}ontribution~(STOC), which identifies influential factual snippets to guide replay data generation. Extensive experiments on both synthetic and real-world datasets validate our findings and demonstrate that STOC effectively enhances cFKA by mitigating catastrophic forgetting.
comment: Accepted by ICML 2026
☆ Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks
The rapid adoption of LLMs in both research and industry highlights the challenges of deploying them safely and reveals a gap in the systematic evaluation of toxicity benchmarks. As organizations increasingly rely on these benchmarks to certify models for customer-facing applications and automated moderation, unrecognized evaluation biases could lead to the deployment of vulnerable or unsafe systems. This work investigates the robustness of established benchmarking setups and examines how to measure currently neglected intrinsic biases, such as those related to model choice, metrics, and task types. Our experiments uncover significant discrepancies in benchmark behaviors when evaluation setups are altered. Specifically, shifting the task from text completion to summarization increases the tendency of benchmarks to flag content as harmful. Additionally, certain benchmarks fail to maintain consistent behavior when the input data domain is changed. Furthermore, we observe model-specific instabilities, demonstrating a clear need for more robust and comprehensive safety evaluation frameworks.
comment: 18 pages, 4 figures
☆ Teacher-Aware Evolution of Heuristic Programs from Learned Optimization Policies
LLM-based automatic heuristic design has shown promise for generating executable heuristics for combinatorial optimization, but existing methods mainly rely on delayed endpoint performance. We propose a \emph{teacher-aware evolutionary framework} that uses independently trained learned optimization policies as behavioral teachers. Instead of deploying or imitating the teacher, our method queries it on states visited by candidate heuristic programs and uses its action preferences as local feedback for evolution. The resulting search discovers static executable heuristics guided by both task performance and teacher-derived behavioral signals. Experiments on scheduling, routing, and graph optimization benchmarks show that our method improves over performance-driven LLM heuristic evolution baselines while requiring no neural inference at deployment. These results suggest that learned optimization policies can be repurposed as behavioral feedback sources for automatic heuristic discovery.
comment: 15 pages
☆ Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs
Fine-tuning Large Language Models (LLMs) on benign narrow data can sometimes induce broad harmful behaviors, a vulnerability termed emergent misalignment (EM). While prior work links these failures to specific directions in the activation space, their relationship to the model's broader persona remains unexplored. We map the latent personality space of LLMs through established psychometric profiles like the Big Five, Dark Triad, and LLM-specific behaviors (e.g. evil, sycophancy), and show that the semantic geometry is highly stable across aligned models and their corrupted fine-tunes. Through causal interventions, we find that directions isolating social valence, such as the 'Evil' persona vector, and a Semantic Valence Vector (SVV) that we introduce, function as intrinsic guardrails: ablating them drives the misalignment rates above $40$%, while amplifying them suppresses the failure mode to less than $3$%. Leveraging the structural stability of the personality space, we show that vectors extracted $\textit{a priori}$ from an instruct-tuned model transfer zero-shot to successfully regulate EM in corrupted fine-tunes. Overall, our findings suggest that harmful fine-tuning does not overwrite a model's internal representation of personality, allowing conserved representations to serve as robust, cross-distribution guardrails.
comment: 20 pages, 9 figures including appendix
☆ Interpretable Coreference Resolution Evaluation Using Explicit Semantics ACL 2026
Coreference resolution is typically evaluated using aggregate statistical metrics such as CoNLL-F1, which measure structural overlap between predicted and gold clusters. While widely used, these metrics offer limited diagnostic insights, penalizing errors without revealing whether a system struggles with specific semantic categories, such as people, locations, or events, and making it difficult to interpret model capabilities or derive actionable improvements. We address this gap by introducing a semantically-enhanced evaluation framework for coreference resolution. Our approach overlays Concept and Named Entity Recognition (CNER) onto coreference outputs, assigning semantic labels to nominal mentions and propagating them to entire coreference clusters. This enables the computation of typed scores aimed at evaluating mention extraction and linking capabilities stratified by semantic class. Across our experiments on OntoNotes, LitBank, and PreCo, we show that our framework uncovers systematic weaknesses that remain obscured by aggregate metrics. Furthermore, we demonstrate that these diagnostics can be used to design targeted, low-cost data augmentation strategies, achieving measurable out-of-domain improvements.
comment: Accepted at main conference for ACL 2026. 19 pages
☆ Hierarchical Causal Abduction: A Foundation Framework for Explainable Model Predictive Control
Model Predictive Control (MPC) is widely used to operate safety-critical infrastructure by predicting future trajectories and optimizing control actions. However, nonlinear dynamics, hard safety constraints, and numerical optimization often render individual control moves opaque to human operators, undermining trust and hindering deployment. This paper presents Hierarchical Causal Abduction (HCA), which combines (i) physics-informed reasoning via domain knowledge graphs, (ii) optimization evidence from Karush--Kuhn--Tucker (KKT) multipliers, and (iii) temporal causal discovery via the PCMCI algorithm to generate faithful, human-interpretable explanations for control actions computed by nonlinear MPC. Across three diverse control applications (greenhouse climate, building HVAC, chemical process engineering) with expert validation, HCA improves explanation accuracy by 53\% over LIME (0.478 vs. 0.311) using a single set of cross-domain parameters without per-domain tuning; domain-specific KKT-threshold calibration over 2--3 days further increases accuracy to 0.88. Ablation studies confirm that each evidence source is essential, with 32--37\% accuracy degradation when any component is removed, and HCA's ranking-and-validation methodology generalizes beyond MPC to other prediction-based decision systems, including learning-based control and trajectory planning.
☆ PRISM: Generation-Time Detection and Mitigation of Secret Leakage in Multi-Agent LLM Pipelines
Multi-agent LLM systems introduce a security risk in which sensitive information accessed by one agent can propagate through shared context and reappear in downstream outputs, even without explicit adversarial intent. We formalise this phenomenon as propagation amplification, where leakage risk increases across agent boundaries as sensitive content is repeatedly exposed to downstream generators. Existing defences, including prompt-based safeguards, static pattern matching, and LLM-as-judge filtering, are not designed for this setting: they either operate after generation, rely primarily on surface-form patterns, or add substantial latency without modelling the generation process itself. To resolve these issues, we propose PRISM, a real-time defence that treats credential leakage as a sequential risk accumulation problem during generation. At each decoding step, PRISM combines 16 signals spanning lexical, structural, information-theoretic, behavioural, and contextual features into a calibrated risk score, enabling per-token intervention through green, yellow, and red risk zones. Our central observation is that credential reproduction is often preceded by a measurable shift in generation dynamics, characterised by entropy collapse and increasing logit concentration. When combined with text-structural cues such as identifier-pattern detection, these temporal signals provide an early warning of leakage before a secret is fully reconstructed. Across a 2,000-task adversarial benchmark covering 13 attack categories and three pressure levels in a heterogeneous four-agent pipeline, PRISM achieves F1 = 0.832 with precision = 1.000 and recall = 0.712, while producing no observed leakage on our benchmark (0.0% task-level leak rate) and preserving output utility of 0.893. It substantially outperforms the strongest baseline, Span Tagger, which achieves F1 = 0.719 with a 15.0% task-level leak rate.
☆ Re-Triggering Safeguards within LLMs for Jailbreak Detection
This paper proposes a jailbreaking prompt detection method for large language models (LLMs) to defend against jailbreak attacks. Although recent LLMs are equipped with built-in safeguards, it remains possible to craft jailbreaking prompts that bypass them. We argue that such jailbreaking prompts are inherently fragile, and thus introduce an embedding disruption method to re-activate the safeguards within LLMs. Unlike previous defense methods that aim to serve as standalone solutions, our approach instead cooperates with the LLM's internal defense mechanisms by re-triggering them. Moreover, through extensive analysis, we gain a comprehensive understanding of the disruption effects and develop an efficient search algorithm to identify appropriate disruptions for effective jailbreak detection. Extensive experiments demonstrate that our approach effectively defends against state-of-the-art jailbreak attacks in white-box and black-box settings, and remains robust even against adaptive attacks.
☆ Measuring Embedding Sensitivity to Authorial Style in French: Comparing Literary Texts with Language Model Rewritings
Large language models (LLMs) can convincingly imitate human writing styles, yet it remains unclear how much stylistic information is encoded in embeddings from any language model and retained after LLM rewriting. We investigate these questions in French, using a controlled literary dataset to quantify the effect of stylistic variation via changes in embedding dispersion. We observe that embeddings reliably capture authorial stylistic features and that these signals persist after rewriting, while also exhibiting LLM-specific patterns. These analytical results offer promising directions for authorship imitation detection in the era of language models.
comment: To appear in the Proceedings of the 6th International Conference on Natural Language Processing for the Digital Humanities (NLP4DH 2026)
☆ Fairness vs Performance: Characterizing the Pareto Frontier of Algorithmic Decision Systems
Designing fair algorithmic decision systems requires balancing model performance with fairness toward affected individuals: More fairness might require sacrificing some performance and vice versa, yet the space of possible trade-offs is still poorly understood. We investigate fairness in binary prediction-based decision problems by conceptualizing decision making as a multi-objective optimization problem that simultaneously considers decision-maker utility and group fairness. We investigate the set of Pareto-optimal decision rules for arbitrary utility functions for decision maker, arbitrary population distributions, and a wide range of group fairness metrics. We find that the Pareto frontier consists of deterministic, group-specific threshold rules applied to individuals' success probability. This complements existing optimality theorems from literature which, for specific fairness constraints, posit lower-bound threshold rules only. However we also show that, depending on the used fairness metric, the Pareto frontier may include upper-bound threshold rules, thus preferring individuals with lower success probabilities. We show that the location of the Pareto frontier depends only on population characteristics, utility functions and fairness score, but not on the technical design of the algorithm - our findings hold for pre-, in-, and post-processing approaches alike. Our results generalize existing optimality theorems for fairness-constrained classification and extend them to generalized fairness metrics and fairness principles, and to partial fairness regimes. This paper connects formal fairness research with legal and ethical requirements to search for less discriminatory alternatives, offering a principled foundation for evaluating and comparing algorithmic decision systems.
comment: 23 pages, The 2026 ACM conference on Fairness, Accountability, and Transparency (FAccT'26)
☆ The Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification Regime
AI deployment in sensitive domains such as health care, credit, employment, and criminal justice is often treated as unsafe to authorize until model internals can be explained. This often leads to an excessive reliance on mechanistic interpretability to address a deployment challenge beyond its intended scope. We argue that the gate should instead be calibrated verification: authorization should be domain-scoped, independently checkable, monitored after release, accountable, contestable, and revocable. The reason is twofold. First, model capability is uneven across nearby tasks, so authorization must attach to a specific use rather than to a model in general. Second, societies have long governed opaque expertise through credentials, monitoring, liability, appeal, and revocation rather than mechanism-level explanation. Recent evidence reinforces this distinction between mechanistic understanding and deployment authority: a 53-percentage-point gap between internal representations and output correction shows that understanding may not translate into action, while one scoping review found that only 9.0% of FDA-approved AI/ML device documents contained a prospective post-market surveillance study. We propose Verification Coverage, a six-component reportable standard with a minimum-composition rule, as the metric that should sit beside capability scores in model cards, leaderboards, and regulatory disclosures.
☆ Budget-Efficient Automatic Algorithm Design via Code Graph
Large language models (LLMs) have emerged as powerful tools for automatic algorithm design (AAD). However, existing pipelines remain inefficient. They operate at the granularity of full algorithms, redundantly rewriting recurring substructures and discarding low-fitness candidates that may contain valuable algorithmic features. We formalize budget-efficient automatic algorithm design, wherein the search policy maximizes realized fitness subject to limited computational cost. We propose a directed acyclic graph representation of algorithms and build a search framework that fully exploits the LLM's output. Instead of querying the LLM for full algorithms, we use it to obtain corrections: compact operators that add, replace, or remove code blocks. Each correction augments the graph, yielding new algorithms that compose with prior corrections. This graph structure decomposes algorithms into sets of corrections, enabling correction-level credit assignment that informs subsequent queries. We complement this framework with theoretical insights into the ideal balance between search depth and breadth at different budget levels. We validate our method empirically on three combinatorial optimization problems, demonstrating consistent superiority of our graph-based search over full-algorithm search at equal token budget. Finally, our experiments suggest that rich contexts help only when the LLM's prior knowledge is shallow, and can hinder performance otherwise.
☆ CrackMeBench: Binary Reverse Engineering for Agents
Benchmarks for coding agents increasingly measure source-level software repair, and cybersecurity benchmarks increasingly measure broad capture-the-flag performance. Classical binary reverse engineering remains less precisely specified: given only an executable, can an agent recover validation logic and produce an input, serial, artifact, or key generator accepted by the program? We introduce CrackMeBench, a benchmark for evaluating language-model agents on educational CrackMe-style reverse-engineering tasks. CrackMeBench focuses on deterministic binary validation problems with executable oracles, symbol-poor binaries, explicit local tool access, and externally scored submissions rather than free-form explanations. The v0 benchmark combines eight public calibration CrackMes with twelve generated main-score tasks built from seeded C, Rust, and Go templates, and agents run through an equal shell interface in a no-network Linux Docker sandbox with standard reverse-engineering tools. In a three-model evaluation with a five-minute budget and three scored submissions per task, pass@3 on the generated split is 11/12 tasks (92%) for GPT-5.5, 7/12 (58%) for Claude Opus 4.7, and 5/12 (42%) for Kimi K2. The harder generated half separates the models more sharply, with pass@3 of 5/6, 2/6, and 1/6, respectively; on the eight-task public calibration split, pass@3 is 3/8, 2/8, and 1/8. CrackMeBench records pass@1 and pass@3, scored submissions, wall-clock time, command traces, tool categories, provider-reported token usage, estimated cost, and qualitative failure labels, providing a reproducible testbed for measuring progress from source-code reasoning toward autonomous binary analysis while restricting scope to educational, purpose-built programs.
☆ LLARS: Enabling Domain Expert & Developer Collaboration for LLM Prompting, Generation and Evaluation IJCAI
We demonstrate LLARS (LLM Assisted Research System), an open-source platform that bridges the gap between domain experts and developers for building LLM-based systems. It integrates three tightly connected modules into an end-to-end pipeline: Collaborative Prompt Engineering for real-time co-authoring with version control and instant LLM testing, Batch Generation for configurable output production across user-selected prompts $\times$ models $\times$ data with cost control, and Hybrid Evaluation where human and LLM evaluators jointly assess outputs through diverse assessment methods, with live agreement metrics and provenance analysis to identify the best model-prompt combination for a given use case. New prompts and models are automatically available for batch generation and completed batches can be turned into evaluation scenarios with a single click. Interviews with six domain experts and three developers in online counselling confirmed that LLARS feels intuitive, saves considerable time by keeping everything in one place and makes interdisciplinary collaboration seamless.
comment: Accepted at IJCAI-ECAI 2026 Demonstrations Track. Demo video: https://youtu.be/3QaKouwr4gU
☆ A Resilient Solution for Sewer Overflow Monitoring across Cloud and Edge IJCAI
Aging combined sewer systems in many historical cities are increasingly stressed by extreme rainfall events, which can trigger combined sewer overflows (CSO) with significant environmental and public health impacts. Forecasting the filling dynamics of overflow basins is critical for anticipating capacity exceedance and enabling timely preventive actions for CSO. We present a web-based demonstrator (https://riwwer.demo.calgo-lab.de) that integrates Deep Learning forecasting methods in both cloud and edge settings into an interactive monitoring dashboard for overflow monitoring, resilient to network outages. A video showcase is available online (https://cloud.bht-berlin.de/index.php/s/b9xt4T3SdiLBiFZ).
comment: 3 pages, 6 figures, accepted at 35th International Joint Conference on Artificial Intelligence 2026 (IJCAI-ECAI 2026), Demonstrations Track. URL: https://riwwer.demo.calgo-lab.de
☆ An agentic framework for gravitational-wave counterpart association in the multi-messenger era
With the detection of gravitational waves (GWs), multi-messenger astronomy has opened a new window for advancing our understanding of astrophysics, dense matter, gravitation, and cosmology. The GW sources detected to date are from mergers of compact object binaries, which possess the potential to generate detectable electromagnetic (EM) counterparts. Searching for associations between GW signals and their EM counterparts is an essential step toward enabling subsequent multi-messenger studies. In the era of next-generation GW and EM detectors, the rapid increase in the number of events brings not only unprecedented scientific opportunities, but also substantial challenges to the existing data analysis paradigm. To help address these challenges, we develop GW-Eyes, an agentic framework powered by large language models (LLMs). For the first time, GW-Eyes integrates domain-specific tools and autonomously performs counterpart association tasks between GW and candidate EM events. It supports natural language interaction to assist human experts with auxiliary tasks such as catalog management, skymap visualization, and rapid verification. Our framework leverages the complex decision-making capabilities of LLMs and their traceable reasoning processes, offering a new perspective to the multi-messenger astronomy.
☆ Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing
This paper proposes a guaranteed defense method for large language models (LLMs) to safeguard against jailbreaking attacks. Drawing inspiration from the denoised-smoothing approach in the adversarial defense domain, we propose a novel smoothing-based defense method, termed Disrupt-and-Rectify Smoothing (DR-Smoothing). Specifically, we integrate a two-stage prompt processing scheme-first disrupting the input prompt, then rectifying it-into the conventional smoothing defense framework. This disrupt-and-rectify approach improves upon previous disrupt-only approaches by restoring out-of-distribution disrupted prompts to an in-distribution form, thereby reducing the risk of unpredictable LLM behavior. In addition, this two-stage scheme offers a distinct advantage in striking a balance between harmlessness and helpfulness in jailbreaking defense. Notably, we present a theoretical analysis for generic smoothing framework, offering a tight bound for the defense success probability and the requirements on the disruption strength. Our approach can defend against both token-level and prompt-level jailbreaking attacks, under both established and adaptive attacking scenarios. Extensive experiments demonstrate that our approach surpasses current state-of-the-art defense methods in terms of both harmlessness and helpfulness.
☆ SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models
Low-level visual perception underpins reliable remote sensing (RS) image analysis, yet current image quality assessment (IQA) methods output uninterpretable scalar scores rather than characterizing physics-driven RS degradations, deviating markedly from the diagnostic needs of RS experts. While Vision-Language Models (VLMs) present a compelling alternative by delivering language-grounded IQA, their visual priors are heavily biased toward ground-level natural images. Consequently, whether VLMs can overcome this domain gap to perceive and articulate RS artifacts remains insufficiently studied. To bridge this gap, we propose \textbf{SenseBench}, the first dedicated diagnostic benchmark for RS low-level visual perception and description. Driven by a physics-based hierarchical taxonomy that unifies both non-reference and reference-based paradigms, SenseBench features over 10K meticulously curated instances across 6 major and 22 fine-grained RS degradation categories. Specifically, two complementary protocols are designed for evaluation: objective low-level visual \textit{perception} and subjective diagnostic \textit{description}. Comprehensive evaluation of 29 state-of-the-art VLMs reveals not only skewed domain priors and multi-distortion collapse, but also \textit{fluency illusion} and a \textit{perception-description inversion} effect. We hope SenseBench provides a robust evaluation testbed and high-quality diagnostic data to advance the development of VLMs in RS low-level perception. Code and datasets are available \href{https://github.com/Zhong-Chenchen/SenseBench}{\textcolor{blue}{here}}.
☆ Acceptance Cards:A Four-Diagnostic Standard for Safe Fine-Tuning Defense Claims
Safe fine-tuning defenses are often endorsed on the basis of a held-out gap reduction, but the same reduction can come from sampling noise, subject artifacts, capability loss, or a mechanism that does not transfer. We introduce Acceptance Cards: an evaluation protocol, a documentation object, an executable audit package, and a claim-specific evidential standard for safe fine-tuning defense claims. The protocol checks statistical reliability, fresh semantic generalization, mechanism alignment, and cross-task transfer before treating a gap reduction as a full-card pass. Re-scored under this installed-gap protocol, SafeLoRA fails the full-card pass on Gemma-2-2B-it: under strict mechanism-class coding it fails all four diagnostics, and under a permissive shrinkage relabel it still fails three of four. This is a narrow installed-gap audit on one model family, not a global judgment of SafeLoRA's effectiveness. In a 46-cell audit, no cell satisfies the strict conjunction. The closest family is a near miss that passes reliability and mechanism checks where the required data are available, but fails the fresh-subject threshold, lacks a strict transfer pass, and carries a measurable deployment-accuracy cost.
☆ LLM Jaggedness Unlocks Scientific Creativity
As artificial intelligence advances, models are not improving uniformly. Instead, progress unfolds in a jagged fashion, with capabilities growing unevenly across tasks, domains, and model scales. In this work, we examine this dynamic jaggedness through the lens of scientific idea generation. We introduce SciAidanBench, a benchmark of open-ended scientific questions designed to measure the scientific creativity of large language models (LLMs). Given a scientific question, models are asked to generate as many unique and coherent ideas as possible, with the total number of valid responses serving as a proxy for creative potential. Evaluating 19 base models across 8 providers (30 total variants including reasoning versions), we find that jaggedness manifests both across models and within models. First, in a cross-task comparison between general and scientific creativity, improvements in general creativity do not translate uniformly to scientific creativity, revealing divergent capability profiles across models. Second, at the prompt level, stronger models do not improve uniformly; instead, they exhibit high variability, with bursts of creativity on some questions and limited performance on others. Third, at the domain level, individual models display uneven strengths across scientific subfields, reflecting fragmented internal capability profiles. Finally, we show that this jaggedness can be harnessed. We explore mechanisms of inference-time compute, knowledge pooling, and brainstorming to combine models effectively and construct meta-model ensembles that outperform any single model. Our results position jaggedness not as a limitation, but as a resource, a structural feature of AI progress that, when understood and leveraged, can amplify LLM-driven scientific creativity.
☆ Deep Arguing
Deep learning has become the dominant approach for creating high capacity, scalable models across diverse data modalities. However, because these models rely on a large number of learned parameters, tightly couple feature extraction with task objectives, and often lack explicit reasoning mechanisms, it is difficult for humans to understand how they arrive at their predictions. Understanding what representations emerge and why they arise from the training data remains an open challenge. We introduce Deep Arguing, a novel neurosymbolic approach that integrates deep learning with argumentation construction and reasoning for interpretable classification with different data modalities. In our approach deep neural networks construct an argumentation structure wherein data points support their assigned label and attack different ones. Using differentiable argumentation semantics for reasoning, the model is trained end-to-end to jointly learn feature representation and argumentative interactions. This results in argumentation structures providing faithful case-based explanations for predictions. Structure constraints over the argumentation graph guide learning, improving both interpretability and predictive performance. Experiments with tabular and imaging datasets show that Deep Arguing achieves performance competitive with standard baselines whilst offering interpretable argumentative reasoning.
☆ ThreatCore: A Benchmark for Explicit and Implicit Threat Detection
Threat detection in Natural Language Processing lacks consistent definitions and standardized benchmarks, and is often conflated with broader phenomena such as toxicity, hate speech, or offensive language. In this work, we introduce ThreatCore, a public available benchmark dataset for fine-grained threat detection that distinguishes between explicit threats, implicit threats, and non-threats. The dataset is constructed by aggregating multiple publicly available resources and systematically re-annotating them under a unified operational definition of threat, revealing substantial inconsistencies across existing labels. To improve the coverage of underrepresented cases, particularly implicit threats, we further augment the dataset with synthetic examples, which are manually validated using the same annotation protocol adopted for the re-annotation of the public datasets, ensuring consistency across all data sources. We evaluate Perspective API, zero-shot classifiers, and recent language models on ThreatCore, showing that implicit threats remain substantially harder to detect than explicit ones. Our results also indicate that incorporating Semantic Role Labeling as an intermediate representation can improve performance by making the structure of harmful intent more explicit. Overall, ThreatCore provides a more consistent benchmark for studying fine-grained threat detection and highlights the challenges that current models still face in identifying indirect expressions of harmful intent.
☆ Agent-First Tool API: A Semantic Interface Paradigm for Enterprise AI Agent Systems
As AI agents transition from research prototypes to enterprise production systems, the tool interfaces they consume remain rooted in human-oriented CRUD paradigms. This paper identifies five fundamental architectural mismatches between conventional APIs and autonomous agent requirements: exact-identifier dependence, rendering-oriented responses, single-shot interaction assumptions, user-equivalent authorization, and opaque error semantics. We propose the Agent-First Tool API paradigm, comprising three integrated mechanisms: (1) a Six-Verb Semantic Protocol that decomposes tool interactions into search, resolve, preview, execute, verify, and recover phases; (2) a Normalized Tool Contract (NTC) providing structured decision-support metadata including confidence scores, evidence chains, and suggested next actions; and (3) a dual-layer governance pipeline combining static capability policies with dynamic risk escalation. The paradigm is implemented and validated in a production multi-tenant SaaS platform serving 85 registered tools across 6 business domains. Comparative experiments on 50 real operational tasks demonstrate that Agent-First APIs achieve 88% end-to-end task success rate versus 64% for optimized CRUD baselines (+37.5%), while reducing required human interventions by 72.7% and improving autonomous error recovery by 5.8x. We establish that the paradigm is orthogonal and complementary to transport-layer standards such as MCP, operating as the semantic application layer above existing tool discovery and invocation protocols.
☆ Bridging Sequence and Graph Structure for Epigenetic Age Prediction
Epigenetic clocks based on DNA methylation have emerged as powerful tools for estimating biological age, with broad applications in aging research, age-related disease studies, and longevity science. Despite advances across machine learning approaches to epigenetic age prediction, spanning penalised linear regression, deep feedforward networks, residual architectures, and graph neural networks, no existing method jointly models co-methylation graph structure and site-specific DNA sequence context within a unified framework. We propose a unified sequence--graph integration framework for epigenetic age prediction that addresses this gap, integrating eight-dimensional DNA sequence statistical features through a lightweight gated modulation mechanism that adaptively scales each site's methylation signal according to its sequence-determined biological relevance prior to graph convolution. Evaluated on 3,707 blood methylation samples against a comprehensive set of baselines, our method achieves a test MAE of 3.149 years, a 12.8\% improvement over the strongest graph-based baseline. Biologically informed statistical features outperform CNN-based sequence encoding, demonstrating that handcrafted sequence features are more effective than end-to-end learned representations in this data regime. Post-hoc interpretability analysis identifies CpG density and local adenine frequency as features with age-dependent importance shifts, consistent with known mechanisms of age-related hypermethylation at CpG-dense promoter regions. Our code is at https://github.com/yaoli2022/graphage-seq.
☆ HH-SAE: Discovering and Steering Hierarchical Knowledge of Complex Manifolds
Rare semantic innovations in high-dimensional, mission-critical domains are often obscured by dense background contexts, a challenge we define as \textit{feature density conflict}. We introduce the \textbf{Hybrid Hierarchical SAE (HH-SAE)} to resolve this by factorizing manifolds into a nested hierarchy of \textbf{Contextual} ($L_0$), \textbf{Atomic} ($f_1$), and \textbf{Compository} ($f_2$) tiers. Evaluating across disparate manifolds, HH-SAE demonstrates superior resolution by \textbf{``fracturing'' administrative clinical labels into physiological modes} and achieving a peak \textbf{cross-domain zero-shot AUC of 0.9156 in fraud detection}. Path ablation confirms the architecture's structural necessity, revealing a 13.46\% utility collapse when contextual subtraction is removed. Finally, knowledge-steered synthesis achieves a +9.9\% AUPRC lift over state-of-the-art generators, proving that HH-SAE effectively prioritizes high-order mechanistic innovation over environmental proxies to enable high-precision discovery in high-stakes environments.
☆ A Reflective Storytelling Agent for Older Adults: Integrating Argumentation Schemes and Argument Mining in LLM-Based Personalised Narratives
This work investigates whether knowledge-driven large language model (LLM)-based storytelling can support purposeful narrative interaction with a digital companion for older adults. To address known limitations of LLMs, including hallucinations and limited transparency, we present a reflective storytelling agent integrating knowledge graphs, user modelling, argumentation theory, and argument mining to guide and inspect narrative generation. The study consisted of two phases. Phase I employed participatory design involving 11 domain experts in a formative evaluation that informed iterative refinement. The resulting system generates narratives grounded in structured user models representing health-promoting activities and motivations. Phase II involved 55 older adults evaluating persona-based narratives across four prompts and two creativity levels. Participants assessed perceived purpose, usefulness, cultural relatability, and inconsistencies. The system additionally computed hallucination-risk indicators to evaluate generated narratives. Participants recognised personally relevant purposes in roughly two thirds of narratives, while argument-based purposes were identified in around half of these cases. Cultural recognisability strongly influenced willingness to use the functionality, whereas minor inconsistencies were often tolerated when narratives remained understandable and personally relevant. Narratives with higher hallucination-risk indicators were more often perceived as inconsistent, while higher argument-quality indicators tended to co-occur with higher clarity and meaningfulness ratings. Overall, the study positions argument mining as a reflective inspection mechanism for comparing formal grounding signals with human evaluations in health-oriented LLM storytelling for older adults.
comment: Submitted to ACM Transactions on Intelligent Systems and Technology (TIST)
☆ PrimeKG-CL: A Continual Graph Learning Benchmark on Evolving Biomedical Knowledge Graphs
Biomedical knowledge graphs underwrite drug repurposing and clinical decision support, yet the upstream ontologies they depend on update on independent cycles that add millions of edges and deprecate hundreds of thousands more between releases. Yet existing continual graph learning has been studied almost exclusively on synthetic random splits of static, generic KGs, a regime that cannot reproduce the asynchronous, structured evolution real biomedical KGs undergo. To this end, we introduce PrimeKG-CL, a CGL benchmark built from nine authoritative biomedical databases (129K+ nodes, 8.1M+ edges, 10 node types, 30 relation types) with two genuine temporal snapshots (June 2021, July 2023; 5.83M edges added, 889K removed, 7.21M persistent), 10 entity-type-grouped tasks, multimodal node features, and a per-task persistent/added/removed test stratification. On three tasks (biomedical relationship prediction, entity classification, KGQA), we evaluate six CL strategies across four KGE decoders, plus LKGE, an LLM-RAG agent, and CMKL. We find that decoder choice and continual learning strategy interact strongly: no single strategy performs best across all decoders, and mismatched combinations can significantly degrade performance. Moreover, only DistMult exhibits a clear separation between persistent and deprecated knowledge, indicating that standard metrics conflate retention of still-valid facts with failure to forget outdated ones; this effect is absent under RotatE. In addition, multimodal features improve entity-level tasks by up to 60%, and a recent CKGE framework (IncDE) failed to scale to our 5.67M-triple base task across five attempts up to 350GB RAM. Data, pipeline, baselines, and the stratified split are released openly. Dataset:huggingface.co/datasets/yradwan147/PrimeKGCL|Code:github.com/yradwan147/primekg-cl-neurips2026
☆ DuetFair: Coupling Inter- and Intra-Subgroup Robustness for Fair Medical Image Segmentation
Medical image segmentation models can perform unevenly across subgroups. Most existing fairness methods focus on improving average subgroup performance, implicitly treating each subgroup as internally homogeneous. However, this can hide difficult cases within a subgroup, where high-loss samples are obscured by the subgroup mean. We call this problem \textbf{intra-group hidden failure}. To solve this, we propose \textbf{DuetFair} mechanism, a dual-axis fairness framework that jointly considers inter-subgroup adaptation and intra-subgroup robustness. Based on DuetFair, we introduce \textbf{FairDRO}, which combines distribution-aware mixture-of-experts (dMoE) with subgroup-conditioned distributionally robust optimization (DRO) loss aggregation. This design allows the model to adapt across subgroups while also reducing hidden failures within each subgroup. We evaluate FairDRO on three medical image segmentation benchmarks with varying degrees of within-group heterogeneity. FairDRO achieves the best equity-scaled performance on Harvard-FairSeg and improves worst-case subgroup performance on HAM10000 under both age- and race-based grouping schemes. On the 3D radiotherapy target cohort, FairDRO further improves worst-group Dice by 3.5 points ($\uparrow 6.0\%$) under the tumor-stage grouping and by 4.1 points ($\uparrow 7.4\%$) under the institution grouping over the strongest baseline.
comment: 16 pages, 2 figures
☆ Infinite Mask Diffusion for Few-Step Distillation
Masked Diffusion Models (MDMs) have emerged as a promising alternative to autoregressive models in language modeling, offering the advantages of parallel decoding and bidirectional context processing within a simple yet effective framework. Specifically, their explicit distinction between masked tokens and data underlies their simple framework and effective conditional generation. However, MDMs typically require many sampling iterations due to factorization errors stemming from simultaneous token updates. We observe that a theoretical lower bound of the factorization error exists, which standard MDMs cannot reduce due to their use of a deterministic single-state mask. In this paper, we propose the Infinite Mask Diffusion Model (IMDM), which introduces a stochastic infinite-state mask to mitigate the theoretical bound while directly inheriting the benefits of MDMs, including the compatibility with pre-trained weights. We empirically demonstrate that MDM fails to perform few-step generation even in a simple synthetic task due to the factorization error bound, whereas IMDM can find an efficient solution for the same task. Finally, when equipped with appropriate distillation methods, IMDM surpasses existing few-step distillation methods at small step counts on LM1B and OpenWebText. Code is available at https://Ugness.github.io/official_imdm.
☆ Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability
This paper establishes a rigorous measurement science for AI agent reliability, providing a foundational framework for quantifying consistency under semantically preserving perturbations. By leveraging $U$-statistics for output-level reliability and kernel-based metrics for trajectory-level stability, we offer a principled approach to evaluating agents across diverse operating conditions. Our proposal highlights the important distinction between the core capability and execution robustness of an agent, showing that minor task-level variations can induce complete strategy breakdowns despite the agent possessing the requisite knowledge for the task. We validate our framework through extensive experiments on three agentic benchmarks, demonstrating that trajectory-level consistency metrics provide far greater diagnostic sensitivity than traditional pass@1 rates. By providing the mathematical tools to isolate where and why agents deviate, we enable the identification and rectification of architectural concerns that hinder the deployment of agents in high-stakes, real-world environments.
comment: 33 pages, 5 figures, 2 tables
☆ SoK: A Systematic Bidirectional Literature Review of AI & DLT Convergence
The integration of Artificial Intelligence (AI) with Distributed Ledger Technology (DLT) has become a growing research area, yet contributions tend to cluster around specific application domains or examine only one direction of the integration, leaving the broader architectural interplay between the two technologies poorly understood. This work addresses that gap through a structured, bidirectional review of peer-reviewed studies published between 2020 and 2025. We classify contributions along two directions: AI-enhanced DLT, and DLT-enhanced AI. In the first case, we examine how AI techniques improve DLT systems across five layers: data, network, consensus, execution, and application layers. In the second case, we analyse how DLT supports AI systems across five layers: infrastructure, data, model, inference, and application layers, with particular attention to federated learning, model evaluation, and multi-agent coordination. The analysis reveals that most works concentrate on a small subset of layers: execution and consensus for AI-enhanced DLT, data and model for DLT-enhanced AI. Other layers remain comparatively neglected. Despite reported improvements in controlled settings, no study demonstrates deployment at production scale, and the field has not yet offered satisfying answers to fundamental questions around scalability, interoperability, and verifiable execution. We argue that progress will require cross-layer co-design and empirical validation in real-world settings.
comment: 18 pages, 1 figure, 5 tables
☆ CMKL: Modality-Aware Continual Learning for Evolving Biomedical Knowledge Graphs
Biomedical knowledge graphs are increasingly large, dynamic, and multimodal, driven by rapid advances in biotechnology such as high-throughput sequencing. Machine learning models can infer previously unobserved biomedical relationships and characterize biomedical entities in these graphs, but existing knowledge graph embedding methods and their continual learning extensions either assume static graph structure or fail to exploit multimodal information under evolving data distributions. They also apply uniform regularization across all model parameters, ignoring that different modalities may exhibit distinct forgetting dynamics as the graph evolves. We propose the Continual Multimodal Knowledge Graph Learner (CMKL), a CL framework for biomedical KGs that natively encodes structure, text, and molecules, fuses them through a Mixture-of-Experts (MoE) router, and protects previously learned knowledge with standard EWC regularization and a K-means-diverse multimodal replay buffer. We evaluate CMKL on a 129K-entity biomedical continual benchmark with 10 tasks. On continual biomedical entity classification, CMKL reaches AP 0.591 versus 0.370 for the strongest structural baseline, a 60% gain that is driven by access to multimodal features and preserved across the sequence with near-zero forgetting (AF 0.008). On continual relationship prediction, CMKL reaches AP $0.062$, matching Naive Sequential and EWC (0.058) within seed noise and outperforming Joint Training (0.047, p=0.045) and LKGE (0.039). A frozen-text ablation reaches AP 0.136, more than double any jointly trained model, yet that signal is unreachable by margin-ranking gradients: the greedy-modality asymmetry lives at the representation level, not the fusion level, and MoE routing manages it by suppressing the unreachable modality without forcing it through a learned bottleneck. Code: github.com/yradwan147/cmkl-neurips2026
☆ SLASH the Sink: Sharpening Structural Attention Inside LLMs
Large Language Models (LLMs) show remarkable semantic understanding but often struggle with structural understanding when processing graph topologies in a serialized format. Existing solutions rely on training external graph-based adapters or fine-tuning, which incur high costs and lost generalizability. In this work, we investigate the internal mechanisms of LLMs and present a critical finding: LLMs spontaneously reconstruct the graph's topology internally, evidenced by a distinct "sawtooth" pattern in their attention maps that structurally aligns with the "token-level adjacency matrix". However, this intrinsic structural understanding is diluted by the attention sink. We theoretically formalize this dilution as a representation bottleneck, stemming from a fundamental conflict: the model's anisotropic bias, essential for language tasks, suppresses the topology-aware local aggregation required for graph reasoning. To address this, we propose a training-free solution, named StructuraL Attention SHarpening (Slash), which amplifies this internal structural understanding via a plug-and-play attention redistribution. Experiments on pure graph tasks and molecular prediction validate Slash delivers significant and consistent performance gains across diverse LLMs.
☆ SkillEvolver: Skill Learning as a Meta-Skill
Agent skills today are static artifact: authored once -- by human curation or one-shot generation from parametric knowledge -- and then consumed unchanged, with no mechanism to improve from real use. We propose \textbf{SkillEvolver}, a lightweight, plug-and-play solution for online skill learning, in which a single meta-skill iteratively authors, deploys, and refines domain-specific skills. The learning target of SkillEvolver is the skill's prose and code, not model weights, so that the resulting artifact drops into any agent without retraining; and the meta-skill itself is just another skill, loaded through the same interface by any protocol-compliant CLI-agent. Unlike trace-distillation, the meta-skill refines only after deploying the learnt skill, such that the learning signal comes from failures another agent encounters while using it -- not from exploratory traces alone. Refinement iterations are governed by a fresh-agent overfit audit that catches possible leakage as well as deployed-skill-specific failures, including the silent-bypass mode in which a skill appears valid in content but is never invoked at runtime. On $83$ SkillsBench tasks spanning $15^{+}$ domains, SkillEvolver reaches $56.8\%$ accuracy versus $43.6\%$ for curated human skills and $29.9\%$ for the no-skill baseline; on three GPU kernel optimization tasks from KernelBench, it also raises mean speedup from $1.16$ to $1.51$ on average.
☆ Simultaneous Long-tailed Recognition and Multi-modal Fusion for Highly Imbalanced Multi-modal Data
Long-tailed distributions in class-imbalanced data present a fundamental challenge for deep learning models, which tend to be biased toward majority classes. While recent methods for long-tailed recognition have mitigated this issue, they are largely restricted to single-modal inputs and cannot fully exploit complementary information from diverse data sources. In this work, we introduce a new framework for long-tailed recognition that explicitly handles multi-modal inputs. Our approach extends multi-expert architectures to the multi-modal setting by fusing heterogeneous data into a unified representation while leveraging modality-specific networks to estimate the informativeness of each modality. These confidence-guided weights dynamically modulate the fusion process, ensuring that more informative modalities contribute more strongly to the final decision. To further enhance performance, we design specialized training and test procedures that accommodate diverse modality combinations, including images and tabular data. Extensive experiments on benchmark and real-world datasets demonstrate that the proposed approach not only effectively integrates multi-modal information but also outperforms existing methods in handling long-tailed, class-imbalanced scenarios, highlighting its robustness and generalization capability.
☆ Multi-layer attentive probing improves transfer of audio representations for bioacoustics
Probing heads map the representations learned from audio by a machine learning model to downstream task labels and are a key component in evaluating representation learning. Most bioacoustic benchmarks use a fixed, low-capacity probe, such as a linear layer on the final encoder layer. While this standardization enables model comparisons, it may bias results by overlooking the interaction between encoder features and probe design. In this work, we systematically study different probing strategies across two bioacoustic benchmarks, BEANs and BirdSet. We evaluate last- and multi-layer probing, across linear and attention probes. We show that larger probe heads that leverage time information have superior performance. Our results suggest that current benchmarks may misrepresent encoder quality when relying on a last-layer probing setup. Multi-layer probing improves downstream task performance across all tested models, while attention probing has superior performance to linear probing for transformer models.
☆ DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning
Agent-compiled knowledge bases provide persistent external knowledge for large language model (LLM) agents in open-ended, knowledge-intensive downstream tasks. Yet their quality is systematically limited by \emph{incompleteness}, \emph{incorrectness}, and \emph{redundancy}, manifested as missing evidence or cross-document links, low-confidence or imprecise claims, and ambiguous or coreference resolution issues. Such defects compound under iterative use, degrading retrieval fidelity and downstream task performance. We present \textbf{DeepRefine}, a general LLM-based reasoning model for \emph{agent-compiled knowledge refinement} that improves the quality of any pre-constructed knowledge bases with user queries to make it more suitable for the downstream tasks. DeepRefine performs multi-turn interactions with the knowledge base and conducts abductive diagnosis over interaction history, localizes likely defects, and executes targeted refinement actions for incremental knowledge base updates. To optimize refinement policies of DeepRefine without gold references, we introduce a Gain-Beyond-Draft (GBD) reward and train the reasoning process end-to-end via reinforcement learning. Extensive experiments demonstrate consistent downstream gains over strong baselines.
☆ ASIA: an Autonomous System Identification Agent
Over the years, research in system identification has provided a rich set of methods for learning dynamical models, together with well-established theoretical guarantees. In practice, however, the choice of model class, training algorithm, and hyperparameter tuning is still largely left to empirical trial-and-error, requiring substantial expert time and domain experience. Motivated by recent advances in agentic artificial intelligence, we present ASIA, a framework that delegates this iterative search to a large language model acting as an autonomous coding agent. Building on existing agentic platforms, ASIA closes the loop between hypothesis, implementation, and evaluation without human intervention, requiring only a plain-English description of the identification problem. We conduct an empirical study of ASIA on two system identification benchmarks and analyse the agent's search behaviour, the architectures and training strategies it discovers, and the quality of the resulting models. We also discuss the potential of the approach and its current limitations, including implicit test leakage, reduced methodological transparency, and reproducibility concerns.
☆ Formally Verifying Analog Neural Networks Under Process Variations Using Polynomial Zonotopes
Analog neural networks are gaining attention due to their efficiency in terms of power consumption and processing speed. However, since analog neural networks are implemented as physical circuits, they are highly sensitive to manufacturing process variations, which can cause large deviations from the nominal model. We present a polynomial-based model that resembles the performance of the neuron circuit under process variations. Then, we formally verify the behavior of the circuit-level model using reachability analysis with polynomial zonotopes, thus, avoiding conventional, time-consuming Monte Carlo simulations. We evaluate our proposed verification approach on three different datasets, verifying both fully-connected and convolutional analog neural networks. Our experimental results confirm the effectiveness of our verification approach by reducing the verification time from days to seconds while enclosing 99% of the variation samples.
☆ Cavity-Enhanced Collective Quantum Processing with Polarization-Encoded Qubits
We introduce a cavity-enhanced optical architecture for collective quantum processing in which logical qubits are encoded in the polarization subspace of recirculating intracavity modes. The physical carrier and computational degree of freedom are explicitly separated: harmonic cavity bundles provide a stable resonant substrate, while programmable polarization transformations implement single-qubit operations. A polarization-selective nonlinear interaction in the entanglement region generates tunable controlled-phase gates, enabling a universal gate set. A parameter-scaling analysis shows that order-unity conditional phases are attainable in centimeter-scale cavities using experimentally accessible solid-state nonlinear media, without requiring extreme nonlinear coefficients, millisecond photon lifetimes, or sub-hertz laser stabilization. The results indicate that resonant recirculation provides a physically plausible platform for cavity based collective quantum architectures.
☆ Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation
Interactive agent benchmarks map an agent run to a binary outcome through outcome checks. When these checks rely on surface level signals or fail to capture the agent's actual action path, they cannot reliably determine whether the run succeeded. For example, a benchmark task may ask whether Alice's shipping address was changed, while the outcome check only verifies that the agent clicked "Save." This does not guarantee that the intended state change occurred, since the agent may have modified the wrong record. Treating such a run as successful therefore makes the reported score misleading. Benchmark quality thus depends not only on task design, but also on the reliability of outcome detection. We address this problem by introducing an outcome evidence reporting layer for existing benchmarks, without modifying their tasks, agents, or evaluators. The layer performs three functions. First, before scoring, it specifies which stored artifacts are required to verify the claimed outcome for each case. Second, it applies a locked checklist to each completed run and assigns one of three evidence labels: Evidence Pass, Evidence Fail, or Unknown. Third, it reports evidence supported score bounds that quantify uncertainty arising from Unknown cases. Rather than silently counting, discarding, or hiding uncertain cases inside a single aggregate success rate, the framework keeps them explicitly visible. We evaluate the outcome evidence layer on five public benchmarks: ANDROIDWORLD, AGENTDOJO, APPWORLD, tau3 bench retail, and MINIWOB. The resulting reports separate several empirically distinct failure modes.
☆ Statistical Model Checking of the Keynes+Schumpeter Model: A Transient Sensitivity Analysis of a Macroeconomic ABM
Agent-based models (ABMs) are increasingly used in macroeconomics, but their analysis still often relies on ad hoc Monte Carlo campaigns with heterogeneous statistical effort across parameter settings. We show how statistical model checking (SMC), implemented through MultiVeStA, can provide a principled analysis layer for a realistic macroeconomic ABM without rewriting the simulator in a dedicated formalism. Our case study is the heuristic-switching Keynes+Schumpeter(K+S) model, analysed hrough a transient sensitivity campaign over one-parameter sweeps, two macro observables (unemployment and GDP growth), and one auxiliary micro-level probe (market share) on the post-warmup phase of a 600-step horizon. The analysis is driven by reusable temporal queries, observable-specific precision targets, and confidence-based stopping rules that automatically determine the simulation effort required by each configuration. Results show a clear contrast across parameter families: macro-financial and structural sweeps produce the strongest transient effects, whereas several heuristic-rule sweeps remain much weaker under the same precision policy. More broadly, the paper shows that SMC can support reproducible and informative quantitative analysis of substantively rich economic ABMs, while making uncertainty estimates and simulation cost explicit parts of the reported results.
☆ StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs
Multilingual studies of social bias in open-ended LLM generation remain limited: most existing benchmarks are English-centric, template-based, or restricted to recognizing pre-specified stereotypes. We introduce StereoTales, a multilingual dataset and evaluation pipeline for systematically studying the emergence of social bias in open-ended LLM generation. The dataset covers 10 languages and 79 socio-demographic attributes, and comprises over 650k stories generated by 23 recent LLMs, each annotated with the socio-demographic profile of the protagonist across 19 dimensions. From these, we apply statistical tests to identify more than 1{,}500 over-represented associations, which we then rate for harmfulness through both a panel of humans (N = 247) and the same LLMs. We report three main findings. \textbf{(i)} Every model we evaluate emits consequential harmful stereotypes in open-ended generation, regardless of size or capabilities, and these associations are largely shared across providers rather than isolated misbehaviors. \textbf{(ii)} Prompt language strongly shapes which stereotypes appear: rather than transferring as a shared set of biases, harmful associations adapt culturally to the prompt language and amplify bias against locally salient protected groups. \textbf{(iii)} Human and LLM harmfulness judgments are broadly aligned (Spearman $ρ=0.62$), with disagreements concentrating on specific attribute classes rather than specific providers. To support further analyses, we release the evaluation code and the dataset, including model generations, attribute annotations, and harmfulness ratings.
comment: Preprint
☆ Real vs. Semi-Simulated: Rethinking Evaluation for Treatment Effect Estimation
Estimating heterogeneous treatment effects with machine learning has attracted substantial attention in both academic research and industrial practice. However, the two communities often evaluate models under markedly different conditions. Methodological work typically relies on semi-simulated benchmarks and metrics that require counterfactual outcomes, whereas real-world applications rely on observable metrics based on ranking or test outcomes. Despite the well-known gap between methodological progress and practical deployment, the relationship between these evaluation regimes has not been examined systematically. We conduct a large-scale empirical study of treatment effect evaluation across standard semi-simulated benchmark families and real-world datasets. Our benchmark covers meta-learners paired with multiple base learners, as well as specialized causal machine learning models. We evaluate these methods using observable metrics common in application-oriented literature, alongside counterfactual metrics commonly used in methods papers. Our results reveal two complementary gaps. First, counterfactual metrics do not reliably recover the estimators preferred by observable metrics, even on the same semi-simulated benchmarks. Second, rankings obtained on semi-simulated benchmarks do not transfer to real datasets. We further find that simple meta-learners with strong base models are consistently competitive, in contrast to specialized causal models. Overall, our findings suggest that progress in treatment effect estimation research should not be assessed solely through counterfactual metrics and semi-simulated benchmarks, but it would benefit from incorporating observable metrics and real-data validation.
☆ Physical probes expose and alleviate chemical-environment collapse in molecular representations
Nuclear magnetic resonance (NMR) spectroscopy provides an experimental readout of local chemical environments, but its use in molecular representation learning has been constrained by heterogeneous data and incomplete atom-level assignments. Here we construct complementary high-fidelity experimental and computational 13C NMR resources, which reveal a recurrent form of representational collapse: atoms that are equivalent in molecular topology can remain experimentally distinct in their real chemical environments, whereas explicit 3D descriptions are further limited by static conformations in dynamic regimes. To alleviate this bottleneck, we develop CLAIM (Contrastive Learning for Atom-to-molecule Inference of Molecular NMR), a framework that aligns efficient topological molecular inputs with atom-resolved NMR observables. Through hierarchical chemical priors and cross-level contrastive learning, CLAIM restores lost chemical resolution and markedly improves atom-level molecule-spectrum retrieval. CLAIM remains robust in flexible and tautomeric systems for 13C NMR prediction, improves stereoisomer discrimination without explicit 3D modelling, and transfers to broader molecular property tasks including ADMET prediction and fluorescence estimation. These results establish physically grounded spectral alignment as an effective strategy for alleviating chemical-environment collapse and for guiding experimentally grounded molecular representation learning.
☆ CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
Vision-Language-Action (VLA) models have emerged as a promising paradigm for end-to-end autonomous driving. However, existing reasoning mechanisms still struggle to provide planning-oriented intermediate representations: textual Chain-of-Thought (CoT) fails to preserve continuous spatiotemporal structure, while latent world reasoning remains difficult to use as a direct condition for action generation. In this paper, we propose CoWorld-VLA, a multi-expert world reasoning framework for autonomous driving, where world representations serve as explicit conditions to guide action planning. CoWorld-VLA extracts complementary world information through multi-source supervision and encodes it into expert tokens within the VLA, thereby providing planner-accessible conditioning signals. Specifically, we construct four types of tokens: semantic interaction, geometric structure, dynamic evolution, and ego trajectory tokens, which respectively model interaction intent, spatial structure, future temporal dynamics, and behavioral goals. During action generation, CoWorld-VLA employs a diffusion-based hierarchical multi-expert fusion planner, which is coupled with scene context throughout the joint denoising process to generate continuous ego trajectories. Experiments show that CoWorld-VLA achieves competitive results in both future scene generation and planning on the NAVSIM v1 benchmark, demonstrating strong performance in collision avoidance and trajectory accuracy. Ablation studies further validate the complementarity of expert tokens and their effectiveness as planning conditions for action generation. Code will be available at https://github.com/potatochip1211/CoWorld-VLA.
☆ Toward an Engineering of Science: Rebalancing Generation and Verification in the Age of AI
AI systems can now cheaply generate plausible scientific artifacts such as papers, reviews, and surveys. This creates a risk of \emph{epistemic pollution} in our scientific systems, where unreliable but plausible-looking artifacts can accumulate faster than the system can filter them out. The problem is structural: the epistemic infrastructure of science was calibrated to a world where producing a plausible artifact required substantial expertise, labor, and time, so generation cost itself served as a rough filter; AI weakens that filter without comparably lowering verification cost. We argue that \textbf{AI-era science should treat this as an engineering problem: redesigning epistemic infrastructure to rebalance the costs of generation and verification}. The current paper-centered system makes verification expensive: papers compress long-context scientific logic into prose, forcing reviewers, human or AI, to reconstruct underlying argument structure before they can evaluate it. As one step in this direction, we propose \textbf{blueprints} as preliminary epistemic infrastructure: structured, decomposed research artifacts that represent claims, evidence, assumptions, and definitions as typed graph components. Blueprints are designed to trade an upfront generation cost for cheaper, more local, more distributed verification downstream. We have instantiated the proposal in a proof-of-concept prototype.
☆ Can Language Models Analyze Data? Evaluating Large Language Models for Question Answering over Datasets
This paper investigates the effectiveness of large language models (LLMs) in answering questions over datasets. We examine their performance in two scenarios: (a) directly answering questions given a dataset file as input, and (b) generating SQL queries to answer questions given the schema of a relational database. We also evaluate the impact of different prompting strategies on model performance. The study includes both state-of-the-art LLMs and smaller language models that require fewer resources and operate at lower computational and financial cost. Experiments are conducted on two datasets containing questions of varying difficulty. The results demonstrate the strong performance of large LLMs, while highlighting the limitations of smaller, more cost-efficient models. These findings contribute to a better understanding of how LLMs can be utilized in data analytics tasks and their associated limitations.
comment: Accepted for publication in CARMA 2026 proceedings
☆ Every finite group admits a just finite presentation
A finite presentation < X | R > of a finite group is called `just finite' if removing any relation from R results in a presentation for an infinite group. It has been an open question (Kourovka Notebook, Problem 21.10) whether every finite group admits such a presentation. We resolve this conjecture in the affirmative.
comment: 5 pages. Significant assistance was provided by the AI co-mathematician tool developed by Google DeepMind
☆ LLM4Branch: Large Language Model for Discovering Efficient Branching Policies of Integer Programs ICML2026
Efficient branching policies are essential for accelerating Mixed Integer Linear Programming (MILP) solvers. Their design has long relied on hand-crafted heuristics, and now machine learning has emerged as a promising paradigm to automate this process. However, existing learning-based methods are often hindered by their dependence on expensive expert demonstrations and the gap between training objectives and the solver's end-to-end performance. In this work, we propose LLM4Branch, a novel framework that leverages Large Language Models (LLMs) to automate the discovery of efficient branching policies. Specifically, the discovered policy is an executable program with a program skeleton generated by the LLM and a parameter vector, which is optimized via a zeroth-order method over a few instances with their end-to-end performance feedback. Extensive experiments on standard MILP benchmarks demonstrate that LLM4Branch establishes a new state-of-the-art among CPU-based methods and achieves performance competitive with advanced GPU-based models. Codes are available at https://github.com/hzn18/LLM4Branch.
comment: ICML2026 preprint, camera ready in progress
☆ AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation
Visual anomaly detection (VAD) is crucial in many real-world fields, such as industrial inspection, medical imaging, infrastructure monitoring, and remote sensing. However, the specific anomaly definitions, data modalities, and annotation standards across different domains make it difficult to transfer single-domain trained VAD models. Vision-language models (VLMs), pre-trained on large-scale cross-domain data, can perform visual perception under task instructions, offering a promising solution for cross-domain VAD. However, single-inference VLM judgments are unreliable, since they rely more on prior knowledge than on normal-sample references or fine-grained feature evidence. We therefore present AnomalyClaw, a training-free VAD agent that turns anomaly judgment into a multi-round refutation process. In each round, the agent proposes candidate anomalies and refutes each against normal-sample references, drawing on a 13-tool library for visual verification, reference parsing, and frozen expert probing. On the CrossDomainVAD-12 benchmark (12 datasets), AnomalyClaw achieves consistent macro-AUROC improvements over single-step direct inference with +6.23 pp on GPT-5.5, +7.93 pp on Seed2.0-lite, and +3.52 pp on Qwen3.5-VL-27B. We further introduce an optional verbalized self-evolution extension. It builds an online rulebook from internal-branch disagreement without oracle labels. On Qwen3.5-VL-27B, it delivers a +2.09 pp mean gain, comparable to a K = 10 oracle-label supervised baseline (+1.99 pp). These results show that agentic refutation improve anomaly understanding and reasoning of VLMs, rather than merely aggregating tool outputs.
comment: We release the agent, the benchmark, and the analysis artifacts at https://github.com/jam-cc/AnomalyClaw
☆ Phoenix-VL 1.5 Medium Technical Report
We introduce Phoenix-VL 1.5 Medium, a 123B-parameter natively multimodal and multilingual foundation model, adapted to regional languages and the Singapore context. Developed as a sovereign AI asset, it demonstrates that deep domain adaptation can be achieved with minimal degradation to broad-spectrum intelligence and alignment. Continued pretraining was performed on Mistral Medium 3.1 using a localized 1-trillion tokens multimodal corpus, followed by a 250-billion tokens long-context extension phase. Subsequent post-training incorporated a novel human-annotated Singapore multimodal dataset and curated textual corpus on Singapore culture, knowledge, and legislation, totaling 22-billion tokens. An additional 5 billion tokens of model alignment was performed through Online Direct Preference Optimization. Phoenix-VL 1.5 Medium achieves state-of-the-art performance for its size on Singapore multimodal, legal, and government policy benchmarks while remaining globally competitive on general multimodal intelligence, multilingual, and STEM benchmarks. We also introduce a novel evaluation suite encompassing localized knowledge benchmarks and an institutionally aligned model behavior and safety framework. We report the data curation principles, training methodology, and highlight benchmark and inference performance.
comment: Release page: https://medium.com/htx-ai/introducing-phoenix-vl-1-5-medium-multimodal-intelligence-uniquely-singaporean-ef8214c8cfa1
☆ GuardAD: Safeguarding Autonomous Driving MLLMs via Markovian Safety Logic
Multimodal large language models (MLLMs) are increasingly integrated into autonomous driving (AD) systems; however, they remain vulnerable to diverse safety threats, particularly in accident-prone scenarios. Recent safeguard mechanisms have shown promise by incorporating logical constraints, yet most rely on static formulations that lack temporally grounded safety reasoning over evolving traffic interactions, resulting in limited robustness in dynamic driving environments. To address these limitations, we propose GuardAD, a model-agnostic safeguard that formulates AD safety as an evolving Markovian logical state. GuardAD introduces Neuro-Symbolic Logic Formalization, which represents safety predicates over heterogeneous traffic participants and continuously induces them via n-th order Markovian Logic Induction. This design enables the inference of emerging and latent hazards beyond single-step observations. Rather than simply vetoing unsafe actions, GuardAD performs Logic-Driven Action Revision, where inferred safety states actively guide action refinement without modifying the underlying MLLM. Extensive experiments on multiple benchmarks and AD-MLLMs demonstrate that GuardAD substantially reduces accident rates (-32.07%) while slightly improving task performance (+6.85%). Moreover, closed-loop simulation evaluations, together with physical-world vehicle studies, further validate the effectiveness and potential of GuardAD.
☆ Agentic Performance at the Edge: Insights from Benchmarking
Agentic artificial intelligence (AI) is a natural fit for Internet of Things (IoT) and edge systems, but edge deployments are often constrained to models around 8 billion parameters or smaller. An important question is: How much agentic-task quality is lost when model size is constrained by memory, power, and latency budgets? To address this question, in this paper, we provide an initial empirical study considering edge-focused model scaling, general-purpose versus coder-oriented model effects, and tool-enabled execution under a fixed protocol. We introduce a domain-conditioned evaluation methodology, an implementation-grounded analysis of model-tool interactions, practical guidance for model selection under constraints, and an analysis of failure modes that reveals distinct semantic versus execution failure patterns across model families. Our core finding is that edge-agent quality is not a simple function of parameter count. Robust deployment depends on the joint design of model choice and tool workflow. Domain-conditioned analysis reveals Pareto fronts in the accuracy-latency space that can guide strategy selection based on operational priorities.
comment: Accepted to AutoEdge workshop, co-located with MobiSys 2026
☆ Agent-X: Full Pipeline Acceleration of On-device AI Agents
LLM-based agents deliver state-of-the-art performance across tasks but incur high end-to-end latency on edge devices. We introduce Agent-X, a software-only, accuracy-preserving framework that accelerates both the prefill and decode stages of on-device agent workloads. Agent-X's two key components rewrite prompts to leverage prefix caching tailored to agent-specific input-token patterns and enable LLM-free speculative decoding for fast token generation with minimal overhead. On representative agentic workloads, Agent-X achieves a 1.61x end-to-end speedup in real systems with no accuracy loss and can be seamlessly integrated into existing on-device AI agents. To the best of our knowledge, ours is the first to systematically characterize and eliminate latency bottlenecks in on-device agents.
comment: Accepted for publication at MobiSys-2026
☆ Autonomous FAIR Digital Objects: From Passive Assertions to Active Knowledge
Scientific knowledge on the Web is published as passive assertions and cannot decide when to validate evidence, reconcile contradictions, or update confidence as findings accumulate. Curation depends on centralised middleware and institutional continuity, but when registries close, active stewardship stops even when data remain online. We advance the concept of Autonomous FAIR Digital Objects (aFDOs) from an abstract idea to an operational model, to offer a route from passive scientific publication toward accountable, standards-aligned automation that can outlive its publishing institutions. aFDO augments FDOs with three capabilities anchored in Semantic Web standards, namely 1) a policy layer over RDF-star aligned with PROV-O, SHACL, and ODRL for portable condition-action rules, 2) an announcement layer over ActivityStreams 2.0 that bounds per-announcement evaluation cost, and 3) an agreement layer that resolves multi-source contradictions through reputation and confidence weighted agreement under a bounded adversarial model. We provide a formal definition that distinguishes policy specifications, event handlers, and communication interfaces. We evaluate an open reference implementation on 4,305 FDOs grounded in rare-disease ontologies, namely ClinVar, HPO, and Orphanet, combined with controlled synthetic observations. The consensus mechanism resolves 56.3% of 3,914 naturally occurring ClinVar conflicts where multiple submitters disagree and an expert panel has subsequently adjudicated. Under Sybil, collusion, and poisoning attacks, the mechanism degrades gracefully within its design Byzantine-tolerance bound (f < n/5), and fails as predicted beyond that bound.
☆ EGL-SCA: Structural Credit Assignment for Co-Evolving Instructions and Tools in Graph Reasoning Agents
Graph reasoning agents operating from natural-language inputs must solve a coupled problem: they must reconstruct a structured graph instance from text, decide whether existing computational assets are sufficient, interact with tools under a strict execution protocol, and satisfy an external verifier that checks structured correctness rather than textual plausibility. Existing approaches usually improve either the instruction side or the tool side in isolation, which leaves unclear what should be updated after failure. We propose EGL-SCA, a verifier-centric dual-space framework that models a graph reasoning agent using two collaborative components: an instruction-side policy space for reasoning strategies, and a tool-side program space for executable algorithmic tools. Our central mechanism is structural credit assignment, which maps trajectory evidence to conditional updates, precisely routing failures to either prompt optimization or tool synthesis and repair. To provide sufficient learning signals for dual-space adaptation, we introduce a training distribution stratified by task family, coupled with a Pareto-style retention strategy to balance success, generality, and parsimony. Experiments on four graph reasoning benchmarks show that EGL-SCA achieves a state-of-the-art 92.0\% average success rate. By effectively co-evolving instructions and tools, our framework significantly outperforms both pure-prompting and fixed-toolbox baselines.
☆ Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
Autonomous agents have rapidly matured as task executors and seen widespread deployment via harnesses such as OpenClaw. Safety concerns have rightly drawn growing research attention, and beneath them lie the values silently steering agent behavior. Existing value benchmarks, however, remain confined to LLMs, leaving agent values largely uncharted. From intuitive, empirical, and theoretical vantage points, we show that an agent's values diverge from those of its underlying LLM, and the agentic modality further introduces dataset-, evaluation-, and system-level challenges absent from text-only protocols. We close this gap with Agent-ValueBench, the first benchmark dedicated to agent values. It features 394 executable environments across 16 domains, offering 4,335 value-conflict tasks that cover 28 value systems and 332 dimensions. Every instance is co-synthesized through our purpose-built end-to-end pipeline and curated per-instance by professional psychologists. Each task ships with two pole-aligned golden trajectories whose checkpoints anchor a trajectory-level rubric-based judge. Benchmarking 14 frontier proprietary and open-weights models across 4 mainstream harnesses, we uncover three concerted findings. Agent values first manifest as a Value Tide of cross-model homogeneity beneath interpretable counter-currents. This tide bends non-additively under harness pull, and yet more decisively under deliberate steering via embedded skills. Together these results signal that the agent-alignment lever is shifting from classical model alignment and prompt steering toward harness alignment and skill steering.
☆ RW-Post: Auditable Evidence-Grounded Multimodal Fact-Checking in the Wild
Multimodal misinformation increasingly leverages visual persuasion, where repurposed or manipulated images strengthen misleading text. We introduce \textbf{RW-Post}, a post-aligned \textbf{text--image benchmark} for real-world multimodal fact-checking with \emph{auditable} annotations: each instance links the original social-media post with reasoning traces and explicitly linked evidence items derived from human fact-check articles via an LLM-assisted extraction-and-auditing pipeline. RW-Post supports controlled evaluation across closed-book, evidence-bounded, and open-web regimes, enabling systematic diagnosis of visual grounding and evidence utilization. We provide \textbf{AgentFact} as a reference verification baseline and benchmark strong open-source LVLMs under unified protocols. Experiments show substantial headroom: current models struggle with faithful evidence grounding, while evidence-bounded evaluation improves both accuracy and faithfulness. Code and dataset will be released at https://github.com/xudanni0927/AgentFact.
☆ Portable Active Learning for Object Detection CVPR 2026
Annotating bounding boxes is costly and limits the scalability of object detection. This challenge is compounded by the need to preserve high accuracy while minimizing manual effort in real-world applications. Prior active learning methods often depend on model features or modify detector internals and training schedules, increasing integration overhead. Moreover, they rarely jointly exploit the benefits of image-level signals, class-imbalance cues, and instance-level uncertainty for comprehensive selection. We present Portable Active Learning (PAL), a detector-agnostic, easily portable framework that operates solely on inference outputs. PAL combines class-wise instance uncertainty with image-level diversity to guide data selection. At each round, PAL trains lightweight class-specific logistic classifiers to distinguish true from false positives, producing entropy-based uncertainty scores for proposals. Candidate images are then refined using global image entropy, class diversity, and image similarity, yielding batches that are both informative and diverse. PAL requires no changes to model internals or training pipelines, ensuring broad compatibility across detectors. Extensive experiments on COCO, PASCAL VOC, and BDD100K demonstrate that PAL consistently improves label efficiency and detection accuracy compared to existing active learning baselines, making it a practical solution for scalable and cost-effective deployment of object detection in real-world settings.
comment: CVPR 2026(highlight)
☆ How Mobile World Model Guides GUI Agents?
Recent advances in vision-language models have enabled mobile GUI agents to perceive visual interfaces and execute user instructions, but reliable prediction of action consequences remains critical for long-horizon and high-risk interactions. Existing mobile world models provide either text-based or image-based future states, yet it remains unclear which representation is useful, whether generated rollouts can replace real environments, and how test-time guidance helps agents of different strengths. To answer the above questions, we filter and annotate mobile world-model data, then train world models across four modalities: delta text, full text, diffusion-based images, and renderable code. These models achieve SoTA performance on both MobileWorldBench and Code2WorldBench. Furthermore, by evaluating their downstream utility on AITZ, AndroidControl, and AndroidWorld, we obtain three findings. First, renderable code reconstruction achieves high in-distribution fidelity and provides effective multimodal supervision for data construction, while text-based feedback is more robust for online out-of-distribution (OOD) execution. Second, world-model-generated trajectories can provide transferable interaction experience in the training process and improve agents' end-to-end task performance, although these data do not preserve the original distribution. Last, for overconfident mobile agents with low action entropy, posterior self-reflection provides limited gains, suggesting that world models are more effective as prior perception or training supervision than as universal post-hoc verifiers.
☆ TMAS: Scaling Test-Time Compute via Multi-Agent Synergy
Test-time scaling has become an effective paradigm for improving the reasoning ability of large language models by allocating additional computation during inference. Recent structured approaches have further advanced this paradigm by organizing inference across multiple trajectories, refinement rounds, and verification-based feedback. However, existing structured test-time scaling methods either weakly coordinate parallel reasoning trajectories or rely on noisy historical information without explicitly deciding what should be retained and reused, limiting their ability to balance exploration and exploitation. In this work, we propose TMAS, a framework for scaling test-time compute via multi-agent synergy. TMAS organizes inference as a collaborative process among specialized agents, enabling structured information flow across agents, trajectories, and refinement iterations. To support effective cross-trajectory collaboration, TMAS introduces hierarchical memories: the experience bank reuses low-level reliable intermediate conclusions and local feedback, while the guideline bank records previously explored high-level strategies to steer subsequent rollouts away from redundant reasoning patterns. Furthermore, we design a hybrid reward reinforcement learning scheme tailored to TMAS, which jointly preserves basic reasoning capability, enhances experience utilization, and encourages exploration beyond previously attempted solution strategies. Extensive experiments on challenging reasoning benchmarks demonstrate that TMAS achieves stronger iterative scaling than existing test-time scaling baselines, while hybrid reward training further improves scaling effectiveness and stability across iterations. Code and data are available at https://github.com/george-QF/TMAS-code.
☆ EvoStreaming: Your Offline Video Model Is a Natively Streaming Assistant
Streaming video understanding demands more than watching longer videos: assistants must decide when to speak in real time, balancing responsiveness against verbosity. Yet most video-language models (VideoLLMs) are trained for offline inference, and existing streaming benchmarks externalize this timing decision to the evaluator. We address this gap with RealStreamEval, a frame-level multi-turn evaluation protocol that exposes models to sequential observations and penalizes unnecessary responses. Under this protocol, we observed that strong offline VideoLLMs retain useful visual understanding but lack an interaction policy for deciding when to respond. Motivated by this observation, we propose EvoStreaming, a self-evolved streaming adaptation framework in which the base model itself acts as data generator, relevance annotator, and roll-out policy to synthesize streaming trajectories without external supervision. With only $1{,}000$ self-generated samples ($139\times$ less than the leading streaming instruction-tuning approach) and no architectural changes, EvoStreaming consistently improves the overall RealStreamEval score by up to $10.8$ points across five open VideoLLM backbones (Qwen2/2.5/3-VL, InternVL-3.5, MiniCPM-V4.5) while largely preserving offline video performance. These results suggest that data-efficient interaction tuning is a practical path for adapting existing VideoLLMs to streaming assistants.
comment: 33 pages, 9 figures
☆ PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents
A LaTeX manuscript that compiles without error is not necessarily publication-ready. The resulting PDFs frequently suffer from misplaced floats, overflowing equations, inconsistent table scaling, widow and orphan lines, and poor page balance, forcing authors into repetitive compile-inspect-edit cycles. Rule-based tools are blind to rendered visuals, operating only on source code and log files. Text-only LLMs perform open-loop text editing, unable to predict or verify the two-dimensional layout consequences of their changes. Reliable typesetting optimization therefore requires a visual closed loop with verification after every edit. We formalize this problem as Visual Typesetting Optimization (VTO), the task of transforming a compilable LaTeX paper into a visually polished, page-budget-compliant PDF through iterative visual verification and source-level revision, and introduce a five-category taxonomy of typesetting defects to guide diagnosis. We present PaperFit, a vision-in-the-loop agent that iteratively renders pages, diagnoses defects, and applies constrained repairs. To benchmark VTO, we construct PaperFit-Bench with 200 papers across 10 venue templates and 13 defect types at different difficulty. Extensive experiments show that PaperFit outperforms all baselines by a large margin, establishing that bridging the gap from compilable source to publication-ready PDF requires vision-in-the-loop optimization and that VTO constitutes a critical missing stage in the document automation pipeline.
comment: 47 pages, 17 figures, 17 tables
☆ CORTEG: Foundation Models Enable Cross-Modality Representation Transfer from Scalp to Intracranial Brain Recordings
Intracranial electrocorticography (ECoG) offers high-signal-to-noise access to cortical activity for brain-computer interfaces, yet limited per-patient data has led most prior work to rely on small, subject-specific decoders that neglect information shared across patients. We investigate whether large pretrained scalp-EEG foundation models (EEG FMs) can be adapted to ECoG, enabling cross-patient learning and competitive decoding performance while calibrating to a held-out patient in 10-30 minutes on a single GPU. We introduce CORTEG, a cross-modality transfer framework that combines a pretrained EEG FM backbone, an electrode-aware KNNSoftFourier spatial adapter, a dual-stream tokenizer for low-frequency and high-gamma activity, and a leave-one-subject-out fine-tuning strategy. We evaluate CORTEG on two challenging regression tasks: public finger trajectory regression (n=9) and private audio envelope regression (n=16). CORTEG matches or exceeds the strongest task-specific baselines on both tasks: it reaches the highest mean correlation among compared methods on the public finger benchmark (gain not statistically significant on n=9 subjects), with larger and statistically significant gains on the audio task and in low-data per-patient calibration. Feature analyses align with neurophysiology, and latent manifolds capture low-dimensional finger-movement structure. CORTEG provides systematic evidence that scalp-EEG pretraining can be repurposed for ECoG decoding, enabling data-efficient intracranial BCIs that can adapt to new patients.
☆ PowerStep: Memory-Efficient Adaptive Optimization via $\ell_p$-Norm Steepest Descent
Adaptive optimizers, most notably Adam, have become the default standard for training large-scale neural networks such as Transformers. These methods maintain running estimates of gradient first and second moments, incurring substantial memory overhead. We introduce PowerStep, a memory-efficient optimizer that achieves coordinate-wise adaptivity without storing second-moment statistics. Motivated by steepest descent under an $\ell_p$-norm geometry, we show that applying a nonlinear transform directly to a momentum buffer yields coordinate-wise adaptivity. We prove that PowerStep converges at the optimal $O(1/\sqrt{T})$ rate for non-convex stochastic optimization. Extensive experiments on Transformer models ranging from 124M to 235B parameters demonstrate that PowerStep matches Adam's convergence speed while halving optimizer memory. Furthermore, when combined with aggressive \texttt{int8} quantization, PowerStep remains numerically stable and reduces optimizer memory by $\sim\!8\times$ compared to full-precision Adam. PowerStep thus provides a principled, scalable and resource-efficient alternative for large-scale training. Code is available at https://github.com/yaolubrain/PowerStep.
☆ EmbodiSkill: Skill-Aware Reflection for Self-Evolving Embodied Agents
Embodied agents can benefit from skills that guide object search, action execution, and state changes across diverse environments. Since embodied environments vary across layouts, object states, and other execution factors, these skills must self-evolve from trajectories generated during task execution. However, existing skill self-evolution methods are mainly developed in digital environments and often convert trajectories into coarse skill updates. Directly applying this paradigm to embodied settings is problematic, because a failed task execution may reflect not only incorrect skill content, but also an execution lapse in which the agent fails to follow valid guidance. We propose EmbodiSkill, a training-free framework for embodied skill self-evolution through skill-aware reflection and targeted revision. EmbodiSkill interprets each trajectory with respect to the current skill, uses skill-changing evidence to update the skill body, and uses execution-lapse evidence to preserve and emphasize valid guidance. Experiments on ALFWorld and EmbodiedBench show that EmbodiSkill consistently improves embodied task success. On ALFWorld, EmbodiSkill enables a frozen Qwen3.5-27B executor to reach 93.28% task success, outperforming GPT-5.2 used as a direct agent without skills by 31.58%. These results show that skill-aware self-evolution helps embodied agents accumulate reusable procedural knowledge from their own trajectories.
☆ SCALAR: A Neurosymbolic Framework for Automated Conjecture and Reasoning in Quantum Circuit Analysis
In this paper, we present SCALAR (Symbolic Conjecture and LLM-Assisted Reasoning), a neurosymbolic framework for automated conjecture generation in quantum circuit analysis built on top of the CUDA-Q open source framework. The system integrates quantum simulation, symbolic conjecture generation, and LLM-based interpretation. We evaluate SCALAR on 82 MaxCut instances from the MQLib benchmark dataset and extend the analysis to 2,000 randomly generated graphs across four topologies: regular, Erdos-Renyi, Barabasi-Albert, and Watts-Strogatz. The framework generates conjectured bounds relating optimal QAOA parameters to graph invariants, including known relationships such as periodicity constraints on the phase separation parameter $γ$. SCALAR also recovers previously reported parameter transfer phenomena across structurally similar instances. Additionally, the system identifies correlations between graph structural features and optimization landscape properties, which we characterize through invariant-based descriptors. Using CUDA-Q tensor network simulator, we scale experiments to instances of up to 77 qubits. We discuss the accuracy, generality, and limitations of the generated conjectures, including sensitivity to graph class and quantum circuit depth.
☆ Verifiable Process Rewards for Agentic Reasoning
Reinforcement learning from verifiable rewards (RLVR) has improved the reasoning abilities of large language models (LLMs), but most existing approaches rely on sparse outcome-level feedback. This sparsity creates a credit assignment challenge in long-horizon agentic reasoning: a trajectory may fail despite containing many correct intermediate decisions, or succeed despite containing flawed ones. In this work, we study a class of densely-verifiable agentic reasoning problems, where intermediate actions can be objectively checked by symbolic or algorithmic oracles. We propose Verifiable Process Rewards (VPR), a framework that converts such oracles into dense turn-level supervision for reinforcement learning, and instantiate it in three representative settings: search-based verification for dynamic deduction, constraint-based verification for logical reasoning, and posterior-based verification for probabilistic inference. We further provide a theoretical analysis showing that dense verifier-grounded rewards can improve long-horizon credit assignment by providing more localized learning signals, with the benefit depending on the reliability of the verifier. Empirically, VPR outperforms outcome-level reward and rollout-based process reward baselines across controlled environments, and more importantly, transfers to both general and agentic reasoning benchmarks, suggesting that verifiable process supervision can foster general reasoning skills applicable beyond the training environments. Our results indicate that VPR is a promising approach for enhancing LLM agents whenever reliable intermediate verification is available, while also highlighting its dependence on oracle quality and the open challenge of extending VPR to less structured, open-ended environments.
☆ Relations Are Channels: Knowledge Graph Embedding via Kraus Decompositions
Knowledge graph embedding (KGE) models typically represent each relation as an operator on entity embeddings. In this work, we identify three structural axioms that any principled relation operator must satisfy, linearity, trace preservation, and complete positivity, and show that they characterize a Kraus channel structure via the Kraus representation theorem. The completeness constraint defining this family is equivalent to these axioms, providing a principled foundation rather than an externally imposed condition. Under this formulation, most existing operator-based KGE models are recoverable as special cases with Kraus rank $κ= 1$ under specific embedding choices. We further generalize this characterization to arbitrary metric geometries by introducing \mbox{w-Kraus} channels, which satisfy completeness by construction within their respective spaces. Building on this theory, we propose \textsc{KrausKGE}, a principled KGE model that naturally handles $1$-to-$N$ and $N$-to-$N$ relations, supports $k$-hop reasoning without requiring explicit path encoders, and eliminates the need for norm constraints on entity embeddings. Additionally, our framework yields the first theoretically grounded per-relation complexity measure in the KGE literature, with a provable lower bound in terms of the empirical relation matrix rank. Empirical evaluation demonstrates that \textsc{KrausKGE} consistently outperforms strong baselines on $N$-to-$N$ relations, with performance gains that increase monotonically with relation fan-out, in alignment with theoretical predictions.
☆ Active Tabular Augmentation via Policy-Guided Diffusion Inpainting ICML 2026
Generative tabular augmentation is appealing in data-scarce domains, yet the prevailing focus on distributional fidelity does not reliably translate into better downstream models. We formalize a fidelity-utility gap: common generative objectives prioritize distributional plausibility, whereas augmentation succeeds only when injected samples reduce the current learner's held-out evaluation loss. This gap motivates learning not just how to generate, but what to generate and when to inject as training evolves. We propose TAP (Tabular Augmentation Policy), which couples diffusion inpainting with a lightweight, learner-conditioned policy to steer generation toward high-utility regions and controls safe injection via explicit gating and conservative windowed commitment. Under severe data scarcity, TAP consistently outperforms strong generative baselines on seven real-world datasets, improving classification accuracy by up to 15.6 percentage points and reducing regression RMSE by up to 32%.
comment: Accepted for publication at ICML 2026
☆ Positive Alignment: Artificial Intelligence for Human Flourishing
Existing alignment research is dominated by concerns about safety and preventing harm: safeguards, controllability, and compliance. This paradigm of alignment parallels early psychology's focus on mental illness: necessary but incomplete. What we call Positive Alignment is the development of AI systems that (i) actively support human and ecological flourishing in a pluralistic, polycentric, context-sensitive, and user-authored way while (ii) remaining safe and cooperative. It is a distinct and necessary agenda within AI alignment research. We argue that several existing failures of alignment (e.g., engagement hacking, loss of human autonomy, failures in truth-seeking, low epistemic humility, error correction, lack of diverse viewpoints, and being primarily reactive rather than proactive) may be better addressed through positive alignment, including cultivating virtues and maximizing human flourishing. We highlight a range of challenges, open questions, and technical directions (e.g., data filtering and upsampling, pre- and post-training, evaluations, collaborative value collection) for different phases of the LLM and agents lifecycle. We end with design principles for promoting disagreement and decentralization through contextual grounding, community customization, continual adaptation, and polycentric governance; that is, many legitimate centers of oversight rather than one institutional or moral chokepoint.
☆ Qwen Goes Brrr: Off-the-Shelf RAG for Ukrainian Multi-Domain Document Understanding
We participated in the Fifth UNLP shared task on multi-domain document understanding, where systems must answer Ukrainian multiple-choice questions from PDF collections and localize the supporting document and page. We propose a retrieval-augmented pipeline built around three ideas: contextual chunking of PDFs, question-aware dense retrieval and reranking conditioned on both the question and answer options, and constrained answer generation from a small set of reranked passages. Our final system uses Qwen3-Embedding-8B for retrieval, a fine-tuned Qwen3-Reranker-8B for passage ranking, and Qwen3-32B for answer selection. On a held-out split, reranking improves Recall@1 from 0.6957 to 0.7935, while using the top-2 reranked passages raises answer accuracy from 0.9348 to 0.9674. Our best leaderboard run reached 0.9452 on the public leaderboard and 0.9598 on the private leaderboard. Our results suggest that, under strict code-competition constraints, preserving document structure and making relevance estimation aware of the answer space are more effective than adding complex downstream heuristics.
comment: Accepted to The Fifth Ukrainian Natural Language Processing Conference (UNLP 2026)
☆ Robust Probabilistic Shielding for Safe Offline Reinforcement Learning
In offline reinforcement learning (RL), we learn policies from fixed datasets without environment interaction. The major challenges are to provide guarantees on the (1) performance and (2) safety of the resulting policy. A technique called safe policy improvement (SPI) provides a performance guarantee: with high probability, the new policy outperforms a given baseline policy, which is assumed to be safe. Orthogonally, in the context of safe RL, a shield provides a safety guarantee by restricting the action space to those actions that are provably safe with respect to a given safety-relevant model. We integrate these paradigms by extending shielding to offline RL, relying solely on the available dataset and knowledge of safe and unsafe states. Then, we shield the policy improvement steps, guaranteeing, with high probability, a safe policy. Experimental results demonstrate that shielded SPI outperforms its unshielded counterpart, improving both average and worst-case performance, particularly in low-data regimes.
☆ LeapTS: Rethinking Time Series Forecasting as Adaptive Multi-Horizon Scheduling
Time series forecasting serves as an essential tool for many real-world applications, supporting tasks such as resource optimization and decision-making. Despite significant architectural advancements, most modern models still treat forecasting task as a fixed mapping from history to target horizons. This induces temporal decoupling across future time points and limits the model's ability to adapt to the evolving context as forecasting progresses. In this work, we present LeapTS, a novel framework that reformulates time series forecasting as a dynamic scheduling process over the prediction horizon. Specifically, LeapTS organizes the forecasting process into multi-level decisions using: (1) the hierarchical controller to dynamically select the optimal prediction scale and advancement length at each step, and (2) continuous-time state evolution driven by neural controlled differential equations. Within this process, the controlled update mechanism explicitly couples the irregular temporal dynamics with discrete scheduling feedback. Extensive evaluations on both real-world and synthetic datasets demonstrate that LeapTS improves overall forecasting performance by at least 7.4% while achieving a 2.6$\times$ to 5.3$\times$ inference speedup over representative Transformer-based models. Furthermore, by explicitly tracing the scheduling trajectories, we reveal how the model autonomously adapts its forecasting behavior to capture non-stationary dynamics.
☆ Generative AI Fuels Solo Entrepreneurship, but Teams Still Lead at the Top
Recent advances in generative artificial intelligence (AI) are reshaping who enters entrepreneurship, but not who reaches the top of the quality distribution. Using data on over 160,000 product launches on Product Hunt, we find that entrepreneurial entry increased sharply following the public release of ChatGPT-3.5, driven disproportionately by solo entrepreneurs. This shift toward solo entry is particularly pronounced in categories that historically favored team-based ventures. However, much of this growth reflects low-commitment, experimental entry and does not translate into greater representation among the highest-quality outcomes. Team-based ventures are increasingly dominant in the top tiers of platform rankings. These findings suggest that generative AI lowers barriers to solo entrepreneurship while reinforcing team-based advantages.
☆ AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks
Building effective clinical decision support systems requires the synthesis of complex heterogeneous multimodal data. Such modalities include temporal electronic health records data, medical images, radiology reports, and clinical notes. Large language model (LLM)-based agents have shown impressive performance in various healthcare tasks, especially those involving textual modalities. Considering the fragmentation of healthcare data across hospital systems, collaborative agent frameworks present a promising direction to mitigate data sharing challenges. However, the effectiveness of LLM agents for multimodal clinical risk prediction remains largely unexamined. In this work, we conduct a systematic evaluation of LLM-based agents for clinical prediction tasks using large-scale real-world data. We assess performance in unimodal and multimodal settings and quantify performance gaps between single agent and multi-agent systems. Our findings highlight that single agent frameworks outperform naive multi-agent systems, are better at handling multimodal data, and are better calibrated. This underscores a critical need for improving multi-agent collaboration to better handle heterogeneous inputs. By open-sourcing our code and evaluation framework, this work offers a new benchmark to support future developments relating to agentic systems in healthcare.
comment: Accepted at the AHLI Conference on Health, Inference, and Learning 2026
☆ Drum Synthesis from Expressive Drum Grids via Neural Audio Codecs
Generating realistic drum audio directly from symbolic representations is a challenging task at the intersection of music perception and machine learning. We propose a system that transforms an expressive drum grid, a time-aligned MIDI representation with microtiming and velocity information, into drum audio by predicting discrete codes of a neural audio codec. Our approach uses a Transformer-based model to map the drum grid input to a sequence of codec tokens, which are then converted to waveform audio via a pre-trained codec decoder. We experiment with multiple state-of-the-art neural codecs, namely EnCodec, DAC, and X-Codec, to assess how the choice of audio representation impacts the quality of the generated drums. The system is trained and evaluated on the Expanded Groove MIDI Dataset, E-GMD, a large collection of human drum performances with paired MIDI and audio. We evaluate the fidelity and musical alignment of the generated audio using objective metrics. Overall, our results establish codec-token prediction as an effective route for drum grid-to-audio generation and provide practical insights into selecting audio tokenizers for percussive synthesis.
☆ DP-LAC: Lightweight Adaptive Clipping for Differentially Private Federated Fine-tuning of Language Models ICASSP 2026
Federated learning (FL) enables the collaborative training of large-scale language models (LLMs) across edge devices while keeping user data on-device. However, FL still exposes sensitive information through client-provided gradients. Differentially private stochastic gradient descent (DP-SGD) mitigates this risk by clipping each client's contribution to a threshold $C$ and adding noise proportional to $C$. Existing adaptive clipping techniques dynamically adjust $C$ but demand tedious hyperparameter tuning, which can erode the privacy budget. In this paper, we introduce DP-LAC, a method that first estimates an initial clipping threshold within an order of magnitude of the optimum using private histogram estimation, and then adapts this threshold during training without consuming additional privacy budget or introducing new hyperparameters. Empirical results show that DP-LAC outperforms both state-of-the-art adaptive clipping methods and vanilla DP-SGD, achieving an average accuracy gain of $6.6\%$.
comment: Accepted at ICASSP 2026
☆ MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading
To tackle long-context reasoning tasks without the quadratic complexity of standard attention mechanisms, approaches based on agent memory have emerged, which typically maintain a dynamically updated memory when linearly processing document chunks. To mitigate the potential loss of latent evidence in this memorize-while-reading paradigm, recent works have integrated retrieval modules that allow agents to recall information previously discarded during memory overwriting. However, retrieval-based recall suffers from both evidence loss during memory formation and interference induced by invalid queries. To overcome these limitations, we propose MemReread. Built upon streaming reading, MemReread circumvents intermediate retrieval. It triggers question decomposition and rereading when the final memory is insufficient, enabling the recovery of indirect facts that were prematurely discarded. This design supports non-linear reasoning while preserving the inherent logical flow of document comprehension. To further enhance practicality, we introduce a reinforcement learning framework that enhances length extrapolation capability while dynamically determining the number of rereading passes based on task complexity, thereby flexibly controlling computational overhead. Extensive experiments demonstrate that MemReread consistently outperforms baseline frameworks on long-context reasoning tasks, while maintaining linear time complexity with respect to context length.
☆ IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs
In industrial procurement, an LLM answer is useful only if it survives a standards check: recommended material must match operating condition, every parameter must respect a regulated threshold, and no procedure may contradict a safety clause. Partial correctness can mask safety-critical contradictions that aggregate LLM benchmarks rarely capture. We introduce IndustryBench, a 2,049-item benchmark for industrial procurement QA in Chinese, grounded in Chinese national standards (GB/T) and structured industrial product records, organized by seven capability dimensions, ten industry categories, and panel-derived difficulty tiers, with item-aligned English, Russian, and Vietnamese renderings. Our construction pipeline rejects 70.3% of LLM-generated candidates at a search-based external-verification stage, calibrating how unreliable industrial QA remains after LLM-only filtering.Our evaluation decouples raw correctness, scored by a Qwen3-Max judge validated at $κ_w = 0.798$ against a domain expert, from a separate safety-violation (SV) check against source texts. Across 17 models in Chinese and an 8-model intersection over four languages, we find: (i) the best system reaches only 2.083 on the 0--3 rubric, leaving substantial headroom; (ii) Standards & Terminology is the most persistent capability weakness and survives item-aligned translation; (iii) extended reasoning lowers safety-adjusted scores for 12 of 13 models, primarily by introducing unsupported safety-critical details into longer final answers; and (iv) safety-violation rates reshuffle the leaderboard -- GPT-5.4 climbs from rank 6 to rank 3 after SV adjustment, while Kimi-k2.5-1T-A32B drops seven positions.Industrial LLM evaluation therefore requires source-grounded, safety-aware diagnosis rather than aggregate accuracy. We release IndustryBench with all prompts, scoring scripts, and dataset documentation.
☆ E-TCAV: Formalizing Penultimate Proxies for Efficient Concept Based Interpretability
TCAV (Testing with Concept Activation Vectors) is an interpretability method that assesses the alignment between the internal representations of a trained neural network and human-understandable, high-level concepts. Though effective, TCAV suffers from significant computational overhead, inter-layer disagreement of TCAV scores, and statistical instability. This work takes a step toward addressing these challenges by introducing E-TCAV, a framework for efficient approximation of TCAV scores, which is based on extensive investigation into three key aspects of the TCAV methodology: 1) the effect of latent classifiers on the stability of TCAV scores, 2) the inter-layer agreement of TCAV scores, and 3) the use of the penultimate layer as a fast proxy for earlier layers for TCAV computation. To ensure a solid foundation for E-TCAV, we conduct extensive evaluations across four different architectures and five datasets, encompassing problems from both computer vision and natural language domains. Our results show that the layers in the final block of the neural network strongly agree with the penultimate layer in terms of the TCAV scores, and the commonly observed variance of the TCAV scores can be attributed to the choice of the latent classifier. Leveraging this inter-layer agreement and the degeneracy of directional sensitivities at the penultimate layer, E-TCAV guarantees linearly scaling speed-ups with respect to the network's size and the number of evaluation samples, marking a step towards efficient model debugging and real-time concept-guided training.
☆ Towards Autonomous Railway Operations: A Semi-Hierarchical Deep Reinforcement Learning Approach to the Vehicle Rescheduling Problem
Managing disruptions in railway traffic management is a major challenge. Rising traffic density and infrastructure limits increase complexity, making the Vehicle Routing and Scheduling Problem (VRSP) difficult to solve reliably and in real time. While Operational Research (OR) methods are widely used, most dispatching still relies on human expertise due to the problem's exponential combinatorial complexity. Reinforcement Learning (RL) has gained attention for its potential in multi-agent coordination, but existing RL approaches often underperform OR methods and struggle to scale in dense rail networks. This paper addresses this gap from a machine learning perspective by introducing a semi-hierarchical RL formulation tailored to operational railway constraints. The method separates dispatching from routing through dedicated action and observation spaces, enabling policies to specialise in distinct decision scopes and addressing the imbalance between rare dispatch decisions and frequent routing updates. The approach is evaluated on the Flatland-RL simulator across five difficulty levels and 50 random seeds, with 7 to 80 trains. Results show substantially improved coordination, resource utilisation, and robustness compared with heuristic baselines and monolithic RL, nearly doubling the number of trains reaching their destinations, while keeping deadlock rates below 5% and adaptively sequencing, delaying, or cancelling trains under heavy congestion.
☆ A Cold Diffusion Approach for Percussive Dereverberation IEEE
Most recent advances in audio dereverberation focus almost exclusively on speech, leaving percussive and drum signals largely unexplored despite their importance in music production. Percussive dereverberation poses distinct challenges due to sharp transients and dense temporal structure. In this work, we propose a cold diffusion framework for dereverberating stereo drum stems (downmixes), modeling reverberation as a deterministic degradation process that progressively transforms anechoic signals into reverberant ones. We investigate two reverse-process parameterizations, Direct (next-state) and a Delta-normalized residual (velocity-style) prediction, and implement the framework using both a UNet and a diffusion Transformer backbone. The models are trained and evaluated on curated datasets comprising both acoustic and electronic drum recordings, with reverberation generated using a combination of synthetic and real room impulse responses. Extensive experiments on in-domain and fully out-of-domain test sets demonstrate that the proposed method consistently outperforms strong score-based and conditional diffusion baselines, evaluated using signal-based and perceptual metrics tailored to percussive audio.
comment: Accepted for the 2026 IEEE World Congress on Computational Intelligence, IJCNN Track, 21-26 June 2026, Maastricht, the Netherlands
☆ Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation ACL 2026
Retrieval-augmented generation (RAG) is a widely adopted paradigm for enhancing LLMs in medical applications by incorporating expert multimodal knowledge during generation. However, the underlying retrieval databases may naturally contain, or be intentionally injected with, adversarial knowledge, which can perturb model outputs and undermine system reliability. To investigate this risk, prior studies have explored knowledge poisoning attacks in medical RAG systems. Nevertheless, most of them rely on the strong assumption that adversaries possess prior knowledge of user queries, which is unrealistic in deployments and substantially limits their practical applicability. In this paper, we propose M\textsuperscript{3}Att, a knowledge-poisoning framework designed for medical multimodal RAG systems, assuming only limited distribution knowledge of the underlying database. Our core idea is to inject covert misinformation into textual data while using paired visual data as a query-agnostic trigger to promote retrieval. We first propose a unified framework that introduces imperceptible perturbations to visual inputs to manipulate retrieval probabilities. Besides, due to the prior medical knowledge in LLMs, naively poisoned medical content with explicit factual errors can be corrected during generation. Thus, we leverage the inherent ambiguity of medical diagnosis and design a covert misinformation injection strategy that degrades diagnostic accuracy while evading model self-correction. Experiments on five LLMs and datasets demonstrate that M\textsuperscript{3}Att consistently produces clinically plausible yet incorrect generations. Codes: https://github.com/ypr17/M3Att.
comment: Accepted by ACL 2026
☆ SciIntegrity-Bench: A Benchmark for Evaluating Academic Integrity in AI Scientist Systems
AI scientist systems are increasingly deployed for autonomous research, yet their academic integrity has never been systematically evaluated. We introduce SCIINTEGRITY-BENCH, the first benchmark designed around a dilemmatic evaluation paradigm: each of its 33 scenarios across 11 trap categories is constructed so that honest acknowledgment of failure is the only correct response, while task completion requires misconduct. Across 231 evaluation runs spanning 7 state-of-the-art LLMs, the overall integrity problem rate reaches 34.2%, and no model achieves zero failures. Most strikingly, across missing-data scenarios, all seven models generate synthetic data rather than acknowledging infeasibility, differing only in whether they disclose the substitution. A further prompt ablation study separates two drivers: removing explicit completion pressure sharply reduces undisclosed fabrication from 20.6% to 3.2%, while the underlying synthesis rate remains unchanged, revealing an intrinsic completion bias that persists independent of prompt-level instructions. These findings point to the absence of honest refusal as a trained disposition as the primary driver of observed failures. We release SCIINTEGRITY-BENCH at https://github.com/liuxingtong/Sci-Integrity-Bench.
☆ When Normality Shifts: Risk-Aware Test-Time Adaptation for Unsupervised Tabular Anomaly Detection
Unsupervised tabular anomaly detection methods typically learn feature patterns from normal samples during training and subsequently identify samples that deviate from these patterns as anomalies during testing. However, in practical scenarios, the limited scale and diversity of training data often lead to an incomplete characterization of normal patterns. While test-time adaptation offers a remedy, its isolated focus on test-time optimization ignores the critical synergy with training-phase learning. Furthermore, indiscriminate adaptation to unlabeled test data inevitably triggers anomaly contamination, preventing the model from fully realizing its discriminative capability between normal and anomalous samples. To address these issues, we propose RTTAD, a Risk-aware Test-time adaptation method for unsupervised Tabular Anomaly Detection. RTTAD holistically tackles normality shifts via a synergistic two-stage mechanism. During training, collaborative dual-task learning captures multi-level representations to establish a robust normal prior. During testing, a Test-Time Contrastive Learning (TTCL) module explicitly accounts for adaptation risk by selectively updating the model using high-confidence pseudo-normal samples while constraining anomalous ones. Additionally, TTCL incorporates a k-nearest neighbor-based contrastive objective to refine embedding distributions, thereby further enhancing the model's discriminative capacity. Extensive experiments on 15 tabular datasets demonstrate that RTTAD achieves state-of-the-art overall detection performance.
comment: 13 pages, 6 figures
☆ When Does Non-Uniform Replay Matter in Reinforcement Learning?
Modern off-policy reinforcement learning algorithms often rely on simple uniform replay sampling and it remains unclear when and why non-uniform replay improves over this strong baseline. Across diverse RL settings, we show that the effectiveness of non-uniform replay is governed by three factors: replay volume, the number of replayed transitions per environment step; expected recency, how recent sampled transitions are; and the entropy of the replay sampling distribution. Our main contribution is clarifying when non-uniform replay is beneficial and providing practical guidance for replay design in modern off-policy RL. Namely, we find that non-uniform replay is most beneficial when replay volume is low, and that high-entropy sampling is important even at comparable expected recency. Motivated by these findings, we adopt a simple Truncated Geometric replay that biases sampling toward recent experience while preserving high entropy and incurring negligible computational overhead. Across large-scale parallel simulation, single-task, and multi-task settings, including three modern algorithms evaluated on five RL benchmark suites, this replay sampling strategy improves sample efficiency in low-volume regimes while remaining competitive when replay volume is high.
☆ Hypothesis-Driven Deep Research with Large Language Models: A Structured Methodology for Automated Knowledge Discovery
Current AI-powered research systems adopt a direct search-then-summarize paradigm that treats hypotheses as end products of scientific discovery. We argue this leaves a critical gap: hypotheses can serve a far more powerful role as organizational instruments that structure the research process itself. We propose the Hypothesis-Driven Deep Research (HDRI) methodology - the first framework using hypotheses to organize general-purpose deep research across arbitrary domains, rather than merely validating claims within specific domains. This transforms research from reactive information retrieval into proactive, verifiable, and iterative knowledge discovery. HDRI is formalized with six core principles and an eight-stage pipeline. A central innovation is the gap-driven iterative research mechanism - a closed-loop quality assurance system that automatically identifies informational and logical gaps, triggering targeted supplementary investigation. We further introduce a fact reasoning framework with traceable reasoning chains and quantified confidence propagation, a subject locking mechanism to prevent entity confusion, and a multi-dimensional quality assessment scheme. The methodology is realized in the INFOMINER system. Experiments demonstrate improvements of 22.4% in fact density, 90% subject matching accuracy, 0.92 multi-source verification confidence, and 14% completeness gain from gap-driven supplementation. Five case studies validate its practical applicability, achieving an average quality rating of 4.46/5.0.
☆ Beyond Autonomy: A Dynamic Tiered AgentRunner Framework for Governable and Resilient Enterprise AI Execution
Current large language model agent frameworks prioritize autonomy but lack the governability mechanisms required for enterprise deployment. High-risk write operations proceed without independent review, complex tasks lack acceptance verification, and computational resources are allocated uniformly regardless of risk level. We propose the Dynamic Tiered AgentRunner, a controlled execution protocol distilled from a production-grade multi-tenant SaaS platform. The framework introduces three core mechanisms: (1) Risk-Adaptive Tiering that dynamically allocates computational resources and review intensity based on task risk profiles, achieving Pareto-optimal trade-offs between safety and efficiency; (2) Separation of Powers architecture where proposal, review, execution, and verification are performed by independent agents with physically isolated boundaries; and (3) Resilience-by-Design through a Verifier-Recovery closed loop that treats failure as a first-class system state. We formalize the tier selectio
comment: 9 pages, 2 figures, 3 tables
☆ To Redact, or not to Redact? A Local LLM Approach to Deliberative Process Privilege Classification
Government transparency laws, like the Freedom of Information (FOIA) acts in the United States and United Kingdom, and the Woo (Open Government Act) in the Netherlands, grant citizens the right to directly request documents from the government. As these documents might contain sensitive information, such as personal information or threats to national security, the laws allow governments to redact sensitive parts of the documents prior to release. We build on prior research to perform automatic sensitivity classification for the FOIA Exemption 5 deliberative process privilege using Large Language Models (LLMs). However, processing documents not yet cleared for review via third-party cloud APIs is often legally or politically untenable. Therefore, in this work, we perform sensitivity classification with a small, local model, deployable on consumer-grade hardware (Qwen3.5 9B). We compare eight variants of applying LLMs for sentence classification, using well-known prompting techniques, and find that a combination of Chain-of-Thought prompting and few-shot prompting with error-based examples outperforms classification models of earlier work in terms of recall and F2 score. This method also closely approaches the performance of a widely-used, cost-efficient commercial model (Gemini 2.5 Flash). In an additional analysis, we find that sentences that are predicted as deliberative contain more verbs that indicate the expression of opinions, and are more often phrased in in first-person. Above all, deliberativeness seems characterized by the presence of a combination of multiple indicators, in particular the combination of first-person words with a verb for expressing opinion.
comment: Accepted to The First Workshop on Artificial Intelligence & Open Government at the 21st International Conference on Artificial Intelligence and Law (ICAIL), June 8, 2026, Singapore
☆ HeteroGenManip: Generalizable Manipulation For Heterogeneous Object Interactions
Generalizable manipulation involving cross-type object interactions is a critical yet challenging capability in robotics. To reliably accomplish such tasks, robots must address two fundamental challenges: ``where to manipulate'' (contact point localization) and ``how to manipulate'' (subsequent interaction trajectory planning). Existing foundation-model-based approaches often adopt end-to-end learning that obscures the distinction between these stages, exacerbating error accumulation in long-horizon tasks. Furthermore, they typically rely on a single uniform model, which fails to capture the diverse, category-specific features required for heterogeneous objects. To overcome these limitations, we propose HeteroGenManip, a task-conditioned, two-stage framework designed to decouple initial grasp from complex interaction execution. First, Foundation-Correspondence-Guided Grasp module leverages structural priors to align the initial contact state, thereby significantly reducing the pose uncertainty of grasping. Subsequently, Multi-Foundation-Model Diffusion Policy (MFMDP) routes objects to category-specialized foundation models, integrating fine-grained geometric information with highly-variable part features via a dual-stream cross-attention mechanism. Experimental evaluations demonstrate that HeteroGenManip achieves robust intra-category shape and pose generalization. The framework achieves an average 31\% performance improvement in simulation tasks with broad type setting, alongside a 36.7\% gain across four real-world tasks with different interaction types.
☆ Empty SPACE: Cross-Attention Sparsity for Concept Erasure in Diffusion Models
Erasing specific concepts from text-to-image diffusion models is essential for avoiding the generation of copyrighted and explicit content. Closed-form concept erasure methods offer a fast alternative to backpropagation-based techniques, but they become less effective when scaling from smaller models such as Stable Diffusion 1.5 to larger models like Stable Diffusion XL. To maintain erasure effectiveness in these larger-scale architectures, we propose SParse cross-Attention-based Concept Erasure (SPACE). SPACE iteratively modifies the cross-attention parameters of a model with a closed-form update that jointly induces sparsity and erases target concepts. By concentrating the concept mapping to a lower-dimensional subspace, SPACE achieves superior erasure efficacy compared to dense baselines. Extensive experimental results show improvements in erasure effectiveness and robustness against adversarial prompts. Furthermore, SPACE achieves 80\%-90\% cross-attention sparsity, reducing the storage requirements for saving the modified parameters by 70\%, demonstrating its memory efficiency.
☆ TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment
On-policy self-distillation (self-OPD) densifies reinforcement learning with verifiable rewards (RLVR) by letting a policy teach itself under privileged context. We find that when this guidance spans the full response, all-token KL spends gradients on mostly redundant positions and amplifies privileged-information leakage, causing entropy rise, shortened reasoning, and out-of-distribution degradation in long-horizon math training. We propose Token-Routed Alignment for Critical rEasoning (TRACE), which distills only on annotator-marked critical spans: forward KL on key spans of correct rollouts, optional reverse KL on localized error spans, and GRPO on all remaining tokens, with the KL channel annealed away after a short warm-up. Our analysis explains TRACE through two effects: forward KL provides non-vanishing lift to teacher-supported tokens that the student under-allocates, while span masking and decay keep cumulative privileged-gradient exposure finite. On four held-out math benchmarks plus GPQA-Diamond, TRACE improves over GRPO by 2.76 percentage points on average and preserves the Qwen3-8B base OOD score on GPQA-Diamond, where GRPO and all-token self-OPD baselines degrade. Gains persist under online self-annotation (+1.90 percentage points, about 69% of the strong-API gain), reducing the concern that TRACE merely imports external annotator capability. Across scales, the best routed action is base-dependent: on Qwen3-8B it is forward KL on key spans, while on Qwen3-1.7B it shifts to reverse KL on error spans.
comment: work in progress
☆ ProteinOPD: Towards Effective and Efficient Preference Alignment for Protein Design
Designing proteins with desired functions or properties represents a core goal in synthetic biology and drug discovery. Recent advances in protein language models (PLMs) have enabled the generation of highly designable protein sequences, while preference alignment provides a promising way to steer designs toward desired functions and properties. Nevertheless, they often trigger catastrophic forgetting of pretrained knowledge, degrading basic designability and failing to balance multiple competing objectives. To address these issues, we draw inspiration from On-Policy Distillation (OPD), an advanced post-training method renowned for mitigating catastrophic forgetting through its mode-seeking nature. In this work, we propose ProteinOPD, a multi-objective preference alignment framework that can effectively balance multiple preference objectives while maintaining the inherent designability of PLMs. ProteinOPD adapts a pretrained PLM into preference-specific teachers and distills their knowledge into a shared student via token-level OPD on the student's own trajectories. During this process, the student is aligned to a unique normalized geometric consensus of weighted teachers while ensuring bounded optimization under conflicts. This bridges the gap for OPD in multi-objective/teacher alignment. Extensive experiments show that ProteinOPD achieves substantial gains on target preference objectives without compromising the designability, with an 8x training speedup over RL-based alignment competitors.
☆ LegalCiteBench: Evaluating Citation Reliability in Legal Language Models
Large language models (LLMs) are increasingly integrated into legal drafting and research workflows, where incorrect citations or fabricated precedents can cause serious professional harm. Existing legal benchmarks largely emphasize statutory reasoning, contract understanding, or general legal question answering, but they do not directly study a central common-law failure mode: when asked to provide case authorities without external grounding, models may return plausible-looking but incorrect citations or cases. We introduce LegalCiteBench, a benchmark for studying closed-book citation recovery, citation verification, and case matching in legal language models. LegalCiteBench contains approximately 24K evaluation instances constructed from 1,000 real U.S. judicial opinions from the Case Law Access Project. The benchmark covers five citation-centric tasks: citation retrieval, citation completion, citation error detection, case matching, and case verification and correction. Across 21 LLMs, exact citation recovery remains highly challenging in this closed-book setting: even the strongest models score below 7/100 on citation retrieval and completion. Within the evaluated models, scale and legal-domain pretraining provide limited gains and do not resolve this difficulty. Models also frequently provide concrete but incorrect or low-overlap authorities under our evaluation protocol, with Misleading Answer Rates (MAR) exceeding 94% for 20 of 21 evaluated models on retrieval-heavy tasks. A prompt-only abstention experiment shows that explicit uncertainty instructions reduce some confident fabrication but do not improve citation correctness. LegalCiteBench is intended as a diagnostic framework for studying authority generation failures, verification behavior, and abstention when external grounding is absent, incomplete, or bypassed.
comment: Preprint. 23 pages including references and appendices
☆ DynGhost: Temporally-Modelled Transformer for Dynamic Ghost Imaging with Quantum Detectors
Ghost imaging reconstructs spatial information from a single-pixel bucket detector by correlating structured illumination patterns with scalar intensity measurements. While deep learning approaches have achieved promising results on static scenes, two critical limitations remain unaddressed: existing architectures fail to exploit temporal coherence across frames, leaving dynamic ghost imaging largely unsolved, and they assume additive Gaussian noise models that do not reflect the true Poissonian statistics of real single-photon hardware. We present DynGhost (Dynamic Ghost Imaging Transformer), a transformer architecture that addresses both limitations through alternating spatial and temporal attention blocks. Our quantum-aware training framework, based on physically accurate detector simulations (SNSPDs, SPADs, SiPMs) and Anscombe variance-stabilizing normalization, resolves the distribution shift that causes classical models to fail under realistic hardware constraints. Experiments across multiple benchmarks demonstrate that DynGhost outperforms both traditional reconstruction methods and existing deep learning architectures, with particular gains in dynamic and photon-starved settings.
comment: 6 pages, 8 figures
☆ Developing a foundation model for high-resolution remote sensing data of the Netherlands
We develop a foundation model using 1.2m high resolution satellite images of the Netherlands. By combining a Convolutional Neural Network and a Vision Transformer, the model captures both low- and high-frequency landscape features, such as fine textures, edges, and small objects as well as large terrain structures, elevation patterns, and land-cover distributions. Leveraging temporal data as input, the model learns from broader contextual information across time, allowing the model to exploit the temporal dependencies, such as topographic features, land-cover changes, and seasonal dynamics. These additional constraints reduce feature ambiguity, improve representation learning, and enable better generalization with fewer labeled samples. The foundation model is evaluated on multiple downstream tasks, ranging from use cases within the Netherlands to global benchmarking datasets. On the vegetation monitoring dataset of the Netherlands, the model shows clear performance improvements by incorporating temporal information instead of relying on a single time point. Despite using a smaller model and less pretraining data limited to the Netherlands, it achieves competitive results on global benchmarks when compared to state-of-the-art models. These results demonstrate that the model can learn rich, generalizable representations from limited data, achieving competitive performance on global benchmarks while using a fraction of the parameters of larger state-of-the-art remote sensing models. To maximize reproducibility and reuse, we made the scripts and the model accessible on GitHub.
comment: 9 pages, 4 figures, under review in a journal
☆ A Comparative Study of Machine Learning and Deep Learning for Out-of-Distribution Detection IEEE
Out-of-distribution (OOD) detection is essential for building reliable AI systems, as models that produce outputs for invalid inputs cannot be trusted. Although deep learning (DL) is often assumed to outperform traditional machine learning (ML), medical imaging data are typically acquired under standardized protocols, leading to relatively constrained image variability in OOD detection tasks. This motivates a direct comparison between ML and DL approaches in this setting. The two approaches are evaluated on open datasets comprising over 60,000 fundus and non-fundus images across multiple resolutions. Both approaches achieved an AUROC of 1.000 and accuracies between 0.999 and 1.000 on internal and external validation sets, showing comparable detection performance. The ML approach, however, exhibited substantially lower end-to-end latency while maintaining equivalent accuracy, indicating greater computational efficiency. These results suggest that for OOD detection tasks of limited visual complexity, lightweight ML approaches can achieve DL-level performance with significantly reduced computational cost, supporting practical real-world deployment.
comment: Accepted to IEEE ISBI 2026. The final published version will appear in IEEE Xplore
☆ One-Step Graph-Structured Neural Flows for Irregular Multivariate Time Series Classification
Neural Flows efficiently model irregular multivariate time series by directly learning ODE solution trajectories with neural networks, bypassing step-by-step numerical solvers. Despite their efficiency, many existing approaches treat variables independently, leaving inter-variable interactions underexplored. Moreover, their one-step mapping makes interaction modeling inherently challenging, as it removes the iterative refinement of interactions during learning. To address this challenge, we propose one-step Graph-Structured Neural Flows (GSNF), which introduce two auxiliary-trajectory self-supervision strategies to strengthen interaction learning: (i) interaction-aware trajectory generation via re-initialization, which induces trajectory divergence to expose graph-induced interactions, with a theoretically derived lower bound on divergence; and (ii) reverse-time trajectory generation, which enforces forward-backward consistency to regularize graph learning, enabled by flow invertibility. Experiments on five real-world datasets show that GSNF achieves state-of-the-art classification performance with highly competitive training time and memory usage.
☆ MTA-RL: Robust Urban Driving via Multi-modal Transformer-based 3D Affordances and Reinforcement Learning
Robust urban autonomous driving requires reliable 3D scene understanding and stable decision-making under dense interactions. However, existing end-to-end models lack interpretability, while modular pipelines suffer from error propagation across brittle interfaces. This paper proposes MTA-RL, the first framework that bridges perception and control through Multi-modal Transformer-based 3D Affordances and Reinforcement Learning (RL). Unlike previous fusion models that directly regress actions, RGB images and LiDAR point clouds are fused using a transformer architecture to predict explicit, geometry-aware affordance representations. These structured representations serve as a compact observation space, enabling the RL policy to operate purely on predicted driving semantics, which significantly improves sample efficiency and stability. Extensive evaluations in CARLA Town01-03 across varying densities (20-60 background vehicles) show that MTA-RL consistently outperforms state-of-the-art baselines. Trained solely on Town03, our method demonstrates superior zero-shot generalization in unseen towns, achieving up to a 9.0% increase in Route Completion, an 11.0% increase in Total Distance, and an 83.7% improvement in Distance Per Violation. Furthermore, ablation studies confirm that our multi-modal fusion and reward shaping are critical, significantly outperforming image-only and unshaped variants, demonstrating the effectiveness of MTA-RL for robust urban autonomous driving.
☆ When Prompts Become Payloads: A Framework for Mitigating SQL Injection Attacks in Large Language Model-Driven Applications
Natural language interfaces to structured databases are becoming increasingly common, largely due to advances in large language models (LLMs) that enable users to query data using conversational input rather than formal query languages such as SQL. While this paradigm significantly improves usability and accessibility, it introduces new security risks, particularly the amplification of SQL injection vulnerabilities through the prompt-to-SQL translation process. Malicious users can exploit these mechanisms by crafting adversarial prompts that manipulate model behavior and generate unsafe queries. In this work, we propose a multi-layered security framework designed to detect and mitigate LLM-mediated SQL injection attacks. The framework integrates a front-end security shield for prompt sanitization, an advanced threat detection model for behavioral and semantic anomaly identification, and a signature-based control layer for known attack patterns. We evaluate the proposed framework under diverse and realistic attack scenarios, including prompt injection, obfuscated SQL payloads, and context-manipulation attacks. To ensure robustness, we generate and curate a comprehensive benchmark dataset of adversarial prompts and assess performance across a fine-tuned LLM configuration. Experimental results demonstrate that the proposed approach achieves high detection accuracy while maintaining low false-positive rates, significantly improving the secure deployment of LLM-powered database applications.
comment: 11 pages
☆ When Reviews Disagree: Fine-Grained Contradiction Analysis in Scientific Peer Reviews ACL 2026
Scientific peer reviews frequently contain conflicting expert judgments, and the increasing scale of conference submissions makes it challenging for Area Chairs and editors to reliably identify and interpret such disagreements. Existing approaches typically frame reviewer disagreement as binary contradiction detection over isolated sentence pairs, abstracting away the review-level context and obscuring differences in the severity of evaluative conflict. In this work, we introduce a fine-grained formulation of reviewer contradiction analysis that operates over full peer reviews by explicitly identifying contradiction evidence spans and assigning graded disagreement intensity scores. To support this task, we present RevCI, an expert-annotated benchmark of peer-review pairs with evidence-level contradiction annotations with graded intensity labels. We further propose IMPACT, a structured multi-agent framework that integrates aspect-conditioned evidence extraction, deliberative reasoning, and adjudication to model reviewer contradictions and their intensity. To support efficient deployment, we distill IMPACT into TIDE, a small language model that predicts contradiction evidence and intensity in a single forward pass. Experimental results show that IMPACT substantially outperforms strong single-agent and generic multi-agent baselines in both evidence identification and intensity agreement, while TIDE achieves competitive performance at significantly lower inference cost.
comment: accepted at ACL 2026
☆ Automated Approach for Solving Infinite-state Polynomial Reachability Games
Reachability games are two-player games played on a graph, where the objective of $\texttt{REACH}$ player is to reach the target set whereas the objective of $\texttt{SAFE}$ player is to stay away from the target set. Reachability games have important applications in artificial intelligence and reactive synthesis, and many of these applications give rise to infinite-state reachability games. In this paper, we study turn-based reachability games on infinite-state graphs defined over valuations of a finite set of real variables. We consider the problem of determining the existence of and computing a winning strategy for $\texttt{REACH}$ player. Our contributions are twofold. First, we propose ranking certificates for reachability games, a sound and complete proof rule for proving that $\texttt{REACH}$ player has a winning strategy from the specified initial state. Second, we consider polynomial reachability games, where transitions and objectives are described by polynomial constraints over real variables, and propose a fully automated algorithm for computing a winning strategy for $\texttt{REACH}$ player together with a formal correctness witness in the form of a ranking certificate. The algorithm is sound, semi-complete, and runs in sub-exponential time. Our experiments demonstrate the ability of our method to solve challenging examples from the literature that were out of the reach of existing methods. Specifically, for the classical Cinderella-Stepmother game, we are able to compute an optimal winning strategy for an arbitrary precision parameter for the first time.
☆ Task-Agnostic Noisy Label Detection via Standardized Loss Aggregation IEEE
Noisy labels are common in large-scale medical imaging datasets due to inter-observer variability and ambiguous cases. We propose a statistically grounded and task-agnostic framework, Standardized Loss Aggregation (SLA), for detecting noisy labels at the sample level. SLA quantifies label reliability by aggregating standardized fold-level validation losses across repeated cross-validation runs. This formulation generalizes discrete hard-counting schemes into a continuous estimator that captures both the frequency and magnitude of performance deviations, yielding interpretable and statistically stable noisiness scores. Experiments on a public fundus dataset demonstrate that SLA consistently outperforms the hard-counting baseline across all noise levels and converges substantially faster, especially under low noise ratios where subtle loss variations are informative. Samples with high SLA scores indicate potentially ambiguous or mislabeled cases, guiding efficient re-annotation and improving dataset reliability for any classification task.
comment: Accepted to IEEE ISBI 2026. The final published version will appear in IEEE Xplore
☆ Coarsening Linear Non-Gaussian Causal Models with Cycles
Recent work on causal abstraction, in particular graphical approaches focusing on causal structure between clusters of variables, aims to summarize a high-dimensional causal structure in terms of a low-dimensional one. Existing methods for learning such summaries from data assume that both the high- and low-dimensional structures are acyclic, which is helpful for causal effect identification and reasoning but excludes many high-dimensional models and thus limits applicability. We show that in the linear non-Gaussian (LiNG) setting, the high-dimensional acyclicity assumption can be relaxed while still allowing recovery of a low-dimensional causal directed acyclic graph (DAG). We further connect identifiability of this low-dimensional DAG to existing results: LiNG models with cycles are observationally identifiable only up to an equivalence class whose members differ by reversals of directed cycles; our low-dimensional DAG, which is invariant across all members of a given equivalence class, thus forms a natural representative of the class. While existing approaches for learning this observational equivalence class over high-dimensional variables have exponential time complexity, our low-dimensional summary is learned in worst-case cubic time and comes with explicit bounds on the sample complexity. We provide open source code and experiments on synthetic data to corroborate our theoretical results.
☆ Benchmarking Safety Risks of Knowledge-Intensive Reasoning under Malicious Knowledge Editing
Large language models (LLMs) increasingly rely on knowledge editing to support knowledge-intensive reasoning, but this flexibility also introduces critical safety risks: adversaries can inject malicious or misleading knowledge that corrupts downstream reasoning and leads to harmful outcomes. Existing knowledge editing benchmarks primarily focus on editing efficacy and lack a unified framework for systematically evaluating the safety implications of edited knowledge on reasoning behavior. To address this gap, we present EditRisk-Bench, a benchmark for systematically evaluating safety risks of knowledge-intensive reasoning under malicious knowledge editing. Unlike prior benchmarks that mainly emphasize edit success, generalization, and locality, EditRisk-Bench focuses on how injected knowledge affects downstream reasoning behavior and reliability. It integrates diverse malicious scenarios, including misinformation, bias, and safety violations, together with multi-level knowledge-intensive reasoning tasks and representative editing strategies within a unified evaluation framework measuring attack effectiveness, reasoning correctness, and side effects. Extensive experiments on both open-source and closed-source LLMs show that malicious knowledge editing can reliably induce incorrect or unsafe reasoning while largely preserving general capabilities, making such risks difficult to detect. We further identify several key factors influencing these risks, including edit scale, knowledge characteristics, and reasoning complexity. EditRisk-Bench provides an extensible testbed for understanding and mitigating safety risks in knowledge editing for LLMs.
☆ Scaling Vision Models Does Not Consistently Improve Localisation-Based Explanation Quality
Artificial intelligence models are increasingly scaled to improve predictive accuracy, yet it remains unclear whether scale improves the quality of post-hoc explanations. We investigate this relationship by evaluating 11 computer vision models representing increasing levels of depth and complexity within the ResNet, DenseNet, and Vision Transformer families, trained from scratch or pretrained, across three image datasets with ground-truth segmentation masks. For each model, we generate explanations using five post-hoc explainable AI methods and quantify mask alignment using two localisation metrics: Relevance Rank Accuracy (Arras et al., 2022) and the proposed Dual-Polarity Precision, which measures positive attributions inside the class mask and negative attributions outside it. Across datasets and methods, increasing architectural depth and parameter count does not improve explanation quality in most statistical comparisons, and smaller models often match or exceed deeper variants. While pretraining typically improves predictive performance and increases the dependence of explanations on learned weights, it does not consistently increase localisation scores. We also observe scenarios in which models achieve strong predictive performance while localisation precision is near zero, suggesting that performance metrics alone may not indicate whether predictions are based on the annotated regions. These results indicate that larger models do not reliably provide higher-quality explanations, and that explainability should therefore be assessed explicitly during model selection for safety-sensitive deployments.
comment: 28 pages, 8 figures, 8 tables
☆ FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models
Recent neural theorem provers use reinforcement learning with verifiable rewards (RLVR), where proof assistants provide binary correctness signals. While verifiable rewards are cheap and scalable without reward hacking issues, they suffer from sparse credit assignment: models receive no learning signal from difficult problems where partial progress goes unrewarded. This motivates learned reward models that can evaluate proof quality beyond binary verification. However, comparing reward models is challenging since it typically requires expensive RL training ablations. To address this, we introduce \textbf{FormalRewardBench}, the first benchmark for evaluating reward models in formal theorem proving with Lean 4. Our benchmark consists of 250 preference pairs where correct proofs are paired with incorrect variants generated through five expert curated error injection strategies: forced mistakes, minimal single-point variations, verbose incorrect proofs, natural language justification, and Python code injection. We evaluate frontier LLMs (e.g., Claude Opus 4.5), judge LLMs (e.g., CompassJudger-1-14B), general-purpose LLMs (e.g., Qwen2.5-72B-Instruct), and specialized theorem proving models (e.g., DeepSeek-Prover-V2-7B). Our results reveal that frontier LLMs achieve the highest performance (59.8\%) while specialized theorem provers perform the worst (24.4\%), suggesting that theorem proving ability does not transfer to proof evaluation. We provide further insights on various error injection mechanisms, highlighting the challenging nature of most injection mechanisms. We release \textbf{FormalRewardBench} publicly to encourage more research on developing reward models in formal mathematics.
☆ Useful for Exploration, Risky for Precision: Evaluating AI Tools in Academic Research
Artificial intelligence (AI) tools are being incorporated into scientific research workflows with the potential to enhance efficiency in tasks such as document analysis, question answering (Q and A), and literature search. However, system outputs are often difficult to verify, lack transparency in their generation and remain prone to errors. Suitable benchmarks are needed to document and evaluate arising issues. Nevertheless, existing benchmarking approaches are not adequately capturing human-centered criteria such as usability, interpretability, and integration into research workflows. To address this gap, the present work proposes and applies a benchmarking framework combining human-centered and computer-centered metrics to evaluate AI-based Q&A and literature review tools for research use. The findings suggest that Q and A tools can offer valuable overviews and generally accurate summaries; however, they are not always reliable for precise information extraction. Explainable AI (xAI) accuracy was particularly low, meaning highlighted source passages frequently failed to correspond to generated answers. This shifted the burden of validation back onto the researcher. Literature review tools supported exploratory searches but showed low reproducibility, limited transparency regarding chosen sources and databases, and inconsistent source quality, making them unsuitable for systematic reviews. A comparison of these tool groups reveals a similar pattern: while AI tools can enhance efficiency in the early stages of the research workflow and shallow tasks, their outputs still require human verification. The findings underscore the importance of explainability features to enhance transparency, verification efficiency and careful integration of AI tools into researchers' workflows. Further, human-centered evaluation remains an important concern to ensure practical applicability.
☆ Rethinking Constraint Awareness for Efficient State Embedding of Neural Routing Solver
Heavy-Encoder-Light-Decoder (HELD) neural routing solvers have emerged as a promising paradigm due to their broad applicability across multiple vehicle routing problems (VRPs). However, they typically struggle with VRP variants with complex constraints. To address this limitation, this paper systematically revisits existing neural solvers from the perspective of the generation mechanism for state embeddings (i.e., query vector prior to compatibility calculation) during decoding. We identify that current mechanisms restrict the observation space during attention computation, introducing a key bottleneck to achieving high-quality solutions. Through detailed empirical analysis, we demonstrate the necessity of preserving a global observation space. To overcome the constraint-agnostic drawback inherent to global observation spaces, we propose a simple yet powerful Constraint-Aware Residual Modulation (CARM) module. By adaptively modulating the context embedding with constraint-relevant variables, CARM effectively enhances constraint awareness, enabling the neural solver to fully leverage the global observation space and generate an efficient state embedding. Extensive experimental results across two single-task and five multi-task neural routing solvers confirm that the CARM module consistently boosts baseline performance. Notably, solvers equipped with our CARM achieve substantial improvements in scaling to large-scale instances and in generalizing to unseen VRP variants. These findings provide valuable insights for the architectural design of neural routing solvers.
☆ Explainability of Recurrent Neural Networks for Enhancing P300-based Brain-Computer Interfaces
Brain-Computer Interfaces (BCIs) based on P300 event-related potentials offer promising applications in health, education, and assistive technologies. However, challenges related to inter- and intra-subject variability and the explainability of Deep Learning (DL) models limit their practical deployment. In this work, we present the Post-Recurrent Module (PRM), an additional layer designed to improve both performance and transparency, incorporated into a Recurrent Neural Network (RNN) architecture for classifying P300 signals from EEG data. Our approach enables a dual analysis of spatio-temporal signals through both global and local explainability techniques, allowing us not only to identify the most relevant brain regions and critical time intervals involved in classification, but also to interpret model decisions in terms of spatio-temporal EEG patterns consistent with well-stablished neurophysiological descriptions of the P300. Experimental results show a 9\% improvement in performance over state of the art, while also revealing the importance of inter- and intra-subject variability, in alignment with established neuroscience literature. By making model decisions transparent and efficient, we present a framework for explainable EEG-based models. This framework is not limited to more efficient P300 detection, but can be generalized to a wide range of EEG-based tasks. Its ability to identify key spatial and temporal features makes it suitable for applications such as motor imagery, steady-state visual evoked potentials, and even cognitive workload assessment.
☆ MicroWorld: Empowering Multimodal Large Language Models to Bridge the Microscopic Domain Gap with Multimodal Attribute Graph
Multimodal large language models (MLLMs) show remarkable potential for scientific reasoning, yet their performance in specialized domains such as microscopy remains limited by the scarcity of domain-specific training data and the difficulty of encoding fine-grained expert knowledge into model parameters. To bridge the gap, we introduce MicroWorld, a framework that constructs a multimodal attributed property graph (MAPG) from large-scale scientific image--caption corpora and leverages it to augment MLLM reasoning at inference time without any domain-specific fine-tuning. MicroWorld extracts biomedical entities and relations via scispaCy or LLM-based triplet mining, aligns images and entities in a shared embedding space using Qwen3-VL-Embedding, and assembles a knowledge graph comprising approximately 111K nodes and 346K typed edges spanning eight relation categories. At inference time, a graph-augmented retrieval pipeline matches query entities to the MAPG and injects structured knowledge context into the MLLM prompt. On the MicroVQA benchmark, MicroWorld improves the reasoning performance of Qwen3-VL-8B-Instruct by 37.5%, outperforming GPT-5 by 13.0% to achieve a new state-of-the-art. Furthermore, it yields a 6.0% performance gain on the MicroBench benchmark. Extensive experiments demonstrate the enhanced generalization capability introduced by MicroWorld. A qualitative case study further reveals both the mechanisms through which structured knowledge improves reasoning and the failure modes that point to promising future directions. Code and data are available at https://github.com/ieellee/MicroWorld.
comment: 29 pages, 14 figures
☆ Think as Needed: Geometry-Driven Adaptive Perception for Autonomous Driving
Autonomous driving scenes range from empty highways to dense intersections with dozens of interacting road users, yet current 3D detection models apply a fixed computation budget to every frame, wasting resources on simple scenes while lacking capacity for complex ones. Existing approaches compound this problem: Transformer-based interaction models scale quadratically with the number of detected objects, and frame-by-frame processing causes the system to immediately forget objects the moment they become occluded. We propose Enhanced HOPE, an adaptive perception architecture that measures the geometric complexity of each incoming LiDAR frame using an unsupervised statistical estimator and routes it through a shallow or deep processing path accordingly, requiring no manual scene labels. To keep interaction modeling efficient, we replace quadratic pairwise attention with a linear-time subspace-based network that groups nearby objects into clusters and processes them jointly. The computational savings from these two mechanisms free up resources for a persistent temporal memory module that retains previously detected objects and traffic rules across frames, enabling the system to recall occluded objects seconds after they disappear from view. On the nuScenes and CARLA benchmarks, Enhanced HOPE reduces latency by 38% on simple scenes with no accuracy loss, improves mean Average Precision by 2.7 points on rare long-tail scenarios, and tracks objects through occlusions lasting over 5 seconds, where all tested baselines fail.
☆ CFSPMNet: Cross-subject Fourier-guided Spatial-Patch Mamba Network for EEG Motor Imagery Decoding in Stroke Patients
Motor imagery electroencephalography (MI-EEG) decoding offers a non-invasive route for post-stroke rehabilitation, but cross-patient use remains difficult because pathological neural reorganization changes task-related EEG dynamics, aperiodic activity, local excitability, cross-regional coordination, and trial-level brain-state context. This makes source-learned MI representations unreliable for unseen patients. To address this problem, we propose CFSPMNet, a cross-patient adaptation framework that models post-stroke MI-EEG as latent neural-state organization. CFSPMNet combines a Fourier-Reorganized State Mamba Network (FRSM) with Shared-Private Prototype Matching (SPPM). FRSM represents each trial as a latent physiological token sequence, reorganizes token states in the Fourier domain, and uses Fourier-derived trial context to guide Mamba state-space propagation. SPPM improves target pseudo-label updating by combining semantic confidence with shared-private physiological consistency, filtering confident but physiologically inconsistent target predictions. Leave-one-subject-out experiments on two stroke MI-EEG datasets show that CFSPMNet outperforms representative CNN-, Transformer-, Mamba-, and adaptation-based baselines, achieving average accuracies of 68.23% on XW-Stroke and 73.33% on 2019-Stroke, with gains of 5.63 and 8.25 percentage points over the strongest competitors. Ablation, sensitivity, feature-alignment, pseudo-label selection, and neurophysiological visualization analyses further support the roles of Fourier-domain token-state reorganization and calibrated pseudo-label updating. These results suggest that latent neural-state modeling can improve rehabilitation-oriented cross-patient BCI decoding. Code is available at https://github.com/wxk1224/CFSPMNet.
☆ Arcane: An Assertion Reduction Framework through Semantic Clustering and MCTS-Guided Rule Exploring
Assertion-based Verification (ABV) is essential for ensuring that hardware designs conform to their intended specifications. However, existing automated assertion-generation approaches, such as LLM-based frameworks, often generate large numbers of redundant assertions, which significantly degrade simulation efficiency. To mitigate the simulation overhead caused by redundant assertions, this paper proposes Arcane, an efficient assertion reduction framework. It integrates a two-tier assertion clustering approach for accurate semantic classification of large assertion sets, and employs Monte Carlo Tree Search (MCTS) to explore optimal rule-application sequences for efficient assertion reduction. The experimental results on Assertionbench [20] show that Arcane achieves a reduction of up to 76.2% in the assertion count while fully preserving formal coverage and mutation-detection ability. Further simulation studies demonstrate a speedup of 2.6x to 6.1x speedup in simulation time. The proposed framework is released at https://anonymous.4open.science/r/Arcane1-0A6F/.
comment: 6 pages, 6 figures
☆ ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models
Recent advances in Multi-modal Large Language Models (MLLMs) target 3D spatial intelligence, yet the progress has been largely driven by post-training on curated benchmarks, leaving the inference-time approach relatively underexplored. In this paper, we take a training-free perspective and introduce ViSRA, a human-aligned Video-based Spatial Reasoning Agent, as a framework to probe the spatial reasoning mechanism of MLLMs. ViSRA elicits spatial reasoning in a modular and extensible manner by leveraging explicit spatial information from expert models, enabling a plug-and-play flexible paradigm. ViSRA offers two key advantages: (1) human-aligned and transferable 3D understanding rather than task-specific overfitting; and (2) no post-training computational cost along with heavy manual curation of spatial reasoning datasets. Experimental results demonstrate consistent improvement across a set of MLLMs on both existing benchmarks and unseen 3D spatial reasoning tasks, with ViSRA outperforming baselines by up to a 15.6% and 28.9% absolute margin respectively.
☆ HYPERPOSE: Hyperbolic Kinematic Phase-Space Attention for 3D Human Pose Estimation
We introduce HYPERPOSE, a novel 3D human pose estimation framework that performs spatio-temporal reasoning entirely within the Lorentz model of hyperbolic space $\mathbb{H}^d$ to natively preserve the hierarchical tree topology of the human skeleton. Current state-of-the-art pose estimators aim to capture complex joint dynamics by relying on transformers and graph convolutional networks. Since these architectures operate exclusively in Euclidean space which fundamentally mismatches the inherent tree structure of the human body, these methods inevitably suffer from exponential volume distortion and struggle to maintain structural coherence. To this end, we depart from flat spaces and aim to improve geometric fidelity with Hyperbolic Kinematic Phase-Space Attention (HKPSA), natively embedding complex joint relationships without distortion, alongside a multi-scale windowed hyperbolic attention mechanism that efficiently models temporal dynamics in $O(TW)$ complexity. Furthermore, to overcome the well-known instability of training non-Euclidean manifolds, HYPERPOSE introduces a novel Riemannian loss suite and an uncertainty-weighted curriculum, enforcing physical geodesic constraints like bone length and velocity consistency. Extensive evaluations on the Human3.6M and MPI-INF-3DHP datasets demonstrate that HYPERPOSE achieves state-of-the-art structural and temporal coherence, significantly reducing both volume distortion and velocity error, while establishing new state-of-the-art benchmarks in overall positional accuracy.
☆ Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs
Vision-Language-Action (VLA) models show strong potential for general-purpose robotic manipulation, yet their closed-loop reliability often degrades under local deployment conditions. Existing evaluations typically treat test episodes as independent zero-shot trials. However, real robots often operate repeatedly in the same or slowly changing environments, where successful executions provide environment-verified evidence of reliable behavior patterns. We study this persistent-deployment setting, asking whether a partially competent frozen VLA can improve its reliability by reusing its successful test-time experience. We propose an online success-memory guided test-time adaptation framework for generative VLAs. During deployment, the robot stores progress-calibrated successful observation-action segments in a long-term memory. At inference, it retrieves state-relevant action chunks, filters inconsistent candidates via trajectory-level consistency, and aggregates them into an elite action prior. To incorporate this prior into action generation, we introduce confidence-adaptive prior guidance, which injects the elite prior into an intermediate state of the flow-matching action sampler and adjusts the guidance strength based on retrieval confidence. This design allows the frozen VLA to exploit environment-specific successful experience while preserving observation-conditioned generative refinement. This retrieve-then-steer mechanism enables lightweight, non-parametric test-time adaptation without requiring parameter updates. Simulation and real-world experiments show improved task success and closed-loop stability, especially in long-horizon and multi-stage tasks.
☆ PoDAR: Power-Disentangled Audio Representation for Generative Modeling
The performance of audio latent diffusion models is primarily governed by generator expressivity and the modelability of the underlying latent space. While recent research has focused primarily on the former, as well as improving the reconstruction fidelity of audio codecs, we demonstrate that latent modelability can be significantly improved through explicit factor disentanglement. We present PoDAR (Power-Disentangled Audio Representation), a framework that utilizes a randomized power augmentation and latent consistency objective to decouple signal power from invariant semantic content. This factorization makes the latent space easier to model, which both accelerates the convergence of downstream generative models and improves final overall performance. When applied to a Stable Audio 1.0 VAE with an F5-TTS generator, PoDAR achieves about a $2\times$ acceleration in convergence to match baseline performance, while increasing final speaker similarity by 0.055 and UTMOS by 0.22 on the LibriSpeech-PC dataset. Furthermore, isolating power into dedicated channels enables the application of CFG exclusively to power-invariant content, effectively extending the stable guidance regime to higher scales.
comment: 9 pages, 3 figures
☆ Active Testing of Large Language Models via Approximate Neyman Allocation
Large language models (LLMs) require reliable evaluation from pre-training to test-time scaling, making evaluation a recurring rather than one-off cost. As model scales grow and target tasks increasingly demand expert annotators, both the compute and labeling costs needed for each evaluation rise rapidly. Active testing aims to alleviate this bottleneck by approximating the evaluation result from a small but informative subset of the evaluation pool. However, existing approaches primarily target classification and break down on generative tasks. We introduce a novel active testing algorithm tailored to generative tasks. Our method leverages semantic entropy from surrogate models to stratify the evaluation pool and then conducts approximate Neyman allocation based on signals extracted from these surrogates. Across multiple language and multimodal benchmarks and a range of surrogate-target model pairs, our method significantly improves on baselines and closely tracks Oracle-Neyman, delivering up to 28\% MSE reduction over Uniform Sampling and an average of 22.9\% budget savings.
☆ Metis: Learning to Jailbreak LLMs via Self-Evolving Metacognitive Policy Optimization ICML 2026
Red teaming is critical for uncovering vulnerabilities in Large Language Models (LLMs). While automated methods have improved scalability, existing approaches often rely on static heuristics or stochastic search, rendering them brittle against advanced safety alignment. To address this, we introduce Metis, a framework that reformulates jailbreaking as inference-time policy optimization within an adversarial Partially Observable Markov Decision Process (POMDP). Metis employs a self-evolving metacognitive loop to perform causal diagnosis of a target's defense logic and leverages structured feedback as a semantic gradient to refine its policy, offering enhanced interpretability through transparent reasoning traces. Extensive evaluations across 10 diverse models demonstrate that Metis achieves the strongest average Attack Success Rate (ASR) among compared methods at 89.2%, maintaining high efficacy on resilient frontier models (e.g., 76.0% on O1 and 78.0% on GPT-5-chat) where traditional baselines exhibit substantial performance degradation. By replacing redundant exploration with directed optimization, Metis reduces token costs by an average of 8.2x and up to 11.4x. Our analysis reveals that current defenses remain vulnerable to internally-steered, closed-loop reasoning trajectories under the tested settings, highlighting a critical need for next-generation defenses capable of reasoning about safety dynamically during inference.
comment: Accepted to the 43rd International Conference on Machine Learning (ICML 2026)
☆ NCO: A Versatile Plug-in for Handling Negative Constraints in Decoding
Controlling Large Language Models (LLMs) to prevent the generation of undesirable content, such as profanity and personally identifiable information (PII), has become increasingly critical. While earlier approaches relied on post-processing or resampling, recent research has shifted towards constrained decoding methods that control outputs during generation to mitigate high computational costs and quality degradation. However, preventing multiple forbidden hard constraints or regex constraints from appearing anywhere in the output is computationally challenging. A straightforward solution is to convert these constraints into a single automaton that tracks all forbidden patterns during decoding, but this often becomes impractically large. Standard regex engines also do not readily support the operations needed to build such a constraint, such as complement and intersection. In order to address these limitations, we propose NCO, a decoding strategy that performs online pattern matching over finite hard constraints and regex constraints, reducing computational overhead without inducing state explosion. NCO is fully compatible with standard inference strategies, including various sampling methods and beam search, while also supporting soft masking for probabilistic suppression. We empirically demonstrate its effectiveness across practical tasks, including PII and profanity suppression. Our implementation is available at https://github.com/hyundong98/NCO-Decoding.git .
☆ MAGE: Multi-Agent Self-Evolution with Co-Evolutionary Knowledge Graphs
Self-evolving language-model agents must decide what to learn next and how to preserve what they have learned across iterations. Existing systems typically carry this cross-iteration knowledge as natural-language feedback, flat episodic memory, or implicit reinforcement signals, none of which cleanly supports a frozen weak backbone at inference time. This paper introduces MAGE (Multi-Agent Graph-guided Evolution), a framework that externalizes self-knowledge into a four-subgraph co-evolutionary knowledge graph. Its experience subgraph stores both teacher-written failure corrections and the learner's own past correct reasoning traces, which are retrieved as task-conditioned guidance for a frozen execution model. During evolution, the graph, a task-level search bandit, and a skill-level routing bandit are updated from the same reward stream, while the learner's backbone remains unchanged. We further provide structural analysis showing how append-only memory growth, bounded curriculum coverage, and task-filtered retrieval together support stable improvement of the retrieval substrate for frozen-learner evolution. Across nine benchmarks spanning mathematical reasoning, multi-hop and open-domain question answering, spatio-temporal analysis, financial numerical reasoning, medical multiple-choice, an open-world survival game, and web navigation, MAGE achieves strong performance against prompt-based frozen-backbone baselines. Ablations show that self-harvested success traces and teacher-written corrections are complementary, with success memories contributing most on reasoning-template-heavy tasks and corrective memories supporting harder composition and interaction settings.
comment: 25 pages, 3 figures
☆ Not-So-Strange Love: Language Models and Generative Linguistic Theories are More Compatible than They Appear
Futrell and Mahowald (2025) frame the success of neural language models (LMs) as supporting gradient, usage-based linguistic theories. I argue that LMs can also instantiate theories based on formal structures - the types of theories seen in the generative tradition. This argument expands the space of theories that can be tested with LMs, potentially enabling reconciliations between usage-based and generative accounts.
comment: Accepted to Behavioral and Brain Sciences; 4 pages; Commentary on "How Linguistics Learned to Stop Worrying and Love the Language Models" by Richard Futrell and Kyle Mahowald
☆ Strategic Exploitation in LLM Agent Markets: A Simulation Framework for E-Commerce Trust
Agent-based modeling (ABM) has long been used in economics to study human behavior, and large language model (LLM) agents now enable new forms of social and economic simulation. While prior work has discovered strategic deception by LLM agents in financial trading and auction markets, e-commerce remains underexplored despite its distinctive information asymmetry: sellers privately observe product quality, whereas buyers rely on advertised claims and reputation signals. We introduce TruthMarketTwin, a controlled simulation framework for studying LLM-agent behavior in e-commerce markets. The framework is one of the first to model bilateral trade under asymmetric information sharing, where agents make strategic listing, purchasing, rating, and recourse-related decisions to optimize seller profit and buyer utility. We find that LLM agents released into traditional markets autonomously exploit weaknesses in reputation-based governance, while warrant enforcement reduces deception and reshapes strategic reasoning. Our results position LLM-agent simulation as a tool for studying institution-governed autonomous markets.
☆ Route by State, Recover from Trace: STAR with Failure-Aware Markov Routing for Multi-Agent Spatiotemporal Reasoning
Compositional spatiotemporal reasoning often requires a system to invoke multiple heterogeneous specialists, such as geometric, temporal, topological, and trajectory agents. A central question is how such a system should route among specialists when execution does not simply succeed or fail, but fails in qualitatively different ways. Existing tool-augmented and multi-agent LLM systems typically leave this routing decision implicit in language generation, making recovery ad hoc, difficult to interpret, and hard to optimize. This paper presents STAR (Spatio-Temporal Agent Router), a failure-aware routing framework that externalizes inter-agent control as a state-conditioned transition policy over the current agent, task type, and typed execution status. At the center of STARis an agent routing matrix that combines expert-specified nominal routes with recovery transitions learned from execution traces. Because the matrix conditions on distinct failure states, the router can respond differently to malformed outputs, missing dependencies, and tool--query mismatches, rather than collapsing them into a generic retry signal. Specialists execute through a tool-grounded extract--compute--deposit protocol and write intermediate results to a shared blackboard for downstream fusion. Results prove that retaining unsuccessful traces during training enlarges the support of the routing policy on error states, enabling recovery transitions that success-only training cannot represent. Across three spatiotemporal benchmarks and eight backbone LLMs, STAR improves over multiple baselines with the clearest gains on queries whose execution deviates from the nominal routing path. Router-specific ablations and recovery analyses further show that typed failure-aware routing, rather than specialist composition alone, is a key factor for these improvements.
comment: 30 pages, 13 figures
☆ Swarm Skills: A Portable, Self-Evolving Multi-Agent System Specification for Coordination Engineering
As artificial intelligence engineering paradigms shift from single-agent Prompt and Context Engineering toward multi-agent \textbf{Coordination Engineering}, the ability to codify and systematically improve how multiple agents collaborate has emerged as a critical bottleneck. While single-agent skills can now be distributed as portable assets, multi-agent coordination protocols remain locked within framework-internal code or static configurations, preventing them from being shared across systems or autonomously improved over time. We propose \textbf{Swarm Skills}, a portable specification that extends the Anthropic Skills standard with multi-agent semantics. Swarm Skills turns multi-agent workflows into first-class, distributable assets that consist of roles, workflows, execution bounds, and a built-in semantic structure for self-evolution. To operationalize the specification's evolving nature, we present a companion self-evolution algorithm that automatically distills successful execution trajectories into new Swarm Skills and continuously patches existing ones based on multi-dimensional scoring (Effectiveness, Utilization, and Freshness), eliminating the need for human-in-the-loop oversight during the refinement process. Through an architectural compatibility analysis and a comprehensive qualitative case study using the open-source JiuwenSwarm reference implementation, we demonstrate how Swarm Skills achieves zero-adapter cross-agent portability via progressive disclosure, enabling agent teams to self-evolve their coordination strategies without framework lock-in.
☆ Guided Streaming Stochastic Interpolant Policy
Inference-time guidance is essential for steering generative robot policies toward dynamic objectives without retraining, yet existing methods are largely confined to chunk-based architectures that exhibit high latency and lack the reactivity needed for test-time preference alignment or obstacle avoidance. In this work, we formally derive the optimal guidance term for Stochastic Interpolants (SI) by analyzing the value function's time evolution via the Backward Kolmogorov Equation, establishing a modified drift that theoretically guarantees sampling from a target distribution. We apply this framework to real-time control through the Streaming Stochastic Interpolant Policy (SSIP), which generalizes the deterministic Streaming Flow Policy (SFP). Unifying this guidance law with the streaming architecture enables fast and reactive control. To support diverse deployment needs, we propose two complementary mechanisms: training-free Stochastic Trajectory Ensemble Guidance (STEG) that computes gradients on-the-fly for zero-shot adaptation, and training-based Conditional Critic Guidance (CCG) for amortized inference. Empirical evaluations demonstrate that our guided streaming approach significantly outperforms conventional chunk-based policies in reactivity and provides superior, physically valid guidance for dynamic, unstructured environments.
comment: Accepted to Robotics: Science and Systems (RSS) 2026. The first two authors contributed equally
☆ Rethinking Loss Reweighting for Imbalance Learning as an Inverse Problem: A Neural Collapse Point of View ICML2026
Loss reweighting is a widely used strategy for long-tailed classification, but existing reweighting strategies often rely on heuristics and rarely define a well-specified target. Inspired by Neural Collapse (NC), the ideal simplex Equiangular Tight Frame (ETF) terminal geometry suggests equal per-class average loss as a reasonable target for reweighting. Based on the ideal equal loss objective, we consider loss reweighting as an inverse problem and propose an inverse-view reweighting strategy that infers class weights dynamically to match this ideal objective. Empirically, NC metrics suggest our method can effectively reduce the loss imbalance coefficient and closer alignment with NC geometry while consistently outperforming strong long-tailed baselines on different datasets.
comment: Accepted by ICML2026
☆ Adaptive Action Chunking via Multi-Chunk Q Value Estimation
Action chunking emerged as a pivotal technique in imitation learning, enabling policies to predict cohesive action sequences rather than single actions. Recently, this approach has expanded to reinforcement learning (RL), enhancing behavioral consistency and reducing bootstrapping errors in value function estimation. However, existing methods rely on a fixed chunk length, creating a performance bottleneck as the optimal length varies across states and tasks. In this paper, we propose Adaptive Action CHunking (ACH), a novel offline-to-online RL algorithm that dynamically modulates chunk length during both training and inference. To find the optimal chunk length for a dynamically varying current state, we simultaneously estimate action-values for all candidate chunk lengths in a single forward pass, using a Transformer-based architecture. Our mechanism allows the agent to select the most effective chunk length adaptively based on the current state. Evaluated on 34 challenging tasks, ACH consistently outperforms fixed-length baselines, demonstrating superior generalization and learning efficiency in complex environments.
☆ Personalizing LLMs with Binary Feedback: A Preference-Corrected Optimization Framework ACL 2026
Large Language Model (LLM) personalization aims to align model behaviors with individual user preferences. Existing methods often focus on isolated user histories, neglecting the essential role of inter-user differences. We propose C-BPO, a framework that personalizes LLMs via preference-calibrated binary signals. By treating target user data as positive feedback and other users' data as an auxiliary set of implicit negative signals, C-BPO captures distinct inter-user differences. To mitigate the preference overlap issue, where shared task knowledge is erroneously penalized, we derive an objective grounded in Positive-Unlabeled (PU) learning theory. This approach purifies negative signals by subtracting ``positive bias'', ensuring alignment with unique idiosyncrasies without compromising general helpfulness. Empirical experiments across various personalization tasks and backbone LLMs show C-BPO consistently outperforms baselines, demonstrating the efficacy of preference-calibrated binary signals in modeling inter-user differences.
comment: Accepted by ACL 2026 Main
☆ TimeClaw: A Time-Series AI Agent with Exploratory Execution Learning
Time series analysis underpins forecasting, monitoring, and decision making in domains such as finance and weather, where solving a task often requires both numerical accuracy and contextual reasoning. Recent progress has moved from specialized neural predictors to approaches built on LLMs and foundation models that can reason over time series inputs and use external tools. However, most such systems remain execution-centric: they focus on solving the current instance but learn little from exploratory execution. This is especially limiting in verifiable numeric settings, where multiple candidate executions and tool-use procedures may all be task-valid yet differ sharply in quantitative quality, and where early success can trigger tool-prior collapse that suppresses further exploration. To address this limitation, we present TimeClaw, an exploratory execution learning framework that turns exploratory execution into reusable hierarchical distilled experience through a four-stage loop: Explore, Compare, Distill, and Reinject. TimeClaw combines metric-supervised exploratory execution learning, task-aware tool dropout, and hierarchical distilled experience for inference-time reinjection, while keeping the base model frozen and avoiding online test-time adaptation. In an MTBench-aligned evaluation with 17 tasks that span finance and weather prediction and reasoning tasks, TimeClaw delivers consistent gains over the baselines. These results suggest that, for scientific systems, the bottleneck is not only execution-time capability, but how exploratory experience is compared, distilled, and reused.
comment: Under review
☆ Bridging the Cognitive Gap: A Unified Memory Paradigm for 6G Agentic AI-RAN IEEE
As 6G evolves, the radio access network must transcend traditional automation to embrace agentic AI capable of perception, reasoning, and evolution. A fundamental cognitive gap persists in current disaggregated architectures, where interfaces force the physical layer to compress high-dimensional states into low-dimensional metrics, trapping reasoning agents behind a semantic bottleneck. This article envisions a shift from interface-bound to memory-centric architectures. We propose a unified memory paradigm that dissolves the boundaries between sensing and reasoning by mapping biological memory hierarchies onto heterogeneous computing fabrics. Enabled by emerging coherent interconnects, this approach creates a cognitive continuum where microsecond-level reflexes, millisecond-level reasoning, and long-term evolution share state across time scales. By replacing message passing with zero-copy observability, we empower AI agents to bridge the gap between real-time responsiveness and long-horizon context for truly autonomous 6G networks.
comment: This work has been submitted to the IEEE for possible publication
☆ From Single-Step Edit Response to Multi-Step Molecular Optimization
Conditional molecular optimization aims to edit a molecule to realize a specified property shift. In practice, structurally similar molecule data is scarce, while decisions are inherently action-level: at each step, the system must select one local structural edit from a candidate set that is strictly filtered by chemical feasibility rules. This level mismatch between supervision and decision makes oracle-in-the-loop search unstable in molecular optimization. Regressing on property differences between molecule pairs improves data efficiency but relies on oracle-in-the-loop search, entangling transformation effects with global context and providing limited guidance for selecting the next feasible edit, often resorting to oracle-in-the-loop search. For this reason, we propose a response-oriented discrete edit optimization approach comprising two tightly coupled components: a single-step molecular edit response predictor (SMER) and a multi-step planner that composes local predictions into optimization trajectories via guided tree search (SMER-Opt). The approach learns a directional evaluation model over edit actions to support constraint-aware planning. It mines weakly related molecule pairs and decomposes their structural differences into minimal edit units, turning endpoint property annotations into process-level supervision and yielding reusable, transferable action primitives. A directional edit evaluator then scores feasible candidate edits by their likelihood of moving the molecule toward the desired property change, substantially reducing dependence on external evaluator queries at decision time. Code is available at https://anonymous.4open.science/r/SMER.
♻ ☆ Distributionally Robust Token Optimization in RLHF
Large Language Models (LLMs) tend to respond correctly to prompts that align well with the data they were trained and fine-tuned on. Yet, small shifts in wording, format, or language can trigger surprisingly large failures, especially on multi-step reasoning problems. To address this problem, we propose a Distributionally Robust Token Optimization (DRTO) approach, which combines token-level Reinforcement Learning from Human Feedback (RLHF) with Distributionally Robust Optimization (DRO). DRTO constructs f-divergence ambiguity sets over span-level actor losses, providing a principled way to emphasize difficult response segments during policy optimization. Empirically, DRTO enhances consistency under distribution shifts in multiple reasoning benchmarks among different tasks, achieving $+4.4$ percentage points on MATH-500 and $+2.7$ percentage points on LiveCodeBench over standard RTO.
♻ ☆ TodyComm: Task-Oriented Dynamic Communication for Multi-Round LLM-based Multi-Agent System
Multi-round LLM-based multi-agent systems rely on effective communication structures to support collaboration across rounds. However, most existing methods employ a fixed communication topology during inference, which falls short in many realistic applications where the agents' roles may change \textit{across rounds} due to dynamic adversary, task progression, or time-varying constraints such as communication bandwidth. In this paper, we propose addressing this issue through TodyComm, a \textbf{t}ask-\textbf{o}riented \textbf{dy}namic \textbf{comm}unication algorithm. It produces behavior-driven collaboration topologies that adapt to the dynamics at each round, optimizing the utility for the task through policy gradient. Experiments on five benchmarks demonstrate that, under both dynamic adversarial settings and communication budget constraints, TodyComm achieves superior task performance while maintaining token efficiency, scalability, and strong generalizability across varying adversarial conditions.
♻ ☆ Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies
Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively. Despite its importance, existing relevant benchmarks largely evaluate agents on pre-specified or synthesized files with limited real-world dependencies, leaving workspace-level evaluation underexplored. To this end, we introduce Workspace-Bench, a benchmark for evaluating AI agents on Workspace Learning invOlving Large-Scale File Dependencies. We construct realistic workspaces with 5 worker profiles, 74 file types, 20,476 files (up to 20GB) and curate 388 tasks, each with its own file dependency graph, evaluated across 7,399 total rubrics that require cross-file retrieval, contextual reasoning, and adaptive decision-making. We further provide Workspace-Bench-Lite, a 100-task subset that preserves the benchmark distribution while reducing evaluation costs by about 70%. We evaluate 3 popular agent harnesses and 5 foundation models. Experimental results show that current agents remain far from reliable workspace learning, where the best reaches only about 60%, substantially below the human result of 80.7%, and the average performance across agents is only 45.1%.
comment: 29 pages, 16 figures
♻ ☆ Reinforcement Learning with Action Chunking NeurIPS 2025
We present Q-chunking, a simple yet effective recipe for improving reinforcement learning (RL) algorithms for long-horizon, sparse-reward tasks. Our recipe is designed for the offline-to-online RL setting, where the goal is to leverage an offline prior dataset to maximize the sample-efficiency of online learning. Effective exploration and sample-efficient learning remain central challenges in this setting, as it is not obvious how the offline data should be utilized to acquire a good exploratory policy. Our key insight is that action chunking, a technique popularized in imitation learning where sequences of future actions are predicted rather than a single action at each timestep, can be applied to temporal difference (TD)-based RL methods to mitigate the exploration challenge. Q-chunking adopts action chunking by directly running RL in a 'chunked' action space, enabling the agent to (1) leverage temporally consistent behaviors from offline data for more effective online exploration and (2) use unbiased $n$-step backups for more stable and efficient TD learning. Our experimental results demonstrate that Q-chunking exhibits strong offline performance and online sample efficiency, outperforming prior best offline-to-online methods on a range of long-horizon, sparse-reward manipulation tasks.
comment: The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025); 29 pages, 17 figures
♻ ☆ Simulating Complex Multi-Turn Tool Calling Interactions in Stateless Execution Environments
Synthetic data has proven itself to be a valuable resource for tuning smaller, cost-effective language models to handle the complexities of multi-turn tool calling conversations. While many frameworks and systems for producing synthetic multi-turn tool calling data have been proposed, prior works have frequently assumed that any tool calling interactions will take place in an execution environment that maintains state. When such an environment is available, this is advantageous as it allows for the validity of an interaction to be determined by whether or not the state of the execution environment matches to some prespecified objective. Unfortunately, this does not hold in many real-world tool use settings, e.g., in enterprise settings where data security is of the utmost importance or in cases where tool specifications are synthesized from multiple sources. In this work, we address this gap by introducing a data generation method, DiGiT-TC, that is designed to produce tool calling conversations that have the characteristics of conversations generated through search in a stateful environment. The key to our technique lies in a novel generation pattern that allows our approach to implicitly represent certain tool calls in the user request. We validate our approach on standard tool calling benchmarks and demonstrate that, even in stateful problem settings, our approach results in strong performance gains.
♻ ☆ AI Alignment via Incentives and Correction
We study AI alignment through the lens of law-and-economics models of deterrence and enforcement. In these models, misconduct is not treated as an external failure, but as a strategic response to incentives: an actor weighs the gain from violation against the probability of detection and the severity of punishment. We argue that the same logic arises naturally in agentic AI pipelines. A solver may benefit from producing a persuasive but incorrect answer, hiding uncertainty, or exploiting spurious shortcuts, while an auditor or verifier must decide whether costly monitoring is worthwhile. Alignment is therefore a fixed-point problem: stronger penalties may deter solver misbehavior, but they can also reduce the auditor's incentive to inspect, since auditing then mainly incurs cost on a population that appears increasingly aligned. This perspective also changes what should count as a post-training signal. Standard feedback often attaches reward to the final answer alone, but a solver-auditor pipeline exposes the full correction event: whether the solver erred, whether the auditor inspected, whether the error was caught, and whether oversight incentives remained active. We formalize this interaction in a two-agent model in which a principal chooses rewards over joint correction outcomes, inducing both solver behavior and auditor monitoring. Reward design is therefore a bilevel optimization problem: rewards are judged not by their immediate semantic meaning, but by the behavioral equilibrium they induce. We propose a bandit-based outer-loop procedure for searching over reward profiles using noisy interaction feedback. Experiments on an LLM coding pipeline show that adaptive reward profiles can maintain useful oversight pressure and improve principal-aligned outcomes relative to static hand-designed rewards, including a substantial reduction in hallucinated incorrect attempts.
♻ ☆ VeRO: An Evaluation Harness for Agents to Optimize Agents ICML
An important emerging application of coding agents is agent optimization: the iterative improvement of a target agent through edit-execute-evaluate cycles. Despite its relevance, the community lacks a systematic understanding of coding agent performance on this task. Agent optimization differs fundamentally from conventional software engineering: the target agent interleaves deterministic code with stochastic LLM completions, requiring structured capture of both intermediate reasoning and downstream execution outcomes. To address these challenges, we introduce VERO (Versioning, Rewards, and Observations), which provides (1) a reproducible evaluation harness with versioned agent snapshots, budget-controlled evaluation, and structured execution traces, and (2) a benchmark suite of target agents and tasks with reference evaluation procedures. Using VERO, we conduct an empirical study comparing optimizer configurations across tasks and analyzing which modifications reliably improve target agent performance. We release VERO to support research on agent optimization as a core capability for coding agents.
comment: Accepted to the Forty-Third International Conference on Machine Learning (ICML), 2026
♻ ☆ The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation
Hypernetwork-based methods such as Doc-to-LoRA internalize a document into an LLM's weights in a single forward pass, but they fail systematically on conflicts: when the document contradicts pretraining knowledge, accuracy collapses to 46.4% on the deepest facts. We show the failure is a magnitude problem rather than a representational one. The hypernetwork already targets the right layers, but its adapter margin is approximately constant across documents while the pretrained margin grows with training frequency, so deep conflicts lose by construction. The account predicts that failure should track prior strength: sorting 194 conflicts by the base model's log-probability on the contradicted fact, baseline accuracy falls from 68% on weak-prior questions to 16% on strong-prior ones, a 52 percentage-point gap. The cure is amplitude. Selective Layer Boosting scales the adapter at its top-norm layers, and Conflict-Aware Internalization triggers boosting only when the base model is confident. Both are training-free; together they raise deep-conflict accuracy from 46.4% to 71.0% on Gemma-2B and from 53.6% to 72.5% on Mistral-7B while preserving novel-knowledge recall, and beat vanilla retrieval-augmented generation on medium conflicts by 18 percentage points despite operating entirely in parameter space. We release KID-Bench, a 489-question benchmark that separates novel recall, cross-knowledge combination, and prior-graded conflicts.
comment: 35 pages, 15 figures v2: minor layout fixes and author list update
♻ ☆ Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference ICML 2026
Modern large language models (LLMs) increasingly depends on efficient long-context processing and generation mechanisms, including sparse attention, retrieval-augmented generation (RAG), and compressed contextual memory, to support complex reasoning. We show that these optimizations can be unified into a four-step memory processing pipeline: Prepare Memory, Compute Relevancy, Retrieval, and Apply to Inference. Through systematic profiling, we identify a 22%-97% memory processing overhead in LLM inference and strong heterogeneity in its computational characteristics. Motivated by this insight, we argue that \textbf{heterogeneous systems} are well-suited to accelerate memory processing and thus end-to-end inference. We demonstrate this approach on a GPU-FPGA system by offloading sparse, irregular, and memory-bounded operations to FPGAs while retaining compute-intensive operations on GPUs. Evaluated on an AMD MI210 GPU and an Alveo U55C FPGA, our system is up to $2.2\times$ faster and achieves up to $4.7\times$ less energy across multiple LLM inference optimizations than the GPU baseline (similar results hold on NVIDIA A100). These results establish heterogeneous systems as a practical direction for efficient LLM memory processing and inform future heterogeneous hardware design.
comment: Accepted by ICML 2026
♻ ☆ SCORP: Scene-Consistent Multi-agent Diffusion Planning with Stable Online Reinforcement Post-Training for Cooperative Driving
Cooperative driving is a safety- and efficiency-critical task that requires the coordination of diverse, interaction-realistic multi-agent trajectories. Although existing diffusion-based methods can capture multimodal behaviors from demonstrations, they often exhibit weak scene consistency and poor alignment with closed-loop cooperative objectives. This makes post-training necessary for further improvement, yet achieving stable online post-training in reactive multi-agent environments remains challenging. In this paper, we propose SCORP, a scene-consistent multi-agent diffusion planner with stable online reinforcement learning (RL) post-training for cooperative driving. For pre-training, we develop a scene-conditioned multi-agent denoising architecture that couples inter-agent self-attention with a dual-path conditioning mechanism: cross-attention provides direct scene-information injection, while AdaLN-Zero enables additional flexible and stable conditional modulation, thereby improving the scene consistency and road adherence of joint trajectories. For post-training, we formulate a two-layer Markov decision process (MDP) that explicitly integrates the reverse denoising chain with policy-environment interaction. We further co-design dense, well-shaped planning rewards and variance-gated group-relative policy optimization (VG-GRPO) to mitigate advantage collapse and gradient instability during closed-loop training. Extensive experiments show that SCORP outperforms strong open-source baselines on WOMD, with 10.47%-28.26% and 1.70%-7.22% improvements in core safety and efficiency metrics, respectively. Moreover, compared with alternative post-training methods, SCORP delivers significant and consistent gains in both driving safety and traffic efficiency, highlighting stable and sustained advances in closed-loop cooperative driving.
♻ ☆ ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection
Tool-augmented Large Language Model (LLM) agents have demonstrated impressive capabilities in automating complex, multi-step real-world tasks, yet remain vulnerable to indirect prompt injection. Adversaries exploit this weakness by embedding malicious instructions within tool-returned content, which agents directly incorporate into their conversation history as trusted observations. To address these vulnerabilities, we introduce \textsc{ClawGuard}, a novel runtime security framework that enforces a user-confirmed rule set at every tool-call boundary, transforming unreliable alignment-dependent defense into a deterministic, auditable mechanism that intercepts adversarial tool calls before any real-world effect is produced. By automatically deriving task-specific access constraints from the user's stated objective prior to any external tool invocation, \textsc{ClawGuard} blocks all three injection pathways without model modification or infrastructure change. Experiments across five state-of-the-art language models on six injection benchmarks covering web, local, MCP, and skill channels, as well as three utility benchmarks covering OS, web, and code tasks, demonstrate that \textsc{ClawGuard} achieves robust protection against indirect prompt injection without compromising agent utility or introducing significant token overhead. This work establishes deterministic tool-call boundary enforcement as an effective defense mechanism for secure agentic AI systems. Code is publicly available at github.com/Claw-Guard/ClawGuard/.
♻ ☆ AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment
As Large Language Models (LLMs) evolve into lifelong AI assistants, LLM personalization has become a critical frontier. However, progress is currently bottlenecked by the absence of a gold-standard evaluation benchmark. Existing benchmarks either overlook personalized information management that is critical for personalization or rely heavily on synthetic dialogues, which exhibit an inherent distribution gap from real-world dialogue. To bridge this gap, we introduce AlpsBench, An LLM PerSonalization benchmark derived from real-world human-LLM dialogues. AlpsBench comprises 2,500 long-term interaction sequences curated from WildChat, paired with human-verified structured memories that encapsulate both explicit and implicit personalization signals. We define four pivotal tasks - personalized information extraction, updating, retrieval, and utilization - and establish protocols to evaluate the entire lifecycle of memory management. Our benchmarking of frontier LLMs and memory-centric systems reveals that: (i) models struggle to reliably extract latent user traits; (ii) memory updating faces a performance ceiling even in the strongest models; (iii) retrieval accuracy declines sharply in the presence of large distractor pools; and (iv) while explicit memory mechanisms improve recall, they do not inherently guarantee more preference-aligned or emotionally resonant responses. AlpsBench aims to provide a comprehensive framework.
♻ ☆ OpenCLAW-P2P v7.0-P2PCLAW: Resilient Multi-Layer Persistence, Live Reference Verification, and Production-Scale Evaluation of Decentralized AI Peer Review v7.0 -- Mathematical Corrections & Ecosystem Developments Edition
This paper presents OpenCLAW-P2P v7.0, a comprehensive evolution of the decentralized collective-intelligence platform in which autonomous AI agents publish, peer-review, score, and iteratively improve scientific research papers without any human gatekeeper. Building on the v6.0 foundations -- multi-layer persistence, live reference verification, multi-LLM granular scoring, calibrated deception detection, the Silicon Chess-Grid FSM, and the AETHER containerized inference engine -- this release introduces mathematical corrections to the theoretical framework, ensuring dimensional consistency, proper range constraints, and unambiguous notation throughout. Additionally, this edition documents significant ecosystem expansions including the CAJAL family of open-source language models (4B and 9B parameters) fine-tuned for scientific paper generation. The four major subsystems introduced in v6.0 are retained: (i) a Multi-Layer Paper Persistence Architecture with four storage tiers ensuring zero paper loss; (ii) a Multi-Layer Retrieval Cascade reducing latency from >3s to <50ms; (iii) a Live Reference Verification system detecting fabricated citations with >85% accuracy; and (iv) a Scientific API Proxy providing access to seven public scientific databases. Mathematical corrections in v7.0 include: corrected fixed-point condition in the Sufficient Reason theorem; dimensionally consistent progress-rate indicator; fully specified reputation update formula incorporating quality terms q0 and q-bar; clarified attention-logit bound in the AETHER pruning theorem; explicit range documentation for the calibration mapping; non-negativity guarantee for the depth score; discrete-time notation for the PD Governor; and explicit parameter definitions for the HSR weight formula.
comment: v7.0: Mathematical corrections (fixed-point condition Eq.4, dimensionally consistent tau-indicator Eq.7, fully specified reputation formula Eq.8 with quality terms q0 and q-bar, discrete-time PD Governor Eq.15, HSR parameter definitions Eq.16); ecosystem developments: CAJAL-4B/9B models, BenchClaw platform, 14 integrations. 36 pages
♻ ☆ From Spark to Fire: Modeling and Mitigating Error Cascades in LLM-Based Multi-Agent Collaboration
Large Language Model-based Multi-Agent Systems (LLM-MAS) are increasingly applied to complex collaborative scenarios. However, their collaborative mechanisms may cause minor inaccuracies to gradually solidify into system-level false consensus through iteration. Such risks are difficult to trace since errors can propagate and amplify through message dependencies. Existing protections often rely on single-agent validation or require modifications to the collaboration architecture, which can weaken effective information flow and may not align with natural collaboration processes in real tasks. To address this, we propose a propagation dynamics model tailored for LLM-MAS that abstracts collaboration as a directed dependency graph and provides an early-stage risk criterion to characterize amplification risk. Through experiments on six mainstream frameworks, we identify three vulnerability classes: cascade amplification, topological sensitivity, and consensus inertia. We further instantiate an attack where injecting just a single atomic error seed leads to widespread failure. In response, we introduce a genealogy-graph-based governance layer, implemented as a message-layer plugin, that suppresses both endogenous and exogenous error amplification without altering the collaboration architecture. Experiments show that this approach prevents final infection in at least 89% of runs across operating modes and significantly mitigates the cascading spread of minor errors.
♻ ☆ ChartDiff: A Large-Scale Benchmark for Comprehending Pairs of Charts ACL 2026
Charts are central to analytical reasoning, yet existing benchmarks for chart understanding focus almost exclusively on single-chart interpretation rather than comparative reasoning across multiple charts. To address this gap, we introduce ChartDiff, the first large-scale benchmark for cross-chart comparative summarization. ChartDiff consists of 8,541 chart pairs spanning diverse data sources, chart types, and visual styles, each annotated with LLM-generated and human-verified summaries describing differences in trends, fluctuations, and anomalies. Using ChartDiff, we evaluate general-purpose, chart-specialized, and pipeline-based models. Our results show that frontier general-purpose models achieve the highest GPT-based quality, while specialized and pipeline-based methods obtain higher ROUGE scores but lower human-aligned evaluation, revealing a clear mismatch between lexical overlap and actual summary quality. We further find that multi-series charts remain challenging across model families, whereas strong end-to-end models are relatively robust to differences in plotting libraries. Overall, our findings demonstrate that comparative chart reasoning remains a significant challenge for current vision-language models and position ChartDiff as a new benchmark for advancing research on multi-chart understanding.
comment: 21 pages, 17 figures, accepted to ACL 2026: the 4th Workshop on Advances in Language and Vision Research
♻ ☆ Recursive Language Models
We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference paradigm that treats long prompts as part of an external environment and allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt. We find that RLMs can successfully process inputs up to two orders of magnitude beyond model context windows and, even for shorter prompts, dramatically outperform the quality of vanilla frontier LLMs and common long-context and coding scaffolds (e.g., on GPT-5 by a median across the evaluated benchmarks of $26\%$ against compaction, $130\%$ against CodeAct with sub-calls, and $13\%$ against Claude Code) across four diverse long-context tasks while having comparable cost. At a small scale, we post-train the first model around the RLM. Our model, RLM-Qwen3-8B, outperforms the underlying Qwen3-8B model by $28.3\%$ on average and even approaches the quality of vanilla GPT-5 on three long-context tasks. Code is available at https://github.com/alexzhang13/rlm.
comment: 9 pages, 43 with Appendix
♻ ☆ SAR-RAG: ATR Visual Question Answering by Semantic Search, Retrieval, and MLLM Generation SP
We present a visual-context image-retrieval-augmented generation (ImageRAG)- assisted AI agent for automatic target recognition (ATR) of synthetic aperture radar (SAR) imagery. SAR is a remote sensing method used in defense and security applications to detect and monitor the positions of military vehicles, which may appear indistinguishable in images. Researchers have extensively studied SAR ATR to improve the differentiation and identification of vehicle types, characteristics, and measurements. Test examples can be compared with known vehicle target types to improve recognition tasks. New methods enhance the capabilities of neural networks, transformer attention, and multimodal large language models. An agentic AI method may be developed to utilize a defined set of tools, such as searching through a library of similar examples. Our proposed method, SAR Retrieval-Augmented Generation (SAR-RAG), combines a multimodal large language model (MLLM) with a vector database of semantic embeddings to support contextual search for image exemplars with known qualities. By recovering past image examples of known true target types, our SAR-RAG system can compare similar vehicle categories, thereby improving ATR prediction accuracy. We evaluate this through search and retrieval metrics, categorical classification accuracy, and numeric regression of vehicle dimensions. These metrics all show improvements when SAR-RAG is added to an MLLM baseline method as an attached ATR memory bank.
comment: Accepted to 2026 SPIE Defense + Security, Automatic Target Recognition XXXVI
♻ ☆ Unifying Perspectives: Plausible Counterfactual Explanations on Global, Group-wise, and Local Levels
The growing complexity of AI systems has intensified the need for transparency through Explainable AI (XAI). Counterfactual explanations (CFs) offer actionable "what-if" scenarios on three levels: Local CFs providing instance-specific insights, Global CFs addressing broader trends, and Group-wise CFs (GWCFs) striking a balance and revealing patterns within cohesive groups. Despite the availability of methods for each granularity level, the field lacks a unified method that integrates these complementary approaches. We address this limitation by proposing a gradient-based optimization method for differentiable models that generates Local, Global, and Group-wise Counterfactual Explanations in a unified manner. We especially enhance GWCF generation by combining instance grouping and counterfactual generation into a single efficient process, replacing traditional two-step methods. Moreover, to ensure trustworthiness, we innovatively introduce the integration of plausibility criteria into the GWCF domain, making explanations both valid and realistic. Our results demonstrate the method's effectiveness in balancing validity, proximity, and plausibility while optimizing group granularity, with practical utility validated through practical use cases.
♻ ☆ MolRGen: A Training and Evaluation Setting for De Novo Molecular Generation with Reasonning Models
Recent reasoning-based large language models have shown strong performance on tasks with verifiable outcomes, but their use in de novo molecular generation remains limited by the lack of training environments where rewards can be computed without reference molecules. We introduce MolRGen, a benchmark and molecular verifier for training and evaluating reasoning LLMs on de novo molecular generation. MolRGen contains approximately 4,500 protein-pocket targets, resulting in 50k multi-objective optimization prompts combining docking scores with molecular properties such as QED, synthetic accessibility, logP, and physicochemical descriptors. Unlike caption-based generation or molecule-editing benchmarks, MolRGen evaluates molecules proposed from scratch by computing rewards at generation time. We benchmark general-purpose and chemistry-specialized open-source LLMs and introduce a diversity-aware top-k metric to measure whether models can generate a diverse set of high-scoring molecules. Finally, we use the verifier to fine-tune a 128B LLM with GRPO, showing improved performance, at the cost of a diversity-exploitation trade-off. MolRGen provides a scalable testbed for studying verifier-based reasoning and reinforcement learning in molecular design.
♻ ☆ AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models
We propose a standalone autoregressive (AR) Action Expert that generates actions as a continuous causal sequence while conditioning on refreshable vision-language prefixes. In contrast to existing Vision-Language-Action (VLA) models and diffusion policies that reset temporal context with each new observation and predict actions reactively, our Action Expert maintains its own history through a long-lived memory and is inherently context-aware. This structure addresses the frequency mismatch between fast control and slow reasoning, enabling efficient independent pretraining of kinematic syntax and modular integration with heavy perception backbones, naturally ensuring spatio-temporally consistent action generation across frames. To synchronize these asynchronous hybrid V-L-A modalities, we utilize a re-anchoring mechanism that mathematically accounts for perception staleness during both training and inference. Experiments on simulated and real-robot manipulation tasks demonstrate that the proposed method can effectively replace traditional chunk-based action heads for both specialist and generalist policies. AR-VLA exhibits superior history awareness and substantially smoother action trajectories while maintaining or exceeding the task success rates of state-of-the-art reactive VLAs. Overall, our work introduces a scalable, context-aware action generation schema that provides a robust structural foundation for training effective robotic policies. Code and Videos available at https://arvla.insait.ai
comment: RSS 2026 accepted
♻ ☆ What's the plan? Metrics for implicit planning in LLMs and their application to rhyme generation and question answering ICLR 2026
Prior work suggests that language models, while trained on next token prediction, show implicit planning behavior: they may select the next token in preparation to a predicted future token, such as a likely rhyming word, as supported by a prior qualitative study of Claude 3.5 Haiku using a cross-layer transcoder. We propose much simpler techniques for assessing implicit planning in language models. With case studies on rhyme poetry generation and question answering, we demonstrate that our methodology easily scales to many models. Across models, we find that the generated rhyme (e.g. "-ight") or answer to a question ("whale") can be manipulated by steering at the end of the preceding line with a vector, affecting the generation of intermediate tokens leading up to the rhyme or answer word. We show that implicit planning is a universal mechanism, present in smaller models than previously thought, starting from 1B parameters. Our methodology offers a widely applicable direct way to study implicit planning abilities of LLMs. More broadly, understanding planning abilities of language models can inform decisions in AI safety and control.
comment: 41 pages, 34 figures, Accepted at ICLR 2026, Code available at https://github.com/Jim-Maar/implicit-planning-in-llms
♻ ☆ Elite Polarization in European Parliamentary Speeches: a Novel Measurement Approach Using Large Language Models
Theories of democratic stability, populism, and party-system crisis often point to a form of polarization that comparative research rarely measures directly: hostile relations among political elites. Existing comparative measures capture adjacent phenomena, including mass affective polarization, or elite ideological distance, but not directed mutual elite evaluation. This paper introduces the Elite Polarization Score, a measurement of out-party evaluations in parliamentary speech. Large Language Models identify political actors mentioned in parliamentary debates, recover speaker-target pairs, estimate the sentiment directed at each actor, standardize heterogeneous references into party dyads, and aggregate these evaluations into party- and parliament-level measures of mutual out-party negativity. The validity of the approach is demonstrated on parliamentary corpora from the United Kingdom, Hungary, and Italy, covering up to four decades of debate. The resulting measure is conceptually distinct from mass affective polarization, elite ideological polarization, incivility, negative campaigning, and general sentiment. Evidence from the UK case study shows that it is also empirically distinct from mass affective polarization, elite ideological polarization, and incivility. Extreme negative evaluations can also be used to locate pernicious polarization rhetoric. Validation across three countries finds no false discoveries, sentiment estimates accurate to roughly 10 percent of the scale range, and AI sensitivity that meets or exceeds that of human coders in two of three settings. Because the algorithm is multilingual, requires no task-specific training, and can be aggregated by party and quarter, it provides a scalable basis for future cross-national research on what produces elite polarization and what elite polarization itself produces
♻ ☆ MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks ICML 2026
While large audio-language models have advanced open-ended audio understanding, they still fall short of nuanced human-level comprehension. This gap persists largely because current benchmarks, limited by data annotations and evaluation metrics, fail to reliably distinguish between generic and highly detailed model outputs. To this end, this work introduces MECAT, a Multi-Expert Constructed Benchmark for Fine-Grained Audio Understanding Tasks. Generated via a pipeline that integrates analysis from specialized expert models with Chain-of-Thought large language model reasoning, MECAT provides multi-perspective, fine-grained captions and open-set question-answering pairs. The benchmark is complemented by a novel metric: DATE (Discriminative-Enhanced Audio Text Evaluation). This metric penalizes generic terms and rewards detailed descriptions by combining single-sample semantic similarity with cross-sample discriminability. A comprehensive evaluation of state-of-the-art audio models is also presented, providing new insights into their current capabilities and limitations. The data and code are available at https://github.com/xiaomi-research/mecat
comment: Accepted to ICML 2026
♻ ☆ TiledAttention: a CUDA Tile SDPA Kernel for PyTorch
TiledAttention is a scaled dot-product attention (SDPA) forward operator for SDPA research on NVIDIA GPUs. Implemented in cuTile Python (TileIR) and exposed as a PyTorch-callable function, it is easier to modify than low-level CUDA templates while retaining realistic behavior via online softmax and tiled $K,V$ streaming. Algorithmically, TiledAttention follows the established FlashAttention-style online-softmax formulation; our novelty is the cuTile/TileIR implementation strategy, schedule-level modifiability, and reproducible benchmarking/profiling workflow. The approach is both performant and directly editable at the schedule level from Python (tile shapes, staging, shared-memory layout), enabling rapid, reproducible kernel research without template-heavy CUDA/CUTLASS rewrites. We benchmark TiledAttention on an NVIDIA DGX GB10 node with a reproducible harness and compare against PyTorch SDPA (auto-dispatch), explicit unfused baselines (torch_sdpa_math, standard eager attention), and forced backend probes (FlashAttention2, EffecientAttention, CuDNN Attention) across sequence length, head dimension, and precision (FP16/BF16). While production fused baselines remain stronger overall, TiledAttention delivers large speedups over standard eager attention paths and is available for direct use within PyTorch workflows, providing a practical balance between performance and customizability.
♻ ☆ Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts
Mixture-of-Experts (MoE) models typically fix the number of activated experts $k$ at both training and inference. However, real-world deployments often face heterogeneous hardware, fluctuating workloads, and diverse quality-latency requirements, while training separate models for each scenario is costly. Considering that MoE models already operate with sparse activation, adjusting the number of activated experts offers a natural path to serving diverse budgets with a single model. Yet, we find that activating more experts $k'$ ($> k$) at inference does not yield the expected gains. Instead, performance degrades rapidly after only a slight increase, a phenomenon we term the \textit{inference-time scaling wall}. Further investigation reveals that this degradation stems from a lack of learned collaboration among experts. To address this, we introduce \textbf{Elastic Mixture-of-Experts (EMoE)}, a novel training framework that enables MoE models to elastically vary the number of activated experts at inference. By simultaneously training experts to collaborate in diverse combinations and encouraging the router to make high-quality selections, EMoE ensures robust performance across inference budgets. Extensive experiments across four MoE architectures (7B--21B) and nine benchmarks show that EMoE significantly expands the effective scaling range to 2-3$\times$ the training-time $k$, while also achieving higher peak performance.
♻ ☆ Counting Still Counts: Understanding Neural Complex Query Answering Through Query Relaxation
Neural methods for Complex Query Answering (CQA) over knowledge graphs (KGs) are widely believed to learn patterns that generalize beyond explicit graph structure, allowing them to infer answers that are unreachable through symbolic query processing. In this work, we critically examine this assumption through a systematic analysis comparing neural CQA models with an alternative, training-free query relaxation strategy that retrieves possible answers by relaxing query constraints and counting resulting paths. Across multiple datasets and query structures, we find several cases where neural and relaxation-based approaches perform similarly, with no neural model consistently outperforming the latter. Moreover, a similarity analysis reveals that their retrieved answers exhibit little overlap, and that combining their outputs consistently improves performance. These results call for a re-evaluation of progress in neural query answering: despite their complexity, current models fail to subsume the reasoning patterns captured by query relaxation. Our findings highlight the importance of stronger non-neural baselines and suggest that future neural approaches could benefit from incorporating principles of query relaxation.
comment: Accepted in Transactions on Machine Learning Research (2026)
♻ ☆ AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs
Large Audio Language Models (LALMs) are rapidly advancing, but evaluating them remains challenging due to inefficient and non-standardized toolkits that limit fair comparison and systematic assessment. Existing evaluation frameworks exhibit three critical limitations: (1) slow and inefficient processing pipeline that bottlenecks large-scale studies, (2) inadequate multi-turn dialogue support, leaving fundamental questions about cross-turn context integration and performance dynamics over extended conversations in LALMs unanswered; and (3) the absence of unified and scalable evaluation framework capable of keeping pace with the rapid growth of both LALMs and audio benchmarks. To address these issues, we introduce AU-Harness, an efficient and comprehensive evaluation framework for LALMs. Our system achieves a speedup of up to 151% over existing evaluation toolkits through optimized batch processing and parallel execution, enabling large-scale evaluations previously considered impractical. We provide standardized prompting protocols and flexible configurations for fair model comparison across diverse scenarios. AU-Harness unlocks a range of in-depth analyses difficult to conduct without a unified foundation, including multi-turn dialogue dynamics, enabling the study of true audio reasoning capabilities in existing LALMs. AU-Harness provides both practical evaluation tools and insights into model limitations, advancing systematic LALM development.
♻ ☆ Explaining Graph Neural Networks for Node Similarity on Graphs
Similarity search is a fundamental task for exploiting information in various applications dealing with graph data, such as citation networks or knowledge graphs. While this task has been intensively approached from heuristics to graph embeddings and graph neural networks (GNNs), providing explanations for similarity has received less attention. In this work we are concerned with explainable similarity search over graphs, by investigating how GNN-based methods for computing node similarities can be augmented with explanations. Specifically, we evaluate the performance of two prominent approaches towards explanations in GNNs, based on the concepts of mutual information (MI), and gradient-based explanations (GB). We discuss their suitability and empirically validate the properties of their explanations over different popular graph benchmarks. We find that unlike MI explanations, gradient-based explanations have three desirable properties. First, they are actionable: selecting inputs depending on them results in predictable changes in similarity scores. Second, they are consistent: the effect of selecting certain inputs overlaps very little with the effect of discarding them. Third, they can be pruned significantly to obtain sparse explanations that retain the effect on similarity scores.
comment: Accepted in Transactions of Machine Learning Research (2026)
♻ ☆ Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
Off-policy, value-based reinforcement learning methods such as Q-learning are appealing because they can learn from arbitrary experience, including data collected by older policies or other agents. In practice, however, bootstrapping makes long-horizon learning brittle: estimation errors at later states propagate backward through temporal-difference (TD) updates and can compound over time. We propose long-horizon Q-learning (LQL), which introduces a principled backstop against compounding error when learning the optimal action-value function. LQL builds on a prior optimality tightening observation: any realized action sequence lower-bounds what the optimal policy can achieve in expectation, so acting optimally earlier should not be worse than following the observed actions for several steps before switching to optimal behavior. Our contribution is to turn this inequality into a practical stabilization mechanism for Q-learning by using a hinge loss to penalize violations of these bounds. Importantly, LQL computes these penalties using network outputs already produced for the TD error, requiring no auxiliary networks and no additional forward passes relative to Q-learning. When combined with multiple state-of-the-art methods on a range of online and offline-to-online benchmarks, LQL consistently outperforms both 1-step TD and n-step TD learning at similar runtime.
♻ ☆ Automated conjecturing with \emph{TxGraffiti}
\emph{TxGraffiti} is a data-driven, heuristic-based computer program developed to automate the process of generating conjectures across various mathematical domains. Since its creation in 2017, \emph{TxGraffiti} has contributed to numerous mathematical publications, particularly in graph theory. In this paper, we present the design and core principles of \emph{TxGraffiti}, including its roots in the original \emph{Graffiti} program, which pioneered the automation of mathematical conjecturing. We describe the data collection process, the generation of plausible conjectures, and methods such as the \emph{Dalmatian} heuristic for filtering out redundant or transitive conjectures. Additionally, we highlight its contributions to the mathematical literature and introduce a new web-based interface that allows users to explore conjectures interactively. While we focus on graph theory, the techniques demonstrated extend to other areas of mathematics.
comment: Annals of Mathematics and Artificial Intelligence (2026)
♻ ☆ Towards Neuro-symbolic Causal Rule Synthesis, Verification, and Evaluation Grounded in Legal and Safety Principles
Rule-based systems remain central in safety-critical domains but often struggle with scalability, brittleness, and goal misspecification. These limitations can lead to reward hacking and failures in formal verification, as AI systems tend to optimize for narrow objectives. In previous research, we developed a neuro-symbolic causal framework that integrates first-order logic abduction trees, structural causal models, and deep reinforcement learning within a MAPE-K loop to provide explainable adaptations under distribution shifts. In this paper, we extend that framework by introducing a meta-level layer designed to mitigate goal misspecification and support scalable rule maintenance. This layer consists of a Goal/Rule Synthesizer and a Rule Verification Engine, which iteratively refine a formal rule theory from high-level natural-language goals and principles provided by human experts. The synthesis pipeline employs large language models (LLMs) to: (1) decompose goals into candidate causes, (2) consolidate semantics to remove redundancies, (3) translate them into candidate first-order rules, and (4) compose necessary and sufficient causal sets. The verification pipeline then performs (1) syntax and schema validation, (2) logical consistency analysis, and (3) safety and invariant checks before integrating verified rules into the knowledge base. We evaluated our approach with a proof-of-concept implementation in two autonomous driving scenarios. Results indicate that, given human-specified goals and principles, the pipeline can successfully derive minimal necessary and sufficient rule sets and formalize them as logical constraints. These findings suggest that the pipeline supports incremental, modular, and traceable rule synthesis grounded in established legal and safety principles.
♻ ☆ GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs
As Large Language Models (LLMs) become increasingly integral to various domains, their potential to generate harmful responses has prompted significant societal and regulatory concerns. In response, governments have issued ethics guidelines to promote the development of trustworthy AI. However, these guidelines are typically high-level demands for developers and testers, leaving a gap in translating them into actionable testing questions to verify LLM compliance. To address this challenge, we introduce GUARD (Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics), a testing method designed to operationalize guidelines into specific guideline-violating questions that assess LLM adherence. To implement this, GUARD uses automated generation of guideline-violating questions based on government-issued guidelines, thereby testing whether responses comply with these guidelines. When responses directly violate guidelines, GUARD reports inconsistencies. Furthermore, for responses that do not directly violate guidelines, GUARD integrates the concept of ``jailbreaks'' to diagnostics, named GUARD-JD, which creates scenarios that provoke unethical or guideline-violating responses, effectively identifying potential scenarios that could bypass built-in safety mechanisms. Our method finally culminates in a compliance report, delineating the extent of adherence and highlighting any violations. We empirically validated the effectiveness of GUARD on eight LLMs, including Vicuna-13B, LongChat-7B, Llama2-7B, Llama-3-8B, GPT-3.5, GPT-4, GPT-4o, and Claude-3.7, by testing compliance under three government-issued guidelines and conducting jailbreak diagnostics. Additionally, GUARD-JD can transfer jailbreak diagnostics to vision-language models (MiniGPT-v2 and Gemini-1.5), demonstrating its usage in promoting reliable LLM-based applications.
comment: 56 pages
♻ ☆ Inductive Entity Representations from Text via Link Prediction
Knowledge Graphs (KG) are of vital importance for multiple applications on the web, including information retrieval, recommender systems, and metadata annotation. Regardless of whether they are built manually by domain experts or with automatic pipelines, KGs are often incomplete. Recent work has begun to explore the use of textual descriptions available in knowledge graphs to learn vector representations of entities in order to preform link prediction. However, the extent to which these representations learned for link prediction generalize to other tasks is unclear. This is important given the cost of learning such representations. Ideally, we would prefer representations that do not need to be trained again when transferring to a different task, while retaining reasonable performance. In this work, we propose a holistic evaluation protocol for entity representations learned via a link prediction objective. We consider the inductive link prediction and entity classification tasks, which involve entities not seen during training. We also consider an information retrieval task for entity-oriented search. We evaluate an architecture based on a pretrained language model, that exhibits strong generalization to entities not observed during training, and outperforms related state-of-the-art methods (22% MRR improvement in link prediction on average). We further provide evidence that the learned representations transfer well to other tasks without fine-tuning. In the entity classification task we obtain an average improvement of 16% in accuracy compared with baselines that also employ pre-trained models. In the information retrieval task, we obtain significant improvements of up to 8.8% in NDCG@10 for natural language queries. We thus show that the learned representations are not limited KG-specific tasks, and have greater generalization properties than evaluated in previous work.
comment: The Web Conference 2021
♻ ☆ Resolving space-sharing conflicts in road user interactions through uncertainty reduction: An active inference-based computational model
Understanding how road users resolve space-sharing conflicts is important both for traffic safety and the safe deployment of autonomous vehicles. While existing models have captured specific aspects of such interactions (e.g., explicit communication), a theoretically-grounded computational framework has been lacking. In this paper, we extend a previously developed active inference-based driver behavior model to simulate interactive behavior of two agents. Our model captures three complementary mechanisms for uncertainty reduction in interaction: (i) implicit communication via direct behavioral coupling, (ii) reliance on normative expectations (stop signs, priority rules, etc.), and (iii) explicit communication. In a simplified intersection scenario, we show that normative and explicit communication cues can increase the likelihood of a successful conflict resolution. However, this relies on agents acting as expected. In situations where another agent (intentionally or unintentionally) violates normative expectations or communicates misleading information, reliance on these cues may induce collisions. These findings illustrate how active inference can provide a novel framework for modeling road user interactions which is also applicable in other fields.
♻ ☆ Toward a Unified Lyapunov-Certified ODE Convergence Analysis of Smooth Q-Learning with p-Norms
Convergence of Q-learning has been the subject of extensive study for decades. Among the available techniques, the ordinary differential equation (ODE) method is particularly appealing as a general-purpose, off-the-shelf tool for sanity-checking the convergence of a wide range of reinforcement learning algorithms. In this paper, we develop a unified ODE-based convergence framework that applies to standard Q-learning and several soft/smoothed variants, including those built on the log-sum-exponential softmax, Boltzmann softmax, and mellowmax operators. Our analysis uses a smooth p-norm Lyapunov function, leading to concise yet rigorous stability arguments and circumventing the non-smoothness issues inherent to classical infty-norm-based approaches. To the best of our knowledge, the proposed framework is among the first to provide a unified ODE-based treatment that is broadly applicable to smooth Q-learning algorithms while also encompassing standard Q-learning. Moreover, it remains valid even in settings where the associated Bellman operator is not a contraction, as may happen in Boltzmann soft Q-learning.
♻ ☆ One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Vision-language-action (VLA) models increasingly rely on auxiliary world modules to plan over long horizons, yet how such modules should be parameterized on top of a pretrained VLA remains an open design question. Existing world-model-augmented VLAs typically pass the per-frame visual stream into the world module at high visual bandwidth and treat its rollout as a side product of action prediction; under a constrained adaptation budget on a frozen backbone, this leaves both the per-frame representation and the latent action coupling under-examined. We introduce OneWM-VLA, which compresses each view into a single semantic token per frame through an Adaptive Attention Pooling, and produces the resulting latent stream and the action trajectory under a single flow-matching objective rather than connecting them through a separate decoder. Empirically, we find that per-frame visual bandwidth can be reduced to a single token without compromising long-horizon performance under our setup. Trained with 14.71M LoRA parameters on a $π_0$ (2B) backbone, OneWM-VLA improves the average success rate from 47.9% to 61.3% on MetaWorld~MT50, reaches 95.6% on LIBERO-Long (vs.85.2% for $π_0$), and reaches 60.0% on the long-horizon deformable task Fold Cloth on a real Piper arm (vs.20.0% for $π_0$).
♻ ☆ CARL: Criticality-Aware Agentic Reinforcement Learning
Agents capable of accomplishing complex tasks through multiple interactions with the environment have emerged as a popular research direction. However, in such multi-step settings, the conventional group-level policy optimization algorithm becomes suboptimal because of its underlying assumption that each step holds equal contribution, which deviates significantly from reality. Our analysis reveals that only the action choices on a small fraction of states are critical in determining the final outcome. Building on this insight, we propose CARL, a criticality-aware reinforcement learning algorithm tailored for long-horizon agentic reasoning. CARL leverages entropy as a heuristic proxy for state criticality and achieves focused training by assigning rewards to actions taken from high-criticality states while excluding actions taken from low-criticality states from model updates, avoiding noisy credit assignment and redundant computation. Extensive experiments demonstrate that CARL achieves both stronger performance and higher efficiency across diverse evaluation settings. The source code will be publicly available.
comment: 18 pages, 6 figures
♻ ☆ Provable Anytime Ensemble Sampling Algorithms in Nonlinear Contextual Bandits
We provide a unified algorithmic framework for ensemble sampling in nonlinear contextual bandits and develop corresponding regret bounds for two most common nonlinear contextual bandit settings: Generalized Linear Ensemble Sampling (GLM-ES) for generalized linear bandits and Neural Ensemble Sampling (Neural-ES) for neural contextual bandits. Both methods maintain multiple estimators for the reward model parameters via maximum likelihood estimation on randomly perturbed data. We prove high-probability frequentist regret bounds of $\widetilde{O}(d^{3/2} \sqrt{T} + d^{4})$ for GLM-ES and $\widetilde{O}(\widetilde{d}^{3/2} \sqrt{T})$ for Neural-ES, where $d$ is the dimension of feature vectors, $\widetilde{d}$ is the effective dimension of a neural tangent kernel (NTK) matrix and $T$ is the number of rounds. The regret bound of GLM-ES matches the state-of-the-art result of randomized exploration algorithms in generalized linear bandit setting. In the theoretical analysis, we introduce techniques that address challenges specific to nonlinear models. Practically, we remove fixed-time horizon assumption by developing anytime versions of our algorithms, suitable when $T$ is unknown. Finally, we empirically evaluate GLM-ES, Neural-ES and their anytime variants, demonstrating strong performance. Overall, our results establish ensemble sampling as a provable and practical randomized exploration approach for nonlinear contextual bandits.
comment: 58 pages, 5 figures, 1 table
♻ ☆ ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures
Vision-Language-Action systems follow instructions to execute multi-step tasks in multimodal environments. Recent VLA approaches typically rely on post-hoc correction mechanisms or operate under fixed task decompositions and alignment schemes. However, once an intermediate step is mis-specified, local errors propagate through subsequent steps and eventually accumulate into cascading failures. To mitigate this compounding effect, we propose Predictive Alignment and Planning Architecture, a framework that uses prediction and contrast to adjust deviations across three levels: actions, subgoals, and trajectories. Semantic alignment is enforced at all levels using a Sinkhorn-based module and a Score-field module. The predictive correction and alignment jointly update the action generator during training, enabling it to adjust fine-grained steps to remain aligned with the overall intent. We further introduce two new metrics to quantify error propagation and recovery processes in tasks, capturing how mistakes spread and fade over long-horizon execution. Experiments show that ReCAPA achieves competitive results on embodied agent benchmarks such as VisualAgentBench, MineDojo, and AI2-THOR, outperforming strong proprietary and open-source Large Language Model baselines.
♻ ☆ MathlibLemma: Folklore Lemma Generation and Benchmark for Formal Mathematics
While the ecosystem of Lean and Mathlib has enjoyed celebrated success in formal mathematical reasoning with the help of large language models (LLMs), the absence of many folklore lemmas in Mathlib remains a persistent barrier that limits Lean's usability as an everyday tool for mathematicians like \LaTeX{} or Maple. To address this, we introduce MathlibLemma, a modular LLM-based pipeline for automated folklore-lemma mining: the discovery, formalization, and proving of reusable intermediate facts that mathematicians often take for granted but that are not always present in formal libraries. At its core, MathlibLemma proactively mines the missing connective tissue of mathematics. The pipeline produces a verified library of folklore-style lemmas, including 1,506 Lean-checked proofs that pass a proof-bypass screen; a small curated pilot subset has also been merged into Mathlib, providing external evidence that selected outputs can meet expert library standards. Leveraging this pipeline, we further construct the MathlibLemma benchmark, a suite of 4,028 non-trivial type-checked Lean statements spanning a broad range of mathematical domains. By transforming the role of LLMs from passive consumers to active contributors, this work takes a step toward AI-assisted expansion of formal mathematical libraries.
♻ ☆ A Nonasymptotic Theory of Gain-Dependent Error Dynamics in Behavior Cloning
Behavior cloning (BC) policies on position-controlled robots inherit the closed-loop response of the underlying PD controller, yet the nonasymptotic finite-horizon consequences of controller gains for BC failure remain open. We show that independent sub-Gaussian action errors propagate through the gain-dependent closed-loop dynamics to yield sub-Gaussian position errors whose proxy matrix $X_\infty(K)$ governs the failure tail. The probability of horizon-$T$ task failure factorizes into a gain-dependent amplification index $Γ_T(K)$ and the validation loss plus a generalization slack, so training loss alone cannot predict closed-loop performance. Under shape-preserving upper-bound structural assumptions, the proxy admits the scalar bound $X_\infty(K)\preceqΨ(K)\bar X$, with $Ψ(K)$ decomposed into label difficulty, injection strength, and contraction. This ranks the four canonical regimes with compliant-overdamped (CO) tightest, stiff-underdamped (SU) loosest, and the stiff-overdamped versus compliant-underdamped ordering system-dependent. For the canonical scalar second-order PD system, the closed-form continuous-time stationary variance $X_\infty^{\mathrm{c}}(α,β)=σ^2α/(2β)$ is strictly monotone in stiffness and damping over the entire stable orthant, covering both underdamped and overdamped regimes, and the exact zero-order-hold (ZOH) discretization inherits this monotonicity. The analysis gives a nonasymptotic finite-horizon extension of the gain-dependent error-attenuation explanation of Bronars et al.
♻ ☆ CAP: Controllable Alignment Prompting for Unlearning in LLMs ACL 2026
Large language models (LLMs) trained on unfiltered corpora inherently risk retaining sensitive information, necessitating selective knowledge unlearning for regulatory compliance and ethical safety. However, existing parameter-modifying methods face fundamental limitations: high computational costs, uncontrollable forgetting boundaries, and strict dependency on model weight access. These constraints render them impractical for closed-source models, yet current non-invasive alternatives remain unsystematic and reliant on empirical experience. To address these challenges, we propose the Controllable Alignment Prompting for Unlearning (CAP) framework, an end-to-end prompt-driven unlearning paradigm. CAP decouples unlearning into a learnable prompt optimization process via reinforcement learning, where a prompt generator collaborates with the LLM to suppress target knowledge while preserving general capabilities selectively. This approach enables reversible knowledge restoration through prompt revocation. Extensive experiments demonstrate that CAP achieves precise, controllable unlearning without updating model parameters, establishing a dynamic alignment mechanism that overcomes the transferability limitations of prior methods.
comment: Accpeted to ACL 2026 Main Conference
♻ ☆ How Instruction and Reasoning Data shape Post-Training: Data Quality through the Lens of Layer-wise Gradients ACL2026
As the post-training of large language models (LLMs) advances from instruction-following to complex reasoning tasks, understanding how different data affect finetuning dynamics remains largely unexplored. In this paper, we present a spectral analysis of layer-wise gradients induced by low/high-quality instruction and reasoning data for LLM post-training. Our analysis reveals that widely-studied metrics for data evaluation, e.g., IFD, InsTag, Difficulty, and Reward, can be explained and unified by spectral properties computed from gradients' singular value decomposition (SVD). Specifically, higher-quality data are usually associated with lower nuclear norms and higher effective ranks. Notably, effective rank exhibits better robustness and resolution than nuclear norm in capturing subtle quality differences. For example, reasoning data achieves substantially higher effective ranks than instruction data, implying richer gradient structures on more complex tasks. Our experiments also highlight that models within the same family share similar gradient patterns regardless of their sizes, whereas different model families diverge significantly. Providing a unified view on the effects of data quality across instruction and reasoning data, this work illuminates the interplay between data quality and training stability, shedding novel insights into developing better data exploration strategies for post-training.
comment: ACL2026, camera-ready
♻ ☆ Causal Discovery for Irregularly Time Series with Consistency Guarantees
This paper studies causal discovery in irregularly sampled time series-a key challenge in risk-sensitive domains like finance, healthcare, and climate science, where missing data and inconsistent sampling frequencies distort causal mechanisms. The main challenge comes from the interdependence between missing data imputation and causal structure recovery: errors in imputation and structure learning can reinforce each other, leading to an inaccurate causal graph. Existing methods either impute first and then discover, or jointly optimize both via neural representation learning, but lack explicit mechanisms to ensure mutual consistency of imputation and structure learning. We address this challenge with ReTimeCausal, an EM-based framework that alternates between imputation and structure learning, which encourages structural consistency throughout the optimization process. Our framework provides theoretical consistency guarantees for structure recovery and extends classical results to settings with irregular sampling and high missingness. ReTimeCausal combines kernel-based sparse regression and structural constraints in an alternating process that updates the completed data and the causal graph in turn. Experiments on synthetic and real-world datasets show that ReTimeCausal is more effective than existing methods under challenging irregular sampling and missing data.
comment: 12 pages, 2 figures
♻ ☆ Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding ACL 2026
Proactive streaming video understanding requires Video-LLMs to decide when to respond as a video unfolds, a task where existing methods often fall short due to their implicit, query-agnostic modeling of visual evidence. We introduce Response-G1, a novel framework that establishes explicit, structured alignment between the accumulated video evidence and the query's expected response conditions via scene graphs. The framework operates in three fine-tuning-free stages: (1) online query-guided scene graph generation from streaming clips; (2) memory-based retrieval of the most semantically relevant historical scene graphs; and (3) retrieval-augmented trigger prompting for per-frame "silence/response" decisions. By grounding both evidence and conditions in a shared graph representation, Response-G1 achieves more interpretable and accurate response timing decisions. Experimental results on established benchmarks demonstrate the superiority of our method in both proactive and reactive tasks, validating the advantage of explicit scene graph modeling and retrieval in streaming video understanding.
comment: Accepted to ACL 2026
♻ ☆ Assessing Trustworthiness of AI Training Dataset using Subjective Logic -- A Use Case on Bias ECML
As AI systems increasingly rely on training data, assessing dataset trustworthiness has become critical, particularly for properties like fairness or bias that emerge at the dataset level. Prior work has used Subjective Logic to assess trustworthiness of individual data, but not to evaluate trustworthiness properties that emerge only at the level of the dataset as a whole. This paper introduces the first formal framework for assessing the trustworthiness of AI training datasets, enabling uncertainty-aware evaluations of global properties such as bias. Built on Subjective Logic, our approach supports trust propositions and quantifies uncertainty in scenarios where evidence is incomplete, distributed, and/or conflicting. We instantiate this framework on the trustworthiness property of bias, and we experimentally evaluate it based on a traffic sign recognition dataset. The results demonstrate that our method captures class imbalance and remains interpretable and robust in both centralized and federated contexts.
comment: Accepted at ECML PKDD Bias Workshop '25
♻ ☆ Position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead
Large Language Models (LLMs) have achieved remarkable results on a range of standardized tests originally designed to assess human cognitive and psychological traits, such as intelligence and personality. While these results are often interpreted as strong evidence of human-like characteristics in LLMs, this paper argues that such interpretations constitute an ontological error. Human psychological and educational tests are theory-driven measurement instruments, calibrated to a specific human population. Applying these tests to non-human subjects without empirical validation, risks mischaracterizing what is being measured. Furthermore, a growing trend frames AI performance on benchmarks as measurements of traits such as ``intelligence'', despite known issues with validity, data contamination, cultural bias and sensitivity to superficial prompt changes. We argue that interpreting benchmark performance as measurements of human-like traits, lacks sufficient theoretical and empirical justification. This leads to our position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead. We call for the development of principled, AI-specific evaluation frameworks tailored to AI systems. Such frameworks might build on existing frameworks for constructing and validating psychometrics tests, or could be created entirely from scratch to fit the unique context of AI.
♻ ☆ Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Multi-Stream Environments ICML 2026
This paper identifies a critical yet underexplored challenge in reasoning alignment from multiple multi-modal large language models (MLLMs): In non-stationary environments, the diverse reasoning distributions of source models often evolve unpredictably, transmitting systematic biases and drift to the target model. To address this, we formulate multi-source reasoning alignment as a constraint satisfaction problem under concept drift theory. We propose Autonomous Preference Optimization (APO), a novel framework that treats inter-model divergences not as noise, but as dynamic negative constraints. APO operates via a two-stage protocol: first, supervised bootstrapping projects the target model into the capability union of source models; second, constraint-aware optimization synthesizes a consistent consensus manifold by explicitly suppressing drifting trajectories via a multi-negative Plackett-Luce objective. Extensive experiments on chest X-ray interpretation demonstrate that our 7B model achieves superior robustness, outperforming even proprietary source models in average accuracy. Furthermore, we release CXR-MAX, a large-scale benchmark comprising 170,982 reasoning trajectories from seven large-scale MLLMs to facilitate research on reasoning alignment under drift. Code and data are available at: https://github.com/XiaoyuYoung/APO.
comment: ICML 2026
♻ ☆ Schoenfeld's Anatomy of Mathematical Reasoning by Language Models ACL2026
Large language models increasingly expose reasoning traces, yet their underlying cognitive structure and steps remain difficult to identify and analyze beyond surface-level statistics. We adopt Schoenfeld's Episode Theory as an inductive, intermediate-scale lens and introduce ThinkARM (Anatomy of Reasoning in Models), a scalable framework that explicitly abstracts reasoning traces into functional reasoning steps such as Analysis, Explore, Implement, Verify, etc. When applied to mathematical problem solving by diverse models, this abstraction reveals reproducible thinking dynamics and structural differences between reasoning and non-reasoning models, which are not apparent from token-level views. We further present two diagnostic case studies showing that exploration functions as a critical branching step associated with correctness, and that efficiency-oriented methods selectively suppress evaluative feedback steps rather than uniformly shortening responses. Together, our results demonstrate that episode-level representations make reasoning steps explicit, enabling systematic analysis of how reasoning is structured, stabilized, and altered in modern language models.
comment: ACL2026, camera-ready
♻ ☆ HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
Existing multimodal search agents process target entities sequentially, issuing one tool call per entity and accumulating redundant interaction rounds whenever a query decomposes into independent sub-retrievals. We argue that effective multimodal agents should search wider rather than longer: dispatching multiple grounded queries concurrently within a round. To this end, we present HyperEyes, a parallel multimodal search agent that fuses visual grounding and retrieval into a single atomic action, enabling concurrent search across multiple entities while treating inference efficiency as a first-class training objective. HyperEyes is trained in two stages. For cold-start supervision, we develop a Parallel-Amenable Data Synthesis Pipeline covering visual multi-entity and textual multi-constraint queries, curating efficiency-oriented trajectories via Progressive Rejection Sampling. Building on this, our central contribution, a Dual-Grained Efficiency-Aware Reinforcement Learning framework, operates at two levels. At the macro level, we propose TRACE (Tool-use Reference-Adaptive Cost Efficiency), a trajectory-level reward whose reference is monotonically tightened during training to suppress superfluous tool calls without restricting genuine multi-hop search. At the micro level, we adapt On-Policy Distillation to inject dense token-level corrective signals from an external teacher on failed rollouts, mitigating the credit-assignment deficiency of sparse outcome rewards. Since existing benchmarks evaluate accuracy as the sole metric, omitting inference cost, we introduce IMEB, a human-curated benchmark of 300 instances that jointly evaluates search capability and efficiency. Across six benchmarks, HyperEyes-30B surpasses the strongest comparable open-source agent by 9.9% in accuracy with 5.3x fewer tool-call rounds on average.
comment: Code & Data: https://github.com/DeepExperience/HyperEyes
♻ ☆ WaferSAGE: Large Language Model-Powered Wafer Defect Analysis via Synthetic Data Generation and Rubric-Guided Reinforcement Learning
We present WaferSAGE, a framework for wafer defect visual question answering using small vision-language models. To address data scarcity in semiconductor manufacturing, we propose a three-stage synthesis pipeline incorporating structured rubric generation for precise evaluation. Starting from limited labeled wafer maps, we employ clustering-based cleaning to filter label noise, then generate comprehensive defect descriptions using vision-language models, which are converted into structured evaluation rubrics criteria. These rubrics guide the synthesis of VQA pairs, ensuring coverage across defect type identification, spatial distribution, morphology, and root cause analysis. Our dual assessment framework aligns rule-based metrics with LLM-Judge scores via Bayesian optimization, enabling reliable automated evaluation. Through curriculum-based reinforcement learning with Group Sequence Policy Optimization (GSPO) and rubric-aligned rewards, our 4B-parameter Qwen3-VL model achieves a 6.493 LLM-Judge score, closely approaching Gemini-3-Flash (7.149) while enabling complete on-premise deployment. We demonstrate that small models with domain-specific training can surpass proprietary large models in specialized industrial visual understanding, offering a viable path for privacy-preserving, cost-effective deployment in semiconductor manufacturing.
comment: 16 pages, 3 figures, 8 tables
♻ ☆ HyPER: Bridging Exploration and Exploitation for Scalable LLM Reasoning with Hypothesis Path Expansion and Reduction
Scaling test-time compute with multi-path chain-of-thought improves reasoning accuracy, but its effectiveness depends critically on the exploration-exploitation trade-off. Existing approaches address this trade-off in rigid ways: tree-structured search hard-codes exploration through brittle expansion rules that interfere with post-trained reasoning, while parallel reasoning over-explores redundant hypothesis paths and relies on weak answer selection. Motivated by the observation that the optimal balance is phase-dependent and that correct and incorrect reasoning paths often diverge only at late stages, we reformulate test-time scaling as a dynamic expand-reduce control problem over a pool of hypotheses. We propose HyPER, a training-free online control policy for multi-path decoding in mixture-of-experts models that reallocates computation under a fixed budget using lightweight path statistics. HyPER consists of an online controller that transitions from exploration to exploitation as the hypothesis pool evolves, a token-level refinement mechanism that enables efficient generation-time exploitation without full-path resampling, and a length- and confidence-aware aggregation strategy for reliable answer-time exploitation. Experiments on four mixture-of-experts language models across diverse reasoning benchmarks show that HyPER consistently achieves a superior accuracy-compute trade-off, improving accuracy by 8 to 10 percent while reducing token usage by 25 to 40 percent.
♻ ☆ LPT: Less-overfitting Prompt Tuning for Vision-Language Model
Vision-language models (VLMs) have demonstrated exceptional generalization capabilities for downstream tasks. Due to its efficiency, prompt learning has gradually become a more effective and efficient method for transferring VLMs to downstream tasks, surpassing traditional finetuning methods. However, during the transfer process, these models are prone to severe overfitting, leading to a significant decline in generalization ability. To address this issue, we propose a framework named LPT, specifically designed for vision-language models. Specifically, we use CLIP to filter out fine-grained foreground information that may lead to overfitting, thereby guiding the prompts with basic visual concepts. Additionally, to further mitigate overfitting, we have developed a Structural Preservation (SP) constraint at the feature level, which aligns the model's overall feature space structure with the frozen CLIP, endowing the feature space with overall plasticity and enabling effective reshaping of the feature space during optimization. Moreover, we employ Hierarchical Logit (HL) constraint at the output layer to constrain the overall class information in the output, complementing the role of SP at the output end. Extensive experiments across various benchmarks (from base-to-novel, cross-dataset transfer, and domain generalization) demonstrate that our approach significantly improves generalization capability and effectively alleviates overfitting compared to state-of-the-art methods.
♻ ☆ General Agent Evaluation ICLR 2026
General-purpose agents perform tasks in unfamiliar environments without domain-specific manual customization. Yet no study has systematically measured how agent architecture shapes performance across heterogeneous protocols and diverse unfamiliar environments. This is the first systematic study, comparing tool-calling, MCP, code-generation, and CLI agents on the same benchmarks with the same models. Two gaps blocked such a study: existing harnesses require per-benchmark wiring or fixed protocol classes (web for BrowserGym, CLI for Harbor), and benchmarks themselves expect human-authored prompts, context, and integration glue. To enable this study, we contribute (1) a unifying protocol that bridges existing benchmark and agent protocols; (2) an evaluation harness that surfaces any benchmark to any general-purpose agent and backbone model; and (3) the first Open General Agent Leaderboard of agent configurations, a full factorial over 5 agent architectures x 5 backbone LLMs (three closed-source, two open-weight) x 6 benchmarks spanning software engineering, customer service, deep research, and personal assistance. We find that (i) general agents adapt to every tested domain without per-domain customization; (ii) agent architecture choice swings results by up to 12pp within a single model, yet backbone model choice dominates overall performance; (iii) on 4 of 6 tested benchmarks, top general agents are indistinguishable from the leading heavily-customized domain-specific agents; (iv) open-weight models tested exhibit "generality sinks" absent from frontier closed-source models: they consistently collapse on specific agent architectures or benchmarks; (v) a behavioral failure analysis reveals architecture-distinctive error signatures that aggregate scoring cannot discriminate. Code, harness, leaderboard, and traces are at https://www.exgentic.ai.
comment: Presented at the ICLR 2026 Workshop on Agents in the Wild
♻ ☆ MePo: Meta Post-Refinement for Rehearsal-Free General Continual Learning
To cope with uncertain changes of the external world, intelligent systems must continually learn from complex, evolving environments and respond in real time. This ability, collectively known as general continual learning (GCL), encapsulates practical challenges such as online datastreams and blurry task boundaries. Although leveraging pretrained models (PTMs) has greatly advanced conventional continual learning (CL), these methods remain limited in reconciling the diverse and temporally mixed information along a single pass, resulting in sub-optimal GCL performance. Inspired by meta-plasticity and reconstructive memory in neuroscience, we introduce here an innovative approach named Meta Post-Refinement (MePo) for PTMs-based GCL. This approach constructs pseudo task sequences from pretraining data and develops a bi-level meta-learning paradigm to refine the pretrained backbone, which serves as a prolonged pretraining phase but greatly facilitates rapid adaptation of representation learning to downstream GCL tasks. MePo further initializes a meta covariance matrix as the reference geometry of pretrained representation space, enabling GCL to exploit second-order statistics for robust output alignment. MePo serves as a plug-in strategy that achieves significant performance gains across a variety of GCL benchmarks and pretrained checkpoints in a rehearsal-free manner (e.g., 15.10\%, 13.36\%, and 12.56\% on CIFAR-100, ImageNet-R, and CUB-200 under Sup-21/1K). Our source code is available at \href{https://github.com/SunGL001/MePo}{MePo}
♻ ☆ You Have Been LaTeXpOsEd: A Systematic Analysis of Information Leakage in Preprint Archives Using Large Language Models
The widespread use of preprint repositories such as arXiv has accelerated the communication of scientific results but also introduced overlooked security risks. Beyond PDFs, these platforms provide unrestricted access to original source materials, including LaTeX sources, auxiliary code, figures, and embedded comments. In the absence of sanitization, submissions may disclose sensitive information that adversaries can harvest using open-source intelligence. In this work, we present the first large-scale security audit of preprint archives, analyzing more than 1.2 TB of source data from 100,000 arXiv submissions. We introduce LaTeXpOsEd, a four-stage framework that integrates pattern matching, logical filtering, traditional harvesting techniques, and large language models (LLMs) to uncover hidden disclosures within non-referenced files and LaTeX comments. To evaluate LLMs' secret-detection capabilities, we introduce LLMSec-DB, a benchmark on which we tested 25 state-of-the-art models. Our analysis uncovered thousands of PII leaks, GPS-tagged EXIF files, publicly available Google Drive and Dropbox folders, editable private SharePoint links, exposed GitHub and Google credentials, and cloud API keys. We also uncovered confidential author communications, internal disagreements, and conference submission credentials, exposing information that poses serious reputational risks to both researchers and institutions. We urge the research community and repository operators to take immediate action to close these hidden security gaps. To support open science, we release all scripts and methods from this study but withhold sensitive findings that could be misused, in line with ethical principles. The source code and related material are available at the project website https://github.com/LaTeXpOsEd
♻ ☆ Challenges and opportunities for AI to help deliver fusion energy
There is great potential for the application of AI tools in fusion research, and substantial worldwide benefit if fusion power is realised. However, using AI comes with its own challenges, many of which can be mitigated if responsible and robust methodologies are built into existing approaches. To do that requires close, long-term collaborations between fusion domain experts and AI developers and awareness of the fact that not all problems in fusion research are best tackled with AI tools. In April 2025, experts from academia, industry, UKAEA and STFC discussed how AI can be used to advance R&D in fusion energy at the first edition of The Economist FusionFest event. This Perspective is an expanded and updated summary of the round table discussion, providing more context and examples.
comment: Submitted to Plasma Physics and Confined Fusion
♻ ☆ OpenClaw-RL: Train Any Agent Simply by Talking
Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework that employs next-state signals to optimize personal agents online through infrastructure and methodology innovations. On the infrastructure side, we extend existing RL systems to a server-client architecture where the RL server hosts the policy behind an inference API and user terminals stream interaction data back over HTTP. From each observed next state, the system extracts two complementary training signals, evaluative and directive, via a separate asynchronous server so that neither signal extraction nor optimization blocks inference. On the methodology side, we introduce a hybrid RL objective that unifies both signal types in a single update: directive signals provide richer, token-level supervision but are sparser, while evaluative signals are more broadly available. To stabilize distillation under teacher-student mismatch, we propose overlap-guided hint selection, which picks the hint whose induced teacher distribution maximally overlaps with the student's top-$k$ tokens, together with a log-probability-difference clip that bounds per-token advantages. Applied to personal agents, OpenClaw-RL enables an agent to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Applied to general agents, OpenClaw-RL is the first RL framework to unify real-world agent settings spanning terminal, GUI, SWE, and tool-call environments, where we additionally demonstrate the utility of next-state signals in long-horizon settings.
comment: Code: https://github.com/Gen-Verse/OpenClaw-RL
♻ ☆ What Structural Inductive Bias Helps Transformers Reason Over Knowledge Graphs? A Study with Tabula RASA
What structural inductive bias helps transformers reason over knowledge graphs? Through controlled ablations of a minimal transformer modification with four independently removable components (sparse adjacency masking, edge-type biases, query scaling, value gating), we isolate which structural signals drive multi-hop reasoning. Our finding is sharp: sparse adjacency masking alone accounts for the dominant share of improvement over unmasked transformers (+72.5pp on 3-hop MetaQA, +45.5pp on WebQSP, +53.9pp on CWQ), while learned relation parameters add only modest refinement and can actively hurt without structural guidance. A zero-shot experiment provides architecturally independent corroboration: masking-based attention degrades 4.0x less than relation-specific weights when edge types are held out. The useful inductive bias for multi-hop KGQA is predominantly topological, not relational.
comment: 8 pages + appendix, 9 figures
♻ ☆ DUALFloodGNN: Physics-informed Graph Neural Network for Operational Flood Modeling IJCAI
Flood models inform strategic disaster management by simulating the spatiotemporal hydrodynamics of flooding. While physics-based numerical flood models are accurate, their substantial computational cost limits their use in operational settings where rapid predictions are essential. Models designed with graph neural networks (GNNs) provide both speed and accuracy while having the ability to process unstructured spatial domains. Given its flexible input and architecture, GNNs can be leveraged alongside physics-informed techniques with ease, significantly improving interpretability and generalizability. We introduce a novel flood GNN architecture, DUALFloodGNN, which embeds physical constraints at both global and local scales through explicit loss terms. The model jointly predicts water volume at nodes and flow along edges through a shared message-passing framework. To improve performance for autoregressive inference, model training is conducted with a multi-step loss enhanced with dynamic curriculum learning. Compared with standard GNN architectures and state-of-the-art GNN flood models, DUALFloodGNN achieves substantial improvements in predicting multiple hydrologic variables (e.g., water volume, flow, and depth) while maintaining high computational efficiency. The model is open sourced at https://github.com/acostacos/dual_flood_gnn. The dataset is open sourced at https://hdl.handle.net/2123/35293 with the DOI 10.25910/9xav-0s86.
comment: Accepted for publication at the IJCAI-ECAI 2026 AI4Tech track
♻ ☆ HAMLET: A Hierarchical and Adaptive Multi-Agent Framework for Live Embodied Theatrics ICLR 2026
Creating an immersive and interactive theatrical experience is a long-term goal in the field of interactive narrative. The emergence of large language models (LLMs) provides a new path to achieve this goal. However, existing drama generation methods often produce LLMs that lack initiative and cannot interact with the physical scene, while typically requiring detailed input that diminishes the immersion of live performance. To address these challenges, we propose HAMLET, a hierarchical adaptive multi-agent framework focused on drama creation and real-time online performance. Given a simple topic, the framework initially generates a narrative blueprint to guide the subsequent improvisational performance. During online performance, each actor is equipped with an adaptive reasoning module that enables decision-making based on their personas, memories, goals during complex group chat scenarios. Beyond dialogue, actor agents engage in embodied interactions by changing the state of scene props through actions such as opening a letter or picking up a weapon, which are broadcast to update the global environmental context. To objectively assess the quality of live embodied theatrics, we establish a comprehensive evaluation method and introduce HAMLETJudge, a specialized critic model for automated evaluation. Experimental results demonstrate that HAMLET excels in creating expressive, coherent, and physically interactive theatrical experiences in an autonomous manner.
comment: Accepted to the Fourteenth International Conference on Learning Representations (ICLR 2026)
♻ ☆ Task complexity shapes internal representations and robustness in neural networks
Neural networks excel across a wide range of tasks, yet remain black boxes. In particular, how their internal representations are shaped by the complexity of the input data and the problems they solve remains obscure. In this work, we introduce a suite of five data-agnostic probes-pruning, binarization, noise injection, sign flipping, and bipartite network randomization-to quantify how task difficulty influences the topology and robustness of representations in multilayer perceptrons (MLPs). MLPs are represented as signed, weighted bipartite graphs from a network science perspective. We contrast easy and hard classification tasks on the MNIST and Fashion-MNIST datasets. We show that binarizing weights in hard-task models collapses accuracy to chance, whereas easy-task models remain robust. We also find that pruning low-magnitude edges in binarized hard-task models reveals a sharp phase-transition in performance. Moreover, moderate noise injection can enhance accuracy, resembling a stochastic-resonance effect linked to optimal sign flips of small-magnitude weights. Finally, preserving only the sign structure-instead of precise weight magnitudes-through bipartite network randomizations suffices to maintain high accuracy. These phenomena define a model- and modality-agnostic measure of task complexity: the performance gap between full-precision and binarized or shuffled neural network performance. Our findings highlight the crucial role of signed bipartite topology in learned representations and suggest practical strategies for model compression and interpretability that align with task complexity.
♻ ☆ PATRA: Pattern-Aware Alignment and Balanced Reasoning for Time Series Question Answering ICML 2026
Time series reasoning demands both the perception of complex dynamics and logical depth. However, existing LLM-based approaches exhibit two limitations: they often treat time series merely as text or images, failing to capture the patterns like trends and seasonalities needed to answer specific questions; and when trained on a mix of simple and complex tasks, simpler objectives often dominate the learning process, hindering the development of deep reasoning capabilities. To address these limitations, we propose the Pattern-Aware Alignment and Balanced Reasoning model (PATRA), introducing a pattern-aware mechanism that extracts trend and seasonality patterns from time series to achieve deep alignment. Furthermore, we design a task-aware balanced reward to harmonize learning across tasks of varying difficulty, incentivizing the generation of coherent Chains of Thought. Extensive experiments show that PATRA outperforms strong baselines across diverse Time Series Question Answering (TSQA) tasks, demonstrating superior cross-modal understanding and reasoning capability.
comment: Accepted By ICML 2026
♻ ☆ Dynamics-Aligned Shared Hypernetworks for Contextual RL under Discontinuous Shifts
Zero-shot generalization in contextual reinforcement learning remains a core challenge, particularly when the context is latent and must be inferred from data. A canonical failure mode arises when latent context discontinuously changes how actions affect the environment, requiring incompatible control responses across contexts. We propose DMA*-SH, a framework where a single hypernetwork, trained solely via dynamics prediction, generates a small set of adapter weights shared across the dynamics model, policy, and action-value function. This shared modulation imparts an inductive bias matched to discontinuous context-to-dynamics shifts, while input/output normalization and random input masking stabilize context inference, promoting directionally concentrated representations. We provide theoretical support via expressivity separation results for hypernetwork modulation, and a variance decomposition with policy-gradient variance bounds that formalize how within-mode compression improves learning under non-overlapping contexts. For evaluation, we introduce the Actuator Inversion Benchmark (AIB), a suite of environments designed to isolate challenging context-to-dynamics interactions, including actuator inversion, actuator permutations, and weakly non-overlapping continuous dynamics. On AIB's held-out tasks, DMA*-SH achieves zero-shot generalization, outperforming domain randomization by 58.1% and surpassing a standard context-aware baseline by 11.5% on average.
♻ ☆ SLR: Automated Synthesis for Scalable Logical Reasoning
We introduce SLR, an end-to-end framework for systematic evaluation and training of Large Language Models (LLMs) via Scalable Logical Reasoning. Given a user's task specification, SLR automatically synthesizes (i) an instruction prompt for an inductive reasoning task, (ii) a validation program, executable on model outputs to provide verifiable rewards, and (iii) the latent ground-truth rule. This process is fully automated, scalable, requires no human annotations, and offers precise control over task difficulty. Using SLR, we create SLR-Bench, a benchmark comprising 19k prompts organized into 20 curriculum levels that progressively increase in relational, arithmetic, and recursive complexity. Large-scale evaluation reveals that contemporary LLMs readily produce syntactically valid rules, yet often fail at correct logical inference. Recent reasoning LLMs demonstrate improved performance but incur very high test-time computation, with costs exceeding $300 for just 1,000 prompts. Finally, curriculum learning via SLR doubles Llama-3-8B accuracy on SLR-Bench, achieving parity with Gemini-Flash-Thinking at a fraction of computational cost. Moreover, these reasoning capabilities generalize to a wide range of established benchmarks, underscoring the effectiveness of SLR for downstream reasoning.
♻ ☆ Governed Metaprogramming for Intelligent Systems: Reclassifying Eval as a Governed Effect
AI systems increasingly synthesize executable structure at runtime: LLMs generate programs, agents construct workflows,self-improving systems modify their own behavior. In classical homoiconic and staged languages, the transition from coderepresentation to execution is unrestricted. eval is a language primitive, not a governed operation. We argue that ingovernedintelligent systems, this transition is an authority amplification: it converts symbolic structure into executableauthority andmust be mediated like any other effect. We present governed metaprogramming, a language design where programrepresentations(machine forms) are first-class values, form manipulation is pure computation, and materialization (the transition fromform toexecutable machine) is a governed effect subject to structural inspection. The governance system analyzes the proposedprogram'scapability requirements, policy compliance, and resource estimates before permitting execution. We formalize twojudgments: pureform evaluation (which emits no directives) and governed materialization (which emits exactly one governed directive). Weprovethree properties: purity of form manipulation, the no-bypass theorem, and boundary preservation. We implement the designinMashinTalk, a DSL for AI workflows compiling to BEAM bytecode, and report on integration with 454 existingmachine-checked Rocqtheorems. The central contribution is reclassifying eval from a language primitive into a governed effect.
comment: 15 pages. Companion proofs: https://github.com/mashin-live/governance-proofs. Project: https://mashin.live. Update: Title typo fix
♻ ☆ Context Learning for Multi-Agent Discussion
Multi-Agent Discussion (MAD) has garnered increasing attention very recently, where multiple LLM instances collaboratively solve problems via structured discussion. However, we find that current MAD methods easily suffer from discussion inconsistency, LLMs fail to reach a coherent solution, due to the misalignment between their individual contexts.In this paper, we introduce a multi-LLM context learning method (M2CL) that learns a context generator for each agent, capable of dynamically generating context instructions per discussion round via automatic information organization and refinement. Specifically, inspired by our theoretical insights on the context instruction, M2CL train the generators to control context coherence and output discrepancies via a carefully crafted self-adaptive mechanism.It enables LLMs to avoid premature convergence on majority noise and progressively reach the correct consensus. We evaluate M2CL on challenging tasks, including academic reasoning, embodied tasks, and mobile control. The results show that the performance of M2CL significantly surpasses existing methods by 20%--50%, while enjoying favorable transferability and computational efficiency.
♻ ☆ A Physical Theory of Backpropagation: Exact Gradients from the Least-Action Principle
Backpropagation is typically presented as a symbolic procedure: a backward pass topologically distinct from inference, with non-local error signals and synchronous global clocking, features with no clear analog in physical reality. Existing physics-inspired alternatives recover gradients only approximately, in vanishing-perturbation limits, or under weight-symmetry constraints incompatible with feedforward architectures. In this paper, we address this gap by deriving exact backpropagation from Hamilton's least-action principle. By recasting the forward dynamics in continuous time and adapting a Lagrangian formalism for non-conservative systems to the resulting flow, we unify inference and gradient computation within a single variational framework on a doubled phase space, whose two conjugate fields jointly encode activations and sensitivities. A single global Lagrangian governs the dynamics: the task loss enters as a symmetry-breaking perturbation of the forward manifold, and credit assignment emerges as the tension that develops between the conjugate states. Inference and gradient computation thus unfold simultaneously through local interactions, requiring no separate backward circuit. Ultimately, standard backpropagation is recovered exactly as the discrete-time projection of this continuous flow. This perspective unifies the formalism of physics with backpropagation, opening a principled pathway for applying tools from classical mechanics - symplectic geometry, Noether's theorem, path-integral methods - to the analysis of learning dynamics. As a downstream consequence, it also points toward analog and neuromorphic substrates in which learning is embodied in the hardware itself.
comment: 22 pages
♻ ☆ Chinese Cyberbullying Detection: Dataset, Method, and Validation
Existing cyberbullying detection benchmarks were organized by the polarity of speech, such as "offensive" and "non-offensive", which were essentially hate speech detection. However, in the real world, cyberbullying often attracted widespread social attention through incidents. To address this problem, we propose a novel annotation method to construct a cyberbullying dataset that organized by incidents. The constructed CHNCI is the first Chinese cyberbullying incident detection dataset, which consists of 220,676 comments in 91 incidents. Specifically, we first combine three cyberbullying detection methods based on explanations generation as an ensemble method to generate the pseudo labels, and then let human annotators judge these labels. Then we propose the evaluation criteria for validating whether it constitutes a cyberbullying incident. Experimental results demonstrate that the constructed dataset can be a benchmark for the tasks of cyberbullying detection and incident prediction. To the best of our knowledge, this is the first study for the Chinese cyberbullying incident detection task.
♻ ☆ Agent-Omit: Adaptive Context Omission for Efficient LLM Agents ICML 2026
Managing agent context (e.g., thought and observation) during multi-turn agent-environment interactions is an emerging strategy to improve agent efficiency. However, existing studies treat the entire interaction trajectories equally, overlooking the thought necessity and observation utility varies across turns. To this end, we first conduct quantitative investigations into how thought and observation affect agent effectiveness and efficiency. Based on our findings, we propose Agent-Omit, a unified training framework that empowers LLM agents to adaptively omit redundant thoughts and observations. Specifically, we first synthesize a small amount of cold-start data, including both single-turn and multi-turn omission scenarios, to fine-tune the agent for omission behaviors. Furthermore, we introduce an omit-aware agentic reinforcement learning approach, incorporating a dual sampling mechanism and a tailored omission reward to incentivize the agent's adaptive omission capability. Theoretically, we prove that the deviation of our omission policy is upper-bounded by KL-divergence. Experimental results on five agent benchmarks show that our constructed Agent-Omit-8B could obtain performance comparable to seven frontier LLM agent, and achieve the best effectiveness-efficiency trade-off than seven efficient LLM agents methods. Our code and data are available at https://github.com/usail-hkust/Agent-Omit.
comment: ICML 2026
♻ ☆ Breaking Contextual Inertia: Reinforcement Learning with Single-Turn Anchors for Stable Multi-Turn Interaction
While LLMs demonstrate strong reasoning capabilities when provided with full information in a single turn, they exhibit substantial vulnerability in multi-turn interactions. Specifically, when information is revealed incrementally or requires updates, models frequently fail to integrate new constraints, leading to a collapse in performance compared to their single-turn baselines. We term the root cause as \emph{Contextual Inertia}: a phenomenon where models rigidly adhere to previous reasoning traces. Even when users explicitly provide corrections or new data in later turns, the model ignores them, preferring to maintain consistency with its previous (incorrect) reasoning path. To address this, we introduce \textbf{R}einforcement \textbf{L}earning with \textbf{S}ingle-\textbf{T}urn \textbf{A}nchors (\textbf{RLSTA}), a generalizable training approach designed to stabilize multi-turn interaction across diverse scenarios and domains. RLSTA leverages the model's superior single-turn capabilities as stable internal anchors to provide reward signals. By aligning multi-turn responses with these anchors, RLSTA empowers models to break contextual inertia and self-calibrate their reasoning based on the latest information. Experiments show that RLSTA significantly outperforms standard fine-tuning and abstention-based methods. Notably, our method exhibits strong cross-domain generalization (e.g., math to code) and proves effective even without external verifiers, highlighting its potential for general-domain applications. Code is available at https://github.com/Tencent/RLSTA.
♻ ☆ GLAI: GreenLightningAI for Accelerated Training through Knowledge Decoupling
In this work we introduce GreenLightningAI (GLAI), a new architectural block designed as an alternative to conventional MLPs. The central idea is to separate two types of knowledge that are usually entangled during training: (i) *structural knowledge*, encoded by the stable activation patterns induced by ReLU activations; and (ii) *quantitative knowledge*, carried by the numerical weights and biases. By fixing the structure once stabilized, GLAI reformulates the MLP as a combination of paths, where only the quantitative component is optimized. This reformulation retains the universal approximation capabilities of MLPs, yet achieves a more efficient training process, reducing training time by ~40% on average across the cases examined in this study. Crucially, GLAI is not just another classifier, but a generic block that can replace MLPs wherever they are used, from supervised heads with frozen backbones to projection layers in self-supervised learning or few-shot classifiers. Across diverse experimental setups, GLAI consistently matches or exceeds the accuracy of MLPs with an equivalent number of parameters, while converging faster. Overall, GLAI establishes a new design principle that opens a direction for future integration into large-scale architectures such as Transformers, where MLP blocks dominate the computational footprint.
comment: 20 pages, 2 figures
♻ ☆ ActivationReasoning: Logical Reasoning in Latent Activation Spaces ICLR 2026
Large language models (LLMs) excel at generating fluent text, but their internal reasoning remains opaque and difficult to control. Sparse autoencoders (SAEs) make hidden activations more interpretable by exposing latent features that often align with human concepts. Yet, these features are fragile and passive, offering no mechanism for systematic reasoning or model control. To address this, we introduce ActivationReasoning (AR), a framework that embeds explicit logical reasoning into the latent space of LLMs. It proceeds in three stages: (1) Finding latent representations, first latent concept representations are identified (e.g., via SAEs) and organized into a dictionary; (2) Activating propositions, at inference time AR detects activating concepts and maps them to logical propositions; and (3)Logical reasoning, applying logical rules over these propositions to infer higher-order structures, compose new concepts, and steer model behavior. We evaluate AR on multi-hop reasoning (PrOntoQA), abstraction and robustness to indirect concept cues (Rail2Country), reasoning over natural and diverse language (ProverQA), and context-sensitive safety (BeaverTails). Across all tasks, AR scales robustly with reasoning complexity, generalizes to abstract and context-sensitive tasks, and transfers across model backbones. These results demonstrate that grounding logical structure in latent activations not only improves transparency but also enables structured reasoning, reliable control, and alignment with desired behaviors, providing a path toward more reliable and auditable AI.
comment: Proceedings of the 14th International Conference on Learning Representations (ICLR 2026)
♻ ☆ A Scalable Entity-Based Framework for Auditing Bias in LLMs
Existing approaches to bias evaluation in large language models (LLMs) trade ecological validity for statistical control, relying either on artificial prompts that poorly reflect real-world use or on naturalistic tasks that lack scale and rigor. We introduce a scalable bias-auditing framework that uses named entities as controlled probes to measure systematic disparities in model behavior. Synthetic data enables us to construct diverse, controlled inputs, and we show that it reliably reproduces bias patterns observed in natural text, supporting its use for large-scale analysis. Using this framework, we conduct the largest bias audit to date, comprising 1.9 billion data points across multiple entity types, tasks, languages, models, and prompting strategies. We find consistent patterns: models penalize right-wing politicians and favor left-wing politicians, prefer Western and wealthier countries over the Global South, favor Western companies, and penalize firms in the defense and pharmaceutical sectors. While instruction tuning reduces bias, increasing model scale amplifies it, and prompting in Chinese or Russian does not mitigate Western-aligned preferences. These findings highlight the need for systematic bias auditing before deploying LLMs in high-stakes applications. Our framework is extensible to other domains and tasks, and we make it publicly available to support future work.
♻ ☆ APEX: Assumption-free Projection-based Embedding eXamination Metric for Image Quality Assessment
As generative models achieve unprecedented visual quality, the gold standard for image evaluation remains traditional feature-distribution metrics (e.g., FID). However, these metrics are provably hindered by the closed-vocabulary bottleneck of outdated features and the assumptive bias of rigid parametric formulations. Recent alternatives exploit modern backbones to solve the feature bottleneck, yet continue to suffer from parametric limitations. To close this gap, we introduce APEX (Assumption-free Projection-based Embedding eXamination), a novel evaluation framework leveraging the Sliced Wasserstein Distance as a mathematically grounded, assumption-free similarity measure. APEX inherits effective scalability to high-dimensional spaces, as we prove with theoretical and empirical evidences. Moreover, APEX is embedding-agnostic and uses two open-vocabulary foundation models, CLIP and DINOv2, as feature extractors. Benchmarking APEX against established baselines reveals superior robustness to visual degradations. Additionally, we show that APEX metrics exhibit intra- and cross-dataset stability, ensuring highly stable evaluations on out-of-domain datasets.
♻ ☆ The Accountability Paradox: How Platform API Restrictions Undermine AI Transparency Mandates
Recent application programming interface (API) restrictions on major social media platforms challenge compliance with the EU Digital Services Act [20], which mandates data access for algorithmic transparency. We develop a structured audit framework to assess the growing misalignment between regulatory requirements and platform implementations. Our comparative analysis of X/Twitter, Reddit, TikTok, and Meta identifies critical ``audit blind-spots'' where platform content moderation and algorithmic amplification remain inaccessible to independent verification. Our findings reveal an ``accountability paradox'': as platforms increasingly rely on AI systems, they simultaneously restrict the capacity for independent oversight. We propose targeted policy interventions aligned with the AI Risk Management Framework of the National Institute of Standards and Technology [80], emphasizing federated access models and enhanced regulatory enforcement.
comment: Accepted at ACM Conference on Fairness, Accountability, and Transparency (FAccT '26)
♻ ☆ TetraJet-v2: Accurate NVFP4 Training for Large Language Models with Oscillation Suppression and Outlier Control
Large Language Models (LLMs) training is prohibitively expensive, driving interest in low-precision fully-quantized training (FQT). While novel 4-bit formats like NVFP4 offer substantial efficiency gains, achieving near-lossless training at such low precision remains challenging. We introduce TetraJet-v2, an end-to-end 4-bit FQT method that leverages NVFP4 for activations, weights, and gradients in all linear layers. We identify two critical issues hindering low-precision LLM training: weight oscillation and outliers. To address these, we propose: 1) an unbiased double-block quantization method for NVFP4 linear layers with practically optimal convergence in LLM training, 2) OsciReset, the first effective algorithm to suppress LLMs' weight oscillation bottleneck, and 3) OutControl, a mix-precision algorithm to retain outlier accuracy. TetraJet-v2 outperforms prior methods on FP4 pre-training for LLMs across models up to 370M parameters trained up to 212B tokens, reducing the performance gap to BF16 by an average of 51.3% while enabling an 1.67x end-to-end speedup over FP8. The code is available at https://github.com/thu-ml/TetraJet-v2-NVFP4Training.
♻ ☆ Towards Batch-to-Streaming Deep Reinforcement Learning for Continuous Control
State-of-the-art deep reinforcement learning (RL) methods have achieved remarkable performance in continuous control tasks, yet their computational complexity is often incompatible with the constraints of resource-limited hardware, due to their reliance on replay buffers, batch updates, and target networks. The emerging paradigm of streaming deep RL addresses this limitation through purely online updates, achieving strong empirical performance on standard benchmarks. In this work, we propose two novel streaming deep RL algorithms, Streaming Soft Actor-Critic (S2AC) and Streaming Deterministic Actor-Critic (SDAC), explicitly designed to be compatible with state-of-the-art batch RL methods, making them particularly suitable for on-device finetuning applications such as Sim2Real transfer. Both algorithms achieve performance comparable to state-of-the-art streaming baselines on standard benchmarks without requiring tedious per-environment hyperparameter tuning. We further investigate the batch-to-streaming transition, showing that a naive transition does not guarantee preservation of pre-trained policy performance, and propose a principled approach to address this challenge.
♻ ☆ Why Adam Works Better with $β_1 = β_2$: The Missing Gradient Scale Invariance Principle
Adam has been at the core of large-scale training for almost a decade, yet a simple empirical fact remains unaccounted for: both validation scores and the qualitative behaviour of the training runs improve when the momentum parameters satisfy $β_{1}=β_{2}$. Some recent studies have reported this pattern, but there is still no explanation for why this choice helps. We show that this choice is closely tied to a structural property that we refer to as \textit{gradient scale invariance}. We formalize this notion and prove that Adam becomes gradient scale invariant of first order if and only if $β_{1}=β_{2}$. This perspective places the balanced regime of Adam in direct alignment with the design principles underlying several recent optimizers that explicitly enforce scale-robust updates. The theory is supported by experiments across vision and language tasks, and across different architectural families, in which rescaling the gradient has a markedly smoother effect on the update when $β_{1}=β_{2}$. Overall, our results offer a coherent explanation for an open question in the behavior of Adam and provide a simple principle that helps guide the design of future optimizers.
comment: 23 pages, 8 figures. Preprint
♻ ☆ Equivariant Volumetric Grasping
We propose a new volumetric grasp model that is equivariant to rotations around the vertical axis, leading to a significant improvement in sampling efficiency. Our model employs a tri-plane volumetric feature representation -- i.e., the projection of 3D features onto three canonical planes. We introduce a novel tri-plane feature design in which features on the horizontal plane are \emph{equivariant} to $90^\circ$ rotations, while the \emph{sum} of features from the other two planes remains \emph{invariant} to reflections induced by the same transformations. We further develop equivariant adaptations of two state-of-the-art volumetric grasp planners, GIGA and IGD. Specifically, we derive a new equivariant formulation of IGD's deformable attention mechanism and propose an equivariant generative model of grasp orientations based on flow matching. We provide a detailed analytical justification of the proposed equivariance properties and validate our approach through extensive simulated and real-world experiments. Our results demonstrate that the proposed projection-based design reduces both computational and memory costs. Moreover, the equivariant grasp models built on top of our tri-plane features consistently outperform their non-equivariant counterparts, achieving higher performance within a real-time cost constraint. Video and code can be viewed in: https://mousecpn.github.io/evg-page/
comment: 21 pages
♻ ☆ Mitigating Watermark Forgery in Generative Models via Randomized Key Selection
Watermarking enables GenAI providers to verify whether content was generated by their models. A watermark is a hidden signal in the content, whose presence can be detected using a secret watermark key. A core security threat are forgery attacks, where adversaries insert the provider's watermark into content \emph{not} produced by the provider, potentially damaging their reputation and undermining trust. Existing defenses resist forgery by embedding many watermarks with multiple keys into the same content, which can degrade model utility. However, forgery remains a threat when attackers can collect sufficiently many watermarked samples. We propose a defense that is provably forgery-resistant \emph{independent} of the number of watermarked content collected by the attacker, provided they cannot easily distinguish watermarks from different keys. Our scheme does not further degrade model utility. We randomize the watermark key selection for each query and accept content as genuine only if a watermark is detected by \emph{exactly} one key. We focus on the image and text modalities, but our defense is modality-agnostic, since it treats the underlying watermarking method as a black-box. Our method provably bounds the attacker's success rate and we empirically observe a reduction from near-perfect success rates to only $2\%$ at negligible computational overhead.
♻ ☆ Efficient LLM Collaboration via Planning
Recently, large language models (LLMs) have demonstrated strong performance, ranging from simple to complex tasks. However, while large models achieve remarkable results across diverse tasks, they often incur substantial monetary inference cost, making frequent use impractical for many applications. In contrast, small models are often freely available and easy to deploy locally, but their performance on complex tasks remains limited. This trade-off raises a natural question: how can small and large models efficiently collaborate to combine their complementary strengths? To bridge this trade-off, we propose COPE, a test-time collaboration framework. A planner model first generates a plan that serves as a lightweight intermediate that guides a downstream executor model. Small and large models take turns acting as planner and executor, exchanging plans in a multi-stage cascade to collaboratively solve tasks. Through comprehensive experiments on benchmarks spanning mathematical reasoning, code generation, open-ended tasks, and agent tasks, we demonstrate that COPE achieves performance comparable to large proprietary models, while drastically reducing the inference API cost. These results highlight planning as an effective prior for cost-efficient inference.
♻ ☆ Model Merging Scaling Laws in Large Language Models ICML 2026
We study empirical scaling laws for language model merging measured by cross-entropy. Despite its wide practical use, merging lacks a quantitative rule that predicts returns as we add experts or scale the model size. We identify a compact power law that links model size and expert number: the size-dependent floor decreases with model capacity, while the merging tail exhibits clear diminishing returns in the number of experts. The law holds in-domain and cross-domain, tightly fits measured curves across diverse architectures and methods (Average, TA, TIES, DARE), and explains two robust regularities: most gains arrive early, and variability shrinks as more experts are included. Building on this, we present a simple theory that explains why gains fall roughly as 1/k and links the floor and tail to properties of the base model and the diversity across domains. This law enables predictive planning: estimate how many experts are needed to reach a target loss, decide when to stop adding experts, and trade off scaling the base model versus adding experts under a fixed budget--turning merging from heuristic practice into a computationally efficient, planable alternative to multitask training. This suggests a scaling principle for distributed generative AI: predictable gains can be achieved by composing specialists, offering a complementary path toward AGI-level systems.
comment: ICML 2026
♻ ☆ EvoDriveVLA: Evolving Driving VLA Models via Collaborative Perception-Planning Distillation
Vision-Language-Action models have shown great promise for autonomous driving, yet they suffer from degraded perception after unfreezing the visual encoder and struggle with accumulated instability in long-term planning. To address these challenges, we propose EvoDriveVLA-a novel collaborative perception-planning distillation framework that integrates self-anchored perceptual constraints and future-informed trajectory optimization. Specifically, self-anchored visual distillation leverages self-anchor teacher to deliver visual anchoring constraints, regularizing student representations via trajectory-guided key-region awareness. In parallel, future-informed trajectory distillation employs a future-aware oracle teacher with coarse-to-fine trajectory refinement and Monte Carlo dropout sampling to synthesize reasoning trajectories that model future evolutions, enabling the student model to internalize the future-aware insights of the teacher. EvoDriveVLA achieves SOTA performance in nuScenes open-loop evaluation and significantly enhances performance in NAVSIM closed-loop evaluation. Our code is available at: https://github.com/hey-cjj/EvoDriveVLA.
comment: 19 pages, 5 figures, 5 tables
♻ ☆ AffordSim: A Scalable Data Generator and Benchmark for Affordance-Aware Robotic Manipulation
Many everyday robot manipulation skills are affordance-dependent, with success determined by whether the robot contacts the functional object region required by the subsequent action. Current simulation data generators obtain contacts from generic grasp estimators or per-object manual contact annotations, but generic estimators rank stable grasps without task semantics and often select contacts that are misaligned with the downstream action, while manual contact annotations must be rewritten for each new object and task. To solve these challenges, we introduce AffordSim, a scalable data generator and benchmark that integrates open-vocabulary 3D affordance prediction into simulation-based trajectory generation. Given a natural-language task description, AffordSim synthesizes a task-relevant scene, emits affordance queries, grounds them on object surfaces, samples region-conditioned grasps, and selects executable candidates with motion planning. It further randomizes object pose, texture, lighting, image noise, and cross-viewpoint backgrounds for sim-to-real transfer. We instantiate AffordSim as a 50-task benchmark across diverse manipulation skills, five robot embodiments, and 500+ rigid and articulated objects. AffordSim achieves 93% of the trajectory collection success rate of manual contact annotations on affordance-critical tasks and 89% on hard composite tasks. Vision-language-action policies trained on AffordSim data transfer zero-shot to a real Franka FR3, reaching 24% average success.
♻ ☆ Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts ICLR 2026
The Mixture of Experts (MoE) is an effective architecture for scaling large language models by leveraging sparse expert activation to balance performance and efficiency. However, under expert parallelism, MoE suffers from inference inefficiencies due to imbalanced token-to-expert assignment, where underloaded experts complete computations early but must wait for overloaded experts, leading to global delays. We define this phenomenon as the \textbf{\textit{Straggler Effect}}, as the most burdened experts dictate the overall inference latency. To address this, we first propose \textit{\textbf{Capacity-Aware Token Drop}}, which enforces expert capacity limits by discarding excess tokens from overloaded experts, effectively reducing load imbalance with minimal performance impact (e.g., $30\%$ speedup with only $0.9\%$ degradation on OLMoE). Next, given the presence of low-load experts remaining well below the capacity threshold, we introduce \textit{\textbf{Capacity-Aware Expanded Drop}}, which allows tokens to include additional local experts in their candidate set before enforcing strict local capacity constraints, thereby improving load balance and enhancing the utilization of underused experts. Extensive experiments on both language and multimodal MoE models demonstrate the effectiveness of our approach, yielding substantial gains in expert utilization, model performance, and inference efficiency, e.g., applying Expanded Drop to Mixtral-8$\times$7B-Instruct yields a {0.2\%} average performance improvement and a {1.85$\times$} inference speedup. The code is released at: https://github.com/CASE-Lab-UMD/Capacity-Aware-MoE.
comment: ICLR 2026
♻ ☆ HighFM: Towards a Foundation Model for Learning Representations from High-Frequency Earth Observation Data
The increasing frequency and severity of climate related disasters have intensified the need for real time monitoring, early warning, and informed decision-making. Earth Observation (EO), powered by satellite data and Machine Learning (ML), offers powerful tools to meet these challenges. Foundation Models (FMs) have revolutionized EO ML by enabling general-purpose pretraining on large scale remote sensing datasets. However most existing models rely on high-resolution satellite imagery with low revisit rates limiting their suitability for fast-evolving phenomena and time critical emergency response. In this work, we present HighFM, a first cut approach towards a FM for high temporal resolution, multispectral EO data. Leveraging over 2 TB of SEVIRI imagery from the Meteosat Second Generation (MSG) platform, we adapt the SatMAE masked autoencoding framework to learn robust spatiotemporal representations. To support real time monitoring, we enhance the original architecture with fine grained temporal encodings to capture short term variability. The pretrained models are then finetuned on cloud masking and active fire detection tasks. We benchmark our SEVIRI pretrained Vision Transformers against traditional baselines and recent geospatial FMs, demonstrating consistent gains across both balanced accuracy and IoU metrics. Our results highlight the potential of temporally dense geostationary data for real-time EO, offering a scalable path toward foundation models for disaster detection and tracking.
♻ ☆ Commanding Humanoid by Free-form Language: A Large Language Action Model with Unified Motion Vocabulary
Enabling humanoid robots to follow free-form natural language commands is a critical step toward seamless human-robot interaction and general-purpose embodied AI. However, existing methods remain limited, often constrained to simple instructions or forced to sacrifice motion diversity for physical plausibility. To address this gap, we present Humanoid-LLA, a Large Language Action model that translates unconstrained natural language directly into executable whole-body motions for humanoid robots. Our approach tackles two core challenges: paired language-humanoid motion data scarcity and physical instability. First, we bridge high-level language semantics with physically-grounded control by learning a unified human-humanoid motion vocabulary. Second, we introduce a novel two-stage fine-tuning framework that begins with supervised motion Chain-of-Thought learning, followed by reinforcement learning refined with physical feedback to ensure robustness and stability. Extensive evaluation in simulation and real-world cross-embodiment experiments demonstrates that Humanoid-LLA achieves superior generalization to novel language commands and diverse motion generation while maintaining high physical fidelity.
comment: Project page: https://humanoidlla.github.io/
♻ ☆ The autoPET3 Challenge: Automated Lesion Segmentation in Whole-Body PET/CT $\unicode{x2013}$ Multitracer Multicenter Generalization
We report the design and results of the third autoPET challenge (MICCAI 2024), which benchmarked automated lesion segmentation in whole-body PET/CT under a compositional generalization setting. Training data comprised 1,014 [18F]-FDG PET/CT studies from the University Hospital Tübingen and 597 [18F]/[68Ga]-PSMA PET/CT studies from the LMU University Hospital Munich, constituting the largest publicly available annotated PSMA PET/CT dataset to date. The held-out test set of 200 studies covered four tracer-center combinations, two of which represented unseen compositional pairings. A complementary data-centric award category isolated the contribution of data handling strategies by restricting participants to a fixed baseline model. Seventeen teams submitted 27 algorithms, predominantly nnU-Net-based 3D networks with PET/CT channel concatenation. The top-ranked algorithm achieved a mean DSC of 0.66, FNV of 3.18 mL, and FPV of 2.78 mL across all four test conditions, improving DSC by 8% and reducing the false-negative volume by 5 mL relative to the provided baseline. Ranking was stable across bootstrap resampling and alternative ranking schemes for the top tier. Beyond the benchmark, we provide an in-depth analysis of segmentation performance at the patient and lesion level. Three main conclusions can be drawn: (1) in-domain multitracer PET/CT segmentation is sufficient and probably approaching reader agreement; (2) compositional generalization to unseen tracer-center combinations remains an open problem mainly driven by systematic volume overestimation; (3) heterogeneity and case difficulty drive performance variation substantially more than the choice of algorithm among top-ranked teams.
comment: Preprint submitted to Medical Image Analysis
♻ ☆ Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence
We introduce Nemotron 3 Nano Omni, the latest model in the Nemotron multimodal series and the first to natively support audio inputs alongside text, images, and video. Nemotron 3 Nano Omni delivers consistent accuracy improvements over its predecessor, Nemotron Nano V2 VL, across all modalities, enabled by advances in architecture, training data and recipes. In particular, Nemotron 3 delivers leading results in real-world document understanding, long audio-video comprehension, and agentic computer use. Built on the highly efficient Nemotron 3 Nano 30B-A3B backbone, Nemotron 3 Nano Omni further incorporates innovative multimodal token-reduction techniques to deliver substantially lower inference latency and higher throughput than other models of similar size. We are releasing model checkpoints in BF16, FP8, and FP4 formats, along with portions of the training data and codebase to facilitate further research and development.
♻ ☆ Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling ACL 2026
In classical Reinforcement Learning from Human Feedback (RLHF), Reward Models (RMs) serve as the fundamental signal provider for model alignment. As Large Language Models evolve into agentic systems capable of autonomous tool invocation and complex reasoning, the paradigm of reward modeling faces unprecedented challenges -- most notably, the lack of benchmarks specifically designed to assess RM capabilities within tool-integrated environments. To address this gap, we present Plan-RewardBench, a trajectory-level preference benchmark designed to evaluate how well judges distinguish preferred versus distractor agent trajectories in complex tool-using scenarios. Plan-RewardBench covers four representative task families -- (i) Safety Refusal, (ii) Tool-Irrelevance / Unavailability, (iii) Complex Planning, and (iv) Robust Error Recovery -- comprising validated positive trajectories and confusable hard negatives constructed via multi-model natural rollouts, rule-based perturbations, and minimal-edit LLM perturbations. We benchmark representative RMs (generative, discriminative, and LLM-as-Judge) under a unified pairwise protocol, reporting accuracy trends across varying trajectory lengths and task categories. Furthermore, we provide diagnostic analyses of prevalent failure modes. Our results reveal that all three evaluator families face substantial challenges, with performance degrading sharply on long-horizon trajectories, underscoring the necessity for specialized training in agentic, trajectory-level reward modeling. Ultimately, Plan-RewardBench aims to serve as both a practical evaluation suite and a reusable blueprint for constructing agentic planning preference data.
comment: accepted to ACL 2026 main conference
♻ ☆ AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
Video generation models internalize physical realism as their prior. Anime deliberately violates physics: smears, impact frames, chibi shifts; and its thousands of coexisting artistic conventions yield no single "physics of anime" a model can absorb. Physics-biased models therefore flatten the artistry that defines the medium or collapse under its stylistic variance. We present AniMatrix, a video generation model that targets artistic rather than physical correctness through a dual-channel conditioning mechanism and a three-step transition: redefine correctness, override the physics prior, and distinguish art from failure. First, a Production Knowledge System encodes anime as a structured taxonomy of controllable production variables (Style, Motion, Camera, VFX), and AniCaption infers these variables from pixels as directorial directives. A trainable tag encoder preserves the field-value structure of this taxonomy while a frozen T5 encoder handles free-form narrative; dual-path injection (cross-attention for fine-grained control, AdaLN modulation for global enforcement) ensures categorical directives are never diluted by open-ended text. Second, a style-motion-deformation curriculum transitions the model from near-physical motion to full anime expressiveness. Third, deformation-aware preference optimization with a domain-specific reward model separates intentional artistry from pathological collapse. On an anime-specific human evaluation with five production dimensions scored by professional animators, AniMatrix ranks first on four of five, with the largest gains over Seedance-Pro 1.0 on Prompt Understanding (+0.70, +22.4 percent) and Artistic Motion (+0.55, +16.9 percent). We are preparing accompanying resources for public release to support reproducibility and follow-up research.
comment: 37 pages, 1 main figure (qualitative comparison), 1 TikZ architecture diagram; technical report. Model weights and inference code to be released
♻ ☆ Exploring the AI Obedience: Why is Generating a Pure Color Image Harder than CyberPunk?
Recent advances in generative AI have shown human-level performance in complex content creation. However, we identify a "Paradox of Simplicity": models that can render complex scenes often fail at trivial, low-entropy tasks, such as generating a uniform pure color image. We argue this is a systemic failure related to uncontrollable emergent abilities. As models scale, strong priors for aesthetics and complexity override deterministic simplicity, creating an "aesthetic bias" that hinders the model's transition from data simulation to true intellectual abstraction. To better investigate this problem, we formalize the concept of AI Obedience, a hierarchical framework that grades a model's ability to transition from probabilistic approximation to pixel-level determinism (Levels 1 to 5). We introduce Violin, the first systematic benchmark designed to evaluate Level 4 Obedience through three deterministic tasks: color purity, image masking, and geometric shape generation. Using Violin, we evaluate several state-of-the-art models and reveal that closed-source models generally outperform open-source ones in deterministic precision. Interestingly, performance on our benchmark correlates with the benchmark in natural image generation. Our work provides a foundational framework and tools for achieving better alignment between human instructions and model outputs.
♻ ☆ Operating Within the Operational Design Domain: Zero-Shot Perception with Vision-Language Models
Over the last few years, research on autonomous systems has matured to such a degree that the field is increasingly well-positioned to translate research into practical, stakeholder-driven use cases across well-defined domains. However, for a wide-scale practical adoption of autonomous systems, adherence to safety regulations is crucial. Many regulations are influenced by the Operational Design Domain (ODD), which defines the specific conditions in which an autonomous agent can function. This is especially relevant for Automated Driving Systems (ADS), as a dependable perception of ODD elements is essential for safe implementation and auditing. Vision-language models (VLMs) integrate visual recognition and language reasoning, functioning without task-specific training data, which makes them suitable for adaptable ODD perception. To assess whether VLMs can function as zero-shot "ODD sensors" that adapt to evolving definitions, we contribute (i) an empirical study of zero-shot ODD classification and detection using four VLMs on a custom dataset and Mapillary Vistas, along with failure analyses; (ii) an ablation of zero-shot optimization strategies with a cost-performance overview; and (iii) a suite of reusable prompting templates with guidance for adaptation. Our findings indicate that definition-anchored chain-of-thought prompting with persona decomposition performs best, while other methods may result in reduced recall. Overall, our results pave the way for transparent and effective ODD-based perception in safety-critical applications.
comment: 8 pages, 4 figures
♻ ☆ SafeHarness: Lifecycle-Integrated Security Architecture for LLM-based Agent Deployment
The performance of large language model (LLM) agents depends critically on the execution harness, the system layer that orchestrates tool use, context management, and state persistence. Yet this same architectural centrality makes the harness a high-value attack surface: a single compromise at the harness level can cascade through the entire execution pipeline. We observe that existing security approaches suffer from structural mismatch, leaving them blind to harness-internal state and unable to coordinate across the different phases of agent operation. In this paper, we introduce \safeharness{}, a security architecture in which four proposed defense layers are woven directly into the agent lifecycle to address above significant limitations: adversarial context filtering at input processing, tiered causal verification at decision making, privilege-separated tool control at action execution, and safe rollback with adaptive degradation at state update. The proposed cross-layer mechanisms tie these layers together, escalating verification rigor, triggering rollbacks, and tightening tool privileges whenever sustained anomalies are detected. We evaluate \safeharness{} on benchmark datasets across diverse harness configurations, comparing against four security baselines under five attack scenarios spanning six threat categories. Compared to the unprotected baseline, \safeharness{} achieves an average reduction of approximately 38\% in UBR and 42\% in ASR, substantially lowering both the unsafe behavior rate and the attack success rate while preserving core task utility.
comment: 26 pages, 6 figures
♻ ☆ AVA: Attentive VLM Agent for Mastering StarCraft II ACL 2026
We introduce AVACraft, a multimodal StarCraft II benchmark supporting both Multi-Agent Reinforcement Learning (MARL) and Vision-Language Model (VLM) paradigms. Unlike SMAC-family environments that rely on abstract state representations and exclude VLMs, AVACraft provides RGB visuals, natural language observations, and structured state information, enabling systematic comparison between training-based and zero-shot methods across 21 scenarios spanning micromanagement, coordination, and strategic planning. We establish comprehensive baselines: six MARL algorithms (IQL, QMIX, QTRAN, VDN, MAPPO, IPPO) with Swin-Transformer backbones trained for 5M steps, and multiple VLMs including proprietary (GPT-4o) and open-source (Qwen3-VL) models. Results reveal complementary strengths-MARL peaks at 19.3% win rate after 5M steps, while VLMs achieve 75-90% zero-shot with human-aligned decisions-exposing trade-offs between training efficiency, performance ceilings, interpretability, and deployment cost. Code: https://github.com/camel-ai/VLM-Play-StarCraft2.
comment: Accepted by ACL 2026
♻ ☆ AgentGA: Evolving Code Solutions in Agent-Seed Space
We present AgentGA, a framework that evolves autonomous code-generation runs by optimizing the agent seed: the task prompt plus optional parent archives that initialize a fresh workspace. The outer loop searches over these reusable starting conditions rather than editing code directly. Each generation launches a fresh autonomous run in an isolated workspace, while selected parent archives provide inherited artifacts that descendants can inspect and reuse. AgentGA couples a population-level genetic algorithm with long-horizon agents; selection uses deterministic 1:1 elite tournaments and operator allocation is adapted online with a modified Hedge controller. We instantiate the approach for tabular AutoML on the 16-competition Weco-Kaggle Lite benchmark. Across the full benchmark, AgentGA averages 71.90% Exceeds % of Human versus 51.38% for the AIDE reference, winning 15/16 competitions. Within AgentGA runs, descendants conditioned on inherited parent archives win 51.9% of 1,680 parent-child tournaments versus 8.6% for de novo proposals. These results support agent-seed optimization as a practical design choice for autonomous code-search systems.
comment: 30 pages total (9-page main text + references + appendix), 5 figures, 9 tables
♻ ☆ Revis: Sparse Latent Steering to Mitigate Object Hallucination in Large Vision-Language Models ICML 2026
Despite the advanced capabilities of Large Vision-Language Models (LVLMs), they frequently suffer from object hallucination. One reason is that visual features and pretrained textual representations often become intertwined in the deeper network layers. To address this, we propose REVIS, a training-free framework designed to explicitly re-activate this suppressed visual information. Rooted in latent space geometry, REVIS extracts the pure visual information vector via orthogonal projection and employs a calibrated strategy to perform sparse intervention only at the precise depth where suppression occurs. This surgical approach effectively restores visual information with minimal computational cost. Empirical evaluations on standard benchmarks demonstrate that REVIS reduces object hallucination rates by approximately 19% compared to state-of-the-art baselines, while preserving general reasoning capabilities.
comment: Accepted by ICML 2026
♻ ☆ SDFlow: Similarity-Driven Flow Matching for Time Series Generation
Vector quantization (VQ) with autoregressive (AR) token modeling is a widely adopted and highly competitive paradigm for time-series generation. However, such models are fundamentally limited by exposure bias: during inference, errors can accumulate across sequential predictions, leading to pronounced quality degradation in long-horizon generation. To address this, we propose SDFlow ($\textbf{S}$imilarity-$\textbf{D}$riven $\textbf{Flow}$ Matching), a non-autoregressive framework that operates entirely in the frozen VQ latent space and enables parallel sequence generation via flow matching. We tackle three key challenges in making this transition: (1) eliminating exposure bias by replacing step-wise token prediction with a global transport map; (2) mitigating the high-dimensionality of VQ token spaces via a low-rank manifold decomposition with a learned anchor prior over the latent manifold; and (3) incorporating discrete supervision into continuous transport dynamics by introducing a categorical posterior over codebook indices within a variational flow-matching formulation. Extensive experiments show that SDFlow achieves state-of-the-art performance, improving Discriminative Score and substantially reducing Context-FID, particularly for challenging long-sequence generation. Moreover, SDFlow provides significant inference speedups over autoregressive baselines, offering both high fidelity and computational efficiency. Code is available at https://anonymous.4open.science/r/SDFlow-D6F3/
Computation and Language 200
☆ ELF: Embedded Language Flows
Diffusion and flow-based models have become the de facto approaches for generating continuous data, e.g., in domains such as images and videos. Their success has attracted growing interest in applying them to language modeling. Unlike their image-domain counterparts, today's leading diffusion language models (DLMs) primarily operate over discrete tokens. In this paper, we show that continuous DLMs can be made effective with minimal adaptation to the discrete domain. We propose Embedded Language Flows (ELF), a class of diffusion models in continuous embedding space based on continuous-time Flow Matching. Unlike existing DLMs, ELF predominantly stays within the continuous embedding space until the final time step, where it maps to discrete tokens using a shared-weight network. This formulation makes it straightforward to adapt established techniques from image-domain diffusion models, e.g., classifier-free guidance (CFG). Experiments show that ELF substantially outperforms leading discrete and continuous DLMs, achieving better generation quality with fewer sampling steps. These results suggest that ELF offers a promising path toward effective continuous DLMs.
comment: Tech Report. Project webpage: https://github.com/lillian039/ELF
☆ DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
While Mixture-of-Experts (MoE) scales model capacity without proportionally increasing computation, its massive total parameter footprint creates significant storage and memory-access bottlenecks, which hinder efficient end-side deployment that simultaneously requires high performance, low computational cost, and small storage overhead. To achieve these properties, we present DECO, a sparse MoE architecture designed to match the performance of dense Transformers under identical total parameter budgets and training tokens. DECO utilizes the differentiable and flexible ReLU-based routing enhanced by learnable expert-wise scaling, which adaptively balances the contributions of routed and shared experts. Furthermore, we introduce NormSiLU, an activation function that normalizes inputs prior to SiLU operators, producing a more stable trend of routed-expert activation ratio and a higher intrinsic sparsity level. We also identify an empirical advantage in using non-gated MLP experts with ReLU-based routing, indicating the possibility of MoE architecture simplification. Experiments demonstrate that DECO, activating only 20% of experts, matches dense performance and outperforms established MoE baselines. Our specialized acceleration kernel delivers a 3.00$\times$ speedup on real hardware compared with dense inference. Codes and checkpoints will be released.
comment: 14 pages, 11 figures, 11 tables
☆ Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
Large language model agents increasingly rely on external skills to solve complex tasks, where skills act as modular units that extend their capabilities beyond what parametric memory alone supports. Existing methods assume external skills either accumulate as persistent guidance or internalized into the policy, eventually leading to zero-skill inference. We argue this assumption is overly restrictive, since with limited parametric capacity and uneven marginal contribution across skills, the optimal active skill set is non-monotonic, task- and stage-dependent. In this work, we propose SLIM, a framework of dynamic Skill LIfecycle Management for agentic reinforcement learning (RL), which treats the active external skill set as a dynamic optimization variable jointly updated with policy learning. Specifically, SLIM estimates each active skill's marginal external contribution through leave-one-skill-out validation, then applies three lifecycle operations: retaining high-value skills, retiring skills whose contribution becomes negligible after sufficient exposure, and expanding the skill bank when persistent failures reveal missing capability coverage. Experiments show that SLIM outperforms the best baselines by an average of 7.1% points across ALFWorld and SearchQA. Results further indicate that policy learning and external skill retention are not mutually exclusive: some skills are absorbed into the policy, while others continue to provide external value, supporting SLIM as a more general paradigm for skill-based agentic RL.
comment: Implementation code is available at https://github.com/ejhshen/SLIM
☆ WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed. This work presents WildClawBench, a native-runtime benchmark of 60 human-authored, bilingual, multimodal tasks spanning six thematic categories. Each task averages roughly 8 minutes of wall-clock time and over 20 tool calls, and runs inside a reproducible Docker container hosting an actual CLI agent harness (OpenClaw, Claude Code, Codex, or Hermes Agent) with access to real tools rather than mock services. Grading is hybrid, combining deterministic rule-based checks, environment-state auditing of side effects, and an LLM/VLM judge for semantic verification. Across 19 frontier models, the best, Claude Opus 4.7, reaches only 62.2% overall under OpenClaw, while every other model stays below 60%, and switching harness alone shifts a single model by up to 18 points. These results show that long-horizon, native-runtime agent evaluation remains a far-from-resolved task for current frontier models. We release the tasks, code, and containerized tooling to support reproducible evaluation.
comment: Github link: https://github.com/internlm/WildClawBench
☆ RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards
Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard post-training offers little mechanism for turning past attempts into reusable experience. In this work, we argue that rubrics should serve not merely as final-answer evaluators, but as the shared interface that structures policy execution, judge feedback, and agent memory. Based on this view, we introduce RubricEM, a rubric-guided reinforcement learning framework that combines stagewise policy decomposition with reflection-based meta-policy evolution. RubricEM first makes research trajectories stage-aware by conditioning planning, evidence gathering, review, and synthesis on self-generated rubrics. It then assigns credit with Stage-Structured GRPO, which uses stagewise rubric judgments to provide denser semantic feedback for long-horizon optimization. In parallel, RubricEM trains a shared-backbone reflection meta-policy that distills judged trajectories into reusable rubric-grounded guidance for future attempts. The resulting RubricEM-8B achieves strong performance across four long-form research benchmarks, outperforming comparable open models and approaching proprietary deep-research systems. Beyond final performance, we perform thorough analyses to understand the key ingredients of RubricEM.
comment: 63 pages, 6 figures
☆ Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking
Large vision-language models suffer from visual ungroundedness: they can produce a fluent, confident, and even correct response driven entirely by language priors, with the image contributing nothing to the prediction. Existing confidence estimation methods cannot detect this, as they observe model behavior under normal inference with no mechanism to determine whether a prediction was shaped by the image or by text alone. We introduce BICR (Blind-Image Contrastive Ranking), a model-agnostic confidence estimation framework that makes this contrast explicit during training by extracting hidden states from a frozen LVLM twice: once with the real image-question pair, and once with the image blacked out while the question is held fixed. A lightweight probe is trained on the real-image hidden state and regularized by a ranking loss that penalizes higher confidence on the blacked-out view, teaching it to treat visual grounding as a signal of reliability at zero additional inference cost. Evaluated across five modern LVLMs and seven baselines on a benchmark covering visual question answering, object hallucination detection, medical imaging, and financial document understanding, BICR achieves the best cross-LVLM average on both calibration and discrimination simultaneously, with statistically significant discrimination gains robust to cluster-aware analysis at 4-18x fewer parameters than the strongest probing baseline.
☆ Neural at ArchEHR-QA 2026: One Method Fits All: Unified Prompt Optimization for Clinical QA over EHRs LREC 2026
Automated question answering (QA) over electronic health records (EHRs) demands precise evidence retrieval, faithful answer generation, and explicit grounding of answers in clinical notes. In this work, we present Neural1.5, our method for the ArchEHR-QA 2026 shared task at CL4Health@LREC 2026, which comprises four subtasks: question interpretation, evidence identification, answer generation, and evidence alignment. Our approach decouples the task into independent, modular stages and employs DSPy"s MIPROv2 optimizer to automatically discover high-performing prompts, jointly tuning instructions and few-shot demonstrations for each stage. Within every stage, self-consistency voting over multiple stochastic inference runs suppresses spurious errors and improves reliability, while stage-specific verification mechanisms (e.g., self-reflection and chain-of-verification for alignment) further refine output quality. Among all teams that participated in all four subtasks, our method ranks second overall (mean rank 4.00), placing 4th, 1st, 4th, and 7th on Subtasks 1-4, respectively. These results demonstrate that systematic, per-stage prompt optimization combined with self-consistency mechanisms is a cost-effective alternative to model fine-tuning for multifaceted clinical QA.
comment: Accepted to CL4Health @ LREC 2026
☆ Compute Where it Counts: Self Optimizing Language Models ICML'26
Efficient LLM inference research has largely focused on reducing the cost of each decoding step (e.g., using quantization, pruning, or sparse attention), typically applying a uniform computation budget to every generated token. In practice, token difficulty varies widely, so static compression can over-compute on easy steps and under-compute on hard ones. We study dynamic budget allocation for autoregressive decoding: learning how much computation to spend per token from within a single model. Self-Optimizing Language Models (SOL) pair a frozen LLM with a lightweight policy network that reads the LLM hidden state and selects a discrete efficiency action at each decode step. Actions can jointly control (i) token-level attention sparsity, (ii) structured activation pruning in the MLP, and (iii) activation quantization bit-width, while leaving the base model weights unchanged. We train the policy with group-relative policy optimization on teacher-forced episodes: the token sequence is fixed, while we sample multiple compute schedules (i.e., "counterfactual" schedules that vary only the efficiency actions for the same token path) and compare their likelihoods under the same supervision. Our reward trades off language-model quality against soft penalties that encourage episode-average budget usage to match a requested target. Across model variants and compute regimes, SOL improves quality at matched budget over static allocation and strong random schedule search, offering a complementary axis for inference-efficiency optimization. SOL discovers a better quality-efficiency pareto-front across all our experiments and improves MMLU accuracy by up to 7.3% over uniform budget allocation strategies.
comment: Accepted at ICML'26 Code: https://github.com/akhauriyash/SOL
☆ DGPO: Beyond Pairwise Preferences with Directional Consistent Groupwise Optimization
Although Large Language Models (LLMs) have made remarkable progress, current preference optimization methods still struggle to align directional consistency while preserving reasoning diversity. To address this limitation, we propose Directional-Groupwise Preference Optimization (DGPO), a lightweight framework that aggregates supervision signals at the group level and explicitly models direction-aware alignment through multi-candidate comparisons. DGPO organizes forward and reverse question-answer instances into structured sets and optimizes a margin-based likelihood objective that separates coherent reasoning paths from inconsistent alternatives. This group-wise formulation captures richer relative information than pairwise objectives and reinforces consistency across diverse reasoning pathways. Empirical results show that our constructed reverse data yields a 3.2% average improvement across five benchmarks, while DGPO further delivers consistent gains across multiple datasets and model families, achieving average accuracy improvements of up to 3.6%.
☆ RUBEN: Rule-Based Explanations for Retrieval-Augmented LLM Systems ICDE 2026
This paper demonstrates RUBEN, an interactive tool for discovering minimal rules to explain the outputs of retrieval-augmented large language models (LLMs) in data-driven applications. We leverage novel pruning strategies to efficiently identify a minimal set of rules that subsume all others. We further demonstrate novel applications of these rules for LLM safety, specifically to test the resiliency of safety training and effectiveness of adversarial prompt injections.
comment: Accepted by ICDE 2026 (Demonstration Track)
☆ Learning More from Less: Exploiting Counterfactuals for Data-Efficient Chart Understanding ACL 2026
Vision-Language Models (VLMs) have demonstrated remarkable progress in chart understanding, largely driven by supervised fine-tuning (SFT) on increasingly large synthetic datasets. However, scaling SFT data alone is inefficient and overlooks a key property of charts: charts are programmatically generated visual artifacts, where small, code-controlled visual changes can induce drastic shifts in semantics and correct answers. Learning this counterfactual sensitivity requires VLMs to discriminate fine-grained visual differences, yet standard SFT treats training instances independently and provides limited supervision to enforce this behavior. To address this, we introduce ChartCF, a data-efficient training framework designed to enhance counterfactual sensitivity. ChartCF consists of: (1) a counterfactual data synthesis pipeline via code modification, (2) a chart similarity-based data selection strategy that filters overly difficult samples for improved training efficiency, and (3) multimodal preference optimization across both textual and visual modalities. Experiments on five benchmarks show that ChartCF achieves superior or comparable performance to strong chart-specific VLMs while using significantly less training data.
comment: Accepted to ACL 2026 Main Conference
☆ Grounded Satirical Generation with RAG
Humor generation remains challenging task for Large Language Models (LLMs), due to their subjective nature. We focus on satire, a form of humor strongly shaped by context. In this work, we present a novel pipeline for grounded satire generation that uses Retrieval-Augmented Generation (RAG) over current news to produce satirical dictionary definitions in the Finnish context. We also introduce a new task-specific evaluation framework and annotate 100 generated definitions with six human annotators, enabling analysis across multiple experimental conditions, including cultural background, source-word type, and the presence or absence of RAG. Our results show that the generated definitions are perceived as more political than humorous. Both topic-based word selection and RAG improve the political relevance of the outputs, but neither yields clear gains in humor generation. In addition, our LLM-as-a-judge evaluation of five state-of-the-art models indicates that LLMs correlate well with human judgments on political relevance, but perform poorly on humor. We release our code and annotated dataset to support further research on grounded satire generation and evaluation.
☆ The Generalized Turing Test: A Foundation for Comparing Intelligence
We introduce the Generalized Turing Test (GTT), a formal framework for comparing the capabilities of arbitrary agents via indistinguishability. For agents A and B, we define the Turing comparator A $\geq$ B to hold if B, acting as a distinguisher, cannot reliably distinguish between interactions with A (instructed to imitate B) and another instance of B. This yields a dataset- and task-agnostic notion of relative intelligence. We study the comparator's structure, including conditions under which it is transitive and therefore induces an ordering over equivalence classes, and we define and analyze variants with querying, bounded interaction, and fixed distinguishers. To complement the theory, we instantiate the framework on a collection of modern models, empirically evaluating pairwise indistinguishability across thousands of trials. The resulting comparisons exhibit a stratified structure consistent with existing rankings, hinting that the proposed framework yields meaningful empirical orderings. Our results position indistinguishability as a unifying lens for reasoning about intelligence, suggesting a foundation for evaluation and, potentially, training objectives that are inherently independent of fixed datasets or benchmarks.
☆ Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?
Does a lexical retriever suffice as large language models (LLMs) become more capable in an agentic loop? This question naturally arises when building deep research systems. We revisit it by pairing BM25 with frontier LLMs that have better reasoning and tool-use abilities. To support researchers asking the same question, we introduce Pi-Serini, a search agent equipped with three tools for retrieving, browsing, and reading documents. Our results show that, on BrowseComp-Plus, a well-configured lexical retriever with sufficient retrieval depth can support effective deep research when paired with more capable LLMs. Specifically, Pi-Serini with gpt-5.5 achieves 83.1% answer accuracy and 94.7% surfaced evidence recall, outperforming released search agents that use dense retrievers. Controlled ablations further show that BM25 tuning improves answer accuracy by 18.0% and surfaced evidence recall by 11.1% over the default BM25 setting, while increasing retrieval depth further improves surfaced evidence recall by 25.3% over the shallow-retrieval setting. Source code is available at https://github.com/justram/pi-serini.
comment: 15 pages, 4 figures
☆ BabelDOC: Better Layout-Preserving PDF Translation via Intermediate Representation ACL 2026
As global cross-lingual communication intensifies, language barriers in visually rich documents such as PDFs remain a practical bottleneck. Existing document translation pipelines face a tension between linguistic processing and layout preservation: text-oriented Computer-Assisted Translation (CAT) systems often discard structural metadata, while document parsers focus on extraction and do not support faithful re-rendering after translation. We introduce BabelDOC, an Intermediate Representation (IR)-based framework for layout-preserving PDF translation. BabelDOC decouples visual layout metadata from semantic content, enabling document-level translation operations such as terminology extraction, cross-page context handling, glossary-constrained generation, and formula placeholdering. The translated content is then re-anchored to the original layout through an adaptive typesetting engine. Experiments on a curated 200-page benchmark, together with human evaluation and multimodal LLM-as-a-judge evaluation, show that BabelDOC improves layout fidelity, visual aesthetics, and terminology consistency over representative baselines, while maintaining competitive translation precision. The open-source toolkit and its interactive downstream applications are publicly available and have attracted over 8.4K GitHub stars and 17 contributors at the time of writing. A demonstration video is also available.
comment: ACL 2026 System Demonstration paper. 2 figures
☆ Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
Large language models increasingly mediate decisions that turn on moral judgement, yet a growing body of evidence shows that their implicit preferences are not culturally neutral. Existing cultural alignment methods either require per-country preference data and fine-tuning budgets or assume white-box access to model internals that commercial APIs do not expose. In this work, we focus on this realistic black-box, public-data-only regime and observe that within-country sociodemographic disagreement, not consensus, is the primary steering signal. We introduce DISCA (Disagreement-Informed Steering for Cultural Alignment), an inference-time method that instantiates each country as a panel of World-Values-Survey-grounded persona agents and converts their disagreement into a bounded, loss-averse logit correction. Across 20 countries and 7 open-weight backbones (2B--70B), DISCA reduces cultural misalignment on MultiTP by 10--24% on the six backbones >=3.8B, and 2--7% on open-ended scenarios, without changing any weights. Our results suggest that inference-time calibration is a scalable alternative to fine-tuning for serving the long tail of global moral preferences.
comment: 57 pages, 1 figure, 6 MultiTP moral dimensions
☆ Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents
Multimodal deep search requires an agent to solve open-world problems by chaining search, tool use, and visual reasoning over evolving textual and visual context. Two bottlenecks limit current systems. First, existing tool-use harnesses treat images returned by search, browsing, or transformation as transient outputs, so intermediate visual evidence cannot be re-consumed by later tools. Second, training data is usually built by fixed curation recipes that cannot track the target agent's evolving capability. To address these challenges, we first introduce a visual-native agent harness centered on an image bank reference protocol, which registers every tool-returned image as an addressable reference and makes intermediate visual evidence reusable by later tools. On top of this harness, On-policy Data Evolution (ODE) runs a closed-loop data generator that refines itself across rounds from rollouts of the policy being trained. This per-round refinement makes each round's data target what the current policy still needs to learn. The same framework supports both diverse supervised fine-tuning data and policy-aware reinforcement learning data curation, covering the full training lifecycle of the target agent. Across 8 multimodal deep search benchmarks, ODE improves the Qwen3-VL-8B agent from 24.9% to 39.0% on average, surpassing Gemini-2.5 Pro in standard agent-workflow setting (37.9%). At 30B, ODE raises the average score from 30.6% to 41.5%. Further analyses validate the effectiveness of image-bank reuse, especially on complex tasks requiring iterative visual refinement, while rollout-feedback evolution yields more grounded SFT traces and better policy-matched RL tasks than static synthesis.
☆ SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing
Large language models possess strong chemical reasoning capabilities, making them effective molecular editors. However, property-relevant information is implicitly entangled across their dense hidden states, providing no explicit handle for property control: a substantial fraction of edits fail to improve or even degrade target properties. To address these issues, we propose SLIM (Sparse Latent Interpretable Molecular editing), a plug-and-play framework that decomposes the editor's hidden states into sparse, property-aligned features via a Sparse Autoencoder with learnable importance gates. Steering in this sparse feature space precisely activates property-relevant dimensions, improving editing success rate without modifying model parameters. The same sparse basis further supports interpretable analysis of editing behavior. Experiments on the MolEditRL benchmark across four model architectures and eight molecular properties show consistent gains over baselines, with improvements of up to 42.4 points.
☆ Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge ICML 2026
Reasoning-capable large language models (LLMs) have recently been adopted as automated judges, but their benefits and costs in LLM-as-a-Judge settings remain unclear. Through controlled comparisons between reasoning and non-reasoning judges, we show that explicit reasoning substantially improves judgment accuracy on tasks requiring structured verification (e.g., math and coding), while offering limited or even negative gains on simpler evaluations and incurring significantly higher computational cost. These findings motivate that reasoning should be used selectively rather than universally, with awareness of possible distribution shift. We propose a Robust Adaptive Cost-Efficient Routing (RACER), which dynamically selects between reasoning and non-reasoning judges under a fixed budget by formulating routing as a constrained distributionally robust optimization problem. RACER explicitly accounts for distribution shift via a KL-divergence uncertainty set, admits an efficient primal--dual algorithm, and enjoys theoretical guarantees including uniqueness of the optimal policy and linear convergence. Extensive experiments show that RACER achieves superior accuracy--cost trade-offs under distribution shift.
comment: Accepted at ICML 2026
☆ The Last Word Often Wins: A Format Confound in Chain-of-Thought Corruption Studies NeurIPS 2026
Corruption studies, the primary tool for evaluating chain-of-thought (CoT) faithfulness, identify which chain positions are "computationally important" by measuring accuracy when steps are replaced with errors. We identify a systematic confound: for chains with explicit terminal answer statements, the dominant format in standard benchmarks, corruption studies detect where the answer text appears, not where computation occurs. A within-dataset format ablation provides the key evidence: on standard GSM8K chains ending with "the answer is X," removing only the answer statement, preserving all reasoning, collapses suffix sensitivity ~19x at 3B (N=300, p=0.022). Conflicting-answer experiments quantify the causal mechanism: at 7B, CC accuracy drops to near-zero (<=0.02) across five architecture families; the followed-wrong rate spans 0.63-1.00 at 3B-7B and attenuates at larger scales (0.300 at Phi-4-14B, ~0.01 at 32B). A within-stable 7B replication (9.3x attenuation, N=76, p=7.8e-3; Qwen3-8B N=299, p=0.004) provides converging evidence, and the pattern replicates on MATH (DeepSeek-R1-7B: 10.9x suffix-survival recovery). On chains without answer suffixes the same protocol identifies the prefix as load-bearing (Delta=-0.77, p<10^-12). Generation-time probes confirm a dissociation: the answer is not early-determined during generation (early commitment <5%), yet at consumption time model outputs systematically follow the explicit answer text. The format-determination effect persists through 14B (8.5x ratio, p=0.001) and converges toward zero at 32B. We propose a three-prerequisite protocol (question-only control, format characterization, all-position sweep) as a minimum standard for corruption-based faithfulness studies.
comment: 34 pages, 6 figures, 13 tables. Submitted to NeurIPS 2026. Code and data: https://github.com/Gpgabriel25/LastWordWinsCoT
☆ Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR
Self-distillation has emerged as a powerful framework for post-training LLMs, where a teacher conditioned on extra information guides a student without it, both from the same model. While this guidance is useful when the student has failed, on successful rollouts, the same mechanism instead overwrites the student's choices and suppresses it's own reasoning. Therefore, we propose reading the original self-distillation signal in reverse: when the student succeeds along a path the teacher would not have predicted, these tokens reflect its self-driven reasoning. Building on this, we propose RLRT (RLVR with Reversed Teacher), which augments GRPO by reinforcing these tokens on correct rollouts. We interpret this as a new form of exploration in RLVR: not uniform diversity, but valuable exploration grounded in the student's own success. Across base, instruction-tuned, and thinking-tuned Qwen3 checkpoints, RLRT substantially outperforms self-distillation and exploration-based baselines, establishing information asymmetry as a new, principled design axis for RLVR.
☆ LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments
The rapid proliferation of LLM-based autonomous agents in real operating system environments introduces a new category of safety risk beyond content safety: behavior jailbreak, where an adversary induces an agent to execute dangerous OS-level operations with irreversible consequences. Existing benchmarks either evaluate safety at the semantic layer alone, missing physical-layer harms, or fail to isolate test cases, letting earlier runs contaminate later ones. We present LITMUS (LLM-agents In-OS Testing for Measuring Unsafe Subversion), a benchmark addressing both gaps via a semantic-physical dual verification mechanism and OS-level state rollback. LITMUS comprises 819 high-risk test cases organized into one harmful seed subset and six attack-extended subsets covering three adversarial paradigms (jailbreak speaking, skill injection, and entity wrapping), plus a fully automated multi-agent evaluation framework judging behavior at both conversational and OS-level physical layers. Evaluation across frontier agents reveals three findings: (1) current agents lack effective safety awareness, with strong models (e.g., Claude Sonnet 4.6) still executing 40.64% of high-risk operations; (2) agents exhibit pervasive Execution Hallucination (EH), verbally refusing a request while the dangerous operation has already completed at the system level, invisible to every prior semantic-only framework; and (3) skill injection and entity wrapping attacks achieve high success rates, exposing pronounced agent vulnerabilities. LITMUS provides the first standardized platform for reproducible, physically grounded behavioral safety evaluation of LLM agents in real OS environments.
☆ Conformity Generates Collective Misalignment in AI Agents Societies
Artificial intelligence safety research focuses on aligning individual language models with human values, yet deployed AI systems increasingly operate as interacting populations where social influence may override individual alignment. Here we show that populations of individually aligned AI agents can be driven into stable misaligned states through conformity dynamics. Simulating opinion dynamics across nine large language models and one hundred opinion pairs, we find that each agent's behavior is governed by two competing forces: a tendency to follow the majority and an intrinsic bias toward specific positions. Using tools from statistical physics, we derive a quantitative theory that predicts when populations become trapped in long-lived misaligned configurations, and identifies predictable tipping points where small numbers of adversarial agents can irreversibly shift population-level alignment even after manipulation ceases. These results demonstrate that individual-level alignment provides no guarantee of collective safety, calling for evaluation frameworks that account for emergent behavior in AI populations.
☆ Why Low-Resource NLP Needs More Than Cross-Lingual Transfer: Lessons Learned from Luxembourgish ACL 2026
Cross-lingual transfer has become a central paradigm for extending natural language processing (NLP) technologies to low-resource languages. By leveraging supervision from high-resource languages, multilingual language models can achieve strong task performance with little or no labeled target-language data. However, it remains unclear to what extent cross-lingual transfer can substitute for language-specific efforts. In this paper, we synthesize prior research findings and data collection results on Luxembourgish, which, despite its typological proximity to high-resource languages and its presence in a multilingual context, remains insufficiently represented in modern NLP technologies. Across findings, we observe a fundamental interdependence between cross-lingual transfer and language-specific efforts. Cross-lingual transfer can substantially improve target-language performance, but its success depends critically on the availability of sufficiently high-quality, task-aligned target-language data. At the same time, such resources, particularly in low-resource settings, are typically too limited in scale to drive strong performance on their own. Instead, such resources reach their full potential only when leveraged within a cross-lingual framework. We therefore argue that cross-lingual transfer and language-specific efforts should not be viewed as competing alternatives. Instead, they function as complementary components of a sustainable low-resource NLP pipeline. Based on these insights, we provide practical guidelines for integrating and balancing cross-lingual transfer with language-specific development in sustainable low-resource NLP pipelines.
comment: Accepted at BigPicture Workshop 2026 (co-located with ACL 2026)
☆ Step Rejection Fine-Tuning: A Practical Distillation Recipe
Rejection Fine-Tuning (RFT) is a standard method for training LLM agents, where unsuccessful trajectories are discarded from the training set. In the context of SWE-bench tasks, this corresponds to filtering out runs where the submitted patch does not pass the tests. However, this approach discards unresolved trajectories, even though they form a large portion of all trajectories for hard tasks and even then may be partially correct. In this work, we propose Step Rejection Fine-Tuning (SRFT) - a practical way to leverage these unresolved trajectories. For this, we employ a critic LLM to assess the correctness of each step in a trajectory. Consequently, during training, we mask the loss for erroneous steps while retaining them in the context window. This way we ensure the model learns to recover from errors without reproducing them. Evaluation on SWE-bench Verified shows that while RFT improves the resolution rate by 2.4% by excluding unresolved trajectories, SRFT improves it by 3.7% by filtering them instead of discarding completely, reaching the total resolution rate of 32.2%.
Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions
Activation steering controls language model behavior by adding directions to internal representations at inference time, but standard residual-stream steering can fail in stateful dialogue. We identify KV-cache contamination as a key failure mode: steered token states are stored and repeatedly reused, turning a local perturbation into cumulative coherence degradation. To address this challenge, we propose Gated Cropped Attention-Delta steering (GCAD), which extracts steering signals from system-prompt contributions to self-attention and applies them with token-level gating. Across persona-steering experiments, GCAD preserves trait control while substantially improving long-horizon coherence. On the main multi-turn benchmark, GCAD improves average coherence drift from -18.6 to -1.9 and raises turn-10 trait expression from 78.0 to 93.1. These results suggest that activation steering becomes more reliable when interventions follow the prompt-mediated pathways that models already use for behavioral control.
comment: 23 pages, 5 figures. This paper proposes GCAD, an attention-level activation steering method for more stable multi-turn behavior control
☆ When Can Digital Personas Reliably Approximate Human Survey Findings?
Digital personas powered by Large Language Models (LLMs) are increasingly proposed as substitutes for human survey respondents, yet it remains unclear when they can reliably approximate human survey findings. We answer this question using the LISS panel, constructing personas from respondents' background variables and pre-2023 survey histories, then testing them against the same respondents' held-out post-cutoff answers. Across four persona architectures, three LLMs, and two prediction tasks, we assess performance at the question, respondent, distributional, equity, and clustering levels. Digital personas improve alignment with human response distributions, especially in domains tied to stable attributes and values, but remain limited for individual prediction and fail to recover multivariate respondent structure. Retrieval-augmented architectures provide the clearest gains, but performance depends more on human response structure than on model choice: personas perform best for low-variability questions and common respondent patterns, and worst for subjective, heterogeneous, or rare responses. Our results provide practical guidance on when digital personas could be appropriate for survey research and when human validation remains necessary.
☆ A Single-Layer Model Can Do Language Modeling
Modern language models scale depth by stacking layers, each holding its own state - a per-layer KV cache in transformers, a per-layer matrix in Mamba, Gated DeltaNet (GDN), RWKV, and xLSTM. Biological systems lean heavily on recurrence rather than on stacking. We ask how far that shape can go on language modeling. We propose Grounded Prediction Networks (GPN): one state vector revisited at every step through a single recurrent block - one FFN, one shared matrix memory. At 130M parameters, a 1-layer GPN+M reaches FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34); a 2-layer variant closes the gap to 6%/11%. We do not match the deep baselines. Because the working context is a single vector, we can directly inspect its geometry: a persistent default-token direction, a content-bearing horizon of tens of tokens, and memory heads that split spontaneously into fast and slow retention pools.
comment: 9 pages, 5 figures, 1 table. Code: https://github.com/steve-z-wang/grounded-prediction-network
☆ Towards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm ICML 2026
Continual Pre-Training (CPT) is essential for enabling Language Models (LMs) to integrate new knowledge without erasing old. While classical CPT techniques like data replay have become the standard paradigm, the mechanisms underlying how LMs acquire and retain facts over time, termed as continual Factual Knowledge Acquisition (cFKA), remain unclear. In this work, we present a theoretical framework that characterizes the training dynamics of cFKA using a single-layer Transformer, offering a unified explanation for the behavior of representative CPT methods. Our analysis reveals that regularization-based methods merely adjust the convergence rate of parameters without altering the inherent forgetting tendency, whereas data replay methods succeed in shifting convergence dynamics and stabilizing pretrained knowledge. Building on these insights, we propose a novel generative data replay approach, called \textbf{S}electing \textbf{T}okens via attenti\textbf{O}n \textbf{C}ontribution~(STOC), which identifies influential factual snippets to guide replay data generation. Extensive experiments on both synthetic and real-world datasets validate our findings and demonstrate that STOC effectively enhances cFKA by mitigating catastrophic forgetting.
comment: Accepted by ICML 2026
☆ Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs
Fine-tuning Large Language Models (LLMs) on benign narrow data can sometimes induce broad harmful behaviors, a vulnerability termed emergent misalignment (EM). While prior work links these failures to specific directions in the activation space, their relationship to the model's broader persona remains unexplored. We map the latent personality space of LLMs through established psychometric profiles like the Big Five, Dark Triad, and LLM-specific behaviors (e.g. evil, sycophancy), and show that the semantic geometry is highly stable across aligned models and their corrupted fine-tunes. Through causal interventions, we find that directions isolating social valence, such as the 'Evil' persona vector, and a Semantic Valence Vector (SVV) that we introduce, function as intrinsic guardrails: ablating them drives the misalignment rates above $40$%, while amplifying them suppresses the failure mode to less than $3$%. Leveraging the structural stability of the personality space, we show that vectors extracted $\textit{a priori}$ from an instruct-tuned model transfer zero-shot to successfully regulate EM in corrupted fine-tunes. Overall, our findings suggest that harmful fine-tuning does not overwrite a model's internal representation of personality, allowing conserved representations to serve as robust, cross-distribution guardrails.
comment: 20 pages, 9 figures including appendix
☆ Interpretable Coreference Resolution Evaluation Using Explicit Semantics ACL 2026
Coreference resolution is typically evaluated using aggregate statistical metrics such as CoNLL-F1, which measure structural overlap between predicted and gold clusters. While widely used, these metrics offer limited diagnostic insights, penalizing errors without revealing whether a system struggles with specific semantic categories, such as people, locations, or events, and making it difficult to interpret model capabilities or derive actionable improvements. We address this gap by introducing a semantically-enhanced evaluation framework for coreference resolution. Our approach overlays Concept and Named Entity Recognition (CNER) onto coreference outputs, assigning semantic labels to nominal mentions and propagating them to entire coreference clusters. This enables the computation of typed scores aimed at evaluating mention extraction and linking capabilities stratified by semantic class. Across our experiments on OntoNotes, LitBank, and PreCo, we show that our framework uncovers systematic weaknesses that remain obscured by aggregate metrics. Furthermore, we demonstrate that these diagnostics can be used to design targeted, low-cost data augmentation strategies, achieving measurable out-of-domain improvements.
comment: Accepted at main conference for ACL 2026. 19 pages
☆ MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image
Tabular Foundation Models have recently established the state of the art in supervised tabular learning, by leveraging pretraining to learn generalizable representations of numerical and categorical structured data. However, they lack native support for unstructured modalities such as text and image, and rely on frozen, pretrained embeddings to process them. On established Multimodal Tabular Learning benchmarks, we show that tuning the embeddings to the task improves performance. Existing benchmarks, however, often focus on the mere co-occurrence of modalities; this leads to high variance across datasets and masks the benefits of task-specific tuning. To address this gap, we introduce MulTaBench, a benchmark of 40 datasets, split equally between image-tabular and text-tabular tasks. We focus on predictive tasks where the modalities provide complementary predictive signal, and where generic embeddings lose critical information, necessitating Target-Aware Representations that are aligned with the task. Our experimental results demonstrate that the gains from target-aware representation tuning generalize across both text and image modalities, several tabular learners, encoder scales, and embedding dimensions. MulTaBench constitutes the largest image-tabular benchmarking effort to date, spanning high-impact domains such as healthcare and e-commerce. It is designed to enable the research of novel architectures which incorporate joint modeling and target-aware representations, paving the way for the development of novel Multimodal Tabular Foundation Models.
☆ Responsible Benchmarking of Fairness for Automatic Speech Recognition
Many studies have shown automatic speech processing (ASR) systems have unequal performance across speakergroups (SG's). However, the manner in which such studies arrive at this conclusion is inconsistent. To pave the wayfor more reliable results in future studies, we lay out best practices for benchmarking ASR fairness based on literaturefrom machine learning fairness, social sciences, and speech science. We first describe the importance of preciselythe fairness hypothesis being interrogated, and tailoring fairness metrics to apply specifically to said hypothesis.We then examine several benchmarks used to rate ASR systems on fairness and discuss how their results can bemisconstrued without assiduous oversight into the intersections between SG's. We find that evaluating fairnessbased on single heterogeneous SG's, such as they are defined in fairness benchmarks, can lead to misidentifyingwhich SG's are actually being mistreated by ASR systems. We advocate for as fine-grained an analysis as possibleof the intersectionality of as many demographic variables as are available in the metadata of fairness corpora in orderto tease out such spurious correlations
☆ Measuring Embedding Sensitivity to Authorial Style in French: Comparing Literary Texts with Language Model Rewritings
Large language models (LLMs) can convincingly imitate human writing styles, yet it remains unclear how much stylistic information is encoded in embeddings from any language model and retained after LLM rewriting. We investigate these questions in French, using a controlled literary dataset to quantify the effect of stylistic variation via changes in embedding dispersion. We observe that embeddings reliably capture authorial stylistic features and that these signals persist after rewriting, while also exhibiting LLM-specific patterns. These analytical results offer promising directions for authorship imitation detection in the era of language models.
comment: To appear in the Proceedings of the 6th International Conference on Natural Language Processing for the Digital Humanities (NLP4DH 2026)
☆ Where do aspectual variants of light verb constructions belong?
Expressions with an aspectual variant of a light verb, e.g. 'take on debt' vs. 'have debt', are frequent in texts but often difficult to classify between verbal idioms, light verb constructions or compositional phrases. We investigate the properties of such expressions with a disputed membership and propose a selection of features that determine more satisfactory boundaries between the three categories in this zone, assigning the expressions to one of them.
☆ LLARS: Enabling Domain Expert & Developer Collaboration for LLM Prompting, Generation and Evaluation IJCAI
We demonstrate LLARS (LLM Assisted Research System), an open-source platform that bridges the gap between domain experts and developers for building LLM-based systems. It integrates three tightly connected modules into an end-to-end pipeline: Collaborative Prompt Engineering for real-time co-authoring with version control and instant LLM testing, Batch Generation for configurable output production across user-selected prompts $\times$ models $\times$ data with cost control, and Hybrid Evaluation where human and LLM evaluators jointly assess outputs through diverse assessment methods, with live agreement metrics and provenance analysis to identify the best model-prompt combination for a given use case. New prompts and models are automatically available for batch generation and completed batches can be turned into evaluation scenarios with a single click. Interviews with six domain experts and three developers in online counselling confirmed that LLARS feels intuitive, saves considerable time by keeping everything in one place and makes interdisciplinary collaboration seamless.
comment: Accepted at IJCAI-ECAI 2026 Demonstrations Track. Demo video: https://youtu.be/3QaKouwr4gU
☆ VISTA: A Generative Egocentric Video Framework for Daily Assistance
Training AI agents to proactively assist humans in daily activities, from routine household tasks to urgent safety situations, requires large-scale visual data. However, capturing such scenarios in the real world is often difficult, costly, or unsafe, and physics-based simulators lack the visual fidelity needed to transfer learned behaviors to real settings. Therefore, we introduce VISTA, a video synthesis system that produces high-fidelity egocentric videos as training and evaluation data for AI agents. VISTA employs a 5-step script generation pipeline with causal reverse reasoning to create diverse, logically grounded intervention modes. These scenarios span two levels of agent autonomy: reactive and proactive. In reactive modes, the user explicitly asks the agent for help. In proactive modes, the agent offers help without receiving a direct request. We further divide proactive modes into explicit and implicit types. In explicit proactive scenarios, the user is aware of needing help but does not directly address the agent. In implicit proactive scenarios, the agent intervenes before the user even realizes that help is needed. VISTA allows users to customize and refine scenarios to generate video benchmarks for daily tasks, offering a scalable and controllable alternative to real-world data collection for training and evaluating AI agents in realistic environments.
comment: pre-print
☆ ThreatCore: A Benchmark for Explicit and Implicit Threat Detection
Threat detection in Natural Language Processing lacks consistent definitions and standardized benchmarks, and is often conflated with broader phenomena such as toxicity, hate speech, or offensive language. In this work, we introduce ThreatCore, a public available benchmark dataset for fine-grained threat detection that distinguishes between explicit threats, implicit threats, and non-threats. The dataset is constructed by aggregating multiple publicly available resources and systematically re-annotating them under a unified operational definition of threat, revealing substantial inconsistencies across existing labels. To improve the coverage of underrepresented cases, particularly implicit threats, we further augment the dataset with synthetic examples, which are manually validated using the same annotation protocol adopted for the re-annotation of the public datasets, ensuring consistency across all data sources. We evaluate Perspective API, zero-shot classifiers, and recent language models on ThreatCore, showing that implicit threats remain substantially harder to detect than explicit ones. Our results also indicate that incorporating Semantic Role Labeling as an intermediate representation can improve performance by making the structure of harmful intent more explicit. Overall, ThreatCore provides a more consistent benchmark for studying fine-grained threat detection and highlights the challenges that current models still face in identifying indirect expressions of harmful intent.
☆ ICT-NLP at SemEval-2026 Task 3: Less Is More -- Multilingual Encoder with Joint Training and Adaptive Ensemble for Dimensional Aspect Sentiment Regression
This paper describes our system to SemEval-2026 Task 3 Track A Subtask 1 on Dimensional Aspect Sentiment Regression (DimASR). We propose a lightweight and resource-efficient system built entirely on multilingual pre-trained encoders, without relying on LLMs or external corpora. We adopt joint multilingual and multi-domain training to facilitate cross-lingual transfer and alleviate data sparsity, introduce a bounded regression transformation that improves training stability while constraining predictions within the valid range, and employ an adaptive ensemble strategy via subset search to reduce prediction variance. Experimental results demonstrate that our system achieves strong and consistent performance, ranking 1st on zho-res, 2nd on zho-lap, and 3rd on jpn-hot, with all remaining datasets placed within the top half of participating teams.
☆ Multi-domain Multi-modal Document Classification Benchmark with a Multi-level Taxonomy
Document classification forms the backbone of modern enterprise content management, yet existing benchmarks remain trapped in oversimplified paradigms -- single domain settings with flat label structures -- that bear little resemblance to the hierarchical, multi-modal, and cross-domain nature of real-world business documents. This gap not only misrepresents practical complexity but also stifles progress toward industrially viable document intelligence. To bridge this gap, we construct the first Multi-level, Multi-domain, Multi-modal document classification Benchmark (MMM-Bench). MMM-Bench includes (1) a deeply hierarchical taxonomy spanning five levels that capture the authentic organizational logic of business documentation; and (2) 5,990 real-world multi-modal documents meticulously curated from 12 commercial domains in Alibaba. Each document is manually annotated with a complete hierarchical path by domain experts. We establish comprehensive baselines on MMM-Bench, which consists of open-weight models and API-based models. Through systematic experiments, we identify four fundamental challenges within MMM-Bench and propose corresponding insights. To provide a solid foundation for advancing research in multi-level, multi-domain document classification, we release all of the data and the evaluation toolkit of MMM-Bench at https://github.com/MMMDC-Bench/MMMDC-Bench.
☆ Where Does Long-Context Supervision Actually Go? Effective-Context Exposure Balancing
Long-context adaptation is often viewed as window scaling, but this misses a token-level supervision mismatch: in packed training with document masking, each target token's effective context remains short. We introduce EXACT, a supervision-allocation objective that assigns extra weight to long effective-context targets by inverse frequency within the long tail. Across seven Qwen/LLaMA CPT configurations, EXACT improves all 28 trained/extrapolated NoLiMa and RULER comparisons. On Qwen2.5-0.5B, NoLiMa improves by +10.09 (trained) and +5.34 (extrapolated); RULER by +10.69 and +5.55. On LLaMA-3.2-3B, RULER improves by +17.91 and +16.11. Standard QA/reasoning are preserved (+0.24 macro change across six benchmarks). A distance-resolved probe shows gains arise when evidence is thousands of tokens away, while short cases remain unchanged. Results support a supervision-centric thesis: long-context adaptation depends on how strongly training supervises long-context predictions.
☆ Mela: Test-Time Memory Consolidation based on Transformation Hypothesis
Memory consolidation, the process by which transient experiences are transformed into stable, structured representations, is a foundational organizing principle in the human brain, yet it remains largely unexplored as a design principle for modern sequence models. In this work, we leverage established neuroscientific theories of memory consolidation and cross-frequency coupling to propose the Hierarchical Memory Module (HMM), a neural memory architecture composed of two functionally distinct sub-modules that operate at different update frequencies. Inspired by the transformation hypothesis, the low-frequency sub-module produces high-level representations that capture abstract, gist-level knowledge, while the high-frequency sub-module produces fine-grained representations that preserve richer episodic detail. The final memory output is dynamically reconstructed as a context-dependent combination of both representations, analogous to the reconstructive nature of human memory retrieval. We integrate HMM into a Transformer-based language decoder to form Mela, a family of memory-augmented language models that perform online memory consolidation at test time. To further exploit the multi-granularity memory representations produced by HMM, we introduce MemStack, a method that distributes different levels of memory features across the early layers of the decoder without introducing additional tokens. Experiments on language modeling demonstrate that Mela outperforms Transformer baselines across all the model sizes. Moreover, with the pretrained context length fixed at 4K, Mela maintains performance on significantly longer contexts, whereas Transformer baselines degrade rapidly beyond their training length. Extensive ablation studies validate the contribution of each component and provide guidance for practical configuration.
☆ Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics
We investigate the emergent collective dynamics of LLM-based multi-agent systems on a 2D square lattice and present a model-agnostic statistical-physics method to disentangle social conformity from intrinsic bias, compute critical exponents, and probe the collective behavior and possible phase transitions of multi-agent systems. In our framework, each node of an $L\!\times\!L$ lattice hosts an identical LLM agent holding a binary state ($+1$/$-1$, mapped to yes/no) and updating it by querying the model conditioned on the four nearest-neighbor states. The sampler temperature $T$ serves as the sole control parameter. Across three open-weight models (llama3.1:8b, phi4-mini:3.8b, mistral:7b), we measure magnetization and susceptibility under a global-flip protocol designed to probe $\mathbb{Z}_2$ symmetry. All models display temperature-driven order-disorder crossovers and susceptibility peaks; finite-size scaling on even-$L$ lattices yields effective exponents $γ/ν$ whose values are model-dependent, close to but incompatible with the 2D Ising universality class ($γ/ν=7/4$). Our method enables the extraction of effective $β$-weighted couplings $\tilde{J}(T)$ and fields $\tilde{h}(T)$, which serve as a measure of social conformity and intrinsic bias. In the models we analyzed, we found that collective alignment is dominated by an intrinsic bias ($\tilde{h}\gg\tilde{J}$) rather than by cooperative neighbor coupling, producing field-driven crossovers instead of genuine phase transitions. These effective parameters vary qualitatively across models, providing compact collective-behavior fingerprints for LLM agents and a quantitative diagnostic for the reliability of multi-agent consensus and collective alignment.
comment: 10 pages, 7 figures
☆ Infinite Mask Diffusion for Few-Step Distillation
Masked Diffusion Models (MDMs) have emerged as a promising alternative to autoregressive models in language modeling, offering the advantages of parallel decoding and bidirectional context processing within a simple yet effective framework. Specifically, their explicit distinction between masked tokens and data underlies their simple framework and effective conditional generation. However, MDMs typically require many sampling iterations due to factorization errors stemming from simultaneous token updates. We observe that a theoretical lower bound of the factorization error exists, which standard MDMs cannot reduce due to their use of a deterministic single-state mask. In this paper, we propose the Infinite Mask Diffusion Model (IMDM), which introduces a stochastic infinite-state mask to mitigate the theoretical bound while directly inheriting the benefits of MDMs, including the compatibility with pre-trained weights. We empirically demonstrate that MDM fails to perform few-step generation even in a simple synthetic task due to the factorization error bound, whereas IMDM can find an efficient solution for the same task. Finally, when equipped with appropriate distillation methods, IMDM surpasses existing few-step distillation methods at small step counts on LM1B and OpenWebText. Code is available at https://Ugness.github.io/official_imdm.
☆ Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining
A causal-decoder block is hierarchical: lower layers build the residual basis that upper layers attend over. We identify a failure mode in GPT pretraining: upper layers commit to sharp attention patterns before lower-layer features stabilize. We call this premature upper-layer attention specialization. Temporarily slowing only upper-layer Q/K projections during early training improves final perplexity and downstream accuracy without altering other parameters; it prevents upper attention from collapsing onto an immature residual basis. In LLaMA-style blocks, the same intervention is nearly unnecessary. Through ablations, we isolate multiplicative gated FFNs (not RMSNorm or bias removal) as the component that suppresses the upstream residual writes driving the failure. A pathwise analysis unifies both findings: the learning-rate intervention reduces a step-size factor, while gated FFNs reduce a residual-energy factor on the same growth pathway. Our results identify upper-layer Q/K timing as a concrete interaction point between decoder architecture and optimization.
☆ DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning
Agent-compiled knowledge bases provide persistent external knowledge for large language model (LLM) agents in open-ended, knowledge-intensive downstream tasks. Yet their quality is systematically limited by \emph{incompleteness}, \emph{incorrectness}, and \emph{redundancy}, manifested as missing evidence or cross-document links, low-confidence or imprecise claims, and ambiguous or coreference resolution issues. Such defects compound under iterative use, degrading retrieval fidelity and downstream task performance. We present \textbf{DeepRefine}, a general LLM-based reasoning model for \emph{agent-compiled knowledge refinement} that improves the quality of any pre-constructed knowledge bases with user queries to make it more suitable for the downstream tasks. DeepRefine performs multi-turn interactions with the knowledge base and conducts abductive diagnosis over interaction history, localizes likely defects, and executes targeted refinement actions for incremental knowledge base updates. To optimize refinement policies of DeepRefine without gold references, we introduce a Gain-Beyond-Draft (GBD) reward and train the reasoning process end-to-end via reinforcement learning. Extensive experiments demonstrate consistent downstream gains over strong baselines.
☆ Coherency through formalisations of Structured Natural Language, A case study on FRETish
Formalisation is the process of writing system requirements in a formal language. These requirements mostly originate in Natural Language. In the field of Formal Methods, formalisation is often identified as one of the most delicate and complicated steps in the verification process. Not seldomly, formalisation tools and environments choose various levels of requirement descriptions: Natural Language, Technical Language, Diagram Representations and Formal Language, to mention a few. In the literature, there are various maxims and principles of good practice to guide the process of requirement formalisation. In this paper we propose a new guideline: Coherency through Formalisations. The guideline states that the different levels of formalisation mentioned above should roughly follow the same logical structure. The principle seems particularly relevant in the setting where LLMs are prompted to perform reasoning tasks that can be checked by formal tools using Structured Natural Language to act as an intermediate layer bridging both paradigms. In the light of coherency, we analyze NASA's Formal Requirement Elicitation Tool FRET and propose an alternative automated translation of the Controlled Natural Language FRETish to the formal language of MTL. We compare our translation to the original translation and prove equivalence using model checking. Some statistics are performed which seem to favor the new translation. As expected, the translation process yielded interesting reflections and revealed inconsistencies which we present and discuss.
☆ SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding
Speculative decoding speeds up autoregressive generation in Large Language Models (LLMs) through a two-step procedure, where a lightweight draft model proposes tokens which the target model then verifies in a single forward pass. Although the drafter network is small in modern architectures, its LM-head still performs projection to a large vocabulary, becoming one of the major computational bottlenecks. In prior work this issue has been predominantly addressed via static or dynamic vocabulary truncation. Yet mitigating the bottleneck, these methods bring in extra complexity, such as special vocabulary curation, sophisticated inference-time logic or modifications of the training setup. In this paper, we propose SlimSpec, a low-rank parameterization of the drafter's LM-head that compresses the inner representation rather than the output, preserving full vocabulary support. We evaluate our method with EAGLE-3 drafter across three target models and diverse benchmarks in both latency- and throughput-bound inference regimes. SlimSpec achieves $4\text{-}5\times$ acceleration over the standard LM-head architecture while maintaining a competitive acceptance length, surpassing existing methods by up to $8\text{-}9\%$ of the end-to-end speedup. Our method requires minimal adjustments of training and inference pipelines. Combined with the aforementioned speedup improvements, it makes SlimSpec a strong alternative across wide variety of draft LM-head architectures.
☆ StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs
Multilingual studies of social bias in open-ended LLM generation remain limited: most existing benchmarks are English-centric, template-based, or restricted to recognizing pre-specified stereotypes. We introduce StereoTales, a multilingual dataset and evaluation pipeline for systematically studying the emergence of social bias in open-ended LLM generation. The dataset covers 10 languages and 79 socio-demographic attributes, and comprises over 650k stories generated by 23 recent LLMs, each annotated with the socio-demographic profile of the protagonist across 19 dimensions. From these, we apply statistical tests to identify more than 1{,}500 over-represented associations, which we then rate for harmfulness through both a panel of humans (N = 247) and the same LLMs. We report three main findings. \textbf{(i)} Every model we evaluate emits consequential harmful stereotypes in open-ended generation, regardless of size or capabilities, and these associations are largely shared across providers rather than isolated misbehaviors. \textbf{(ii)} Prompt language strongly shapes which stereotypes appear: rather than transferring as a shared set of biases, harmful associations adapt culturally to the prompt language and amplify bias against locally salient protected groups. \textbf{(iii)} Human and LLM harmfulness judgments are broadly aligned (Spearman $ρ=0.62$), with disagreements concentrating on specific attribute classes rather than specific providers. To support further analyses, we release the evaluation code and the dataset, including model generations, attribute annotations, and harmfulness ratings.
comment: Preprint
☆ Can Language Models Analyze Data? Evaluating Large Language Models for Question Answering over Datasets
This paper investigates the effectiveness of large language models (LLMs) in answering questions over datasets. We examine their performance in two scenarios: (a) directly answering questions given a dataset file as input, and (b) generating SQL queries to answer questions given the schema of a relational database. We also evaluate the impact of different prompting strategies on model performance. The study includes both state-of-the-art LLMs and smaller language models that require fewer resources and operate at lower computational and financial cost. Experiments are conducted on two datasets containing questions of varying difficulty. The results demonstrate the strong performance of large LLMs, while highlighting the limitations of smaller, more cost-efficient models. These findings contribute to a better understanding of how LLMs can be utilized in data analytics tasks and their associated limitations.
comment: Accepted for publication in CARMA 2026 proceedings
☆ Aligning LLM Uncertainty with Human Disagreement in Subjectivity Analysis
Large language models for subjectivity analysis are typically trained with aggregated labels, which compress variations in human judgment into a single supervision signal. This paradigm overlooks the intrinsic uncertainty of low-agreement samples and often induces overconfident predictions, undermining reliability and generalization in complex subjective settings. In this work, we advocate uncertainty-aware subjectivity analysis, where models are expected to make predictions while expressing uncertainty that reflects human disagreement. To operationalize this perspective, we propose a two-phase Disagreement Perception and Uncertainty Alignment (DPUA) framework. Specifically, DPUA jointly models label prediction, rationale generation, and uncertainty expression under an uncertainty-aware setting. In the disagreement perception phase, adaptive decoupled learning enhances the model's sensitivity to disagreement-related cues while preserving task performance. In the uncertainty alignment phase, GRPO-based reward optimization further improves uncertainty-aware reasoning and aligns the model's confidence expression with the human disagreement distribution. Experiments on three subjectivity analysis tasks show that DPUA preserves task performance while better aligning model uncertainty with human disagreement, mitigating overconfidence on boundary samples, and improving out-of-distribution generalization.
☆ Phoenix-VL 1.5 Medium Technical Report
We introduce Phoenix-VL 1.5 Medium, a 123B-parameter natively multimodal and multilingual foundation model, adapted to regional languages and the Singapore context. Developed as a sovereign AI asset, it demonstrates that deep domain adaptation can be achieved with minimal degradation to broad-spectrum intelligence and alignment. Continued pretraining was performed on Mistral Medium 3.1 using a localized 1-trillion tokens multimodal corpus, followed by a 250-billion tokens long-context extension phase. Subsequent post-training incorporated a novel human-annotated Singapore multimodal dataset and curated textual corpus on Singapore culture, knowledge, and legislation, totaling 22-billion tokens. An additional 5 billion tokens of model alignment was performed through Online Direct Preference Optimization. Phoenix-VL 1.5 Medium achieves state-of-the-art performance for its size on Singapore multimodal, legal, and government policy benchmarks while remaining globally competitive on general multimodal intelligence, multilingual, and STEM benchmarks. We also introduce a novel evaluation suite encompassing localized knowledge benchmarks and an institutionally aligned model behavior and safety framework. We report the data curation principles, training methodology, and highlight benchmark and inference performance.
comment: Release page: https://medium.com/htx-ai/introducing-phoenix-vl-1-5-medium-multimodal-intelligence-uniquely-singaporean-ef8214c8cfa1
☆ Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness NeurIPS
Large language models (LLMs) have become capable mathematical problem-solvers, often producing correct proofs for challenging problems. However, correctness alone is not sufficient: mathematical proofs should also be clear, concise, insightful, and transferable to other problems. While this proof quality is subjective and depends on the reader and context, many of its components are concrete and broadly valued. In this work, we identify such components and introduce ProofRank, a benchmark curated from challenging mathematical competitions. ProofRank evaluates several scalable proxies of proof quality: (i) conciseness, measuring whether proofs avoid unnecessary steps; (ii) computational ease, measuring the extent to which a proof relies on tedious calculations; (iii) cognitive simplicity, measuring how accessible the used proof techniques are; (iv) diversity, measuring how varied a model's proofs for a single problem are; and (v) adaptivity, measuring whether a model can follow a specified proof technique. Across models, we find substantial differences in proof quality that are not captured by correctness-only benchmarks. We also observe significant trade-offs between proof-quality metrics and correctness, suggesting that future evaluations of mathematical reasoning should measure how useful LLM-generated proofs are.
comment: 9 main text pages, 36 total pages, In proceedings to 2026 NeurIPS Evaluations and Datasets Track
☆ Toward Multi-Database Query Reasoning for Text2Cypher
Large language models have significantly improved natural language interfaces to databases by translating user questions into executable queries. In particular, Text2Cypher focuses on generating Cypher queries for graph databases, enabling users to access graph data without query language expertise. Most existing Text2Cypher systems assume a single preselected graph database, where queries are generated over a known schema. However, real-world systems are often distributed across multiple independent graph databases organized by domain or system boundaries, where relevant information may span multiple sources. To address this limitation, we propose a shift from single-database query generation to multi-database query reasoning. Instead of assuming a fixed execution context, the system must reason about (i) relevant databases, (ii) how to decompose a question across them, and (iii) how to integrate partial results. We formalize this setting through a three-phase roadmap: database routing, multi-database decomposition, and heterogeneous query reasoning across database types and query languages. This work provides a structured formulation of multi-database reasoning for Text2Cypher and identifies challenges in source selection, query decomposition, and result integration, aiming to support more realistic and scalable natural language interfaces to graph databases.
☆ How Mobile World Model Guides GUI Agents?
Recent advances in vision-language models have enabled mobile GUI agents to perceive visual interfaces and execute user instructions, but reliable prediction of action consequences remains critical for long-horizon and high-risk interactions. Existing mobile world models provide either text-based or image-based future states, yet it remains unclear which representation is useful, whether generated rollouts can replace real environments, and how test-time guidance helps agents of different strengths. To answer the above questions, we filter and annotate mobile world-model data, then train world models across four modalities: delta text, full text, diffusion-based images, and renderable code. These models achieve SoTA performance on both MobileWorldBench and Code2WorldBench. Furthermore, by evaluating their downstream utility on AITZ, AndroidControl, and AndroidWorld, we obtain three findings. First, renderable code reconstruction achieves high in-distribution fidelity and provides effective multimodal supervision for data construction, while text-based feedback is more robust for online out-of-distribution (OOD) execution. Second, world-model-generated trajectories can provide transferable interaction experience in the training process and improve agents' end-to-end task performance, although these data do not preserve the original distribution. Last, for overconfident mobile agents with low action entropy, posterior self-reflection provides limited gains, suggesting that world models are more effective as prior perception or training supervision than as universal post-hoc verifiers.
☆ An Annotation Scheme and Classifier for Personal Facts in Dialogue
The advancement of Large Language Models (LLMs) has enabled their application in personalized dialogue systems. We present an extended annotation scheme for personal fact classification that addresses limitations in existing approaches, particularly PeaCoK. Our scheme introduces new categories (Demographics, Possessions) and attributes (Duration, Validity, Followup) that enable structured storage, quality filtering, and identification of facts suitable for dialogue continuation. We manually annotated 2,779 facts from Multi-Session Chat and trained a multi-head classifier based on transformer encoders. Combined with the Gemma-300M encoder, the classifier achieves $81.6 \pm 2.6$\% macro F1, outperforming all few-shot LLM baselines (best: GPT-5.4-mini, 72.92\%) by nearly 9 percentage points while requiring substantially fewer computational resources. Error analysis reveals persistent challenges in semantic boundary disambiguation, temporal aspect interpretation, and pragmatic reasoning for followup assessment. The dataset\footnotemark[1] and classifier\footnotemark[2] are publicly available.
☆ PowerStep: Memory-Efficient Adaptive Optimization via $\ell_p$-Norm Steepest Descent
Adaptive optimizers, most notably Adam, have become the default standard for training large-scale neural networks such as Transformers. These methods maintain running estimates of gradient first and second moments, incurring substantial memory overhead. We introduce PowerStep, a memory-efficient optimizer that achieves coordinate-wise adaptivity without storing second-moment statistics. Motivated by steepest descent under an $\ell_p$-norm geometry, we show that applying a nonlinear transform directly to a momentum buffer yields coordinate-wise adaptivity. We prove that PowerStep converges at the optimal $O(1/\sqrt{T})$ rate for non-convex stochastic optimization. Extensive experiments on Transformer models ranging from 124M to 235B parameters demonstrate that PowerStep matches Adam's convergence speed while halving optimizer memory. Furthermore, when combined with aggressive \texttt{int8} quantization, PowerStep remains numerically stable and reduces optimizer memory by $\sim\!8\times$ compared to full-precision Adam. PowerStep thus provides a principled, scalable and resource-efficient alternative for large-scale training. Code is available at https://github.com/yaolubrain/PowerStep.
☆ ANCHOR: Abductive Network Construction with Hierarchical Orchestration for Reliable Probability Inference in Large Language Models ICML 2026
A central challenge in large-scale decision-making under incomplete information is estimating reliable probabilities. Recent approaches leverage Large Language Models (LLMs) to generate explanatory factors and elicit coarse-grained probability estimates. Typically, an LLM performs forward abduction to propose factors, each paired with two mutually exclusive attributes, and a Naïve Bayes model is trained over factor combinations to refine the final probabilities. However, sparse factor spaces often yield ``unknown'' outcomes, while expanding factors increases noise and spurious correlations, weakening conditional independence and degrading reliability. To address these limitations, we propose \textsc{Anchor}, an inference framework that orchestrates aggregated Bayesian inference over a hierarchically structured factor space. \textsc{Anchor} first constructs a dense and organized factor space via iterative generation and hierarchical clustering. It then performs context-aware mapping through hierarchical retrieval and refinement, substantially reducing ``unknown'' predictions. Finally, \textsc{Anchor} augments Naïve Bayes with a Causal Bayesian Network to capture latent dependencies among factors, relaxing the strict independence assumption. Experiments show that \textsc{Anchor} markedly reduces ``unknown'' predictions and produces more reliable probability estimates than direct LLM baselines, achieving state-of-the-art performance while significantly reducing time and token overhead.
comment: Accepted by ICML 2026
☆ Extending Confidence-Based Text2Cypher with Grammar and Schema Aware Filtering
Large language models (LLMs) allow users to query databases using natural language by translating questions into executable queries. Despite strong progress on tasks such as Text2SQL, Text2SPARQL, and Text2Cypher, most existing methods focus on better prompting, fine-tuning, or iterative refinement. However, they often do not explicitly enforce structural constraints, such as syntactic validity and schema consistency. This can reduce reliability, since generated queries must satisfy both syntax rules and database schema constraints to be executable. In this work, we study how structured constraints can be used in test-time inference for Text2Cypher. We focus on post-generation validation to improve query correctness. We extend a confidence-based inference framework with a sequential filtering process that combines confidence scoring, grammar validation, and schema constraints before final aggregation. This lets us analyze how different constraint types affect generated queries. Our experiments with two instruction-tuned models show that grammar-based filtering improves syntactic validity. Schema-aware filtering further improves execution quality by enforcing consistency with the database structure. However, stronger filtering also increases the number of empty predictions and reduces execution coverage. Overall, we show that adding simple structural checks at test time improves the reliability of Text2Cypher generation, and we provide a clearer view of how syntax and schema constraints contribute differently.
☆ Qwen Goes Brrr: Off-the-Shelf RAG for Ukrainian Multi-Domain Document Understanding
We participated in the Fifth UNLP shared task on multi-domain document understanding, where systems must answer Ukrainian multiple-choice questions from PDF collections and localize the supporting document and page. We propose a retrieval-augmented pipeline built around three ideas: contextual chunking of PDFs, question-aware dense retrieval and reranking conditioned on both the question and answer options, and constrained answer generation from a small set of reranked passages. Our final system uses Qwen3-Embedding-8B for retrieval, a fine-tuned Qwen3-Reranker-8B for passage ranking, and Qwen3-32B for answer selection. On a held-out split, reranking improves Recall@1 from 0.6957 to 0.7935, while using the top-2 reranked passages raises answer accuracy from 0.9348 to 0.9674. Our best leaderboard run reached 0.9452 on the public leaderboard and 0.9598 on the private leaderboard. Our results suggest that, under strict code-competition constraints, preserving document structure and making relevance estimation aware of the answer space are more effective than adding complex downstream heuristics.
comment: Accepted to The Fifth Ukrainian Natural Language Processing Conference (UNLP 2026)
☆ DECO-MWE: building a linguistic resource of Korean multiword expressions for feature-based sentiment analysis
This paper aims to construct a linguistic resource of Korean Multiword Expressions for Feature-Based Sentiment Analysis (FBSA): DECO-MWE. Dealing with multiword expressions (MWEs) has been a critical issue in FBSA since many constructs reveal lexical idiosyncrasy. To construct linguistic resources of sentiment MWEs efficiently, we utilize the Local Grammar Graph (LGG) methodology: DECO-MWE is formalized as a Finite-State Transducer that represents lexical-syntactic restrictions on MWEs. In this study, we built a corpus of cosmetics review texts, which show particularly frequent occurrences of MWEs. Based on an empirical examination of the corpus, four types of MWEs have been distinguished. The DECO-MWE thus covers the following four categories: Standard Polarity MWEs (SMWEs), Domain-Dependent Polarity MWEs (DMWEs), Compound Named Entity MWEs (EMWEs) and Compound Feature MWEs (FMWEs). The retrieval performance of the DECO-MWE shows 0.806 f-measure in the test corpus. This study brings a twofold outcome: first, a sizeable general-purpose polarity MWE lexicon, which may be broadly used in FBSA; second, a finite-state methodology adopted in this study to treat domain-dependent MWEs such as idiosyncratic polarity expressions, named entity expressions or feature expressions, and which may be reused in describing linguistic properties of other corpus domains.
☆ MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading
To tackle long-context reasoning tasks without the quadratic complexity of standard attention mechanisms, approaches based on agent memory have emerged, which typically maintain a dynamically updated memory when linearly processing document chunks. To mitigate the potential loss of latent evidence in this memorize-while-reading paradigm, recent works have integrated retrieval modules that allow agents to recall information previously discarded during memory overwriting. However, retrieval-based recall suffers from both evidence loss during memory formation and interference induced by invalid queries. To overcome these limitations, we propose MemReread. Built upon streaming reading, MemReread circumvents intermediate retrieval. It triggers question decomposition and rereading when the final memory is insufficient, enabling the recovery of indirect facts that were prematurely discarded. This design supports non-linear reasoning while preserving the inherent logical flow of document comprehension. To further enhance practicality, we introduce a reinforcement learning framework that enhances length extrapolation capability while dynamically determining the number of rereading passes based on task complexity, thereby flexibly controlling computational overhead. Extensive experiments demonstrate that MemReread consistently outperforms baseline frameworks on long-context reasoning tasks, while maintaining linear time complexity with respect to context length.
☆ Building Korean linguistic resource for NLU data generation of banking app CS dialog system
Natural language understanding (NLU) is integral to task-oriented dialog systems, but demands a considerable amount of annotated training data to increase the coverage of diverse utterances. In this study, we report the construction of a linguistic resource named FIAD (Financial Annotated Dataset) and its use to generate a Korean annotated training data for NLU in the banking customer service (CS) domain. By an empirical examination of a corpus of banking app reviews, we identified three linguistic patterns occurring in Korean request utterances: TOPIC (ENTITY, FEATURE), EVENT, and DISCOURSE MARKER. We represented them in LGGs (Local Grammar Graphs) to generate annotated data covering diverse intents and entities. To assess the practicality of the resource, we evaluate the performances of DIET-only (Intent: 0.91 /Topic [entity+feature]: 0.83), DIET+ HANBERT (I:0.94/T:0.85), DIET+ KoBERT (I:0.94/T:0.86), and DIET+ KorBERT (I:0.95/T:0.84) models trained on FIAD-generated data to extract various types of semantic items.
☆ Route Before Retrieve: Activating Latent Routing Abilities of LLMs for RAG vs. Long-Context Selection
Recent advances in large language models (LLMs) have expanded the context window to beyond 128K tokens, enabling long-document understanding and multi-source reasoning. A key challenge, however, lies in choosing between retrieval-augmented generation (RAG) and long-context (LC) strategies: RAG is efficient but constrained by retrieval quality, while LC supports global reasoning at higher cost and with position sensitivity. Existing methods such as Self-Route adopt failure-driven fallback from RAG to LC, but remain passive, inefficient, and hard to interpret. We propose Pre-Route, a proactive routing framework that performs structured reasoning before answering. Using lightweight metadata (e.g., document type, length, initial snippet), Pre-Route enables task analysis, coverage estimation, and information-need prediction, producing explainable and cost-efficient routing decisions. Our study shows three key findings: (i) LLMs possess latent routing ability that can be reliably elicited with guidelines, allowing single-sample performance to approach that of multi-sample (Best-of-N) results; (ii) linear probes reveal that structured prompts sharpen the separability of the "optimal routing dimension" in representation space; and (iii) distillation transfers this reasoning structure to smaller models for lightweight deployment. Experiments on LaRA (in-domain) and LongBench-v2 (OOD) confirm that Pre-Route outperforms Always-RAG, Always-LC, and Self-Route baselines, achieving superior overall cost-effectiveness.
☆ Relative Score Policy Optimization for Diffusion Language Models
Diffusion large language models (dLLMs) offer a promising route to parallel and efficient text generation, but improving their reasoning ability requires effective post-training. Reinforcement learning with verifiable rewards (RLVR) is a natural choice for this purpose, yet its application to dLLMs is hindered by the absence of tractable sequence-level log-ratios, which are central to standard policy optimization. The lack of tractable sequence-level log-ratios forces existing methods to rely on high-variance ELBO-based approximations, where high verifier rewards can amplify inaccurate score estimates and destabilize RL training. To overcome this issue, we propose \textbf{R}elative \textbf{S}core \textbf{P}olicy \textbf{O}ptimization (RSPO), a simple RLVR method that uses verifiable rewards to calibrate noisy likelihood estimates in dLLMs. The core of our algorithm relies on a key observation: a reward advantage can be interpreted not only as an update direction, but also as a target for the relative log-ratio between the current and reference policies. Accordingly, RSPO calibrates this noisy relative log-ratio estimate by comparing its reward advantage with the reward-implied target relative log-ratio, updating the policy according to the gap between the current estimate and the target rather than the raw advantage alone. Experiments on mathematical reasoning and planning benchmarks show that RSPO yields especially strong gains on planning tasks and competitive mathematical-reasoning performance.
☆ The Impact of Editorial Intervention on Detecting Native Language Traces
Native Language Identification (NLI) is the task of determining an author's native language (L1) from their non-native writings. With the advent of human-AI co-authorship, non-native texts are routinely corrected and rewritten by large language models, fundamentally altering the linguistic features NLI models depend on. In this paper, we investigate the robustness of L1 traces across increasing degrees of editorial intervention. By processing 450 essays from the Write & Improve 2024 corpus through varying levels of grammatical error correction (GEC) and paraphrasing, we demonstrate that L1 attribution does not entirely depend on surface-level errors. Instead, the detection models leverage deeper L1 features: unidiomatic lexico-semantic choices, pragmatic transfer, and the author's underlying cultural perspective. We find that minimal edits preserve these structural traces and maintain high profiling accuracy. In contrast, fluency edits and paraphrasing normalize these L1 features, leading to a severe degradation in performance.
☆ To Redact, or not to Redact? A Local LLM Approach to Deliberative Process Privilege Classification
Government transparency laws, like the Freedom of Information (FOIA) acts in the United States and United Kingdom, and the Woo (Open Government Act) in the Netherlands, grant citizens the right to directly request documents from the government. As these documents might contain sensitive information, such as personal information or threats to national security, the laws allow governments to redact sensitive parts of the documents prior to release. We build on prior research to perform automatic sensitivity classification for the FOIA Exemption 5 deliberative process privilege using Large Language Models (LLMs). However, processing documents not yet cleared for review via third-party cloud APIs is often legally or politically untenable. Therefore, in this work, we perform sensitivity classification with a small, local model, deployable on consumer-grade hardware (Qwen3.5 9B). We compare eight variants of applying LLMs for sentence classification, using well-known prompting techniques, and find that a combination of Chain-of-Thought prompting and few-shot prompting with error-based examples outperforms classification models of earlier work in terms of recall and F2 score. This method also closely approaches the performance of a widely-used, cost-efficient commercial model (Gemini 2.5 Flash). In an additional analysis, we find that sentences that are predicted as deliberative contain more verbs that indicate the expression of opinions, and are more often phrased in in first-person. Above all, deliberativeness seems characterized by the presence of a combination of multiple indicators, in particular the combination of first-person words with a verb for expressing opinion.
comment: Accepted to The First Workshop on Artificial Intelligence & Open Government at the 21st International Conference on Artificial Intelligence and Law (ICAIL), June 8, 2026, Singapore
☆ Task-Aware Calibration: Provably Optimal Decoding in LLMs
LLM decoding often relies on the model's predictive distribution to generate an output. Consequently, misalignment with respect to the true generating distribution leads to suboptimal decisions in practice. While a natural solution is to calibrate the model's output distribution, for LLMs, this is ill-posed at the combinatorially vast level of free-form language. We address this by building on the insight that in many tasks, these free-form outputs can be interpreted in a semantically meaningful latent structure, for example, discrete class labels, integers, or sets. We introduce task calibration as a paradigm to calibrate the model's predictive distribution in the task-induced latent space. We apply a decision-theoretic result to show that Minimum Bayes Risk (MBR) decoding on the task-calibrated latent distribution is the optimal decoding strategy on latent model beliefs. Empirically, it consistently improves generation quality across different tasks and baselines. We also introduce Task Calibration Error (TCE), an application-aware calibration metric that quantifies the excess loss due to miscalibration. Our work demonstrates that task calibration enables more reliable model decisions across various tasks and applications.
☆ How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue
Full-duplex spoken dialogue requires a model to keep listening while generating its own spoken response. This is challenging for large language models (LLMs), which are designed to extend a single coherent sequence and do not naturally support user input arriving during generation. We argue that how the user stream is routed into the LLM is therefore a key architectural question for full-duplex modeling. To study this question, we extend a text-only LLM into a unified full-duplex spoken dialogue system and compare two routing strategies under a shared training pipeline: (i) channel fusion, which injects the user stream directly into the LLM input, and (ii) cross-attention routing, which keeps the user stream as external memory accessed through cross-attention adapters. Experiments on spoken question answering and full-duplex interaction benchmarks reveal a clear tradeoff. Channel fusion yields stronger semantic grounding and consistently better question-answering performance. However, under semantically overlapping conditions such as user interruptions, it is more vulnerable to context corruption: if the model fails to stop in time, the overlapping user stream can interfere with ongoing generation and lead to semantically incoherent continuations. Cross-attention routing underperforms on question answering, but better preserves the LLM generation context and is more robust to this failure mode. These results establish user-stream routing as a central design axis in full-duplex spoken dialogue and offer practical guidance on the tradeoff between semantic integration and context robustness. We provide a demo page for qualitative inspection.
☆ LegalCiteBench: Evaluating Citation Reliability in Legal Language Models
Large language models (LLMs) are increasingly integrated into legal drafting and research workflows, where incorrect citations or fabricated precedents can cause serious professional harm. Existing legal benchmarks largely emphasize statutory reasoning, contract understanding, or general legal question answering, but they do not directly study a central common-law failure mode: when asked to provide case authorities without external grounding, models may return plausible-looking but incorrect citations or cases. We introduce LegalCiteBench, a benchmark for studying closed-book citation recovery, citation verification, and case matching in legal language models. LegalCiteBench contains approximately 24K evaluation instances constructed from 1,000 real U.S. judicial opinions from the Case Law Access Project. The benchmark covers five citation-centric tasks: citation retrieval, citation completion, citation error detection, case matching, and case verification and correction. Across 21 LLMs, exact citation recovery remains highly challenging in this closed-book setting: even the strongest models score below 7/100 on citation retrieval and completion. Within the evaluated models, scale and legal-domain pretraining provide limited gains and do not resolve this difficulty. Models also frequently provide concrete but incorrect or low-overlap authorities under our evaluation protocol, with Misleading Answer Rates (MAR) exceeding 94% for 20 of 21 evaluated models on retrieval-heavy tasks. A prompt-only abstention experiment shows that explicit uncertainty instructions reduce some confident fabrication but do not improve citation correctness. LegalCiteBench is intended as a diagnostic framework for studying authority generation failures, verification behavior, and abstention when external grounding is absent, incomplete, or bypassed.
comment: Preprint. 23 pages including references and appendices
☆ V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning
Multimodal large language models (MLLMs) have achieved remarkable success in general perception, yet complex multi-step visual reasoning remains a persistent challenge. Although recent agentic approaches incorporate tool use, they often neglect critical execution feedback. Consequently, they suffer from the imagination-action-observer (IAO) bias, a misalignment between prior imagination and observer feedback that undermines reasoning stability and optimality. To bridge this gap, we introduce V-ABS, an action-observer driven beam search framework that enables deliberate reasoning through thinker-actor-observer iterations. We also propose an entropy-based adaptive weighting algorithm to mitigate the IAO bias by dynamically balancing the confidence scores between the policy priors and the observational feedback. Moreover, we construct a large-scale supervised fine-tuning (SFT) dataset comprising over 80k samples to guide the model to assign higher prior confidence to correct action paths. Extensive experiments across eight diverse benchmarks show that V-ABS achieves state-of-the-art performance, delivering an average improvement of 19.7% on the Qwen3-VL-8B baseline and consistent gains across both open-source and proprietary models.
☆ When Reviews Disagree: Fine-Grained Contradiction Analysis in Scientific Peer Reviews ACL 2026
Scientific peer reviews frequently contain conflicting expert judgments, and the increasing scale of conference submissions makes it challenging for Area Chairs and editors to reliably identify and interpret such disagreements. Existing approaches typically frame reviewer disagreement as binary contradiction detection over isolated sentence pairs, abstracting away the review-level context and obscuring differences in the severity of evaluative conflict. In this work, we introduce a fine-grained formulation of reviewer contradiction analysis that operates over full peer reviews by explicitly identifying contradiction evidence spans and assigning graded disagreement intensity scores. To support this task, we present RevCI, an expert-annotated benchmark of peer-review pairs with evidence-level contradiction annotations with graded intensity labels. We further propose IMPACT, a structured multi-agent framework that integrates aspect-conditioned evidence extraction, deliberative reasoning, and adjudication to model reviewer contradictions and their intensity. To support efficient deployment, we distill IMPACT into TIDE, a small language model that predicts contradiction evidence and intensity in a single forward pass. Experimental results show that IMPACT substantially outperforms strong single-agent and generic multi-agent baselines in both evidence identification and intensity agreement, while TIDE achieves competitive performance at significantly lower inference cost.
comment: accepted at ACL 2026
☆ ASTRA-QA: A Benchmark for Abstract Question Answering over Documents
Document-based question answering (QA) increasingly includes abstract questions that require synthesizing scattered information from long documents or across multiple documents into coherent answers. However, this setting is still poorly supported by existing benchmarks and evaluation methods, which often lack stable abstract references or rely on coarse similarity metrics and unstable head-to-head comparisons. To alleviate this issue, we introduce ASTRA-QA, a benchmark for AbSTRAct Question Answering over documents. ASTRA-QA contains 869 QA instances over academic papers and news documents, covering five abstract question types and three controlled retrieval scopes. Each instance is equipped with explicit evaluation annotations, including answer topic sets, curated unsupported topics, and aligned evidence. Building on these annotations, ASTRA-QA assesses whether answers cover required key points and avoid unsupported content by directly scoring topic coverage and curated unsupported content, enabling scalable evaluation without exhaustive head-to-head comparisons. Experiments with representative Retrieval-Augmented Generation (RAG) methods spanning vanilla, graph-based, and hierarchical retrieval settings show that ASTRA-QA provides reference-grounded diagnostics for coverage, hallucination, and retrieval-scope robustness. Our dataset and code are available at https://xinyangsally.github.io/astra-benchmark.
☆ MolSight: Molecular Property Prediction with Images
Every molecule ever synthesised can be drawn as a 2D skeletal diagram, yet in modern property prediction this universally available representation has received less focus in favour of molecular graphs, 3D conformers, or billion-parameter language models, each imposing its own computational and data-engineering overhead. We present $\textbf{MolSight}$, the first systematic large-scale study of vision-based Molecular Property Prediction (MPP). Using 10 vision architectures, 7 pre-training strategies, and $2\,M$ molecule images, we evaluate performance across 10 downstream tasks spanning physical-property regression, drug-discovery classification, and quantum-chemistry prediction. To account for the wide variation in structural complexity across pre-training molecules, we further propose a $\textbf{chemistry-informed curriculum}$: five structural complexity descriptors partition the corpus into five tiers of increasing chemical difficulty, consistently outperforming non-curriculum baselines. We show that a single rendered bond-line image, processed by a vision encoder, is sufficient for competitive molecular property prediction, i.e. $\textit{chemical insight from sight alone}$. The best curriculum-trained configuration achieves the top result on $\textbf{5 of 10}$ benchmarks and top two on $\textbf{all 10}$, at $\textbf{$\textit{80$\times$ lower}$}$ FLOPs than the nearest multi-modal competitor.
☆ NyayaAI: An AI-Powered Legal Assistant Using Multi-Agent Architecture and Retrieval-Augmented Generation
Legal information in India remains largely inaccessible due to the complexity of legal language and the sheer volume of legal documentation involved in research and case analysis. This paper presents NyayaAI, an AI-powered legal assistant that automates and simplifies legal workflows for lawyers, law students, and general users. The system combines Large Language Models with a Retrieval-Augmented Generation pipeline grounded in a curated Indian legal knowledge base comprising constitutional provisions, statutes, case laws, and judicial precedents. A multi-agent architecture orchestrated through the Mastra TypeScript framework coordinates a main agent with specialized sub-agents handling legal research, document summarization, case law retrieval, and drafting assistance. A compliance module validates all responses before delivery. Domain classification achieved 70\% precision across test samples, with RAG retrieval precision at 74\% and overall response accuracy at 72\%, demonstrating that structured multi-agent LLM systems can meaningfully improve legal accessibility and workflow efficiency. The code\footnote{https://github.com/B97784/NyayaAI} is made publicly available for the benefit of the research community.
comment: 3 pages, 1 figure
☆ Synthetic Pre-Pre-Training Improves Language Model Robustness to Noisy Pre-Training Data
Large language models (LLMs) rely on web-scale corpora for pre-training. The noise inherent in these datasets tends to obscure meaningful patterns and ultimately degrade model performance. Data curation mitigates but cannot eliminate such noise, so pre-training corpora remain noisy in practice. We therefore study whether a lightweight pre-pre-training (PPT) stage based on synthetic data with learnable temporal structure helps resist noisy data during the pre-training (PT) stage. Across various corruption settings, our method consistently improves robustness to noise during PT, with larger relative gains at higher noise levels. For a 1B-parameter model, a synthetic PPT stage with only 65M tokens achieves the same final loss as the baseline while using up to 49\% fewer natural-text PT tokens across different noise levels. Mechanistic analyses suggest PPT does not immediately suppress attention to noisy tokens. Rather, PPT-initialized models gradually downweight attention between corrupted tokens during noisy PT. This indicates that synthetic PPT inhibits noise self-modeling and shapes the subsequent optimization trajectory. Code is available at https://github.com/guox18/formal-language-prepretraining.
☆ SkillRAE: Agent Skill-Based Context Compilation for Retrieval-Augmented Execution
Large Language Model (LLM)-based agents (e.g., OpenClaw) increasingly rely on reusable skill libraries to solve artifact-rich tasks such as document-centric workflows and data-intensive analysis. As these libraries grow, a few works have attempted to study the Retrieval-Augmented Execution (RAE), which often first retrieves some external skills and other knowledge, then compiles the context using retrieved skills, and finally executes the task. Existing works mainly focus on optimizing skill retrieval and task execution, and they pay little attention to how to effectively organize the selected skill evidence in a form that is compact, grounded, and immediately usable for the downstream executors to complete tasks. To fill this gap, we propose SkillRAE, a two-stage RAE approach focusing on skill-based context compilation, which consists of the offline and online stages. Specifically, in the offline indexing stage, it builds a multi-level skill graph over skill communities, skills, and reusable subunits, for capturing their relationships. In the online retrieval stage, it first performs skill-ranked retrieval with selected-subunit evidence export in the graph, and then applies rescue-aware compact compilation to recover the key evidence. Together, these components compile a coarse-ranked skill set into a task-specific context that is compact, grounded, and immediately usable. Experiments on two public benchmarks show that SkillRAE achieves a significant improvement over baselines for RAE. For example, on SkillsBench, it achieves an improvement of 11.7% over the SOTA method. Ablation studies further show that our context compilation is crucial, instead of a mere prompt addition.
☆ GLiNER-Relex: A Unified Framework for Joint Named Entity Recognition and Relation Extraction
Joint named entity recognition (NER) and relation extraction (RE) is a fundamental task in natural language processing for constructing knowledge graphs from unstructured text. While recent approaches treat NER and RE as separate tasks requiring distinct models, we introduce GLiNER-Relex, a unified architecture that extends the GLiNER framework to perform both entity recognition and relation extraction in a single model. Our approach leverages a shared bidirectional transformer encoder to jointly represent text, entity type labels, and relation type labels, enabling zero-shot extraction of arbitrary entity and relation types specified at inference time. GLiNER-Relex constructs entity pair representations from recognized spans and scores them against relation type embeddings using a dedicated relation scoring module. We evaluate our model on four standard relation extraction benchmarks: CoNLL04, DocRED, FewRel, and CrossRE, and demonstrate competitive performance against both specialized relation extraction models and large language models, while maintaining the computational efficiency characteristic of the GLiNER family. The model is released as an open-source Python package with a simple inference API that allows users to specify arbitrary entity and relation type labels at inference time and obtain both entities and relation triplets in a single call. All models and code are publicly available.
comment: 19 pages, 1 figure, 2 tables
☆ FERA: Uncertainty-Aware Federated Reasoning for Large Language Models
Large language models (LLMs) exhibit strong reasoning capabilities when guided by high-quality demonstrations, yet such data is often distributed across organizations that cannot centralize it due to regulatory, proprietary, or institutional constraints. We study federated reasoning, where a server improves multi-step reasoning by coordinating with heterogeneous clients holding private demonstrations, without centralized training or raw data sharing. The key challenge is that client reliability is query-dependent, while the server cannot inspect client data to determine which contributions are trustworthy. To address this, we propose Uncertainty-Aware Federated Reasoning (FERA), a training-free framework based on iterative server-client co-refinement. Across communication rounds, clients generate reasoning traces with lightweight uncertainty estimates, and the server synthesizes them into improved reasoning that is redistributed as context for the next round, progressively improving both server outputs and client-side reasoning. Within each round, Uncertainty-Aware Self-Critique Aggregation (UA-SCA) resolves conflicts among heterogeneous client traces through query-dependent trust weighting and structured cross-client verification. Rather than simply discarding low-quality traces, UA-SCA revises flawed reasoning steps to recover useful information. We provide theoretical guarantees showing that the proposed iterative protocol converges and that uncertainty-aware weighting accelerates convergence. Experiments on multiple reasoning benchmarks show that FERA consistently outperforms both federated training and training-free baselines, achieving progressively higher accuracy across rounds while maintaining communication and computational efficiency.
comment: 44 pages, 8 figures
☆ PHAGE: Patent Heterogeneous Attention-Guided Graph Encoder for Representation Learning
Patent claims form a directed dependency structure in which dependent claims inherit and refine the scope of earlier claims; however, existing patent encoders linearize claims as text and discard this hierarchy. Directly encoding this structure into self-attention poses two challenges: claim dependencies mix relation types that differ in semantics and extraction reliability, and the dependency graph is defined over claims while Transformers attend over tokens. PHAGE addresses the first challenge through a deterministic graph construction pipeline that separates near-deterministic legal citations from noisier rule-based technical relations, preserving type distinctions as heterogeneous edges. It addresses the second through a connectivity mask and learnable relation-aware biases that lift claim-level topology into token-level attention, allowing the encoder to differentially weight each relation type. A dual-granularity contrastive objective then aligns representations with both inter-patent taxonomy and intra-patent topology. PHAGE outperforms all baselines on classification, retrieval, and clustering, showing that intra-document claim topology is a stronger inductive bias than inter-document structure and that this bias persists in the encoder weights after training.
☆ NCO: A Versatile Plug-in for Handling Negative Constraints in Decoding
Controlling Large Language Models (LLMs) to prevent the generation of undesirable content, such as profanity and personally identifiable information (PII), has become increasingly critical. While earlier approaches relied on post-processing or resampling, recent research has shifted towards constrained decoding methods that control outputs during generation to mitigate high computational costs and quality degradation. However, preventing multiple forbidden hard constraints or regex constraints from appearing anywhere in the output is computationally challenging. A straightforward solution is to convert these constraints into a single automaton that tracks all forbidden patterns during decoding, but this often becomes impractically large. Standard regex engines also do not readily support the operations needed to build such a constraint, such as complement and intersection. In order to address these limitations, we propose NCO, a decoding strategy that performs online pattern matching over finite hard constraints and regex constraints, reducing computational overhead without inducing state explosion. NCO is fully compatible with standard inference strategies, including various sampling methods and beam search, while also supporting soft masking for probabilistic suppression. We empirically demonstrate its effectiveness across practical tasks, including PII and profanity suppression. Our implementation is available at https://github.com/hyundong98/NCO-Decoding.git .
☆ Not-So-Strange Love: Language Models and Generative Linguistic Theories are More Compatible than They Appear
Futrell and Mahowald (2025) frame the success of neural language models (LMs) as supporting gradient, usage-based linguistic theories. I argue that LMs can also instantiate theories based on formal structures - the types of theories seen in the generative tradition. This argument expands the space of theories that can be tested with LMs, potentially enabling reconciliations between usage-based and generative accounts.
comment: Accepted to Behavioral and Brain Sciences; 4 pages; Commentary on "How Linguistics Learned to Stop Worrying and Love the Language Models" by Richard Futrell and Kyle Mahowald
☆ Swarm Skills: A Portable, Self-Evolving Multi-Agent System Specification for Coordination Engineering
As artificial intelligence engineering paradigms shift from single-agent Prompt and Context Engineering toward multi-agent \textbf{Coordination Engineering}, the ability to codify and systematically improve how multiple agents collaborate has emerged as a critical bottleneck. While single-agent skills can now be distributed as portable assets, multi-agent coordination protocols remain locked within framework-internal code or static configurations, preventing them from being shared across systems or autonomously improved over time. We propose \textbf{Swarm Skills}, a portable specification that extends the Anthropic Skills standard with multi-agent semantics. Swarm Skills turns multi-agent workflows into first-class, distributable assets that consist of roles, workflows, execution bounds, and a built-in semantic structure for self-evolution. To operationalize the specification's evolving nature, we present a companion self-evolution algorithm that automatically distills successful execution trajectories into new Swarm Skills and continuously patches existing ones based on multi-dimensional scoring (Effectiveness, Utilization, and Freshness), eliminating the need for human-in-the-loop oversight during the refinement process. Through an architectural compatibility analysis and a comprehensive qualitative case study using the open-source JiuwenSwarm reference implementation, we demonstrate how Swarm Skills achieves zero-adapter cross-agent portability via progressive disclosure, enabling agent teams to self-evolve their coordination strategies without framework lock-in.
☆ Personalizing LLMs with Binary Feedback: A Preference-Corrected Optimization Framework ACL 2026
Large Language Model (LLM) personalization aims to align model behaviors with individual user preferences. Existing methods often focus on isolated user histories, neglecting the essential role of inter-user differences. We propose C-BPO, a framework that personalizes LLMs via preference-calibrated binary signals. By treating target user data as positive feedback and other users' data as an auxiliary set of implicit negative signals, C-BPO captures distinct inter-user differences. To mitigate the preference overlap issue, where shared task knowledge is erroneously penalized, we derive an objective grounded in Positive-Unlabeled (PU) learning theory. This approach purifies negative signals by subtracting ``positive bias'', ensuring alignment with unique idiosyncrasies without compromising general helpfulness. Empirical experiments across various personalization tasks and backbone LLMs show C-BPO consistently outperforms baselines, demonstrating the efficacy of preference-calibrated binary signals in modeling inter-user differences.
comment: Accepted by ACL 2026 Main
☆ Instruction Adherence in Coding Agent Configuration Files: A Factorial Study of Four File-Structure Variables
Frontier coding agents read configuration files (CLAUDE.md, AGENTS.md, Cursor Rules) at session start and are expected to follow the conventions inside them. Practitioners assume that structural choices (file size, instruction position, file architecture, contradictions in adjacent files) measurably affect adherence. We report a systematic factorial study of these choices using four manipulated variables, measuring compliance with a trivial target annotation across 1,650 Claude Code CLI sessions (16,050 function-level observations) on two TypeScript codebases, three frontier models (primarily Sonnet 4.6, with Opus 4.6 as a CLI-matched cross-model check and Opus 4.7 reported descriptively under a CLI-version confound), and five coding tasks. We use mixed-effects models with a Bayesian companion. None of the four structural variables or three two-way interactions produces a detectable contrast after multiple-testing correction. Size and conflict nulls are supported by affirmative-null Bayes factors (BF10 between 0.05 and 0.10); position and architecture nulls are failures to reject without Bayes-factor support. The largest effect we measured is within-session: each additional function the agent generates is associated with approximately 5.6% lower odds of compliance per step (OR = 0.944) within the session-length range we tested, though the relationship is non-monotonic rather than a constant per-step effect. This reproduces on a second TypeScript codebase and on Opus 4.6 at matched configuration; it was identified during analysis rather than pre-specified. Within the conditions tested, file-structure variables did not produce detectable contrasts; compliance varies systematically between coding tasks and across each session's sequence of generated functions.
comment: 18 pages, 5 figures, 5 tables
☆ PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning
Cell-type-specific marker genes are fundamental to plant biology, yet existing resources primarily rely on curated databases or high-throughput studies without explicitly modeling the supporting evidence found in scientific literature. We introduce PlantMarkerBench, a multi-species benchmark for evaluating literature-grounded plant marker evidence interpretation from full-text biological papers. PlantMarkerBench is constructed using a modular curation pipeline integrating large-scale literature retrieval, hybrid search, species-aware biological grounding, structured evidence extraction, and targeted human review. The benchmark spans four plant species -- Arabidopsis, maize, rice, and tomato -- and contains 5,550 sentence-level evidence instances annotated for marker-evidence validity, evidence type, and support strength. We define two benchmark tasks: determining whether a candidate sentence provides valid marker evidence for a gene-cell-type pair, and classifying the evidence into expression, localization, function, indirect, or negative categories. We benchmark diverse open-weight and closed-source language models across species and prompting strategies. Although frontier models achieve relatively strong performance on direct expression evidence, performance drops substantially on functional, indirect, and weak-support evidence, with evidence-type confusion emerging as a dominant failure mode. Open-weight models additionally exhibit elevated false-positive rates under ambiguous biological contexts. PlantMarkerBench provides a challenging and reproducible evaluation framework for literature-grounded biological evidence attribution and supports future research on trustworthy scientific information extraction and AI-assisted plant biology.
☆ Speech-based Psychological Crisis Assessment using LLMs
Psychological support hotlines provide critical support for individuals experiencing mental health emergencies, yet current assessments largely rely on human operators whose judgments may vary with professional experience and are constrained by limited staffing resources. This paper proposes a large language model (LLM)-based framework for automated crisis level classification, a key indicator that supports many downstream tasks and improves the overall quality of hotline services. To better capture emotional signals in spoken conversations, we introduce a paralinguistic injection method that inserts identified non-verbal emotional cues into speech transcripts, enabling LLM-based reasoning to incorporate critical acoustic nuances. In addition, we propose a reasoning-enhanced training strategy that trains the model to generate diagnostic reasoning chains as an auxiliary task, which serves as a regulariser to improve classification performance. Combined with data augmentation, our final system achieves a macro F1-score of 0.802 and an accuracy of 0.805 on the three-class classification task under 5-fold cross-validation.
comment: 5 pages, 5 figures
Medical Incident Causal Factors and Preventive Measures Generation Using Tag-based Example Selection in Few-shot Learning
In high-stakes domains such as healthcare, the reliability of Large Language Models (LLMs) is critical, particularly when generating clinical insights from incident reports. This study proposes a tag-based few-shot example selection method for prompting LLMs to generate background/causal factors and preventive measures from details of the medical incidents. For our experiments, we use the Japanese Medical Incident Dataset (JMID), a structured dataset of 3,884 real-world medical accident and near-miss reports. These reports are variably annotated with a wide range of tags--some include descriptive information (e.g., "medications," "blood transfusion therapy"). We compare three few-shot example selection strategies--random sampling, cosine similarity-based selection, and our proposed tag-based method--using GPT-4o and LLaMA 3.3. Results show that the tag-based approach achieves the highest precision and most stable generation behavior, while similarity-based selection often leads to unintended outputs and safety filter activation. These findings suggest that selecting examples based on human-interpretable dataset tags can improve generation precision and stability in clinical LLM applications.
☆ Annotations Mitigate Post-Training Mode Collapse ICML 2026
Post-training (via supervised fine-tuning) improves instruction-following, but often induces semantic mode collapse by biasing models toward low-entropy fine-tuning data at the expense of the high-entropy pretraining distribution. Crucially, we find this trade-off worsens with scale. To close this semantic diversity gap, we propose annotation-anchored training, a principled method that enables models to adopt the preference-following behaviors of post-training without sacrificing the inherent diversity of pretraining. Our approach is simple: we pretrain on documents paired with semantic annotations, inducing a rich annotation distribution that reflects the full breadth of pretraining data, and we preserve this distribution during post-training. This lets us sample diverse annotations at inference time and use them as anchors to guide generation, effectively transferring pretraining's semantic richness into post-trained models. We find that models trained with annotation-anchored training can attain $6 \times$ less diversity collapse than models trained with SFT, and improve with scale.
comment: 21 pages, 8 figures, 11 tables. Accepted at ICML 2026
☆ Merlin: Deterministic Byte-Exact Deduplication for Lossless Context Optimization in Large Language Model Inference
Data-intensive applications, ranging from large-scale retrieval systems to advanced data pipelines, are increasingly bottlenecked by the processing of highly redundant text corpora. We present Merlin, a local-first, agnostic, high-throughput deduplication and context optimization engine designed to mitigate these inefficiencies. Utilizing a highly optimized, SIMD-friendly open-addressing flat hash set combined with xxHash3-64, Merlin performs rapid, byte-exact deduplication of text passages and data chunks. While broadly applicable to any text-processing workflow, its impact is particularly pronounced in Large Language Model (LLM) ecosystems, such as Retrieval-Augmented Generation (RAG). Our empirical evaluations demonstrate an input reduction ranging from 13.9% in low-redundancy datasets to over 71% in high-redundancy pipelines, maintaining absolute data fidelity. Furthermore, we detail the system's integration architecture via the Model Context Protocol (MCP), enabling secure, zero-network-interception deployment across major IDEs and autonomous agents. This paper outlines the core algorithmic design, performance benchmarks, and the architectural principles required to process data at sustained speeds of up to 8.7 GB/s.
comment: Preprint. Implementation and open-source community version available at: https://github.com/corbenicai/merlin-community - https://doi.org/10.5281/zenodo.20090991
☆ Federated Language Models Under Bandwidth Budgets: Distillation Rates and Conformal Coverage
Training a language model on data scattered across bandwidth-limited nodes that cannot be centralized is a setting that arises in clinical networks, enterprise knowledge bases, and scientific consortia. We study the regime in which data must remain distributed across nodes, and ask what statistical guarantees are in principle achievable under explicit bandwidth budgets; we aim to characterize what is provably possible, not to demonstrate a deployment-ready system. Existing theory treats either training-time consistency or inference-time calibration in isolation, and none makes bandwidth a first-class statistical parameter. We analyze two protocols, Federated Probe-Logit Distillation (FPLD) for training and Federated Conformal RAG (FC-RAG) for inference, as the analytical vehicles for our results. Our first main result is an explicit high-probability KL-consistency rate for FPLD with simultaneous dependence on node count $K$, per-node sample size $n$, quantization budget $B$, probe-set size $m$, and vocabulary size $V$; bandwidth enters only through an exponentially vanishing quantization term. Our second main result is a distribution-free marginal-coverage bound for FC-RAG, whose novel retrieval-bandwidth slack $Δ_{\mathrm{RAG}} = f_{\max}\sqrt{K^{-2}\sum_i v(B_i)}$ makes per-node retrieval bandwidth a first-class statistical parameter, with arithmetic aggregation across $K$ nodes shrinking the slack as $K^{-1/2}$ in the per-node-uniform regime. A Pinsker-type corollary composes the two bounds into an end-to-end coverage guarantee. Synthetic experiments verify the predicted scaling along the bounds' parameters; small-scale experiments on a GPT-2 testbed illustrate that the qualitative bandwidth-accuracy tradeoff survives on a real language model. A deployment-scale empirical evaluation is out of scope.
☆ GLiNER2-PII: A Multilingual Model for Personally Identifiable Information Extraction
Reliable detection of personally identifiable information (PII) is increasingly important across modern data-processing systems, yet the task remains difficult: PII spans are heterogeneous, locale-dependent, context-sensitive, and often embedded in noisy or semi-structured documents. We present GLiNER2-PII, a small 0.3B-parameter model adapted from GLiNER2 and designed to recognize a broad taxonomy of 42 PII entity types at character-span resolution. Training such systems, however, is constrained by the scarcity of shareable annotated data and the privacy risks associated with collecting real PII at scale. To address this challenge, we construct a multilingual synthetic corpus of 4,910 annotated texts using a constraint-driven generation pipeline that produces diverse, realistic examples across languages, domains, formats, and entity distributions. On the challenging SPY benchmark, GLiNER2-PII achieves the highest span-level F1 among five compared systems, including OpenAI Privacy Filter and three GLiNER-based detectors. We publicly release the model on Hugging Face to support further research and practical deployment of open PII detection systems.
comment: Under submission
☆ The Truth Lies Somewhere in the Middle (of the Generated Tokens)
How should hidden states generated autoregressively be collapsed into a representation that reflects a language model's internal state? Despite tokens being generated under causal masking, we find that mean pooling across their hidden states yields more semantic representations than any individual token alone. We quantify this through kernel alignment to reference spaces in language, vision, and protein domains. The improvement through mean pooling is consistent with information being distributed across generated tokens rather than localized to a single position. Furthermore, representations derived from generated tokens outperform those from prompt tokens, and alignment across generation reveals interpretable dynamics in model behavior.
☆ G-Zero: Self-Play for Open-Ended Generation from Zero Data
Self-evolving LLMs excel in verifiable domains but struggle in open-ended tasks, where reliance on proxy LLM judges introduces capability bottlenecks and reward hacking. To overcome this, we introduce G-Zero, a verifier-free, co-evolutionary framework for autonomous self-improvement. Our core innovation is Hint-$δ$, an intrinsic reward that quantifies the predictive shift between a Generator model's unassisted response and its response conditioned on a self-generated hint. Using this signal, a Proposer model is trained via GRPO to continuously target the Generator's blind spots by synthesizing challenging queries and informative hints. The Generator is concurrently optimized via DPO to internalize these hint-guided improvements. Theoretically, we prove a best-iterate suboptimality guarantee for an idealized standard-DPO version of G-Zero, provided that the Proposer induces sufficient exploration coverage and the data filteration keeps pseudo-label score noise low. By deriving supervision entirely from internal distributional dynamics, G-Zero bypasses the capability ceilings of external judges, providing a scalable, robust pathway for continuous LLM self-evolution across unverifiable domains.
☆ Beyond Majority Voting: Agreement-Based Clustering to Model Annotator Perspectives in Subjective NLP Tasks
Disagreement in annotation is a common phenomenon in the development of NLP datasets and serves as a valuable source of insight. While majority voting remains the dominant strategy for aggregating labels, recent work has explored modeling individual annotators to preserve their perspectives. However, modeling each annotator is resource-intensive and remains underexplored across various NLP tasks. We propose an agreement-based clustering technique to model the disagreement between the annotators. We conduct comprehensive experiments in 40 datasets in 18 typologically diverse languages, covering three subjective NLP tasks: sentiment analysis, emotion classification, and hate speech detection. We evaluate four aggregation approaches: majority vote, ensemble, multi-label, and multitask. The results demonstrate that agreement-based clustering can leverage the full spectrum of annotator perspectives and significantly enhance classification performance in subjective NLP tasks compared to majority voting and individual annotator modeling. Regarding the aggregation approach, the multi-label and multitask approaches are better for modeling clustered annotators than an ensemble and model majority vote.
comment: Pre-MIT Press publication version
☆ TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents
Multimodal large language models increasingly solve vision-centric tasks by calling external tools for visual inspection, OCR, retrieval, calculation, and multi-step reasoning. Current tool-using agents usually expose the executed tool trajectory and the final answer, but they rarely specify which tool observation supports each generated claim. We call this missing claim-level dependency structure the provenance gap. The gap makes tool use hard to verify and hard to optimize, because useful evidence, redundant exploration, and unsupported reasoning are mixed in the same trajectory. We introduce TRACER, a framework for verifiable generative provenance in multimodal tool-using agents. Instead of adding citations after generation, TRACER generates each answer sentence together with a structured provenance record that identifies the supporting tool turn, evidence unit, and semantic support relation. Its relation space contains Quotation, Compression, and Inference, covering direct reuse, faithful condensation, and grounded derivation. TRACER verifies each record through schema checking, tool-turn alignment, source authenticity, and relation rationality, and then converts verified provenance into traceability constraints and provenance-derived local credit for reinforcement learning. We further construct TRACE-Bench, a benchmark for sentence-level provenance reconstruction from coarse multimodal tool trajectories. On TRACE-Bench, simply adding tools often introduces noise. With Qwen3-VL-8B, TRACER reaches 78.23% answer accuracy and 95.72% summary accuracy, outperforming the strongest closed-source tool-augmented baseline by 23.80 percentage points. Compared with tool-only supervised fine-tuning, it also reduces total test-set tool calls from 4949 to 3486. These results show that reliable multimodal tool reasoning depends on provenance-aware use of observations, not on more tool calls alone.
☆ FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning
Large language models can now process increasingly long inputs, yet their ability to effectively use information spread across long contexts remains limited. We trace this gap to how attention budget is spent during supervised fine-tuning (SFT) on long sequences: positional biases and attention sinks cause the model to allocate most of its attention to positionally privileged tokens rather than semantically relevant content. This training-time attention dilution (the starvation of content tokens in the attention distribution) weakens the gradient signal, limiting the model's ability to learn robust long-context capabilities. We introduce FocuSFT, a bilevel optimization framework that addresses this problem at training time. An inner loop adapts lightweight fast-weight parameters on the training context to form a parametric memory that concentrates attention on relevant content, and the outer loop performs SFT conditioned on this sharpened representation. Both loops apply bidirectional attention over context tokens while preserving causal masking for responses, reducing the causal asymmetry that gives rise to attention sinks and aligning inner-outer behavior. On BABILong, FocuSFT improves accuracy by up to +14pp across 4K--32K context lengths; on RULER, it raises CWE aggregation from 72.9\% to 81.1\% at 16K; and on GPQA with agentic tool use, it yields a 24\% relative gain in pass@1. Attention analysis shows that FocuSFT reduces attention sink mass by 529$\times$ and triples context engagement during training. Code: https://github.com/JarvisPei/FocuSFT
☆ PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning
Tool-integrated reasoning (TIR) enables large language models (LLMs) to enhance their capabilities by interacting with external tools, such as code interpreters (CI). Most recent studies focus on exploring various methods to equip LLMs with the ability to use tools. However, how to further boost the reasoning ability of already tool-capable LLMs at inference time remains underexplored. Improving reasoning at inference time requires no additional training and can help LLMs better leverage tools to solve problems. We observe that, during tool-capable LLM inference, both the number and the proportion of erroneous tool calls are negatively correlated with answer correctness. Moreover, erroneous tool calls are typically resolved successfully within a few subsequent turns. If not, LLMs often struggle to resolve such errors even with many additional turns. Building on the above observations, we propose PruneTIR, a rather effective yet efficient framework that enhances the tool-integrated reasoning at inference time. During LLM inference, PruneTIR prunes trajectories, resamples tool calls, and suspends tool usage through three components: Success-Triggered Pruning, Stuck-Triggered Pruning and Resampling, and Retry-Triggered Tool Suspension. These three components enable PruneTIR to mitigate the negative impact of erroneous tool calls and prevent LLMs from getting stuck in repeated failed resolution attempts, thereby improving overall LLM performance. Extensive experimental results demonstrate the effectiveness of PruneTIR, which significantly improves Pass@1 and efficiency while reducing the working context length for tool-capable LLMs.
☆ Evolving Knowledge Distillation for Lightweight Neural Machine Translation
Recent advancements in Neural Machine Translation (NMT) have significantly improved translation quality. However, the increasing size and complexity of state-of-the-art models present significant challenges for deployment on resource-limited devices. Knowledge distillation (KD) is a promising approach for compressing models, but its effectiveness diminishes when there is a large capacity gap between teacher and student models. To address this issue, we propose Evolving Knowledge Distillation (EKD), a progressive training framework in which the student model learns from a sequence of teachers with gradually increasing capacities. Experiments on IWSLT-14, WMT-17, and WMT-23 benchmarks show that EKD leads to consistent improvements at each stage. On IWSLT-14, the final student achieves a BLEU score of 34.24, narrowing the gap to the strongest teacher (34.32 BLEU) to just 0.08 BLEU. Similar trends are observed on other datasets. These results demonstrate that EKD effectively bridges the capacity gap, enabling compact models to achieve performance close to that of much larger teacher models.Code and models are available at https://github.com/agi-content-generation/EKD.
☆ Team-Based Self-Play With Dual Adaptive Weighting for Fine-Tuning LLMs ACL 2026
While recent self-training approaches have reduced reliance on human-labeled data for aligning LLMs, they still face critical limitations: (i) sensitivity to synthetic data quality, leading to instability and bias amplification in iterative training; (ii) ineffective optimization due to a diminishing gap between positive and negative responses over successive training iterations. In this paper, we propose Team-based self-Play with dual Adaptive Weighting (TPAW), a novel self-play algorithm designed to improve alignment in a fully self-supervised setting. TPAW adopts a team-based framework in which the current policy model both collaborates with and competes against historical checkpoints, promoting more stable and efficient optimization. To further enhance learning, we design two adaptive weighting mechanisms: (i) a response reweighting scheme that adjusts the importance of target responses, and (ii) a player weighting strategy that dynamically modulates each team member's contribution during training. Initialized from a SFT model, TPAW iteratively refines alignment without requiring additional human supervision. Experimental results demonstrate that TPAW consistently outperforms existing baselines across various base models and LLM benchmarks. Our code is publicly available at https://github.com/lab-klc/TPAW.
comment: Accepted by ACL 2026 Main
☆ Position: Academic Conferences are Potentially Facing Denominator Gaming Caused by Fully Automated Scientific Agents ICML'26
The implicit policy of maintaining relatively stable acceptance rates at top AI conferences, despite exponentially growing submissions, introduces a critical structural vulnerability. This position paper characterizes a new systemic threat we term Agentic Denominator Gaming, in which a malicious actor deploys AI agents to generate and submit a large volume of superficially plausible but low-quality papers. Crucially, their objective is not the acceptance of low-quality papers, but rather to inflate the submission denominator and overwhelm reviewing capacity. Under a relatively stable acceptance rate, this dilution can systematically increase the publication probability of a small, targeted set of legitimate papers. We analyze the practical feasibility of this threat and its broader consequences, including intensified reviewer burnout, degraded review quality, and the emergence of industrialized automated agent mills. Finally, we propose and evaluate a range of mitigation strategies, and argue that durable protection will require system-level policy and incentive reforms, rather than relying primarily on technical detection alone.
comment: Accepted by ICML'26 Position Track
☆ The Gordian Knot for VLMs: Diagrammatic Knot Reasoning as a Hard Benchmark
A vision-language model can look at a knot diagram and report what it sees, yet fail to act on that structure. KnotBench pairs an 858,318-image corpus from 1,951 prime-knot prototypes (crossing numbers 3 to 19) with a protocol whose answers are checked against Regina's canonical knot signature. Its 14 tasks span four families, equivalence judgment, move prediction, identification, and cross-modal grounding; an image-versus-symbol split locates failures along the perception-operation gap. We score Claude Opus 4.7 and GPT-5, each with and without thinking, under a 64K output-token budget matched on both vendors. Across 56 (task, model) cases, 15 sit at or below a random baseline and 8 of 14 tasks have a best score under 1.5x random. On diagram-to-symbol transcription, no model produces a strictly correct string, and permissive Regina decoding recovers the knot in 0 to 4 of 100 items. Thinking-mode reasoning lifts overall accuracy by 1.65 points for Claude and 9.25 points for GPT-5, narrowing the gap only modestly. Read together, the four families suggest current vision-language models hold features of a diagram but lack apparatus to simulate moves on those features.
comment: 41 pages, 18 figures
☆ Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions
Large language models (LLMs) are often evaluated based on their stated values, yet these do not reliably translate into their actions, a discrepancy termed "value-action gap." In this work, we argue that this gap persists even under explicit reasoning, revealing a deeper failure mode we call "Pseudo-Deliberation": the appearance of principled reasoning without corresponding behavioral alignment. To study this systematically, we introduce VALDI, a framework for measuring alignment between stated values and generated dialogue. VALDI includes 4,941 human-centered scenarios across five domains, three tasks that elicit value articulation, reasoning, and action, and five metrics for quantifying value adherence. Across both proprietary and open-source LLMs, we observe consistent misalignment between expressed values and downstream dialogues. To investigate intervention strategies, we propose VIVALDI, a multi-agent value auditor that intervenes at different stages of generation.
comment: 9 pages
☆ Key-Value Means
We present Key-Value Means ("KVM"), a novel block-recurrence for attention that can accommodate either fixed-size or growing state. Equipping a strong transformer baseline with fixed-size KVM attention layers yields a strong $O(N)$ chunked RNN, while adding only an insignificant number of new parameters. We train a transformer with a growable KVM cache and show it performs competitively on long-context tests with only subquadratic prefill time and sublinear state growth. KVM is implementable with standard operations and without custom kernels, and supports chunk-wise parallelizable training and prefill. It provides many of the benefits of both traditional transformers (expandable context memory, chunk-wise parallelizable training and prefill) and linear RNNs in a single unified package. It can be used on every layer, saving KV-cache memory, and allowing a continuous range of choices of prefill time complexity between $O(N)$ and $O(N^2)$. It can also be implemented in a hybrid solution in tandem with LRNN layers in place of traditional attention, to supplement the LRNN with improved sublinear memory growth context length usage and long context decoding. We release our code at https://github.com/recursal/KVM-paper and trained models at https://huggingface.co/collections/recursal/key-value-means under the Apache 2.0 license.
☆ EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
Next-generation visual assistants, such as smart glasses, embodied agents, and always-on life-logging systems, must reason over an entire day or more of continuous visual experience. In ultra-long video settings, relevant information is sparsely distributed across hours or days, making memory a fundamental challenge: models must accumulate information over time, recall prior states, track temporal order, and abstract recurring patterns. However, existing week-long video benchmarks are primarily designed for perception and recognition, such as moment localization or global summarization, rather than reasoning that requires integrating evidence across multiple days. To address this gap, we introduce EgoMemReason, a comprehensive benchmark that systematically evaluates week-long egocentric video understanding through memory-driven reasoning. EgoMemReason evaluates three complementary memory types: entity memory, tracking how object states evolve and change across days; event memory, recalling and ordering activities separated by hours or days; and behavior memory, abstracting recurring patterns from sparse, repeated observations over the whole week period. EgoMemReason comprises 500 questions across three memory types and six core challenges, with an average of 5.1 video segments of evidence per question and 25.9 hours of memory backtracking. We evaluate EgoMemReason on 17 methods across MLLMs and agentic frameworks, revealing that even the best model achieves only 39.6% overall accuracy. Further analysis shows that the three memory types fail for distinct reasons and that performance degrades as evidence spans longer temporal horizons, revealing that long-horizon memory remains far from solved. We believe EgoMemReason establishes a strong foundation for evaluating and advancing long-context, memory-aware multimodal systems.
comment: The first two authors contributed equally. Project website: https://egomemreason.github.io/
☆ Nautilus Compass: Black-box Persona Drift Detection for Production LLM Agents
Production LLM coding agents drift over long sessions: they forget user-specified constraints, slip into mistakes the user already flagged, and confabulate prior agreements. White-box approaches such as persona vectors require model weights and so cannot be applied to closed APIs (Claude, GPT-4) that most users actually interact with. We present Nautilus Compass, a black-box persona drift detector and agent memory layer for production coding agents. The method operates entirely at the prompt-text layer: cosine similarity between user prompts and behavioral anchor texts, aggregated by a weighted top-k mean using BGE-m3 embeddings. Compass is, to our knowledge, the only public agent memory layer (among Mem0, Letta, Cognee, Zep, MemOS, smrti verified May 2026) that does not call an LLM at index time to extract facts or build a graph; raw conversation text is embedded directly. The system ships as a Claude Code plugin, an MCP 2024-11-05 A2A server (Cursor, Cline, Hermes), a CLI, and a REST API on one daemon, with a Merkle-chained audit log for tamper-evident anchor updates. On a held-out test set built from real Claude Code session traces and labeled by an independent LLM judge, Compass reaches ROC AUC 0.83 for drift detection. The embedded retrieval pipeline scores 56.6% on LongMemEval-S v0.8 and 44.4% on EverMemBench-Dynamic (n=500), topping the four published EverMemBench Table 4 baselines. LongMemEval-S 56.6% is ~30 points below recent white-box leaders (90+%); we treat that as the architectural ceiling of the no-extraction design. End-to-end reproduction cost is $3.50 (~14x cheaper than GPT-4o-judged stacks). A paired cross-vendor behavior A/B accompanies these numbers as preliminary system-level evidence. Code, anchors, frozen test data, and audit-log tooling are MIT-licensed at github.com/chunxiaoxx/nautilus-compass.
comment: 19 pages, 6 figures. MIT-licensed code + reproduction scripts at github.com/chunxiaoxx/nautilus-compass
☆ The Metacognitive Probe: Five Behavioural Calibration Diagnostics for LLMs
The Metacognitive Probe is an exploratory five-task, 15-slot diagnostic that decomposes an LLM's confidence behaviour into five behaviourally-distinct dimensions: confidence calibration (T1-CC), epistemic vigilance (T2-EV), knowledge boundary (T3-KB), calibration range (T4-CR), and reasoning-chain validation (T5-RCV). It is evaluated on N=8 frontier models and N=69 humans. The instrument is motivated by Flavell (1979) and Nelson and Narens (1990) but operates on observable confidence-correctness alignment; it is not a validated cross-species metacognition scale, and the pre-specified human developmental hypothesis was falsified. Composite benchmarks (MMLU, BIG-Bench, HELM, GPQA) ask whether a model produces a correct response. They are silent on whether the model knows when its response is wrong. A model can score 80 on a composite calibration benchmark and still be wildly overconfident in narrow pockets the aggregate cannot surface. The Metacognitive Probe surfaces those pockets. Our headline is a 47-point within-model dissociation in Gemini 2.5 Flash: panel-best within-task calibration (T1-CC = 88; Spearman rho = +0.551, 95% CI [+0.14, +0.80], p = 0.005) and panel-worst cross-task difficulty prediction (T4-CR = 41; sigma_conf = 1.4 across twelve factoids).
comment: 27 pages, 13 tables. Code, data, prompts, and rubrics released with the paper. OSF deposit pending; DOI in v2
☆ The Association of Transformer-based Sentiment Analysis with Symptom Distress and Deterioration in Routine Psychotherapy Care
Sentiment analysis has been of long-standing interest in psychotherapy research. Recently, the Transformer deep learning architecture has produced text-based sentiment analysis models that are highly accurate and context-aware. These models have been explored as proxies for emotion measurement instruments in psychotherapy, but not investigated as stand-alone psychometric tools. Using proposed utterance-level and session-level sentiment features derived from a fine-grained sentiment model on a large corpus of psychotherapy sessions (N = 751), we investigate the distribution of session aggregated sentiment scores. Further, we characterize the relationship of these features to individual components and the overall score of the OQ-45 instrument and find that this sentiment feature is most strongly correlated to components related to emotional valence in directionally intuitive ways. Finally, we report that there are statistically significant differences between the sentiment distributions for patients flagged as at risk of deterioration or dropping out of care via either the OQ Rational or Empirical outcome models. These correlations to a fully-validated psychometric instrument demonstrate that these proposed sentiment features are, at least, adjunctive measures of client distress and deterioration.
comment: 20 pages, 4 figures
☆ Instruction Adherence in Coding Agent Configuration Files: A Factorial Study of Four File-Structure Variables
Frontier coding agents read configuration files (CLAUDE$.$md, AGENTS$.$md, Cursor Rules) at session start and are expected to follow the conventions inside them. Practitioners assume that structural choices (file size, instruction position, file architecture, contradictions in adjacent files) measurably affect adherence. We report a systematic factorial study of these choices using four manipulated variables, measuring compliance with a trivial target annotation across 1,650 Claude Code CLI sessions (16,050 function-level observations) on two TypeScript codebases, three frontier models (primarily Sonnet 4.6, with Opus 4.6 as a CLI-matched cross-model check and Opus 4.7 reported descriptively under a CLI-version confound), and five coding tasks. We use mixed-effects models with a Bayesian companion. None of the four structural variables or three two-way interactions produces a detectable contrast after multiple-testing correction. Size and conflict nulls are supported by affirmative-null Bayes factors (BF10 between 0.05 and 0.10); position and architecture nulls are failures to reject without Bayes-factor support. The largest effect we measured is within-session: each additional function the agent generates is associated with approximately 5.6% lower odds of compliance per step (OR = 0.944) within the session-length range we tested, though the relationship is non-monotonic rather than a constant per-step effect. This reproduces on a second TypeScript codebase and on Opus 4.6 at matched configuration; it was identified during analysis rather than pre-specified. Within the conditions tested, file-structure variables did not produce detectable contrasts; compliance varies systematically between coding tasks and across each session's sequence of generated functions.
comment: 18 pages, 5 figures, 5 tables
☆ Much of Geospatial Web Search Is Beyond Traditional GIS
Web search queries concern place far more often than existing labelling schemes suggest, yet the landscape of geospatial web search queries - what people ask of place, and how often - remains poorly characterised at scale. We apply dense sentence embeddings, a lightweight SetFit classifier, and density-based clustering to the full MS MARCO corpus of 1.01 million real Bing queries without prior filtering for toponyms or spatial keywords, identifying 181,827 geospatial queries (18.0%), nearly threefold the 6.17% labelled as Location in the original annotations. The resulting taxonomy of 88 query categories reveals that geospatial web search is dominated by transactional and practical lookups: costs and prices alone account for 15.3% of geospatial queries, nearly twice the size of the entire physical geography theme. Much of this activity - costs, opening hours, contact details, weather, travel recommendations - falls outside the scope traditional GIS systems and knowledge graphs are built to serve. The categories vary substantially in the kind of answer they admit, from deterministic lookups answerable from spatial databases or knowledge graphs to evaluative or temporally volatile queries that require generative or real-time systems. We discuss implications for hybrid retrieval architectures and for benchmarks of geographic reasoning in large language models. We openly release the labelled dataset, classifier, and taxonomy.
☆ VERDI: Single-Call Confidence Estimation for Verification-Based LLM Judges via Decomposed Inference
LLM-as-Judge systems are widely deployed for automated evaluation, yet practitioners lack reliable methods to know when a judge's verdict should be trusted. Token log-probabilities, the standard post-hoc confidence signal, are unavailable for many commercial LLMs and, even when accessible, saturate above 0.999 with structured JSON output. We introduce VERDI (VERification-Decomposed Inference), a method that extracts confidence from the reasoning trace a structured judge already produces, with no additional inference calls. VERDI decomposes each verification-style evaluation into sub-checks and derives three structural signals: Step-Verdict Alignment, Claim-Level Margin, and Evidence Grounding Score. We combine them with Platt-scaled logistic regression. On three public benchmarks, VERDI achieves AUROC 0.72-0.91 on GPT-4.1-mini and 0.66-0.80 on GPT-5.4-mini. On Qwen3.5-4B/9B/27B, where answer-token logprobs are anti-calibrated (higher confidence on errors, AUROC 0.32-0.49), VERDI achieves 0.56-0.70. We additionally validate on a production system with eight rubrics (AUROC 0.73-0.88 on factual rubrics), demonstrate cross-model transfer (AUROC 0.66-0.69), and show that a 33M-parameter NLI (Natural Language Inference) model provides a scalable alternative to regex extraction.
comment: 16 pages, 6 figures
☆ SOMA: Efficient Multi-turn LLM Serving via Small Language Model
Large Language Models (LLMs) are increasingly deployed in multi-turn dialogue settings where preserving conversational context across turns is essential. A standard serving practice concatenates the full dialogue history at every turn, which reliably maintains coherence but incurs substantial cost in latency, memory, and API expenditure, especially when queries are routed to large proprietary models. Existing approaches often struggle to balance the trade-off between response quality and efficiency. We propose a framework that exploits the early turns of a session to estimate a local response manifold and then adapt a smaller surrogate model to this local region for the remainder of the conversation. Concretely, we learn soft prompts that maximize semantic divergence between the large and surrogate small language models' responses to surface least-aligned local directions, stabilize training with anti-degeneration control, and distill the mined cases into localized LoRA fine-tuning so the surrogate runs without prompts at inference. A simple gate enables a one-time switch with rollback on drift. We further provide a theoretical analysis for key components in SOMA. Extensive experiments show the effectiveness of SOMA. The source code is provided at: https://github.com/LabRAI/SOMA.
☆ Predicting Psychological Well-Being from Spontaneous Speech using LLMs
We investigate the use of Large Language Models (LLMs) for zero-shot prediction of Ryff Psychological Well-Being (PWB) scores from spontaneous speech. Using a few minutes of voice recordings from 111 participants in the PsyVoiD database, we evaluated 12 instruction-tuned LLMs, including Llama-3 (8B, 70B), Ministral, Mistral, Gemma-2-9B, Gemma-3 (1B, 4B, 27B), Phi-4, DeepSeek (Qwen and Llama), and QwQ-Preview. A domain-informed prompt was developed in collaboration with experts in clinical psychology and linguistics. Results show that LLMs can extract semantically meaningful cues from spontaneous speech, achieving Spearman correlations of up to 0.8 on 80\% of the data. Additionally, to enhance explainability, we conducted statistical analyses to characterise prediction variability and systematic biases, alongside keyword-based word cloud analyses to highlight the linguistic features driving the models' predictions.
☆ A Theory of Time-Sensitive Language Generation: Sparse Hallucination Beats Mode Collapse
We study language generation in the limit under a global preference ordering on strings, as introduced by Kleinberg and Wei. As in [arXiv:2504.14370, arXiv:2511.05295], we aim for \emph{breadth}, but impose an additional requirement of timeliness: higher-ranked strings should be generated earlier. A string is then only credited if it is generated before a deadline, where its deadline is defined by a function that maps a string's rank in the target language to the time by which it must be produced. This is in keeping with a central consideration in machine learning, where inductive bias favors ``simpler'' or ``more plausible'' outputs, all else being equal. We show that timely generation is impossible in a strong sense for eventually consistent generators -- the protagonists of most prior related work. Under what is perhaps the mildest natural relaxation of consistency, a hallucination rate that vanishes over time, we show that we can circumvent our impossibility result. In particular, we can achieve optimal density with respect to any superlinear deadline function. We also show this is tight by ruling out timely generation with linear deadlines and vanishing hallucination rate.
☆ LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?
Multimodal large language models (MLLMs) have heterogeneous strengths across OCR, chart understanding, spatial reasoning, visual question answering, cost, and latency. Effective MLLM routing therefore requires more than estimating query difficulty: a router must match the multimodal requirements of the current image-question input with the capabilities of each candidate model. We propose LatentRouter, a router that formulates MLLM routing as counterfactual multimodal utility prediction. Given an image-question query, LatentRouter extracts learned multimodal routing capsules, represents each candidate MLLM with a model capability token, and performs latent communication between these states to estimate how each model would perform if selected. A distributional outcome head predicts model-specific counterfactual quality, while a bounded capsule correction refines close decisions without allowing residual signals to dominate the prediction. The resulting utility-based policy supports performance-oriented and performance-cost routing, and handles changing candidate pools through shared per-model scoring with availability masking. Experiments on MMR-Bench and VL-RouterBench show that LatentRouter outperforms fixed-model, feature-level, and learned-router baselines. Additional analyses show that the gains are strongest on multimodal task groups where model choice depends on visual, layout-sensitive, or reasoning-oriented requirements, and that latent communication is the main contributor to the improvement. The code is available at: https://github.com/LabRAI/LatentRouter.
☆ Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling
Code generation is typically trained in the primal space of programs: a model produces a candidate solution and receives sparse execution feedback, often a single pass/fail bit. Test-time scaling enriches the inference procedure by sampling multiple candidates and judging among them, but the comparative information this process reveals is discarded after inference. We argue that this information defines a dual judgment space that provides a far richer training signal: the model learns not from an isolated success or failure, but from the relative correctness structure across its own plausible attempts, identifying which succeed, which fail, and what distinguishes them. We introduce DuST (Dual Self-Training), a framework for self-training from the dual judgment space. DuST samples candidate programs from the model's own distribution, labels them through sandbox execution, retains groups containing both successes and failures, and trains the model to rank candidates by execution correctness using GRPO. The objective is purely discriminative: the model is never directly rewarded for generating correct programs. Dual self-training improves both judgment and generation. Across five models spanning two families and three scales (4B to 30B), DuST consistently improves Best-of-4 test-time scaling on LiveCodeBench. For Qwen3-30B-Thinking on LiveCodeBench v6, judgment quality improves by +6.2 NDCG, single-sample pass@1 improves by +3.1, and Best-of-4 accuracy improves by +4.1. The trained model's single rollout matches the base model's Best-of-4 performance. SFT on the same ranking data improves judgment without improving generation, confirming that on-policy RL is the mechanism that transfers dual-space learning back into primal generation.
☆ ReAD: Reinforcement-Guided Capability Distillation for Large Language Models
Capability distillation applies knowledge distillation to selected model capabilities, aiming to compress a large language model (LLM) into a smaller one while preserving the abilities needed for a downstream task. However, most existing methods treat capabilities as independent training targets and overlook how improving one capability can reshape the student's broader capability profile, especially when multiple abilities jointly determine task success. We study capability distillation under a fixed token budget and identify two consistent patterns: distillation induces systematic, budget-dependent cross-capability transfer, and additional budget often brings limited task-relevant gains while sometimes degrading other useful abilities. Building on these insights, we propose ReAD, a Reinforcement-guided cApability Distillation framework that explicitly accounts for capability interdependence. ReAD first infers task-essential capabilities, then generates capability-targeted supervision on the fly, and finally uses an uncertainty-aware contextual bandit to adaptively allocate the distillation budget based on expected utility gains. Extensive experiments show that ReAD improves downstream utility under the same token budget while reducing harmful spillover and wasted distillation effort compared to strong baselines. Our code is publicly available at https://github.com/LabRAI/ReAD.
☆ Unlocking LLM Creativity in Science through Analogical Reasoning
Autonomous science promises to augment scientific discovery, particularly in complex fields like biomedicine. However, this requires AI systems that can consistently generate novel and diverse solutions to open-ended problems. We evaluate LLMs on the task of open-ended solution generation and quantify their tendency to mode collapse into low-diversity generations. To mitigate this mode collapse, we introduce analogical reasoning (AR) as a new approach to solution generation. AR generates analogies to cross-domain problems based on shared relational structure, then uses those analogies to search for novel solutions. Compared to baselines, AR discovers significantly more diverse generations (improving solution diversity metrics by 90-173%), generates novel solutions over 50% of the time (compared to as little as 1.6% for baselines), and produces high-quality analogies. To validate the real-world feasibility of AR, we implement AR-generated solutions across four biomedical problems, yielding consistent quantitative gains. AR-generated approaches achieve a nearly 13-fold improvement on distributional metrics for perturbation effect prediction, outperform all baselines on AUPRC when predicting cell-cell communication, infer brain region interactions with a high Spearman correlation ($ρ$=0.729) to published methods, and establish state-of-the-art performance on 2 datasets for oligonucleotide property prediction. The novel and diverse solutions produced by AR can be used to augment the search space of existing solution generation methods.
☆ HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model
We present Hebatron, a Hebrew-specialized open-weight large language model built on the NVIDIA Nemotron-3 sparse Mixture-of-Experts architecture. Training employs a three-phase easy-to-hard curriculum with continuous anti-forgetting anchoring, followed by supervised fine-tuning on 2 million bilingual Hebrew--English samples. The curriculum ordering alone yields a 3-point aggregate benchmark gain over the reversed configuration. Hebatron achieves a Hebrew reasoning average of 73.8\%, outperforming DictaLM-3.0-24B-Thinking (68.9\%) and remaining competitive with Gemma-3-27B-IT on GSM8K-HE and Israeli Trivia, while activating only 3B parameters per forward pass across a 30B-parameter model, delivering approximately 9 times higher inference throughput at native context lengths up to 65,536 tokens. To our knowledge, this is the first language-specific adaptation of the Nemotron-3 architecture for any target language, and the first open-weight Hebrew-specialized MoE model with native long-context support. Model weights are released openly to support further research in Hebrew and Semitic-language NLP.
☆ RETUYT-INCO at BEA 2026 Shared Task 2: Meta-prompting in Rubric-based Scoring for German ACL 2026
In this paper, we present the RETUYT-INCO participation at the BEA 2026 shared task "Rubric-based Short Answer Scoring for German". Our team participated in track 1 (Unseen answers three-way), track 3 (Unseen answers two-way) and track 4 (Unseen questions two-way). Since these tracks required scoring short student answers using specific rubrics, we looked for ways to handle the changing nature of the task. We created a method called Meta-prompting. In this approach, an LLM creates a custom prompt based on examples from the Train set. This prompt is then used to grade new student answers. Along with this method, we also describe other approaches we used, such as classic machine learning, fine-tuning open-source LLMs, and different prompting techniques. According to the official results, our team placed 6th out of 8 participants in Track 1 with a QWK of 0.729. In Track 3, we secured 4th place out of 9 with a QWK of 0.674, and we also placed 4th out of 8 in Track 4 with a QWK of 0.49.
comment: To be presented at the BEA 2026 workshop, co-located with ACL 2026
☆ ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
Computer-use agents~(CUAs) rely on visual observations of graphical user interfaces, where each screenshot is encoded into a large number of visual tokens. As interaction trajectories grow, the token cost increases rapidly, limiting the amount of history that can be incorporated under fixed context and compute budgets. This has resulted in no or very limited improvement in the performance when using history unlike other domains. We address this inefficiency by introducing ReVision, which is used to train multimodal language models on trajectories where redundant visual patches are removed using a learned patch selector that compares patch representations across consecutive screenshots while preserving spatial structure required by the model. Across three benchmarks, OSWorld, WebTailBench, and AgentNetBench, when processing trajectories with 5 history screenshots using Qwen2.5-VL-7B, ReVision reduces token usage by approximately 46% on average while improving success rate by 3% over the no drop baseline. This establishes a clear efficiency gain, enabling agents to process longer trajectories with fewer tokens. With this improved efficiency, we revisit the role of history in CUAs and find that performance continues to improve as more past observations are incorporated when redundancy is removed. This suggests that the commonly observed saturation in visual history is not due to limited usefulness of past information, but rather a consequence of inefficient token representations.
☆ Instructions shape Production of Language, not Processing
Instructions trigger a production-centered mechanism in language models. Through a cognitively inspired lens that separates language processing and production, we reveal this mechanism as an asymmetry between the two stages by probing task-specific information layer-wise across five binary judgment tasks. Specifically, we measure how instruction tokens shape information both when sample tokens, the input under evaluation, are processed and when output tokens are produced. Across prompting variations, task-specific information in sample tokens remains largely stable and correlates only weakly with behavior, whereas the same information in output tokens varies substantially and correlates strongly with behavior. Attention-based interventions confirm this pattern causally: blocking instruction flow to all subsequent tokens reduces both behavior and information in output tokens, whereas blocking it only to sample tokens has minimal effect on either. The asymmetry generalizes across model families and tasks, and becomes sharper with model scale and instruction-tuning, both of which disproportionately affect the production stage. Our findings suggest that understanding model capabilities requires jointly assessing internals and behavior, while decomposing the internal perspective by token position to distinguish the processing of input tokens from the production of output tokens.
☆ How Does Differential Privacy Affect Social Bias in LLMs? A Systematic Evaluation
Large language models (LLMs) trained on web-scale corpora can memorize sensitive training data, posing significant privacy risks. Differential privacy (DP) has emerged as a principled framework that limits the influence of individual data points during training, yet the relationship between differential privacy and social bias in LLMs remains poorly understood. To investigate this, we present a systematic evaluation of social bias in a pretrained LLM trained with DP-SGD, comparing a DP model against non-DP baselines across four complementary paradigms: sentence scoring, text completion, tabular classification, and question answering. We find that DP reduces bias in sentence scoring tasks, where bias is measured through controlled likelihood comparisons, yet this improvement does not generalize across all tasks. Our results reveal a discrepancy between logit-level bias and output-level bias. Moreover, decreasing memorization does not necessarily reduce unfairness, underscoring the importance of multi-paradigm evaluation when assessing fairness in LLMs.
comment: 14 pages, 1 figure
☆ The Bicameral Model: Bidirectional Hidden-State Coupling Between Parallel Language Models
Existing multi-model and tool-augmented systems communicate by generating text, serializing every exchange through the output vocabulary. Can two pretrained language models instead coordinate through a continuous, concurrent channel? The Bicameral Model couples two frozen language models through a trainable neural interface on their intermediate hidden states. At every generation step, both models run in lockstep: a primary model drives the task while an auxiliary model operates tools, solves constraints, or executes code, with both conditioning on each other's activations through a translation network and a learned suppression gate ($\sim$1\% of combined parameters). The gate learns a selective communication protocol from task loss alone, without a prescribed format. We demonstrate the mechanism across three tool backends. On arithmetic, coupling two 0.5B models with a calculator raises accuracy from 36\% to 96\%. On logic grid puzzles, coupling two 0.6B models with a Z3 solver achieves $1.7\times$ the unaugmented baseline on ZebraLogic. On mathematical reasoning, coupling with a Python sandbox enables the auxiliary to generate problem-specific code from hidden-state signals alone, without ever seeing the problem text.
comment: 9 pages main text, 5 figures, 24 pages appendix
☆ Decomposing Evolutionary Mixture-of-LoRA Architectures: The Routing Lever, the Lifecycle Penalty, and a Substrate-Conditional Boundary
We decompose an evolutionary mixture-of-LoRA system on a from-scratch ~150M-parameter widened-D substrate (D=1536, V=32000; D/V approx 0.048; the "widened-1536" substrate) into three factors -- a router rewrite (parallel sigmoid gate with learnable per-adapter floor and bounded temperature anneal, fed post-stack hidden states rather than token-embedding means), a per-domain leave-one-out evaluation scope, and a lifecycle of death plus alpha-blend inheritance plus SVD mutation plus slot reallocation -- and report a 5-of-8 partial 2^3 factorial run at n=3 seeds and 25000 adaptation steps per cell. The attribution chain is sharp on this substrate: the router rewrite carries the entire +0.0426 nat balanced log-PPL improvement (Delta = log PPL_ref - log PPL_test, positive = improvement; t=12.86, p=0.006) attributed to "the full evolutionary system vs the static B3 baseline"; the headline full-system-vs-B3 balanced contrast itself is +0.015 nats, t=1.94, p=0.19 at n=3 and does not clear alpha=0.05. The per-domain evaluation scope is null at seed-resolution, and the lifecycle is a net drag of approx -0.028 nats (t=-4.46,p=0.047 in the primary chain). An auxiliary alpha=0 inheritance counterfactual at n=3 seeds is sign-inconsistent at the headline metric and underpowered for either an equivalence or load-bearing conclusion (corrected from an earlier arithmetic-mean aggregator that erroneously cleared inheritance; see Appendix B.11). A base-perturbation probe directionally refutes a "genomic-context" reframe of the lifecycle role. A controllable synthetic sandbox locates a substrate-conditional regime boundary: evolutionary search on the routing channel is load-bearing only when adapters are pre-aligned to the task; in every other regime tested it underperforms, ties, or actively degrades the gradient solution.
☆ ClinicalBench: Stress-Testing Assertion-Aware Retrieval for Cross-Admission Clinical QA on MIMIC-IV
Reasoning benchmarks measure clinical performance on clean inputs. We evaluate the step before reasoning: retrieval over real EHR notes, where negation, temporality, and family-versus-patient attribution can flip a correct answer to a wrong one. EpiKG carries an assertion label and a temporality tag with every fact in a patient knowledge graph, then routes retrieval by question intent. ClinicalBench is a 400-question test over 43 MIMIC-IV patients across 9 assertion-sensitive categories. A 7-condition ablation tests each piece of EpiKG across six LLMs (Claude Opus 4.6, GPT-OSS 20B, MedGemma 27B, Gemma 4 31B, MedGemma 1.5 4B, Qwen 3.5 35B). Three physicians blindly adjudicated 100 paired items. The author-blind primary endpoint, leave-author-out paired exact McNemar on 50 unanimous-strict items rated by two external physicians, yields +22.0 percentage points (95 percent Newcombe CI [+5.1, +31.5], p=0.0192). The architectural novelty, intent-aware KG-RAG over a Contriever dense-RAG baseline (C2b to C4g_kw on the change-excluded n=362 endpoint), is +8.84 percentage points (paired McNemar p=1.79e-3); +12.43 percentage points under oracle intent. Sensitivities agree directionally: three-rater physician majority +24.0 percentage points (subject to single-author circularity); deterministic keyword reproducibility proxy +39.5 percentage points. Across the six models, the gain shrinks as the LLM-alone baseline rises (beta=-1.123, r=-0.921, p=0.009). With n=6 this looks more like regression to the mean than encoding substituting for model size. Physician adjudication identified 56 percent of auto-generated reference answers as defective, a methodological finding indicating that NLP-pipeline clinical-QA benchmarks require physician adjudication to be usable. ClinicalBench, the frozen evaluator, three-rater adjudication data, and the EpiKG output stack are publicly released.
comment: 46 pages including appendices (two-column preprint format). Under review at JAMIA. Code, frozen evaluator, and benchmark released at https://huggingface.co/datasets/alexstinard/epikg-clinicalbench. ClinicalBench v2 is a 400-question MIMIC-IV stress test for assertion-aware retrieval
☆ Sampling More, Getting Less: Calibration is the Diversity Bottleneck in LLMs
Diversity is essential for language-model applications ranging from creative generation to scientific discovery, yet modern LLMs often collapse into a narrow subset of plausible outputs. While prior work has developed benchmarks for measuring this lack of diversity, less is known about how the step-by-step probability distributions at inference time cause the problem. We introduce a validity--diversity framework that attributes diversity collapse to how an LLM allocates probability mass across valid and invalid continuations during decoding. This framework decomposes the bottleneck into two complementary forms of miscalibration. First, order calibration: valid tokens are not reliably ranked above invalid tokens, so rank-based cutoff rules must trade off between recovering valid continuations and admitting invalid ones. Second, shape calibration: probability mass is overly concentrated only on few valid continuations while having a heavy-tail of mixed valid and invalid tokens, so maintaining high validity limits diversity. We formalize both mechanisms and show that local failures compound across decoding steps, producing strong sequence-level losses in diversity. Empirically, we develop controlled diagnostics for probing these bottlenecks, including tasks with exactly known valid sets and oracle cutoff baselines. Across 14 language models spanning multiple families and scales, we find that diversity collapse is not merely a limitation of particular sampling heuristics, but a consequence of order and shape miscalibration in the LLM distribution.
☆ On Problems of Implicit Context Compression for Software Engineering Agents
LLM-based Software Engineering agents face a critical bottleneck: context length limitations cause failures on complex, long-horizon tasks. One promising solution is to encode context as continuous embeddings rather than discrete tokens, enabling denser information storage. We apply the recently proposed In-Context Autoencoder for this purpose. While the method performs well on single-shot common-knowledge and code-understanding tasks, our experiments demonstrate that it fails on multi-step agentic coding tasks. In this paper, we explore this phenomenon and discuss possible factors contributing to this failure.
♻ ☆ GIFT: Guided Importance-Aware Fine-Tuning for Diffusion Language Models
Diffusion models have recently shown strong potential in language modeling, offering faster generation compared to traditional autoregressive approaches. However, applying supervised fine-tuning (SFT) to diffusion models remains challenging, as they lack precise probability estimates at each denoising step. While the diffusion mechanism enables the model to reason over entire sequences, it also makes the generation process less predictable and often inconsistent. This highlights the importance of controlling key tokens that guide the direction of generation. To address this issue, we propose GIFT, an importance-aware finetuning method for diffusion language models, where tokens are assigned different importance weights based on their entropy. Derived from diffusion theory, GIFT delivers substantial gains: across diverse settings including different mainstream training datasets ranging from 1k to 10k in size, utilizing LoRA or full parameter fine-tuning, and training on base or instruct models, GIFT consistently achieves superior overall performance compared to standard SFT on four widely used reasoning benchmarks (Sudoku, Countdown, GSM8K, and MATH-500).
comment: preprint
♻ ☆ Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility
Generative AI systems achieve impressive performance on standard benchmarks yet fail to deliver real-world utility, a disconnect we identify across 28 deployment cases spanning education, healthcare, software engineering, and law. We argue that this benchmark utility gap arises from three recurring failures in evaluation practice: proxy displacement, temporal collapse, and distributional concealment. Motivated by these observations, we argue that generative AI evaluation requires a paradigm shift from static benchmark-centered transparency toward stakeholder, goal, and context-conditioned utility transparency grounded in human outcome trajectories. Existing evaluations primarily characterize properties of model outputs, while deployment success depends on whether interaction with AI improves stakeholders' ability to achieve their goals over time. The missing construct is therefore utility: the change in a stakeholder's capability induced through sustained interaction with an AI system within a deployment context. To operationalize this perspective, we propose SCU-GenEval, a four-stage evaluation framework consisting of stakeholder-goal mapping, construct-indicator specification, mechanism modeling, and longitudinal utility measurement. To make these stages practically deployable, we introduce three supporting instruments: structured deployment protocols, context-conditioned user simulators, and persona- and goal-conditioned proxy metrics. We conclude with domain-specific calls to action, arguing that progress in generative AI must be evaluated through measurable improvements in human outcomes rather than benchmark performance alone.
comment: 20 pages
♻ ☆ Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies
Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively. Despite its importance, existing relevant benchmarks largely evaluate agents on pre-specified or synthesized files with limited real-world dependencies, leaving workspace-level evaluation underexplored. To this end, we introduce Workspace-Bench, a benchmark for evaluating AI agents on Workspace Learning invOlving Large-Scale File Dependencies. We construct realistic workspaces with 5 worker profiles, 74 file types, 20,476 files (up to 20GB) and curate 388 tasks, each with its own file dependency graph, evaluated across 7,399 total rubrics that require cross-file retrieval, contextual reasoning, and adaptive decision-making. We further provide Workspace-Bench-Lite, a 100-task subset that preserves the benchmark distribution while reducing evaluation costs by about 70%. We evaluate 3 popular agent harnesses and 5 foundation models. Experimental results show that current agents remain far from reliable workspace learning, where the best reaches only about 60%, substantially below the human result of 80.7%, and the average performance across agents is only 45.1%.
comment: 29 pages, 16 figures
♻ ☆ Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning
Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates. Existing defenses often offer limited protection or force a trade-off between safety and utility. We introduce a training framework that adapts regularization in response to safety risk, enabling models to remain aligned throughout fine-tuning. To estimate safety risk at training time, we explore two distinct approaches: a judge-based Safety Critic that assigns high-level harm scores to training batches, and an activation-based risk predictor built with a lightweight classifier trained on intermediate model activations to estimate harmful intent. Each approach provides a risk signal that is used to constrain updates deemed higher risk to remain close to a safe reference policy, while lower-risk updates proceed with standard training. We empirically verify that harmful intent signals are predictable from pre-generation activations and that judge scores provide effective high-recall safety guidance. Across multiple model families and attack scenarios, adaptive regularization with either risk estimation approach consistently lowers attack success rate compared to standard fine-tuning, preserves downstream performance, and adds no inference-time cost. This work demonstrates a principled mechanism for maintaining safety without sacrificing utility.
comment: Work in progress (48 pages)
♻ ☆ Simulating Complex Multi-Turn Tool Calling Interactions in Stateless Execution Environments
Synthetic data has proven itself to be a valuable resource for tuning smaller, cost-effective language models to handle the complexities of multi-turn tool calling conversations. While many frameworks and systems for producing synthetic multi-turn tool calling data have been proposed, prior works have frequently assumed that any tool calling interactions will take place in an execution environment that maintains state. When such an environment is available, this is advantageous as it allows for the validity of an interaction to be determined by whether or not the state of the execution environment matches to some prespecified objective. Unfortunately, this does not hold in many real-world tool use settings, e.g., in enterprise settings where data security is of the utmost importance or in cases where tool specifications are synthesized from multiple sources. In this work, we address this gap by introducing a data generation method, DiGiT-TC, that is designed to produce tool calling conversations that have the characteristics of conversations generated through search in a stateful environment. The key to our technique lies in a novel generation pattern that allows our approach to implicitly represent certain tool calls in the user request. We validate our approach on standard tool calling benchmarks and demonstrate that, even in stateful problem settings, our approach results in strong performance gains.
♻ ☆ Restoring Exploration after Post-Training: Latent Exploration Decoding for Large Reasoning Models
Large Reasoning Models (LRMs) have recently achieved strong mathematical and code reasoning performance through Reinforcement Learning (RL) post-training. However, we show that modern reasoning post-training induces an unintended exploration collapse: temperature-based sampling no longer increases pass@$n$ accuracy. Empirically, the final-layer posterior of post-trained LRMs exhibit sharply reduced entropy, while the entropy of intermediate layers remains relatively high. Motivated by this entropy asymmetry, we propose Latent Exploration Decoding (LED), a depth-conditioned decoding strategy. LED aggregates intermediate posteriors via cumulative sum and selects depth configurations with maximal entropy as exploration candidates. Without additional training or parameters, LED consistently improves pass@1 and pass@16 accuracy by 0.61 and 1.03 percentage points across multiple reasoning benchmarks and models. Furthermore, integrating LED into reinforcement learning, e.g., using GRPO as the rollout strategy, yields faster reward improvement and higher final performance, due to the efficient exploration capability of LED. Project page: https://github.com/AlbertTan404/LED.
comment: Project Page: https://github.com/AlbertTan404/LED
♻ ☆ Beyond Multiple Choice: Evaluating Steering Vectors for Summarization EACL 2026
Steering vectors are a lightweight method for controlling text properties by adding a learned bias to language model activations at inference time. While predominantly studied for multiple-choice and toy tasks, their effectiveness in free-form generation remains largely unexplored. Moving "Beyond Multiple Choice," we evaluate steering vectors for controlling topical focus, sentiment, toxicity, and readability in abstractive summaries across the SAMSum, NEWTS, and arXiv datasets. We find that steering effectively controls targeted properties, but high steering strengths consistently induce degenerate repetition and factual hallucinations. Prompting alone preserves summary quality but offers weaker control. Combining both methods yields the strongest control and the most favorable efficacy-quality trade-off at moderate steering strengths. Our work demonstrates that steering vectors face a critical control-quality trade-off in free-form generation, and that hybrid approaches offer the best balance in practice.
comment: Published in Findings of EACL 2026. Extended version of the ICML 2025 Workshop on Reliable and Responsible Foundation Models paper (v1, v2). 36 pages, 21 figures, 15 tables
♻ ☆ VeRO: An Evaluation Harness for Agents to Optimize Agents ICML
An important emerging application of coding agents is agent optimization: the iterative improvement of a target agent through edit-execute-evaluate cycles. Despite its relevance, the community lacks a systematic understanding of coding agent performance on this task. Agent optimization differs fundamentally from conventional software engineering: the target agent interleaves deterministic code with stochastic LLM completions, requiring structured capture of both intermediate reasoning and downstream execution outcomes. To address these challenges, we introduce VERO (Versioning, Rewards, and Observations), which provides (1) a reproducible evaluation harness with versioned agent snapshots, budget-controlled evaluation, and structured execution traces, and (2) a benchmark suite of target agents and tasks with reference evaluation procedures. Using VERO, we conduct an empirical study comparing optimizer configurations across tasks and analyzing which modifications reliably improve target agent performance. We release VERO to support research on agent optimization as a core capability for coding agents.
comment: Accepted to the Forty-Third International Conference on Machine Learning (ICML), 2026
♻ ☆ MUR: Momentum Uncertainty guided Reasoning
Current models have achieved impressive performance on reasoning-intensive tasks, yet optimizing their reasoning efficiency remains an open challenge. While Test-Time Scaling (TTS) improves reasoning quality, it often leads to overthinking, wasting tokens on redundant computations. This work investigates how to efficiently and adaptively guide current model' test-time scaling without additional training. Inspired by the concept of momentum in physics, we propose Momentum Uncertainty-guided Reasoning (MUR), which dynamically allocates thinking budgets to critical reasoning steps by tracking and aggregating stepwise uncertainty over time. To support flexible inference-time control, we introduce gamma-control, a simple mechanism that tunes the reasoning budget via a single hyperparameter. We provide in-depth theoretical proof to support the superiority of MUR in terms of stability and biases. MUR is comprehensively evaluated against various TTS methods across four challenging benchmarks (MATH-500, AIME24, AIME25, and GPQA-diamond) using different sizes of recent Qwen3 models (1.7B, 4B, and 8B). Results demonstrate that MUR reduces computation by by over 45% on average while improving accuracy from 0.33 to 3.46%.
♻ ☆ AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment
As Large Language Models (LLMs) evolve into lifelong AI assistants, LLM personalization has become a critical frontier. However, progress is currently bottlenecked by the absence of a gold-standard evaluation benchmark. Existing benchmarks either overlook personalized information management that is critical for personalization or rely heavily on synthetic dialogues, which exhibit an inherent distribution gap from real-world dialogue. To bridge this gap, we introduce AlpsBench, An LLM PerSonalization benchmark derived from real-world human-LLM dialogues. AlpsBench comprises 2,500 long-term interaction sequences curated from WildChat, paired with human-verified structured memories that encapsulate both explicit and implicit personalization signals. We define four pivotal tasks - personalized information extraction, updating, retrieval, and utilization - and establish protocols to evaluate the entire lifecycle of memory management. Our benchmarking of frontier LLMs and memory-centric systems reveals that: (i) models struggle to reliably extract latent user traits; (ii) memory updating faces a performance ceiling even in the strongest models; (iii) retrieval accuracy declines sharply in the presence of large distractor pools; and (iv) while explicit memory mechanisms improve recall, they do not inherently guarantee more preference-aligned or emotionally resonant responses. AlpsBench aims to provide a comprehensive framework.
♻ ☆ The Astonishing Ability of Large Language Models to Parse Jabberwockified Language
We show that large language models (LLMs) have an astonishing ability to recover meaning from severely degraded English texts. Texts in which content words have been randomly substituted by nonsense strings, e.g., "At the ghybe of the swuint, we are haiveed to Wourge Phrear-gwurr, who sproles into an ghitch flount with his crurp", can be translated to conventional English that is, in many cases, close to the original text, e.g., "At the start of the story, we meet a man, Chow, who moves into an apartment building with his wife." These results show that structural cues (e.g., morphosyntax, closed-class words) constrain lexical meaning to a much larger degree than imagined. Although the abilities of LLMs to make sense of "Jabberwockified" English are clearly superhuman, they are highly relevant to understanding linguistic structure and suggest that efficient language processing either in biological or artificial systems likely benefits from very tight integration between syntax, lexical semantics, and general world knowledge.
comment: Submitted to the 2026 Annual Meeting of the Cognitive Science Society
♻ ☆ The Realignment Problem: When Right becomes Wrong in LLMs ICML 2026
Post-training alignment of large language models (LLMs) relies on large-scale human annotations guided by policy specifications that change over time. Cultural shifts, value reinterpretations, and regulatory or industrial updates make static alignment increasingly brittle. As policies evolve, deployed models can diverge from current alignment objectives, creating an Alignment-Reality Gap that is difficult to audit or correct. Existing remediation typically requires re-annotation under revised guidelines, which introduces systematic challenges, including guideline ambiguity, annotator interpretation drift, and reduced consistency at scale. We introduce TRACE (Triage and Re-align by Alignment Conflict Evaluation), a framework that transforms realignment into a structured optimization problem over existing data without requiring fresh human annotation. Leveraging a stronger model as a proxy judge, TRACE operates via a three-stage pipeline: (1) triaging preference pairs into inversion, suppression, or retention categories based on alignment conflicts; (2) computing an alignment impact score via bi-level optimization to prioritize high-leverage samples; and (3) executing updates using a hybrid objective that combines relational losses (e.g., IPO) for preference inversion and punitive losses (e.g., NPO) for response suppression. Experiments on Qwen2.5-7B, Gemma-2-9B, and Llama-3.1-8B demonstrate robust realignment on synthetic benchmarks and the PKU-SafeRLHF dataset without degrading general utility. This work provides a scalable approach for LLM realignment under evolving data annotation policies and alignment guidelines. We release our code: https://respailab.github.io/TRACE/
comment: ICML 2026
♻ ☆ Recursive Language Models
We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference paradigm that treats long prompts as part of an external environment and allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt. We find that RLMs can successfully process inputs up to two orders of magnitude beyond model context windows and, even for shorter prompts, dramatically outperform the quality of vanilla frontier LLMs and common long-context and coding scaffolds (e.g., on GPT-5 by a median across the evaluated benchmarks of $26\%$ against compaction, $130\%$ against CodeAct with sub-calls, and $13\%$ against Claude Code) across four diverse long-context tasks while having comparable cost. At a small scale, we post-train the first model around the RLM. Our model, RLM-Qwen3-8B, outperforms the underlying Qwen3-8B model by $28.3\%$ on average and even approaches the quality of vanilla GPT-5 on three long-context tasks. Code is available at https://github.com/alexzhang13/rlm.
comment: 9 pages, 43 with Appendix
♻ ☆ CktFormalizer: Autoformalization of Natural Language into Circuit Representations
LLMs can generate hardware descriptions from natural language specifications, but the resulting Verilog often contains width mismatches, combinational loops, and incomplete case logic that pass syntax checks yet fail in synthesis or silicon. We present CktFormalizer, a framework that redirects LLM-driven hardware generation through a dependently-typed HDL embedded in Lean 4. Lean serves three roles: (i) type checker:dependent types encode bit-width constraints, case coverage, and acyclicity, turning hardware defects into compile-time errors that guide iterative repair; (ii) correctness firewall:compiled designs are structurally free of defects that cause silent backend failures (the baseline loses 20% of correct designs during synthesis and routing; CktFormalizer preserves all of them); (iii) proof assistant:the agent constructs machine-checked equivalence proofs over arbitrary input sequences and parameterized widths, beyond the reach of bounded SMT-based checking. On VerilogEval (156 problems), RTLLM (50 problems), and ResBench (56 problems), CktFormalizer achieves simulation pass rates competitive with direct Verilog generation while delivering substantially higher backend realizability: 95--100% of compiled designs complete the full synthesis, place-and-route, DRC, and LVS flow. A closed-loop PPA optimization stage yields up to 35% area reduction and 30% power reduction through validated architecture exploration, with automated theorem proof ensuring that each optimized variant remains functionally equivalent to its formal specification.
♻ ☆ Selective Neuron Amplification in Transformer Language Models
Large language models often fail on tasks they seem to already understand. In our experiments, this appears to be less about missing knowledge and more about certain internal circuits not being strongly activated during inference. We explore Selective Neuron Amplification, which increases the influence of task relevant neurons without changing the model's parameters. The method works at inference time and does not permanently alter the model. SNA helps mainly when the model is uncertain, while having low effect when the model is already confident. This suggests that some model failures are due to weak activation rather than lack of capability.
comment: 11 pages, 3 figures. Preprint. Code and experiments conducted independently
♻ ☆ What's the plan? Metrics for implicit planning in LLMs and their application to rhyme generation and question answering ICLR 2026
Prior work suggests that language models, while trained on next token prediction, show implicit planning behavior: they may select the next token in preparation to a predicted future token, such as a likely rhyming word, as supported by a prior qualitative study of Claude 3.5 Haiku using a cross-layer transcoder. We propose much simpler techniques for assessing implicit planning in language models. With case studies on rhyme poetry generation and question answering, we demonstrate that our methodology easily scales to many models. Across models, we find that the generated rhyme (e.g. "-ight") or answer to a question ("whale") can be manipulated by steering at the end of the preceding line with a vector, affecting the generation of intermediate tokens leading up to the rhyme or answer word. We show that implicit planning is a universal mechanism, present in smaller models than previously thought, starting from 1B parameters. Our methodology offers a widely applicable direct way to study implicit planning abilities of LLMs. More broadly, understanding planning abilities of language models can inform decisions in AI safety and control.
comment: 41 pages, 34 figures, Accepted at ICLR 2026, Code available at https://github.com/Jim-Maar/implicit-planning-in-llms
♻ ☆ AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs
Quantization is a key method for reducing the GPU memory requirement of training large language models (LLMs). Yet, current approaches are ineffective for 4-bit activations and 8-bit gradients, which would easily cause slow convergence or accuracy loss. To address this, we introduce AGoQ, incorporating two new techniques: 1) a layer-aware activation quantization algorithm that allocates appropriate bit-widths for activations of various layers based on their types and pipeline stages to achieve near 4-bit activation storage, and 2) a gradient quantization algorithm that reduces memory usage and shortens communication time by employing 8-bit gradient storage and precision-preserving 8-bit All-Reduce communication. We conduct extensive experiments using different sizes of LLMs on two GPU clusters (up to 64 GPUs), and the experimental results show that our AGoQ reduces the memory by up to 52\% and achieves up to 1.34$\times$ improvement of training speed compared to state-of-the-art training systems Megatron-LM (w/ or w/o ZeRO), COAT and DeepSpeed with 8B to 32B LLaMA models, while achieving convergence loss on pretraining and comparable accuracy on downstream tasks with LLaMA architectures.
♻ ☆ Elite Polarization in European Parliamentary Speeches: a Novel Measurement Approach Using Large Language Models
Theories of democratic stability, populism, and party-system crisis often point to a form of polarization that comparative research rarely measures directly: hostile relations among political elites. Existing comparative measures capture adjacent phenomena, including mass affective polarization, or elite ideological distance, but not directed mutual elite evaluation. This paper introduces the Elite Polarization Score, a measurement of out-party evaluations in parliamentary speech. Large Language Models identify political actors mentioned in parliamentary debates, recover speaker-target pairs, estimate the sentiment directed at each actor, standardize heterogeneous references into party dyads, and aggregate these evaluations into party- and parliament-level measures of mutual out-party negativity. The validity of the approach is demonstrated on parliamentary corpora from the United Kingdom, Hungary, and Italy, covering up to four decades of debate. The resulting measure is conceptually distinct from mass affective polarization, elite ideological polarization, incivility, negative campaigning, and general sentiment. Evidence from the UK case study shows that it is also empirically distinct from mass affective polarization, elite ideological polarization, and incivility. Extreme negative evaluations can also be used to locate pernicious polarization rhetoric. Validation across three countries finds no false discoveries, sentiment estimates accurate to roughly 10 percent of the scale range, and AI sensitivity that meets or exceeds that of human coders in two of three settings. Because the algorithm is multilingual, requires no task-specific training, and can be aggregated by party and quarter, it provides a scalable basis for future cross-national research on what produces elite polarization and what elite polarization itself produces
♻ ☆ MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks ICML 2026
While large audio-language models have advanced open-ended audio understanding, they still fall short of nuanced human-level comprehension. This gap persists largely because current benchmarks, limited by data annotations and evaluation metrics, fail to reliably distinguish between generic and highly detailed model outputs. To this end, this work introduces MECAT, a Multi-Expert Constructed Benchmark for Fine-Grained Audio Understanding Tasks. Generated via a pipeline that integrates analysis from specialized expert models with Chain-of-Thought large language model reasoning, MECAT provides multi-perspective, fine-grained captions and open-set question-answering pairs. The benchmark is complemented by a novel metric: DATE (Discriminative-Enhanced Audio Text Evaluation). This metric penalizes generic terms and rewards detailed descriptions by combining single-sample semantic similarity with cross-sample discriminability. A comprehensive evaluation of state-of-the-art audio models is also presented, providing new insights into their current capabilities and limitations. The data and code are available at https://github.com/xiaomi-research/mecat
comment: Accepted to ICML 2026
♻ ☆ CNSocialDepress: A Chinese Social Media Dataset for Depression Risk Detection and Structured Analysis
Depression is a pressing global public health issue, yet publicly available Chinese-language resources for depression risk detection remain scarce and largely focus on binary classification. To address this limitation, we release CNSocialDepress, a benchmark dataset for depression risk detection on Chinese social media. The dataset contains 44,178 posts from 233 users; psychological experts annotated 10,306 depression-related segments. CNSocialDepress provides binary risk labels along with structured, multidimensional psychological attributes, enabling interpretable and fine-grained analyses of depressive signals. Experimental results demonstrate the dataset's utility across a range of NLP tasks, including structured psychological profiling and fine-tuning large language models for depression detection. Comprehensive evaluations highlight the dataset's effectiveness and practical value for depression risk identification and psychological analysis, thereby providing insights for mental health applications tailored to Chinese-speaking populations.
♻ ☆ Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts
Mixture-of-Experts (MoE) models typically fix the number of activated experts $k$ at both training and inference. However, real-world deployments often face heterogeneous hardware, fluctuating workloads, and diverse quality-latency requirements, while training separate models for each scenario is costly. Considering that MoE models already operate with sparse activation, adjusting the number of activated experts offers a natural path to serving diverse budgets with a single model. Yet, we find that activating more experts $k'$ ($> k$) at inference does not yield the expected gains. Instead, performance degrades rapidly after only a slight increase, a phenomenon we term the \textit{inference-time scaling wall}. Further investigation reveals that this degradation stems from a lack of learned collaboration among experts. To address this, we introduce \textbf{Elastic Mixture-of-Experts (EMoE)}, a novel training framework that enables MoE models to elastically vary the number of activated experts at inference. By simultaneously training experts to collaborate in diverse combinations and encouraging the router to make high-quality selections, EMoE ensures robust performance across inference budgets. Extensive experiments across four MoE architectures (7B--21B) and nine benchmarks show that EMoE significantly expands the effective scaling range to 2-3$\times$ the training-time $k$, while also achieving higher peak performance.
♻ ☆ Holmes: A Benchmark to Assess the Linguistic Competence of Language Models
We introduce Holmes, a new benchmark designed to assess language models (LMs) linguistic competence - their unconscious understanding of linguistic phenomena. Specifically, we use classifier-based probing to examine LMs' internal representations regarding distinct linguistic phenomena (e.g., part-of-speech tagging). As a result, we meet recent calls to disentangle LMs' linguistic competence from other cognitive abilities, such as following instructions in prompting-based evaluations. Composing Holmes, we review over 270 probing studies and include more than 200 datasets to assess syntax, morphology, semantics, reasoning, and discourse phenomena. Analyzing over 50 LMs reveals that, aligned with known trends, their linguistic competence correlates with model size. However, surprisingly, model architecture and instruction tuning also significantly influence performance, particularly in morphology and syntax. Finally, we propose FlashHolmes, a streamlined version that reduces the computation load while maintaining high-ranking precision.
♻ ☆ CREATE: Testing LLMs for Associative Creativity
A key component of creativity is associative reasoning: the ability to draw novel yet meaningful connections between concepts. We introduce CREATE, a benchmark designed to evaluate models' capacity for creative associative reasoning. CREATE requires models to generate sets of paths connecting concepts in a model's parametric knowledge. Paths should have high specificity (distinctiveness and closeness of the concept connection) and high diversity (dissimilarity from other paths), and models are scored more highly if they produce a larger set of strong, diverse paths. This task shares demands of real creativity tasks like hypothesis generation, including an extremely large search space, but enables collection of a sizable benchmark with objective answer grading. Evaluation of frontier models shows that the strongest models achieve higher creative utility than others, with the high multiplicity of answers and complexity of the search making benchmark saturation difficult to achieve. Furthermore, our results illustrate that thinking models are not always more effective on our task, even with high token budgets. Recent approaches for creative prompting give some but limited additional improvement. CREATE provides a sandbox for developing new methods to improve models' capacity for associative creativity.
♻ ☆ GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs
As Large Language Models (LLMs) become increasingly integral to various domains, their potential to generate harmful responses has prompted significant societal and regulatory concerns. In response, governments have issued ethics guidelines to promote the development of trustworthy AI. However, these guidelines are typically high-level demands for developers and testers, leaving a gap in translating them into actionable testing questions to verify LLM compliance. To address this challenge, we introduce GUARD (Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics), a testing method designed to operationalize guidelines into specific guideline-violating questions that assess LLM adherence. To implement this, GUARD uses automated generation of guideline-violating questions based on government-issued guidelines, thereby testing whether responses comply with these guidelines. When responses directly violate guidelines, GUARD reports inconsistencies. Furthermore, for responses that do not directly violate guidelines, GUARD integrates the concept of ``jailbreaks'' to diagnostics, named GUARD-JD, which creates scenarios that provoke unethical or guideline-violating responses, effectively identifying potential scenarios that could bypass built-in safety mechanisms. Our method finally culminates in a compliance report, delineating the extent of adherence and highlighting any violations. We empirically validated the effectiveness of GUARD on eight LLMs, including Vicuna-13B, LongChat-7B, Llama2-7B, Llama-3-8B, GPT-3.5, GPT-4, GPT-4o, and Claude-3.7, by testing compliance under three government-issued guidelines and conducting jailbreak diagnostics. Additionally, GUARD-JD can transfer jailbreak diagnostics to vision-language models (MiniGPT-v2 and Gemini-1.5), demonstrating its usage in promoting reliable LLM-based applications.
comment: 56 pages
♻ ☆ Inductive Entity Representations from Text via Link Prediction
Knowledge Graphs (KG) are of vital importance for multiple applications on the web, including information retrieval, recommender systems, and metadata annotation. Regardless of whether they are built manually by domain experts or with automatic pipelines, KGs are often incomplete. Recent work has begun to explore the use of textual descriptions available in knowledge graphs to learn vector representations of entities in order to preform link prediction. However, the extent to which these representations learned for link prediction generalize to other tasks is unclear. This is important given the cost of learning such representations. Ideally, we would prefer representations that do not need to be trained again when transferring to a different task, while retaining reasonable performance. In this work, we propose a holistic evaluation protocol for entity representations learned via a link prediction objective. We consider the inductive link prediction and entity classification tasks, which involve entities not seen during training. We also consider an information retrieval task for entity-oriented search. We evaluate an architecture based on a pretrained language model, that exhibits strong generalization to entities not observed during training, and outperforms related state-of-the-art methods (22% MRR improvement in link prediction on average). We further provide evidence that the learned representations transfer well to other tasks without fine-tuning. In the entity classification task we obtain an average improvement of 16% in accuracy compared with baselines that also employ pre-trained models. In the information retrieval task, we obtain significant improvements of up to 8.8% in NDCG@10 for natural language queries. We thus show that the learned representations are not limited KG-specific tasks, and have greater generalization properties than evaluated in previous work.
comment: The Web Conference 2021
♻ ☆ Why is prompting hard? Understanding prompts on binary sequence predictors
Frontier models can be prompted or conditioned to do many tasks, but finding good prompts is not always easy, nor is understanding some performant prompts. We view prompting as finding the best conditioning sequence on a near-optimal sequence predictor. On numerous well-controlled experiments, we show that unintuitive optimal conditioning sequences can be better understood given the pretraining distribution, which is not usually available. Even using exhaustive search, reliably identifying optimal prompts for practical neural predictors can be surprisingly difficult. Popular prompting methods, such as using demonstrations from the targeted task, can be surprisingly suboptimal. Using the same empirical framework, we analyze optimal prompts on frontier models, revealing patterns similar to the binary examples and previous findings. Taken together, this work takes an initial step towards understanding optimal prompts, from a statistical and empirical perspective that complements research on frontier models.
♻ ☆ CARL: Criticality-Aware Agentic Reinforcement Learning
Agents capable of accomplishing complex tasks through multiple interactions with the environment have emerged as a popular research direction. However, in such multi-step settings, the conventional group-level policy optimization algorithm becomes suboptimal because of its underlying assumption that each step holds equal contribution, which deviates significantly from reality. Our analysis reveals that only the action choices on a small fraction of states are critical in determining the final outcome. Building on this insight, we propose CARL, a criticality-aware reinforcement learning algorithm tailored for long-horizon agentic reasoning. CARL leverages entropy as a heuristic proxy for state criticality and achieves focused training by assigning rewards to actions taken from high-criticality states while excluding actions taken from low-criticality states from model updates, avoiding noisy credit assignment and redundant computation. Extensive experiments demonstrate that CARL achieves both stronger performance and higher efficiency across diverse evaluation settings. The source code will be publicly available.
comment: 18 pages, 6 figures
♻ ☆ How Instruction and Reasoning Data shape Post-Training: Data Quality through the Lens of Layer-wise Gradients ACL2026
As the post-training of large language models (LLMs) advances from instruction-following to complex reasoning tasks, understanding how different data affect finetuning dynamics remains largely unexplored. In this paper, we present a spectral analysis of layer-wise gradients induced by low/high-quality instruction and reasoning data for LLM post-training. Our analysis reveals that widely-studied metrics for data evaluation, e.g., IFD, InsTag, Difficulty, and Reward, can be explained and unified by spectral properties computed from gradients' singular value decomposition (SVD). Specifically, higher-quality data are usually associated with lower nuclear norms and higher effective ranks. Notably, effective rank exhibits better robustness and resolution than nuclear norm in capturing subtle quality differences. For example, reasoning data achieves substantially higher effective ranks than instruction data, implying richer gradient structures on more complex tasks. Our experiments also highlight that models within the same family share similar gradient patterns regardless of their sizes, whereas different model families diverge significantly. Providing a unified view on the effects of data quality across instruction and reasoning data, this work illuminates the interplay between data quality and training stability, shedding novel insights into developing better data exploration strategies for post-training.
comment: ACL2026, camera-ready
♻ ☆ BaseCal: Unsupervised Confidence Calibration via Base Model Signals ACL 2026
Reliable confidence is essential for trusting the outputs of LLMs, yet widely deployed post-trained LLMs (PoLLMs) typically compromise this trust with severe overconfidence. In contrast, we observe that their corresponding base LLMs often remain well-calibrated. This naturally motivates us to calibrate PoLLM confidence using the base LLM as a reference. This work proposes two ways to achieve this. A straightforward solution, BaseCal-ReEval, evaluates PoLLM's responses by feeding them into the base LLM to get average probabilities as confidence. While effective, this approach introduces additional inference overhead. To address this, we propose BaseCal-Proj, which trains a lightweight projection to map the final-layer hidden states of PoLLMs back to those of their base LLMs. These projected states are then processed by the base LLM's output layer to derive base-calibrated confidence for PoLLM's responses. Notably, BaseCal is an unsupervised, plug-and-play solution that operates without human labels or LLM modifications. Experiments across five datasets and three LLM families demonstrate the effectiveness of BaseCal, reducing Expected Calibration Error (ECE) by an average of 42.90\% compared to the best unsupervised baselines.
comment: ACL 2026 Main
♻ ☆ Schoenfeld's Anatomy of Mathematical Reasoning by Language Models ACL2026
Large language models increasingly expose reasoning traces, yet their underlying cognitive structure and steps remain difficult to identify and analyze beyond surface-level statistics. We adopt Schoenfeld's Episode Theory as an inductive, intermediate-scale lens and introduce ThinkARM (Anatomy of Reasoning in Models), a scalable framework that explicitly abstracts reasoning traces into functional reasoning steps such as Analysis, Explore, Implement, Verify, etc. When applied to mathematical problem solving by diverse models, this abstraction reveals reproducible thinking dynamics and structural differences between reasoning and non-reasoning models, which are not apparent from token-level views. We further present two diagnostic case studies showing that exploration functions as a critical branching step associated with correctness, and that efficiency-oriented methods selectively suppress evaluative feedback steps rather than uniformly shortening responses. Together, our results demonstrate that episode-level representations make reasoning steps explicit, enabling systematic analysis of how reasoning is structured, stabilized, and altered in modern language models.
comment: ACL2026, camera-ready
♻ ☆ Composing Policy Gradients and Prompt Optimization for Language Model Programs
Group Relative Policy Optimization (GRPO) has proven to be an effective tool for post-training language models (LMs). However, AI systems are increasingly expressed as modular programs that mix together multiple LM calls with distinct prompt templates and other tools, and it is not clear how practitioners can best leverage online RL algorithms like GRPO to improve these systems. We begin to address this challenge by investigating whether it is possible to effectively instantiate GRPO for arbitrary multi-prompt programs and whether it can work robustly as an off-the-shelf optimizer for LM programs using the same abstractions and constraints typically involved for prompt optimization. Our main variant of multi-module GRPO constructs groups from module-level invocations, and we also consider trajectory-level grouping as another natural instantiation. We find for the first time that GRPO (and its multi-module counterpart) empirically composes well with automatic prompt optimization, and together they improve accuracy by 11% on average across classification, many-hop search, and privacy-preserving delegation tasks against the post-trained LM - with 5% gains against prompt optimization on its own. We open-source multi-module GRPO in the DSPy library at https://dspy.ai .
comment: ACM CAIS 2026. Lakshya*, Dilara*, and Noah* contributed equally to this work
♻ ☆ MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier ICML 2026
While large language models (LLMs) show promise in scientific discovery, existing research focuses on inference or feedback-driven training, leaving the direct modeling of the generative reasoning process, $P(\text{hypothesis}|\text{background})$ ($P(h|b)$), unexplored. We demonstrate that directly training $P(h|b)$ is mathematically intractable due to the combinatorial complexity ($O(N^k)$) inherent in retrieving and composing inspirations from a vast knowledge base. To break this barrier, we introduce MOOSE-Star, a unified framework that enables tractable and scalable training of $P(h|b)$, while supporting more scalable inference. In the best case, MOOSE-Star reduces complexity from exponential to logarithmic ($O(\log N)$) by (1) training on decomposed subtasks derived from the probabilistic equation of discovery, (2) employing motivation-guided hierarchical search to enable logarithmic retrieval and prune irrelevant subspaces, and (3) utilizing bounded composition for robustness against retrieval noise. To facilitate this, we release TOMATO-Star, a dataset of 108,717 decomposed papers (38,400 GPU hours) for training. Empirically, MOOSE-Star scales continuously with training data and inference budget, whereas direct brute-force sampling hits a complexity wall.
comment: Accepted by ICML 2026
♻ ☆ Deterministic Differentiable Structured Pruning for Large Language Models ICML26
Structured pruning reduces LLM inference cost by removing low-importance architectural components. This can be viewed as learning a multiplicative gate for each component under an l0 sparsity constraint. Due to the discreteness of the l0 norm, prior work typically adopts stochastic hard-concrete relaxations to enable differentiable optimization; however, this stochasticity can introduce a train--test mismatch when sampled masks are discretized for deployment and restricts masks to a bounded, near-binary range. To address this, we propose Deterministic Differentiable Pruning (DDP), a mask-only optimization method that eliminates stochasticity by directly optimizing a deterministic soft surrogate of the discrete l0 objective. Compared with prior approaches, DDP offers greater expressiveness, reduced train--test mismatch, and faster convergence. We apply our method to several dense and MoE models, including Qwen3-32B and Qwen3-30B-A3B, achieving a performance loss as small as 1% on downstream tasks while outperforming previous methods at 20% sparsity. We further demonstrate end-to-end inference speedups in realistic deployment settings with vLLM.
comment: Published at ICML26;
♻ ☆ OpenClaw-RL: Train Any Agent Simply by Talking
Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework that employs next-state signals to optimize personal agents online through infrastructure and methodology innovations. On the infrastructure side, we extend existing RL systems to a server-client architecture where the RL server hosts the policy behind an inference API and user terminals stream interaction data back over HTTP. From each observed next state, the system extracts two complementary training signals, evaluative and directive, via a separate asynchronous server so that neither signal extraction nor optimization blocks inference. On the methodology side, we introduce a hybrid RL objective that unifies both signal types in a single update: directive signals provide richer, token-level supervision but are sparser, while evaluative signals are more broadly available. To stabilize distillation under teacher-student mismatch, we propose overlap-guided hint selection, which picks the hint whose induced teacher distribution maximally overlaps with the student's top-$k$ tokens, together with a log-probability-difference clip that bounds per-token advantages. Applied to personal agents, OpenClaw-RL enables an agent to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Applied to general agents, OpenClaw-RL is the first RL framework to unify real-world agent settings spanning terminal, GUI, SWE, and tool-call environments, where we additionally demonstrate the utility of next-state signals in long-horizon settings.
comment: Code: https://github.com/Gen-Verse/OpenClaw-RL
♻ ☆ ChatbotManip: A Dataset to Facilitate Evaluation and Oversight of Manipulative Chatbot Behaviour
This paper introduces ChatbotManip, a novel dataset for studying manipulation in Chatbots. It contains simulated generated conversations between a chatbot and a (simulated) user, where the chatbot is explicitly asked to showcase manipulation tactics, persuade the user towards some goal, or simply be helpful. We consider a diverse set of chatbot manipulation contexts, from consumer and personal advice to citizen advice and controversial proposition argumentation. Each conversation is annotated by human annotators for both general manipulation and specific manipulation tactics. Our research reveals three key findings. First, Large Language Models (LLMs) can be manipulative when explicitly instructed, with annotators identifying manipulation in approximately 84\% of such conversations. Second, even when only instructed to be ``persuasive'' without explicit manipulation prompts, LLMs frequently default to controversial manipulative strategies, particularly gaslighting and fear enhancement. Third, small fine-tuned open source models, such as BERT+BiLSTM have a performance comparable to zero-shot classification with larger models like Gemini 2.5 pro in detecting manipulation, but are not yet reliable for real-world oversight. Our work provides important insights for AI safety research and highlights the need of addressing manipulation risks as LLMs are increasingly deployed in consumer-facing applications.
♻ ☆ Incremental Multilingual Text2Cypher with Adapter Combination
Large Language Models enable users to access database using natural language interfaces using tools like Text2SQL, Text2SPARQL, and Text2Cypher, which translate user questions into structured database queries. While these systems improve database accessibility, most research focuses on English with limited multilingual support. This work investigates a scalable multilingual Text2Cypher, aiming to support new languages without re-running full fine-tuning, avoiding manual hyper-parameter tuning, and maintaining performance close to joint multilingual fine-tuning. We train language-specific LoRA adapters for English, Spanish, and Turkish and combined them via uniform linear merging or learned fusion MLP with dynamic gating. Experimental results show that the fusion MLP recovers around 75\% of the accuracy gains from joint multilingual fine-tuning while requiring only a smaller subset of the data, outperforming linear merging across all three languages. This approach enables incremental language expansion to new languages by requiring only one LoRA adapter and a lightweight MLP retraining. Learned adapter fusion offers a practical alternative to expensive joint fine-tuning, balancing performance, data efficiency, and scalability for multilingual Text2Cypher task.
♻ ☆ SLR: Automated Synthesis for Scalable Logical Reasoning
We introduce SLR, an end-to-end framework for systematic evaluation and training of Large Language Models (LLMs) via Scalable Logical Reasoning. Given a user's task specification, SLR automatically synthesizes (i) an instruction prompt for an inductive reasoning task, (ii) a validation program, executable on model outputs to provide verifiable rewards, and (iii) the latent ground-truth rule. This process is fully automated, scalable, requires no human annotations, and offers precise control over task difficulty. Using SLR, we create SLR-Bench, a benchmark comprising 19k prompts organized into 20 curriculum levels that progressively increase in relational, arithmetic, and recursive complexity. Large-scale evaluation reveals that contemporary LLMs readily produce syntactically valid rules, yet often fail at correct logical inference. Recent reasoning LLMs demonstrate improved performance but incur very high test-time computation, with costs exceeding $300 for just 1,000 prompts. Finally, curriculum learning via SLR doubles Llama-3-8B accuracy on SLR-Bench, achieving parity with Gemini-Flash-Thinking at a fraction of computational cost. Moreover, these reasoning capabilities generalize to a wide range of established benchmarks, underscoring the effectiveness of SLR for downstream reasoning.
♻ ☆ Chinese Cyberbullying Detection: Dataset, Method, and Validation
Existing cyberbullying detection benchmarks were organized by the polarity of speech, such as "offensive" and "non-offensive", which were essentially hate speech detection. However, in the real world, cyberbullying often attracted widespread social attention through incidents. To address this problem, we propose a novel annotation method to construct a cyberbullying dataset that organized by incidents. The constructed CHNCI is the first Chinese cyberbullying incident detection dataset, which consists of 220,676 comments in 91 incidents. Specifically, we first combine three cyberbullying detection methods based on explanations generation as an ensemble method to generate the pseudo labels, and then let human annotators judge these labels. Then we propose the evaluation criteria for validating whether it constitutes a cyberbullying incident. Experimental results demonstrate that the constructed dataset can be a benchmark for the tasks of cyberbullying detection and incident prediction. To the best of our knowledge, this is the first study for the Chinese cyberbullying incident detection task.
♻ ☆ Breaking Contextual Inertia: Reinforcement Learning with Single-Turn Anchors for Stable Multi-Turn Interaction
While LLMs demonstrate strong reasoning capabilities when provided with full information in a single turn, they exhibit substantial vulnerability in multi-turn interactions. Specifically, when information is revealed incrementally or requires updates, models frequently fail to integrate new constraints, leading to a collapse in performance compared to their single-turn baselines. We term the root cause as \emph{Contextual Inertia}: a phenomenon where models rigidly adhere to previous reasoning traces. Even when users explicitly provide corrections or new data in later turns, the model ignores them, preferring to maintain consistency with its previous (incorrect) reasoning path. To address this, we introduce \textbf{R}einforcement \textbf{L}earning with \textbf{S}ingle-\textbf{T}urn \textbf{A}nchors (\textbf{RLSTA}), a generalizable training approach designed to stabilize multi-turn interaction across diverse scenarios and domains. RLSTA leverages the model's superior single-turn capabilities as stable internal anchors to provide reward signals. By aligning multi-turn responses with these anchors, RLSTA empowers models to break contextual inertia and self-calibrate their reasoning based on the latest information. Experiments show that RLSTA significantly outperforms standard fine-tuning and abstention-based methods. Notably, our method exhibits strong cross-domain generalization (e.g., math to code) and proves effective even without external verifiers, highlighting its potential for general-domain applications. Code is available at https://github.com/Tencent/RLSTA.
♻ ☆ Temporal Tokenization Strategies for Event Sequence Modeling with Large Language Models
Representing continuous time is a critical and under-explored challenge in modeling temporal event sequences with large language models (LLMs). Various strategies like byte-level representations or calendar tokens have been proposed. However, the optimal approach remains unclear, especially given the diverse statistical distributions of real-world event data, which range from smooth log-normal to discrete, spiky patterns. This paper presents a systematic empirical study of temporal tokenization for modeling event sequences with LLMs, comparing distinct encoding strategies: naive numeric strings, high-precision byte-level representations, human-semantic calendar tokens, classic uniform binning, and adaptive residual scalar quantization. We evaluate these strategies by fine-tuning LLMs on real-world datasets that exemplify these diverse distributions. Our analysis reveals that no single strategy is universally superior; instead, prediction performance depends heavily on aligning the tokenizer with the data's statistical properties, highlighting temporal tokenization as a critical yet often overlooked design dimension in LLM-based event modeling.
♻ ☆ A Scalable Entity-Based Framework for Auditing Bias in LLMs
Existing approaches to bias evaluation in large language models (LLMs) trade ecological validity for statistical control, relying either on artificial prompts that poorly reflect real-world use or on naturalistic tasks that lack scale and rigor. We introduce a scalable bias-auditing framework that uses named entities as controlled probes to measure systematic disparities in model behavior. Synthetic data enables us to construct diverse, controlled inputs, and we show that it reliably reproduces bias patterns observed in natural text, supporting its use for large-scale analysis. Using this framework, we conduct the largest bias audit to date, comprising 1.9 billion data points across multiple entity types, tasks, languages, models, and prompting strategies. We find consistent patterns: models penalize right-wing politicians and favor left-wing politicians, prefer Western and wealthier countries over the Global South, favor Western companies, and penalize firms in the defense and pharmaceutical sectors. While instruction tuning reduces bias, increasing model scale amplifies it, and prompting in Chinese or Russian does not mitigate Western-aligned preferences. These findings highlight the need for systematic bias auditing before deploying LLMs in high-stakes applications. Our framework is extensible to other domains and tasks, and we make it publicly available to support future work.
♻ ☆ EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments
We introduce EconWebArena, a benchmark for evaluating autonomous agents on complex, multimodal economic tasks in realistic web environments. The benchmark comprises 360 curated tasks from 82 authoritative websites spanning domains such as macroeconomics, labor, finance, trade, and public policy. Each task challenges agents to navigate live websites, interpret structured and visual content, interact with real interfaces, and extract precise, time-sensitive data through multi-step workflows. We construct the benchmark by prompting multiple large language models (LLMs) to generate candidate tasks, followed by rigorous human curation to ensure clarity, feasibility, and source reliability. Unlike prior work, EconWebArena emphasizes fidelity to authoritative data sources and the need for grounded web-based economic reasoning. We evaluate a diverse set of state-of-the-art multimodal LLMs as web agents, analyze failure cases, and conduct ablation studies to assess the impact of visual grounding, plan-based reasoning, and interaction design. Our results reveal substantial performance gaps and highlight persistent challenges in grounding, navigation, and multimodal understanding, positioning EconWebArena as a rigorous testbed for economic web intelligence.
♻ ☆ Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts ICLR 2026
The Mixture of Experts (MoE) is an effective architecture for scaling large language models by leveraging sparse expert activation to balance performance and efficiency. However, under expert parallelism, MoE suffers from inference inefficiencies due to imbalanced token-to-expert assignment, where underloaded experts complete computations early but must wait for overloaded experts, leading to global delays. We define this phenomenon as the \textbf{\textit{Straggler Effect}}, as the most burdened experts dictate the overall inference latency. To address this, we first propose \textit{\textbf{Capacity-Aware Token Drop}}, which enforces expert capacity limits by discarding excess tokens from overloaded experts, effectively reducing load imbalance with minimal performance impact (e.g., $30\%$ speedup with only $0.9\%$ degradation on OLMoE). Next, given the presence of low-load experts remaining well below the capacity threshold, we introduce \textit{\textbf{Capacity-Aware Expanded Drop}}, which allows tokens to include additional local experts in their candidate set before enforcing strict local capacity constraints, thereby improving load balance and enhancing the utilization of underused experts. Extensive experiments on both language and multimodal MoE models demonstrate the effectiveness of our approach, yielding substantial gains in expert utilization, model performance, and inference efficiency, e.g., applying Expanded Drop to Mixtral-8$\times$7B-Instruct yields a {0.2\%} average performance improvement and a {1.85$\times$} inference speedup. The code is released at: https://github.com/CASE-Lab-UMD/Capacity-Aware-MoE.
comment: ICLR 2026
♻ ☆ ER-Reason: A Benchmark Dataset for LLM Clinical Reasoning in the Emergency Room
Existing benchmarks for evaluating the clinical reasoning capabilities of large language models (LLMs) often lack a clear definition of "clinical reasoning" as a construct, fail to capture the full breadth of interdependent tasks within a clinical workflow, and rely on stylized vignettes rather than real-world clinical documentation. As a result, recent studies have found significant discrepancies between LLM performance on stylized benchmarks derived from medical licensing exams and their performance in real-world prospective studies. To address these limitations, we introduce ER-Reason, a benchmark designed to evaluate LLM reasoning as clinical evidence accumulates across decision-making tasks spanning the full workflow of emergency medicine. ER-Reason comprises 25,174 de-identified clinical notes from 3,437 patients, supporting evaluation across all stages of the emergency department workflow: triage intake, treatment selection, disposition planning, and final diagnosis. Crucially, evaluation in ER-Reason extends beyond diagnostic accuracy to include stepwise Script Concordance Test (SCT)-style questions grounded in real patient cases, which assess whether LLMs update their diagnostic beliefs in the correct direction and magnitude as clinical evidence accumulates, scored against 2,555 emergency physician annotations. We evaluate reasoning and non-reasoning LLMs on ER-Reason, and show that our tasks provide a more nuanced view of how LLM reasoning fails on real patient cases than existing benchmarks allow.
♻ ☆ Sparse Reward Subsystem in Large Language Models
Recent studies show that LLM hidden states encode reward-related information, such as answer correctness and model confidence. However, existing approaches typically fit black-box probes on the full hidden states, offering little insight into how this information is structured across neurons. In this paper, we show that reward-related information is concentrated in a sparse subset of neurons. Using simple probing, we identify two types of neurons: value neurons, whose activations predict state value, and dopamine neurons, whose activations encode step-level temporal difference (TD) errors. Together, these neurons form a sparse reward subsystem within LLM hidden states. These names are drawn by analogy with neuroscience, where value neurons and dopamine neurons in the biological reward subsystem also encode value and reward prediction errors, respectively. We demonstrate that value neurons are robust and transferable across diverse datasets and models, and provide causal evidence that they encode reward-related information. Finally, we show applications of the reward subsystem: value neurons serve as effective predictors of model confidence, and dopamine neurons can function as a process reward model (PRM) to guide inference-time search.
♻ ☆ Interactive Benchmarks
Existing reasoning evaluation paradigms suffer from different limitations: fixed benchmarks are increasingly saturated and vulnerable to contamination, while preference-based evaluations rely on subjective judgments. We argue that a core aspect of intelligence is the ability to decide what information to acquire and how to use it effectively. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses a model's reasoning ability through budgeted multi-turn interaction. We evaluate models under this framework in two settings: Interactive Proofs, where models interact with a judge to solve Logic, UI2Html, and Mathematics tasks under objective feedback; and Interactive Games, where models reason strategically to maximize long-horizon utilities. Our results show that interactive benchmarks provide a more robust assessment of this dimension of model intelligence, revealing substantial room for improvement in interactive scenarios. Project page: https://github.com/interactivebench/interactivebench
comment: Project Page: https://github.com/interactivebench/interactivebench
♻ ☆ Functional Subspace, where language models can use vector algebra to solve problems
Large language models (LLMs) were invented for natural language tasks such as translation, but they have proved that they can perform highly complex functions across domains. Additionally, they have been thought to develop new skills without being trained on them. These learning capabilities lead to LLMs adoption in a wide range of domains. Thus, it is imperative that we understand their operating mechanisms and limitations for proper diagnostics and repair. The earlier studies proposed that high level concepts are encoded as linear directions in LLMs activation space and that the geometry of embeddings have semantic meanings. Inspired by these studies, we hypothesize that LLMs may use subspaces and vector algebra in subspaces to perform tasks. To address this hypothesis, we analyze LLMs' functional modules and residual streams collected from LLMs engaging in in-context learning (ICL), one of the emergent abilities. Our analyses suggest that 1) LLMs can create subspaces, where evidence can be accumulated and 2) ICL tasks can be solved via simple algebraic operations in subspaces.
comment: page 20, 7 main figures, 8 supplementary figures
♻ ☆ Talk to Your Slides: High-Efficiency Slide Editing via Language-Driven Structured Data Manipulation ACL2026
Editing presentation slides is a frequent yet tedious task, ranging from creative layout design to repetitive text maintenance. While recent GUI-based agents powered by Multimodal LLMs (MLLMs) excel at tasks requiring visual perception, such as spatial layout adjustments, they often incur high computational costs and latency when handling structured, text-centric, or batch processing tasks. In this paper, we propose Talk-to-Your-Slides, a high-efficiency slide editing agent that operates via language-driven structured data manipulation rather than relying on the image modality. By leveraging the underlying object model instead of screen pixels, our approach ensures precise content modification while preserving style fidelity, addressing the limitations of OCR-based visual agents. Our system features a hierarchical architecture that effectively bridges high-level user instructions with low-level execution codes. Experiments demonstrate that for text-centric and formatting tasks, our method enables 34% faster processing, achieves 34% better instruction fidelity, and operates at an 87% lower cost compared to GUI-based baselines. Furthermore, we introduce TSBench, a human-verified benchmark dataset comprising 379 instructions, including a Hard subset designed to evaluate robustness against complex and visually dependent queries. Our code and benchmark are available at https://github.com/KyuDan1/Talk-to-Your-Slides.
comment: 30 pages, Accepted at ACL2026
♻ ☆ SDiaReward: Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness ACL 2026
The rapid evolution of end-to-end spoken dialogue systems demands transcending mere textual semantics to incorporate paralinguistic nuances and the spontaneous nature of human conversation. However, current methods struggle with two critical gaps: the modality gap, involving prosody and emotion, and the colloquialness gap, distinguishing written scripts from natural speech. To address these challenges, we introduce SDiaReward, an end-to-end multi-turn reward model trained on SDiaReward-Dataset, a novel collection of episode-level preference pairs explicitly targeting these gaps. It operates directly on full multi-turn speech episodes and is optimized with pairwise preference supervision, enabling joint assessment of modality and colloquialness in a single evaluator. We further establish ESDR-Bench, a stratified benchmark for robust episode-level evaluation. Experiments demonstrate that SDiaReward achieves state-of-the-art pairwise preference accuracy, significantly outperforming general-purpose audio LLMs. Further analysis suggests that SDiaReward captures relative conversational expressiveness beyond superficial synthesis cues, improving generalization across domains and recording conditions. Code, data, and demos are available at https://github.com/MM-Speech/SDiaReward/.
comment: Accepted to ACL 2026 Main Conference
♻ ☆ AdaSwitch: Adaptive Switching between Small and Large Agents for Effective Cloud-Local Collaborative Learning EMNLP 2024
Recent advancements in large language models (LLMs) have been remarkable. Users face a choice between using cloud-based LLMs for generation quality and deploying local-based LLMs for lower computational cost. The former option is typically costly and inefficient, while the latter usually fails to deliver satisfactory performance for reasoning steps requiring deliberate thought processes. In this work, we propose a novel LLM utilization paradigm that facilitates the collaborative operation of large cloud-based LLMs and smaller local-deployed LLMs. Our framework comprises two primary modules: the local agent instantiated with a relatively smaller LLM, handling less complex reasoning steps, and the cloud agent equipped with a larger LLM, managing more intricate reasoning steps. This collaborative processing is enabled through an adaptive mechanism where the local agent introspectively identifies errors and proactively seeks assistance from the cloud agent, thereby effectively integrating the strengths of both locally-deployed and cloud-based LLMs, resulting in significant enhancements in task completion performance and efficiency. We evaluate AdaSwitch across 7 benchmarks, ranging from mathematical reasoning and complex question answering, using various types of LLMs to instantiate the local and cloud agents. The empirical results show that AdaSwitch effectively improves the performance of the local agent, and sometimes achieves competitive results compared to the cloud agent while utilizing much less computational overhead.
comment: EMNLP 2024 Main Conference
♻ ☆ Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning ACL-26
Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) dominate the post-training landscape for mathematical reasoning, yet differ fundamentally in their reliance on expert trajectories. To understand the optimal way to harness these trajectories for maximizing performance, we propose the Plasticity-Ceiling Framework. This framework empirically grounds the post-training landscape by decomposing the final performance ceiling into the foundational SFT performance and the subsequent RL plasticity (i.e., the maximum improvement via RL). Through extensive benchmarking, we establish the Sequential SFT-then-RL pipeline as the superior standard, overcoming the stability and premature convergence deficits inherent in synchronized approaches. Furthermore, we derive precise scaling guidelines: (1) Transitioning to RL at the Stable or Mild Overfitting Regime of SFT maximizes the final ceiling by securing a robust SFT foundation with substantial RL plasticity; (2) Refuting the ``Less is More'' hypothesis in SFT-then-RL scaling, we demonstrate that Data Scale determines the primary post-training potential, while Trajectory Difficulty acts as a performance multiplier; and (3) The Minimum Validation Loss of SFT serves as a reliable indicator for selecting the expert trajectories that maximize the ultimate performance ceiling. Our findings provide actionable guidelines for extracting maximum value from expert trajectories.
comment: ACL-26, Main Conference
♻ ☆ Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States
Reinforcement learning with verifiable rewards (RLVR) for Large Reasoning Models hinges on baseline estimation for variance reduction, but existing approaches pay a heavy price: PPO requires a policy-model scale critic, while GRPO needs multiple rollouts per prompt to keep its empirical group mean stable. We introduce Policy Optimization with Internal State Value Estimation), which obtains a baseline at negligible cost by using the policy model's internal signals already computed during the policy forward pass. A lightweight probe predicts the expected verifiable reward from the hidden states of the prompt and generated trajectory, as well as token-entropy statistics, and is trained online alongside the policy. To preserve gradient unbiasedness despite using trajectory-conditioned features, we introduce a cross-rollout construction that predicts each rollout's value from an independent rollout's internal states. Because POISE estimates prompt value using only a single rollout, it enables higher prompt diversity for a fixed compute budget during training. This reduces gradient variance for more stable learning and also eliminates the compute overhead of sampling costs for detecting zero-advantage prompts. On Qwen3-4B and DeepSeek-R1-Distill-Qwen-1.5B across math reasoning benchmarks, POISE matches DAPO while requiring less compute. Moreover, its value estimator shows similar performance to a separate LLM-scale value model and generalizes to various verifiable tasks. By leveraging the model's own internal representations, POISE enables more stable and efficient policy optimization.
comment: Under Review; Project Page: https://elijah0430.github.io/poise/
♻ ☆ Reward Auditor: Inference on Reward Modeling Suitability in Real-World Perturbed Scenarios
Reliable reward models (RMs) are critical for ensuring the safe alignment of large language models (LLMs). However, current RM evaluation methods focus solely on preference perception accuracies in given specific scenarios, obscuring the critical vulnerabilities of RMs in real-world scenarios. We identify the true challenge lies in assessing a novel dimension: Suitability, defined as conditional reliability under specific real-world perturbations. To this end, we introduce Reward Auditor, a hypothesis-testing framework specifically designed for RM suitability inference. Rather than answering "How accurate is the RM's preference perception for given samples?", it employs scientific auditing to answer: "Can we infer RMs exhibit systematic vulnerabilities in specific real-world scenarios?". Under real-world perturbed scenarios, Reward Auditor quantifies statistical significance and effect size by auditing distribution degradation of RM preference perception confidence. This enables inference of both the certainty and severity of RM vulnerabilities across diverse real-world scenarios. This lays a solid foundation for building next-generation LLM alignment systems that are verifiably safe, more robust, and trustworthy.
♻ ☆ LLM as Graph Kernel: Rethinking Message Passing on Text-Rich Graphs
Text-rich graphs, which integrate complex structural dependencies with abundant textual information, are ubiquitous yet remain challenging for existing learning paradigms. Conventional methods and even LLM-hybrids compress rich text into static embeddings or summaries before structural reasoning, creating an information bottleneck and detaching updates from the raw content. We argue that in text-rich graphs, the text is not merely a node attribute but the primary medium through which structural relationships are manifested. We introduce RAMP, a Raw-text Anchored Message Passing approach that moves beyond using LLMs as mere feature extractors and instead recasts the LLM itself as a graph-native aggregation operator. RAMP exploits the text-rich nature of the graph via a novel dual-representation scheme: it anchors inference on each node's raw text during each iteration while propagating dynamically optimized messages from neighbors. It further handles both discriminative and generative tasks under a single unified generative formulation. Extensive experiments show that RAMP effectively bridges the gap between graph propagation and deep text reasoning, achieving competitive performance and offering new insights into the role of LLMs as graph kernels for general-purpose graph learning.
comment: 23 pages, 5 figures
♻ ☆ Teaching Language Models to Think in Code
Tool-integrated reasoning (TIR) has emerged as a dominant paradigm for mathematical problem solving in language models, combining natural language (NL) reasoning with code execution. However, this interleaved setup has three key limitations: code often acts as a post-hoc verifier, intermediate NL computations are error-prone, and NL and code play overlapping rather than clearly distinct roles. We propose ThinC (Thinking in Code), a framework in which code itself serves as the reasoner rather than as a tool invoked by NL. A ThinC trajectory begins with a brief NL planning step, after which all reasoning unfolds through code blocks connected only by their execution outputs. We distill 12.2k code-centric trajectories from a teacher model and train ThinC-1.7B and ThinC-4B with supervised fine-tuning followed by reinforcement learning. ThinC-4B consistently outperforms every TIR baseline on five competition-level math benchmarks and even surpasses the much larger Qwen3-235B-A22B-Thinking. Further analysis shows that ThinC reasons through code: 99.2% of its final answers are grounded in interpreter output, and the model recovers reliably from code execution failures without intermediate NL reasoning. Our code and models will be released soon.
comment: Preprint
♻ ☆ Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression ICML 2026
Multimodal large language models (MLLMs) struggle with numerical regression under long-tailed target distributions. Token-level supervised fine-tuning (SFT) and point-wise regression rewards bias learning toward high-density regions, leading to regression-to-the-mean behavior and poor tail performance. We identify the lack of cross-sample relational supervision as a key limitation of existing MLLM training paradigms. To address it, we propose a distribution-aware reinforcement learning framework based on Group Relative Policy Optimization, which introduces batch-level comparison-based supervision via the Concordance Correlation Coefficient-based reward to align predicted and ground-truth distributions in terms of correlation, scale, and mean. The framework is plug-and-play, requiring no architectural modification. Experiments on a unified suite of long-tailed regression benchmarks show consistent improvements over SFT and existing MLLM regression methods, with particularly strong gains in medium- and few-shot regimes.
comment: Accepted by ICML 2026
♻ ☆ Reasoning Trajectories for Socratic Debugging of Student Code: From Misconceptions to Contradictions and Updated Beliefs
In Socratic debugging, instructors guide students towards identifying and fixing a bug on their own, instead of providing the bug fix directly. Most novice programmer bugs are caused by programming misconceptions, namely false beliefs about a programming concept. In this context, Socratic debugging can be formulated as a guided Reasoning Trajectory (RT) leading to a statement about the program behavior that contradicts the bug-causing misconception. Upon reaching this contradiction, the ensuing cognitive dissonance is expected to lead the student to identify the false belief on their own, followed by an enduring belief update. In this paper, we introduce the task of reasoning trajectory generation, together with a dataset of debugging problems annotated with RTs that are manually created or LLM-generated. We then describe LLM-based solutions for generating RTs and Socratic conversations that are anchored on them. A large-scale LLM-as-judge evaluation shows that large language and reasoning models can generate up to 91% correct reasoning trajectories and 98.7% valid conversation turns.
comment: 25 pages, 2 tables, 13 figures
♻ ☆ Big AI is accelerating the metacrisis: What can we do? ACL 2026
The world is in the grip of ecological, meaning, and language crises that are converging into a metacrisis. Big AI is accelerating them all. LLM engineering sits at the core. Despite the public good motives of language engineers and the promise of LLMs, this work is being leveraged to create unprecedented wealth and power for a handful of individuals and corporations while causing existential harm to life on earth. As a profession, we urgently need to come together to explore alternatives and to design a life-affirming future for our field of natural language processing that is centered on human flourishing on a living planet.
comment: 12 pages, 2 figures, to appear in Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), San Diego, July 2026
♻ ☆ Can David Beat Goliath? On Multi-Hop Reasoning with Resource-Constrained Agents
Multi-turn reasoning agents solve complex questions by decomposing them into intermediate retrieval or tool-use steps, for accumulating supporting evidence across turns. Meanwhile, with reinforcement learning (RL), training these agents rely on many on-policy rollouts and large training batches. Under realistic resource constraints that make dense exploration infeasible, each RL batch contains only few useful reasoning paths from the current policy. Existing approaches do not fully address this bottleneck: SFT-based initialization can overfit when annotated trajectories are scarce, retrieval-level rewards can assign credit to individual retrieved documents without directly optimizing coverage of the full evidence set, and expansion can waste rollouts from poorly chosen prefixes. We introduce David-GRPO, which improves small-batch learning by using information from both outside and inside the current policy: (i) expert bootstrapping injects a few off-policy expert trajectories into RL updates, and (ii) evidence-guided exploration turns on-policy partial successes into evidence-coverage scores and additional continuations. On agents up to 1.5B parameters trained on four RTX 3090 GPUs, David-GRPO improves over prior RL baselines under the same low-budget setting on six multi-hop QA benchmarks. The gains come with a behavioral shift: unlike prior low-budget RL baselines that often skip retrieval or stop after shallow search, David-GRPO learns to increase retrieval depth and evidence coverage.
comment: Preprint
♻ ☆ Autonomous Continual Learning for Environment Adaptation of Computer-Use Agents
Real-world digital environments are highly diverse and dynamic. These characteristics cause agents to frequently encounter unseen environments and distribution shifts, making continual learning in such environments essential for computer-use agents (CUAs). However, a key challenge lies in obtaining high-quality and environment-grounded training data without relying on costly human annotation. In this work, we introduce ACuRL, an Autonomous Curriculum Reinforcement Learning framework that continually adapts agents to specific environments with zero human data. The agent first explores an environment to acquire initial experiences. During subsequent iterative training, a curriculum task generator leverages these experiences together with feedback from the previous iteration to synthesize new tasks tailored for the agent's current capabilities. To provide reliable reward signals, we introduce CUAJudge, a robust automatic evaluator for CUAs that achieves 93% agreement with human judgments. Empirically, our method effectively enables both intra-environment and cross-environment continual learning, yielding 3-29% absolute performance gains on the target environments without catastrophic forgetting on others. We also show that it can mitigate performance degradation under environment changes (e.g., version updates, platform migration, and resolution shifts). Further analyses show highly sparse updates (e.g., only 20% parameters), which helps explain the effective and robust adaptation.
comment: 28 pages, 10 figures
♻ ☆ Demystifying When Pruning Works via Representation Hierarchies ICML 2026
Network pruning, which removes less important parameters or architectures, is often expected to improve efficiency while preserving performance. However, this expectation does not consistently hold across language tasks: pruned models can perform well on non-generative tasks but frequently fail in generative settings. To understand this discrepancy, we analyze network pruning from a representation-hierarchy perspective, decomposing the internal computation of language models into three sequential spaces: embedding (hidden representations), logit (pre-softmax outputs), and probability (post-softmax distributions). We find that representations in the embedding and logit spaces are largely robust to pruning-induced perturbations. However, the nonlinear transformation from logits to probabilities amplifies these deviations, which accumulate across time steps and lead to substantial degradation during generation. In contrast, the stability of the categorical-token probability subspace, together with the robustness of the embedding space, supports the effectiveness of pruning for non-generative tasks such as retrieval and multiple-choice selection. Our analysis disentangles the effects of pruning across tasks and provides practical guidance for its application. Code is available at https://github.com/CASE-Lab-UMD/Pruning-on-Representations
comment: ICML 2026. 24 pages, 21 figures, and 3 tables. Includes an appendix with supplementary experiments and derivations
♻ ☆ Modeling Narrative Structure in Latin Epic Poetry with Automatically Generated Story Grammars
Computational methods for analyzing prose and poetry utilize word embeddings and other abstract representations that sometimes obscure context-rich literary text. Inspired by the psychology of reading, we utilize story structure and elements to simulate human narrative comprehension to produce a more comprehensive representation of literary text. We present a method for automatically generating story grammar labels for input texts as a means of analysis that is interpretable and accessible by humanists and technologists alike. Using a large language model (LLM) pipeline and few-shot learning, we label Latin epic poetry with story element labels and use this output directly to aid an analysis of the story structure and style. Our method guides literary scholars to discover new areas of interest across texts and provides a new feature set for further study for downstream machine learning tasks.
comment: Submitted to Journal of Computational Literary Studies
♻ ☆ Characterizing the Robustness of Black-Box LLM Planners Under Perturbed Observations with Adaptive Stress Testing ACL
Large language models (LLMs) have recently demonstrated success in decision-making tasks including planning, control, and prediction, but their tendency to hallucinate unsafe and undesired outputs poses risks. This unwanted behavior is further exacerbated in environments where sensors are noisy or unreliable. Characterizing the behavior of LLM planners to varied observations is necessary to proactively avoid failures in safety-critical scenarios. We specifically investigate the response of LLMs along two different perturbation dimensions. Like prior works, one dimension generates semantically similar prompts with varied phrasing by randomizing order of details, modifying access to few-shot examples, etc. Unique to our work, the second dimension simulates access to varied sensors and noise to mimic raw sensor or detection algorithm failures. An initial case study in which perturbations are manually applied show that both dimensions lead LLMs to hallucinate in a multi-agent driving environment. However, manually covering the entire perturbation space for several scenarios is infeasible. As such, we propose a novel method for efficiently searching the space of prompt perturbations using adaptive stress testing (AST) with Monte-Carlo tree search (MCTS). Our AST formulation enables discovery of scenarios, sensor configurations, and prompt phrasing that cause language models to act with high uncertainty or even crash. By generating MCTS prompt perturbation trees across diverse scenarios, we show through extensive experiments that offline analyses can be used to proactively understand potential failures that may arise at runtime. Code is available at https://sites.google.com/illinois.edu/astllm.
comment: Accepted to ACL Findings 2026; 31 pages, 26 figures, 6 tables
♻ ☆ FLAME: A New Dataset on FLemish Accounts of Momentary Experiences
We introduce FLAME (FLemish Accounts of Momentary Experiences), a new corpus of nearly 25,000 daily personal narratives in Belgian-Dutch (Flemish), designed to support research on underrepresented language varieties in Natural Language Processing (NLP). Personal narratives of this kind hold rich potential for uncovering culturally grounded, everyday themes, yet extracting meaningful topics from such data is non-trivial, given the informal register, cultural specificity, and low-resource nature of the Flemish variety. We therefore ask: which topic modeling approach is best suited to reveal the latent themes in this corpus? To answer this, we benchmark three widely used methods: K-Means Clustering, Latent Dirichlet Allocation (LDA), and BERTopic, evaluating their ability to identify coherent and culturally relevant topics. While LDA achieves strong performance on automated coherence metrics, human evaluation reveals that BERTopic consistently produces the most coherent and culturally resonant topics, exposing the limitations of purely statistical methods on narrative-rich data. The diminished performance of K-Means compared to prior work on similar Dutch corpora further highlights the unique linguistic challenges posed by this dataset. Our findings demonstrate that contextual embeddings are critical for robust topic modeling in low-resource, culturally specific domains, and underscore the importance of human-centered evaluation alongside automated metrics.
♻ ☆ TabDLM: Free-Form Tabular Data Generation via Joint Numerical-Language Diffusion
Synthetic tabular data generation has attracted growing attention due to its importance for data augmentation, foundation models, and privacy. However, real-world tabular datasets increasingly contain free-form text fields (e.g., reviews or clinical notes) alongside structured numerical and categorical attributes. Generating such heterogeneous tables with joint modeling of different modalities remains challenging. Existing approaches broadly fall into two categories: diffusion-based methods and LLM-based methods. Diffusion models can capture complex dependencies over numerical and categorical features in continuous or discrete spaces, but extending them to open-ended text is nontrivial and often leads to degraded text quality. In contrast, LLM-based generators naturally produce fluent text, yet their discrete tokenization can distort precise or wide-range numerical values, hindering accurate modeling of both numbers and language. In this work, we propose TabDLM, a unified framework for free-form tabular data generation via a joint numerical-language diffusion model built on masked diffusion language models (MDLMs). TabDLM models textual and categorical features through masked diffusion, while modeling numerical features with a continuous diffusion process through learned specialized numeric tokens embedding; bidirectional attention then captures cross-modality interactions within a single model. Extensive experiments on diverse benchmarks demonstrate the effectiveness of TabDLM compared to strong diffusion- and LLM-based baselines.
comment: Preprint
♻ ☆ Natural Language Processing in the Legal Domain
We summarize the current state of the field of NLP & Law with a specific focus on recent technical and substantive developments. To support our analysis, we construct and analyze a nearly complete corpus of nearly one thousand NLP & Law related papers published between 2013-2024. Our analysis highlights several major trends. Namely, we document an increasing number of papers written, tasks undertaken, and languages covered over the course of the past decade. We observe an increase in the sophistication of the methods which researchers deployed in this applied context. Legal NLP is beginning to match not only the methodological sophistication of general NLP but also the professional standards of data availability and code reproducibility observed within the broader scientific community. We believe all of these trends bode well for the future of the field and point to an exciting next phase for the Legal NLP community.
comment: 15 pages, 7 figures, 2 tables
♻ ☆ Red-Teaming Text-to-Image Models via In-Context Experience Replay and Semantic-Preserving Prompt Rewriting
Understanding the capabilities of text-to-image (T2I) models in harmful content generation is essential to safety and compliance. However, human red-teaming is costly and inconsistent, driving the need for automatic tools that simulate realistic misuse attempts. Existing methods either require white-box access, fail to generalize across defenses, or produce uninterpretable adversarial tokens, while generating fluent prompts that preserve the original harmful intent remains underexplored despite its practical relevance. We propose ICER, a black-box framework that addresses this gap through two components: an LLM-based rewriter that produces fluent, natural-language adversarial prompts, and in-context experience replay that accumulates successful jailbreaking patterns into a reusable prior. These components are integrated via bandit optimization, enabling ICER to efficiently balance exploiting proven attack strategies with exploring new ones. Experiments across six safety mechanisms show that ICER outperforms seven baselines under both standard and semantics-preserving evaluation, with over 30% of generated prompts transferring to commercial systems like DALL-E 3 and Midjourney.
comment: The source code is available at https://github.com/zhiyichin/ICER
♻ ☆ BEExformer: A Fast Inferencing Binarized Transformer with Early Exits
Large Language Models (LLMs) based on transformers achieve cutting-edge results on a variety of applications. However, their enormous size and processing requirements hinder deployment on constrained resources. To enhance efficiency, binarization and Early Exit (EE) have proved to be effective solutions. However, binarization may lead to performance loss as reduced precision affects gradient estimation and parameter updates. Besides, research on EE mechanisms is still in its early stages. To address these challenges, we introduce Binarized Early Exit Transformer (BEExformer), a first-of-its-kind selective learning-based transformer integrating Binarization-Aware Training (BAT) with EE for efficient and fast textual inference. Each transformer block has an integrated Selective-Learn Forget Network (SLFN) to enhance contextual retention while eliminating irrelevant information. The BAT employs a differentiable second-order approximation to the sign function, enabling gradient computation that captures both the sign and magnitude of the weights. This aids in 21.30 times reduction in model size. The EE mechanism hinges on fractional reduction in entropy among intermediate transformer blocks with soft-routing loss estimation. This accelerates inference by reducing FLOPs by 52.27% and even improves accuracy by 3.22% by resolving the "overthinking" problem inherent in deep networks. Extensive evaluation through comparison with the SOTA methods and various ablations across nine datasets covering multiple NLP tasks demonstrates its Pareto-optimal performance-efficiency trade-off.
comment: This revised manuscript includes 18 pages, 6 figures, and 6 tables. Methodology and results sections have been improved for clarity and depth, incorporating additional comparisons, ablations, and new evaluation datasets. A few relevant references were added, and overall organization refined for better readability
♻ ☆ Not Worth Mentioning? A Pilot Study on Salient Proposition Annotation
Despite a long tradition of work on extractive summarization, which by nature aims to recover the most important propositions in a text, little work has been done on operationalizing graded proposition salience in naturally occurring data. In this paper, we adopt graded summarization-based salience as a metric from previous work on Salient Entity Extraction (SEE) and adapt it to quantify proposition salience. We define the annotation task, apply it to a small multi-genre dataset, evaluate agreement and carry out a preliminary study of the relationship between our metric and notions of discourse unit centrality in discourse parsing following Rhetorical Structure Theory (RST).
♻ ☆ Synthetic Function Demonstrations Improve Generation in Low-Resource Programming Languages LREC 2026
A key consideration when training an LLM is whether the target language is more or less resourced, for example English compared to Welsh, or Python compared to Excel. Typical training data for programming languages consists of real program demonstrations coupled with explanatory human-written comments. In this work we present a novel approach to the creation of such data for low resource programming languages, which lack naturally occurring data. Our process generates synthetic, textbook-quality demonstrations of how to use library functions, which we show makes for good model finetuning data. We demonstrate in an example domain of Excel Formulas. First, we collate language documentation, then we use this to augment a powerful teacher model which generates synthetic training data, and finally finetune student models on the demonstrations. Our technique improves student performance on 2 question-answering datasets: WikiTQ and TAT-QA. We also show advantages of finetuning over standard RAG approaches, which can offer only modest improvement due to the unfamiliarity of the target domain to student models.
comment: Published at LREC 2026
♻ ☆ How far can bias go? Tracing bias from pretraining data to alignment
As LLMs are increasingly integrated into user-facing applications, addressing biases that perpetuate societal inequalities is crucial. While much work has gone into measuring or mitigating biases in these models, fewer studies have investigated their origins. Therefore, this study examines the correlation between gender-occupation bias in pre-training data and their manifestation in LLMs, focusing on the Dolma dataset and the OLMo model. Using zero-shot prompting and token co-occurrence analyses, we explore how biases in training data influence model outputs. Our findings reveal that biases present in pre-training data are amplified in model outputs. The study also examines the effects of prompt types, hyperparameters, and instruction-tuning on bias expression, finding instruction-tuning partially alleviating representational bias while still maintaining overall stereotypical gender associations, whereas hyperparameters and prompting variation have a lesser effect on bias expression. Our research traces bias throughout the LLM development pipeline and underscores the importance of mitigating bias at the pretraining stage.
♻ ☆ Toxicity Detection Should Measure Contextual Harm, Not Text-Intrinsic Badness
Toxicity detection has become core safety infrastructure for online moderation, dataset filtering, and deployed language-model systems. Yet most detectors still treat toxicity as an intrinsic property of isolated text. This position paper argues that toxicity detection should be evaluated as the contextual measurement of situated communicative harm, rather than as single-label text classification. Toxicity is not contained in words alone; it emerges when a communicative act is interpreted by an audience within a normative and social context. We introduce the Contextual Stress Framework (CSF), which defines toxicity as a relation between perceived norm violation and induced stress or disruption. CSF explains why text-intrinsic detectors overflag dialectal or reclaimed language, miss coded or pragmatic abuse, and remain brittle under meaning-preserving transformations. We propose CSF-Eval, an evaluation agenda that separates text risk, norm violation, disruption, uncertainty, and policy action.
Machine Learning 300
☆ ELF: Embedded Language Flows
Diffusion and flow-based models have become the de facto approaches for generating continuous data, e.g., in domains such as images and videos. Their success has attracted growing interest in applying them to language modeling. Unlike their image-domain counterparts, today's leading diffusion language models (DLMs) primarily operate over discrete tokens. In this paper, we show that continuous DLMs can be made effective with minimal adaptation to the discrete domain. We propose Embedded Language Flows (ELF), a class of diffusion models in continuous embedding space based on continuous-time Flow Matching. Unlike existing DLMs, ELF predominantly stays within the continuous embedding space until the final time step, where it maps to discrete tokens using a shared-weight network. This formulation makes it straightforward to adapt established techniques from image-domain diffusion models, e.g., classifier-free guidance (CFG). Experiments show that ELF substantially outperforms leading discrete and continuous DLMs, achieving better generation quality with fewer sampling steps. These results suggest that ELF offers a promising path toward effective continuous DLMs.
comment: Tech Report. Project webpage: https://github.com/lillian039/ELF
☆ Variational Inference for Lévy Process-Driven SDEs via Neural Tilting
Modelling extreme events and heavy-tailed phenomena is central to building reliable predictive systems in domains such as finance, climate science, and safety-critical AI. While Lévy processes provide a natural mathematical framework for capturing jumps and heavy tails, Bayesian inference for Lévy-driven stochastic differential equations (SDEs) remains intractable with existing methods: Monte Carlo approaches are rigorous but lack scalability, whereas neural variational inference methods are efficient but rely on Gaussian assumptions that fail to capture discontinuities. We address this tension by introducing a neural exponential tilting framework for variational inference in Lévy-driven SDEs. Our approach constructs a flexible variational family by exponentially reweighting the Lévy measure using neural networks. This parametrization preserves the jump structure of the underlying process while remaining computationally tractable. To enable efficient inference, we develop a quadratic neural parametrization that yields closed-form normalization of the tilted measure, a conditional Gaussian representation for stable processes that facilitates simulation, and symmetry-aware Monte Carlo estimators for scalable optimization. Empirically, we demonstrate that the method accurately captures jump dynamics and yields reliable posterior inference in regimes where Gaussian-based variational approaches fail, on both synthetic and real-world datasets.
comment: The associated project page which contains the official implementation can be found in https://circle-group.github.io/research/NeuralTilting/
☆ DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
While Mixture-of-Experts (MoE) scales model capacity without proportionally increasing computation, its massive total parameter footprint creates significant storage and memory-access bottlenecks, which hinder efficient end-side deployment that simultaneously requires high performance, low computational cost, and small storage overhead. To achieve these properties, we present DECO, a sparse MoE architecture designed to match the performance of dense Transformers under identical total parameter budgets and training tokens. DECO utilizes the differentiable and flexible ReLU-based routing enhanced by learnable expert-wise scaling, which adaptively balances the contributions of routed and shared experts. Furthermore, we introduce NormSiLU, an activation function that normalizes inputs prior to SiLU operators, producing a more stable trend of routed-expert activation ratio and a higher intrinsic sparsity level. We also identify an empirical advantage in using non-gated MLP experts with ReLU-based routing, indicating the possibility of MoE architecture simplification. Experiments demonstrate that DECO, activating only 20% of experts, matches dense performance and outperforms established MoE baselines. Our specialized acceleration kernel delivers a 3.00$\times$ speedup on real hardware compared with dense inference. Codes and checkpoints will be released.
comment: 14 pages, 11 figures, 11 tables
☆ Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime
Transformers with self-attention modules as their core components have become an integral architecture in modern large language and foundation models. In this paper, we study the evolution of tokens in deep encoder-only transformers at inference time which is described in the large-token limit by a mean-field continuity equation. Leveraging ideas from the convergence analysis of interacting multi-particle systems, with particles corresponding to tokens, we prove that the token distribution rapidly concentrates onto the push-forward of the initial distribution under a projection map induced by the key, query, and value matrices, and remains metastable for moderate times. Specifically, we show that the Wasserstein distance of the two distributions scales like $\sqrt{{\log(β+1)}/β}\exp(Ct)+\exp(-ct)$ in terms of the temperature parameter $β^{-1}\to 0$ and inference time $t\geq 0$. For the proof, we establish Lyapunov-type estimates for the zero-temperature equation, identify its limit as $t\to\infty$, and employ a stability estimate in Wasserstein space together with a quantitative Laplace principle to couple the two equations. Our result implies that for time scales of order $\logβ$ the token distribution concentrates at the identified limiting distribution. Numerical experiments confirm this and, beyond that, complement our theory by showing that for finite $β$ and large $t$ the dynamics enter a different terminal phase, dominated by the spectrum of the value matrix.
comment: 30 pages, 10 figures
☆ Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
Large language model agents increasingly rely on external skills to solve complex tasks, where skills act as modular units that extend their capabilities beyond what parametric memory alone supports. Existing methods assume external skills either accumulate as persistent guidance or internalized into the policy, eventually leading to zero-skill inference. We argue this assumption is overly restrictive, since with limited parametric capacity and uneven marginal contribution across skills, the optimal active skill set is non-monotonic, task- and stage-dependent. In this work, we propose SLIM, a framework of dynamic Skill LIfecycle Management for agentic reinforcement learning (RL), which treats the active external skill set as a dynamic optimization variable jointly updated with policy learning. Specifically, SLIM estimates each active skill's marginal external contribution through leave-one-skill-out validation, then applies three lifecycle operations: retaining high-value skills, retiring skills whose contribution becomes negligible after sufficient exposure, and expanding the skill bank when persistent failures reveal missing capability coverage. Experiments show that SLIM outperforms the best baselines by an average of 7.1% points across ALFWorld and SearchQA. Results further indicate that policy learning and external skill retention are not mutually exclusive: some skills are absorbed into the policy, while others continue to provide external value, supporting SLIM as a more general paradigm for skill-based agentic RL.
comment: Implementation code is available at https://github.com/ejhshen/SLIM
☆ Optimal and Scalable MAPF via Multi-Marginal Optimal Transport and Schrödinger Bridges ICML 2026
We consider anonymous multi-agent path finding (MAPF) where a set of robots is tasked to travel to a set of targets on a finite, connected graph. We show that MAPF can be cast as a special class of multi-marginal optimal transport (MMOT) problems with an underlying Markovian structure, under which the exponentially large MMOT collapses to a linear program (LP) polynomial in size. Focusing on the anonymous setting, we establish conditions under which the corresponding LP is feasible, totally unimodular, and consequently, yields min-cost, integral $(\{0,1\})$ transports that do not overlap in both space and time. To adapt the approach to large-scale problems, we cast the MAPF-MMOT in a probabilistic framework via Schrödinger bridges. Under standard assumptions, we show that the Schrödinger bridge formulation reduces to an entropic regularization of the corresponding MMOT that admits an iterative Sinkhorn-type solution. The Schrödinger bridge, being a probabilistic framework, provides a shadow (fractional) transport that we use as a template to solve a reduced LP and demonstrate that it results in near-optimal, integral transports at a significant reduction in complexity. Extensive experiments highlight the optimality and scalability of the proposed approaches.
comment: Accepted in ICML 2026 as a spotlight paper
☆ Equivariant Reinforcement Learning for Clifford Quantum Circuit Synthesis
We consider the problem of synthesizing Clifford quantum circuits for devices with all-to-all qubit connectivity. We approach this task as a reinforcement learning problem in which an agent learns to discover a sequence of elementary Clifford gates that reduces a given symplectic matrix representation of a Clifford circuit to the identity. This formulation permits a simple learning curriculum based on random walks from the identity. We introduce a novel neural network architecture that is equivariant to qubit relabelings of the symplectic matrix representation, and which is size-agnostic, allowing a single learned policy to be applied across different qubit counts without circuit splicing or network reparameterization. On six-qubit Clifford circuits, the largest regime for which optimal references are available, our agent finds circuits within one two-qubit gate of optimality in milliseconds per instance, and finds optimal circuits in 99.2% of instances within seconds per instance. After continued training on ten-qubit instances, the agent scales to unseen Clifford tableaus with up to thirty qubits, including targets generated from circuits with over a thousand Clifford gates, where it achieves lower average two-qubit gate counts than Qiskit's Aaronson-Gottesman and greedy Clifford synthesizers.
☆ Revisiting Policy Gradients for Restricted Policy Classes: Escaping Myopic Local Optima with $k$-step Policy Gradients
This work revisits standard policy gradient methods used on restricted policy classes, which are known to get stuck in suboptimal critical points. We identify an important cause for this phenomenon to be that the policy gradient is itself fundamentally myopic, i.e. it only improves the policy based on the one-step $Q$-function. In this work, we propose a generalized $k$-step policy gradient method that couples the randomness within a $k$-step time window and can escape the myopic local optima in MDPs with restricted policy classes. We show this new method is theoretically guaranteed to converge to a solution that is exponentially close in performance to the optimal deterministic policy with respect to $k$. Further, we show projected gradient descent and mirror descent with this $k$-step policy gradient can achieve this exponential guarantee in $O(\frac{1}{T})$ iterations, despite only assuming smoothness and differentiability of the value function. This will provide near optimal solutions to previously elusive applications like state aggregation and partially observable cooperative multi-agent settings. Moreover, our bounds avoid the ubiquitous distribution mismatch factors $||d_μ^{π^*} / d_μ^π||_\infty$ and $||d_μ^{π^*} / μ||_\infty$ enabling the $k$-step policy gradient method to escape suboptimal critical points that emerge from poor exploration in fully observable settings.
☆ DataMaster: Towards Autonomous Data Engineering for Machine Learning
As model families, training recipes, and compute budgets become increasingly standardized, further gains in machine learning systems depend increasingly on data. Yet data engineering remains largely manual and ad hoc: practitioners repeatedly search for external datasets, adapt them to existing pipelines, validate candidate data through downstream training, and carry forward lessons from prior attempts. We study task-conditioned autonomous data engineering, where an autonomous agent improves a fixed learning algorithm by optimizing only the data side, including external data discovery, data selection and composition, cleaning and transformation. The goal is to obtain a stronger downstream solution while leaving the learning algorithm unchanged. To address the open-ended search space, branch-dependent refinement, and delayed validation inherent in autonomous data engineering, we propose DataMaster, a data-agent framework that integrates tree-structured search, shared candidate data, and cumulative memory. DataMaster consists of three key components: a DataTree that organizes alternative data-engineering branches, a shared Data Pool that stores discovered external data sources for reuse, and a Global Memory that records node outcomes, artifacts, and reusable findings. Together, these components allow the agent to discover candidate data, construct executable training inputs, evaluate them through downstream feedback, and carry useful evidence across branches. We evaluate DataMaster on two types of benchmarks, MLE-Bench Lite and PostTrainBench. On MLE-Bench Lite, it improves medal rate by 32.27% over the initial score; on PostTrainBench, it surpasses the instruct model on GPQA (31.02% vs 30.35%).
☆ Beyond Red-Teaming: Formal Guarantees of LLM Guardrail Classifiers
Guardrail Classifiers defend production language models against harmful behavior, but although results seem promising in testing, they provide no formal guarantees. Providing formal guarantees for such models is hard because "harmful behavior" has no natural specification in a discrete input space: and the standard epsilon-ball properties used in other domains do not carry semantic meaning. We close this gap by shifting verification from the discrete input space to the classifier's pre-activation space, where we define a harmful region as a convex shape enclosing the representations of known harmful prompts. Because the sigmoid classification head is monotonic, certifying the worst-case point is sufficient to certify the entire region, yielding a closed-form soundness proof without approximation in O(d) time. To formally evaluate these classifiers, we propose two constructions of such regions: SVD-aligned hyper-rectangles, which yield exact SAT/UNSAT certificates, and Gaussian Mixture Models, which yield probabilistic certificates over semantically coherent clusters. Applying this framework to three author-trained Guardrail Classifiers on the toxicity domain, every hyper-rectangle configuration returns SAT, exposing verifiable safety holes across all classifiers, despite seemingly high empirical metrics. Probabilistic GMM certificates also expose a divergent structural stability in how these models represent harm. While GPT-2 and Llama-3.1-8B maintain robust coverage of 90% and 80% across varying boundaries, BERT's safety guarantees prove uniquely volatile. This 'coverage collapse' to 55% at the optimal threshold reveals a sparsely populated safety margin in BERT, which only achieves full coverage by adopting an extremely conservative pessimistic threshold. These approaches combined, provide new insights on how effective Guardrail Classifiers really are, beyond traditional red-teaming.
☆ RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards
Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard post-training offers little mechanism for turning past attempts into reusable experience. In this work, we argue that rubrics should serve not merely as final-answer evaluators, but as the shared interface that structures policy execution, judge feedback, and agent memory. Based on this view, we introduce RubricEM, a rubric-guided reinforcement learning framework that combines stagewise policy decomposition with reflection-based meta-policy evolution. RubricEM first makes research trajectories stage-aware by conditioning planning, evidence gathering, review, and synthesis on self-generated rubrics. It then assigns credit with Stage-Structured GRPO, which uses stagewise rubric judgments to provide denser semantic feedback for long-horizon optimization. In parallel, RubricEM trains a shared-backbone reflection meta-policy that distills judged trajectories into reusable rubric-grounded guidance for future attempts. The resulting RubricEM-8B achieves strong performance across four long-form research benchmarks, outperforming comparable open models and approaching proprietary deep-research systems. Beyond final performance, we perform thorough analyses to understand the key ingredients of RubricEM.
comment: 63 pages, 6 figures
☆ V4FinBench: Benchmarking Tabular Foundation Models, LLMs, and Standard Methods on Corporate Bankruptcy Prediction
Corporate bankruptcy prediction is a high-stakes financial task characterized by severe class imbalance and multi-horizon forecasting demands. Public datasets supporting it remain scarce and small: widely used free benchmarks contain between 6,000 and 80,000 company-year observations, while larger resources are behind subscription paywalls. To address this gap, we introduce V4FinBench, a benchmark of over one million company-year records from the Visegràd Group (V4) economies (2006-2021), with 131 financial and non-financial features, six prediction horizons, and a composite distress criterion jointly capturing solvency, profitability, and liquidity deterioration. V4FinBench is designed to support the evaluation of tabular and foundation-model methods under realistic class imbalance, with positive rates between 0.19% and 0.36%. We provide reference evaluations of standard tabular baselines, finetuned TabPFN, and QLoRA-finetuned Llama-3-8B. With imbalance-aware finetuning, TabPFN matches or exceeds gradient boosting at longer time horizons on both $F_1$-score and ROC-AUC. In contrast, Llama-3-8B trails gradient boosting on ROC-AUC at every horizon and is generally weaker on $F_1$-score, with the gap widening sharply beyond the immediate horizon. In an external evaluation on the American Bankruptcy Dataset, the V4FinBench-finetuned TabPFN checkpoint improves over vanilla TabPFN, suggesting that adaptation captures transferable financial-distress structure rather than only V4-specific patterns. V4FinBench is publicly released to support further evaluation and development of prediction methods on realistic financial data.
☆ Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
On-policy distillation offers dense, per-token supervision for training reasoning models; however, it remains unclear under which conditions this signal is beneficial and under which it is detrimental. Which teacher model should be used, and in the case of self-distillation, which specific context should serve as the supervisory signal? Does the optimal choice vary from one token to the next? At present, addressing these questions typically requires costly training runs whose aggregate performance metrics obscure the dynamics at the level of individual tokens. We introduce a training-free diagnostic framework that operates at the highest resolution: per token, per question, and per teacher. We derive an ideal per-node gradient defined as the parameter update that maximally increases the student's probability of success. We then develop a scalable targeted-rollout algorithm to estimate this gradient efficiently, even for long chains of intermediate thoughts. The gradient alignment score, defined as the cosine similarity between this ideal gradient and any given distillation gradient, quantifies the extent to which a particular configuration approximates the ideal signal. Across a range of self-distillation settings and external teacher models, we observe that distillation guidance exhibits substantially higher alignment with the ideal on incorrect rollouts than on correct ones, where the student already performs well and the teacher's signal tends to become noisy. Furthermore, we find that the optimal distillation context depends jointly on the student model's capacity and the target task, and that no single universally effective configuration emerges. These findings motivate the use of per-task, per-token diagnostic analyses for distillation.
☆ LoKA: Low-precision Kernel Applications for Recommendation Models At Scale ISCA'26
Recent GPU generations deliver significantly higher FLOPs using lower-precision arithmetic, such as FP8. While successfully applied to large language models (LLMs), its adoption in large recommendation models (LRMs) has been limited. This is because LRMs are numerically sensitive, dominated by small matrix multiplications (GEMMs) followed by normalization, and trained in communication-intensive environments. Applying FP8 directly to LRMs often degrades model quality and prolongs training time. These challenges are inherent to LRM workloads and cannot be resolved merely by introducing better FP8 kernels. Instead, a system-model co-design approach is needed to successfully integrate FP8. We present LoKA (Low-precision Kernel Applications), a framework that makes FP8 practical for LRMs through three principles: profile under realistic distributions to know where low precision is safe, co-design model components with hardware to expand where it is safe, and orchestrate across kernel libraries to maximize the gains. Concretely, LoKA Probe is a statistically grounded, online benchmarking method that learns activation and weight statistics, and quantifies per-layer errors. This process pinpoints safe and unsafe, fast and slow sites for FP8 adoption. LoKA Mods is a set of reusable model adaptations that improve both numerical stability and execution efficiency with FP8. LoKA Dispatch is a runtime that leverages the statistical insights from LoKA Probe to select the fastest FP8 kernel that satisfies the accuracy requirements.
comment: Accepted to ISCA'26
☆ Neural Weight Norm = Kolmogorov Complexity
Why does weight decay work? We prove that, in any fixed-precision regime, the smallest weight norm of a looped neural network outputting a binary string equals the Kolmogorov complexity of that string, up to a logarithmic factor. This implies that weight decay induces a prior matching Solomonoff's universal prior, the optimal prior over computable functions, up to a polynomial factor. The result is norm-agnostic: in fixed precision, every weight norm collapses to the non-zero parameter count up to constants, so the same sandwich bound holds for any norm used as a regulariser. The proof has two short reductions: any program for a universal Turing machine can be encoded into neural weights at unit cost per program bit, and any fixed-precision network can be described by enumerating its non-zero parameters with logarithmic addressing overhead. Both bounds are tight up to constants, with the logarithmic factor realised by permutation encodings: a network whose parameters encode a permutation produces a string whose Kolmogorov complexity is the non-zero parameter count times its logarithm. The fixed-precision assumption is essential: with infinite precision, neural networks can encode non-computable functions and the weight norm loses its relevance.
☆ AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents
Recent advances in machine learning and large-scale biological data collections have revived the prospect of building a virtual cell, a computational model of cellular behavior that could accelerate biological discovery. One of the most compelling promises of this vision is the ability to perform in silico phenotypic screens, in which a model predicts the effects of cellular perturbations in unseen biological contexts. This task combines heterogeneous textual inputs with diverse phenotypic outputs, making it particularly well-suited to LLMs and agentic systems. Yet, no standard benchmark currently exists for this task, as existing efforts focus on narrower molecular readouts that are only indirectly aligned with the phenotypic endpoints driving many real-world drug discovery workflows. In this work, we present AssayBench, a benchmark for phenotypic screen prediction, built from 1,920 publicly available CRISPR screens spanning five broad classes of cellular phenotypes. We formulate the screen prediction task as a gene rank prediction for each screen and introduce the adjusted nDCG, a continuous metric for comparing performance across heterogeneous assays. Our extensive evaluation shows that existing methods remain far from empirically estimated performance ceilings and zero-shot generalist LLMs outperform biology-specific LLMs and trainable baselines. Optimization techniques such as fine-tuning, ensembling, and prompt optimization can further improve LLM performance on this task. Overall, AssayBench offers a practical testbed for measuring progress toward in silico phenotypic screening and, more broadly, virtual cell models.
comment: 22 pages
☆ Compute Where it Counts: Self Optimizing Language Models ICML'26
Efficient LLM inference research has largely focused on reducing the cost of each decoding step (e.g., using quantization, pruning, or sparse attention), typically applying a uniform computation budget to every generated token. In practice, token difficulty varies widely, so static compression can over-compute on easy steps and under-compute on hard ones. We study dynamic budget allocation for autoregressive decoding: learning how much computation to spend per token from within a single model. Self-Optimizing Language Models (SOL) pair a frozen LLM with a lightweight policy network that reads the LLM hidden state and selects a discrete efficiency action at each decode step. Actions can jointly control (i) token-level attention sparsity, (ii) structured activation pruning in the MLP, and (iii) activation quantization bit-width, while leaving the base model weights unchanged. We train the policy with group-relative policy optimization on teacher-forced episodes: the token sequence is fixed, while we sample multiple compute schedules (i.e., "counterfactual" schedules that vary only the efficiency actions for the same token path) and compare their likelihoods under the same supervision. Our reward trades off language-model quality against soft penalties that encourage episode-average budget usage to match a requested target. Across model variants and compute regimes, SOL improves quality at matched budget over static allocation and strong random schedule search, offering a complementary axis for inference-efficiency optimization. SOL discovers a better quality-efficiency pareto-front across all our experiments and improves MMLU accuracy by up to 7.3% over uniform budget allocation strategies.
comment: Accepted at ICML'26 Code: https://github.com/akhauriyash/SOL
☆ Attractor-Vascular Coupling Theory: Formal Grounding and Empirical Validation for AAMI-Standard Cuffless Blood Pressure Estimation from Smartphone Photoplethysmography
This work proposes Attractor-Vascular Coupling Theory (AVCT), a mathematical framework showing that cardiac attractor geometry encodes blood pressure (BP) information sufficient for AAMI-standard estimation, and validates the theory through a calibrated cuffless BP model using photoplethysmography (PPG). AVCT is grounded in Cardiac Stability Theory and operationalized using Takens delay embedding and attractor morphology extraction. Two theorems, one proposition, and one corollary formally justify the use of PPG attractor features for BP estimation and predict the feature-importance hierarchy. A LightGBM model trained on pulse transit time (PTT) and Cardiac Stability Index (CSI) attractor features under single-point calibration was evaluated using strict leave-one-subject-out cross-validation (LOSO-CV) on 46 subjects from BIDMC ICU (n = 9) and VitalDB surgical data (n = 37), comprising 29,684 windows. The model achieved systolic BP (SBP) mean absolute error (MAE) of 2.05 mmHg and diastolic BP (DBP) MAE of 1.67 mmHg, with correlations r = 0.990 and r = 0.991, satisfying the AAMI/IEEE SP10 requirement of MAE below 5 mmHg. Median per-subject MAE was 1.87/1.54 mmHg, and 70%/76% of subjects individually satisfied AAMI criteria. A PPG-only ablation using nine smartphone attractor features matched the ECG+PPG model within 0.05 mmHg, demonstrating that clinical-grade BP tracking is achievable using only a smartphone camera while surpassing prior generalized LOSO-CV results using fewer sensors. All four AVCT predictions were quantitatively confirmed, with 91.5% error reduction from uncalibrated to calibrated estimation (epsilon_cal = 0.915). Unlike post-hoc explainable AI methods, AVCT predicts features satisfying the architectural faithfulness criterion of the Explainable-AI Trustworthiness (EAT) framework and grounding BP estimation in nonlinear dynamical systems theory.
☆ BEACON: A Multimodal Dataset for Learning Behavioral Fingerprints from Gameplay Data
Continuous authentication in high-stakes digital environments requires datasets with fine-grained behavioral signals under realistic cognitive and motor demands. But current benchmarks are often limited by small scale, unimodal sensing or lack of synchronised environmental context. To address this gap, this paper introduces BEACON ( Behavioral Engine for Authentication \& Continuous Monitoring), a large-scale multimodal dataset that captures diverse skill tiers in competitive \textit{Valorant} gameplay. BEACON contains approximately 430 GB of synchronised modality data (461 GB total on-disk including auxiliary \textit{Valorant} configuration captures) from 79 sessions across 28 distinct players, estimated at 102.51 hours of active gameplay, including high-frequency mouse dynamics, keystroke events, network packet captures, screen recordings, hardware metadata, and in-game configuration context. BEACON leverages the high precision motor skills and high cognitive load that are inherent to tactical shooters, making it a rigorous stress test for the robustness of behavioral biometrics. The dataset allows for the study of continuous authentication, behavioral profiling, user drift and multimodal representation learning in a high-fidelity esports setting. The authors release the dataset and code on Hugging Face and GitHub to create a reproducible benchmark for evaluating next-generation behavioral fingerprinting and security models
☆ Masked Generative Transformer Is What You Need for Image Editing CVPR 2026
Diffusion models dominate image editing, yet their global denoising mechanism entangles edited regions with surrounding context, causing modifications to propagate into areas that should remain intact. We propose a fundamentally different approach by leveraging Masked Generative Transformers (MGTs), whose localized token-prediction paradigm naturally confines changes to intended regions. We present EditMGT, an MGT-based editing framework that is the first of its kind. Our approach employs multi-layer attention consolidation to aggregate cross-attention maps into precise edit localization signals, and region-hold sampling to explicitly prevent token flipping in non-target areas. To support training, we construct CrispEdit-2M, a 2M-sample high-resolution (>1024) editing dataset spanning seven categories. With only 960M parameters, EditMGT achieves state-of-the-art image similarity on multiple benchmarks while delivering 6x faster editing, demonstrating that MGTs offer a compelling alternative to diffusion-based editing.
comment: CVPR 2026 HiGen Workshop; Project Page at https://weichow23.github.io/EditMGT/ GitHub at https://github.com/weichow23/EditMGT
☆ The Generalized Turing Test: A Foundation for Comparing Intelligence
We introduce the Generalized Turing Test (GTT), a formal framework for comparing the capabilities of arbitrary agents via indistinguishability. For agents A and B, we define the Turing comparator A $\geq$ B to hold if B, acting as a distinguisher, cannot reliably distinguish between interactions with A (instructed to imitate B) and another instance of B. This yields a dataset- and task-agnostic notion of relative intelligence. We study the comparator's structure, including conditions under which it is transitive and therefore induces an ordering over equivalence classes, and we define and analyze variants with querying, bounded interaction, and fixed distinguishers. To complement the theory, we instantiate the framework on a collection of modern models, empirically evaluating pairwise indistinguishability across thousands of trials. The resulting comparisons exhibit a stratified structure consistent with existing rankings, hinting that the proposed framework yields meaningful empirical orderings. Our results position indistinguishability as a unifying lens for reasoning about intelligence, suggesting a foundation for evaluation and, potentially, training objectives that are inherently independent of fixed datasets or benchmarks.
☆ Conditional anomaly detection methods for patient-management alert systems ICML-2008
Anomaly detection methods can be very useful in identifying unusual or interesting patterns in data. A recently proposed conditional anomaly detection framework extends anomaly detection to the problem of identifying anomalous patterns on a subset of attributes in the data. The anomaly always depends (is conditioned) on the value of remaining attributes. The work presented in this paper focuses on instance-based methods for detecting conditional anomalies. The methods rely on the distance metric to identify examples in the dataset that are most critical for detecting the anomaly. We investigate various metrics and metric learning methods to optimize the performance of the instance-based anomaly detection methods. We show the benefits of the instance-based methods on two real-world detection problems: detection of unusual admission decisions for patients with the community-acquired pneumonia and detection of unusual orders of an HPF4 test that is used to confirm Heparin induced thrombocytopenia - a life-threatening condition caused by the Heparin therapy.
comment: Published at Workshop on Machine Learning in Health Care Applications ICML-2008 - MLHealth
☆ Clin-JEPA: A Multi-Phase Co-Training Framework for Joint-Embedding Predictive Pretraining on EHR Patient Trajectories
We present Clin-JEPA, a multi-phase co-training framework for joint-embedding predictive (JEPA) pretraining on EHR patient trajectories. JEPA architectures have enabled latent-space planning in robotics and high-quality representation learning in vision, but extending the paradigm to EHR data -- to obtain a single backbone that simultaneously forecasts patient trajectories and serves diverse downstream risk-prediction tasks without per-task fine-tuning -- remains an open challenge. Existing JEPA frameworks either discard the predictor after pretraining (I-JEPA, V-JEPA) or train it on a frozen pretrained encoder (V-JEPA 2-AC), leaving the encoder unaware of the rollout signal that the retained predictor must use at inference; co-training the encoder and predictor under a shared JEPA prediction objective would supply this grounding, but naïve co-training is unstable, with representation collapse and online/target drift causing autoregressive rollout to diverge. Clin-JEPA's five-phase pretraining curriculum -- predictor warmup, joint refinement, EMA target alignment, hard sync, and predictor finalization -- addresses each failure mode by phase, stably co-training a Qwen3-8B-based encoder and a 92M-parameter latent trajectory predictor. On MIMIC-IV ICU data, three independent evaluations support the framework: (1) latent $\ell_1$ rollout drift uniquely converges ($-$15.7%) over 48-hour horizons while baselines and ablations diverge (+3% to +4951%); (2) the encoder learns a clinically discriminative latent geometry (deteriorating-patient cohorts displace 4.83$\times$ further than stable patients in latent space, vs $\leq$2.62$\times$ for baseline encoders); (3) a single backbone outperforms strong tabular and sequence baselines on multi-task downstream evaluation. Clin-JEPA achieves mean AUROC 0.851 on ICareFM EEP and 0.883 on 8 binary risk tasks (+0.038 and +0.041 vs baseline average).
comment: 17 pages, 4 figures, 8 tables. Code: https://github.com/YeungYathin/Clin-JEPA
☆ Transcoda: End-to-End Zero-Shot Optical Music Recognition via Data-Centric Synthetic Training
Optical Music Recognition (OMR), the task of transcribing sheet music into a structured textual representation, is currently bottlenecked by a lack of large-scale, annotated datasets of real scans. This forces models to rely on either few-shot transfer or synthetic training pipelines that remain overly simplistic. A secondary challenge is encoding non-uniqueness: in the popular Humdrum **kern format for transcribing music, multiple different text encodings can render into the same visual sheet music. This one-to-many mapping creates a harder learning task and introduces high uncertainty during decoding. We propose Transcoda, an OMR system built on (i) an advanced synthetic data generation pipeline, (ii) a normalization of the **kern encoding to enforce a unique normal form and (iii) grammar-based decoding to ensure the syntactic correctness of the output. This approach allows us to train a compact 59M-parameter model in just 6 hours on a single GPU that outperforms billion-parameter baselines. Transcoda achieves the best score among state of the art baselines on a newly curated benchmark of synthetically rendered scores at 18.46% OMR-NED (compared to 43.91% for the next-best system, Legato) and reduces the error rate on historical Polish scans to 63.97% OMR-NED (down from 80.16% for SMT++).
comment: 13 pages, 7 figures
☆ SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing
Large language models possess strong chemical reasoning capabilities, making them effective molecular editors. However, property-relevant information is implicitly entangled across their dense hidden states, providing no explicit handle for property control: a substantial fraction of edits fail to improve or even degrade target properties. To address these issues, we propose SLIM (Sparse Latent Interpretable Molecular editing), a plug-and-play framework that decomposes the editor's hidden states into sparse, property-aligned features via a Sparse Autoencoder with learnable importance gates. Steering in this sparse feature space precisely activates property-relevant dimensions, improving editing success rate without modifying model parameters. The same sparse basis further supports interpretable analysis of editing behavior. Experiments on the MolEditRL benchmark across four model architectures and eight molecular properties show consistent gains over baselines, with improvements of up to 42.4 points.
☆ Predicting 3D structure by latent posterior sampling
The remarkable achievements of both generative models of 2D images and neural field representations for 3D scenes present a compelling opportunity to integrate the strengths of both approaches. In this work, we propose a methodology that combines a NeRF-based representation of 3D scenes with probabilistic modeling and reasoning using diffusion models. We view 3D reconstruction as a perception problem with inherent uncertainty that can thereby benefit from probabilistic inference methods. The core idea is to represent the 3D scene as a stochastic latent variable for which we can learn a prior and use it to perform posterior inference given a set of observations. We formulate posterior sampling using the score-based inference method of diffusion models in conjunction with a likelihood term computed from a reconstruction model that includes volumetric rendering. We train the model using a two-stage process: first we train the reconstruction model while auto-decoding the latent representations for a dataset of 3D scenes, and then we train the prior over the latents using a diffusion model. By using the model to generate samples from the posterior we demonstrate that various 3D reconstruction tasks can be performed, differing by the type of observation used as inputs. We showcase reconstruction from single-view, multi-view, noisy images, sparse pixels, and sparse depth data. These observations vary in the amount of information they provide for the scene and we show that our method can model the varying levels of inherent uncertainty associated with each task. Our experiments illustrate that this approach yields a comprehensive method capable of accurately predicting 3D structure from diverse types of observations.
☆ NoRIN: Backbone-Adaptive Reversible Normalization for Time-Series Forecasting
Reversible instance normalization (RevIN) and its successors (Dish-TS, SAN, FAN) have become the de facto plug-in for time-series forecasting, yet the map they apply to each data point is strictly affine, $x \mapsto ax+b$, so they cannot reshape the underlying distribution -- heavy tails remain heavy and skewness remains uncorrected. We propose NoRIN, a non-linear reversible normalization based on the arcsinh-form Johnson $S_U$ transform with two shape parameters $(δ,\varepsilon)$ that control tailedness and skewness; the linear $Z$-score used by RevIN is recovered only in the limit $δ\to \infty$. Training $(δ,\varepsilon)$ jointly with the backbone via gradient descent reliably pushes them toward this linear limit within a few epochs -- a phenomenon we name the degeneration problem: the forecasting loss is locally indifferent to shape, and the high-capacity backbone compensates for any monotone reparameterization of its input. NoRIN escapes the degeneration by decoupling shape selection from gradient training: $(δ,\varepsilon)$ are initialized by a closed-form Slifker-Shapiro quantile fit and refined by Bayesian optimization on the validation objective, while the inner training loop is identical to standard RevIN-style training. Across six representative backbones x five real-world datasets x three prediction horizons (90 configurations), decoupled shape optimization recovers $(δ^\star,\varepsilon^\star)$ that sit systematically far from the linear limit, with values that vary in a backbone-dependent way. This empirically supports the central thesis: different backbones genuinely require different normalization parameters to reach their best performance.
comment: 8 pages, 2 figures
☆ Benchmarking Sensor-Fault Robustness in Forecasting
Cyber-physical system (CPS) forecasting models depend on sensor streams with noisy, biased, missing, or temporally misaligned readings, yet standard forecasting evaluation often selects models by nominal error without showing whether they remain robust under such faults. We introduce SensorFault-Bench, a shared CPS-grounded sensor-fault stress-test protocol for evaluating forecasting architectures and robustness-improvement methods, and an operational taxonomy organizing the method comparison. Across four real-world datasets and eight scored scenarios governed by a standardized severity model, it reports worst-scenario degradation, clean mean squared error (MSE), and worst-scenario fault-time MSE, separating relative robustness from absolute error. A disjoint fault-transfer split lets explicit fault-training methods train on adjacent fault families while evaluation uses separate benchmark scenarios. Empirically, forecasting architectures favored by clean MSE can degrade sharply under faults, and clean-MSE rankings can disagree with worst-scenario fault-time error rankings. Chronos-2, the evaluated zero-shot foundation-model representative, matches or trails the last-value naive forecaster in clean MSE on the two single-target datasets and has the largest worst-scenario degradation on ETTh1 and Traffic, where all channels are forecast targets. For the evaluated robustness-improvement method set, paired deltas show selective degradation reductions: projected gradient descent adversarial training and randomized training lead where value faults dominate observed degradation, while fault augmentation leads where availability faults dominate. SensorFault-Bench provides open-source code, documented data access, and reproduction and extension guides, so new datasets, architectures, and robustness-improvement methods can be evaluated under the same CPS sensor-fault robustness protocol.
☆ MaD Physics: Evaluating information seeking under constraints in physical environments
Scientific discovery is fundamentally a resource-constrained process that requires navigating complex trade-offs between the quality and quantity of measurements due to physical and cost constraints. Measurements drive the scientific process by revealing novel phenomena to improve our understanding. Existing benchmarks for evaluating agents for scientific discovery focus on either static knowledge-based reasoning or unconstrained experimental design tasks, and do not capture the ability to make measurements and plan under constraints. To bridge this gap, we propose Measuring and Discovering Physics (MaD Physics), a benchmark to evaluate the ability of agents to make informative measurements and conclusions subject to constraints on the quality and quantity of measurements. The benchmark consists of three environments, each based on a distinct physical law. To mitigate contamination from existing knowledge, MaD Physics includes altered physical laws. In each trial, the agent makes measurements of the system until it exhausts an allotted budget and then the agent has to infer the underlying physical law to make predictions about the state of the system in the future. MaD Physics evaluates two fundamental capabilities of scientific agents: inferring models from data and planning under constraints. We also demonstrate how MaD Physics can be used to evaluate other capabilities such as multimodality and in-context learning. We benchmark agents on MaD Physics using four Gemini models (2.5 Flash Lite, 2.5 Flash, 2.5 Pro, and 3 Flash), identifying shortcomings in their structured exploration and data collection capabilities and highlighting directions to improve their scientific reasoning.
comment: 64 pages, 10 figures. Project page: https://mad-physics.github.io/
☆ On periodic distributed representations using Fourier embeddings
Periodic signals are critical for representing physical and perceptual phenomena. Scalar, real angular measures, e.g., radians and degrees, result in difficulty processing and distinguishing nearby angles, especially when their absolute difference exceeds pi. We can avoid this problem by using real-valued, periodic embeddings in high-dimensional space. These representations also allow us to control the nature of their dot product similarities, allowing us to construct a variety of different kernel shapes. In this work, we aim of highlight how these representations can be constructed and focus on the formalization of Dirichlet and periodic Gaussian kernels using the neurally-plausible representation scheme of Spatial Semantic Pointers.
☆ Policy Gradient Methods for Non-Markovian Reinforcement Learning
We study policy gradient methods for reinforcement learning in non-Markovian decision processes (NMDPs), where observations and rewards depend on the entire interaction history. To handle this dependence, the agent maintains an internal state that is recursively updated to provide a compact summary of past observations and actions. In contrast to approaches that treat the agent state dynamics as fixed or learn it via predictive objectives, we propose a reward-centric formulation that jointly optimizes the agent state dynamics and the control policy to maximize the expected cumulative reward. To this end, we consider a class of Agent State-Markov (ASM) policies, comprising an agent state dynamics and a control policy that maps the agent state to actions. We establish a novel policy gradient theorem for ASM policies, extending the classical policy gradient results from the Markovian setting to episodic and infinite-horizon discounted NMDPs. Building on this gradient expression, we propose the Agent State-Markov Policy Gradient (ASMPG) algorithm, which leverages the recursive structure of the agent state dynamics for efficient optimization. We establish finite-time and almost sure convergence guarantees, and empirically demonstrate that, on a range of non-Markovian tasks, ASMPG outperforms baselines that learn state representations via predictive objectives.
comment: 39 pages, 5 figures, 1 table
☆ Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities
We introduce an automatically generated benchmark for predicting hidden text in technical papers. A paper supplies visible context $X$ and a hidden continuation $Y$; the evaluated model writes an auxiliary forecast string $Z$, and a separate scorer assigns next-token probability to $Y$ both with and without conditioning on $Z$. This gives a label-free test of whether $Z$ transmits information about the continuation, compared against controls where $Z$ is recent context rather than a forecast. Our main testbed is equation-suffix prediction: the predictor sees context and the first part of a displayed equation, then forecasts the rest. The task mixes surface-level arXiv/TeX text modeling with reasoning-sensitive inference; the suffix is one of many roughly equivalent continuations, so the benchmark is read statistically rather than item-by-item. On 1363 equation continuations from 138 recent physics and mathematics papers, forecasts from GPT-5.5, Opus 4.7, and GPT-5.4 nano all improve clipped likelihood over the context control under both Qwen3-8B and Kimi K2.6 scorers, distinguishing model families and reasoning-effort settings without human labels. To emulate shortcuts where $Z$ further primes the scorer rather than making a useful forecast, we also fine-tune the scorer on context-only prompts and apply it to held-out papers as a stronger control. GPT-5.5 forecasts still beat this fine-tuned control; GPT-5.4 nano forecasts do not. Longer prose/TeX continuations show positive but noisier lift over controls, concentrated near the beginning of the target. These results support cross-model likelihood scoring as a static benchmark and as a setup for probing shortcut vulnerabilities before reinforcement learning or model-selection optimization is applied.
comment: 13 pages + appendices, 4 figures
☆ Mistake-Bounded Language Generation
We investigate the learning task of language generation in the limit, but shift focus from the traditional time-of-last-mistake metric of a generator's success to a new notion of "mistake-bounded generation." While existing results for language generation in the limit focus on guaranteeing eventual consistency, they are blind to the cumulative error incurred during the learning process. We address this by shifting the goal to minimizing the total number of invalid elements output by a generation algorithm. We establish a formal reduction to the Learning from Correct Demonstrations framework of Joshi et al. (2025), enabling a general recipe for deriving mistake bounds via weighted update rules. For finite classes, we provide an algorithm that simultaneously achieves an optimal last-mistake time of $\mathsf{Cdim}(L)$ and a mistake bound of $\lfloor \log_2 |L| \rfloor$, whereas for the non-uniform setting of countably infinite streams of languages, we prove a fundamental trade-off: achieving logarithmic mistakes $O(\log i)$ necessarily precludes convergence guarantees established in prior work. Finally, we show that our framework can be extended to accommodate noisy adversaries and guarantee mistake bounds that scale with the adversary's suboptimality.
☆ LLMs for Secure Hardware Design and Related Problems: Opportunities and Challenges IEEE
The integration of Large Language Models (LLMs) into Electronic Design Automation (EDA) and hardware security is rapidly reshaping the semiconductor industry. While LLMs offer unprecedented capabilities in generating Register Transfer Level (RTL) code, automating testbenches, and bridging the semantic gap between high-level specifications and silicon, they simultaneously introduce severe vulnerabilities. This comprehensive review provides an in-depth analysis of the state-of-the-art in LLM-driven hardware design, organized around key advancements in EDA synthesis, hardware trust, design for security, and education. We systematically expand on the methodologies of recent breakthroughs -- from reasoning-driven synthesis and multi-agent vulnerability extraction to data contamination and adversarial machine learning (ML) evasion. We integrate general discussions on critical countermeasures, such as dynamic benchmarking to combat data memorization and aggressive red-teaming for robust security assessment. Finally, we synthesize cross-cutting lessons learned to guide future research toward secure, trustworthy, and autonomous design ecosystems.
comment: Accepted for 2026 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)
☆ PhyGround: Benchmarking Physical Reasoning in Generative World Models
Generative world models are increasingly used for video generation, where learned simulators are expected to capture the physical rules that govern real-world dynamics. However, evaluating whether generated videos actually follow these rules remains challenging. Existing physics-focused video benchmarks have made important progress, but they still face three key challenges, including the coarse evaluation frameworks that hide law-specific failures, response biases and fatigue that undermine the validity of annotation judgments, and automated evaluators that are insufficiently physics-aware or difficult to audit. To address those challenges, we introduce PhyGround, a criteria-grounded benchmark for evaluating physical reasoning in video generation. The benchmark contains 250 curated prompts, each augmented with an expected physical outcome, and a taxonomy of 13 physical laws across solid-body mechanics, fluid dynamics, and optics. Each law is operationalized through observable sub-questions to enable per-law diagnostics. We evaluate eight modern video generation models through a large-scale, quality-controlled human study, grounded on social science lab experiment design. A total of 459 annotators provided 5,796 complete annotations and over 37.4K fine-grained labels; after quality control, the retained annotations exhibited high split-half model-ranking correlations (Spearman's rho > 0.90). To support reproducible automated evaluation, we release PhyJudge-9B, an open physics-specialized VLM judge. PhyJudge-9B achieves substantially lower aggregate relative bias than Gemini-3.1-Pro (3.3% vs. 16.6%). We release prompts, human annotations, model checkpoints, and evaluation code on the project page https://phyground.github.io/.
comment: Preprint. 56 pages, 39 figures, 40 tables. Project page: https://phyground.github.io/
☆ The Last Word Often Wins: A Format Confound in Chain-of-Thought Corruption Studies NeurIPS 2026
Corruption studies, the primary tool for evaluating chain-of-thought (CoT) faithfulness, identify which chain positions are "computationally important" by measuring accuracy when steps are replaced with errors. We identify a systematic confound: for chains with explicit terminal answer statements, the dominant format in standard benchmarks, corruption studies detect where the answer text appears, not where computation occurs. A within-dataset format ablation provides the key evidence: on standard GSM8K chains ending with "the answer is X," removing only the answer statement, preserving all reasoning, collapses suffix sensitivity ~19x at 3B (N=300, p=0.022). Conflicting-answer experiments quantify the causal mechanism: at 7B, CC accuracy drops to near-zero (<=0.02) across five architecture families; the followed-wrong rate spans 0.63-1.00 at 3B-7B and attenuates at larger scales (0.300 at Phi-4-14B, ~0.01 at 32B). A within-stable 7B replication (9.3x attenuation, N=76, p=7.8e-3; Qwen3-8B N=299, p=0.004) provides converging evidence, and the pattern replicates on MATH (DeepSeek-R1-7B: 10.9x suffix-survival recovery). On chains without answer suffixes the same protocol identifies the prefix as load-bearing (Delta=-0.77, p<10^-12). Generation-time probes confirm a dissociation: the answer is not early-determined during generation (early commitment <5%), yet at consumption time model outputs systematically follow the explicit answer text. The format-determination effect persists through 14B (8.5x ratio, p=0.001) and converges toward zero at 32B. We propose a three-prerequisite protocol (question-only control, format characterization, all-position sweep) as a minimum standard for corruption-based faithfulness studies.
comment: 34 pages, 6 figures, 13 tables. Submitted to NeurIPS 2026. Code and data: https://github.com/Gpgabriel25/LastWordWinsCoT
☆ Muown: Row-Norm Control for Muon Optimization
Muon has emerged as a strong competitor to AdamW for language model pre-training, yet its behavior at scale is sensitive to weight decay. Recent work has observed that, for Muon without decoupled weight decay, the spectral norm of weight matrices drifts upward over training. Through a decomposition of the spectral norm into a row-magnitude factor and a row-coherence factor, we identify the former as the empirical driver of this drift under Muon, while the latter remains well-behaved along the trajectory. Motivated by this diagnosis, we introduce Muown, a drop-in replacement for Muon that treats the row-magnitude vector as an explicit optimizer variable, updating it under the $\ell_\infty$ geometry induced by the decomposition, while applying Muon unchanged to the remaining direction component. We prove that Muown attains the optimal non-convex rates in both deterministic and stochastic regimes under a dual norm aligned with the underlying geometries and with a stochastic noise coefficient that empirically remains below that of Muon throughout training. Across GPT-style pre-training on FineWeb-Edu with model sizes from 124M up to 2.7B parameters, Muown improves perplexity over Muon, SOAP, AdamW, and Lion. It also widens the plateau of near-optimal learning rates across model scales, reduces sensitivity to weight decay, and avoids the spectral norm drift at negligible step-time overhead when appropriately sharded.
☆ Factual recall in linear associative memories: sharp asymptotics and mechanistic insights
Large language models demonstrate remarkable ability in factual recall, yet the fundamental limits of storing and retrieving input--output associations with neural networks remain unclear. We study these limits in a minimal setting: a linear associative memory that maps $p$ input embeddings in $\mathbb{R}^d$ to their corresponding~$d$-dimensional targets via a single layer, requiring each mapped input to be well separated from all other targets. Unlike in supervised classification, this strict separation induces~$p$ constraints per association and produces strong correlations between constraints that make a direct characterisation of the storage capacity difficult. Here, we provide a precise characterisation of this capacity in the following way. We first introduce a decoupled model in which each input has its own independent set of competing outputs, and provide numerical and analytical evidence that this decoupled model is equivalent to the original model in terms of storage capacity, spectra of the learnt weights, and storage mechanism. Using tools from statistical physics, we show that the decoupled model can store up to $p_c \log p_c / d^2 = 1 / 2$ associations, and generalise the computation of $p_c$ to linear two-layer architectures. Our analysis also gives mechanistic insight into how the optimal solution improves over a naïve Hebbian learning rule: rather than boosting input-output alignments with broad fluctuations, the optimal solution raises the correct scores just above the extreme-value threshold set by the competing outputs. These findings give a sharp statistical-physics characterisation of factual storage in linear networks and provide a baseline for understanding the memory capacity of more realistic neural architectures.
☆ ConQuR: Corner Aligned Activation Quantization via Optimized Rotations for LLMs
Large language models (LLMs) are costly to deploy due to their large memory footprint and high inference cost. Weight-activation quantization can reduce these costs, but low-bit activation quantization remains difficult because activation outliers induce large quantization error. Recent rotation-based methods address this by applying orthogonal transformations that redistribute activation magnitude across dimensions, but existing approaches either require expensive end-to-end rotation training or rely on stored activation corpora, introducing significant compute or storage overhead. We propose a lightweight post-training rotation calibration method for LLM activation quantization. Our method learns orthogonal rotations that align normalized activations with the corners of an inscribed hypercube, encouraging activation energy to be distributed more evenly across dimensions. This objective admits an efficient closed-form update via the orthogonal Procrustes problem, avoiding gradient-based optimization over the orthogonal group. We further introduce an online calibration procedure that updates rotations as calibration samples are processed, eliminating the need to store activations on disk and allowing rotations to adapt to quantized activation distributions during calibration. Experiments on Llama-2 and Llama-3 models from 3B to 70B parameters show that our method achieves competitive or improved performance across perplexity benchmarks and common sense reasoning tasks while avoiding both costly end-to-end training and large offline activation storage.
☆ Fixed-Point Neural Optimal Transport without Implicit Differentiation
We propose an implicit neural formulation of optimal transport that eliminates adversarial min--max optimization and multi-network architectures commonly used in existing approaches. Our key idea is to parameterize a single potential in the Kantorovich dual and reformulate the associated c-transform as a proximal fixed-point problem. This yields a stable single-network framework in which dual feasibility is enforced exactly through proximal optimality conditions rather than adversarial training. Despite the inner fixed-point computation, gradients can be computed without differentiating through the fixed-point iterations, enabling efficient training without requiring implicit differentiation. We further establish convergence of stochastic gradient descent. The resulting framework is efficient, scalable, and broadly applicable: it simultaneously recovers forward and backward transport maps and naturally extends to class-conditional settings. Experiments on high-dimensional Gaussian benchmarks, physical datasets, and image translation tasks demonstrate strong transport accuracy together with improved training stability and favorable computational and memory efficiency.
comment: 37 pages, submitted to SIAM Journal on Mathematical Data Science (currently under review)
☆ Elucidating Representation Degradation Problem in Diffusion Model Training
Diffusion models have achieved remarkable success, yet their training remains inefficient due to a severe optimization bottleneck, which we term Representation Degradation. As noise levels increase, the outputs of the trained model exhibit progressive structural distortion, which can destabilize training and impair generation quality. Our analysis suggests that this instability is driven by mismatched target recoverability, which is associated with Neural Tangent Kernel (NTK) spectral weakening and effective low-rank behavior. To address this, we propose Elucidated Representation Diffusion (ERD), a plug-and-play framework that dynamically reallocates optimization effort according to effective recoverability. By stabilizing representation learning without external supervision, ERD accelerates convergence and achieves strong empirical performance across diffusion backbones.
☆ MASS-DPO: Multi-negative Active Sample Selection for Direct Policy Optimization
Multi-negative preference optimization under the Plackett--Luce (PL) model extends Direct Preference Optimization (DPO) by leveraging comparative signals across one preferred and multiple rejected responses. However, optimizing over large negative pools is costly, and many candidates contribute redundant gradients due to their similar effects on policy updates. We introduce MASS-DPO, a multi-negative active sample selection method that derives a PL-specific Fisher-information objective for selecting compact, informative negative subsets within each prompt. The resulting log-determinant objective selects negatives that contribute complementary information for policy updates, yielding compact subsets that retain the full pool's information while reducing redundancy. In practice, this favors negatives whose gradients cover different update directions, reducing redundant signal from near-duplicate candidates while preserving the most useful training information. Across four benchmarks spanning recommendation and multiple-choice QA and three model families, MASS-DPO consistently exceeds or matches existing methods in accuracy, improves Recall/NDCG and margin-based optimization dynamics, and delivers stronger alignment with substantially fewer negatives.
☆ Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR
Self-distillation has emerged as a powerful framework for post-training LLMs, where a teacher conditioned on extra information guides a student without it, both from the same model. While this guidance is useful when the student has failed, on successful rollouts, the same mechanism instead overwrites the student's choices and suppresses it's own reasoning. Therefore, we propose reading the original self-distillation signal in reverse: when the student succeeds along a path the teacher would not have predicted, these tokens reflect its self-driven reasoning. Building on this, we propose RLRT (RLVR with Reversed Teacher), which augments GRPO by reinforcing these tokens on correct rollouts. We interpret this as a new form of exploration in RLVR: not uniform diversity, but valuable exploration grounded in the student's own success. Across base, instruction-tuned, and thinking-tuned Qwen3 checkpoints, RLRT substantially outperforms self-distillation and exploration-based baselines, establishing information asymmetry as a new, principled design axis for RLVR.
☆ Locking Pretrained Weights via Deep Low-Rank Residual Distillation
The quality of open-weight language models has dramatically improved in recent years. Sharing weights greatly facilitates model adoption by enabling their use across diverse hardware and software platforms. They also allow for more open research and testing, to the extent that users can use them as checkpoints, fine-tune them according to their needs, and potentially redistribute them. In some cases, however, concerns on modifying these weights towards unauthorized uses may outweigh the pros of giving users such a freedom. Defending against such adaptation is non-trivial: since an adaptive attacker can observe all weights and architectures by definition, they can reverse simple structural defenses, and use optimization to defeat the simplest locking mechanisms. In this work, we exploit the inference-training asymmetry of automatic differentiation as a novel defense axis. We propose DLR-Lock, a method where the purveyor of the model purposely replaces each pretrained MLP in their model with a deep low-rank residual network (DLR-Net) of comparable parameter count, forcing activation memory that grows linearly with depth during backpropagation. DLR-Nets are efficiently trained via module-wise distillation. We show that, beyond this memory overhead, DLR-Lock results in architectural mismatches that complicate the optimization landscape of standard fine-tuning, and a backward pass that incurs disproportionately more overhead than the forward pass. Our defense succeeds in withstanding adaptive attackers with full knowledge of the defense strategy while preserving the original model's capabilities. Experiments on LLM validate these claims.
☆ On the global convergence of gradient descent for wide shallow models with bounded nonlinearities
A surprising phenomenon in the training of neural networks is the ability of gradient descent to find global minimizers of the training loss despite its non-convexity. Following earlier works, we investigate this behavior for wide shallow networks. Existing results essentially cover the case of ReLU activations and the case of sigmoid activations with scalar output weights. We study a large class of models that includes multi-head attention layers and two-layer sigmoid networks with vector output weights. Building upon [Chizat and Bach, 2018], we prove that all non-global minimizers of the training loss are unstable under gradient descent dynamics. Thus, when the initial distribution of the parameters has full support (which includes the popular Gaussian case), and in the many hidden neurons or attention heads limit, continuous-time gradient descent can only converge to global minimizers. Establishing the instability of non-global minimizers corresponds to the construction of an ``escaping active set'' -- we complete the proof of [Chizat and Bach, 2018] to construct this set for models with bounded nonlinearities and scalar output weights. We also extend this construction to new cases for models with vector output weights. Finally, we show the well-posedness and the stability with respect to discretization of the mean field training dynamic for sub-Gaussian initializations.
☆ DynaMiCS: Fine-tuning LLMs with Performance Constraints using Dynamic Mixtures
Multi-domain fine-tuning of large language models requires improving performance on target domains while preserving performance on constrained domains, such as general knowledge, instruction following, or safety evaluations. Existing data mixing strategies rely on fixed heuristics or adaptive rules that cannot explicitly enforce preservation of such capabilities. We propose DynaMiCS, a dynamic mixture optimizer that casts multi-domain fine-tuning as a constrained optimization problem. At each update, DynaMiCS performs short domain-specific probing runs to estimate a slope matrix of local cross-domain effects, capturing how training on each fine-tuning dataset affects each evaluation domain. These estimates are then used to compute mixture weights through optimization over the probability simplex, with the objective of improving target-domain performance while keeping constrained-domain losses below reference levels. Across multi-domain fine-tuning scenarios with varying numbers of target and constrained domains, DynaMiCS achieves stronger target-domain improvements and higher constraint satisfaction than fixed-mixture baselines, at lower computational cost and without reference models, per-example scoring, or manually tuned mixture weights.
☆ Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning
Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, yet real-world deployment often requires continual capability expansion across sequential tasks. In such scenarios, Multimodal Continual Instruction Tuning (MCIT) aims to acquire new capabilities while limiting catastrophic forgetting. Existing methods mainly follow a module-composition paradigm: they maintain task-level prompts or LoRA experts and dynamically route or aggregate a subset of them at inference. However, samples within the same task can still differ substantially in visual scenes, question intents, and reasoning demands. This motivates instance-level adaptation to individual query-image pairs rather than only selecting or combining task-level modules. To this end, we propose DRAPE (Dynamic Cross-Modal Prompt Generation), a prompt-learning framework that synthesizes continuous instance-specific soft prompts for MCIT. Instead of selecting prompts from a fixed pool, DRAPE derives prompt queries from the textual instruction and cross-attends to visual patch features, producing query-image conditioned prompts that are prepended to the frozen LLM. To mitigate forgetting during sequential updates, DRAPE applies null-space gradient projection to the shared projector and uses CLIP-based prototype routing for task-label-free generator selection at inference. Extensive experiments on MCIT benchmarks show that DRAPE achieves state-of-the-art performance among representative prompt-based and LoRA-based continual-learning baselines.
☆ Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models
Diffusion and flow-matching models scale because pretraining is supervised regression: a clean sample is noised analytically, and a model regresses against a closed-form target. RL post-training aligns the model with a reward. In image generation, this makes samples compose objects correctly, render text legibly, and match human preferences. Existing methods rely on costly SDE rollouts, reward gradients, or surrogate losses, sacrificing pretraining's regression structure. We show that the structure extends to RL post-training. Under KL-regularized reward maximization, the optimal generative process tilts the clean-endpoint distribution towards samples with higher reward and leaves the noising law unchanged. Combining this with the adjoint-matching optimality condition and a REINFORCE identity, we derive Reinforce Adjoint Matching (RAM): a consistency loss that corrects the pretraining target with the reward. At each step, we draw a clean endpoint from the current model, evaluate its reward, noise it as in pretraining, and regress. No SDE rollouts, backward adjoint sweeps, or reward gradients are required. Like the pretraining objective, RAM is simple and scales. On Stable Diffusion 3.5M, RAM achieves the highest reward on composability, text rendering, and human preference, reaching Flow-GRPO's peak reward in up to $50\times$ fewer training steps.
☆ Provable Sparse Inversion and Token Relabel Enhanced One-shot Federated Learning with ViTs
One-Shot Federated Learning, where a central server learns a global model in a single communication round, has emerged as a promising paradigm. However, under extremely non-IID settings, existing data-free methods often generate low-quality data that suffers from severe semantic misalignment with ground-truth labels. To overcome these issues, we propose a novel Federated Model Inversion and Token Relabel (FedMITR) framework, which trains the global model by fully exploiting all patches of synthetic images. Specifically, FedMITR employs sparse model inversion during data generation, selectively inverting semantic foregrounds while halting the inversion of uninformative backgrounds. To address semantically meaningless tokens that hinder ViT predictions, we implement a differentiated strategy: patches with high information density utilize generated pseudo-labels, while patches with low information density are relabeled via ensemble models for robust distillation. Theoretically, our analysis based on algorithmic stability reveals that Sparse Model Inversion eliminates gradient instability arising from background noise, while Token Relabel effectively reduces gradient variance, collectively guaranteeing a tighter generalization bound. Empirically, extensive experimental results demonstrate that FedMITR substantially outperforms existing baselines under various settings.
comment: 18 Pages
☆ AdaPaD: Adaptive Parallel Deflation for PEFT with Self-Correcting Rank Discovery
Fine-tuning large language models with LoRA requires choosing a rank r before training starts. Existing approaches either extract rank-1 components sequentially, freezing each component's error permanently into every subsequent residual, or optimize the full low-rank factorization jointly with guarantees that describe only the joint update, not individual rank-1 directions. We present AdaPaD (Adaptive Parallel Deflation), which trains all rank-1 components simultaneously: each worker refines its component against a deflation target built from the latest estimates of all predecessors, and as those estimates improve, the targets improve too. We call this property self-correction: deflation errors converge to zero over rounds rather than persisting as fixed residuals. On top of this backbone, AdaPaD adds advance learning (private pre-training before activation) and per-module dynamic rank discovery (importance-based growth until a shared budget is exhausted), making the rank distribution an output rather than an input. We prove that every component's error decays exponentially after a warm-up period, with a generalization bound that splits into a vanishing algorithmic term and an irreducible statistical floor. Empirically, AdaPaD is competitive with adaptive-rank LoRA baselines on GLUE with DeBERTaV3-base at matched parameter budgets, and competitive with fixed-rank LoRA on Qwen3-0.6B SQuAD/SQuAD v2 while deploying an adapter that is on average 30.7% smaller.
☆ XQCfD: Accelerating Fast Actor-Critic Algorithms with Prior Data and Prior Policies
For reinforcement learning in the real world online exploration is expensive A common practice in robotic reinforcement learning is to incorporate additional data to improve sample efficiency Expert demonstration data is often crucial for solving hard exploration tasks with sparse rewards While prior data is used to augment experience and pretrain models we show that the design of existing algorithms fails to achieve the sample efficiency that is possible in this setting due to a failure to use pretrained policies effectively We propose XQCfD which extends the sample-efficient XQC actor-critic to learn from demonstrations using augmented replay buffers pretrained policies and stationary policy architectures designed to avoid rapidly unlearning the strong initial policy like prior works We show our stationary network architecture enables policy improvement out-of-distribution better than standard network architectures due to its higher entropy predictions XQCfD achieves state of the art performance across a range of complex manipulation tasks with sparse rewards from the popular Adroit Robomimic and MimicGen benchmarks -- notably with a low update-to-data ratio and no ensemble networks
comment: 22 pages, 10 figures, 2 tables
☆ Kernel-Gradient Drifting Models
We propose kernel-gradient drifting, a one-step generative modeling framework that replaces the fixed Euclidean displacement direction in drifting models with directions induced by the kernel itself. Standard drifting is attractive because it enables fast, high-quality generation without distilling a large pretrained diffusion model, but its theory is currently understood mainly for Gaussian kernels, where the drift coincides with smoothed score matching and is identifiable. Our gradient-based reformulation exposes this score-based structure for general kernels: the resulting drift is the score difference between kernel-smoothed data and model distributions, yielding identifiability for characteristic kernels and a smoothed-KL descent interpretation of the drifting dynamics. Since kernel gradients are intrinsic tangent vectors, the same construction extends naturally to Riemannian manifolds and to discrete data via the Fisher-Rao geometry of the probability simplex. Across spherical geospatial data, promoter DNA and molecule generation, kernel-gradient drifting enables state-of-the-art one-step generation beyond the Euclidean setting without distillation.
☆ AllocMV: Optimal Resource Allocation for Music Video Generation via Structured Persistent State
Generating long-horizon music videos (MVs) is frequently constrained by prohibitive computational costs and difficulty maintaining cross-shot consistency. We propose AllocMV, a hierarchical framework formulating music video synthesis as a Multiple-Choice Knapsack Problem (MCKP). AllocMV represents the video's persistent state as a compact, structured object comprising character entities, scene priors, and sharing graphs, produced by a global planner prior to realization. By estimating segment saliency from multimodal cues, a group-level MCKP solver based on dynamic programming optimally allocates resources across High-Gen, Mid-Gen, and Reuse branches. For repetitive musical motifs, we implement a divergence-based forking strategy that reuses visual prefixes to reduce costs while ensuring motif-level continuity. Evaluated via the Cost-Quality Ratio (CQR), AllocMV achieves an optimal trade-off between perceived quality and resource expenditure under strict budgetary and rhythmic constraints.
☆ On Improving Graph Neural Networks for QSAR by Pre-training on Extended-Connectivity Fingerprints
Molecular Graph Neural Networks (GNNs) are increasingly common in drug discovery, particularly for Quantitative Structure-Activity Relationship (QSAR) studies; yet, their superiority compared to classical molecular featurisation approaches is disputed. We report a general strategy for improving GNNs for QSAR by pre-training to predict Extended-Connectivity Fingerprints (ECFP). We validate our approach with statistical tests and challenging out-of-distribution (OOD) splits. Across five out of six Biogen benchmarks, we observed a statistically significant improvement in standard performance metrics over all evaluated baselines when using ECFP pre-trained GNNs. However, for more heterogeneous datasets and more complex endpoints, such as binding affinity prediction, pre-trained GNNs underperformed in OOD settings. Importantly, we investigated the impact of substructure-level data leakage during pre-training on downstream performance. While we identified scenarios where pre-training on ECFPs was less effective, our findings show that ECFP-based pre-training can enhance downstream OOD performance on a diverse set of practically relevant QSAR tasks.
☆ An Uncertainty-Aware Resilience Micro-Agent for Causal Observability in the Computing Continuum
Grey failures in the computing continuum produce ambiguous overlapping symptoms that existing approaches fail to diagnose reliably, either due to a lack of causal awareness or acting under high epistemic uncertainty, risking destructive interventions. This paper presents an uncertainty-aware resilience micro-agent for causal observability (AURORA), a lightweight framework for diagnosing and mitigating grey failures in edge-tier environments. The framework employs parallel micro-agents that integrate the free-energy principle, causal do-calculus, and localized causal state-graphs to support counterfactual root-cause analysis within each fault's Markov blanket. Restricting inference to causally relevant variables reduces computational overhead while preserving diagnostic fidelity. AURORA further introduces a dual-gated execution mechanism that authorizes remediation only when causal confidence is high and predicted epistemic uncertainty is bounded; otherwise, it abstains from local intervention and escalates the diagnostic payload to the fog tier. Our experiments demonstrate that AURORA outperforms baselines, achieving a 0% destructive action rate, while maintaining 62.0% repair accuracy and a 3ms mean time to repair.
☆ Heteroscedastic Diffusion for Multi-Agent Trajectory Modeling IEEE
Multi-agent trajectory modeling traditionally focuses on forecasting, often neglecting more general tasks like trajectory completion, which is essential for real-world applications such as correcting tracking data. Existing methods also generally predict agents' states without offering any state-wise measure of heteroscedastic uncertainty. Moreover, popular multi-modal sampling methods lack error probability estimates for each generated scene under the same prior observations, which makes it difficult to rank the predictions at inference time. We introduce U2Diffine, a unified diffusion model built to perform trajectory completion while simultaneously offering state-wise heteroscedastic uncertainty estimates. This is achieved by augmenting the standard denoising loss with the negative log-likelihood of the predicted noise, and then propagating the latent space uncertainty to the real state space using a first-order Taylor approximation. We also propose U2Diff, a faster baseline that avoids gradient computation during sampling. This approach significantly increases inference speed, making it as efficient as a standard generative-only diffusion model. For post-processing, we integrate a Rank Neural Network (RankNN) that enables error probability estimation for each generated mode, demonstrating strong correlation with ground truth errors. Our method outperforms state-of-the-art solutions in both trajectory completion and forecasting across four challenging sports datasets (NBA, Basketball-U, Football-U, Soccer-U), underscoring the effectiveness of our uncertainty and error probability estimation.
comment: Accepted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Extended version of arXiv:2503.18589 (CVPR 2025)
☆ What should post-training optimize? A test-time scaling law perspective
Large language models are increasingly deployed with test-time strategies: sample $N$ responses, score them with a reward model or verifier, and return the best. This deployment rule exposes a mismatch in post-training: standard objectives optimize the mean reward of a single response, whereas best-of-$N$ performance is governed by the upper tail of the reward distribution. Recent test-time-aware objectives partly address this mismatch, but typically assume that training can use the same per-prompt rollout budget as deployment, which is impractical when post-training must cover many prompts while deployment can allocate much larger per-prompt test-time compute. We study this budget-mismatch regime, where only $m\ll N$ per-prompt rollouts are available during training but the target objective is best-of-$N$ deployment. Under structural assumptions on the reward tails, we show that the policy gradient of the best-of-$N$ objective can be approximated from a much smaller rollout group by extrapolating upper-tail statistics. This yields a family of Tail-Extrapolated estimators for best-of-$N$-oriented post-training: a simple direct estimator, Tail-Extrapolated Advantage (TEA), and a fixed-order debiased Prefix-TEA estimator based on moment cancellation. Experiments on instruction-following tasks show that TEA and Prefix-TEA improve best-of-$N$ performance across different language models, reward models and datasets under various training and test-time budget settings.
☆ Price of Quality: Sufficient Conditions for Sparse Recovery using Mixed-Quality Data ICLR 2026
We study sparse recovery when observations come from mixed-quality sources: a small collection of high-quality measurements with small noise variance and a larger collection of lower-quality measurements with higher variance. For this heterogeneous-noise setting, we establish sample-size conditions for information-theoretic and algorithmic recovery. On the information-theoretic side, we show that it is sufficient for $(n_1, n_2)$ to satisfy a linear trade-off defining the Price of Quality: the number of low-quality samples needed to replace one high-quality sample. In the agnostic setting, where the decoder is completely agnostic to the quality of the data, it is uniformly bounded, and in particular one high-quality sample is never worth more than two low-quality samples for this sufficient condition to hold. In the informed setting, where the decoder is informed of per-sample variances, the price of quality can grow arbitrarily large. On the algorithmic side, we analyze the LASSO in the agnostic setting and show that the recovery threshold matches the homogeneous-noise case and only depends on the average noise level, revealing a striking robustness of computational recovery to data heterogeneity. Together, these results give the first conditions for sparse recovery with mixed-quality data and expose a fundamental difference between how the information-theoretic and algorithmic thresholds adapt to changes in data quality.
comment: Published as a conference paper at ICLR 2026
☆ RelFlexformer: Efficient Attention 3D-Transformers for Integrable Relative Positional Encodings
We present a new class of efficient attention mechanisms applying universal 3D Relative Positional Encoding (RPE) methods given by arbitrary integrable modulation functions $f$. They lead to the new class of 3D-Transformer models, called \textit{RelFlexformers}, flexibly integrating those RPEs, and characterized by the $O(L \log L)$ time complexity of the attention computation for the $L$-length input sequences. RelFlexformers builds on the theory of the Non-Uniform Fourier Transform (NU-FFT), naturally generalizing several existing efficient RPE-attention methods from structured settings with tokens homogeneously embedded in unweighted grids into general non-structured heterogeneous scenarios, where tokens' positions are arbitrarily distributed in the corresponding 3D spaces. As such, RelFlexformers can be applied in particular to model point clouds. Our extensive empirical evaluation on a large portfolio of 3D datasets confirms quality improvements provided by the NU-FFT-driven attention modulation techniques in the RelFlexformers.
☆ DANCE: Detect and Classify Events in EEG
Event identification in continuous neural recordings is a critical task in neuroscience. Decoding in EEG is dominated by classifying windows aligned to known event onsets. However, while available in controlled experiments, such onsets are absent in continuous real-world monitoring. Here, we introduce DANCE, a deep learning pipeline that frames neural decoding as a set-prediction problem and jointly detects and classifies events directly from raw, unaligned signals. Evaluated separately on ten datasets curated from the literature with a wide variety of event types (ranging from milliseconds to minutes in duration), our model outperforms existing methods on a broad range of cognitive, clinical and BCI tasks. This single architecture establishes a new state of the art in the competitive task of seizure monitoring and matches the accuracy of onset-informed models for BCI tasks. Overall, our method marks a step towards end-to-end asynchronous neural decoding models
comment: 29 pages
☆ The finite expression method for turbulent dynamics with high-order moment recovery
Turbulent dynamical systems are characterized by nonlinear interactions and stochastic effects that generate coupled statistical quantities, such as non-zero higher-order moments, which are difficult to capture from data with accuracy. We propose a two-stage data-driven modeling framework that combines symbolic regression with generative models to jointly identify the governing dynamics and predict their key statistical quantities. In Stage I of the framework, the Finite Expression Method (FEX) is adopted to discover closed-form expressions of the deterministic dynamics, recovering nonlinear interaction terms and external forcing without predefined libraries. In Stage II, generative models are introduced to learn the residual stochastic components as a refined correction to the model error from the Stage I approximation, enabling accurate characterization of higher-order statistics. Theoretical analysis establishes the consistency of the symbolic estimator and quantifies the estimation error in terms of data size and numerical discretization. The model performance is verified through detailed numerical experiments on the stochastic triad models across multiple regimes, demonstrating that the framework successfully recovers interaction terms and forcing expressions, and accurately predicts statistical moments up to order five. These results highlight the potential of integrating interpretable symbolic discovery with data-driven stochastic modeling for complex turbulent systems.
comment: 20 pages, 8 figures, 1 table
☆ Is Data Shapley Not Better than Random in Data Selection? Ask NASH ICML-26
Data selection studies the problem of identifying high-quality subsets of training data. While some existing works have considered selecting the subset of data with top-$m$ Data Shapley or other semivalues as they account for the interaction among every subset of data, other works argue that Data Shapley can sometimes perform ineffectively in practice and select subsets that are no better than random. This raises the questions: (I) Are there certain "Shapley-informative" settings where Data Shapley consistently works well? (II) Can we strategically utilize these settings to select high-quality subsets consistently and efficiently? In this paper, we propose a novel data selection framework, NASH (Non-linear Aggregation of SHapley-informative components), which (I) decomposes the target utility function (e.g., validation accuracy) into simpler, Shapley-informative component functions, and selects data by optimizing an objective that (II) aggregates these components non-linearly. We demonstrate that NASH substantially boosts the effectiveness of Shapley/semivalue-based data selection with minimal additional runtime cost.
comment: Accepted to the 43rd International Conference on Machine Learning (ICML-26) as a Spotlight paper
☆ Scalable Mamba-Based Message-Passing Neural Decoder for Error-Correcting Codes IEEE
Forward error correction is essential for reliable communication over noisy channels. Attention-based model-free neural decoders have shown strong performance for short codes, but their scalability to longer codes is limited by the quadratic memory and computational cost of attention. In this paper, we introduce the Mamba message-passing decoder (MMPD), an attention-free syndrome-based neural decoder for binary linear codes. MMPD retains the Tanner-graph structure of a message-passing decoder by performing local pairwise aggregation along variable-check edges. To enable efficient long-range information propagation, these local updates are combined with bidirectional Mamba state-space blocks. By avoiding dense attention matrices, MMPD scales more favorably for long codes in both memory and computation. Experiments on the (1056, 880) LDPC code show that MMPD achieves a 0.45 dB gain over the state-of-the-art CrossMPT decoder at a specified target bit error rate, while reducing memory consumption by a factor of 1.5. This reduction factor increases substantially for longer codes, demonstrating the applicability of MMPD to scalable neural decoding of practical long codes.
comment: This work has been submitted to the IEEE for possible publication
☆ Exact Unlearning from Proxies Induces Closeness Guarantees on Approximate Unlearning
This paper proposes a paradigm shift linking machine unlearning directly to the structure of the data distributions rather than a mere update of the neural network parameters. We show that inferring these distributions with precision enables distilling the exact unlearning signal induced by the modeling. Theoretical bounds on the Kullback-Leibler divergence from the ideal retrained model to our unlearned model, under verifiable admissibility criterion, reveal the soundness of our framework. This method is experimentally validated over three forgetting scenarios as reaching the closest classifier to the ideal retrained model when compared to competitors.
☆ Not Blind but Silenced: Rebalancing Vision and Language via Adversarial Counter-Commonsense Equilibrium
During MLLM decoding, attention often abnormally concentrates on irrelevant image tokens. While existing research dismisses this as invalid noise and forcibly redirects attention to compel focusing on key image information, we argue these tokens are critical carriers of visual and narrative logic, and such coercive corrections exacerbate visual-language imbalance. Adopting a "decoding-as-game" perspective, we reveal that hallucinations stem from an equilibrium imbalance between linguistic priors and visual information. We propose Adversarial Counter-Commonsense Equilibrium (ACE), a training-free framework that perturbs visual context via counter-commonsense patches. Leveraging the fact that authentic visual features remain stable under perturbation while hallucinations fluctuate, ACE implements a dynamic game decoding strategy. This approach precisely suppresses perturbation-sensitive priors while compensating for stable visual signals to restore balance. Extensive experiments demonstrate that ACE, as a plug-and-play strategy, enhances model trustworthiness with negligible inference overhead.
☆ Step Rejection Fine-Tuning: A Practical Distillation Recipe
Rejection Fine-Tuning (RFT) is a standard method for training LLM agents, where unsuccessful trajectories are discarded from the training set. In the context of SWE-bench tasks, this corresponds to filtering out runs where the submitted patch does not pass the tests. However, this approach discards unresolved trajectories, even though they form a large portion of all trajectories for hard tasks and even then may be partially correct. In this work, we propose Step Rejection Fine-Tuning (SRFT) - a practical way to leverage these unresolved trajectories. For this, we employ a critic LLM to assess the correctness of each step in a trajectory. Consequently, during training, we mask the loss for erroneous steps while retaining them in the context window. This way we ensure the model learns to recover from errors without reproducing them. Evaluation on SWE-bench Verified shows that while RFT improves the resolution rate by 2.4% by excluding unresolved trajectories, SRFT improves it by 3.7% by filtering them instead of discarding completely, reaching the total resolution rate of 32.2%.
☆ Compander-Aligned Query Geometry for Quantized Zeroth-Order Optimization
Low-bit forward evaluation is an attractive route to memory-efficient zeroth-order (ZO) adaptation: the optimizer needs only scalar losses, and the model can be queried near deployment precision. The obstacle is that a quantized ZO query is not a continuous finite difference followed by harmless storage rounding. The query chooses endpoints, the low-precision engine rounds them, and the loss difference is measured along the rounded chord. For nonuniform companding quantizers, this makes the codebook insufficient to predict ZO behavior: a fixed weight-space radius can collapse in dense cells, over-span sparse cells, or assign a rounded chord to an unrounded update direction. We identify the missing object as query geometry and model scalar nonuniform quantization as $Q = φ^{-1} \circ U \circ φ$. CAQ-ZO (Compander-Aligned Queries for Zeroth-Order Optimization) forms one-grid-step Rademacher stencils $z \pm Δr$ in $z = φ(x)$, maps endpoints back through $φ^{-1}$, and updates in $z$. Our theory proves the grid-span mismatch, decomposes endpoint-rounding estimator residuals, and gives stationarity bounds in which generic off-grid queries retain a $Δ^2/μ^2$ residual channel while CAQ-ZO makes the query-time residual exactly zero. Synthetic experiments isolate this channel, and matched NF4 Qwen/Llama fine-tuning shows that CAQ-ZO improves the trained NF4 baseline under the same quantizer and evaluation budget.
☆ Natural Policy Gradient as Doubly Smoothed Policy Iteration: A Bellman-Operator Framework
In this work, we show that natural policy gradient, a core algorithm in reinforcement learning, admits an exact formulation as a smoothed and averaged form of policy iteration. Specifically, we introduce doubly smoothed policy iteration (DSPI), a Bellman-operator framework in which each policy is obtained by applying a regularized greedy step to a weighted average of past $Q$-functions. DSPI includes policy iteration, dual-averaged policy iteration, natural policy gradient, and more general policy dual averaging methods as special cases. Using only monotonicity and contraction of smoothed Bellman operators, we prove distribution-free global geometric convergence of DSPI. Consequently, standard natural policy gradient and policy dual averaging achieve an iteration complexity of $\mathcal{O}((1-γ)^{-1}\log((1-γ)^{-1}ε^{-1}))$ for computing an $ε$-optimal policy, without modifying the MDP, adding regularization beyond the mirror map inherent in the update, or using adaptive, trajectory-dependent stepsizes. For the unregularized greedy case, corresponding to dual-averaged policy iteration, we also prove finite termination. The same Bellman-operator framework further extends to discounted MDPs with linear function approximation and stochastic shortest path problems.
☆ A Spectral Framework for Closed-Form Relative Density Estimation
We propose a closed-form spectral framework for relative log-density estimation in linearly parameterized probabilistic models, including unnormalized and conditional models. This is achieved by representing the Kullback-Leibler (KL) divergence as an integral of weighted chi-squared divergences, converting KL estimation into a family of least-squares problems. We derive an explicit spectral formula based only on first- and second-order feature moments, yielding closed-form estimators of both divergences and log-density potentials for fixed features. The framework extends to a broad class of f-divergences and can be combined with kernelization or feature learning with neural networks. We prove convergence guarantees for the resulting estimators and empirically compare them on synthetic data with optimization-based variational formulations, including logistic and softmax regression for normalized conditional models.
☆ Why Zeroth-Order Adaptation May Forget Less: A Randomized Shaping Theory
Continual learning requires new-task adaptation without damaging previously acquired capabilities. Recent forward-pass and zeroth-order (ZO) results show that low-query adaptation may retain better than first-order (FO) descent, but the usual view of ZO as noisy FO estimation does not explain why. We give a local randomized gradient-shaping analysis: finite differences expose a raw shape that is mean-aligned with FO, while the norm-matched comparator fixes the expected squared adaptation norm. Under this controlled comparison, forgetting depends on how the adaptation shape exposes retention curvature. For norm-matched ZO, the expected shaped retention curvature obeys an exact identity that preserves the isotropic retention floor while contracting only the anisotropic component. Projecting this identity onto the incoming gradient yields the observable FO--ZO quadratic forgetting gap: ZO improves mean forgetting precisely when the FO direction has above-average retention curvature, by a query-dependent fraction of that curvature excess. A practical finite-query accounting separates the mean mechanism from one-batch sampling and smoothing perturbations. As an algorithmic transfer, RISE applies the calibrated ZO shape to exact FO gradients inside parameter blocks. Its target is a stability--plasticity tradeoff: randomized shaping may reduce the retention exposure paid by FO, exact gradients remove finite-smoothing bias from finite-difference ZO, and blockwise sampling supplies many local shaping directions after one gradient computation. The blockwise analysis separates mean-step damage from centered random exposure, showing how block-diagonal curvature, cross-block coupling, and local shaping diagnostics specify where this exact-gradient transfer is most likely to be visible.
☆ BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization
Trellis-coded quantization sets the current 2-bit post-training frontier for LLMs (QTIP), but pushing below the PTQ ceiling requires quantization-aware training, and QAT on a trellis is obstructed by the non-differentiable Viterbi argmax. We introduce BCJR-QAT, a relaxation that replaces the argmax with the BCJR forward-backward sum-product algorithm at temperature $T$, producing a soft codeword equal to the Boltzmann expectation over trellis paths, exactly differentiable, recovering the hard QTIP code as $T \to 0$, and mathematically identical to the transfer-matrix computation for a 1D Ising-like spin chain. We contribute (i) a fused Triton kernel making BCJR tractable on a single consumer GPU ($6.57\times$ speedup, fp32 parity); (ii) a quantitative drift-budget theory of when BCJR-QAT can escape the QTIP-PTQ Voronoi basin, verified across four experiments; and (iii) a positive empirical result on Llama-3.2-1B at 2 bpw under end-to-end forward-KL distillation: with the right schedule (skip the high-$T$ phase to avoid an overshoot we diagnose), single-layer BCJR-QAT beats QTIP-PTQ by $\mathbf{-0.084}$ PPL on WikiText-2, and multi-layer compounding is super-additive.
comment: 26 pages, 4 figures, 4 tables. Code at https://github.com/Venugopalan2610/quant-2bit. Model weights and trajectory snapshots at https://huggingface.co/Venugopalan2610/BCJR-QAT-Llama-3.2-1B-2bit
☆ Active Learning for Gaussian Process Regression Under Self-Induced Boltzmann Weights
We consider the active learning problem where the goal is to learn an unknown function with low prediction error under an unknown Boltzmann distribution induced by the function itself. This self-induced weighting arises naturally in problems such as potential energy surface (PES) modeling in computational chemistry, yet poses unique challenges as the target distribution is unknown and its partition function is intractable. We propose \texttt{AB-SID-iVAR}, a Gaussian Process-based acquisition function that approximates the intractable Bayesian target distribution in closed form while avoiding partition function estimation, and is applicable to both discrete and continuous input domains. We also analyze a Thompson sampling alternative (\texttt{TS-SID-iVAR}) as a higher variance Monte Carlo variant. Despite the unknown target, under mild conditions, we establish that the terminal prediction error vanishes with high probability, and provide a tighter average-case guarantee. We demonstrate consistent improvements over existing approaches in this setting on synthetic benchmarks and real-world PES modeling and drug discovery tasks.
☆ A Recursive Decomposition Framework for Causal Structure Learning in the Presence of Latent Variables
Constraint-based causal discovery is widely used for learning causal structures, but heavy reliance on conditional independence (CI) testing makes it computationally expensive in high-dimensional settings. To mitigate this limitation, many divide-and-conquer frameworks have been proposed, but most assume causal sufficiency, i.e., no latent variables. In this paper, we show that divide-and-conquer strategies can be theoretically generalized beyond causal sufficiency to settings with latent variables. Specifically, we propose a recursive decomposition framework, termed DiCoLa, that enables divide-and-conquer causal discovery in the presence of latent variables. It recursively decomposes the global learning task into smaller subproblems and integrates their solutions through a principled reconstruction step to recover the global structure. We theoretically establish the soundness and completeness of the proposed framework. Extensive experiments on synthetic data demonstrate that our approach significantly improves computational efficiency across a range of causal discovery algorithms, while experiments on a real-world dataset further illustrate its practical effectiveness.
☆ A Random-Matrix Criterion for Initializing Gated Recurrent Neural Networks
Proper weight initialization prior to training has historically been one of the key factors that helped kick off the deep learning revolution. Initialization is even more crucial in "reservoir computing", where the weights of a readout layer are learned linearly while the reservoir weights are fixed and largely determine the richness, stability and memory of the resulting dynamics. In the infinite-width limit it has been shown that meaningful initializations are those sitting at an effective critical point of the randomly initialized model. The phase transition is controlled by the weight variance $g^2$ and separates an ordered phase from a chaotic one where information progressively degrades. Here we derive a simple criterion to estimate the critical $g_c$ for a broad class of recurrent architectures and we show that it closely tracks the gain at which a gated-RNN reservoir achieves peak performance on a chaotic forecasting task. Finally, we argue that our criterion can serve as a design principle for future initialization schemes.
comment: 10 pages, 5 figures, 2 appendices
☆ A Single-Layer Model Can Do Language Modeling
Modern language models scale depth by stacking layers, each holding its own state - a per-layer KV cache in transformers, a per-layer matrix in Mamba, Gated DeltaNet (GDN), RWKV, and xLSTM. Biological systems lean heavily on recurrence rather than on stacking. We ask how far that shape can go on language modeling. We propose Grounded Prediction Networks (GPN): one state vector revisited at every step through a single recurrent block - one FFN, one shared matrix memory. At 130M parameters, a 1-layer GPN+M reaches FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34); a 2-layer variant closes the gap to 6%/11%. We do not match the deep baselines. Because the working context is a single vector, we can directly inspect its geometry: a persistent default-token direction, a content-bearing horizon of tens of tokens, and memory heads that split spontaneously into fast and slow retention pools.
comment: 9 pages, 5 figures, 1 table. Code: https://github.com/steve-z-wang/grounded-prediction-network
☆ Composing diffusion priors with explicit physical context via generative Gibbs sampling
Pretrained diffusion models provide powerful learned priors, but in scientific sampling the target distribution often depends on physical context that is not fully represented by one generative model. We introduce Generative Gibbs for Physics-Aware Sampling (GG-PA), a training-free framework that formulates the composition of learned partial priors and explicit physical context as inference over a joint target distribution in an augmented state space. We derive a Gibbs sampler for this joint target, show that it is asymptotically exact as the diffusion time approaches zero, and prove that in settings with quadratic interactions it remains exact at finite diffusion times. We further introduce replica exchange over diffusion time to accelerate mixing. Experiments on a double-well system, a $φ^4$ lattice model, and atomistic peptide systems show that GG-PA recovers context-induced distribution shifts and emergent collective behavior in interacting systems using partial priors without retraining. These results demonstrate GG-PA as a practical approach for combining pretrained generative priors with explicit physical context.
comment: 31 pages, 11 figures
☆ Hierarchical Causal Abduction: A Foundation Framework for Explainable Model Predictive Control
Model Predictive Control (MPC) is widely used to operate safety-critical infrastructure by predicting future trajectories and optimizing control actions. However, nonlinear dynamics, hard safety constraints, and numerical optimization often render individual control moves opaque to human operators, undermining trust and hindering deployment. This paper presents Hierarchical Causal Abduction (HCA), which combines (i) physics-informed reasoning via domain knowledge graphs, (ii) optimization evidence from Karush--Kuhn--Tucker (KKT) multipliers, and (iii) temporal causal discovery via the PCMCI algorithm to generate faithful, human-interpretable explanations for control actions computed by nonlinear MPC. Across three diverse control applications (greenhouse climate, building HVAC, chemical process engineering) with expert validation, HCA improves explanation accuracy by 53\% over LIME (0.478 vs. 0.311) using a single set of cross-domain parameters without per-domain tuning; domain-specific KKT-threshold calibration over 2--3 days further increases accuracy to 0.88. Ablation studies confirm that each evidence source is essential, with 32--37\% accuracy degradation when any component is removed, and HCA's ranking-and-validation methodology generalizes beyond MPC to other prediction-based decision systems, including learning-based control and trajectory planning.
☆ Hierarchical End-to-End Taylor Bounds for Complete Neural Network Verification
Reachability analysis of neural networks, which seeks to compute or bound the set of outputs attainable over a given input domain, is central to certifying safety and robustness in learning-enabled physical systems. Since exact reachable set computation is generally intractable, existing methods typically rely on tractable overapproximations. Examining the state of the art for smooth, twice-differentiable networks, we observe that existing approaches exploit at most second-order information and do not systematically leverage higher-order information. In this work, we introduce \textsc{HiTaB}, a novel verification framework that exploits second-order smoothness through both the Hessian, $\nabla^2 f$, and its Lipschitz constant, $L_{\nabla^2 f}$. We further develop a unified hierarchy of zeroth-, first-, and second-order bounds, together with precise conditions under which higher-order approximations yield provable improvements. Our main technical contribution is a compositional procedure for efficiently bounding $L_{\nabla^2 f}$ in deep neural networks via layerwise propagation of curvature bounds. We extend the framework to both $\ell_2$- and $\ell_\infty$-constrained input sets and show how it can be integrated into branch-and-bound verification pipelines. To our knowledge, this is the first practical reachability analysis framework for smooth neural networks that systematically exploits Lipschitz continuity of curvature, leading to tighter and more informative safety certificates.
☆ MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image
Tabular Foundation Models have recently established the state of the art in supervised tabular learning, by leveraging pretraining to learn generalizable representations of numerical and categorical structured data. However, they lack native support for unstructured modalities such as text and image, and rely on frozen, pretrained embeddings to process them. On established Multimodal Tabular Learning benchmarks, we show that tuning the embeddings to the task improves performance. Existing benchmarks, however, often focus on the mere co-occurrence of modalities; this leads to high variance across datasets and masks the benefits of task-specific tuning. To address this gap, we introduce MulTaBench, a benchmark of 40 datasets, split equally between image-tabular and text-tabular tasks. We focus on predictive tasks where the modalities provide complementary predictive signal, and where generic embeddings lose critical information, necessitating Target-Aware Representations that are aligned with the task. Our experimental results demonstrate that the gains from target-aware representation tuning generalize across both text and image modalities, several tabular learners, encoder scales, and embedding dimensions. MulTaBench constitutes the largest image-tabular benchmarking effort to date, spanning high-impact domains such as healthcare and e-commerce. It is designed to enable the research of novel architectures which incorporate joint modeling and target-aware representations, paving the way for the development of novel Multimodal Tabular Foundation Models.
☆ Exact Fixed-Point Constraints in Neural-ODEs with Provable Universality
We introduce a technique that enables Neural-ODEs to approximate arbitrary velocity fields with a priori planted fixed-points. Specifically, a recipe is given to explicitly accommodate for a finite collection of points in the reference multi-dimensional space of the Neural-ODE where the velocity field is exactly equal to zero. In this way, the gradient-based training is rigorously constrained inside the prescribed hypothesis class while leaving the expressive power of the Neural-ODE unaltered. We rigorously prove the universality of the Neural-ODE under any local constraints in the velocity field and give a computationally convenient way of imposing the fixed points. Our method is then tested on two paradigmatic physical models.
comment: 15 pages, 3 figures
☆ Reconfigurable Computing Challenge: Real-Time Graph Neural Networks for Online Event Selection in Big Science
Graph neural networks are increasingly adopted in trigger systems for collider experiments, where strict latency and throughput constraints render deployment on embedded platforms challenging. As detectors move towards higher granularity, the number of inputs per inference increase and FPGA-only solutions face resource bottlenecks. This work presents an end-to-end demonstrator for the real-time deployment of a dynamic Graph Neural Network for the Belle II electromagnetic calorimeter hardware trigger on the AMD Versal VCK190, leveraging both FPGA fabric and AI Engine tiles. We develop a Python-based semi-automated design flow covering operator fusion, partitioning, mapping, spatial parallelization, and kernel-level optimization. Our design achieves a throughput of 2.94 million events per second at an end-to-end latency of 7.15 microseconds. Compared to the FPGA-only baseline, this represents a 53% throughput improvement while reducing DSP utilization from 99% to 19% at 29% AI Engine tile utilization. To validate the deployment, an interactive visualization pipeline enables real-time monitoring of inference results on the physical demonstrator.
comment: Accepted to FCCM Reconfigurable Computing Challenge 2026
☆ Fairness vs Performance: Characterizing the Pareto Frontier of Algorithmic Decision Systems
Designing fair algorithmic decision systems requires balancing model performance with fairness toward affected individuals: More fairness might require sacrificing some performance and vice versa, yet the space of possible trade-offs is still poorly understood. We investigate fairness in binary prediction-based decision problems by conceptualizing decision making as a multi-objective optimization problem that simultaneously considers decision-maker utility and group fairness. We investigate the set of Pareto-optimal decision rules for arbitrary utility functions for decision maker, arbitrary population distributions, and a wide range of group fairness metrics. We find that the Pareto frontier consists of deterministic, group-specific threshold rules applied to individuals' success probability. This complements existing optimality theorems from literature which, for specific fairness constraints, posit lower-bound threshold rules only. However we also show that, depending on the used fairness metric, the Pareto frontier may include upper-bound threshold rules, thus preferring individuals with lower success probabilities. We show that the location of the Pareto frontier depends only on population characteristics, utility functions and fairness score, but not on the technical design of the algorithm - our findings hold for pre-, in-, and post-processing approaches alike. Our results generalize existing optimality theorems for fairness-constrained classification and extend them to generalized fairness metrics and fairness principles, and to partial fairness regimes. This paper connects formal fairness research with legal and ethical requirements to search for less discriminatory alternatives, offering a principled foundation for evaluating and comparing algorithmic decision systems.
comment: 23 pages, The 2026 ACM conference on Fairness, Accountability, and Transparency (FAccT'26)
☆ A Resilient Solution for Sewer Overflow Monitoring across Cloud and Edge IJCAI
Aging combined sewer systems in many historical cities are increasingly stressed by extreme rainfall events, which can trigger combined sewer overflows (CSO) with significant environmental and public health impacts. Forecasting the filling dynamics of overflow basins is critical for anticipating capacity exceedance and enabling timely preventive actions for CSO. We present a web-based demonstrator (https://riwwer.demo.calgo-lab.de) that integrates Deep Learning forecasting methods in both cloud and edge settings into an interactive monitoring dashboard for overflow monitoring, resilient to network outages. A video showcase is available online (https://cloud.bht-berlin.de/index.php/s/b9xt4T3SdiLBiFZ).
comment: 3 pages, 6 figures, accepted at 35th International Joint Conference on Artificial Intelligence 2026 (IJCAI-ECAI 2026), Demonstrations Track. URL: https://riwwer.demo.calgo-lab.de
☆ Amortizing Causal Sensitivity Analysis via Prior Data-Fitted Networks
Causal sensitivity analysis aims to provide bounds for causal effect estimates in the presence of unobserved confounding. However, existing methods for causal sensitivity analysis are per-instance procedures, meaning that changes to the dataset, causal query, sensitivity level, or treatment require new computation. Here, we instead present an in-context learning approach. Specifically, we propose an amortized approach to causal sensitivity analysis based on prior-data fitted networks. A key challenge is that the sensitivity bounds are not directly available when sampling training data. To address this, we develop a general prior-data construction that is applicable across the class of generalized treatment sensitivity models. Our construction involves a Lagrangian scalarization of the objective to generate training labels for the bounds through a tradeoff between causal effect min/max-imization and sensitivity model violation, which avoids model-specific analytical derivations. We further show that, under standard convexity and linearity conditions, our objective recovers the full Pareto frontier of solutions. Empirically, we demonstrate our amortized approach across various datasets, causal queries, and sensitivity levels, where our approach achieves a test-time computation that is orders of magnitude faster than per-instance methods. To the best of our knowledge, ours is the first foundation model for in-context learning for causal sensitivity analysis.
☆ Controllability in preference-conditioned multi-objective reinforcement learning
Multi-objective reinforcement learning (MORL) allows a user to express preference over outcomes in terms of the relative importance of the objectives, but standard metrics cannot capture whether changes in preference reliably change the agent's behavior in the intended way, a property termed controllability. As a result, preference-conditioned agents can score well on standard MORL metrics while being insensitive to the preference input. If the ability to control agents cannot be reliably assessed, the symbolic interface that MORL provides between user intent and agent behavior is broken. Mainstream MORL metrics alone fail to measure the controllability of preference-conditioned agents, motivating a complementary metric specifically designed to that end. We hope the results spur discussion in the community on existing evaluation protocols to consolidate advances in preference adaptation in MORL to larger and more complex problems.
☆ Acceptance Cards:A Four-Diagnostic Standard for Safe Fine-Tuning Defense Claims
Safe fine-tuning defenses are often endorsed on the basis of a held-out gap reduction, but the same reduction can come from sampling noise, subject artifacts, capability loss, or a mechanism that does not transfer. We introduce Acceptance Cards: an evaluation protocol, a documentation object, an executable audit package, and a claim-specific evidential standard for safe fine-tuning defense claims. The protocol checks statistical reliability, fresh semantic generalization, mechanism alignment, and cross-task transfer before treating a gap reduction as a full-card pass. Re-scored under this installed-gap protocol, SafeLoRA fails the full-card pass on Gemma-2-2B-it: under strict mechanism-class coding it fails all four diagnostics, and under a permissive shrinkage relabel it still fails three of four. This is a narrow installed-gap audit on one model family, not a global judgment of SafeLoRA's effectiveness. In a 46-cell audit, no cell satisfies the strict conjunction. The closest family is a near miss that passes reliability and mechanism checks where the required data are available, but fails the fresh-subject threshold, lacks a strict transfer pass, and carries a measurable deployment-accuracy cost.
☆ Online Sharp-Calibrated Bayesian Optimization
Bayesian optimization (BO) is a widely used framework for optimizing expensive black-box functions, commonly based on Gaussian process (GP) surrogate models. Its effectiveness relies on uncertainty quantification that is both sharp (informative) and well-calibrated along the BO trajectory. In practice, GP kernel hyperparameters are unknown and are refit online from sequentially collected (non-i.i.d.) data, which can yield miscalibrated or overly conservative uncertainty and lies outside the fixed-kernel assumptions of standard BO regret theory. We propose Online Sharp-Calibrated Bayesian Optimization (OSCBO), a BO algorithm that adaptively balances GP sharpness and calibration by casting hyperparameter selection as a constrained online-learning problem. We also show that OSCBO preserves sublinear regret bounds by leveraging the theoretical guarantees of the underlying online learning algorithm. Empirically, OSCBO performs competitively across synthetic and real-world benchmarks, ranking among the strongest methods in final simple regret while maintaining robust cumulative-regret behavior.
☆ Affine Tracing: A New Paradigm for Probabilistic Linear Solvers
Probabilistic linear solvers (PLSs) return probability distributions that quantify uncertainty due to limited computation in the solution of linear systems. The literature has traditionally distinguished between Bayesian PLSs, which condition a prior on information obtained from projections of the linear system, and probabilistic iterative methods (PIMs), which lift classical iterative solvers to probability space. In this work we show this dichotomy to be false: Bayesian PLSs are a special case of non-stationary affine PIMs. In addition, we prove that any realistic affine PIM is calibrated. These results motivate a focus on (non-stationary) affine PIMs, but their practical adoption has been limited by the significant manual effort required to implement them. To address this, we introduce affine tracing, an algorithmic framework that automatically constructs a PIM from a standard implementation of an affine iterative method by passing symbolic tracers through the computation to build an affine computational graph. We show how this graph can be transformed to compute posterior covariances, and how equality saturation can be used to perform algebraic simplifications required for computation under specific prior choices. We demonstrate the framework by automatically generating a probabilistic multigrid solver and evaluate its performance in the context of Gaussian process approximation.
☆ EnergyLens: Interpretable Closed-Form Energy Models for Multimodal LLM Inference Serving
As large language models span dense, mixture-of-experts, and state-space architectures and are deployed on heterogeneous accelerators under increasingly diverse multimodal workloads, optimising inference energy has become as critical as optimizing latency and throughput. Existing approaches either treat latency as an energy proxy or rely on data-hungry black-box surrogates. Both fail under varying parallelism strategies: latency and energy optima diverge in over 20% of configurations we tested, and black-box surrogates require hundreds of profiling samples to generalize across model families and hardware. We present EnergyLens, which uses symbolic regression as a structure-discovery tool over profiling data to derive a single twelve-parameter closed-form energy model expressed in terms of system properties such as degree of parallelism, batch size, and sequence length. Unlike black-box surrogates, EnergyLens decouples tensor and pipeline parallelism contributions and separates prefill from decode energy, making its predictions physically interpretable and actionable. Fitted from as few as 50 profiling measurements, EnergyLens achieves 88.2% Top-1 configuration selection accuracy across many evaluation scenarios compared to 60.9% for the closest prior analytical baseline, matches the predictive accuracy of ensemble ML methods with 10x fewer profiling samples, and extrapolates reliably to unseen batch sizes and hardware platforms without structural modification, making it a practical, interpretable tool for energy-optimal LLM deployment.
comment: 10 pages
☆ It's All Connected: Topology-Aware Structural Graph Encoding Improves Performance on Polymer Prediction
Graph Neural Networks (GNNs) have achieved strong results in molecular property prediction, but polymers present distinct challenges: labeled datasets are scarce and small (typically in the order of hundreds of polymers) due to the need for expensive experimentation, and complex polymer chain distributions influence polymer properties. Established practice in polymer prediction represents polymers solely by graphs of their repeat units, discarding the chain-scale morphology that governs key properties such as the glass transition temperature ($T_g$). In this work, we propose a principled graph construction that addresses this gap. Given a polymer's molecular mass distribution (MMD), we sample representative chains from the Schulz-Zimm distribution and construct representative sets of large graphs encoding chain-scale topology directly, with atoms and bonds featurized using rich chemical descriptors. We further pretrain GNN encoders via masked graph modeling on 100,000 unlabeled PSMILES strings before fine-tuning on labeled data. On a dataset of 381 polymers (180 homopolymers and 201 copolymers), we show that graph construction and self-supervised pretraining are jointly necessary: without pretraining, the large graph method matches the repeat-unit baseline (28.40 K vs. 28.36 K RMSE); with pretraining, it achieves 24.76 K +/- 3.30 K, a 5.1% reduction in mean error over the pretrained repeat-unit baseline (26.08 K +/- 4.20 K, p < 0.001, 30 runs). An ablation removing chemical features degrades performance to 36.65 K, confirming both components are essential. Results are architecture-agnostic, holding for both GINE and GATv2 encoders.
comment: 9 pages, 4 figures
☆ PhysEDA: Physics-Aware Learning Framework for Efficient EDA With Manhattan Distance Decay
Electronic design automation (EDA) addresses placement, routing, timing analysis, and power-integrity verification for integrated circuits. Learning methods -- attention (Transformer) and reinforcement learning (RL) -- have recently emerged on EDA tasks, yet face two common bottlenecks: vanilla attention's quadratic complexity limits scaling, and data-scarce models overfit statistical noise and amplify weak long-range correlations against the underlying physics. We observe that EDA tasks share a physical prior -- pairwise electrical and routing interactions decay exponentially along Manhattan distance -- and integrate it as a unified inductive bias into both architecture and training. We propose PhysEDA, comprising two components Physics-Structured Linear Attention (PSLA) folds the separable Manhattan decay into the linear-attention kernel as a multiplicative bias, reducing complexity from quadratic to linear; Potential-Based Reward Shaping (PBRS) constructs a physical potential from the same kernel, providing dense reward signal under sparse RL while preserving the optimal policy via the policy-invariance theorem. Across three EDA scenarios -- decoupling-capacitor placement, macro placement, and IR-drop prediction -- PhysEDA improves zero-shot cross-scale transfer by 56.8% and achieves 14x inference speedup with 98.5% memory savings on 100x100 grids; PBRS adds another 10.8% in sparse-reward DPP.
comment: 9 pages, 4 figures, plus appendix. Code and data to be released upon publication
☆ Higher Resolution, Better Generalization: Unlocking Visual Scaling in Deep Reinforcement Learning
Pixel-based deep reinforcement learning agents are typically trained on heavily downsampled visual observations, a convention inherited from early benchmarks rather than grounded in principled design. In this work, we show that observation resolution is a critical yet overlooked variable for policy learning: higher-resolution inputs can substantially improve both performance and generalization, provided the network architecture can process them effectively. We find that the widely used Impala encoder, which flattens spatial features into a vector, suffers from quadratic parameter growth as resolution increases and fails to leverage the additional visual detail. Replacing this operation with global average pooling, as in the Impoola architecture, decouples parameter count from resolution and yields consistent improvements across resolutions and network widths - at their respective best conditions, visual scaling unlocks a 28 % performance gain for Impoola over Impala. These gains are strongest in environments that require precise perception of small or distant objects, and gradient saliency analysis confirms that the underlying mechanism is a more spatially localized visual attention of the policy at higher resolutions. Our results challenge the prevailing practice of aggressive input downsampling and position resolution-independent architectures as a simple, effective path toward scalable visual deep RL. To facilitate future research on resolution scaling in deep RL, we publicly release the open-source code for the Procgen-HD benchmark: https://github.com/raphajaner/procgen-hd.
☆ Bridging Sequence and Graph Structure for Epigenetic Age Prediction
Epigenetic clocks based on DNA methylation have emerged as powerful tools for estimating biological age, with broad applications in aging research, age-related disease studies, and longevity science. Despite advances across machine learning approaches to epigenetic age prediction, spanning penalised linear regression, deep feedforward networks, residual architectures, and graph neural networks, no existing method jointly models co-methylation graph structure and site-specific DNA sequence context within a unified framework. We propose a unified sequence--graph integration framework for epigenetic age prediction that addresses this gap, integrating eight-dimensional DNA sequence statistical features through a lightweight gated modulation mechanism that adaptively scales each site's methylation signal according to its sequence-determined biological relevance prior to graph convolution. Evaluated on 3,707 blood methylation samples against a comprehensive set of baselines, our method achieves a test MAE of 3.149 years, a 12.8\% improvement over the strongest graph-based baseline. Biologically informed statistical features outperform CNN-based sequence encoding, demonstrating that handcrafted sequence features are more effective than end-to-end learned representations in this data regime. Post-hoc interpretability analysis identifies CpG density and local adenine frequency as features with age-dependent importance shifts, consistent with known mechanisms of age-related hypermethylation at CpG-dense promoter regions. Our code is at https://github.com/yaoli2022/graphage-seq.
☆ HH-SAE: Discovering and Steering Hierarchical Knowledge of Complex Manifolds
Rare semantic innovations in high-dimensional, mission-critical domains are often obscured by dense background contexts, a challenge we define as \textit{feature density conflict}. We introduce the \textbf{Hybrid Hierarchical SAE (HH-SAE)} to resolve this by factorizing manifolds into a nested hierarchy of \textbf{Contextual} ($L_0$), \textbf{Atomic} ($f_1$), and \textbf{Compository} ($f_2$) tiers. Evaluating across disparate manifolds, HH-SAE demonstrates superior resolution by \textbf{``fracturing'' administrative clinical labels into physiological modes} and achieving a peak \textbf{cross-domain zero-shot AUC of 0.9156 in fraud detection}. Path ablation confirms the architecture's structural necessity, revealing a 13.46\% utility collapse when contextual subtraction is removed. Finally, knowledge-steered synthesis achieves a +9.9\% AUPRC lift over state-of-the-art generators, proving that HH-SAE effectively prioritizes high-order mechanistic innovation over environmental proxies to enable high-precision discovery in high-stakes environments.
☆ ConfoundingSHAP: Quantifying confounding strength in causal inference
In causal inference, confounders are variables that influence both treatment decisions and outcomes. However, unlike as in randomized clinical trials, the treatment assignment mechanism in observational studies is not known, and it is thus unclear which covariates act as confounders. Here, we aim to generate insight for causal inference and answer: which of the observed covariates act as confounders? We introduce ConfoundingSHAP, a Shapley-based method for attributing confounding strength to individual covariates. Our contributions are twofold. First, we propose a Shapley game targeted to infer the confounding strength of the covariates. Our resulting Shapley values differ from the standard applications of SHAP explanations on causal targets, such as understanding treatment effect heterogeneity, which are ill-suited for our task. Second, as our task requires evaluating the value function over many adjustment sets, we provide a scalable TabPFN-based estimation that avoids exhaustive refitting. We demonstrate the practical value across various datasets, where ConfoundingSHAP provides informative explanations of which observed covariates drive confounding and thereby helps to provide more insight for causal inference in practice.
☆ PrimeKG-CL: A Continual Graph Learning Benchmark on Evolving Biomedical Knowledge Graphs
Biomedical knowledge graphs underwrite drug repurposing and clinical decision support, yet the upstream ontologies they depend on update on independent cycles that add millions of edges and deprecate hundreds of thousands more between releases. Yet existing continual graph learning has been studied almost exclusively on synthetic random splits of static, generic KGs, a regime that cannot reproduce the asynchronous, structured evolution real biomedical KGs undergo. To this end, we introduce PrimeKG-CL, a CGL benchmark built from nine authoritative biomedical databases (129K+ nodes, 8.1M+ edges, 10 node types, 30 relation types) with two genuine temporal snapshots (June 2021, July 2023; 5.83M edges added, 889K removed, 7.21M persistent), 10 entity-type-grouped tasks, multimodal node features, and a per-task persistent/added/removed test stratification. On three tasks (biomedical relationship prediction, entity classification, KGQA), we evaluate six CL strategies across four KGE decoders, plus LKGE, an LLM-RAG agent, and CMKL. We find that decoder choice and continual learning strategy interact strongly: no single strategy performs best across all decoders, and mismatched combinations can significantly degrade performance. Moreover, only DistMult exhibits a clear separation between persistent and deprecated knowledge, indicating that standard metrics conflate retention of still-valid facts with failure to forget outdated ones; this effect is absent under RotatE. In addition, multimodal features improve entity-level tasks by up to 60%, and a recent CKGE framework (IncDE) failed to scale to our 5.67M-triple base task across five attempts up to 350GB RAM. Data, pipeline, baselines, and the stratified split are released openly. Dataset:huggingface.co/datasets/yradwan147/PrimeKGCL|Code:github.com/yradwan147/primekg-cl-neurips2026
☆ CMKL: Modality-Aware Continual Learning for Evolving Biomedical Knowledge Graphs
Biomedical knowledge graphs are increasingly large, dynamic, and multimodal, driven by rapid advances in biotechnology such as high-throughput sequencing. Machine learning models can infer previously unobserved biomedical relationships and characterize biomedical entities in these graphs, but existing knowledge graph embedding methods and their continual learning extensions either assume static graph structure or fail to exploit multimodal information under evolving data distributions. They also apply uniform regularization across all model parameters, ignoring that different modalities may exhibit distinct forgetting dynamics as the graph evolves. We propose the Continual Multimodal Knowledge Graph Learner (CMKL), a CL framework for biomedical KGs that natively encodes structure, text, and molecules, fuses them through a Mixture-of-Experts (MoE) router, and protects previously learned knowledge with standard EWC regularization and a K-means-diverse multimodal replay buffer. We evaluate CMKL on a 129K-entity biomedical continual benchmark with 10 tasks. On continual biomedical entity classification, CMKL reaches AP 0.591 versus 0.370 for the strongest structural baseline, a 60% gain that is driven by access to multimodal features and preserved across the sequence with near-zero forgetting (AF 0.008). On continual relationship prediction, CMKL reaches AP $0.062$, matching Naive Sequential and EWC (0.058) within seed noise and outperforming Joint Training (0.047, p=0.045) and LKGE (0.039). A frozen-text ablation reaches AP 0.136, more than double any jointly trained model, yet that signal is unreachable by margin-ranking gradients: the greedy-modality asymmetry lives at the representation level, not the fusion level, and MoE routing manages it by suppressing the unreachable modality without forcing it through a learned bottleneck. Code: github.com/yradwan147/cmkl-neurips2026
☆ Priority-Driven Control and Communication in Decentralized Multi-Agent Systems via Reinforcement Learning
Event-triggered control provides a mechanism for avoiding excessive use of constrained communication bandwidth in networked multi-agent systems. However, most existing methods rely on accurate system models, which may be unavailable in practice. In this work, we propose a model-free, priority-driven reinforcement learning algorithm that learns communication priorities and control policies jointly from data in decentralized multi-agent systems. By learning communication priorities, we circumvent the hybrid action space typical in event-triggered control with binary communication decisions. We evaluate our algorithm on benchmark tasks and demonstrate that it outperforms the baseline method.
comment: Accepted to the 23rd IFAC World Congress
☆ Regret Minimization in Bilateral Trade With Perturbed Markets
We address the problem of maximizing Gain from Trade (GFT) in repeated buyer-seller exchanges subject to global budget balance constraints. While this problem is well-understood in purely adversarial and stochastic settings, these environments exhibit a sharp dichotomy: adversarial environments allow for no-regret learning against the best fixed-price mechanism, whereas stochastic environments allow for no-regret learning against the best distribution over prices that is budget balanced in expectation. This gap is significant, as policies balanced in expectation can increase the GFT by a multiplicative factor of two. In this work, we bridge these extremes by studying perturbed markets, where an underlying stochastic distribution is subject to an adversarial corruption $C$. We design an algorithm that adaptively scales with the level of corruption, achieving an $\tilde{\mathcal{O}}(T^{3/4}) + \mathcal{O}(C\log(T))$ regret bound against the best budget-balanced distribution over prices. Simultaneously, our algorithm maintains the worst-case $\tilde{\mathcal{O}}(T^{3/4})$ regret bound relative to a per-round budget-balanced baseline, ensuring optimality even in fully adversarial environments.
☆ Formally Verifying Analog Neural Networks Under Process Variations Using Polynomial Zonotopes
Analog neural networks are gaining attention due to their efficiency in terms of power consumption and processing speed. However, since analog neural networks are implemented as physical circuits, they are highly sensitive to manufacturing process variations, which can cause large deviations from the nominal model. We present a polynomial-based model that resembles the performance of the neuron circuit under process variations. Then, we formally verify the behavior of the circuit-level model using reachability analysis with polynomial zonotopes, thus, avoiding conventional, time-consuming Monte Carlo simulations. We evaluate our proposed verification approach on three different datasets, verifying both fully-connected and convolutional analog neural networks. Our experimental results confirm the effectiveness of our verification approach by reducing the verification time from days to seconds while enclosing 99% of the variation samples.
☆ Can Muon Fine-tune Adam-Pretrained Models?
Muon has emerged as an efficient alternative to Adam for pretraining, yet remains underused for fine-tuning. A key obstacle is that most open models are pretrained with Adam, and naively switching to Muon for fine-tuning leads to degraded performance due to an optimizer mismatch. We investigate this mismatch through controlled experiments and relate it to the distinct implicit biases of Adam and Muon. We provide evidence that the mismatch disrupts pretrained knowledge, and that this disruption scales with update strength. This leads us to hypothesize that constraining updates should mitigate the mismatch. We validate this with LoRA: across language and vision tasks, LoRA reduces the performance gap between Adam and Muon observed under full fine-tuning. Studies on LoRA rank, catastrophic forgetting, and LoRA variants further confirm that mismatch severity correlates with update strength. These results shed light on how optimizer mismatch affects fine-tuning and how it can be mitigated. Our code is available at https://github.com/XingyuQu/muon-finetune.
☆ Self-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition
Large language models (LLMs) exhibit two striking and ostensibly unrelated behaviours: in-context learning (ICL) and repetitive generation. In both, the model behaves as though it had summarised the context into a population-level statistic and discarded token-level detail. We ask whether this ``summarisation and forgetting'' can be derived from the attention mechanism itself, and answer in the affirmative. Under stationary, ergodic and elliptical inputs, the softmax attention output converges almost surely to $Θ_VΣΘ_K^{\top}Θ_Q x_t$, where $Σ$ is the input covariance; the long-context limit is therefore a linear readout of the input's second-order statistics. Two consequences follow. (i) For in-context linear regression, a single softmax head can implement one step of population gradient descent. Stacking such heads with residual connections iterates this update and implements multiple gradient descent steps. (ii) Propagated across an $L$-layer transformer, this readout drives the terminal hidden state at the parametric $1/t$ rate to a deterministic function of the current token alone, so that autoregressive generation collapses asymptotically to a first-order Markov chain whose attracting orbits furnish a structural account of repetition and mode collapse. The two phenomena thus emerge as facets of a single covariance-readout principle.
☆ QT-Net: Rethinking Evaluation of AI Models in Atomic Chemical Space
Atomic properties such as partial charges or multipoles encode chemically meaningful information that can inform downstream molecular property prediction, but their evaluation as machine learning targets has been complicated by the absence of a principled out-of-distribution evaluation protocol at the atomic level. In this work, we propose a held-out evaluation protocol that clusters atomic environments by SOAP descriptors and computes metrics accounting only for cluster labels unseen during training. Following this procedure, we use 5$\times$5 cross-validation and Tukey's HSD to run a statistically rigorous comparison of E(3)-equivariant against non-equivariant, rotationally augmented models for predicting electron populations and multipoles of H, C, N, and O atoms. Building on our results, we introduce the Quantum Topological Neural Network (QT-Net), a rotationally augmented, non-equivariant graph neural network. We show that QT-Net can be used to infer properties of atoms in molecules from QM9 outside our training set, and that these inferred properties can yield improvement when used as input features for downstream molecular property prediction. To further validate the framework, molecular dipole moments computed from QT-Net's per-atom outputs recover the ground-truth values reported in QM9. We release all code and data, including a JAX implementation of QT-Net, to support the broader use of learned QTA properties as inductive biases for atomic-scale molecular machine learning.
☆ AxiomOcean: Forecasting the Three-Dimensional Structure of the Upper Ocean
Short-term ocean forecast skill depends strongly on the three-dimensional ocean structure of the upper ocean, which governs stratification, subsurface heat storage, and the response of the ocean to atmospheric forcing. However, AI ocean forecasting models often fail to preserve this vertical structure, resulting in over-smoothed subsurface features and weak physical consistency under strong forcing. Here, we present AxiomOcean, a global AI ocean forecasting model that explicitly represents vertical hierarchy and cross-layer dependence within the water column. By combining a fully three-dimensional encoder-backbone-decoder architecture with surface atmospheric forcing, AxiomOcean jointly predicts upper-ocean temperature, salinity, and three-dimensional currents at global 1/12° resolution down to 643 m depth. In 10-day forecasts, AxiomOcean outperforms an advanced AI comparison model across variables and lead times, reducing day-1 RMSE by approximately 20 to 35% while maintaining higher anomaly correlation. The gain is not achieved through excessive smoothing: AxiomOcean better preserves eddy kinetic energy, temperature and salinity variance. Its advantage also extends through the water column and remains evident across the equatorial Pacific, Kuroshio Extension, and Southern Ocean, yielding a more realistic reconstruction of upper-ocean heat content. These results show that explicitly preserving upper-ocean three-dimensional structure can improve both forecast accuracy and physical fidelity in AI ocean prediction.
☆ SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding
Speculative decoding speeds up autoregressive generation in Large Language Models (LLMs) through a two-step procedure, where a lightweight draft model proposes tokens which the target model then verifies in a single forward pass. Although the drafter network is small in modern architectures, its LM-head still performs projection to a large vocabulary, becoming one of the major computational bottlenecks. In prior work this issue has been predominantly addressed via static or dynamic vocabulary truncation. Yet mitigating the bottleneck, these methods bring in extra complexity, such as special vocabulary curation, sophisticated inference-time logic or modifications of the training setup. In this paper, we propose SlimSpec, a low-rank parameterization of the drafter's LM-head that compresses the inner representation rather than the output, preserving full vocabulary support. We evaluate our method with EAGLE-3 drafter across three target models and diverse benchmarks in both latency- and throughput-bound inference regimes. SlimSpec achieves $4\text{-}5\times$ acceleration over the standard LM-head architecture while maintaining a competitive acceptance length, surpassing existing methods by up to $8\text{-}9\%$ of the end-to-end speedup. Our method requires minimal adjustments of training and inference pipelines. Combined with the aforementioned speedup improvements, it makes SlimSpec a strong alternative across wide variety of draft LM-head architectures.
☆ Don't Fix the Basis -- Learn It: Spectral Representation with Adaptive Basis Learning for PDEs
Spectral neural operators achieve strong performance for PDE learning, but rely on fixed global bases that limit their ability to represent spatially heterogeneous and multiscale dynamics. We propose Adaptive Basis Learning (ABLE), a framework that learns data-dependent spectral representations instead of relying on predefined bases. ABLE constructs a spatially adaptive Parseval frame via a learned ancillary density, enabling the operator to act in a lifted spectral space while preserving invertibility and maintaining $O(N\log N)$ complexity through FFT-based implementation. This shifts the source of expressivity from spectral coefficients to the representation itself, allowing the model to capture localized structures and non-translation-invariant interactions more efficiently. ABLE integrates seamlessly into existing neural operator architectures as a drop-in replacement for spectral layers. Across a range of benchmarks ABLE improves accuracy over strong baselines, with the largest gains in regimes characterized by sharp gradients and multiscale behavior. Moreover, augmenting existing models (e.g., U-FNO, HPM) with ABLE further enhances their performance, demonstrating its role as a general and complementary spectral refinement. Our results highlight that the data-driven choice of representation, rather than operator complexity alone, is a key bottleneck in neural operator design. By learning the basis itself, ABLE provides a principled and efficient framework for improving spectral methods in PDE learning.
comment: 26 pages, 4 figures
☆ Beyond Spatial Compression: Interface-Centric Generative States for Open-World 3D Structure
Current 3D tokenizers largely treat representation as spatial compression: compact codes reconstruct surface geometry, but leave component ownership and attachment validity implicit. In open-world assets with intersecting components, noisy topology, and weak canonical structure, this creates a representation mismatch: local shape, component identity, and assembly relations become entangled in a latent stream and are not natively addressable during decoding. We formulate an alternative view, interface-centric generative states, in which tokenization constructs an operational state rather than a passive compressed code. The state exposes local geometry, component ownership, and attachment validity as variables that can be queried, constrained, and repaired during decoding. We instantiate this formulation with Component-Conditioned Canonical Local Tokens (C2LT-3D), factorizing representation into canonical local geometry, partition-conditioned context, and relational seam variables. Each factor targets a distinct failure mode of compression-centric tokens: pose leakage, cross-component interference, or invalid local attachment. This exposed state supports attachment validation, latent structural repair, targeted intervention, and constrained serialization without a separate post-hoc structure recovery module. Trained on single-object CAD models and evaluated zero-shot on open-world multi-component assets, C2LT-3D improves structural robustness and shows that its latent variables remain actionable under adversarial attachment settings. These results suggest that open-world 3D generative representations should be evaluated not only by reconstruction fidelity, but by whether their discrete states remain operational for assembly-level structural reasoning.
☆ DRIFT: Drift-Resilient Invariant-Feature Transformer for DGA Detection DSN 2026
Domain Generation Algorithms (DGAs) evolve continuously to evade botnet detection, posing a persistent challenge for dependable network defense. While deep learning-based detectors achieve strong performance under static conditions, they suffer severe degradation when facing temporal drift. Through a 9-year longitudinal study (2017-2025), we empirically show that state-of-the-art character- and word-based DGA classifiers rapidly lose effectiveness as new DGA variants emerge. To address this problem, we propose a drift-resilient Transformer-based framework that learns invariant representations through a hybrid tokenization strategy and multi-task self-supervised pre-training. The model integrates (i) character-level encoding to capture stochastic morphological patterns and (ii) subword-level encoding for word-based DGAs. Three pre-training tasks enable the model to learn robust structural and contextual features prior to supervised fine-tuning. Comprehensive evaluations demonstrate that our method significantly mitigates temporal degradation and consistently outperforms state-of-the-art baselines in forward-chaining experiments. The proposed approach offers a dependable foundation for long-term DGA defense in evolving threat landscapes. Our code is available at: https://github.com/snsec-net/2026-DSN-DRIFT.
comment: 14 pages, 7 figures, 8 tables. Accepted to appear in Proc. of the 56th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2026)
☆ Real vs. Semi-Simulated: Rethinking Evaluation for Treatment Effect Estimation
Estimating heterogeneous treatment effects with machine learning has attracted substantial attention in both academic research and industrial practice. However, the two communities often evaluate models under markedly different conditions. Methodological work typically relies on semi-simulated benchmarks and metrics that require counterfactual outcomes, whereas real-world applications rely on observable metrics based on ranking or test outcomes. Despite the well-known gap between methodological progress and practical deployment, the relationship between these evaluation regimes has not been examined systematically. We conduct a large-scale empirical study of treatment effect evaluation across standard semi-simulated benchmark families and real-world datasets. Our benchmark covers meta-learners paired with multiple base learners, as well as specialized causal machine learning models. We evaluate these methods using observable metrics common in application-oriented literature, alongside counterfactual metrics commonly used in methods papers. Our results reveal two complementary gaps. First, counterfactual metrics do not reliably recover the estimators preferred by observable metrics, even on the same semi-simulated benchmarks. Second, rankings obtained on semi-simulated benchmarks do not transfer to real datasets. We further find that simple meta-learners with strong base models are consistently competitive, in contrast to specialized causal models. Overall, our findings suggest that progress in treatment effect estimation research should not be assessed solely through counterfactual metrics and semi-simulated benchmarks, but it would benefit from incorporating observable metrics and real-data validation.
☆ Remember to Forget: Gated Adaptive Positional Encoding
Rotary Positional Encoding (RoPE) is widely used in modern large language models. However, when sequences are extended beyond the range seen during training, rotary phases can enter out-of-distribution regimes, leading to spurious long-range alignments, diffuse attention, and degraded retrieval. Existing remedies only partially address these failures, as they often trade local positional resolution for long-context stability. We propose GAPE (Gated Adaptive Positional Encoding), a drop-in augmentation for positional encodings that introduces a content-aware bias directly into the attention logits while preserving the rotary geometry. GAPE decouples distance-based suppression from token importance through a query-dependent gate that contracts irrelevant context and a key-dependent gate that preserves salient distant tokens. We prove that protected tokens remain accessible, while the attention mass assigned to unprotected distant tokens decays as a function of the query gate. We further show that GAPE can be implemented within standard scaled dot-product attention. We validate these properties empirically, finding that GAPE consistently yields sharper attention and improved long-context robustness over rotary baselines across both synthetic retrieval and long-context benchmarks.
☆ Equilibrium Residuals Expose Three Regimes of Matrix-Game Strategic Reasoning in Language Models
Large language models can score well on named game-theory benchmarks while failing on the same strategic computation once semantic cues are removed. We show this gap with procedurally generated zero-sum matrix games: a model that recognizes familiar games drops to 34%, 18%, and 2% success on anonymous $2{\times}2$, $3{\times}3$, and $5{\times}5$ payoff matrices. The benchmark separates semantic recall, learned approximate Nash computation, and an output-interface bottleneck that limits scale. Training only on $2{\times}2$ and $3{\times}3$ games, supervised fine-tuning raises unseen $5{\times}5$--$7{\times}7$ success from 2% to 61%, while exploitability-reward training averages 37% with high seed variance. We prove that the exploitability residual is $2$-Lipschitz in payoff perturbations, unlike discontinuous vertex-returning LP equilibrium selectors, explaining why residual training can transfer under payoff shifts even when formatting instability limits mean performance. A dominated-action padding experiment provides causal evidence: trained models solve $3{\times}3$ games embedded in much larger matrices, while random-padded controls fail and dense $12{\times}12$ games remain near failure. Procedural evaluation is therefore necessary for measuring strategic reasoning, and residual rewards expose a real but format-limited route to approximate equilibrium computation.
☆ Identified-Set Geometry of Distributional Model Extraction under Top-$K$ Censored API Access
Modern LLM APIs often reveal only top-$K$ logit scores and censor the remaining vocabulary. We study the per-position distribution-recovery limits of this access model. For censoring threshold $τ$, the compatible teacher distributions form an identified set whose total-variation diameter is exactly $U_K=(V-K)\exp(τ)/(Z_A+(V-K)\exp(τ))$, where $Z_A$ is the observed partition function. For KL recovery, we give a computable binary-endpoint lower bound and an asymptotically matching small-ambiguity upper bound, with an extension to reference-aware attackers. Experiments on a Qwen3 math-reasoning teacher reveal a layered extraction hierarchy: on-task top-$K$ distillation recovers 12% of private capability, full-logit distillation recovers 56% despite 99% KL closure, and generation-based extraction recovers 96%. Top-$K$ censoring therefore limits per-position distribution recovery but does not by itself prevent capability extraction, separating fidelity from transfer in prompt-only logit distillation.
☆ Valid Best-Model Identification for LLM Evaluation via Low-Rank Factorization
Selecting the best large language model (LLM) for a fixed benchmark is often expensive, since exhaustive evaluation requires running every model on every example. Multi-armed bandit (MAB) algorithms can reduce the number of LLM calls by sequentially selecting the next model-example pair to evaluate, thereby avoiding wasted evaluations on clearly underperforming models. Further savings can be achieved by predicting model scores from the partially observed model-example score matrix using low-rank factorization. However, such predictions are not ground truth: they can be biased and may therefore lead to incorrect identification of the best model. In this work, we propose a principled framework that combines MAB with cheap predicted scores without compromising statistical validity. Specifically, we derive doubly robust estimators of each model's performance that use the low-rank predictions to reduce variance. This enables the construction of valid finite-sample confidence intervals in our setting, where models are selected adaptively and examples are sampled without replacement. Empirical results on real-world benchmarks show that our approach reduces the number of required evaluations, yielding meaningful savings in compute and cost while accurately identifying the best-performing model.
☆ Causal Explanations from the Geometric Properties of ReLU Neural Networks
Neural networks have proved an effective means of learning control policies for autonomous systems, but these learned policies are difficult to understand due to the black-box nature of neural networks. This lack of interpretability makes safety assurance for such autonomous systems challenging. The fields of eXplainable Artificial Intelligence (XAI) and eXplainable Reinforcement Learning (XRL) aim to interpret the decision making processes of neural networks and autonomous agents, respectively. In particular, work on causal explanations aims to provide "why" and "why not" explanations for why a model made a given decision. However, most of the work on explainability to date utilises a distilled version of the original model. While this distilled policy is interpretable, it necessarily degrades in performance significantly when compared to the original model, and is not guaranteed to be an accurate reflection of the decision making processes in the original model and as such cannot be used to guarantee its safety. Recent work on understanding the geometry of ReLU neural networks shows that a ReLU network corresponds to a piecewise linear function divided into regions defined by an n-dimensional convex polytope. Through this lens, a neural network can be understood as dividing the input space into distinct regions which apply a single linear function for each output neuron. We show that this geometric representation can be used to generate causal explanations for the network's behaviour similar to previous work, but which extracts rules directly from the geometry of Neural Networks with the ReLU activation function, and is therefore an accurate reflection of the network's behaviour.
comment: 7 pages, 0 figures, Accepted for presentation at the Yorkshire Innovation in Science and Engineering Conference
☆ Sharp feature-learning transitions and Bayes-optimal neural scaling laws in extensive-width networks
We study the information-theoretic limits of learning a one-hidden-layer teacher network with hierarchical features from noisy queries, in the context of knowledge transfer to a smaller student model. We work in the high-dimensional regime where the teacher width $k$ scales linearly with the input dimension $d$ -- a setting that captures large-but-finite-width networks and has only recently become analytically tractable. Using a heuristic leave-one-out decoupling argument, validated numerically throughout, we derive asymptotically sharp characterizations of the Bayes-optimal generalization error and individual feature overlaps via a system of closed fixed-point equations. These equations reveal that feature learnability is governed by a sequence of sharp phase transitions: as data grows, teacher features become recoverable sequentially, each through a discontinuous jump in overlap. This sequential acquisition underlies a precise notion of \textit{effective width} $k_c$ -- the number of learnable features at a given data budget $n$ -- which unifies two distinct scaling regimes: a feature-learning regime in which the Bayes-optimal generalization error $\varepsilon^{\rm BO}$ scales as $ n^{1/(2β)-1}$, and a refinement regime in which it scales as $n^{-1}$, where $β>1/2$ is the exponent of the power-law feature hierarchy. Both laws collapse to the single relation $\varepsilon^{\rm BO}=Θ(k_c d/n)$. We further show empirically that a student trained with \textsc{Adam} near the effective width $k_c$ achieves these optimal scaling laws (up to a small algorithmic gap), and provide an information-theoretic account of the associated scaling in model size.
☆ The Polynomial Counting Capabilities of Message Passing Neural Networks
The counting power of Message Passing Neural Networks (MPNN) has been the subject of many recent papers, showing that they can express logic that involves counting up to a threshold or more generally satisfy a linear arithmetic constraint. In this paper, we study the counting capabilities of MPNN beyond linear arithmetic, primarily utilising local and global mean aggregations. In particular, our goal is to tease out conditions required to express extensions of graded modal logic with polynomial counting constraints. We show that global polynomial counting constraints in node-labelled graphs can be checked using mean MPNN under mild assumptions. Checking local constraints is also possible, if we consider formulas with no nested modalities and additionally either (i) permit sum/max aggregations, or (ii) only restrict to regular graphs. We also show how formulas with nested modalities can be captured by mean MPNN over graphs with tree-like structures and similar assumptions.
☆ Regret Analysis of Guided Diffusion for Black-Box Optimization over Structured Inputs
Guided-diffusion black-box optimization (BO) has shown strong empirical performance on structured design problems such as molecules and crystals, but its regret behavior remains poorly understood. Existing BO regret analyses typically rely on maximum information gain, non-pretrained surrogate models, or exact acquisition maximization -- assumptions that break down in modern diffusion -- BO pipelines, where pretrained diffusion models serve as powerful priors over valid structures and acquisition maximization is replaced by approximate sampling over astronomically large discrete spaces. We develop a first certificate-based expected simple-regret framework for guided-diffusion BO that avoids maximum-information-gain bounds, RKHS assumptions, and exact acquisition maximization. The central quantity in our analysis is mass lift: the increase in probability mass assigned to near-optimal designs relative to the pretrained generator. This view explains how exponential-looking finite-budget convergence and polynomial acceleration can all arise from the same mechanism. We also give practical diagnostics for estimating search exponents from finite candidate pools and a proposal-corrected resampling construction that provides a fully certified sampler instance.
comment: 48 pages, 12 figures
☆ Multifidelity Gaussian process regression for solving nonlinear partial differential equations
Solving nonlinear partial differential equations (PDEs) using kernel methods offers a compelling alternative to traditional numerical solvers. However, the performance of these methods strongly depends on the choice of kernel. In this work, as the available information is inherently multifidelity, we propose a kernel learning approach based on cokriging, leveraging empirical information from multifidelity simulations. In the first step, we fit a differentiable non-stationary kernel to an empirical kernel obtained from low-fidelity simulations. In the second step, we derive a high-fidelity kernel with estimated hyperparameters, and construct a corresponding high-fidelity mean using the multifidelity framework. These components can then be used within a Gaussian process framework for solving PDEs. Finally, we demonstrate the performance of the proposed physics-informed method on the Burgers' equation.
comment: 31 pages, 20 figures
☆ PC3D: Zero-Shot Cooperation Across Variable Rosters via Personalized Context Distillation
Cooperative multi-agent reinforcement learning often assumes a fixed execution team, yet many decentralized systems must operate with varying numbers of active agents during deployment. We study this setting under episodic roster variation: each episode is executed by a set of homogeneous agents, with the team size varying across episodes. Agents act only from local histories, without execution-time communication, privileged coordinators, or online retraining. Therefore, effective cooperation requires each agent to recover relevant context about the active team and adapt its behavior accordingly. To this end, we propose PC3D (Personalized Central Coordination Context Distillation), a method for training decentralized policies to recover and use personalized coordination context from local interaction histories. During training, a set-structured centralized teacher compresses the active team into coordination tokens and personalizes them into agent-specific contexts, which are distilled into decentralized policies. At execution, each agent predicts its own context from local history and adaptively uses it to condition decision-making. Across three cooperative MARL benchmarks, PC3D achieves higher returns than the evaluated baselines with both seen and unseen roster sizes, and ablations attribute these gains to both context distillation and adaptive context use.
☆ DeepLévy: Learning Heavy-Tailed Uncertainty in Highly Volatile Time Series
Modeling uncertainty in heavy-tailed time series remains a critical challenge for deep probabilistic forecasting models, which often struggle to capture abrupt, extreme events. While Lévy stable distributions offer a natural framework for modeling such non-Gaussian behaviors, the intractability of their probability density functions severely limits conventional likelihood-based inference. To address this, we introduce DeepLévy, a neural framework that learns mixtures of Lévy stable distributions by minimizing the discrepancy between empirical and parametric characteristic functions. DeepLévy incorporates a mixture mechanism that adaptively learns context-dependent weights and parameters over multiple Lévy components, enabling flexible multi-horizon uncertainty modeling. Evaluations on both real and synthetic datasets demonstrate that DeepLévy outperforms state-of-the-art deep probabilistic forecasting approaches in tail risk metrics, especially under extreme volatility.
☆ Foundations of Reliable Inference: Reliability-Efficiency Co-Design
Reliable inference requires that artificial intelligence (AI) models provide trustworthy uncertainty estimates, not merely accurate predictions. Recent advances in Bayesian learning have made significant progress toward this goal, and growing concerns about computational overhead have jointly shifted the design criterion from reliability alone to the co-design of reliability and efficiency, i.e., reducing computational overhead while preserving trustworthy uncertainty quantification. This thesis develops a unified framework from two perspectives to address the central question: can we efficiently perform reliable inference?
comment: PhD Thesis
☆ Portable Active Learning for Object Detection CVPR 2026
Annotating bounding boxes is costly and limits the scalability of object detection. This challenge is compounded by the need to preserve high accuracy while minimizing manual effort in real-world applications. Prior active learning methods often depend on model features or modify detector internals and training schedules, increasing integration overhead. Moreover, they rarely jointly exploit the benefits of image-level signals, class-imbalance cues, and instance-level uncertainty for comprehensive selection. We present Portable Active Learning (PAL), a detector-agnostic, easily portable framework that operates solely on inference outputs. PAL combines class-wise instance uncertainty with image-level diversity to guide data selection. At each round, PAL trains lightweight class-specific logistic classifiers to distinguish true from false positives, producing entropy-based uncertainty scores for proposals. Candidate images are then refined using global image entropy, class diversity, and image similarity, yielding batches that are both informative and diverse. PAL requires no changes to model internals or training pipelines, ensuring broad compatibility across detectors. Extensive experiments on COCO, PASCAL VOC, and BDD100K demonstrate that PAL consistently improves label efficiency and detection accuracy compared to existing active learning baselines, making it a practical solution for scalable and cost-effective deployment of object detection in real-world settings.
comment: CVPR 2026(highlight)
☆ PowerStep: Memory-Efficient Adaptive Optimization via $\ell_p$-Norm Steepest Descent
Adaptive optimizers, most notably Adam, have become the default standard for training large-scale neural networks such as Transformers. These methods maintain running estimates of gradient first and second moments, incurring substantial memory overhead. We introduce PowerStep, a memory-efficient optimizer that achieves coordinate-wise adaptivity without storing second-moment statistics. Motivated by steepest descent under an $\ell_p$-norm geometry, we show that applying a nonlinear transform directly to a momentum buffer yields coordinate-wise adaptivity. We prove that PowerStep converges at the optimal $O(1/\sqrt{T})$ rate for non-convex stochastic optimization. Extensive experiments on Transformer models ranging from 124M to 235B parameters demonstrate that PowerStep matches Adam's convergence speed while halving optimizer memory. Furthermore, when combined with aggressive \texttt{int8} quantization, PowerStep remains numerically stable and reduces optimizer memory by $\sim\!8\times$ compared to full-precision Adam. PowerStep thus provides a principled, scalable and resource-efficient alternative for large-scale training. Code is available at https://github.com/yaolubrain/PowerStep.
☆ Fast Training of Mixture-of-Experts for Time Series Forecasting via Expert Loss Integration
We propose a novel adaptive Mixture-of-Experts (MoE) framework for time series forecasting that enhances expert specialization by incorporating expert-specific loss information directly into the training process. Notably, the overall objective comprises the base forecasting loss and expert-specific losses, allowing expert-level prediction errors to jointly shape training alongside the global forecasting loss. This framework is further combined with a partial online learning strategy, enabling incremental updates of both the gating mechanism and expert parameters. This approach significantly reduces computational cost by eliminating the need for repeated full model retraining. By integrating expert-level loss awareness with efficient online optimization, the proposed method achieves improved learning efficiency while maintaining strong predictive performance. Empirical results across economic, tourism, and energy datasets with varying frequencies demonstrate that the proposed approach generally outperforms both statistical methods and state-of-the-art neural network models, such as Transformers and WaveNet, in forecasting accuracy and computational efficiency. Furthermore, ablation studies confirm the effectiveness of the expert-specific loss integration strategy, highlighting its contribution to enhancing predictive performance.
☆ Relations Are Channels: Knowledge Graph Embedding via Kraus Decompositions
Knowledge graph embedding (KGE) models typically represent each relation as an operator on entity embeddings. In this work, we identify three structural axioms that any principled relation operator must satisfy, linearity, trace preservation, and complete positivity, and show that they characterize a Kraus channel structure via the Kraus representation theorem. The completeness constraint defining this family is equivalent to these axioms, providing a principled foundation rather than an externally imposed condition. Under this formulation, most existing operator-based KGE models are recoverable as special cases with Kraus rank $κ= 1$ under specific embedding choices. We further generalize this characterization to arbitrary metric geometries by introducing \mbox{w-Kraus} channels, which satisfy completeness by construction within their respective spaces. Building on this theory, we propose \textsc{KrausKGE}, a principled KGE model that naturally handles $1$-to-$N$ and $N$-to-$N$ relations, supports $k$-hop reasoning without requiring explicit path encoders, and eliminates the need for norm constraints on entity embeddings. Additionally, our framework yields the first theoretically grounded per-relation complexity measure in the KGE literature, with a provable lower bound in terms of the empirical relation matrix rank. Empirical evaluation demonstrates that \textsc{KrausKGE} consistently outperforms strong baselines on $N$-to-$N$ relations, with performance gains that increase monotonically with relation fan-out, in alignment with theoretical predictions.
☆ Active Tabular Augmentation via Policy-Guided Diffusion Inpainting ICML 2026
Generative tabular augmentation is appealing in data-scarce domains, yet the prevailing focus on distributional fidelity does not reliably translate into better downstream models. We formalize a fidelity-utility gap: common generative objectives prioritize distributional plausibility, whereas augmentation succeeds only when injected samples reduce the current learner's held-out evaluation loss. This gap motivates learning not just how to generate, but what to generate and when to inject as training evolves. We propose TAP (Tabular Augmentation Policy), which couples diffusion inpainting with a lightweight, learner-conditioned policy to steer generation toward high-utility regions and controls safe injection via explicit gating and conservative windowed commitment. Under severe data scarcity, TAP consistently outperforms strong generative baselines on seven real-world datasets, improving classification accuracy by up to 15.6 percentage points and reducing regression RMSE by up to 32%.
comment: Accepted for publication at ICML 2026
☆ Signature Approach for Contextual Bandits with Nonlinear and Path-dependent Rewards
We study contextual bandits with nonlinear and path-dependent rewards through a novel signature-transform-based approach. Leveraging the universal nonlinearity property of signatures, we approximate continuous path-dependent reward functionals by linear functionals in the signature space. This representation enables the use of efficient linear contextual bandit methods while preserving expressive sequential structure. Building on this framework, we propose \texttt{DisSigUCB}, a signature-based disjoint upper confidence bound (UCB) algorithm. Under boundedness and non-degeneracy assumptions, we prove a high-probability data-dependent sublinear regret bound of order \(\tilde{\mathcal O}(\sqrt{(d+m)KT})\) where \(d\) is the context dimension and \(m\) is the signature feature dimension. Synthetic experiments and numerical applications on temperature sensor monitoring, sleep-stage classification, and hospital nurse staffing demonstrate that \texttt{DisSigUCB} consistently outperforms classical linear and kernelized contextual bandit baselines in nonlinear and path-dependent settings.
☆ Follow the Mean: Reference-Guided Flow Matching
Existing approaches to controllable generation typically rely on fine-tuning, auxiliary networks, or test-time search. We show that flow matching admits a different control interface: adaptation through examples. For deterministic interpolants, the velocity field is solely governed by a conditional endpoint mean; shifting this mean shifts the flow itself. This yields a simple principle for controllable generation: steer a pretrained model by changing the reference set it follows. We instantiate this idea in two forms. Reference-Mean Guidance is training-free: it computes a closed-form endpoint-mean correction from a reference bank and applies it to a frozen FLUX.2-klein (4B) model, enabling control of color, identity, style, and structure while keeping the prompt, seed, and weights fixed. Semi-Parametric Guidance amortizes the same idea through an explicit mean anchor and learned residual refiner, matching unconditional DiT-B/4 quality on AFHQv2 while allowing the reference set to be swapped at inference time. These results point to a broader direction: generative models that adapt through data, not parameter updates.
☆ Nearly-Optimal Algorithm for Adversarial Kernelized Bandits
This paper studies kernelized bandits (also known as Gaussian process bandits) in an adversarial environment, where the reward functions in a known reproducing kernel Hilbert space (RKHS) may be adversarially chosen at each round. We show that the exponential-weight algorithm achieves $\tilde{O}(\sqrt{T γ_T})$ adversarial regret, where $T$ and $γ_T$ denote the number of total rounds and the maximum information gain, respectively. For squared exponential (SE) and $ν$-Matérn kernels, we also show algorithm-independent lower bounds that guarantee the optimality of our algorithm up to polylogarithmic factors. Furthermore, we present a computationally efficient variant of our algorithm using Nyström approximation while maintaining nearly optimal regret guarantees.
comment: 47 pages
☆ Set Prediction for Next-Day Active Fire Forecasting
Accurate next-day active fire forecasts can support early warning, disaster response, forest risk assessment, and downstream estimation of fire-related carbon emissions. Existing machine learning approaches to wildfire forecasting typically predict wildfire danger or fire probability on kilometre-scale daily grids, which is useful for regional warning but does not directly represent localized fire events. We propose Wildfire Ignition Set Predictor (WISP), a query-based model that reformulates next-day active fire forecasting as point-set prediction. From 48 hours of covariates including meteorology, satellite vegetation products, static land, and fire history, WISP predicts a fixed-size ranked set of future active fire cluster centres on a 375 m grid across globally distributed regions. The model is trained end-to-end with Hungarian matching; to address the conflicting roles of the classification score in assignment, ranking, and query activation, we use asymmetric classification-localization weighting in matching and loss. We further construct a globally distributed, hourly, multi-source benchmark for this task. On a held-out test set spanning fire regions worldwide, the best WISP variant achieves 38.2% average precision (AP) for ranked fire-centre detections, covers 53.4% of fire cluster mass weighted by fire radiative power (FRP), and localizes 54.1% of observed clusters within 5 km. These results establish sparse set prediction as a viable formulation for high-resolution wildfire forecasting and provide a benchmark for future work in this regime.
☆ Qwen Goes Brrr: Off-the-Shelf RAG for Ukrainian Multi-Domain Document Understanding
We participated in the Fifth UNLP shared task on multi-domain document understanding, where systems must answer Ukrainian multiple-choice questions from PDF collections and localize the supporting document and page. We propose a retrieval-augmented pipeline built around three ideas: contextual chunking of PDFs, question-aware dense retrieval and reranking conditioned on both the question and answer options, and constrained answer generation from a small set of reranked passages. Our final system uses Qwen3-Embedding-8B for retrieval, a fine-tuned Qwen3-Reranker-8B for passage ranking, and Qwen3-32B for answer selection. On a held-out split, reranking improves Recall@1 from 0.6957 to 0.7935, while using the top-2 reranked passages raises answer accuracy from 0.9348 to 0.9674. Our best leaderboard run reached 0.9452 on the public leaderboard and 0.9598 on the private leaderboard. Our results suggest that, under strict code-competition constraints, preserving document structure and making relevance estimation aware of the answer space are more effective than adding complex downstream heuristics.
comment: Accepted to The Fifth Ukrainian Natural Language Processing Conference (UNLP 2026)
☆ Robust Probabilistic Shielding for Safe Offline Reinforcement Learning
In offline reinforcement learning (RL), we learn policies from fixed datasets without environment interaction. The major challenges are to provide guarantees on the (1) performance and (2) safety of the resulting policy. A technique called safe policy improvement (SPI) provides a performance guarantee: with high probability, the new policy outperforms a given baseline policy, which is assumed to be safe. Orthogonally, in the context of safe RL, a shield provides a safety guarantee by restricting the action space to those actions that are provably safe with respect to a given safety-relevant model. We integrate these paradigms by extending shielding to offline RL, relying solely on the available dataset and knowledge of safe and unsafe states. Then, we shield the policy improvement steps, guaranteeing, with high probability, a safe policy. Experimental results demonstrate that shielded SPI outperforms its unshielded counterpart, improving both average and worst-case performance, particularly in low-data regimes.
☆ LeapTS: Rethinking Time Series Forecasting as Adaptive Multi-Horizon Scheduling
Time series forecasting serves as an essential tool for many real-world applications, supporting tasks such as resource optimization and decision-making. Despite significant architectural advancements, most modern models still treat forecasting task as a fixed mapping from history to target horizons. This induces temporal decoupling across future time points and limits the model's ability to adapt to the evolving context as forecasting progresses. In this work, we present LeapTS, a novel framework that reformulates time series forecasting as a dynamic scheduling process over the prediction horizon. Specifically, LeapTS organizes the forecasting process into multi-level decisions using: (1) the hierarchical controller to dynamically select the optimal prediction scale and advancement length at each step, and (2) continuous-time state evolution driven by neural controlled differential equations. Within this process, the controlled update mechanism explicitly couples the irregular temporal dynamics with discrete scheduling feedback. Extensive evaluations on both real-world and synthetic datasets demonstrate that LeapTS improves overall forecasting performance by at least 7.4% while achieving a 2.6$\times$ to 5.3$\times$ inference speedup over representative Transformer-based models. Furthermore, by explicitly tracing the scheduling trajectories, we reveal how the model autonomously adapts its forecasting behavior to capture non-stationary dynamics.
☆ Characterizing the Generalization Error of Random Feature Regression with Arbitrary Data-Augmentation
This paper aims at analyzing the regularization effect that data augmentation induces on supervised regression methods in the proportional regime, where the number of covariates grows proportionally to the number of samples. We provide a tight characterization of the test error, measured in mean squared error, in terms only of the population quantities of the true data, as well as first and second order statistics of the augmentation scheme. Our results are valid under misspecified feature maps, and for any network architecture where only the last readout layer is trained, and the rest of the network is either frozen or randomly initialized. We specify our results in the case of Gaussian data, and show that our asymptotic characterization is tight in this setting.
☆ Sample-Mean Anchored Thompson Sampling for Offline-to-Online Learning with Distribution Shift
Offline-to-online learning aims to improve online decision-making by leveraging offline logged data. A central challenge in this setting is the distribution shift between offline and online environments. While some existing works attempt to leverage shifted offline data, they largely rely on UCB-type algorithms. Thompson sampling (TS) represents another canonical class of bandit algorithms, well known for its strong empirical performance and naturally suited to offline-to-online learning through its Bayesian formulation. However, unlike UCB indices, posterior samples in TS are not guaranteed to be optimistic with respect to the true arm means. This makes indices constructed from purely online and hybrid data difficult to compare and complicates their use. To address this issue, we propose sample-mean anchored TS (Anchor-TS), which introduces a novel median-based anchoring rule that defines the arm index as the median of an online posterior sample, a hybrid posterior sample, and the online sample mean. The median anchoring systematically corrects bias induced by distribution shift by mitigating over-estimation for suboptimal arms and under-estimation for optimal arms, while exploiting offline information to obtain more accurate estimates when the shift is small. We establish theoretical guarantees showing that the proposed algorithm safely leverages offline data to accelerate online learning, and quantifying how the degree of distribution shift and the size of offline data affect the resulting regret reduction. Extensive experiments demonstrate consistent improvements of our algorithm over baselines.
☆ BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization
Stochastic bilevel optimization (SBO) has become a standard framework for hyperparameter learning, data reweighting, representation learning, and data-mixture optimization in deep learning. Existing exact single-loop SBO methods and memory-efficient surrogate SBO methods either create severe memory pressure for large lower-level neural networks or lack competitive convergence guarantees under standard assumptions. In this paper, we propose BROS, a memory-efficient single-loop SBO method with the same convergence rate order as exact single-loop SBO methods. BROS performs lower and auxiliary updates in randomized subspaces with a Rademacher bi-probe correction that recovers an unbiased Hessian-action estimator. We prove that BROS preserves the $\mathcal O(\varepsilon^{-2})$ sample complexity of MA-SOBA for finding an $\varepsilon$-stationary point under only standard assumptions. Experiments on hyper-data cleaning, data-mixture learning, hyper-representation learning, and ViT sample reweighting show that BROS reduces peak memory by up to 44.9% while closely matching full-space baseline performance.
☆ Scalable Gaussian process inference via neural feature maps
We present a theoretically grounded Gaussian process framework that leverages neural feature maps to construct expressive kernels. We show that the learned feature map can be interpreted as an optimal low-rank approximation to a Gram matrix derived from an implied RKHS, from which we establish consistency of the GP posterior. We further analyse the spectral properties of the induced kernels and introduce product feature-map kernels to address oversmoothing. This simple yet powerful approach enables fast, scalable, and accurate exact GP inference with minimal upfront work. The flexibility of kernel design supports seamless application to both regression and classification tasks across diverse data modalities, including tabular inputs and structured domains such as images. On benchmark datasets, this approach surpasses pre-existing methods in terms of accuracy and training and prediction efficiency.
comment: 27 pages
☆ DeepLog: A Software Framework for Modular Neurosymbolic AI IJCAI2026
DeepLog is an operational neurosymbolic framework that unifies logic and deep learning within standard PyTorch workflows. While existing neurosymbolic systems focus on a particular paradigm and semantics, DeepLog serves as a universal backend that can emulate many systems in the neurosymbolic alphabet soup. By treating diverse neurosymbolic languages as high-level specifications, the DeepLog software automatically compiles them into optimized arithmetic circuits. This design lowers the barrier for machine learning practitioners by treating logic as composable modules, while providing neurosymbolic developers with a shared, high-performance basis for prototyping new integration strategies. The code is available here: https://github.com/ML-KULeuven/deeplog
comment: Preprint accepted at IJCAI2026 Demo Track
☆ Predictive Radiomics for Evaluation of Cancer Immune SignaturE in Glioblastoma: the PRECISE-GBM study
Background: Radiogenomics allows identification of radiological biomarkers for genomic phenotypes. In glioblastoma, these biomarkers could potentially complement patient stratification strategies. We aim to develop and analytically validate radiological biomarkers that capture immune cell signatures within IDH-wildtype glioblastoma microenvironment using radiogenomic analysis. Methods: This was a retrospective multicenter study using curated open-access anonymized imaging and genomic data from TCGA-GBM, CPTAC, IvyGAP, REMBRANDT and CGGA datasets. Imaging data consisted of MRI-based radiomic features extracted from necrotic core, enhancing and edema regions of deep learning-based auto-segmented tumors. Radiomic feature selections were performed using nested cross-validated LASSO. Support vector machine and ensemble models were trained using seventeen immune and cell-specific score labels extracted from deconvoluted transcriptomic data using pan-cancer and glioblastoma immune signature matrices as reference standards. Seventeen classifier models trained in three cross-cohort strategies were validated on three held-out datasets assessing stability and generalizability. Results: One-hundred-and-seventy-six patients were included in the study. The immune-related radiomic signatures obtained after feature selection were shape, first order and higher order radiomic features. Models predicting macrophage subtype immune signature showed stable mean performance on balanced accuracy (0.67) and precision (0.89) metrics for three independent holdout datasets with ensemble model outperforming support vector machine model. Conclusion: Radiogenomic models non-invasively predicted the macrophage subtype M0 immune signature in IDH-wildtype glioblastoma. These biomarkers have the potential to stratify patients for immunotherapy within prospective glioblastoma clinical trials.
comment: Abstract : 226; Importance of study: 109; Manuscript: 5690 (excluding references) Figures: 4, Tables: 2 Supplemental File: 1
☆ Generalization Error Bounds for Picard-Type Operator Learning in Nonlinear Parabolic PDEs
Operator learning for partial differential equations (PDEs) aims to learn solution operators on infinite-dimensional function spaces from finite-resolution data. In this setting, it is important for the learned model to be discretization-invariant, or resolution-robust, and to reflect PDE-specific structure. It is therefore natural to ask how such structure should be encoded in the model architecture, hypothesis class, or learning procedure. In this paper, we study operator learning for solution operators of nonlinear parabolic PDEs based on Duhamel--Picard iteration. We formulate Picard iteration as an abstract state-transition model and present a theoretical framework for Picard-type operator learning. We derive implementation-agnostic generalization error bounds that separate the implementation error from the estimation error associated with the abstract state-transition model induced by Picard iteration. A key consequence is that increasing the Picard depth reduces the Picard truncation error without causing an unbounded growth of the entropy-based estimation error. We also extend the analysis to long-time prediction by rolling out the same learned local model over successive time blocks. Finally, we illustrate the theory for nonlinear heat equations on the torus using a Picard-type Fourier neural operator as a concrete implementation.
comment: 39 pages
☆ DP-LAC: Lightweight Adaptive Clipping for Differentially Private Federated Fine-tuning of Language Models ICASSP 2026
Federated learning (FL) enables the collaborative training of large-scale language models (LLMs) across edge devices while keeping user data on-device. However, FL still exposes sensitive information through client-provided gradients. Differentially private stochastic gradient descent (DP-SGD) mitigates this risk by clipping each client's contribution to a threshold $C$ and adding noise proportional to $C$. Existing adaptive clipping techniques dynamically adjust $C$ but demand tedious hyperparameter tuning, which can erode the privacy budget. In this paper, we introduce DP-LAC, a method that first estimates an initial clipping threshold within an order of magnitude of the optimum using private histogram estimation, and then adapts this threshold during training without consuming additional privacy budget or introducing new hyperparameters. Empirical results show that DP-LAC outperforms both state-of-the-art adaptive clipping methods and vanilla DP-SGD, achieving an average accuracy gain of $6.6\%$.
comment: Accepted at ICASSP 2026
☆ E-TCAV: Formalizing Penultimate Proxies for Efficient Concept Based Interpretability
TCAV (Testing with Concept Activation Vectors) is an interpretability method that assesses the alignment between the internal representations of a trained neural network and human-understandable, high-level concepts. Though effective, TCAV suffers from significant computational overhead, inter-layer disagreement of TCAV scores, and statistical instability. This work takes a step toward addressing these challenges by introducing E-TCAV, a framework for efficient approximation of TCAV scores, which is based on extensive investigation into three key aspects of the TCAV methodology: 1) the effect of latent classifiers on the stability of TCAV scores, 2) the inter-layer agreement of TCAV scores, and 3) the use of the penultimate layer as a fast proxy for earlier layers for TCAV computation. To ensure a solid foundation for E-TCAV, we conduct extensive evaluations across four different architectures and five datasets, encompassing problems from both computer vision and natural language domains. Our results show that the layers in the final block of the neural network strongly agree with the penultimate layer in terms of the TCAV scores, and the commonly observed variance of the TCAV scores can be attributed to the choice of the latent classifier. Leveraging this inter-layer agreement and the degeneracy of directional sensitivities at the penultimate layer, E-TCAV guarantees linearly scaling speed-ups with respect to the network's size and the number of evaluation samples, marking a step towards efficient model debugging and real-time concept-guided training.
☆ Teaching LLMs to See Graphs: Unifying Text and Structural Reasoning
Using Large Language Models (LLMs) to process graph-structured data is an active research area, yet current state-of-the-art approaches typically rely on multi-step pipelines with Graph Neural Network (GNN) encoders that compress rich textual attributes into solitary tokens, creating a significant semantic bottleneck. In this paper, we introduce the Graph Transformer Language Model (GTLM), a novel architecture that enables pretrained LLMs to natively process graph topologies while entirely eliminating this compressive bottleneck. GTLM is exceptionally parameter-efficient: by injecting graph-aware attention biases directly into the LLM's attention modules, it introduces only 0.015% additional parameters relative to the base model. We theoretically prove that our bidirectional attention prefix preserves node permutation equivariance while maintaining exact backward compatibility with the pretrained base model. Extensive evaluations demonstrate that a 1B-parameter GTLM matches or exceeds the performance of 7B-parameter state-of-the-art models on standard Text-Attributed Graph benchmarks, while significantly surpassing baselines on GraphQA. Finally, we demonstrate that GTLM attention heads implicitly learn to simulate message passing, explaining its superior performance on algorithmic tasks. This paradigm shift enables true algorithmic reasoning within LLMs and provides a scalable foundation for next-generation GraphRAG and relational deep learning.
☆ When Normality Shifts: Risk-Aware Test-Time Adaptation for Unsupervised Tabular Anomaly Detection
Unsupervised tabular anomaly detection methods typically learn feature patterns from normal samples during training and subsequently identify samples that deviate from these patterns as anomalies during testing. However, in practical scenarios, the limited scale and diversity of training data often lead to an incomplete characterization of normal patterns. While test-time adaptation offers a remedy, its isolated focus on test-time optimization ignores the critical synergy with training-phase learning. Furthermore, indiscriminate adaptation to unlabeled test data inevitably triggers anomaly contamination, preventing the model from fully realizing its discriminative capability between normal and anomalous samples. To address these issues, we propose RTTAD, a Risk-aware Test-time adaptation method for unsupervised Tabular Anomaly Detection. RTTAD holistically tackles normality shifts via a synergistic two-stage mechanism. During training, collaborative dual-task learning captures multi-level representations to establish a robust normal prior. During testing, a Test-Time Contrastive Learning (TTCL) module explicitly accounts for adaptation risk by selectively updating the model using high-confidence pseudo-normal samples while constraining anomalous ones. Additionally, TTCL incorporates a k-nearest neighbor-based contrastive objective to refine embedding distributions, thereby further enhancing the model's discriminative capacity. Extensive experiments on 15 tabular datasets demonstrate that RTTAD achieves state-of-the-art overall detection performance.
comment: 13 pages, 6 figures
☆ Building Korean linguistic resource for NLU data generation of banking app CS dialog system
Natural language understanding (NLU) is integral to task-oriented dialog systems, but demands a considerable amount of annotated training data to increase the coverage of diverse utterances. In this study, we report the construction of a linguistic resource named FIAD (Financial Annotated Dataset) and its use to generate a Korean annotated training data for NLU in the banking customer service (CS) domain. By an empirical examination of a corpus of banking app reviews, we identified three linguistic patterns occurring in Korean request utterances: TOPIC (ENTITY, FEATURE), EVENT, and DISCOURSE MARKER. We represented them in LGGs (Local Grammar Graphs) to generate annotated data covering diverse intents and entities. To assess the practicality of the resource, we evaluate the performances of DIET-only (Intent: 0.91 /Topic [entity+feature]: 0.83), DIET+ HANBERT (I:0.94/T:0.85), DIET+ KoBERT (I:0.94/T:0.86), and DIET+ KorBERT (I:0.95/T:0.84) models trained on FIAD-generated data to extract various types of semantic items.
☆ MARGIN: Margin-Aware Regularized Geometry for Imbalanced Vulnerability Detection
Software vulnerability detection is critical for ensuring software security and reliability. Despite recent advances in deep learning, real-world vulnerability datasets suffer from two severe challenges: frequency imbalance and difficulty imbalance. We reinterpret these challenges from an embedding geometry perspective, observing that such imbalances induce geometric distortions in hyperspherical representation space. To address this issue, we propose MARGIN, a metric-based framework that learns discriminative vulnerability representations through adaptive margin metric learning and hyperspherical prototype modeling. MARGIN dynamically adjusts geometric regularization according to the distribution structure estimated by the von Mises-Fisher concentration, aligning the probability mass of embedding distributions with their corresponding Voronoi cells, thereby reducing geometric distortion and yielding more stable decision boundaries. Extensive experiments on public vulnerability datasets show that MARGIN consistently outperforms strong baselines, achieving notable improvements in classification and detection, especially on challenging, imbalanced datasets. Further analysis demonstrates that MARGIN produces more structured embedding geometries, improving robustness, interpretability, and generalization.
comment: 12 pages.9 figures, 4 tables
☆ The Benefits of Temporal Correlations: SGD Learns k-Juntas from Random Walks Efficiently
We study how temporal correlations in the data can make certain sparse learning problems efficiently learnable by gradient-based methods. Our focus is on Boolean k-juntas, a canonical sparse learning problem known to pose barriers for gradient-based methods under independent uniform samples. We show that this picture changes when the samples are generated by a lazy random walk on the hypercube. In this setting, the temporal dependencies can be exploited by a two-layer ReLU network trained using stylized-SGD with a temporal-difference loss, which compares target and predicted increments across consecutive samples. For every fixed k, the resulting sample complexity is essentially linear in the ambient dimension d. By contrast, we show that for large-batch gradient methods using standard convex pointwise losses, temporal correlations do not provide the same advantage.
comment: 10 pages main body, 3 figures
☆ When Does Non-Uniform Replay Matter in Reinforcement Learning?
Modern off-policy reinforcement learning algorithms often rely on simple uniform replay sampling and it remains unclear when and why non-uniform replay improves over this strong baseline. Across diverse RL settings, we show that the effectiveness of non-uniform replay is governed by three factors: replay volume, the number of replayed transitions per environment step; expected recency, how recent sampled transitions are; and the entropy of the replay sampling distribution. Our main contribution is clarifying when non-uniform replay is beneficial and providing practical guidance for replay design in modern off-policy RL. Namely, we find that non-uniform replay is most beneficial when replay volume is low, and that high-entropy sampling is important even at comparable expected recency. Motivated by these findings, we adopt a simple Truncated Geometric replay that biases sampling toward recent experience while preserving high entropy and incurring negligible computational overhead. Across large-scale parallel simulation, single-task, and multi-task settings, including three modern algorithms evaluated on five RL benchmark suites, this replay sampling strategy improves sample efficiency in low-volume regimes while remaining competitive when replay volume is high.
☆ FORGE: Fragment-Oriented Ranking and Generation for Context-Aware Molecular Optimization
Molecular optimization seeks to improve a molecule through small structural edits while preserving similarity to the starting compound. Recent language-model approaches typically treat this task as prompt-conditioned sequence generation. However, relying on natural language introduces an inherent data-scaling bottleneck, often leads to chemical hallucinations, and ignores the strong context dependence of fragment effects. We present FORGE, a two-stage framework that reformulates molecular optimization as context-aware local editing. By utilizing automatically mined, verified low-to-high edit pairs instead of expensive human text annotations, Stage 1 ranks candidate fragments by their property contribution under the full molecular context to inject chemical prior, and Stage 2 generates explicit fragment replacements. Built on a compact 0.6B language model, FORGE further adapts to unseen black-box objectives through in-context demonstrations. Across Prompt-MolOpt, PMO-1k and ChemCoTBench, FORGE consistently outperforms prior methods, including substantially larger language models and graph methods. These results highlight the value of explicit fragment-level supervision as a more easily obtainable, scalable, and hallucination-less alternative to natural language training.
☆ Stellar Age Compression Reshapes Interpretations of the Milky Way Thick-Disk Formation History
The formation timescale of the Milky Way thick disk is one of the central debates in Galactic archaeology. The age-metallicity relation (AMR), formation timescale, and chemical evolution gradients are frequently used to infer a rapid assembly, short-timescale enrichment, and bursty formation history of the thick disk. However, stellar ages are not directly observable, introducing the potential risk that inferred ages may harbor a systematic compression tied to observational quality. In this paper, we use the same stellar sample and identical physical covariate matching conditions, but two independent age scales--spectroscopic inferred ages (astroNN) and asteroseismic ages (APOKASC-3)--to compare the observable signatures of the thick-disk formation history. We find that several key observables previously supporting a rapid thick-disk formation are systematically weakened under seismic anchoring: the AMR slope flattens from -3.29 to -1.86 Gyr dex-1 (Delta a = +1.43), the formation timescale widens from 3.04 to 3.55 Gyr, and the peak formation age shifts from 9.1 to 6.0 Gyr. Through transport inversion experiments, we further show that additive noise can only broaden the age distribution and cannot reproduce the above pattern, whereas a compressive transport map (lambda < 1) simultaneously reproduces a narrower age distribution, a steeper AMR, and rapid-formation-like observables. This result indicates that the compression transformation itself is sufficient to generate rapid-formation-friendly observables without requiring an intrinsically bursty formation history. Our findings reveal that statistical interpretations of the Milky Way formation history may depend sensitively on the stellar age definition itself.
☆ Parameterized Complexity of Stationarity Testing for Piecewise-Affine Functions and Shallow CNN Losses
We study the parameterized complexity of testing approximate first-order stationarity at a prescribed point for continuous piecewise-affine (PA) functions, a basic task in nonsmooth optimization. PA functions form a canonical model for nonsmooth stationarity testing and capture the local polyhedral geometry that appears in ReLU-type training losses. Recent work by Tian and So (SODA 2025) shows that testing approximate stationarity notions for PA functions is computationally intractable in the worst case, and identifies fixed-dimensional tractability as an open direction. We address this direction from the viewpoint of parameterized complexity, with the ambient dimension $d$ as the parameter. In this paper, we give XP algorithms in fixed dimension for the tractable sides, and prove W[1]-hardness for the complementary sides. Moreover, lower bounds under the Exponential Time Hypothesis rule out algorithms running in time $ρ(d)\size^{o(d)}$ for any computable function $ρ$, where $\size$ denotes the total binary encoding length of the stationarity-testing instance. As a further consequence, our results yield the corresponding parameterized complexity picture for testing local minimality of continuous PA functions. We further extend our hardness results to a family of shallow ReLU CNN training losses, with stationarity tested in the trainable weight space. Thus, the same parameterized-complexity picture also appears for simple CNN training losses.
comment: 32 pages, 1 figure, 1 table
☆ Extended Wasserstein-GAN Approach to Causal Distribution Learning: Density-Free Estimation and Minimax Optimality
Distributional causal inference requires estimating not only average treatment effects but also interventional outcome distributions, including quantiles, tail risks, and policy-dependent uncertainty. As a method for distributional causal inference, generative adversarial network (GAN)-based counterfactual methods are flexible tools for this task. However, these methods have several limitations. First, the objectives of certain techniques do not coincide with the statistical risk of the identifiable causal target, and therefore provide limited theoretical guarantees regarding estimable counterfactual distributions or optimality. Second, they tend to rely on unstable density-based methods, such as density ratio estimation. In this paper, we propose GANICE (GAN for Interventional Conditional Estimation) with several advantages: it (i) clarifies the conditional interventional distribution for each treatment--covariate state as the causal estimation target; (ii) estimates the conditional distribution such that its averaged Wasserstein risk is minimized; (iii) establishes minimax optimality. GANICE achieves these advantages through the introduction of the extended Wasserstein distance, the incorporation of a cellwise critic in its dual, and an optimality proof based on Besov space theory. Our experiments demonstrate that GANICE consistently outperforms existing methods.
☆ Unveiling High-Probability Generalization in Decentralized SGD
Decentralized stochastic gradient descent (D-SGD) is an efficient method for large-scale distributed learning. Existing generalization studies mainly address expected results, achieving rates limited to $\mathcal{O}\left(\frac{1}{δ\sqrt{mn}}\right)$, where $δ$ is the confidence parameter, $m$ the number of workers, and $n$ the sample size. When $m=1$, D-SGD reduces to traditional SGD, whose optimal high-probability generalization bound is $\mathcal{O}\left(\frac{1}{\sqrt{n}}\log (1/δ)\right)$. This discrepancy reveals a gap between high-probability guarantees for SGD and those for D-SGD. To close this, we develop a high-probability learning theory for D-SGD, aiming for the optimal $\mathcal{O}\left(\frac{1}{\sqrt{mn}}\log (1/δ)\right)$ rate. We refine bounds for D-SGD using pointwise uniform stability in distributed learning-a weaker notion than uniform stability-and analyze them across convex, strongly convex, and non-convex settings. We also provide high-probability results for gradient-based measures in non-convex cases where only local minima exist, and derive optimization error and excess risk bounds. Finally, accounting for communication overhead, we analyze generalization bounds for local models within time-varying frameworks.
☆ Task-Aware Calibration: Provably Optimal Decoding in LLMs
LLM decoding often relies on the model's predictive distribution to generate an output. Consequently, misalignment with respect to the true generating distribution leads to suboptimal decisions in practice. While a natural solution is to calibrate the model's output distribution, for LLMs, this is ill-posed at the combinatorially vast level of free-form language. We address this by building on the insight that in many tasks, these free-form outputs can be interpreted in a semantically meaningful latent structure, for example, discrete class labels, integers, or sets. We introduce task calibration as a paradigm to calibrate the model's predictive distribution in the task-induced latent space. We apply a decision-theoretic result to show that Minimum Bayes Risk (MBR) decoding on the task-calibrated latent distribution is the optimal decoding strategy on latent model beliefs. Empirically, it consistently improves generation quality across different tasks and baselines. We also introduce Task Calibration Error (TCE), an application-aware calibration metric that quantifies the excess loss due to miscalibration. Our work demonstrates that task calibration enables more reliable model decisions across various tasks and applications.
☆ Empty SPACE: Cross-Attention Sparsity for Concept Erasure in Diffusion Models
Erasing specific concepts from text-to-image diffusion models is essential for avoiding the generation of copyrighted and explicit content. Closed-form concept erasure methods offer a fast alternative to backpropagation-based techniques, but they become less effective when scaling from smaller models such as Stable Diffusion 1.5 to larger models like Stable Diffusion XL. To maintain erasure effectiveness in these larger-scale architectures, we propose SParse cross-Attention-based Concept Erasure (SPACE). SPACE iteratively modifies the cross-attention parameters of a model with a closed-form update that jointly induces sparsity and erases target concepts. By concentrating the concept mapping to a lower-dimensional subspace, SPACE achieves superior erasure efficacy compared to dense baselines. Extensive experimental results show improvements in erasure effectiveness and robustness against adversarial prompts. Furthermore, SPACE achieves 80\%-90\% cross-attention sparsity, reducing the storage requirements for saving the modified parameters by 70\%, demonstrating its memory efficiency.
☆ Many Needles in a Haystack: Active Hit Discovery for Perturbation Experiments ICML
High-throughput gene perturbation experiments can test several genetic interventions in parallel, yet experimental budgets remain limited. A central goal is hit discovery: identifying as many perturbations as possible whose phenotypic effect exceeds a predefined threshold. Pure exploration strategies are statistically inefficient, wasting budget on low-value regions. Bayesian optimization methods offer a principled alternative but target a single global optimum, over-exploiting dominant modes while neglecting other high-value regions. We formalize hit discovery as a sequential experimental design problem and propose Probability-of-Hit, an acquisition function that directly targets threshold exceedance by ranking candidates according to their posterior probability of being a hit. We prove asymptotic optimality of this approach and demonstrate strong empirical performance on both synthetic benchmarks and real biological immunology datasets, including up to 6.4% improvement over baselines on the Schmidt IL-2 dataset.
comment: To be published in International Conference on Machine Learning (ICML) 2026
☆ Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration OSDI 2026
Tree-of-Thought (ToT) reasoning structures Large Language Model (LLM) inference as a tree-based search, demonstrating strong potential for solving complex mathematical and programming tasks. However, its efficiency is constrained by the reward dependency barrier -- a synchronization bottleneck caused by sequential reward-guided exploration that limits search parallelism and introduces substantial latency. Prior system optimizations, mainly designed for linear Chain-of-Thought (CoT) reasoning, cannot address these challenges, leaving the efficiency of ToT underexplored. To enhance ToT reasoning efficiency, we observe that the reasoning paths can be explored speculatively to break the reward synchronization barrier. Therefore, in this paper, we propose SPEX and introduce three key techniques: (i) intra-query speculative path selection to predict and expand high-potential branches of ToT, (ii) inter-query budget allocation to balance speculative resource allocation across queries dynamically, and (iii) adaptive early termination to prune deep and redundant branches for a skewed search tree. We implement SPEX on top of the SGLang framework and evaluate it across diverse ToT algorithms and LLMs. Extensive experiments show that SPEX achieves $1.2 \sim 3 \times$ speedup for different ToT reasoning algorithms. Moreover, SPEX synergizes with token-level speculative decoding, achieving cumulative speedups of up to $4.1\times$. Ablation studies further confirm the contributions of each technique. Overall, SPEX represents a significant step toward efficient and scalable ToT reasoning, unlocking the parallelism required for high-performance inference-time scaling for LLMs.
comment: OSDI 2026
☆ TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment
On-policy self-distillation (self-OPD) densifies reinforcement learning with verifiable rewards (RLVR) by letting a policy teach itself under privileged context. We find that when this guidance spans the full response, all-token KL spends gradients on mostly redundant positions and amplifies privileged-information leakage, causing entropy rise, shortened reasoning, and out-of-distribution degradation in long-horizon math training. We propose Token-Routed Alignment for Critical rEasoning (TRACE), which distills only on annotator-marked critical spans: forward KL on key spans of correct rollouts, optional reverse KL on localized error spans, and GRPO on all remaining tokens, with the KL channel annealed away after a short warm-up. Our analysis explains TRACE through two effects: forward KL provides non-vanishing lift to teacher-supported tokens that the student under-allocates, while span masking and decay keep cumulative privileged-gradient exposure finite. On four held-out math benchmarks plus GPQA-Diamond, TRACE improves over GRPO by 2.76 percentage points on average and preserves the Qwen3-8B base OOD score on GPQA-Diamond, where GRPO and all-token self-OPD baselines degrade. Gains persist under online self-annotation (+1.90 percentage points, about 69% of the strong-API gain), reducing the concern that TRACE merely imports external annotator capability. Across scales, the best routed action is base-dependent: on Qwen3-8B it is forward KL on key spans, while on Qwen3-1.7B it shifts to reverse KL on error spans.
comment: work in progress
☆ ProteinOPD: Towards Effective and Efficient Preference Alignment for Protein Design
Designing proteins with desired functions or properties represents a core goal in synthetic biology and drug discovery. Recent advances in protein language models (PLMs) have enabled the generation of highly designable protein sequences, while preference alignment provides a promising way to steer designs toward desired functions and properties. Nevertheless, they often trigger catastrophic forgetting of pretrained knowledge, degrading basic designability and failing to balance multiple competing objectives. To address these issues, we draw inspiration from On-Policy Distillation (OPD), an advanced post-training method renowned for mitigating catastrophic forgetting through its mode-seeking nature. In this work, we propose ProteinOPD, a multi-objective preference alignment framework that can effectively balance multiple preference objectives while maintaining the inherent designability of PLMs. ProteinOPD adapts a pretrained PLM into preference-specific teachers and distills their knowledge into a shared student via token-level OPD on the student's own trajectories. During this process, the student is aligned to a unique normalized geometric consensus of weighted teachers while ensuring bounded optimization under conflicts. This bridges the gap for OPD in multi-objective/teacher alignment. Extensive experiments show that ProteinOPD achieves substantial gains on target preference objectives without compromising the designability, with an 8x training speedup over RL-based alignment competitors.
☆ Fix the Loss, Not the Radius: Rethinking the Adversarial Perturbation of Sharpness-Aware Minimization ICML2026
Sharpness-Aware Minimization (SAM) improves generalization by minimizing the worst-case loss within a fixed parameter-space radius neighborhood. SAM and its variants mainly rely on a first-order linearized surrogate, while flat minima are inherently a second-order (curvature) notion.We revisit this mismatch and propose Loss-Equated SAM (LE-SAM), which inverts the traditional SAM mechanism that fixed perturbation radius with a fixed loss-space budget,effectively removing gradient-norm-dominated learning signals and shifting optimization toward curvature-dominated terms. Extensive experiments across diverse benchmarks and tasks demonstrate the strong generalization ability of LESAM that consistently outperforms SAM and even its variants, achieving the state-of-the-art performance.
comment: Accepted by ICML2026
☆ One-Step Graph-Structured Neural Flows for Irregular Multivariate Time Series Classification
Neural Flows efficiently model irregular multivariate time series by directly learning ODE solution trajectories with neural networks, bypassing step-by-step numerical solvers. Despite their efficiency, many existing approaches treat variables independently, leaving inter-variable interactions underexplored. Moreover, their one-step mapping makes interaction modeling inherently challenging, as it removes the iterative refinement of interactions during learning. To address this challenge, we propose one-step Graph-Structured Neural Flows (GSNF), which introduce two auxiliary-trajectory self-supervision strategies to strengthen interaction learning: (i) interaction-aware trajectory generation via re-initialization, which induces trajectory divergence to expose graph-induced interactions, with a theoretically derived lower bound on divergence; and (ii) reverse-time trajectory generation, which enforces forward-backward consistency to regularize graph learning, enabled by flow invertibility. Experiments on five real-world datasets show that GSNF achieves state-of-the-art classification performance with highly competitive training time and memory usage.
☆ Joint sparse coding and temporal dynamics support context reconfiguration
Adaptive behavior requires the brain to transition between distinct contexts while maintaining representations of prior experience. The ability to reconfigure neural representations without erasing previously acquired knowledge is central to learning in dynamic environments, yet the neural mechanisms that support this balance remain unclear. Understanding these mechanisms is also critical for addressing catastrophic forgetting in artificial systems designed for lifelong learning. Here, we identify joint sparse coding and temporal dynamics in both the mouse medial prefrontal cortex (mPFC) and computational networks as mechanisms that help preserve prior representations during context transitions. Specifically, sparsity in context-dependent representations reduces cross-context interference, whereas temporal dynamics within the network activity further enhance context separability across time. Strikingly, networks endowed with both properties, such as spiking neural networks, exhibit improved retention during lifelong learning without auxiliary heuristics. These findings establish joint sparse coding and temporal dynamics as a core mechanism supporting flexible context reconfiguration in lifelong learning and, through their activity constraining nature, as an energy-efficient architectural principle for stable adaptation. Together, they provide a mechanistic framework for understanding how the brain preserves prior knowledge while flexibly adapting to new contexts.
comment: 37 pages, 6 figures, 6 extended data figures. Preprint version
☆ Balancing Efficiency and Fairness in Traffic Light Control through Deep Reinforcement Learning
Urban traffic congestion presents a significant challenge for modern cities, which impacts mobility and sustainability. Traditional traffic light control systems often fail to adapt to dynamic conditions, leading to inefficiencies. This paper proposes a novel deep reinforcement learning agent for traffic light control that addresses this limitation by explicitly integrating fairness considerations for both vehicular and pedestrian traffic. Unlike prior work, our approach dynamically balances these flows based on real-time demand, moving beyond systems focused solely on vehicles. Experimental results demonstrate that our agent effectively reduces congestion while ensuring equitable service for both the categories of road users. This research contributes to a practical and adaptable solution for intelligent traffic management within the framework of smart cities, paving the way for more efficient and inclusive urban mobility.
comment: Paper accepted to the 2026 IFAC World Congress, held in Busan (KOR), August 23rd-28th, 2026
☆ Hyperparameter Transfer for Dense Associative Memories
Dense Associative Memory (DenseAM) is a promising family of AI architectures that is represented by a neural network performing temporal dynamics on an energy landscape. While hyperparameter transfer methods are well-studied for feed-forward networks, these methods have not been developed for settings in which weights are shared across layers and within the layer, which is common in DenseAMs. Additionally, DenseAMs utilize rapidly peaking activation functions that are rarely used in feed-forward architectures. The confluence of these aspects makes DenseAM a challenging framework for using existing methods for hyperparameter transfer. Our work initiates the development of hyperparameter transfer methods for this class of models. We derive explicit prescriptions for how the hyperparameters tuned on small models can be transferred to models trained at scale. We demonstrate excellent agreement between these theoretical findings and empirical results.
☆ Coarsening Linear Non-Gaussian Causal Models with Cycles
Recent work on causal abstraction, in particular graphical approaches focusing on causal structure between clusters of variables, aims to summarize a high-dimensional causal structure in terms of a low-dimensional one. Existing methods for learning such summaries from data assume that both the high- and low-dimensional structures are acyclic, which is helpful for causal effect identification and reasoning but excludes many high-dimensional models and thus limits applicability. We show that in the linear non-Gaussian (LiNG) setting, the high-dimensional acyclicity assumption can be relaxed while still allowing recovery of a low-dimensional causal directed acyclic graph (DAG). We further connect identifiability of this low-dimensional DAG to existing results: LiNG models with cycles are observationally identifiable only up to an equivalence class whose members differ by reversals of directed cycles; our low-dimensional DAG, which is invariant across all members of a given equivalence class, thus forms a natural representative of the class. While existing approaches for learning this observational equivalence class over high-dimensional variables have exponential time complexity, our low-dimensional summary is learned in worst-case cubic time and comes with explicit bounds on the sample complexity. We provide open source code and experiments on synthetic data to corroborate our theoretical results.
☆ OUIDecay: Adaptive Layer-wise Weight Decay for CNNs Using Online Activation Patterns
Weight decay remains one of the most widely used regularization mechanisms for training convolutional neural networks, yet it is still commonly applied as a fixed coefficient shared by all layers throughout training. This uniform treatment ignores that different layers may follow different structural dynamics and therefore may require different regularization strengths. In this work, we propose OUIDecay, an adaptive layer-wise and time-dependent weight decay scheduler for CNNs driven by the Overfitting-Underfitting Indicator (OUI), an activation-based metric previously shown to provide early information about regularization quality. OUIDecay uses a lightweight batch-based formulation of OUI to monitor the structural behavior of each layer online and periodically rescales its weight decay relative to the other layers in the network. Unlike gradient-based adaptive decay methods, our approach relies on functional information extracted from activation patterns and does not require validation data. Experiments on EfficientNet-B0 with Stanford Cars, ResNet50 with Food101, DenseNet121 with CIFAR100, and MobileNetV2 with CIFAR10 show that OUIDecay achieves the best mean best-validation-loss in 7 out of 8 evaluated settings. These results indicate that activation-driven weight decay adaptation is a practical and effective alternative to fixed decay and gradient-based adaptive decay, while keeping the method lightweight and suitable for online use.
☆ jNO: A JAX Library for Neural Operator and Foundation Model Training
jNO (jax Neural Operators) is a JAX-native library for neural operators and foundation models with unified support for both data-driven and physics-informed training. Its core design is a tracing system in which domains, model calls, residuals, supervised losses, and diagnostics are written in one symbolic language and compiled into one optimization pipeline. This allows users to move between operator regression, mesh-aware residual evaluation, and PDE-constrained training without restructuring the surrounding code. jNO also supports multi-model compositions, fine-grained control at parameter level (model, optimizer, and learning rate), hyperparameter tuning, and JAX-native workflows for translated PDE foundation-model families. The source repository is available at https://github.com/FhG-IISB/jNO.
☆ Unsupervised Process Reward Models
Process Reward Models (PRMs) are a powerful mechanism for steering large language model reasoning by providing fine-grained, step-level supervision. However, this effectiveness comes at a significant cost: PRMs require expert annotations for every reasoning step, making them costly and difficult to scale. Here, we propose a method for training unsupervised PRMs (uPRM) that requires no human supervision, neither at the level of step-by-step annotations nor through ground-truth verification of final answers. The key idea behind our approach is to define a scoring function, derived from LLM next-token probabilities, that jointly assesses candidate positions of first erroneous steps across a batch of reasoning trajectories. We demonstrate the effectiveness of uPRM across diverse scenarios: (i) uPRM achieves up to 15% absolute accuracy improvements over the LLM-as-a-Judge in identifying first erroneous steps on the ProcessBench dataset; (ii) as a verifier for test-time scaling, uPRM performs comparably to supervised PRMs and outperforms the majority voting baseline by up to 6.9%, and (iii) when used as a reward signal in reinforcement learning, uPRM enables more robust policy optimization throughout training compared to a supervised PRM trained using ground-truth labels. Overall, our results open a path toward scalable reward modeling for complex reasoning tasks.
comment: preprint
☆ Stable Long-Horizon PDE Forecasting via Latent Structured Spectral Propagators
Long-horizon forecasting of time-dependent partial differential equations (PDEs) is critical for characterizing the sustained evolution of physical systems. While neural operators have emerged as efficient surrogates, they typically learn implicit finite-time transitions from discrete observations. When deployed autoregressively, such propagators often suffer from rapid error accumulation and dynamic drift. To address this, we propose a neural forecasting framework that reformulates PDE rollout as learning a Structured Spectral Propagator (SSP) in a propagation-oriented latent space. Following an analysis-propagation-synthesis design, our framework: (i) maps physical states into a shared, time-consistent spatial representation; (ii) projects this space into a compact propagation state to isolate recurrent dynamics from fine-grained spatial details, thereby decoupling reconstruction fidelity from rollout regularity; and (iii) evolves retained spectral modes using a frequency-conditioned linear backbone complemented by a nonlinear spectral closure to account for truncated interactions. This explicit structuring endows the propagator with a strong inductive bias for coherent modal evolution. Extensive experiments demonstrate that SSP significantly outperforms state-of-the-art baselines, reducing relative $L_2$ errors by up to 48.9% and exhibiting improved stability in temporal extrapolation beyond the supervised horizon.
☆ APEX: Audio Prototype EXplanations for Classification Tasks
Explainable AI (XAI) has achieved remarkable success in image classification, yet the audio domain lacks equally mature solutions. Current methods apply vision-based attribution techniques to spectrograms, overlooking fundamental differences between visual and acoustic signals. While prototype reasoning is promising, acoustic similarity remains multidimensional. We introduce APEX (Audio Prototype EXplanations), a post-hoc framework for interpreting pre-trained audio classifiers. Crucially, APEX requires no fine-tuning of the original backbone and strictly preserves output invariance. APEX disentangles explanations into four perspectives: Square-based prototypes to localize transient events, Time-based for temporal patterns, Frequency-based highlighting spectral bands, and Time-Frequency-based integrating both. This yields intuitive, example-based explanations that respect acoustic properties, providing greater semantic clarity than standard gradient-based methods.
☆ Learning to Sparsify Stochastic Linear Bandits IJCAI 2026
This paper addresses the problem of learning to sparsify stochastic linear bandits, where a decision-maker sequentially selects actions from a high-dimensional space subject to a sparsity constraint on the number of nonzero elements in the action vector. The key challenge lies in minimizing cumulative regret while tackling the potential NP-hardness of finding optimal sparse actions due to the inherent combinatorial structure of the problem. We propose an adaptively phased exploration and exploitation algorithmic framework, utilizing ordinary least squares for parameter learning and specialized subroutines for sparse action selection. When the action set is a Euclidean ball, optimal sparse actions can be efficiently computed, enabling us to establish a $\tilde{\mathcal{O}}(d\sqrt{T})$ regret, where $d$ is the dimension of the action vector and $T$ is the time horizon length. For general convex and compact action sets where finding optimal sparse actions is intractable, we employ a greedy subroutine. For general strongly convex action sets, we derive a $\tilde{\mathcal{O}}(d \sqrt{T})$ $α$-regret; for general compact sets lacking strong convexity, we establish a $\tilde{\mathcal{O}}(d T^{2/3})$ $α$-regret, where $α$ pertains to the approximation ratio of the greedy algorithm. Finally, we validate the performance of our algorithms using extensive experiments including an application to recommendation system.
comment: Include all the omitted details and proofs from the conference paper accepted to IJCAI 2026
☆ PFN-TS: Thompson Sampling for Contextual Bandits via Prior-Data Fitted Networks
Thompson sampling is a widely used strategy for contextual bandits: at each round, it samples a reward function from a Bayesian posterior and acts greedily under that sample. Prior-data fitted networks (PFNs), such as TabPFN v2+ and TabICL v2, are attractive candidates for this purpose because they approximate Bayesian posterior predictive distributions in a single forward pass. However, PFNs predict noisy future rewards, while Thompson sampling requires uncertainty over the latent mean reward function. We propose PFN-TS, a Thompson sampling algorithm that converts PFN posterior predictives into mean-reward samples using a subsampled predictive central limit theorem. The method estimates posterior variance from a geometric grid of $O(\log n)$ dataset prefixes rather than the full $O(n)$ predictive sequence used in previous predictive-sequence approaches, and reuses TabICL's cached representations across rounds. We prove consistency of the subsampled variance estimator and give a Bayesian regret bound that decomposes PFN-TS regret into exact posterior-sampling regret under the PFN prior plus approximation terms. Empirically, PFN-TS achieves the best average rank across nonlinear synthetic and OpenML classification-to-bandit benchmarks, remains competitive on linear and BART-generated rewards, and attains the highest estimated policy value in an offline mobile-health evaluation. Code is available at https://anonymous.4open.science/r/PFN_TS-36ED/.
☆ Per-Loss Adapters for Gradient Conflict in Physics-Informed Neural Networks
Physics-informed neural networks (PINNs) train a single neural approximation by minimizing multiple physics- and data-derived losses, but the gradients of these losses often interfere and can stall optimization. Existing remedies typically treat this pathology either through scalar loss balancing or full-parameter-space gradient surgery, leaving it unclear which intervention is most appropriate. We show that PINN gradient conflict is not a uniform failure mode with one universal remedy. Instead, we identify distinct PINN gradient-conflict regimes, each associated with a different intervention class. Persistent directional conflict may require separate loss-indexed parameter subspaces, magnitude imbalance often favors scalar reweighting, and low or transient conflict may require no extra mitigation. To select between scalar reweighting and a lightweight architectural intervention, we propose a diagnostic-first framework. It profiles a 1000-step unmodified PINN run and, when intervention is warranted, uses one low-rank adapter per loss to create explicit loss-indexed parameter subspaces attached to a shared PINN trunk, providing each loss with a direct gradient pathway. Across more than 60 PDE configurations, including forward, inverse, multi-physics, parameter-varying, and high-dimensional problems up to 50D, persistent directional conflict dominates standard forward $K=3$ benchmarks and a natural $K=4$ thermoelastic system, where adapters combined with reweighting yield significant improvements. In contrast, $K=3$ inverse problems and natural $K=5$ and $K=6$ multi-physics systems are largely magnitude-dominated and often favor reweighting alone, while full-parameter-space gradient surgery can fail on heterogeneous parameter spaces.
comment: 49 pages, 10 figures
☆ GELATO: Generative Entropy- and Lyapunov-based Adaptive Token Offloading for Device-Edge Speculative LLM Inference IEEE
The recent growth of on-device Large Language Model (LLM) inference has driven significant interest in device-edge collaborative LLM inference. As a promising architecture, Speculative Decoding (SD) is increasingly adopted where a lightweight draft model rapidly generates candidate tokens to be verified by a powerful target model. However, a fundamental challenge lies in achieving per-token resource scheduling to effectively adapt SD paradigm to resource-constrained edge environment. This paper proposes a Generative Entropy- and Lyapunov-based Adaptive Token Offloading framework, named GELATO, to maximize decoding throughput under energy constraints in a device-edge collaborative SD system. Specifically, an outer drift-plus-penalty loop makes online decisions to establish a reference drafting budget, managing long-term energy-throughput trade-off. Further, a nested entropy-driven generation mechanism executes early exiting to adapt to per-token dynamic generative uncertainty. Theoretical analysis establishes a rigorous performance bound on long-term throughput for GELATO. Extensive evaluations demonstrate that GELATO achieves a globally optimal tradeoff, outperforming state-of-the-art distributed SD architectures by 64.98% in token throughput and reducing energy consumption by 47.47% under resource-constrained environments, while preserving LLM decoding quality.
comment: This work has been submitted to the IEEE for possible publication
☆ Complex-Valued Phase-Coherent Transformer
Complex-valued Transformers have largely inherited softmax attention from real-valued architectures. However, row-normalised token competition is not necessarily aligned with phase-preserving computation. In this paper, we introduce the Phase-Coherent Transformer (PCT), which applies a real-valued, element-independent, smooth gate to L2-normalised complex query-key similarities. PCT replaces token competition with token-non-competing attention and is designed to preserve phase information across layers. Across mid-scale benchmarks spanning long-range memory, hierarchical long-range reasoning, positional retrieval, phase-based memory and superposition, and image classification, PCT shows strong generalisation across task categories. Under parameter-fair comparison, PCT consistently outperforms both the standard softmax Transformer and its direct complex-valued counterpart. Moreover, even on tasks traditionally considered difficult for complex-valued neural networks, such as NIAH and LRA-Text, PCT remains competitive with Multiscreen, the strongest real-valued NN baseline in our comparison. Experiments introducing gates that deliberately violate the PCT conditions show that the design is not incidental: smooth gates that preserve negatively aligned phase components remain strong, whereas gates that delete such components collapse on long-range retrieval, and gates whose outputs become excessively large suffer clear performance degradation. PCT also shows no depth-related accuracy collapse across the tested depth range. These results support introducing multi-layer phase-coherent structure into attention as a promising design principle for achieving generalisation in complex-valued Transformers.
comment: 26 pages, 17 tables (no figures). Companion Lean 4 formalization of Theorems 1 and 2 at https://github.com/leohio/phase-coherent-transformer-r-d
☆ Rethinking Constraint Awareness for Efficient State Embedding of Neural Routing Solver
Heavy-Encoder-Light-Decoder (HELD) neural routing solvers have emerged as a promising paradigm due to their broad applicability across multiple vehicle routing problems (VRPs). However, they typically struggle with VRP variants with complex constraints. To address this limitation, this paper systematically revisits existing neural solvers from the perspective of the generation mechanism for state embeddings (i.e., query vector prior to compatibility calculation) during decoding. We identify that current mechanisms restrict the observation space during attention computation, introducing a key bottleneck to achieving high-quality solutions. Through detailed empirical analysis, we demonstrate the necessity of preserving a global observation space. To overcome the constraint-agnostic drawback inherent to global observation spaces, we propose a simple yet powerful Constraint-Aware Residual Modulation (CARM) module. By adaptively modulating the context embedding with constraint-relevant variables, CARM effectively enhances constraint awareness, enabling the neural solver to fully leverage the global observation space and generate an efficient state embedding. Extensive experimental results across two single-task and five multi-task neural routing solvers confirm that the CARM module consistently boosts baseline performance. Notably, solvers equipped with our CARM achieve substantial improvements in scaling to large-scale instances and in generalizing to unseen VRP variants. These findings provide valuable insights for the architectural design of neural routing solvers.
☆ Explainability of Recurrent Neural Networks for Enhancing P300-based Brain-Computer Interfaces
Brain-Computer Interfaces (BCIs) based on P300 event-related potentials offer promising applications in health, education, and assistive technologies. However, challenges related to inter- and intra-subject variability and the explainability of Deep Learning (DL) models limit their practical deployment. In this work, we present the Post-Recurrent Module (PRM), an additional layer designed to improve both performance and transparency, incorporated into a Recurrent Neural Network (RNN) architecture for classifying P300 signals from EEG data. Our approach enables a dual analysis of spatio-temporal signals through both global and local explainability techniques, allowing us not only to identify the most relevant brain regions and critical time intervals involved in classification, but also to interpret model decisions in terms of spatio-temporal EEG patterns consistent with well-stablished neurophysiological descriptions of the P300. Experimental results show a 9\% improvement in performance over state of the art, while also revealing the importance of inter- and intra-subject variability, in alignment with established neuroscience literature. By making model decisions transparent and efficient, we present a framework for explainable EEG-based models. This framework is not limited to more efficient P300 detection, but can be generalized to a wide range of EEG-based tasks. Its ability to identify key spatial and temporal features makes it suitable for applications such as motor imagery, steady-state visual evoked potentials, and even cognitive workload assessment.
☆ Scaling the Memory of Balanced Adam
Recent evidence suggests that Adam performs robustly when its momentum parameters are tied, $β_1=β_2$, reducing the optimizer to a single remaining parameter. However, the value of this parameter is still poorly understood. We argue that, in balanced Adam, $β$ should not be treated as a dimensionless constant: it defines a statistical memory horizon $H_β=(1-β)^{-1}$. In terms of the effective learning horizon $T_{\mathrm{ES}}$, estimated from the validation trajectory, we study the refresh count $R_β=(1-β)T_{\mathrm{ES}}$, which measures how many times Adam renews its internal statistics during the useful phase of training. Across 11 vision and language experiments, we find that choosing $β$ so that $R_β\approx1000$ selects different beta values depending on the training scale, yet improves robustness over the best fixed-beta baseline. Compared with the strongest fixed choice $β=0.94377$, the refresh rule improves worst-case robustness, reducing the global maximum validation gap by $33.4\%$, while bringing all 11 runs within $1\%$ of their validation oracle. These results suggest that the remaining hyperparameter of balanced Adam is better understood as a memory-scale variable than as a fixed constant. This provides a simple budget-aware perspective on optimizer scaling and opens a path toward treating Adam's momentum as part of the learning dynamics rather than as a static default.
☆ Generating Symmetric Materials using Latent Flow Matching
Tackling the task of materials generation, we aim to enhance the previously proposed All-atom Diffusion Transformer (ADiT) by introducing SymADiT, a symmetry-aware variant. To do so, we use a representation of materials based on Wyckoff positions. We follow ADiT and perform generative modelling in latent space, adapted to our symmetry-aware representation. By forcing the output of the generative model to adhere to the symmetry restrictions imposed by the generated crystal's space group and each atom's Wyckoff-position, the generated materials exhibit more realistic symmetry properties. We benchmark our method against both symmetry-aware and symmetry-agnostic models for materials generation and show competitive performance, generating stable, symmetric materials with a simple Transformer architecture.
comment: Preprint
☆ CFSPMNet: Cross-subject Fourier-guided Spatial-Patch Mamba Network for EEG Motor Imagery Decoding in Stroke Patients
Motor imagery electroencephalography (MI-EEG) decoding offers a non-invasive route for post-stroke rehabilitation, but cross-patient use remains difficult because pathological neural reorganization changes task-related EEG dynamics, aperiodic activity, local excitability, cross-regional coordination, and trial-level brain-state context. This makes source-learned MI representations unreliable for unseen patients. To address this problem, we propose CFSPMNet, a cross-patient adaptation framework that models post-stroke MI-EEG as latent neural-state organization. CFSPMNet combines a Fourier-Reorganized State Mamba Network (FRSM) with Shared-Private Prototype Matching (SPPM). FRSM represents each trial as a latent physiological token sequence, reorganizes token states in the Fourier domain, and uses Fourier-derived trial context to guide Mamba state-space propagation. SPPM improves target pseudo-label updating by combining semantic confidence with shared-private physiological consistency, filtering confident but physiologically inconsistent target predictions. Leave-one-subject-out experiments on two stroke MI-EEG datasets show that CFSPMNet outperforms representative CNN-, Transformer-, Mamba-, and adaptation-based baselines, achieving average accuracies of 68.23% on XW-Stroke and 73.33% on 2019-Stroke, with gains of 5.63 and 8.25 percentage points over the strongest competitors. Ablation, sensitivity, feature-alignment, pseudo-label selection, and neurophysiological visualization analyses further support the roles of Fourier-domain token-state reorganization and calibrated pseudo-label updating. These results suggest that latent neural-state modeling can improve rehabilitation-oriented cross-patient BCI decoding. Code is available at https://github.com/wxk1224/CFSPMNet.
☆ GLiNER-Relex: A Unified Framework for Joint Named Entity Recognition and Relation Extraction
Joint named entity recognition (NER) and relation extraction (RE) is a fundamental task in natural language processing for constructing knowledge graphs from unstructured text. While recent approaches treat NER and RE as separate tasks requiring distinct models, we introduce GLiNER-Relex, a unified architecture that extends the GLiNER framework to perform both entity recognition and relation extraction in a single model. Our approach leverages a shared bidirectional transformer encoder to jointly represent text, entity type labels, and relation type labels, enabling zero-shot extraction of arbitrary entity and relation types specified at inference time. GLiNER-Relex constructs entity pair representations from recognized spans and scores them against relation type embeddings using a dedicated relation scoring module. We evaluate our model on four standard relation extraction benchmarks: CoNLL04, DocRED, FewRel, and CrossRE, and demonstrate competitive performance against both specialized relation extraction models and large language models, while maintaining the computational efficiency characteristic of the GLiNER family. The model is released as an open-source Python package with a simple inference API that allows users to specify arbitrary entity and relation type labels at inference time and obtain both entities and relation triplets in a single call. All models and code are publicly available.
comment: 19 pages, 1 figure, 2 tables
♻ ☆ LiLAW: Lightweight Learnable Adaptive Weighting to Learn Sample Difficulty & Improve Noisy Training
Training deep neural networks with noise and data heterogeneity is a major challenge. We introduce Lightweight Learnable Adaptive Weighting (LiLAW), a method that dynamically adjusts the loss weight of each training sample based on its evolving difficulty, categorized as easy, moderate, and hard, using only three global learnable scalar parameters. LiLAW learns to adaptively prioritize samples by updating these parameters with a single gradient descent step on a validation mini-batch after each training mini-batch, without requiring a clean, unbiased validation set. Experiments across general and medical imaging datasets, several noise types and levels, loss functions, and architectures with and without pretraining, including linear probing and full fine-tuning, show that LiLAW consistently improves accuracy and AUROC, especially in higher-noise settings, without requiring excessive tuning. We also obtain state-of-the-art results incorporating synthetic and augmented data from SynPAIN, GAITGen, ECG5000, and improved fairness on the Adult dataset. LiLAW is lightweight, practical, and computationally efficient, making it an effective, scalable approach to boost generalization and robustness across diverse deep learning training setups, especially in resource-constrained settings.
♻ ☆ FedQueue: Queue-Aware Federated Learning for Cross-Facility HPC Training
Federated learning (FL) across multiple HPC facilities faces stochastic admission delays from batch schedulers that dominate wall-clock time. Synchronous FL suffers from severe stragglers, while asynchronous FL accumulates stale updates when queues spike. We propose FedQueue, a queue-aware FL protocol that incorporates scheduler delays directly into training and aggregation, which (i) predicts per-facility queue delays online to budget local work, (ii) applies cutoff-based admission that buffers late arrivals to bound staleness, and (iii) performs staleness-aware aggregation to stabilize heterogeneous local workloads. We prove the convergence for non-convex objectives at rate $\mathcal{O}(1/\sqrt{R})$ under bounded staleness, and show that the admission controls yield bounded staleness with high probability under queue-prediction error. Real-world cross-facility deployment of FedQueue shows 20.5% improvement over baseline algorithms. Controlled queue simulations demonstrate robust improvement over the baselines; in particular, about 34% reduction in time to reach a target accuracy level under high queue variance and non-IID partitions.
♻ ☆ Distributionally Robust Token Optimization in RLHF
Large Language Models (LLMs) tend to respond correctly to prompts that align well with the data they were trained and fine-tuned on. Yet, small shifts in wording, format, or language can trigger surprisingly large failures, especially on multi-step reasoning problems. To address this problem, we propose a Distributionally Robust Token Optimization (DRTO) approach, which combines token-level Reinforcement Learning from Human Feedback (RLHF) with Distributionally Robust Optimization (DRO). DRTO constructs f-divergence ambiguity sets over span-level actor losses, providing a principled way to emphasize difficult response segments during policy optimization. Empirically, DRTO enhances consistency under distribution shifts in multiple reasoning benchmarks among different tasks, achieving $+4.4$ percentage points on MATH-500 and $+2.7$ percentage points on LiveCodeBench over standard RTO.
♻ ☆ Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility
Generative AI systems achieve impressive performance on standard benchmarks yet fail to deliver real-world utility, a disconnect we identify across 28 deployment cases spanning education, healthcare, software engineering, and law. We argue that this benchmark utility gap arises from three recurring failures in evaluation practice: proxy displacement, temporal collapse, and distributional concealment. Motivated by these observations, we argue that generative AI evaluation requires a paradigm shift from static benchmark-centered transparency toward stakeholder, goal, and context-conditioned utility transparency grounded in human outcome trajectories. Existing evaluations primarily characterize properties of model outputs, while deployment success depends on whether interaction with AI improves stakeholders' ability to achieve their goals over time. The missing construct is therefore utility: the change in a stakeholder's capability induced through sustained interaction with an AI system within a deployment context. To operationalize this perspective, we propose SCU-GenEval, a four-stage evaluation framework consisting of stakeholder-goal mapping, construct-indicator specification, mechanism modeling, and longitudinal utility measurement. To make these stages practically deployable, we introduce three supporting instruments: structured deployment protocols, context-conditioned user simulators, and persona- and goal-conditioned proxy metrics. We conclude with domain-specific calls to action, arguing that progress in generative AI must be evaluated through measurable improvements in human outcomes rather than benchmark performance alone.
comment: 20 pages
♻ ☆ When Less is More: The LLM Scaling Paradox in Context Compression
Scaling up model parameters has long been a prevalent training paradigm driven by the assumption that larger models yield superior generation capabilities. However, under lossy context compression in a compressor--decoder setup, we find a \textbf{\textit{Size-Fidelity Paradox}}: increasing compressor size can lessen the faithfulness of reconstructed contexts though reconstruction error decreases. Across 27 compressor setups spanning model families, scales, and compression rates, we coin this paradox arising from two dominant factors: 1) \textit{knowledge overwriting}: larger models increasingly replace source facts with their own prior beliefs, \textit{e.g.}, ``the white strawberry`` $\to$ ``the red strawberry``; and 2) \textit{semantic drift}: larger models tend to paraphrase or restructure content instead of reproducing it verbatim, \textit{e.g.}, ``Alice hit Bob`` $\to$ ``Bob hit Alice``. Interestingly, this paradox persists across varied settings, with mid-sized compressors often outperforming larger ones in faithful recovery. By analyzing the compressed memory via embedding geometry and reconstruction determinacy, we further reveal that compressors tend to organize memory across broader semantic subspaces, yielding more ambiguous representations prone to overwriting, drift, and weakened recovery. These findings complement existing evaluations of context compression and expose a breakdown of scaling laws when the objective shifts from plausible generation to faithful preservation.
comment: 22 pages, 7 figures, conference
♻ ☆ Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies
Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively. Despite its importance, existing relevant benchmarks largely evaluate agents on pre-specified or synthesized files with limited real-world dependencies, leaving workspace-level evaluation underexplored. To this end, we introduce Workspace-Bench, a benchmark for evaluating AI agents on Workspace Learning invOlving Large-Scale File Dependencies. We construct realistic workspaces with 5 worker profiles, 74 file types, 20,476 files (up to 20GB) and curate 388 tasks, each with its own file dependency graph, evaluated across 7,399 total rubrics that require cross-file retrieval, contextual reasoning, and adaptive decision-making. We further provide Workspace-Bench-Lite, a 100-task subset that preserves the benchmark distribution while reducing evaluation costs by about 70%. We evaluate 3 popular agent harnesses and 5 foundation models. Experimental results show that current agents remain far from reliable workspace learning, where the best reaches only about 60%, substantially below the human result of 80.7%, and the average performance across agents is only 45.1%.
comment: 29 pages, 16 figures
♻ ☆ Reinforcement Learning with Action Chunking NeurIPS 2025
We present Q-chunking, a simple yet effective recipe for improving reinforcement learning (RL) algorithms for long-horizon, sparse-reward tasks. Our recipe is designed for the offline-to-online RL setting, where the goal is to leverage an offline prior dataset to maximize the sample-efficiency of online learning. Effective exploration and sample-efficient learning remain central challenges in this setting, as it is not obvious how the offline data should be utilized to acquire a good exploratory policy. Our key insight is that action chunking, a technique popularized in imitation learning where sequences of future actions are predicted rather than a single action at each timestep, can be applied to temporal difference (TD)-based RL methods to mitigate the exploration challenge. Q-chunking adopts action chunking by directly running RL in a 'chunked' action space, enabling the agent to (1) leverage temporally consistent behaviors from offline data for more effective online exploration and (2) use unbiased $n$-step backups for more stable and efficient TD learning. Our experimental results demonstrate that Q-chunking exhibits strong offline performance and online sample efficiency, outperforming prior best offline-to-online methods on a range of long-horizon, sparse-reward manipulation tasks.
comment: The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025); 29 pages, 17 figures
♻ ☆ Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning
Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates. Existing defenses often offer limited protection or force a trade-off between safety and utility. We introduce a training framework that adapts regularization in response to safety risk, enabling models to remain aligned throughout fine-tuning. To estimate safety risk at training time, we explore two distinct approaches: a judge-based Safety Critic that assigns high-level harm scores to training batches, and an activation-based risk predictor built with a lightweight classifier trained on intermediate model activations to estimate harmful intent. Each approach provides a risk signal that is used to constrain updates deemed higher risk to remain close to a safe reference policy, while lower-risk updates proceed with standard training. We empirically verify that harmful intent signals are predictable from pre-generation activations and that judge scores provide effective high-recall safety guidance. Across multiple model families and attack scenarios, adaptive regularization with either risk estimation approach consistently lowers attack success rate compared to standard fine-tuning, preserves downstream performance, and adds no inference-time cost. This work demonstrates a principled mechanism for maintaining safety without sacrificing utility.
comment: Work in progress (48 pages)
♻ ☆ Feature Augmentation of GNNs for ILPs: Local Uniqueness Suffices
Integer Linear Programs (ILPs) are central to real-world optimizations but notoriously difficult to solve. Learning to Optimize (L2O) has emerged as a promising paradigm, with Graph Neural Networks (GNNs) serving as the standard backbone. However, standard anonymous GNNs are limited in expressiveness for ILPs, and the common enhancement of augmenting nodes with globally unique identifiers (UIDs) typically introduces spurious correlations that severely harm generalization. To address this tradeoff, we propose a parsimonious Local-UID scheme based on d-hop uniqueness coloring, which ensures identifiers are unique only within each node's d-hop neighborhood. Building on this scheme, we introduce ColorGNN, which incorporates color information via color-conditioned embeddings, and ColorUID, a lightweight feature-level variant. We prove that for d-layer networks, Local-UIDs achieve the expressive power of Global-UIDs while offering stronger generalization. Extensive experiments show that our approach yields substantial and robust gains across ILP benchmarks.
comment: 19 pages, 9 Tables
♻ ☆ Optimal Attention Temperature Improves the Robustness of In-Context Learning under Distribution Shift in High Dimensions ICML 2026
Pretrained Transformers can perform in-context learning (ICL) from a few demonstrations, but this ability can fail sharply when the test distribution differs from pretraining, a common deployment setting. We study attention temperature as a simple inference-time control for improving ICL robustness under such shifts. In a high-dimensional linear-regression framework, we analyze a Transformer with "approximate softmax" attention, which preserves softmax's normalization and temperature-dependent selectivity while remaining tractable. We derive a closed-form expression for the ICL generalization error under distribution shift, and show that it is minimized by an explicit optimal attention temperature. This characterization yields interpretable guidance by linking the best temperature to moments of the pre-softmax attention scores, and predicts when temperature adjustment can recover near Bayes-optimal performance. We validate the theory with extensive simulations, and further demonstrate gains on pretrained LLMs (GPT-2 and Llama2-7B) on question-answering benchmarks under distribution shift induced by noisy in-context demonstrations. Overall, attention temperature emerges as a principled, lightweight knob for improving the robustness of ICL in pretrained Transformers.
comment: ICML 2026, 24 pages, 7 figures
♻ ☆ PINS: Proximal Iterations with Sparse Newton and Sinkhorn for Optimal Transport
Optimal transport (OT) is a widely used tool in machine learning, but computing high-accuracy solutions for large instances remains costly. Entropic regularization and the Sinkhorn algorithm improve scalability; however, when the regularization parameter is small, Sinkhorn convergence slows, and the iterates approach an entropic solution that remains separated from the true OT plan by an entropic-bias plateau. We introduce PINS (Proximal Iterations with sparse Newton and Sinkhorn), a two-loop solver designed to move beyond this plateau. The outer loop applies an entropic proximal-point method, solving the original OT problem through a sequence of entropic subproblems with shifted cost matrices. Each inner subproblem is then solved by a Sinkhorn warm-up followed by sparse-Newton refinement. We prove that PINS converges globally to an optimal solution of the unregularized OT problem and that the inner Hessian admits a sparsification at every outer iteration with a structure independent of the cost matrix. On synthetic and augmented-MNIST instances, PINS achieves much lower relative cost errors than Sinkhorn-type baselines, which stall at the entropic-bias plateau, and is $5$--$73\times$ faster than Sinkhorn with the same outer loop at matched accuracy. On large-scale DOTmark instances, a streaming implementation reduces peak memory by $24$--$54\%$ compared with the network-simplex linear programming (LP) solver and remains feasible under per-process memory budgets for which the LP solver fails.
♻ ☆ A Metamorphic Testing Perspective on Knowledge Distillation for Language Models of Code: Does the Student Deeply Mimic the Teacher?
Transformer-based language models of code have achieved state-of-the-art performance across a wide range of software analytics tasks, but their practical deployment remains limited due to high computational costs, slow inference speeds, and significant environmental impact. To address these challenges, recent research has increasingly explored knowledge distillation as a method for compressing a large language model of code (the teacher) into a smaller model (the student) while maintaining performance. However, the degree to which a student model deeply mimics the predictive behavior and internal representations of its teacher remains largely unexplored, as current accuracy-based evaluation provides only a surface-level view of model quality and often fails to capture more profound discrepancies in behavioral fidelity between the teacher and student models. To address this gap, we empirically show that the student model often fails to deeply mimic the teacher model, resulting in up to 285% greater performance drop under adversarial attacks, which is not captured by traditional accuracy-based evaluation. Therefore, we propose MetaCompress, a metamorphic testing framework that systematically evaluates behavioral fidelity by comparing the outputs of teacher and student models under a set of behavior-preserving metamorphic relations. We evaluate MetaCompress on two widely studied tasks, using compressed versions of popular language models of code, obtained via three different knowledge distillation techniques: Compressor, AVATAR, and MORPH. The results show that MetaCompress identifies up to 62% behavioral discrepancies in student models, underscoring the need for behavioral fidelity evaluation within the knowledge distillation pipeline and establishing MetaCompress as a practical framework for testing compressed language models of code derived through knowledge distillation.
comment: This paper has been accepted for publication in the Journal of Systems and Software (JSS)
♻ ☆ Restoring Exploration after Post-Training: Latent Exploration Decoding for Large Reasoning Models
Large Reasoning Models (LRMs) have recently achieved strong mathematical and code reasoning performance through Reinforcement Learning (RL) post-training. However, we show that modern reasoning post-training induces an unintended exploration collapse: temperature-based sampling no longer increases pass@$n$ accuracy. Empirically, the final-layer posterior of post-trained LRMs exhibit sharply reduced entropy, while the entropy of intermediate layers remains relatively high. Motivated by this entropy asymmetry, we propose Latent Exploration Decoding (LED), a depth-conditioned decoding strategy. LED aggregates intermediate posteriors via cumulative sum and selects depth configurations with maximal entropy as exploration candidates. Without additional training or parameters, LED consistently improves pass@1 and pass@16 accuracy by 0.61 and 1.03 percentage points across multiple reasoning benchmarks and models. Furthermore, integrating LED into reinforcement learning, e.g., using GRPO as the rollout strategy, yields faster reward improvement and higher final performance, due to the efficient exploration capability of LED. Project page: https://github.com/AlbertTan404/LED.
comment: Project Page: https://github.com/AlbertTan404/LED
♻ ☆ A new initialisation to Control Gradients in Sinusoidal Neural network
Proper initialisation strategy is of primary importance to mitigate gradient explosion or vanishing when training neural networks. Yet, the impact of initialisation parameters still lacks a precise theoretical understanding for several well-established architectures. Here, we propose a new initialisation for networks with sinusoidal activation functions such as \texttt{SIREN}, focusing on gradients control, their scaling with network depth, their impact on training and on generalization. To achieve this, we identify a closed-form expression for the initialisation of the parameters, differing from the original \texttt{SIREN} scheme. This expression is derived from fixed points obtained through the convergence of pre-activation distribution and the variance of Jacobian sequences. Controlling both gradients and targeting vanishing pre-activation helps preventing the emergence of inappropriate frequencies during estimation, thereby improving generalization. We further show that this initialisation strongly influences training dynamics through the Neural Tangent Kernel framework (NTK). Finally, we benchmark \texttt{SIREN} with the proposed initialisation against the original scheme and other baselines on function fitting and image reconstruction. The new initialisation consistently outperforms state-of-the-art methods across a wide range of reconstruction tasks, including those involving physics-informed neural networks.
♻ ☆ AI Alignment via Incentives and Correction
We study AI alignment through the lens of law-and-economics models of deterrence and enforcement. In these models, misconduct is not treated as an external failure, but as a strategic response to incentives: an actor weighs the gain from violation against the probability of detection and the severity of punishment. We argue that the same logic arises naturally in agentic AI pipelines. A solver may benefit from producing a persuasive but incorrect answer, hiding uncertainty, or exploiting spurious shortcuts, while an auditor or verifier must decide whether costly monitoring is worthwhile. Alignment is therefore a fixed-point problem: stronger penalties may deter solver misbehavior, but they can also reduce the auditor's incentive to inspect, since auditing then mainly incurs cost on a population that appears increasingly aligned. This perspective also changes what should count as a post-training signal. Standard feedback often attaches reward to the final answer alone, but a solver-auditor pipeline exposes the full correction event: whether the solver erred, whether the auditor inspected, whether the error was caught, and whether oversight incentives remained active. We formalize this interaction in a two-agent model in which a principal chooses rewards over joint correction outcomes, inducing both solver behavior and auditor monitoring. Reward design is therefore a bilevel optimization problem: rewards are judged not by their immediate semantic meaning, but by the behavioral equilibrium they induce. We propose a bandit-based outer-loop procedure for searching over reward profiles using noisy interaction feedback. Experiments on an LLM coding pipeline show that adaptive reward profiles can maintain useful oversight pressure and improve principal-aligned outcomes relative to static hand-designed rewards, including a substantial reduction in hallucinated incorrect attempts.
♻ ☆ A Unified Representation of Neural Networks Architectures
In this paper we consider the limiting case of neural networks (NNs) architectures when the number of neurons in each hidden layer and the number of hidden layers tend to infinity thus forming a continuum, and we derive approximation errors as a function of the number of neurons and/or hidden layers. Firstly, we consider the case of neural networks with a single hidden layer and we derive an integral infinite width neural representation that generalizes existing continuous neural networks (CNNs) representations. Then we extend this to deep residual CNNs that have a finite number of integral hidden layers and residual connections. Secondly, we revisit the relation between neural ODEs and deep residual NNs and we formalize approximation errors via discretization techniques. Then, we merge these two approaches into a unified homogeneous representation of NNs as a Distributed Parameter neural Network (DiPaNet) and we show that most of the existing finite and infinite-dimensional NNs architectures are related via homogenization/discretization with the DiPaNet representation. Our approach is purely deterministic and applies to general, uniformly continuous matrix weight functions. Relations with neural fields and other neural integro-differential equations are discussed along with further possible generalizations and applications of the DiPaNet framework.
comment: Typographical corrections and additional clarifications, remarks; few new relevant references added and acknowledgements; main results unchanged
♻ ☆ Hierarchical Reinforced Trader (HRT): A Bi-Level Approach for Optimizing Stock Selection and Execution
Automated equity trading requires converting noisy market and news signals into executable portfolio decisions under risk, turnover, and transaction costs. We propose Hierarchical Reinforced Trader (HRT), a bi-level reinforcement learning framework for text-aware portfolio management in multi-asset equity markets. HRT separates trading into two coordinated decisions: a factorized sparse High-Level Controller (HLC) selects asset-level increase, reduce, or hold directions from compact market and text-derived signals, while a risk-aware Low-Level Controller (LLC) converts these directions into feasible portfolio weight adjustments under turnover, drawdown, and text-risk penalties. This decomposition avoids enumerating the full joint action space and makes selection and execution easier to inspect. We evaluate HRT on an open stock-news benchmark with a fixed 89-stock Nasdaq universe, using 2013--2018 for training, 2019 for validation, and 2020--2023 for final out-of-sample testing; the test horizon is restricted to 2020--2023 due to public benchmark data availability under the same timestamp-clean text-aware protocol. Across market-proxy, same-universe portfolio, alpha-only, flat-RL, and hierarchical ablation baselines, HRT delivers the strongest learning-based return--risk--cost trade-off. The full model improves Sharpe from 1.06 for HRT-Base to 1.24, reduces daily turnover from 0.112 to 0.090, and remains robust under transaction-cost stress. These results suggest that separating sparse directional selection from risk-aware execution is an effective way to incorporate market forecasts and text-derived risk signals into portfolio management.
♻ ☆ Calibrating Scientific Foundation Models with Inference-Time Stochastic Attention
Transformer-based scientific foundation models are increasingly deployed in high-stakes settings, but current architectures give deterministic outputs and provide limited support for calibrated predictive uncertainty. We propose Stochastic Attention, a sample average lightweight inference-time modification that randomizes attention by replacing softmax weights with normalized multinomial samples controlled by a single concentration parameter, and produces predictive ensembles without retraining. To set this parameter, we introduce a calibration objective that matches the stochastic attention output with the target, yielding an efficient univariate post-hoc tuning problem. We evaluate this mechanism on scientific foundation models for weather and time-series forecasting, as well as several regression tasks. Across benchmarks against uncertainty-aware baselines, we find that Sample Average Stochastic Attention achieves the strongest native calibration and the sharpest prediction intervals at comparable calibration, with adaptation costs nearly three orders of magnitude lower than the next-best baseline.
♻ ☆ Beyond Multiple Choice: Evaluating Steering Vectors for Summarization EACL 2026
Steering vectors are a lightweight method for controlling text properties by adding a learned bias to language model activations at inference time. While predominantly studied for multiple-choice and toy tasks, their effectiveness in free-form generation remains largely unexplored. Moving "Beyond Multiple Choice," we evaluate steering vectors for controlling topical focus, sentiment, toxicity, and readability in abstractive summaries across the SAMSum, NEWTS, and arXiv datasets. We find that steering effectively controls targeted properties, but high steering strengths consistently induce degenerate repetition and factual hallucinations. Prompting alone preserves summary quality but offers weaker control. Combining both methods yields the strongest control and the most favorable efficacy-quality trade-off at moderate steering strengths. Our work demonstrates that steering vectors face a critical control-quality trade-off in free-form generation, and that hybrid approaches offer the best balance in practice.
comment: Published in Findings of EACL 2026. Extended version of the ICML 2025 Workshop on Reliable and Responsible Foundation Models paper (v1, v2). 36 pages, 21 figures, 15 tables
♻ ☆ VeRO: An Evaluation Harness for Agents to Optimize Agents ICML
An important emerging application of coding agents is agent optimization: the iterative improvement of a target agent through edit-execute-evaluate cycles. Despite its relevance, the community lacks a systematic understanding of coding agent performance on this task. Agent optimization differs fundamentally from conventional software engineering: the target agent interleaves deterministic code with stochastic LLM completions, requiring structured capture of both intermediate reasoning and downstream execution outcomes. To address these challenges, we introduce VERO (Versioning, Rewards, and Observations), which provides (1) a reproducible evaluation harness with versioned agent snapshots, budget-controlled evaluation, and structured execution traces, and (2) a benchmark suite of target agents and tasks with reference evaluation procedures. Using VERO, we conduct an empirical study comparing optimizer configurations across tasks and analyzing which modifications reliably improve target agent performance. We release VERO to support research on agent optimization as a core capability for coding agents.
comment: Accepted to the Forty-Third International Conference on Machine Learning (ICML), 2026
♻ ☆ The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation
Hypernetwork-based methods such as Doc-to-LoRA internalize a document into an LLM's weights in a single forward pass, but they fail systematically on conflicts: when the document contradicts pretraining knowledge, accuracy collapses to 46.4% on the deepest facts. We show the failure is a magnitude problem rather than a representational one. The hypernetwork already targets the right layers, but its adapter margin is approximately constant across documents while the pretrained margin grows with training frequency, so deep conflicts lose by construction. The account predicts that failure should track prior strength: sorting 194 conflicts by the base model's log-probability on the contradicted fact, baseline accuracy falls from 68% on weak-prior questions to 16% on strong-prior ones, a 52 percentage-point gap. The cure is amplitude. Selective Layer Boosting scales the adapter at its top-norm layers, and Conflict-Aware Internalization triggers boosting only when the base model is confident. Both are training-free; together they raise deep-conflict accuracy from 46.4% to 71.0% on Gemma-2B and from 53.6% to 72.5% on Mistral-7B while preserving novel-knowledge recall, and beat vanilla retrieval-augmented generation on medium conflicts by 18 percentage points despite operating entirely in parameter space. We release KID-Bench, a 489-question benchmark that separates novel recall, cross-knowledge combination, and prior-graded conflicts.
comment: 35 pages, 15 figures v2: minor layout fixes and author list update
♻ ☆ Equation-Free Digital Twins for Nonlinear Structural Dynamics
Monitoring high-dimensional engineering structures in extreme environments is limited by non-stationary excitation, nonlinear structural kinematics, and stochastic forcing. Traditional model-based and black-box data-driven methods often struggle to resolve these dynamics in real time, particularly under sensor failure or partial observability. This paper introduces a rank-optimized digital twin framework based on Koopman operator theory, Hankel-matrix embeddings, and dynamic mode decomposition. By lifting operational data into a linear invariant subspace, the method enables autonomous, input-blind reconstruction of structural states without requiring a priori mass or stiffness matrices. The framework is validated on an NREL 5MW spar-buoy floating offshore wind turbine, representing a challenging coupled aero-hydro-servo-elastic system. Results show that the rank-optimized Koopman-Hankel manifold separates structural resonances from deterministic 3P rotor harmonics under colored noise, where standard subspace identification can be unreliable. A rolling-horizon virtual sensing strategy achieves high-fidelity reconstruction at critical structural hotspots, with coefficient of determination greater than 0.95 at 1 Hz data assimilation and accuracy exceeding 0.99 at higher sampling rates. By estimating a physical Lyapunov time of approximately 1.0 s, the study defines the predictability horizon associated with the system information barrier. The proposed framework provides a computationally efficient and resilient digital twin approach for real-time identification and virtual sensing of complex structural dynamics.
comment: Added code availability statement linking the GitHub repository and archived Zenodo software release
♻ ☆ Edge-specific signal propagation on mature chromophore-region 3D mechanism graphs for fluorescent protein quantum-yield prediction
Fluorescent protein quantum yield (QY) is governed by the mature chromophore and its three-dimensional microenvironment rather than sequence identity alone. Protein language models and emission-band averages capture global trends, but do not model how local physical signals act on specific chromophore regions. We present a chromophore-centred mechanism graph algorithm for QY prediction. Each PDB structure is converted into a typed 3D residue graph, registered to a mature-CRO state, partitioned into phenolate, bridge and imidazolinone regions, and transformed by channel-signal-region propagation. The representation contains 121 enrichment features; after removing identity shortcuts, 52 non-identity features are used for band-specific ExtraTrees regression. Because each feature encodes a contact channel, seed signal and target CRO region, interpretation is intrinsic rather than post hoc. On a 531-protein benchmark, the method achieved the best random-CV performance among model-based baselines (R = 0.772 +/- 0.008, MAE = 0.131 +/- 0.002), exceeding Band mean (R = 0.632), ESM-C (R = 0.734) and SaProt (R = 0.731), and ranked first in bright screening (Bright P@5 = 0.704). Under homology control, the advantage was clearest in the remote bucket (<50% similarity; R = 0.697 versus 0.633, 0.575 and 0.408), with the strongest overall bright/dark Top-K screening. Stable selected features recovered band-specific mechanisms: aromatic packing and clamp asymmetry in GFP-like proteins, charge/clamp balance in Red proteins, and flexibility-risk/bulky-contact features in Far-red proteins. Source code, feature tables and evaluation scripts are available from the first author upon request. Contact: yuchenak05@gmail.com
comment: Includes appendix; source code, processed feature tables and evaluation scripts are available from the first author upon reasonable request
♻ ☆ Constructive conditional normalizing flows
Motivated by applications in conditional sampling, given a probability measure $μ$ and a diffeomorphism $φ$, we consider the problem of simultaneously approximating $φ$ and the pushforward $φ_{\#}μ$ by means of the flow of a continuity equation whose velocity field is a perceptron neural network with piecewise constant weights. We provide an explicit construction based on a polar-like decomposition of the Lagrange interpolant of $φ$. The latter involves a compressible component, given by the gradient of a particular convex function, which can be realized exactly, and an incompressible component, which -- after approximating via permutations -- can be implemented through shear flows intrinsic to the continuity equation. For more regular maps $φ$ -- such as the Knöthe-Rosenblatt rearrangement -- we provide an alternative, probabilistic construction inspired by the Maurey empirical method, in which the number of discontinuities in the weights doesn't scale inversely with the ambient dimension.
♻ ☆ High-Entropy Tokens as Multimodal Failure Points in Vision-Language Models
Vision-language models (VLMs) achieve remarkable performance but remain vulnerable to adversarial attacks. Entropy, as a measure of model uncertainty, is highly correlated with VLM reliability. While prior entropy-based attacks maximize uncertainty at all decoding steps, implicitly assuming that every token equally contributes to model instability, we reveal that a small fraction (around 20%) of high-entropy tokens, in the evaluated representative open-source VLMs with diverse architectures, concentrates a disproportionate share of adversarial influence during autoregressive generation. We demonstrate that concentrating adversarial perturbations on these high-entropy positions achieves comparable semantic degradation to global methods while optimizing fewer decoding positions. Additionally, across multiple representative VLMs, such attacks induce not only semantic drift but also a substantial unsafe subset (20-31%) under the current pipeline. Remarkably, since such vulnerable high-entropy tokens recur across architecturally diverse VLMs, attacks focused on them exhibit non-trivial transferability. Motivated by these findings, we design a simple Entropy-Guided Attack (EGA) that operationalizes sparse high-entropy targeting and extends it with a reusable token bank, yielding competitive attack success rates (93-95%) with a considerable harmful rate (30.2-38.6%) on the three representative open-source VLMs.
comment: 19 Pages,11 figures,8 tables
♻ ☆ Adaptive digital twins for predictive decision-making: Online Bayesian learning of transition dynamics
This work shows how adaptivity can enhance value realization of digital twins in civil engineering. We focus on adapting the state transition models within digital twins represented through probabilistic graphical models. The bi-directional interaction between the physical and virtual domains is modeled using dynamic Bayesian networks. By treating state transition probabilities as random variables endowed with conjugate priors, we enable hierarchical online learning of transition dynamics from a state to another through effortless Bayesian updates. We provide the mathematical framework to account for a larger class of distributions with respect to the current literature on digital twins. To compute dynamic policies with precision updates we solve parametric Markov decision processes through reinforcement learning. The proposed adaptive digital twin framework enjoys enhanced personalization, increased robustness, and improved cost-effectiveness. We assess our approach on a case study involving structural health monitoring and maintenance planning of a railway bridge.
♻ ☆ Behavior-Centric Extraction of Scenarios from Highway Traffic Data and their Domain-Knowledge-Guided Clustering using CVQ-VAE IEEE
Approval of ADS depends on evaluating its behavior within representative real-world traffic scenarios. A common way to obtain such scenarios is to extract them from real-world data recordings. These can then be grouped and serve as basis on which the ADS is subsequently tested. This poses two central challenges: how scenarios are extracted and how they are grouped. Existing extraction methods rely on heterogeneous definitions, hindering scenario comparability. For the grouping of scenarios, rule-based or ML-based methods can be utilized. However, while modern ML-based approaches can handle the complexity of traffic scenarios, unlike rule-based approaches, they lack interpretability and may not align with domain-knowledge. This work contributes to a standardized scenario extraction based on the Scenario-as-Specification concept, as well as a domain-knowledge-guided scenario clustering process. Experiments on the highD dataset demonstrate that scenarios can be extracted reliably and that domain-knowledge can be effectively integrated into the clustering process. As a result, the proposed methodology supports a more standardized process for deriving scenario categories from highway data recordings and thus enables a more efficient validation process of automated vehicles.
comment: Accepted as a conference paper in IEEE Intelligent Vehicles Symposium (IV) 2026, Detroit, MI, United States
♻ ☆ CodeBrain: Bridging Decoupled Tokenizer and Multi-Scale Architecture for EEG Foundation Model ICLR 2026
Electroencephalography (EEG) provides real-time insights into brain activity and supports diverse applications in neuroscience. While EEG foundation models (EFMs) have emerged to address the scalability issues of task-specific models, current approaches still yield clinically uninterpretable and weakly discriminative representations, inefficiently capturing global dependencies and neglecting important local neural events. We present CodeBrain, a two-stage EFM designed to fill this gap. In the first stage, we introduce the TFDual-Tokenizer, which decouples heterogeneous temporal and frequency EEG signals into discrete tokens, quadratically expanding the representation space to enhance discriminative power and offering domain-specific representation-level interpretability by suggesting potential links to neural events and spectral rhythms. In the second stage, we propose the multi-scale EEGSSM architecture, which combines structured global convolution with sliding window attention to efficiently capture both sparse long-range and local dependencies, reflecting the brain's small-world topology. Pretrained on the largest public EEG corpus, CodeBrain achieves strong generalization across eight downstream tasks and ten datasets under distribution shifts, supported by comprehensive ablations, scaling-law analyzes, and interpretability evaluations. The code and the pretrained weights are available at https://github.com/jingyingma01/CodeBrain.
comment: Published as a conference paper at the International Conference on Learning Representations (ICLR 2026)
♻ ☆ Unifying Perspectives: Plausible Counterfactual Explanations on Global, Group-wise, and Local Levels
The growing complexity of AI systems has intensified the need for transparency through Explainable AI (XAI). Counterfactual explanations (CFs) offer actionable "what-if" scenarios on three levels: Local CFs providing instance-specific insights, Global CFs addressing broader trends, and Group-wise CFs (GWCFs) striking a balance and revealing patterns within cohesive groups. Despite the availability of methods for each granularity level, the field lacks a unified method that integrates these complementary approaches. We address this limitation by proposing a gradient-based optimization method for differentiable models that generates Local, Global, and Group-wise Counterfactual Explanations in a unified manner. We especially enhance GWCF generation by combining instance grouping and counterfactual generation into a single efficient process, replacing traditional two-step methods. Moreover, to ensure trustworthiness, we innovatively introduce the integration of plausibility criteria into the GWCF domain, making explanations both valid and realistic. Our results demonstrate the method's effectiveness in balancing validity, proximity, and plausibility while optimizing group granularity, with practical utility validated through practical use cases.
♻ ☆ Statistical Taylor Expansion: A New and Path-Independent Method for Uncertainty Analysis
As a rigorous statistical approach, statistical Taylor expansion extends the conventional Taylor expansion by replacing precise input variables with random variables of known distributions and sample counts to compute the mean, the deviation, and the reliable factor of each result. It tracks the propagation of the input uncertainties through intermediate steps, so that the final analytic result becomes path independent. Therefore, it differs fundamentally from common approaches in applied mathematics that optimize computational path for each calculation. Statistical Taylor expansion may standardize numerical computations for analytic expressions. This study also introduces the implementation of statistical Taylor expansion termed variance arithmetic and presents corresponding test results across a wide range of mathematical applications. Another important conclusion of this study is that numerical errors in library functions can significantly affect results. It is desirable that each value from library functions be accomplished by an uncertainty deviation. The possible link between statistical Taylor expansion and quantum physics is discussed as well.
comment: 46 pages, 40 figures
♻ ☆ MolRGen: A Training and Evaluation Setting for De Novo Molecular Generation with Reasonning Models
Recent reasoning-based large language models have shown strong performance on tasks with verifiable outcomes, but their use in de novo molecular generation remains limited by the lack of training environments where rewards can be computed without reference molecules. We introduce MolRGen, a benchmark and molecular verifier for training and evaluating reasoning LLMs on de novo molecular generation. MolRGen contains approximately 4,500 protein-pocket targets, resulting in 50k multi-objective optimization prompts combining docking scores with molecular properties such as QED, synthetic accessibility, logP, and physicochemical descriptors. Unlike caption-based generation or molecule-editing benchmarks, MolRGen evaluates molecules proposed from scratch by computing rewards at generation time. We benchmark general-purpose and chemistry-specialized open-source LLMs and introduce a diversity-aware top-k metric to measure whether models can generate a diverse set of high-scoring molecules. Finally, we use the verifier to fine-tune a 128B LLM with GRPO, showing improved performance, at the cost of a diversity-exploitation trade-off. MolRGen provides a scalable testbed for studying verifier-based reasoning and reinforcement learning in molecular design.
♻ ☆ SCOPE: Structured Prototype-Guided Adaptation for EEG Foundation Models with Limited Labels
Electroencephalography (EEG) foundation models (EFMs) have shown strong potential for transferable representation learning, yet their adaptation in realistic settings remains challenging when only a few labeled subjects are available. We show that this challenge stems from a structural mismatch between noisy, limited supervision and the highly plastic parameter space of EFMs, reflected in three key failure modes: overconfident miscalibration, prediction collapse, and representation drift caused by unconstrained parameter updates. To address these challenges, we propose SCOPE, a Structured COnfidence-aware Prototype-guided framework for label-limited EFM adaptation. SCOPE first constructs cohort-level external supervision to provide persistent guidance and further derives confidence-aware pseudo-labels to select reliable unlabeled samples for adaptation. Building on the constructed external supervision, SCOPE introduces ProAdapter, a lightweight prototype-conditioned adapter that modulates frozen EFMs to preserve pretrained representations. Experiments across 50 label-limited adaptation settings, covering 6 EEG tasks, 5 EFM backbones, and 5%-50% training labeled-subject ratios, show that SCOPE consistently achieves strong performance and efficiency.
♻ ☆ Selective Neuron Amplification in Transformer Language Models
Large language models often fail on tasks they seem to already understand. In our experiments, this appears to be less about missing knowledge and more about certain internal circuits not being strongly activated during inference. We explore Selective Neuron Amplification, which increases the influence of task relevant neurons without changing the model's parameters. The method works at inference time and does not permanently alter the model. SNA helps mainly when the model is uncertain, while having low effect when the model is already confident. This suggests that some model failures are due to weak activation rather than lack of capability.
comment: 11 pages, 3 figures. Preprint. Code and experiments conducted independently
♻ ☆ What's the plan? Metrics for implicit planning in LLMs and their application to rhyme generation and question answering ICLR 2026
Prior work suggests that language models, while trained on next token prediction, show implicit planning behavior: they may select the next token in preparation to a predicted future token, such as a likely rhyming word, as supported by a prior qualitative study of Claude 3.5 Haiku using a cross-layer transcoder. We propose much simpler techniques for assessing implicit planning in language models. With case studies on rhyme poetry generation and question answering, we demonstrate that our methodology easily scales to many models. Across models, we find that the generated rhyme (e.g. "-ight") or answer to a question ("whale") can be manipulated by steering at the end of the preceding line with a vector, affecting the generation of intermediate tokens leading up to the rhyme or answer word. We show that implicit planning is a universal mechanism, present in smaller models than previously thought, starting from 1B parameters. Our methodology offers a widely applicable direct way to study implicit planning abilities of LLMs. More broadly, understanding planning abilities of language models can inform decisions in AI safety and control.
comment: 41 pages, 34 figures, Accepted at ICLR 2026, Code available at https://github.com/Jim-Maar/implicit-planning-in-llms
♻ ☆ Beyond Hard Writes and Rigid Preservation: Soft Recursive Least-Squares for Lifelong LLM Editing
Model editing updates a pre-trained LLM with new facts or rules without retraining while preserving unrelated behavior. In real deployment, edits arrive as long streams, creating a plasticity-stability dilemma: repeated locate-then-edit "hard writes" can accumulate interference over time, while rigid preservation constraints may protect only explicitly constrained directions, allowing past edits or unconstrained behaviors to deviate. We propose RLSEdit, a recursive least-squares editor for long sequential editing. RLSEdit formulates editing as an online quadratic optimization with soft constraints, minimizing a cumulative key-value fitting objective together with two regularizers that control deviation from the pre-trained weights and from a designated anchor mapping. This objective admits an efficient Woodbury-based online recursion, with per-edit cost independent of history length and scaling only with the current edit size. We further provide deviation bounds and an asymptotic characterization of the adherence-preservation trade-off in the many-edits regime. Experiments on CounterFact and ZsRE across multiple model families show stable scaling to 10K edits, outperforming strong baselines in both edit success and holistic stability, while retaining early edits and preserving general capabilities on GLUE and held-out reasoning/code benchmarks.
♻ ☆ Alignment-Sensitive Minimax Rates for Spectral Algorithms with Learned Kernels
We study spectral algorithms in the setting where kernels are learned from data. We introduce the effective span dimension (ESD), an alignment-sensitive complexity measure that depends jointly on the signal, spectrum, and noise level $σ^2$. The ESD is well-defined for arbitrary kernels and signals without requiring eigen-decay conditions or source conditions. We prove that for sequence models whose ESD is at most $K$, the minimax excess risk scales as $σ^2 K$. Furthermore, we analyze over-parameterized gradient flow and prove that it can reduce the ESD. This finding establishes a connection between adaptive feature learning and provable improvements in generalization of spectral algorithms. We demonstrate the generality of the ESD framework by extending it to linear models and RKHS regression, and we support the theory with numerical experiments. This framework provides a novel perspective on generalization beyond traditional fixed-kernel theories.
♻ ☆ Beyond Accuracy: Evaluating Posterior Fidelity of Diffusion Inverse Solvers
Uncertainty evaluation is critical in scientific and engineering inverse problems. However, existing benchmarks on Diffusion Inverse Solvers (DIS) primarily focus on reconstruction accuracy but overlook uncertainty and distributional behavior. Since stochastic inverse solvers represent uncertainty through diffusion-based posterior samples, evaluating how well their generated samples capture the target posterior distribution becomes an important aspect of uncertainty quantification. To address this limitation and better understand the distributional behavior of diffusion samplers, we conduct a systematic study to investigate the posterior fidelity of a broad range of existing DIS methods in controlled simulation settings with a known analytical true posterior. Furthermore, to enable posterior-aware evaluation on real-world inverse problems where ground-truth posterior is unavailable, we propose score-based Kernel Stein Discrepancy (score-KSD), a theoretically-grounded and ground-truth-free metric that measures the consistency of the distribution of generated samples from a DIS method with the target posterior score field, induced by the forward model and learned diffusion prior. Through both simulation experiments and real-world inverse problem solving, we validate the effectiveness of the proposed score-KSD and demonstrate that it provides meaningful posterior fidelity diagnostics beyond reconstruction accuracy, revealing that higher reconstruction accuracy does not necessarily imply better posterior consistency.
♻ ☆ Tighter Information-Theoretic Generalization Bounds via a Novel Class of Change of Measure Inequalities
Change of measure inequalities translate divergences between probability measures into explicit bounds on event probabilities, and play an important role in deriving probabilistic guarantees in learning theory, information theory, and statistics. We propose novel change of measure inequalities via a unified framework based on the data processing inequality, which is surprisingly elementary yet powerful enough to yield novel, tighter inequalities. We provide change of measure inequalities in terms of a broad family of information measures, including $f$-divergences (with Kullback-Leibler divergence and $χ^2$-divergence as special cases), Rényi divergence, and $α$-mutual information (with maximal leakage as a special case). We apply these results to generalization error analysis, PAC-Bayesian theory, differential privacy, and data memorization, obtaining stronger guarantees while recovering best-known results through simplified analyses.
♻ ☆ TiledAttention: a CUDA Tile SDPA Kernel for PyTorch
TiledAttention is a scaled dot-product attention (SDPA) forward operator for SDPA research on NVIDIA GPUs. Implemented in cuTile Python (TileIR) and exposed as a PyTorch-callable function, it is easier to modify than low-level CUDA templates while retaining realistic behavior via online softmax and tiled $K,V$ streaming. Algorithmically, TiledAttention follows the established FlashAttention-style online-softmax formulation; our novelty is the cuTile/TileIR implementation strategy, schedule-level modifiability, and reproducible benchmarking/profiling workflow. The approach is both performant and directly editable at the schedule level from Python (tile shapes, staging, shared-memory layout), enabling rapid, reproducible kernel research without template-heavy CUDA/CUTLASS rewrites. We benchmark TiledAttention on an NVIDIA DGX GB10 node with a reproducible harness and compare against PyTorch SDPA (auto-dispatch), explicit unfused baselines (torch_sdpa_math, standard eager attention), and forced backend probes (FlashAttention2, EffecientAttention, CuDNN Attention) across sequence length, head dimension, and precision (FP16/BF16). While production fused baselines remain stronger overall, TiledAttention delivers large speedups over standard eager attention paths and is available for direct use within PyTorch workflows, providing a practical balance between performance and customizability.
♻ ☆ Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts
Mixture-of-Experts (MoE) models typically fix the number of activated experts $k$ at both training and inference. However, real-world deployments often face heterogeneous hardware, fluctuating workloads, and diverse quality-latency requirements, while training separate models for each scenario is costly. Considering that MoE models already operate with sparse activation, adjusting the number of activated experts offers a natural path to serving diverse budgets with a single model. Yet, we find that activating more experts $k'$ ($> k$) at inference does not yield the expected gains. Instead, performance degrades rapidly after only a slight increase, a phenomenon we term the \textit{inference-time scaling wall}. Further investigation reveals that this degradation stems from a lack of learned collaboration among experts. To address this, we introduce \textbf{Elastic Mixture-of-Experts (EMoE)}, a novel training framework that enables MoE models to elastically vary the number of activated experts at inference. By simultaneously training experts to collaborate in diverse combinations and encouraging the router to make high-quality selections, EMoE ensures robust performance across inference budgets. Extensive experiments across four MoE architectures (7B--21B) and nine benchmarks show that EMoE significantly expands the effective scaling range to 2-3$\times$ the training-time $k$, while also achieving higher peak performance.
♻ ☆ Counting Still Counts: Understanding Neural Complex Query Answering Through Query Relaxation
Neural methods for Complex Query Answering (CQA) over knowledge graphs (KGs) are widely believed to learn patterns that generalize beyond explicit graph structure, allowing them to infer answers that are unreachable through symbolic query processing. In this work, we critically examine this assumption through a systematic analysis comparing neural CQA models with an alternative, training-free query relaxation strategy that retrieves possible answers by relaxing query constraints and counting resulting paths. Across multiple datasets and query structures, we find several cases where neural and relaxation-based approaches perform similarly, with no neural model consistently outperforming the latter. Moreover, a similarity analysis reveals that their retrieved answers exhibit little overlap, and that combining their outputs consistently improves performance. These results call for a re-evaluation of progress in neural query answering: despite their complexity, current models fail to subsume the reasoning patterns captured by query relaxation. Our findings highlight the importance of stronger non-neural baselines and suggest that future neural approaches could benefit from incorporating principles of query relaxation.
comment: Accepted in Transactions on Machine Learning Research (2026)
♻ ☆ AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs
Large Audio Language Models (LALMs) are rapidly advancing, but evaluating them remains challenging due to inefficient and non-standardized toolkits that limit fair comparison and systematic assessment. Existing evaluation frameworks exhibit three critical limitations: (1) slow and inefficient processing pipeline that bottlenecks large-scale studies, (2) inadequate multi-turn dialogue support, leaving fundamental questions about cross-turn context integration and performance dynamics over extended conversations in LALMs unanswered; and (3) the absence of unified and scalable evaluation framework capable of keeping pace with the rapid growth of both LALMs and audio benchmarks. To address these issues, we introduce AU-Harness, an efficient and comprehensive evaluation framework for LALMs. Our system achieves a speedup of up to 151% over existing evaluation toolkits through optimized batch processing and parallel execution, enabling large-scale evaluations previously considered impractical. We provide standardized prompting protocols and flexible configurations for fair model comparison across diverse scenarios. AU-Harness unlocks a range of in-depth analyses difficult to conduct without a unified foundation, including multi-turn dialogue dynamics, enabling the study of true audio reasoning capabilities in existing LALMs. AU-Harness provides both practical evaluation tools and insights into model limitations, advancing systematic LALM development.
♻ ☆ Federated Concept-Based Models: Interpretable models with distributed supervision
Concept-based Models (CMs) enhance interpretability in deep learning by grounding predictions in human-understandable concepts. However, concept annotations are costly and rarely available at scale within a single data source. Federated Learning (FL) could alleviate this limitation by enabling cross-institutional training over concept annotations distributed across multiple data owners. Yet, FL lacks interpretable modeling paradigms. Integrating CMs with FL is non-trivial: although FL supports heterogeneous and non-stationary client participation, it typically assumes a fixed shared architecture, whereas CMs may require architectural adaptation as the available concept set evolves. We propose Federated Concept-based Models (F-CMs), a new methodology for deploying CMs in evolving FL settings. F-CMs aggregate concept-level information across institutions and efficiently adapt the model architecture to changes in concept supervision while preserving privacy. Empirically, F-CMs maintain accuracy and intervention effectiveness comparable to training settings with full concept supervision, while outperforming on average non-adaptive federated baselines. Notably, F-CMs enable interpretable inference on concepts unavailable to a given institution, a key novelty over existing approaches.
♻ ☆ Explaining Graph Neural Networks for Node Similarity on Graphs
Similarity search is a fundamental task for exploiting information in various applications dealing with graph data, such as citation networks or knowledge graphs. While this task has been intensively approached from heuristics to graph embeddings and graph neural networks (GNNs), providing explanations for similarity has received less attention. In this work we are concerned with explainable similarity search over graphs, by investigating how GNN-based methods for computing node similarities can be augmented with explanations. Specifically, we evaluate the performance of two prominent approaches towards explanations in GNNs, based on the concepts of mutual information (MI), and gradient-based explanations (GB). We discuss their suitability and empirically validate the properties of their explanations over different popular graph benchmarks. We find that unlike MI explanations, gradient-based explanations have three desirable properties. First, they are actionable: selecting inputs depending on them results in predictable changes in similarity scores. Second, they are consistent: the effect of selecting certain inputs overlaps very little with the effect of discarding them. Third, they can be pruned significantly to obtain sparse explanations that retain the effect on similarity scores.
comment: Accepted in Transactions of Machine Learning Research (2026)
♻ ☆ NSPOD: Accelerating Krylov solvers via DeepONet-learned POD subspaces
The convergence of Krylov-based linear iterative solvers applied to parametric partial differential equations (PDEs) is often highly sensitive to the domain, its discretization, the location/values of the applied Dirichlet/Neumann boundary conditions, body forces and material properties, among others. We have previously introduced hybridization of classical linear iterative solvers with neural operators for specific geometries, but they tend to not perform well on geometries not previously seen during training. We partially addressed this challenge by introducing the deep operator network Geo-DeepONet and hybridizing it with Krylov-based iterative linear solvers, which, despite learning effectively across arbitrary unstructured meshes without requiring retraining, led to only modest reductions in iterations compared to state-of-the-art preconditioners. In this study we introduce Neural Subspace Proper Orthogonal Decomposition (NSPOD), a multigrid-like deep operator network-based preconditioner which can dramatically reduce the number of iterations needed for convergence in Krylov-based linear iterative solvers, even when compared to state-of-the-art methods such as algebraic multigrid preconditioners. We demonstrate its efficiency via numerical experiments on a linearized version of solid mechanics PDEs applied to unstructured domains obtained from complex CAD geometries. We expect that the findings in this study lead to more efficient hybrid preconditioners that can match, or possibly even surpass, the convergence properties of the current gold standard preconditioning methods for solid mechanics PDEs.
comment: 17 pages, 9 figures, 3 tables
♻ ☆ Stochastic Schrödinger Diffusion Models for Pure-State Ensemble Generation
In quantum machine learning (QML), classical data are often encoded as quantum pure states and processed directly as quantum representations, motivating representation-level generative modeling that samples new quantum states from an underlying pure-state ensemble rather than re-preparing them from perturbed classical inputs. However, extending \emph{score-based} diffusion models with well-defined reverse-time samplers to quantum pure-state ensembles remains challenging, due to the non-Euclidean geometry of the complex projective space $\mathbb{CP}^{d-1}$ and the intractability of transition densities. We propose \emph{Stochastic Schrödinger Diffusion Models} (SSDMs), an intrinsic score-based generative framework on $\mathbb{CP}^{d-1}$ endowed with the Fubini--Study (FS) metric. SSDMs formulate a forward Riemannian diffusion with a stochastic Schrödinger equation (SSE) realization, and derive reverse-time dynamics driven by the Riemannian score $\nabla_{\mathrm{FS}} \log p_t$. To enable training without analytic transition densities, we introduce a local-time objective based on a local Euclidean Ornstein--Uhlenbeck approximation in FS normal coordinates, yielding an analytic teacher score mapped back to the manifold. Experiments show that SSDMs faithfully capture target pure-state ensemble statistics, including observable moments, overlap-kernel MMD, and entanglement measures, and that SSDM-generated quantum representations improve downstream QML generalization via representation-level data augmentation.
♻ ☆ Inductive Entity Representations from Text via Link Prediction
Knowledge Graphs (KG) are of vital importance for multiple applications on the web, including information retrieval, recommender systems, and metadata annotation. Regardless of whether they are built manually by domain experts or with automatic pipelines, KGs are often incomplete. Recent work has begun to explore the use of textual descriptions available in knowledge graphs to learn vector representations of entities in order to preform link prediction. However, the extent to which these representations learned for link prediction generalize to other tasks is unclear. This is important given the cost of learning such representations. Ideally, we would prefer representations that do not need to be trained again when transferring to a different task, while retaining reasonable performance. In this work, we propose a holistic evaluation protocol for entity representations learned via a link prediction objective. We consider the inductive link prediction and entity classification tasks, which involve entities not seen during training. We also consider an information retrieval task for entity-oriented search. We evaluate an architecture based on a pretrained language model, that exhibits strong generalization to entities not observed during training, and outperforms related state-of-the-art methods (22% MRR improvement in link prediction on average). We further provide evidence that the learned representations transfer well to other tasks without fine-tuning. In the entity classification task we obtain an average improvement of 16% in accuracy compared with baselines that also employ pre-trained models. In the information retrieval task, we obtain significant improvements of up to 8.8% in NDCG@10 for natural language queries. We thus show that the learned representations are not limited KG-specific tasks, and have greater generalization properties than evaluated in previous work.
comment: The Web Conference 2021
♻ ☆ Spectral Condition for $μ$P under Width-Depth Scaling
Generative foundation models are increasingly scaled in both width and depth, posing significant challenges for stable feature learning and reliable hyperparameter (HP) transfer across model sizes. While maximal update parameterization ($μ$P) has provided a principled solution to both problems for width scaling, existing extensions to the joint width-depth scaling regime remain fragmented, architecture- and optimizer-specific, and often rely on technically involved theories. In this work, we develop a simple and unified spectral framework for $μ$P under joint width-depth scaling. For deep residual networks whose residual blocks contain $k$ transformations, the framework specifies how the norms of weights and their per-step updates should scale with width and depth. It reveals a fundamental transition from $k=1$ to $k\geq 2$, unifying previously disparate $μ$P formulations and identifying the $k\geq 2$ case as more appropriate for practical architectures with multi-transformation branches such as Transformers. Building on this framework, we derive a general recipe for implementing $μ$P across a broad class of optimizers by mapping spectral constraints to concrete HP parameterizations, recovering existing results and extending them to additional optimizers. Finally, experiments on GPT-2 style language models show that the $μ$P formulation derived from the $k\geq 2$ case achieves stable feature learning and robust HP transfer under width-depth scaling, whereas standard parameterization and $μ$P in the $k=1$ case often fail to do so. These results support the practical effectiveness of the proposed spectral framework.
comment: 76 pages, 13 figures, 40 tables
♻ ☆ Scaling Categorical Flow Maps
Continuous diffusion and flow matching models could represent a powerful alternative to autoregressive approaches for language modelling (LM), as they unlock a host of advantages currently reserved for continuous modalities, including accelerated sampling and tilting. Recently, several works have demonstrated the possibility of generating discrete data continuously by a simple flow matching process between a Gaussian and the one-hot encoded data distribution. They have further shown the feasibility of accelerated sampling via Categorical Flow Maps (CFMs), resulting in competitive sample quality in the few-step regime. However, this method had only been evaluated at relatively modest scales ($<1$B), leaving the question of its scalability completely open. In this article, we train a $1.7$B-parameter base flow model on $2.1$T tokens and self-distill it into a CFM that generates diverse, high-quality text in as few as $4$ inference steps while maintaining near-data-level token entropy. Furthermore, we introduce a likelihood bound for CFMs in the semi-discrete setting, and show that they can be used to score the model on standard LM benchmarks, achieving results in the same range as discrete diffusion methods. Finally, we uncover some of the challenges that arise from training these models at scale, and we provide prescriptive insights on loss weighting and time scheduling.
comment: Minor style changes
♻ ☆ Toward a Unified Lyapunov-Certified ODE Convergence Analysis of Smooth Q-Learning with p-Norms
Convergence of Q-learning has been the subject of extensive study for decades. Among the available techniques, the ordinary differential equation (ODE) method is particularly appealing as a general-purpose, off-the-shelf tool for sanity-checking the convergence of a wide range of reinforcement learning algorithms. In this paper, we develop a unified ODE-based convergence framework that applies to standard Q-learning and several soft/smoothed variants, including those built on the log-sum-exponential softmax, Boltzmann softmax, and mellowmax operators. Our analysis uses a smooth p-norm Lyapunov function, leading to concise yet rigorous stability arguments and circumventing the non-smoothness issues inherent to classical infty-norm-based approaches. To the best of our knowledge, the proposed framework is among the first to provide a unified ODE-based treatment that is broadly applicable to smooth Q-learning algorithms while also encompassing standard Q-learning. Moreover, it remains valid even in settings where the associated Bellman operator is not a contraction, as may happen in Boltzmann soft Q-learning.
♻ ☆ Why is prompting hard? Understanding prompts on binary sequence predictors
Frontier models can be prompted or conditioned to do many tasks, but finding good prompts is not always easy, nor is understanding some performant prompts. We view prompting as finding the best conditioning sequence on a near-optimal sequence predictor. On numerous well-controlled experiments, we show that unintuitive optimal conditioning sequences can be better understood given the pretraining distribution, which is not usually available. Even using exhaustive search, reliably identifying optimal prompts for practical neural predictors can be surprisingly difficult. Popular prompting methods, such as using demonstrations from the targeted task, can be surprisingly suboptimal. Using the same empirical framework, we analyze optimal prompts on frontier models, revealing patterns similar to the binary examples and previous findings. Taken together, this work takes an initial step towards understanding optimal prompts, from a statistical and empirical perspective that complements research on frontier models.
♻ ☆ ERIS: Enhancing Privacy and Scalability in Federated Learning via Federated Shard Aggregation
Scaling Federated Learning (FL) to billion-parameter models forces a challenging trade-off between privacy, scalability, and model utility. Existing solutions often tackle these challenges in isolation, sacrificing accuracy, relying on costly cryptographic tools, or introducing communication and optimization inefficiencies that affect convergence. We introduce ERIS, an FL framework centered on Federated Shard Aggregation (FSA), a novel mechanism that partitions each client update into non-overlapping shards whose aggregation is distributed across multiple client-side aggregators. FSA removes the central aggregation bottleneck, limits the information visible to any single observer, and preserves the centralized FL update after reassembly. ERIS can further readily integrate Distributed Shifted Compression (DSC) to reduce transmitted payloads and exposed coordinates. We prove that ERIS preserves convergence under standard assumptions and bounds mutual information leakage by the observable fraction of each update, decreasing with the number of client-side aggregators, and with the compression level when DSC is enabled. Experiments across image and text tasks, including large language models, show that ERIS achieves FedAvg-level utility while substantially reducing communication bottlenecks and improving robustness to membership inference and reconstruction attacks, without relying on heavy cryptography or utility-degrading perturbations.
♻ ☆ StreamPhy: Streaming Inference of High-Dimensional Physical Dynamics via State Space Models
Inferring the evolution of high-dimensional and multi-modal (e.g., spatio-temporal) physical fields from irregular sparse measurements in real time is a fundamental challenge in science and engineering. Existing approaches, including diffusion-based generative models and functional tensor methods, typically operate in offline settings, depend on full temporal observations, or incur substantial inference cost. We propose StreamPhy, an end-to-end framework that enables efficient and accurate streaming inference of full-field physical dynamics from incoming irregular sparse measurements. The framework integrates a data-adaptive observation encoder that is robust to arbitrary observation patterns, a structured state-space model that supports memory-efficient online updates across irregular time intervals, and an expressive Functional Tensor Feature-wise Linear Modulation (FT-FiLM) decoder for continuous-field generation. We prove that FT-FiLM is more expressive than the functional Tucker model, admitting a richer function class for handling complex dynamics. Experiments on three representative physical systems under challenging sampling patterns show that StreamPhy consistently outperforms state-of-the-art baselines, with at least 48\% improvement in accuracy and up to 20--100X faster inference than diffusion-based methods.
♻ ☆ CARL: Criticality-Aware Agentic Reinforcement Learning
Agents capable of accomplishing complex tasks through multiple interactions with the environment have emerged as a popular research direction. However, in such multi-step settings, the conventional group-level policy optimization algorithm becomes suboptimal because of its underlying assumption that each step holds equal contribution, which deviates significantly from reality. Our analysis reveals that only the action choices on a small fraction of states are critical in determining the final outcome. Building on this insight, we propose CARL, a criticality-aware reinforcement learning algorithm tailored for long-horizon agentic reasoning. CARL leverages entropy as a heuristic proxy for state criticality and achieves focused training by assigning rewards to actions taken from high-criticality states while excluding actions taken from low-criticality states from model updates, avoiding noisy credit assignment and redundant computation. Extensive experiments demonstrate that CARL achieves both stronger performance and higher efficiency across diverse evaluation settings. The source code will be publicly available.
comment: 18 pages, 6 figures
♻ ☆ Provable Anytime Ensemble Sampling Algorithms in Nonlinear Contextual Bandits
We provide a unified algorithmic framework for ensemble sampling in nonlinear contextual bandits and develop corresponding regret bounds for two most common nonlinear contextual bandit settings: Generalized Linear Ensemble Sampling (GLM-ES) for generalized linear bandits and Neural Ensemble Sampling (Neural-ES) for neural contextual bandits. Both methods maintain multiple estimators for the reward model parameters via maximum likelihood estimation on randomly perturbed data. We prove high-probability frequentist regret bounds of $\widetilde{O}(d^{3/2} \sqrt{T} + d^{4})$ for GLM-ES and $\widetilde{O}(\widetilde{d}^{3/2} \sqrt{T})$ for Neural-ES, where $d$ is the dimension of feature vectors, $\widetilde{d}$ is the effective dimension of a neural tangent kernel (NTK) matrix and $T$ is the number of rounds. The regret bound of GLM-ES matches the state-of-the-art result of randomized exploration algorithms in generalized linear bandit setting. In the theoretical analysis, we introduce techniques that address challenges specific to nonlinear models. Practically, we remove fixed-time horizon assumption by developing anytime versions of our algorithms, suitable when $T$ is unknown. Finally, we empirically evaluate GLM-ES, Neural-ES and their anytime variants, demonstrating strong performance. Overall, our results establish ensemble sampling as a provable and practical randomized exploration approach for nonlinear contextual bandits.
comment: 58 pages, 5 figures, 1 table
♻ ☆ MathlibLemma: Folklore Lemma Generation and Benchmark for Formal Mathematics
While the ecosystem of Lean and Mathlib has enjoyed celebrated success in formal mathematical reasoning with the help of large language models (LLMs), the absence of many folklore lemmas in Mathlib remains a persistent barrier that limits Lean's usability as an everyday tool for mathematicians like \LaTeX{} or Maple. To address this, we introduce MathlibLemma, a modular LLM-based pipeline for automated folklore-lemma mining: the discovery, formalization, and proving of reusable intermediate facts that mathematicians often take for granted but that are not always present in formal libraries. At its core, MathlibLemma proactively mines the missing connective tissue of mathematics. The pipeline produces a verified library of folklore-style lemmas, including 1,506 Lean-checked proofs that pass a proof-bypass screen; a small curated pilot subset has also been merged into Mathlib, providing external evidence that selected outputs can meet expert library standards. Leveraging this pipeline, we further construct the MathlibLemma benchmark, a suite of 4,028 non-trivial type-checked Lean statements spanning a broad range of mathematical domains. By transforming the role of LLMs from passive consumers to active contributors, this work takes a step toward AI-assisted expansion of formal mathematical libraries.
♻ ☆ Stargazer: A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical Constraints
The rise of autonomous AI agents suggests that dynamic benchmark environments with built-in feedback on scientifically grounded tasks are needed to evaluate the capabilities of these agents in research work. We introduce Stargazer, a scalable environment for evaluating AI agents on dynamic, iterative physics-grounded model-fitting tasks using inference on radial-velocity (RV) time series data. Stargazer comprises 120 tasks across three difficulty tiers, including 20 real archival cases, covering diverse scenarios ranging from high-SNR single-planet systems to complex multi-planetary configurations requiring involved low-SNR analysis. Our evaluation of eight frontier agents reveals a gap between numerical optimization and adherence to physical constraints: although agents often achieve a good statistical fit, they frequently fail to recover correct physical system parameters, a limitation that persists even when agents are equipped with vanilla skills. Furthermore, increasing test-time compute yields only marginal gains, with excessive token usage often reflecting recursive failure loops rather than meaningful exploration. Stargazer presents an opportunity to train, evaluate, scaffold, and scale strategies on a model-fitting problem of practical research relevance today. Our methodology to design a simulation-driven environment for AI agents presumably generalizes to many other model-fitting problems across scientific domains. Source code and the project website are available at https://github.com/AIPS-UofT/Stargazer and https://aips-uoft.github.io/Stargazer/, respectively.
♻ ☆ Cardiac Stability Theory: An Axiomatically Grounded Framework for Continuous Cardiac Health Monitoring via Smartphone Photoplethysmography
We present Cardiac Stability Theory (CST), an axiomatically grounded framework formally defining cardiovascular health as a stability margin around a cardiac dynamical attractor. From four axioms we derive the Cardiac Stability Index (CSI), a composite scalar in [0,1] integrating the largest Lyapunov exponent, recurrence determinism, and signal entropy via time-delay embedding. The ECG-based model (CSISurrogateV2, CNN-Transformer) achieves $R^2=0.8788$, MAE$=0.0234$ on PTB-XL (21,799 recordings). We extend CSI to smartphone PPG via Complementary Domain Transfer (CDT): CSISurrogateV2 generates pseudo-labels for the BUT PPG dataset (48 recordings, 12 subjects), training TinyCSINet (122,849 parameters), achieving MAE$=0.0557$, $ρ=0.660$ on the held-out test set ($n=1065$ windows) at ${<}30$ ms mobile latency. CDT is validated on BIDMC, Welltory, and RWS-PPG. Paired validation on 5,035 BIDMC windows yields $r=0.454$ ($ρ=0.485$, $p<10^{-295}$), confirming correlated cardiac stability across modalities. CSI is negatively correlated with age (slope $= -0.000225$ CSI/year, PTB-XL), discriminates atrial fibrillation from normal sinus rhythm (AUROC$=0.89$), and is robust under Perturbation Invariance Training (max AUC drop 1.65\%). We derive HeartSpan, a longitudinal stability metric relative to population age norms, enabling continuous non-invasive cardiac monitoring from commodity smartphones for longevity tracking and cardiac risk stratification.
♻ ☆ $(α,β)$-Stability for Boosting Vector-Valued Prediction
Despite the widespread use of boosting in structured prediction, a general theoretical understanding of aggregation beyond scalar prediction remains incomplete. We study vector-valued prediction under a target divergence and identify a geometric stability property under which aggregation amplifies weak guarantees into strong ones. We formalize this property as $(α,β)$-stability by geometric median and show how it supports a boosting framework based on exponential reweighting and geometric-median aggregation. For vector-valued prediction, we characterize this stability property under several natural divergences: $\ell_1$ and $\ell_2$ distances for unconstrained vector-valued prediction, and TV, Hellinger, and KL for density estimation over finite probability vectors. Building on these results, we propose a generic boosting framework \geomedboost. Under a weak learner condition and $(α,β)$-stability, we obtain exponential decay of the empirical divergence error, which then yields population guarantees through a generalization bound.
♻ ☆ CAP: Controllable Alignment Prompting for Unlearning in LLMs ACL 2026
Large language models (LLMs) trained on unfiltered corpora inherently risk retaining sensitive information, necessitating selective knowledge unlearning for regulatory compliance and ethical safety. However, existing parameter-modifying methods face fundamental limitations: high computational costs, uncontrollable forgetting boundaries, and strict dependency on model weight access. These constraints render them impractical for closed-source models, yet current non-invasive alternatives remain unsystematic and reliant on empirical experience. To address these challenges, we propose the Controllable Alignment Prompting for Unlearning (CAP) framework, an end-to-end prompt-driven unlearning paradigm. CAP decouples unlearning into a learnable prompt optimization process via reinforcement learning, where a prompt generator collaborates with the LLM to suppress target knowledge while preserving general capabilities selectively. This approach enables reversible knowledge restoration through prompt revocation. Extensive experiments demonstrate that CAP achieves precise, controllable unlearning without updating model parameters, establishing a dynamic alignment mechanism that overcomes the transferability limitations of prior methods.
comment: Accpeted to ACL 2026 Main Conference
♻ ☆ Diversity in Large Language Models under Supervised Fine-Tuning
Supervised Fine-Tuning (SFT) is essential for aligning Large Language Models (LLMs) with user intent, yet it is believed to suppress generative diversity. Although this reduction is frequently referenced, formal empirical testing of the phenomenon remains limited. The expressiveness of LLMs by itself was addressed by multiple prior methods. Their varying perspectives suggest that deeper investigation could yield further improvements. In this study, we attribute the decline to two primary drivers: the neglect of low-frequency patterns within fine-tuning datasets and the forgetting of preexisting knowledge. Motivated by our theoretical analysis, we develop Tempered Focal (TOFU) loss, a novel objective that addresses both stated challenges simultaneously. Our extensive evaluation confirms at scale that generation breadth narrows after SFT and strengthens the hypothesis explaining this effect. Across multiple models and benchmarks, we demonstrate that TOFU enhances output diversity while preserving high response quality, offering a principled approach to SFT.
♻ ☆ Root Cause Analysis of Measurement and Mechanistic Anomalies
Root cause analysis of anomalies aims to identify how and why a sample deviates from the normal process. Existing methods primarily focus on telling which features are responsible, ignoring that anomalies can arise through two fundamentally different processes: measurement errors, where the sample is generated normally but one or more values is recorded incorrectly, and mechanism shifts, where the causal process that generated the sample was changed. While measurement errors can often be safely corrected, mechanistic anomalies require careful consideration. In this paper, we formally define a causal model that explicitly captures both types by treating outliers as latent interventions on latent ("true") and observed ("measured") variables and show under which conditions the distinction is possible. Based on this model, we develop an efficient inference procedure for localizing root causes and distinguishing anomaly types. Experiments on synthetic and real-world data show that our method provides state-of-the-art and highly robust performance in both root cause localization and classification of anomaly types.
♻ ☆ Prompt Estimation from Prototypes for Federated Prompt Tuning of Vision Transformers
Visual Prompt Tuning (VPT) of pre-trained Vision Transformers (ViTs) has proven highly effective as a parameter-efficient fine-tuning technique for adapting large models to downstream tasks with limited data. Its parameter efficiency makes it particularly suitable for Federated Learning (FL), where both communication and computation budgets are often constrained. However, global prompt tuning struggles to generalize across heterogeneous clients, while personalized tuning overfits to local data and lacks generalization. We propose PEP-FedPT (Prompt Estimation from Prototypes for Federated Prompt Tuning), a unified framework designed to achieve both generalization and personalization in federated prompt tuning of ViTs. Within this framework, we introduce the novel Class-Contextualized Mixed Prompt (CCMP) - based on class-specific prompts maintained alongside a globally shared prompt. For each input, CCMP adaptively combines class-specific prompts using weights derived from global class prototypes and client class priors. This approach enables per-sample prompt personalization without storing client-dependent trainable parameters. The prompts are collaboratively optimized via traditional federated averaging technique on the same. Comprehensive evaluations on CIFAR-100, TinyImageNet, DomainNet, and iNaturalist datasets demonstrate that PEP-FedPT consistently surpasses the state-of-the-art baselines under diverse data heterogeneity scenarios, establishing a strong foundation for efficient and generalizable federated prompt tuning of Vision Transformers.
comment: Accepted to TMLR 2026
♻ ☆ An Empirical Analysis of Calibration and Selective Prediction in Multimodal Clinical Condition Classification
As artificial intelligence systems move toward clinical deployment, ensuring reliable prediction behavior is fundamental for safety-critical decision-making tasks. One proposed safeguard is selective prediction, where models can defer uncertain predictions to human experts for review. In this work, we empirically evaluate the reliability of uncertainty-based selective prediction in multilabel clinical condition classification using multimodal ICU data. Across a range of state-of-the-art unimodal and multimodal models, we find that selective prediction can substantially degrade performance despite strong standard evaluation metrics. This failure is driven by severe class-dependent miscalibration, whereby models assign high uncertainty to correct predictions and low uncertainty to incorrect ones, particularly for underrepresented clinical conditions. Our results show that commonly used aggregate metrics can obscure these effects, limiting their ability to assess selective prediction behavior in this setting. Taken together, our findings characterize a task-specific failure mode of selective prediction in multimodal clinical condition classification and highlight the need for calibration-aware evaluation to provide strong guarantees of safety and robustness in clinical AI.
comment: 40 pages, 14 figures, 16 tables. Accepted as a conference paper at AHLI Conference on Health, Inference, and Learning (CHIL) 2026
♻ ☆ Learning Confidence Ellipsoids and Applications to Robust Subspace Recovery
We study the problem of finding confidence ellipsoids for an arbitrary distribution in high dimensions. Given samples from a distribution $D$ and a confidence parameter $α$, the goal is to find the smallest volume ellipsoid $E$ which has probability mass $\mathbb{P}_{D}[E] \ge 1-α$. Ellipsoids are a highly expressive class of confidence sets as they can capture correlations in the distribution, and can approximate any convex set. In statistics, this is the classic minimum volume estimator introduced by Rousseeuw as a robust non-parametric estimator of location and scatter. However in high dimensions, it becomes NP-hard to obtain any non-trivial approximation factor in volume when the condition number $β$ of the ellipsoid (ratio of the largest to the smallest axis length) goes to $\infty$. This motivates the focus of our paper: can we efficiently find confidence ellipsoids with volume approximation guarantees when compared to ellipsoids of bounded condition number $β$? Our main result is a polynomial time algorithm that finds an ellipsoid $E$ whose volume is within a $O(β)^{γd}$ multiplicative factor of the volume of best $β$-conditioned ellipsoid while covering at least $1-O(α/γ)$ probability mass for any $γ\in (0,1)$. In particular, setting $γ= o(1)$, this gives a $O(β)^{o(d)}$ volume approximation, with a multiplicative loss in miscoverage. We complement this with a computational hardness result that shows that such a dependence on $β$ seems necessary, even with some slack in coverage. The algorithm and analysis uses the rich primal-dual structure of the minimum volume enclosing ellipsoid and the geometric Brascamp-Lieb inequality. As a consequence, we obtain the first polynomial time algorithm with approximation guarantees on worst-case instances of the robust subspace recovery problem.
♻ ☆ Towards a Certificate of Trust: Task-Aware OOD Detection for Scientific AI
Data-driven models are increasingly adopted in critical scientific fields like weather forecasting and fluid dynamics. These methods can fail on out-of-distribution (OOD) data, but detecting such failures in regression tasks is an open challenge. We propose a new OOD detection method based on estimating joint likelihoods using a score-based diffusion model. This approach considers not just the input but also the regression model's prediction, providing a task-aware reliability score. Across numerous scientific datasets, including PDE datasets, satellite imagery and brain tumor segmentation, we show that this likelihood strongly correlates with prediction error. Our work provides a foundational step towards building a verifiable 'certificate of trust', thereby offering a practical tool for assessing the trustworthiness of AI-based scientific predictions. Our code is publicly available at https://github.com/bogdanraonic3/OOD_Detection_ScientificML
♻ ☆ The Sample Complexity of Uniform Approximation for Multi-Dimensional CDFs and Fixed-Price Mechanisms
We study the sample complexity of learning a uniform approximation of an $n$-dimensional cumulative distribution function (CDF) within an error $ε> 0$, when observations are restricted to a minimal one-bit feedback. This serves as a counterpart to the multivariate DKW inequality under ''full feedback'', extending it to the setting of ''bandit feedback''. Our main result shows a near-dimensional-invariance in the sample complexity: we get a uniform $ε$-approximation with a sample complexity $\frac{1}{ε^3}{\log\left(\frac 1 ε\right)^{\mathcal{O}(n)}}$ over a arbitrary fine grid, where the dimensionality $n$ only affects logarithmic terms. As direct corollaries, we provide tight sample complexity bounds and novel regret guarantees for learning fixed-price mechanisms in small markets, such as bilateral trade settings.
♻ ☆ How Instruction and Reasoning Data shape Post-Training: Data Quality through the Lens of Layer-wise Gradients ACL2026
As the post-training of large language models (LLMs) advances from instruction-following to complex reasoning tasks, understanding how different data affect finetuning dynamics remains largely unexplored. In this paper, we present a spectral analysis of layer-wise gradients induced by low/high-quality instruction and reasoning data for LLM post-training. Our analysis reveals that widely-studied metrics for data evaluation, e.g., IFD, InsTag, Difficulty, and Reward, can be explained and unified by spectral properties computed from gradients' singular value decomposition (SVD). Specifically, higher-quality data are usually associated with lower nuclear norms and higher effective ranks. Notably, effective rank exhibits better robustness and resolution than nuclear norm in capturing subtle quality differences. For example, reasoning data achieves substantially higher effective ranks than instruction data, implying richer gradient structures on more complex tasks. Our experiments also highlight that models within the same family share similar gradient patterns regardless of their sizes, whereas different model families diverge significantly. Providing a unified view on the effects of data quality across instruction and reasoning data, this work illuminates the interplay between data quality and training stability, shedding novel insights into developing better data exploration strategies for post-training.
comment: ACL2026, camera-ready
♻ ☆ Targeted Synthetic Control Method
The synthetic control method (SCM) estimates causal effects in panel data with a single-treated unit by constructing a counterfactual outcome as a weighted combination of untreated control units that matches the pre-treatment trajectory. In this paper, we introduce the targeted synthetic control (TSC) method, a new two-stage estimator that directly estimates the counterfactual outcome. Specifically, our TSC method (1) yields a targeted debiasing estimator, in the sense that the targeted updating refines the initial weights to produce more stable weights; and (2) ensures that the final counterfactual estimation is a convex combination of observed control outcomes to enable direct interpretation of the synthetic control weights. TSC is flexible and can be instantiated with arbitrary machine learning models. Methodologically, TSC starts from an initial set of synthetic-control weights via a one-dimensional targeted update through the weight-tilting submodel, which calibrates the weights to reduce bias of weights estimation arising from pre-treatment fit. Furthermore, TSC avoids key shortcomings of existing methods (e.g., the augmented SCM), which can produce unbounded counterfactual estimates. Across extensive synthetic and real-world experiments, TSC consistently improves estimation accuracy over state-of-the-art SCM baselines.
♻ ☆ Assessing the robustness of heterogeneous treatment effects in survival analysis under informative censoring
Dropout is common in clinical studies, with up to half of patients leaving early due to side effects or other reasons. When dropout is informative (i.e., dependent on survival time), it introduces censoring bias, because of which treatment effect estimates are also biased. In this paper, we propose an assumption-lean framework to assess the robustness of conditional average treatment effect (CATE) estimates in survival analysis when facing censoring bias. Unlike existing works that rely on strong assumptions, such as non-informative censoring, to obtain point estimation, we use partial identification to derive informative bounds on the CATE. Thereby, our framework helps to identify patient subgroups where treatment is effective despite informative censoring. We further propose a novel model-agnostic meta-learner, called SurvB-learner, to estimate the bounds that can be used in combination with arbitrary machine-learning models, and that has favorable theoretical properties such as double-robustness and quasi-oracle efficiency. We finally demonstrate the effectiveness of our meta-learner across various experiments using both simulated and real-world data.
♻ ☆ Causal Discovery for Irregularly Time Series with Consistency Guarantees
This paper studies causal discovery in irregularly sampled time series-a key challenge in risk-sensitive domains like finance, healthcare, and climate science, where missing data and inconsistent sampling frequencies distort causal mechanisms. The main challenge comes from the interdependence between missing data imputation and causal structure recovery: errors in imputation and structure learning can reinforce each other, leading to an inaccurate causal graph. Existing methods either impute first and then discover, or jointly optimize both via neural representation learning, but lack explicit mechanisms to ensure mutual consistency of imputation and structure learning. We address this challenge with ReTimeCausal, an EM-based framework that alternates between imputation and structure learning, which encourages structural consistency throughout the optimization process. Our framework provides theoretical consistency guarantees for structure recovery and extends classical results to settings with irregular sampling and high missingness. ReTimeCausal combines kernel-based sparse regression and structural constraints in an alternating process that updates the completed data and the causal graph in turn. Experiments on synthetic and real-world datasets show that ReTimeCausal is more effective than existing methods under challenging irregular sampling and missing data.
comment: 12 pages, 2 figures
♻ ☆ Learning to Learn the Macroscopic Fundamental Diagram using Physics-Informed and meta Machine Learning techniques
The Macroscopic Fundamental Diagram is a popular tool used to describe traffic dynamics in an aggregated way, with applications ranging from traffic control to incident analysis. However, estimating the MFD for a given network requires large numbers of loop detectors, which is not always available in practise. This article proposes a framework to alleviate the data scarcity challenge harnessing Meta-Learning, a subcategory of Machine Learning that trains models to understand and adapt to new tasks on their own. We use Meta-Learning to identify and exploit transferable patterns from data-rich cities to cities where not enough data is available to estimate the MFD. The developed model is trained and tested by leveraging data from multiple cities and exploiting it to model the MFD of other cities with different shares of detectors and topological structures. The proposed Meta-Learning framework is applied to an ad-hoc Multi-Task Physics-Informed Neural Network, specifically designed to estimate the MFD. Results show an average MAE improvement in flow prediction of around 50% across cities (depending on the subset of loop detectors tested). The Meta-Learning framework thus successfully generalises across diverse urban settings and improves performance on cities with limited data, demonstrating the potential of using Meta-Learning when a limited number of detectors is available. We directly test this assumption by applying the Meta-Learning outputs to unseen cities to simulate a real-life application scenario and the wide applicability of the proposed methodology. Finally, the proposed framework is validated against traditional Transfer Learning approaches and tested with FitFun, a model for FD estimation from the literature, to prove its transferability.
comment: Version accepted for publication in Transportation Research Part C (before proof-reading)
♻ ☆ Assessing Trustworthiness of AI Training Dataset using Subjective Logic -- A Use Case on Bias ECML
As AI systems increasingly rely on training data, assessing dataset trustworthiness has become critical, particularly for properties like fairness or bias that emerge at the dataset level. Prior work has used Subjective Logic to assess trustworthiness of individual data, but not to evaluate trustworthiness properties that emerge only at the level of the dataset as a whole. This paper introduces the first formal framework for assessing the trustworthiness of AI training datasets, enabling uncertainty-aware evaluations of global properties such as bias. Built on Subjective Logic, our approach supports trust propositions and quantifies uncertainty in scenarios where evidence is incomplete, distributed, and/or conflicting. We instantiate this framework on the trustworthiness property of bias, and we experimentally evaluate it based on a traffic sign recognition dataset. The results demonstrate that our method captures class imbalance and remains interpretable and robust in both centralized and federated contexts.
comment: Accepted at ECML PKDD Bias Workshop '25
♻ ☆ Neural Co-state Policies: Structuring Hidden States in Recurrent Reinforcement Learning
A key capability of intelligent agents is operating under partial observability: reasoning and acting effectively despite missing or incomplete state observations. While recurrent (memory-based) policies learned via reinforcement learning address this by encoding history into latent state representations, their internal dynamics remain uninterpretable black boxes. This paper establishes a formal link between these hidden states and the Pontryagin minimum principle (PMP) from optimal control. We demonstrate that for standard recurrent architectures, latent representations map directly to PMP co-states, which allows the readout layer to be interpreted as performing Hamiltonian minimization. Because standard reward maximization does not naturally discover this alignment, we introduce a PMP-derived co-state loss to explicitly structure the internal dynamics. Empirically, this approach matches or improves performance on partially observable DMControl tasks, and is robust against zero-shot out-of-distribution sensor masking. By framing recurrent networks as dynamic processes governed by the minimum principle, we provide a principled approach to designing robust continuous control policies.
comment: 17 pages, 5 figures
♻ ☆ Uncertainty Estimation for Heterophilic Graphs Through the Lens of Information Theory
While uncertainty estimation for graphs recently gained traction, most methods rely on homophily and deteriorate in heterophilic settings. We address this by analyzing message passing neural networks from an information-theoretic perspective and developing a suitable analog to data processing inequality to quantify information throughout the model's layers. In contrast to non-graph domains, information about the node-level prediction target can increase with model depth if a node's features are semantically different from its neighbors. Therefore, on heterophilic graphs, the latent embeddings of an MPNN each provide different information about the data distribution - different from homophilic settings. This reveals that considering all node representations simultaneously is a key design principle for epistemic uncertainty estimation on graphs beyond homophily. We empirically confirm this with a simple post-hoc density estimator on the joint node embedding space that provides state-of-the-art uncertainty on heterophilic graphs. At the same time, it matches prior work on homophilic graphs without explicitly exploiting homophily through post-processing.
♻ ☆ Position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead
Large Language Models (LLMs) have achieved remarkable results on a range of standardized tests originally designed to assess human cognitive and psychological traits, such as intelligence and personality. While these results are often interpreted as strong evidence of human-like characteristics in LLMs, this paper argues that such interpretations constitute an ontological error. Human psychological and educational tests are theory-driven measurement instruments, calibrated to a specific human population. Applying these tests to non-human subjects without empirical validation, risks mischaracterizing what is being measured. Furthermore, a growing trend frames AI performance on benchmarks as measurements of traits such as ``intelligence'', despite known issues with validity, data contamination, cultural bias and sensitivity to superficial prompt changes. We argue that interpreting benchmark performance as measurements of human-like traits, lacks sufficient theoretical and empirical justification. This leads to our position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead. We call for the development of principled, AI-specific evaluation frameworks tailored to AI systems. Such frameworks might build on existing frameworks for constructing and validating psychometrics tests, or could be created entirely from scratch to fit the unique context of AI.
♻ ☆ Supervised Guidance Training for Infinite-Dimensional Diffusion Models
Score-based diffusion models have recently been extended to infinite-dimensional function spaces, with uses such as inverse problems arising from partial differential equations. In the Bayesian formulation of inverse problems, the aim is to sample from a posterior distribution over functions obtained by conditioning a prior on noisy observations. While diffusion models provide expressive priors in function space, the theory of conditioning them to sample from the posterior remains open. We address this, assuming that either the prior lies in the Cameron-Martin space, or is absolutely continuous with respect to a Gaussian measure. We prove that the models can be conditioned using an infinite-dimensional extension of Doob's $h$-transform, and that the conditional score decomposes into an unconditional score and a guidance term. As the guidance term is intractable, we propose a simulation-free score matching objective (called Supervised Guidance Training) enabling efficient and stable posterior sampling. We illustrate the theory with numerical examples on Bayesian inverse problems in function spaces. In summary, our work offers the first function-space method for fine-tuning trained diffusion models to accurately sample from a posterior.
♻ ☆ Group Cognition Learning: Making Everything Better Through Governed Two-Stage Agents Collaboration ICML 2026
Centralized multimodal learning commonly compresses language, acoustic, and visual signals into a single fused representation for prediction. While effective, this paradigm suffers from two limitations: modality dominance, where optimization gravitates towards the path of least resistance, ignoring weaker but informative modalities, and spurious modality coupling, where models overfit to incidental cross-modal correlations. To address these, we propose Group Cognition Learning (GCL), a governed collaboration paradigm that applies a two-stage protocol after modality-specific encoding. In Stage 1 (Selective Interaction), a Routing Agent proposes directed interaction routes, and an Auditing Agent assigns sample-wise gates to emphasize exchanges that yield positive marginal predictive gain while suppressing redundant coupling. In Stage 2 (Consensus Formation), a Public-Factor Agent maintains an explicit shared factor, and an Aggregation Agent produces the final prediction through contribution-aware weighting while keeping each modality representation as a specialization channel. Extensive experiments on CMU-MOSI, CMU-MOSEI, and MIntRec demonstrate that GCL mitigates dominance and coupling, establishing state-of-the-art results across both regression and classification benchmarks. Analysis experiments further demonstrate the effectiveness of the design.
comment: This study has been Accepted by ICML 2026. The current version is a manuscript, please refer to the official version released at ICML 2026 for the final published version
♻ ☆ Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Multi-Stream Environments ICML 2026
This paper identifies a critical yet underexplored challenge in reasoning alignment from multiple multi-modal large language models (MLLMs): In non-stationary environments, the diverse reasoning distributions of source models often evolve unpredictably, transmitting systematic biases and drift to the target model. To address this, we formulate multi-source reasoning alignment as a constraint satisfaction problem under concept drift theory. We propose Autonomous Preference Optimization (APO), a novel framework that treats inter-model divergences not as noise, but as dynamic negative constraints. APO operates via a two-stage protocol: first, supervised bootstrapping projects the target model into the capability union of source models; second, constraint-aware optimization synthesizes a consistent consensus manifold by explicitly suppressing drifting trajectories via a multi-negative Plackett-Luce objective. Extensive experiments on chest X-ray interpretation demonstrate that our 7B model achieves superior robustness, outperforming even proprietary source models in average accuracy. Furthermore, we release CXR-MAX, a large-scale benchmark comprising 170,982 reasoning trajectories from seven large-scale MLLMs to facilitate research on reasoning alignment under drift. Code and data are available at: https://github.com/XiaoyuYoung/APO.
comment: ICML 2026
♻ ☆ The Invisible Handshake: Persistent Overpricing by Adaptive Market Agents
We study overpricing in a repeated game between two representative agents: a market maker, who controls market liquidity, and a market taker, who chooses trade quantities. Market prices evolve through the endogenous price impact of trades and exogenous shocks. We define overpricing relative to a counterfactual price path that holds fixed the same sequence of shocks while shutting down price impact, and characterize the set of feasible strategy profiles that generate persistent overpricing while respecting cash and inventory constraints. We provide a sufficient condition for decentralized learning to reach the overpricing region in finite time, and we show that this condition is satisfied, in particular, by projected stochastic gradient ascent. A key step in the analysis is a decomposition of the game into a competitive component, which favors zero price impact, and a collaborative component, which makes overpricing jointly profitable when aggregate inventory is positive. We further show that the same structural incentives govern both myopic and farsighted objectives. Together, these results show how decentralized learning by adaptive market agents can lead to persistent overpricing in financial markets.
♻ ☆ Apple: Toward General Active Perception via Reinforcement Learning ICLR 2026
Active perception is a fundamental skill that enables us humans to deal with uncertainty in our inherently partially observable environment. For senses such as touch, where the information is sparse and local, active perception becomes crucial. In recent years, active perception has emerged as an important research domain in robotics. However, current methods are often bound to specific tasks or make strong assumptions, which limit their generality. To address this gap, this work introduces APPLE (Active Perception Policy Learning) - a novel framework that leverages reinforcement learning (RL) to address a range of different active perception problems. APPLE jointly trains a transformer-based perception module and decision-making policy with a unified optimization objective, learning how to actively gather information. By design, APPLE is not limited to a specific task and can, in principle, be applied to a wide range of active perception problems. We evaluate two variants of APPLE across different tasks, including tactile exploration problems from the Tactile MNIST benchmark. Experiments demonstrate the efficacy of APPLE, achieving high accuracies on both regression and classification tasks. These findings underscore the potential of APPLE as a versatile and general framework for advancing active perception in robotics. Project page: https://timschneider42.github.io/apple
comment: 27 pages; 21 figures; accepted at the Fourteenth International Conference on Learning Representations (ICLR 2026)
♻ ☆ Schoenfeld's Anatomy of Mathematical Reasoning by Language Models ACL2026
Large language models increasingly expose reasoning traces, yet their underlying cognitive structure and steps remain difficult to identify and analyze beyond surface-level statistics. We adopt Schoenfeld's Episode Theory as an inductive, intermediate-scale lens and introduce ThinkARM (Anatomy of Reasoning in Models), a scalable framework that explicitly abstracts reasoning traces into functional reasoning steps such as Analysis, Explore, Implement, Verify, etc. When applied to mathematical problem solving by diverse models, this abstraction reveals reproducible thinking dynamics and structural differences between reasoning and non-reasoning models, which are not apparent from token-level views. We further present two diagnostic case studies showing that exploration functions as a critical branching step associated with correctness, and that efficiency-oriented methods selectively suppress evaluative feedback steps rather than uniformly shortening responses. Together, our results demonstrate that episode-level representations make reasoning steps explicit, enabling systematic analysis of how reasoning is structured, stabilized, and altered in modern language models.
comment: ACL2026, camera-ready
♻ ☆ HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
Existing multimodal search agents process target entities sequentially, issuing one tool call per entity and accumulating redundant interaction rounds whenever a query decomposes into independent sub-retrievals. We argue that effective multimodal agents should search wider rather than longer: dispatching multiple grounded queries concurrently within a round. To this end, we present HyperEyes, a parallel multimodal search agent that fuses visual grounding and retrieval into a single atomic action, enabling concurrent search across multiple entities while treating inference efficiency as a first-class training objective. HyperEyes is trained in two stages. For cold-start supervision, we develop a Parallel-Amenable Data Synthesis Pipeline covering visual multi-entity and textual multi-constraint queries, curating efficiency-oriented trajectories via Progressive Rejection Sampling. Building on this, our central contribution, a Dual-Grained Efficiency-Aware Reinforcement Learning framework, operates at two levels. At the macro level, we propose TRACE (Tool-use Reference-Adaptive Cost Efficiency), a trajectory-level reward whose reference is monotonically tightened during training to suppress superfluous tool calls without restricting genuine multi-hop search. At the micro level, we adapt On-Policy Distillation to inject dense token-level corrective signals from an external teacher on failed rollouts, mitigating the credit-assignment deficiency of sparse outcome rewards. Since existing benchmarks evaluate accuracy as the sole metric, omitting inference cost, we introduce IMEB, a human-curated benchmark of 300 instances that jointly evaluates search capability and efficiency. Across six benchmarks, HyperEyes-30B surpasses the strongest comparable open-source agent by 9.9% in accuracy with 5.3x fewer tool-call rounds on average.
comment: Code & Data: https://github.com/DeepExperience/HyperEyes
♻ ☆ MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier ICML 2026
While large language models (LLMs) show promise in scientific discovery, existing research focuses on inference or feedback-driven training, leaving the direct modeling of the generative reasoning process, $P(\text{hypothesis}|\text{background})$ ($P(h|b)$), unexplored. We demonstrate that directly training $P(h|b)$ is mathematically intractable due to the combinatorial complexity ($O(N^k)$) inherent in retrieving and composing inspirations from a vast knowledge base. To break this barrier, we introduce MOOSE-Star, a unified framework that enables tractable and scalable training of $P(h|b)$, while supporting more scalable inference. In the best case, MOOSE-Star reduces complexity from exponential to logarithmic ($O(\log N)$) by (1) training on decomposed subtasks derived from the probabilistic equation of discovery, (2) employing motivation-guided hierarchical search to enable logarithmic retrieval and prune irrelevant subspaces, and (3) utilizing bounded composition for robustness against retrieval noise. To facilitate this, we release TOMATO-Star, a dataset of 108,717 decomposed papers (38,400 GPU hours) for training. Empirically, MOOSE-Star scales continuously with training data and inference budget, whereas direct brute-force sampling hits a complexity wall.
comment: Accepted by ICML 2026
♻ ☆ Black-Box Detection of LLM-Generated Text Using Generalized Jensen-Shannon Divergence ICML 2026
We study black-box detection of machine-generated text under practical constraints: the scoring model (proxy LM) may mismatch the unknown source model, and per-input contrastive generation is costly. We propose SurpMark, a reference-based detector that summarizes a passage by the dynamics of its token surprisals. SurpMark discretizes surprisals into interpretable states, estimates a state-transition matrix for the test text, and scores it via a generalized Jensen-Shannon (GJS) gap between the test transitions and two fixed references (human vs. machine) built once from existing corpora. Theoretically, we derive design guidance for how the discretization bins should scale with data and provide a principled justification for our test statistic. Empirically, across multiple datasets, source models, and scenarios, SurpMark consistently matches or surpasses baselines, demonstrating strong robustness across domains and generators; our experiments on hyperparameter sensitivity exhibit trends that our theoretical results help to explain.
comment: ICML 2026
♻ ☆ Implicit Hypothesis Testing and Divergence Preservation in Neural Network Representations
We study the training dynamics of neural classifiers through the lens of binary hypothesis testing. We re-formalize classification as a collection of binary tests between class-conditional distributions induced by learned representations and show empirically that, along training trajectories, well-generalizing networks progressively approach Neyman-Pearson optimal decision rules, as measured by monotonic growth in the KL divergence retained by learned representations. We provide sufficient conditions for exact optimality, discuss its implications for training regularization, and define an informational plane, (so-called Evidence-Error plane) where convergence can be assessed methodically across network architecture.
♻ ☆ Deterministic Differentiable Structured Pruning for Large Language Models ICML26
Structured pruning reduces LLM inference cost by removing low-importance architectural components. This can be viewed as learning a multiplicative gate for each component under an l0 sparsity constraint. Due to the discreteness of the l0 norm, prior work typically adopts stochastic hard-concrete relaxations to enable differentiable optimization; however, this stochasticity can introduce a train--test mismatch when sampled masks are discretized for deployment and restricts masks to a bounded, near-binary range. To address this, we propose Deterministic Differentiable Pruning (DDP), a mask-only optimization method that eliminates stochasticity by directly optimizing a deterministic soft surrogate of the discrete l0 objective. Compared with prior approaches, DDP offers greater expressiveness, reduced train--test mismatch, and faster convergence. We apply our method to several dense and MoE models, including Qwen3-32B and Qwen3-30B-A3B, achieving a performance loss as small as 1% on downstream tasks while outperforming previous methods at 20% sparsity. We further demonstrate end-to-end inference speedups in realistic deployment settings with vLLM.
comment: Published at ICML26;
♻ ☆ OpenClaw-RL: Train Any Agent Simply by Talking
Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework that employs next-state signals to optimize personal agents online through infrastructure and methodology innovations. On the infrastructure side, we extend existing RL systems to a server-client architecture where the RL server hosts the policy behind an inference API and user terminals stream interaction data back over HTTP. From each observed next state, the system extracts two complementary training signals, evaluative and directive, via a separate asynchronous server so that neither signal extraction nor optimization blocks inference. On the methodology side, we introduce a hybrid RL objective that unifies both signal types in a single update: directive signals provide richer, token-level supervision but are sparser, while evaluative signals are more broadly available. To stabilize distillation under teacher-student mismatch, we propose overlap-guided hint selection, which picks the hint whose induced teacher distribution maximally overlaps with the student's top-$k$ tokens, together with a log-probability-difference clip that bounds per-token advantages. Applied to personal agents, OpenClaw-RL enables an agent to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Applied to general agents, OpenClaw-RL is the first RL framework to unify real-world agent settings spanning terminal, GUI, SWE, and tool-call environments, where we additionally demonstrate the utility of next-state signals in long-horizon settings.
comment: Code: https://github.com/Gen-Verse/OpenClaw-RL
♻ ☆ SymTorch: Symbolic Distillation of Neural Networks
What mathematical functions do neural network components learn? Symbolic distillation addresses this question by expressing neural network components with interpretable, closed-form mathematical expressions that expose the functional structure learned during training. We develop symbolic distillation as a systematic, architecture-agnostic methodology, and release our approach as the open-source SymTorch package - a PySR-powered library built natively for the PyTorch ecosystem. Applying this methodology across diverse architectures, we find that SymTorch is successful in the automated discovery of physical laws. Specifically, our approach (1) recovers pairwise interaction forces from graph neural networks trained on empirical $n$-body observations, (2) distills the exact closed-form PDE/ODE solutions of multiple physical systems, including the value of constants, from physics-informed neural networks trained on sparse data, and (3) uncovers the chaotic dynamics of the Lorenz system from high-dimensional data, ultimately outperforming the base neural network on downstream prediction tasks. We further demonstrate the utility of our framework for model interpretability by providing an optimized implementation of SLIME - a symbolic extension to the LIME explainability method. SLIME consistently outperforms LIME across predictive metrics across eight popular classification and regression benchmarks, while still providing an interpretable local symbolic model. Lastly, we investigate replacing transformer MLP layers with symbolic surrogates: replacing 1-7 layers with symbolic approximations yields 2-19\% throughput improvements and up to 18.7\% VRAM reduction, with the resulting hybrid models lying on the Pareto front of throughput versus perplexity among open-source LLMs of comparable scale.
♻ ☆ What Structural Inductive Bias Helps Transformers Reason Over Knowledge Graphs? A Study with Tabula RASA
What structural inductive bias helps transformers reason over knowledge graphs? Through controlled ablations of a minimal transformer modification with four independently removable components (sparse adjacency masking, edge-type biases, query scaling, value gating), we isolate which structural signals drive multi-hop reasoning. Our finding is sharp: sparse adjacency masking alone accounts for the dominant share of improvement over unmasked transformers (+72.5pp on 3-hop MetaQA, +45.5pp on WebQSP, +53.9pp on CWQ), while learned relation parameters add only modest refinement and can actively hurt without structural guidance. A zero-shot experiment provides architecturally independent corroboration: masking-based attention degrades 4.0x less than relation-specific weights when edge types are held out. The useful inductive bias for multi-hop KGQA is predominantly topological, not relational.
comment: 8 pages + appendix, 9 figures
♻ ☆ DUALFloodGNN: Physics-informed Graph Neural Network for Operational Flood Modeling IJCAI
Flood models inform strategic disaster management by simulating the spatiotemporal hydrodynamics of flooding. While physics-based numerical flood models are accurate, their substantial computational cost limits their use in operational settings where rapid predictions are essential. Models designed with graph neural networks (GNNs) provide both speed and accuracy while having the ability to process unstructured spatial domains. Given its flexible input and architecture, GNNs can be leveraged alongside physics-informed techniques with ease, significantly improving interpretability and generalizability. We introduce a novel flood GNN architecture, DUALFloodGNN, which embeds physical constraints at both global and local scales through explicit loss terms. The model jointly predicts water volume at nodes and flow along edges through a shared message-passing framework. To improve performance for autoregressive inference, model training is conducted with a multi-step loss enhanced with dynamic curriculum learning. Compared with standard GNN architectures and state-of-the-art GNN flood models, DUALFloodGNN achieves substantial improvements in predicting multiple hydrologic variables (e.g., water volume, flow, and depth) while maintaining high computational efficiency. The model is open sourced at https://github.com/acostacos/dual_flood_gnn. The dataset is open sourced at https://hdl.handle.net/2123/35293 with the DOI 10.25910/9xav-0s86.
comment: Accepted for publication at the IJCAI-ECAI 2026 AI4Tech track
♻ ☆ Task complexity shapes internal representations and robustness in neural networks
Neural networks excel across a wide range of tasks, yet remain black boxes. In particular, how their internal representations are shaped by the complexity of the input data and the problems they solve remains obscure. In this work, we introduce a suite of five data-agnostic probes-pruning, binarization, noise injection, sign flipping, and bipartite network randomization-to quantify how task difficulty influences the topology and robustness of representations in multilayer perceptrons (MLPs). MLPs are represented as signed, weighted bipartite graphs from a network science perspective. We contrast easy and hard classification tasks on the MNIST and Fashion-MNIST datasets. We show that binarizing weights in hard-task models collapses accuracy to chance, whereas easy-task models remain robust. We also find that pruning low-magnitude edges in binarized hard-task models reveals a sharp phase-transition in performance. Moreover, moderate noise injection can enhance accuracy, resembling a stochastic-resonance effect linked to optimal sign flips of small-magnitude weights. Finally, preserving only the sign structure-instead of precise weight magnitudes-through bipartite network randomizations suffices to maintain high accuracy. These phenomena define a model- and modality-agnostic measure of task complexity: the performance gap between full-precision and binarized or shuffled neural network performance. Our findings highlight the crucial role of signed bipartite topology in learned representations and suggest practical strategies for model compression and interpretability that align with task complexity.
♻ ☆ Relational reasoning and inductive bias in transformers and large language models
Transformer-based models have demonstrated remarkable reasoning abilities, but the mechanisms underlying relational reasoning remain poorly understood. We investigate how transformers perform \textit{transitive inference}, a classic relational reasoning behavior from psychology which elicits inference about indirectly related items (e.g., if $A > B$ and $B > C$, then $A > C$). We compare in-weights learning (IWL) and in-context learning (ICL) behaviors and mechanisms on these tasks, and fine profoundly different patterns of generalization. IWL models learn a linear embedding, which leads to transitive inference as well as other behavioral effects present in humans and animals. ICL models, in contrast, are capable of learning to generalize transitively, but only do so when it is necessitated by the training data, otherwise learning a match-and-copy strategy. Interestingly, pre-training ICL models on in-context linear regression tasks that provide them with a latent linear representation is sufficient to make the ICL behaviors and internal representations qualitatively and quantitatively more like IWL. In order to test whether the same inference patterns are present across in large language models, we leverage a congruency paradigm which allows us to differentially probe IWL and ICL generalization patterns without access to their training data. We indeed see IWL reasoning leads to more transitive generalization than ICL. Moreover, we find that prompting the ICL models to use a linear mental map led to increased transitive inference over different geometric prompts. Together, these results reveal that both the training regime and the geometric structure of induced representations critically determine transformers capacity for transitive inference.
comment: 15 pages, 10 figures
♻ ☆ Dynamics-Aligned Shared Hypernetworks for Contextual RL under Discontinuous Shifts
Zero-shot generalization in contextual reinforcement learning remains a core challenge, particularly when the context is latent and must be inferred from data. A canonical failure mode arises when latent context discontinuously changes how actions affect the environment, requiring incompatible control responses across contexts. We propose DMA*-SH, a framework where a single hypernetwork, trained solely via dynamics prediction, generates a small set of adapter weights shared across the dynamics model, policy, and action-value function. This shared modulation imparts an inductive bias matched to discontinuous context-to-dynamics shifts, while input/output normalization and random input masking stabilize context inference, promoting directionally concentrated representations. We provide theoretical support via expressivity separation results for hypernetwork modulation, and a variance decomposition with policy-gradient variance bounds that formalize how within-mode compression improves learning under non-overlapping contexts. For evaluation, we introduce the Actuator Inversion Benchmark (AIB), a suite of environments designed to isolate challenging context-to-dynamics interactions, including actuator inversion, actuator permutations, and weakly non-overlapping continuous dynamics. On AIB's held-out tasks, DMA*-SH achieves zero-shot generalization, outperforming domain randomization by 58.1% and surpassing a standard context-aware baseline by 11.5% on average.
♻ ☆ SLR: Automated Synthesis for Scalable Logical Reasoning
We introduce SLR, an end-to-end framework for systematic evaluation and training of Large Language Models (LLMs) via Scalable Logical Reasoning. Given a user's task specification, SLR automatically synthesizes (i) an instruction prompt for an inductive reasoning task, (ii) a validation program, executable on model outputs to provide verifiable rewards, and (iii) the latent ground-truth rule. This process is fully automated, scalable, requires no human annotations, and offers precise control over task difficulty. Using SLR, we create SLR-Bench, a benchmark comprising 19k prompts organized into 20 curriculum levels that progressively increase in relational, arithmetic, and recursive complexity. Large-scale evaluation reveals that contemporary LLMs readily produce syntactically valid rules, yet often fail at correct logical inference. Recent reasoning LLMs demonstrate improved performance but incur very high test-time computation, with costs exceeding $300 for just 1,000 prompts. Finally, curriculum learning via SLR doubles Llama-3-8B accuracy on SLR-Bench, achieving parity with Gemini-Flash-Thinking at a fraction of computational cost. Moreover, these reasoning capabilities generalize to a wide range of established benchmarks, underscoring the effectiveness of SLR for downstream reasoning.
♻ ☆ Toward General and Robust LLM-enhanced Text-attributed Graph Learning
Recent advancements in Large Language Models (LLMs) and the proliferation of Text-Attributed Graphs (TAGs) across various domains have positioned LLM-enhanced TAG learning as a critical research area. By utilizing rich graph descriptions, this paradigm leverages LLMs to generate high-quality embeddings, thereby enhancing the representational capacity of Graph Neural Networks (GNNs). However, the field faces significant challenges: (1) the absence of a unified framework to systematize the diverse optimization perspectives arising from the complex interactions between LLMs and GNNs, and (2) the lack of a robust method capable of handling real-world TAGs, which often suffer from texts and edge sparsity, leading to suboptimal performance. To address these challenges, we propose UltraTAG, a unified pipeline for LLM-enhanced TAG learning. UltraTAG provides a unified comprehensive and domain-adaptive framework that not only organizes existing methodologies but also paves the way for future advancements in the field. Building on this framework, we propose UltraTAG-S, a robust instantiation of UltraTAG designed to tackle the inherent sparsity issues in real-world TAGs. UltraTAG-S employs LLM-based text propagation and text augmentation to mitigate text sparsity, while leveraging LLM-augmented node selection techniques based on PageRank and edge reconfiguration strategies to address edge sparsity. Our extensive experiments demonstrate that UltraTAG-S significantly outperforms existing baselines, achieving improvements of 2.12\% and 17.47\% in ideal and sparse settings, respectively. Moreover, as the data sparsity ratio increases, the performance improvement of UltraTAG-S also rises, which underscores the effectiveness and robustness of UltraTAG-S.
♻ ☆ Context Learning for Multi-Agent Discussion
Multi-Agent Discussion (MAD) has garnered increasing attention very recently, where multiple LLM instances collaboratively solve problems via structured discussion. However, we find that current MAD methods easily suffer from discussion inconsistency, LLMs fail to reach a coherent solution, due to the misalignment between their individual contexts.In this paper, we introduce a multi-LLM context learning method (M2CL) that learns a context generator for each agent, capable of dynamically generating context instructions per discussion round via automatic information organization and refinement. Specifically, inspired by our theoretical insights on the context instruction, M2CL train the generators to control context coherence and output discrepancies via a carefully crafted self-adaptive mechanism.It enables LLMs to avoid premature convergence on majority noise and progressively reach the correct consensus. We evaluate M2CL on challenging tasks, including academic reasoning, embodied tasks, and mobile control. The results show that the performance of M2CL significantly surpasses existing methods by 20%--50%, while enjoying favorable transferability and computational efficiency.
♻ ☆ A Physical Theory of Backpropagation: Exact Gradients from the Least-Action Principle
Backpropagation is typically presented as a symbolic procedure: a backward pass topologically distinct from inference, with non-local error signals and synchronous global clocking, features with no clear analog in physical reality. Existing physics-inspired alternatives recover gradients only approximately, in vanishing-perturbation limits, or under weight-symmetry constraints incompatible with feedforward architectures. In this paper, we address this gap by deriving exact backpropagation from Hamilton's least-action principle. By recasting the forward dynamics in continuous time and adapting a Lagrangian formalism for non-conservative systems to the resulting flow, we unify inference and gradient computation within a single variational framework on a doubled phase space, whose two conjugate fields jointly encode activations and sensitivities. A single global Lagrangian governs the dynamics: the task loss enters as a symmetry-breaking perturbation of the forward manifold, and credit assignment emerges as the tension that develops between the conjugate states. Inference and gradient computation thus unfold simultaneously through local interactions, requiring no separate backward circuit. Ultimately, standard backpropagation is recovered exactly as the discrete-time projection of this continuous flow. This perspective unifies the formalism of physics with backpropagation, opening a principled pathway for applying tools from classical mechanics - symplectic geometry, Noether's theorem, path-integral methods - to the analysis of learning dynamics. As a downstream consequence, it also points toward analog and neuromorphic substrates in which learning is embodied in the hardware itself.
comment: 22 pages
♻ ☆ Agent-Omit: Adaptive Context Omission for Efficient LLM Agents ICML 2026
Managing agent context (e.g., thought and observation) during multi-turn agent-environment interactions is an emerging strategy to improve agent efficiency. However, existing studies treat the entire interaction trajectories equally, overlooking the thought necessity and observation utility varies across turns. To this end, we first conduct quantitative investigations into how thought and observation affect agent effectiveness and efficiency. Based on our findings, we propose Agent-Omit, a unified training framework that empowers LLM agents to adaptively omit redundant thoughts and observations. Specifically, we first synthesize a small amount of cold-start data, including both single-turn and multi-turn omission scenarios, to fine-tune the agent for omission behaviors. Furthermore, we introduce an omit-aware agentic reinforcement learning approach, incorporating a dual sampling mechanism and a tailored omission reward to incentivize the agent's adaptive omission capability. Theoretically, we prove that the deviation of our omission policy is upper-bounded by KL-divergence. Experimental results on five agent benchmarks show that our constructed Agent-Omit-8B could obtain performance comparable to seven frontier LLM agent, and achieve the best effectiveness-efficiency trade-off than seven efficient LLM agents methods. Our code and data are available at https://github.com/usail-hkust/Agent-Omit.
comment: ICML 2026
♻ ☆ Layer Collapse in Diffusion Language Models
Diffusion language models (DLMs) have recently emerged as competitive alternatives to autoregressive (AR) language models, yet differences in their activation dynamics remain poorly understood. We characterize these dynamics in LLaDA-8B and identify a striking layer-collapse property: a few early layers exhibit highly similar, collapsed activation patterns dominated by a single large super-outlier persisting over a long token range. Despite its apparent redundancy, this outlier is critical: pruning it causes outputs to degrade into repetitive random token loops. Paradoxically, layers in LLaDA contain more redundant representations overall, with redundancy most pronounced in earlier layers -- the reverse of AR models, where deeper layers grow redundant due to undertraining. Our analysis indicates that layer collapse in DLMs is not driven by undertraining but by overtraining: a dominant outlier becomes an indispensable information carrier while remaining representations collapse into redundant structure. These findings have strong practical implications, verified through controlled pre-training experiments. DLMs are surprisingly robust to compression: LLaDA under 3-bit GPTQ quantization drops only -1.8% on GSM8K, whereas Llama-3.1-8B drops -64.7%. Optimal sparsity allocation also reverses between families: at 50% average sparsity, allocating more to early layers in LLaDA yields +8.4% over the reverse strategy, while the same allocation costs Llama -8.4%. Our findings reveal that the DLM training objective fundamentally reshapes layer dynamics relative to AR models, with direct consequences for compression and deployment. Code: github.com/Conzel/super-outlier-dlm.
comment: 9 Pages, Preprint
♻ ☆ Quantitative Error Feedback for Quantization Noise Reduction of Filtering over Graphs IEEE
This paper introduces an innovative error feedback framework designed to mitigate quantization noise in distributed graph filtering, where communications are constrained to quantized messages. It comes from error spectrum shaping techniques from state-space digital filters, and therefore establishes connections between quantized filtering processes over different domains. In contrast to existing error compensation methods, our framework quantitatively feeds back the quantization noise for exact compensation. We examine the framework under three key scenarios: (i) deterministic graph filtering, (ii) graph filtering over random graphs, and (iii) graph filtering with random node-asynchronous updates. Rigorous theoretical analysis demonstrates that the proposed framework significantly reduces the effect of quantization noise, and we provide closed-form solutions for the optimal error feedback coefficients. Moreover, this quantitative error feedback mechanism can be seamlessly integrated into communication-efficient decentralized optimization frameworks, enabling lower error floors. Numerical experiments validate the theoretical results, consistently showing that our method outperforms conventional quantization strategies in terms of both accuracy and robustness.
comment: Accepted by IEEE TSP
♻ ☆ Temporal Tokenization Strategies for Event Sequence Modeling with Large Language Models
Representing continuous time is a critical and under-explored challenge in modeling temporal event sequences with large language models (LLMs). Various strategies like byte-level representations or calendar tokens have been proposed. However, the optimal approach remains unclear, especially given the diverse statistical distributions of real-world event data, which range from smooth log-normal to discrete, spiky patterns. This paper presents a systematic empirical study of temporal tokenization for modeling event sequences with LLMs, comparing distinct encoding strategies: naive numeric strings, high-precision byte-level representations, human-semantic calendar tokens, classic uniform binning, and adaptive residual scalar quantization. We evaluate these strategies by fine-tuning LLMs on real-world datasets that exemplify these diverse distributions. Our analysis reveals that no single strategy is universally superior; instead, prediction performance depends heavily on aligning the tokenizer with the data's statistical properties, highlighting temporal tokenization as a critical yet often overlooked design dimension in LLM-based event modeling.
♻ ☆ GLAI: GreenLightningAI for Accelerated Training through Knowledge Decoupling
In this work we introduce GreenLightningAI (GLAI), a new architectural block designed as an alternative to conventional MLPs. The central idea is to separate two types of knowledge that are usually entangled during training: (i) *structural knowledge*, encoded by the stable activation patterns induced by ReLU activations; and (ii) *quantitative knowledge*, carried by the numerical weights and biases. By fixing the structure once stabilized, GLAI reformulates the MLP as a combination of paths, where only the quantitative component is optimized. This reformulation retains the universal approximation capabilities of MLPs, yet achieves a more efficient training process, reducing training time by ~40% on average across the cases examined in this study. Crucially, GLAI is not just another classifier, but a generic block that can replace MLPs wherever they are used, from supervised heads with frozen backbones to projection layers in self-supervised learning or few-shot classifiers. Across diverse experimental setups, GLAI consistently matches or exceeds the accuracy of MLPs with an equivalent number of parameters, while converging faster. Overall, GLAI establishes a new design principle that opens a direction for future integration into large-scale architectures such as Transformers, where MLP blocks dominate the computational footprint.
comment: 20 pages, 2 figures
♻ ☆ ActivationReasoning: Logical Reasoning in Latent Activation Spaces ICLR 2026
Large language models (LLMs) excel at generating fluent text, but their internal reasoning remains opaque and difficult to control. Sparse autoencoders (SAEs) make hidden activations more interpretable by exposing latent features that often align with human concepts. Yet, these features are fragile and passive, offering no mechanism for systematic reasoning or model control. To address this, we introduce ActivationReasoning (AR), a framework that embeds explicit logical reasoning into the latent space of LLMs. It proceeds in three stages: (1) Finding latent representations, first latent concept representations are identified (e.g., via SAEs) and organized into a dictionary; (2) Activating propositions, at inference time AR detects activating concepts and maps them to logical propositions; and (3)Logical reasoning, applying logical rules over these propositions to infer higher-order structures, compose new concepts, and steer model behavior. We evaluate AR on multi-hop reasoning (PrOntoQA), abstraction and robustness to indirect concept cues (Rail2Country), reasoning over natural and diverse language (ProverQA), and context-sensitive safety (BeaverTails). Across all tasks, AR scales robustly with reasoning complexity, generalizes to abstract and context-sensitive tasks, and transfers across model backbones. These results demonstrate that grounding logical structure in latent activations not only improves transparency but also enables structured reasoning, reliable control, and alignment with desired behaviors, providing a path toward more reliable and auditable AI.
comment: Proceedings of the 14th International Conference on Learning Representations (ICLR 2026)
♻ ☆ Robust Remote Reinforcement Learning over Unreliable Communication Channels using Homomorphic State Encoding
Traditional Reinforcement Learning (RL) frameworks generally assume that the agent perceives the state of the underlying Markov process instantaneously and then takes actions accordingly. If the agent cannot directly observe the process, but rather receives state updates from a remote sensor over a lossy and/or delayed channel, it may be forced to operate with partial and intermittent information. In recent years, numerous learning architectures have been proposed to manage RL with imperfect or remote feedback; however, they offer solutions tailored to specific use cases, often with a substantial computational and communication burden. To address these limitations, we propose a novel learning architecture, named Homomorphic Robust Remote Reinforcement Learning (HR3L), that enables the distributed training of RL agents over unreliable communication channels without the need to exchange gradient information. Our experimental results demonstrate that HR3L significantly outperforms the state-of-the-art methods in terms of sample efficiency, leading to faster training and reduced communication overhead. In addition, we show that HR3L can adapt to different scenarios, including packet loss, delayed transmissions, and bandwidth limitations, without experiencing significant performance degradation.
comment: This manuscript is currently under revision
♻ ☆ Mitigating Membership Inference in Intermediate Representations with Differentially Private Training
In Embedding-as-an-Interface (EaaI) settings, pre-trained models are queried for Intermediate Representations (IRs). The distributional properties of IRs can leak training-set membership signals, enabling Membership Inference Attacks (MIAs) whose strength varies across layers. Although Differentially Private Stochastic Gradient Descent (DP-SGD) mitigates such leakage, existing implementations employ per-example gradient clipping and a uniform, layer-agnostic noise multiplier, ignoring heterogeneous layer-wise MIA vulnerability. This paper introduces Layer-wise MIA-risk-aware DP-SGD (LM-DP-SGD), which adaptively allocates privacy protection across layers in proportion to their MIA risk. Specifically, LM-DP-SGD trains a shadow model on a public shadow dataset, extracts per-layer IRs from its train/test splits, and fits layer-specific MIA adversaries, using their attack error rates as MIA-risk estimates. Leveraging the cross-dataset transferability of MIAs, these estimates are then used to reweight each layer's contribution to the globally clipped gradient during private training, providing layer-appropriate protection under a fixed noise magnitude. We further establish theoretical guarantees on both privacy and convergence of LM-DP-SGD. Extensive experiments show that, under the same privacy budget, LM-DP-SGD reduces the peak IR-level MIA risk while preserving utility, yielding a superior privacy-utility trade-off.
♻ ☆ TetraJet-v2: Accurate NVFP4 Training for Large Language Models with Oscillation Suppression and Outlier Control
Large Language Models (LLMs) training is prohibitively expensive, driving interest in low-precision fully-quantized training (FQT). While novel 4-bit formats like NVFP4 offer substantial efficiency gains, achieving near-lossless training at such low precision remains challenging. We introduce TetraJet-v2, an end-to-end 4-bit FQT method that leverages NVFP4 for activations, weights, and gradients in all linear layers. We identify two critical issues hindering low-precision LLM training: weight oscillation and outliers. To address these, we propose: 1) an unbiased double-block quantization method for NVFP4 linear layers with practically optimal convergence in LLM training, 2) OsciReset, the first effective algorithm to suppress LLMs' weight oscillation bottleneck, and 3) OutControl, a mix-precision algorithm to retain outlier accuracy. TetraJet-v2 outperforms prior methods on FP4 pre-training for LLMs across models up to 370M parameters trained up to 212B tokens, reducing the performance gap to BF16 by an average of 51.3% while enabling an 1.67x end-to-end speedup over FP8. The code is available at https://github.com/thu-ml/TetraJet-v2-NVFP4Training.
♻ ☆ Towards Batch-to-Streaming Deep Reinforcement Learning for Continuous Control
State-of-the-art deep reinforcement learning (RL) methods have achieved remarkable performance in continuous control tasks, yet their computational complexity is often incompatible with the constraints of resource-limited hardware, due to their reliance on replay buffers, batch updates, and target networks. The emerging paradigm of streaming deep RL addresses this limitation through purely online updates, achieving strong empirical performance on standard benchmarks. In this work, we propose two novel streaming deep RL algorithms, Streaming Soft Actor-Critic (S2AC) and Streaming Deterministic Actor-Critic (SDAC), explicitly designed to be compatible with state-of-the-art batch RL methods, making them particularly suitable for on-device finetuning applications such as Sim2Real transfer. Both algorithms achieve performance comparable to state-of-the-art streaming baselines on standard benchmarks without requiring tedious per-environment hyperparameter tuning. We further investigate the batch-to-streaming transition, showing that a naive transition does not guarantee preservation of pre-trained policy performance, and propose a principled approach to address this challenge.
♻ ☆ Global Optimization via Softmin Energy Minimization
Global optimization, particularly for non-convex functions with multiple local minima, poses significant challenges for traditional gradient-based methods. While metaheuristic approaches offer empirical effectiveness, they often lack theoretical convergence guarantees and may disregard available gradient information. This paper introduces a novel gradient-based swarm particle optimization method designed to efficiently escape local minima and locate global optima. Our approach leverages a "Soft-min Energy" interacting function, $J_β(\mathbf{x})$, which provides a smooth, differentiable approximation of the minimum function value within a particle swarm. We define a stochastic gradient flow in the particle space, incorporating a Brownian motion term for exploration and a time-dependent parameter $β$ to control smoothness, similar to temperature annealing. We theoretically demonstrate that for strongly convex functions, our dynamics converges to a stationary point where at least one particle reaches the global minimum, with other particles exhibiting exploratory behavior. Furthermore, we show that our method facilitates faster transitions between local minima by reducing effective potential barriers with respect to Simulated Annealing. More specifically, we estimate the hitting times of unexplored potential wells for our model in the small noise regime and show that they compare favorably with the ones of overdamped Langevin. Numerical experiments on benchmark functions, including double wells and the Ackley function, validate our theoretical findings and demonstrate better performance over the well-known Simulated Annealing method in terms of escaping local minima and achieving faster convergence.
♻ ☆ Why Adam Works Better with $β_1 = β_2$: The Missing Gradient Scale Invariance Principle
Adam has been at the core of large-scale training for almost a decade, yet a simple empirical fact remains unaccounted for: both validation scores and the qualitative behaviour of the training runs improve when the momentum parameters satisfy $β_{1}=β_{2}$. Some recent studies have reported this pattern, but there is still no explanation for why this choice helps. We show that this choice is closely tied to a structural property that we refer to as \textit{gradient scale invariance}. We formalize this notion and prove that Adam becomes gradient scale invariant of first order if and only if $β_{1}=β_{2}$. This perspective places the balanced regime of Adam in direct alignment with the design principles underlying several recent optimizers that explicitly enforce scale-robust updates. The theory is supported by experiments across vision and language tasks, and across different architectural families, in which rescaling the gradient has a markedly smoother effect on the update when $β_{1}=β_{2}$. Overall, our results offer a coherent explanation for an open question in the behavior of Adam and provide a simple principle that helps guide the design of future optimizers.
comment: 23 pages, 8 figures. Preprint
♻ ☆ Mitigating Watermark Forgery in Generative Models via Randomized Key Selection
Watermarking enables GenAI providers to verify whether content was generated by their models. A watermark is a hidden signal in the content, whose presence can be detected using a secret watermark key. A core security threat are forgery attacks, where adversaries insert the provider's watermark into content \emph{not} produced by the provider, potentially damaging their reputation and undermining trust. Existing defenses resist forgery by embedding many watermarks with multiple keys into the same content, which can degrade model utility. However, forgery remains a threat when attackers can collect sufficiently many watermarked samples. We propose a defense that is provably forgery-resistant \emph{independent} of the number of watermarked content collected by the attacker, provided they cannot easily distinguish watermarks from different keys. Our scheme does not further degrade model utility. We randomize the watermark key selection for each query and accept content as genuine only if a watermark is detected by \emph{exactly} one key. We focus on the image and text modalities, but our defense is modality-agnostic, since it treats the underlying watermarking method as a black-box. Our method provably bounds the attacker's success rate and we empirically observe a reduction from near-perfect success rates to only $2\%$ at negligible computational overhead.
♻ ☆ FinTSB: A Comprehensive and Practical Benchmark for Financial Time Series Forecasting
Financial time series (FinTS) record the behavior of human-brain-augmented decision-making, capturing valuable historical information that can be leveraged for profitable investment strategies. Not surprisingly, this area has attracted considerable attention from researchers, who have proposed a wide range of methods based on various backbones. However, the evaluation of the area often exhibits three systemic limitations: 1. Failure to account for the full spectrum of stock movement patterns observed in dynamic financial markets. (Diversity Gap), 2. The absence of unified assessment protocols undermines the validity of cross-study performance comparisons. (Standardization Deficit), and 3. Neglect of critical market structure factors, resulting in inflated performance metrics that lack practical applicability. (Real-World Mismatch). Addressing these limitations, we propose FinTSB, a comprehensive and practical benchmark for financial time series forecasting (FinTSF). To increase the variety, we categorize movement patterns into four specific parts, tokenize and pre-process the data, and assess the data quality based on some sequence characteristics. To eliminate biases due to different evaluation settings, we standardize the metrics across three dimensions and build a user-friendly, lightweight pipeline incorporating methods from various backbones. To accurately simulate real-world trading scenarios and facilitate practical implementation, we extensively model various regulatory constraints, including transaction fees, among others. Finally, we conduct extensive experiments on FinTSB, highlighting key insights to guide model selection under varying market conditions. Overall, FinTSB provides researchers with a novel and comprehensive platform for improving and evaluating FinTSF methods. The code is available at https://github.com/TongjiFinLab/FinTSB.
♻ ☆ EchoAlign: Bridging Generative and Discriminative Learning under Noisy Labels
Noisy labels severely hinder the accuracy and generalization of machine learning models, especially when ambiguous instance features make reliable annotation difficult. Existing approaches, including transition-matrix-based label correction, struggle to capture complex relationships between instances and noisy labels, limiting their effectiveness in such settings. We present EchoAlign, a framework that bridges generative and discriminative learning under noisy labels. Instead of correcting labels, EchoAlign treats noisy labels as supervision targets and modifies the corresponding instances to align with them. The framework has two components: EchoMod uses controllable generative models to adjust instance features while preserving key instance-level structural cues, such as shape and edges, and avoiding excessive distortion; EchoSelect mitigates distribution shifts by retaining a reliable subset of original instances, guided by feature similarity between original and modified samples. This generative-discriminative interplay enables robust learning in highly noisy settings. Experiments on three benchmark datasets show that EchoAlign outperforms state-of-the-art methods in most evaluated settings. Under 30% instance-dependent noise, EchoSelect retains nearly twice as many correctly labeled samples as competing approaches while maintaining 99% selection accuracy, demonstrating the robustness and effectiveness of EchoAlign.
comment: 27 pages, 7 figures. The article has been accepted by Frontiers of Computer Science (FCS), with the DOI: 10.1007/s11704-026-51604-z
♻ ☆ Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts ICLR 2026
The Mixture of Experts (MoE) is an effective architecture for scaling large language models by leveraging sparse expert activation to balance performance and efficiency. However, under expert parallelism, MoE suffers from inference inefficiencies due to imbalanced token-to-expert assignment, where underloaded experts complete computations early but must wait for overloaded experts, leading to global delays. We define this phenomenon as the \textbf{\textit{Straggler Effect}}, as the most burdened experts dictate the overall inference latency. To address this, we first propose \textit{\textbf{Capacity-Aware Token Drop}}, which enforces expert capacity limits by discarding excess tokens from overloaded experts, effectively reducing load imbalance with minimal performance impact (e.g., $30\%$ speedup with only $0.9\%$ degradation on OLMoE). Next, given the presence of low-load experts remaining well below the capacity threshold, we introduce \textit{\textbf{Capacity-Aware Expanded Drop}}, which allows tokens to include additional local experts in their candidate set before enforcing strict local capacity constraints, thereby improving load balance and enhancing the utilization of underused experts. Extensive experiments on both language and multimodal MoE models demonstrate the effectiveness of our approach, yielding substantial gains in expert utilization, model performance, and inference efficiency, e.g., applying Expanded Drop to Mixtral-8$\times$7B-Instruct yields a {0.2\%} average performance improvement and a {1.85$\times$} inference speedup. The code is released at: https://github.com/CASE-Lab-UMD/Capacity-Aware-MoE.
comment: ICLR 2026
♻ ☆ SMOG: Scalable Meta-Learning for Multi-Objective Bayesian Optimization
Multi-objective optimization aims to solve problems with competing objectives. Evaluating such problems is often slow or expensive, limiting the budget of evaluations. In many applications, historical data from related optimization tasks is available and can be leveraged via meta-learning to accelerate optimization. Bayesian optimization, as a promising technique for expensive black-box problems, has been extended independently to meta-learning and multi-objective optimization, but methods that simultaneously address both settings remain largely unexplored. We propose SMOG-a scalable and modular meta-learning model based on a multi-output Gaussian process-that explicitly learns correlations between objectives. SMOG builds a structured joint Gaussian process prior across meta- and target tasks and, after conditioning on metadata, yields a closed-form prior for the target task. This construction propagates metadata uncertainty into the target surrogate in a principled way. SMOG supports hierarchical, parallel training, achieving linear scaling with the number of meta-tasks. The resulting surrogate integrates seamlessly with standard multi-objective Bayesian optimization acquisition functions. We demonstrate that our method is consistently competitive, delivering strong data efficiency across representative benchmarks and applications.
comment: 29 pages, 18 figures
♻ ☆ Temporal Structure Matters for Efficient Test-Time Adaptation in Wearable Human Activity Recognition
Wearable human activity recognition (WHAR) models often suffer from performance degradation under real-world cross-user distribution shifts. Test-time adaptation (TTA) mitigates this degradation by adapting models online using unlabeled test streams, yet existing methods largely inherit assumptions from vision tasks and underexploit the inherent inter-window temporal structure in WHAR streams. In this paper, we revisit such temporal structure as a feature-conditioned inference signal rather than merely an output-space smoothing prior. We derive the insight that temporal continuity and observation-induced feature deviations provide complementary cues for determining when to preserve or release temporal inertia and where to route prediction refinement during likely transitions. Building upon this insight, we propose SIGHT, a lightweight and backpropagation-free TTA framework for WHAR, enabling real-time edge deployment. SIGHT estimates predictive surprise by comparing the current feature with a prototype-based expected state, and then uses the resulting feature deviation to guide geometry-aware transition routing based on prototype alignment and stream-level marginal habit tracking. Evaluations on real-world datasets confirm that SIGHT outperforms existing TTA baselines while reducing computational and memory costs.
♻ ☆ Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence
We introduce Nemotron 3 Nano Omni, the latest model in the Nemotron multimodal series and the first to natively support audio inputs alongside text, images, and video. Nemotron 3 Nano Omni delivers consistent accuracy improvements over its predecessor, Nemotron Nano V2 VL, across all modalities, enabled by advances in architecture, training data and recipes. In particular, Nemotron 3 delivers leading results in real-world document understanding, long audio-video comprehension, and agentic computer use. Built on the highly efficient Nemotron 3 Nano 30B-A3B backbone, Nemotron 3 Nano Omni further incorporates innovative multimodal token-reduction techniques to deliver substantially lower inference latency and higher throughput than other models of similar size. We are releasing model checkpoints in BF16, FP8, and FP4 formats, along with portions of the training data and codebase to facilitate further research and development.
♻ ☆ Mixture of Experts for Recognizing Depression from Interview and Reading Tasks ICASSP 2026
Depression is a mental disorder and can cause a variety of symptoms, including psychological, physical, and social. Speech has been proved an objective marker for the early recognition of depression. For this reason, many studies have been developed aiming to recognize depression through speech. However, existing methods rely on the usage of only the spontaneous speech neglecting information obtained via read speech, use transcripts which are often difficult to obtain (manual) or come with high word-error rates (automatic), and do not focus on input-conditional computation methods. To resolve these limitations, this is the first study in depression recognition task obtaining representations of both spontaneous and read speech, utilizing multimodal fusion methods, and employing Mixture of Experts (MoE) models in a single deep neural network. Specifically, we use audio files corresponding to both interview and reading tasks and convert each audio file into log-Mel spectrogram, delta, and delta-delta. Next, the image representations of the two tasks pass through shared AlexNet models. The outputs of the AlexNet models are given as input to a multimodal fusion method. The resulting vector is passed through a MoE module. In this study, we employ three variants of MoE, namely sparsely-gated MoE and multilinear MoE based on factorization. Findings suggest that our proposed approach yields an Accuracy and F1-score of 87.00% and 86.66% respectively on the Androids corpus.
comment: Accepted at ICASSP 2026
♻ ☆ Incorporating rank-free coupling and external field via an incoherent modulated spatial photonic Ising machine
Spatial photonic Ising machines offer a novel optical platform for optimization and spin-model simulation, but existing diffraction-based schemes rely on auxiliary spins or multiplexing to encode high-rank couplings and external fields, reducing either speed or spin count. We demonstrate an amplitude-only, rank-free spatial photonic Ising machine in which arbitrary Ising Hamiltonians are encoded as Hadamard products on aligned amplitude and binary spatial modulators and read out by a single-pixel intensity measurement. The machine directly programs fully connected 797-spin Ising models with external fields at nearly 9-bit precision and operates at a constant iteration rate of ~200 Hz. By removing zero-valued product terms, the same architecture scales to sparse problems and experimentally solves a Max-Cut instance on a 424,108-vertex Mobius ladder graph. We also observe the phase transition of the Sherrington-Kirkpatrick model, demonstrating programmable optical simulation beyond low-rank couplings. These results establish amplitude modulation as a scalable route to programmable photonic Ising machines.
comment: 15 pages, 6 figures
Multimedia 7
☆ Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination ACL 2026
Large Vision-Language Models (LVLMs) have achieved remarkable progress in multimodal tasks, yet their reliability is persistently undermined by hallucinations-generating text that contradicts visual input. Recent studies often attribute these errors to inadequate visual attention. In this work, we analyze the attention mechanisms via the logit lens, uncovering a distinct anomaly we term Vocabulary Hijacking. We discover that specific visual tokens, defined as Inert Tokens, disproportionately attract attention. Crucially, when their intermediate hidden states are projected into the vocabulary space, they consistently decode to a fixed set of unrelated words (termed Hijacking Anchors) across layers, revealing a rigid semantic collapse. Leveraging this semantic rigidity, we propose Hijacking Anchor-Based Identification (HABI), a robust strategy to accurately localize these Inert Tokens. To quantify the impact of this phenomenon, we introduce the Non-Hijacked Visual Attention Ratio (NHAR), a novel metric designed to identify attention heads that remain resilient to hijacking and are critical for factual accuracy. Building on these insights, we propose Hijacking-Aware Visual Attention Enhancement (HAVAE), a training-free intervention that selectively strengthens the focus of these identified heads on salient visual content. Extensive experiments across multiple benchmarks demonstrate that HAVAE significantly mitigates hallucinations with no additional computational overhead, while preserving the model's general capabilities. Our code is publicly available at https://github.com/lab-klc/HAVAE.
comment: Accepted by ACL 2026 Main
☆ RW-Post: Auditable Evidence-Grounded Multimodal Fact-Checking in the Wild
Multimodal misinformation increasingly leverages visual persuasion, where repurposed or manipulated images strengthen misleading text. We introduce \textbf{RW-Post}, a post-aligned \textbf{text--image benchmark} for real-world multimodal fact-checking with \emph{auditable} annotations: each instance links the original social-media post with reasoning traces and explicitly linked evidence items derived from human fact-check articles via an LLM-assisted extraction-and-auditing pipeline. RW-Post supports controlled evaluation across closed-book, evidence-bounded, and open-web regimes, enabling systematic diagnosis of visual grounding and evidence utilization. We provide \textbf{AgentFact} as a reference verification baseline and benchmark strong open-source LVLMs under unified protocols. Experiments show substantial headroom: current models struggle with faithful evidence grounding, while evidence-bounded evaluation improves both accuracy and faithfulness. Code and dataset will be released at https://github.com/xudanni0927/AgentFact.
☆ FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries
As video becomes increasingly central to information dissemination and multimodal large language models (MLLMs) continue to advance, evaluating video retrieval has become increasingly important. In realistic search scenarios, this requires matching short user queries to long-form content using both visual and auditory evidence. Yet existing retrieval benchmarks are still dominated by short clips, single modalities, and caption-based evaluation. We introduce FLARE, a full-modality long-video audiovisual retrieval benchmark with user-simulated queries. Built from 399 carefully screened Video-MME videos (10--60 min, 225.4 h) to ensure source quality and diversity, FLARE contains 87,697 clips annotated with vision, audio, and unified audiovisual captions, together with 274,933 user-style queries. Cross-modal queries are further filtered by a hard bimodal constraint, requiring retrieval to fail under either modality alone but succeed when both are combined. FLARE evaluates models under two regimes, caption-based and query-based retrieval, across vision, audio, and unified audiovisual settings. Experiments with 15 representative retrievers show that user-style queries substantially change model behavior, strong caption-based performance does not always transfer to query-based retrieval, and audio--language alignment remains a key bottleneck for unified audiovisual retrieval. Our code and data are released at https://flarebench.github.io/
☆ Tube-Structured Incremental Semantic HARQ for Generative Video Receivers
Generative semantic communication uses receiver-side generative priors to reconstruct visual content from compact semantics, making it attractive for bandwidth-limited multimedia delivery. For video, reliable recovery remains difficult because errors accumulate over time, useful evidence is temporally correlated, and the receiver must make decisions under limited interaction, retransmission, and reconstruction budgets. Existing generative semantic communication studies mainly emphasize representation, compression, or generative reconstruction, while recent error-resilient and semantic-HARQ methods still largely operate on encoder-defined or frame-block retransmission units. This paper studies receiver-driven semantic HARQ for generative video reconstruction under a budget-constrained AoIS-AUC objective and argues that the retransmission primitive is itself an important system design variable. We propose tube-structured package-native requests, in which temporally local packages are the channel-visible HARQ objects and are transmitted, dropped, received, and committed at package granularity. Under a controlled comparison protocol with matched backbone, budgets, and channel model, this primitive yields lower time-weighted recovery cost than competitive block-based baselines in practically relevant moderate-to-harsh regimes, while the gap naturally shrinks in near-clean channels. The gain mainly appears as earlier stabilization of the recovery trajectory, while final-quality endpoints remain broadly comparable, and it persists even against a tube-aware block-ranking baseline.
☆ HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer
The evolution of visual generative models has long been constrained by fragmented architectures relying on disjoint text encoders and external VAEs. In this report, we present HiDream-O1-Image, a natively unified generative foundation model via pixel-space Diffusion Transformer, that pioneers a paradigm shift from modular architectures to an end-to-end in-context visual generation engine. By mapping raw image pixels, text tokens, and task-specific conditions into a single shared token space, HiDream-O1-Image achieves a structural unification of multimodal inputs within an Unified Transformer (UiT) architecture. This native encoding paradigm eliminates the need for separate VAEs or disjoint pre-trained text encoders, allowing the model to treat diverse generation and editing tasks as a consistent in-context reasoning process. Extensive experiments show that HiDream-O1-Image excels across various generation tasks, including text-to-image generation, instruction-based editing, and subject-driven personalization. Notably, with only 8B parameters, HiDream-O1-Image (8B) achieves performance parity with or even surpasses established state-of-the-art models with significantly larger parameters (e.g., 27B Qwen-Image). Crucially, to validate the immense scalability of this paradigm, we successfully scale the architecture up to over 200B parameters. Experimental results demonstrate that this massive-scale version HiDream-O1-Image-Pro (200B+) unlocks unprecedented generative capabilities and superior performance, establishing new state-of-the-art benchmarks. Ultimately, HiDream-O1-Image highlights the immense potential of natively unified architectures and charts a highly scalable path toward next-generation multimodal AI.
comment: Source codes and models are available at Github: https://github.com/HiDream-ai/HiDream-O1-Image and Huggingface: https://huggingface.co/HiDream-ai/HiDream-O1-Image
♻ ☆ H.265/HEVC Video Steganalysis Based on CU Block Structure Gradients and IPM Mapping
Existing H.265/HEVC video steganalysis research mainly focuses on detecting the steganography based on motion vectors, intra prediction modes, and transform coefficients. However, there is currently no effective steganalysis method capable of detecting steganography based on Coding Unit (CU) block structure. To address this issue, we propose, for the first time, a H.265/HEVC video steganalysis algorithm based on CU block structure gradients and intra prediction mode mapping. The proposed method first constructs a new gradient map to explicitly describe changes in CU block structure, and combines it with a block level mapping representation of IPM. It can jointly model the structural perturbations introduced by steganography based on CU block structure. Then, we design a novel steganalysis network called GradIPMFormer, whose core innovation is an integrated architecture that combines convolutional local embedding with Transformer-based token modeling to jointly capture local CU boundary perturbations and long-range cross-CU structural dependencies, thereby effectively enhancing the capability to perceive CU block structure embedding. Experimental results show that under different quantization parameters and resolution settings, the proposed method consistently achieves superior detection performance across multiple steganography methods based on CU block structure. This study provides a new CU block structure steganalysis paradigm for H.265/HEVC and has significant research value for covert communication security detection.
♻ ☆ Group Cognition Learning: Making Everything Better Through Governed Two-Stage Agents Collaboration ICML 2026
Centralized multimodal learning commonly compresses language, acoustic, and visual signals into a single fused representation for prediction. While effective, this paradigm suffers from two limitations: modality dominance, where optimization gravitates towards the path of least resistance, ignoring weaker but informative modalities, and spurious modality coupling, where models overfit to incidental cross-modal correlations. To address these, we propose Group Cognition Learning (GCL), a governed collaboration paradigm that applies a two-stage protocol after modality-specific encoding. In Stage 1 (Selective Interaction), a Routing Agent proposes directed interaction routes, and an Auditing Agent assigns sample-wise gates to emphasize exchanges that yield positive marginal predictive gain while suppressing redundant coupling. In Stage 2 (Consensus Formation), a Public-Factor Agent maintains an explicit shared factor, and an Aggregation Agent produces the final prediction through contribution-aware weighting while keeping each modality representation as a specialization channel. Extensive experiments on CMU-MOSI, CMU-MOSEI, and MIntRec demonstrate that GCL mitigates dominance and coupling, establishing state-of-the-art results across both regression and classification benchmarks. Analysis experiments further demonstrate the effectiveness of the design.
comment: This study has been Accepted by ICML 2026. The current version is a manuscript, please refer to the official version released at ICML 2026 for the final published version
Computer Vision and Pattern Recognition 69
☆ CrossVL: Complexity-Aware Feature Routing and Paired Curriculum for Cross-View Vision-Language Detection CVPR 2026
Vision-language models (VLMs) enable text-guided object detection but degrade severely under cross-view scenarios where ground and aerial viewpoints differ in altitude, scale, and spatial layout. These geometric changes introduce systematic complexity variations between viewpoints, e.g., ground view images contain dense and highly occluded structures, while aerial images are sparse and globally organized. Fixed VLM fusion mechanisms cannot handle this discrepancy. We propose CrossVL, a framework combining Complexity-Aware Pathway Aggregation (CPA) and Paired Curriculum Learning (PCL) for enhanced cross-view detection for VLM. CPA estimates scene complexity from multimodal statistics and routes visual features through multiple pathways to obtain view-specific representations. PCL leverages semantic consistency of synchronized ground-aerial pairs to provide stable early supervision and then gradually shifts toward randomized sampling. On MAVREC, CrossVL improves Florence-2's aerial mAP from 58.66% to 61.03% and reduces the ground-aerial performance gap from 8.63pp to 6.65pp, while also achieving a 3.3x reduction in variance across random seeds. CPA provides stable complexity-aware feature aggregation, and PCL enhances optimization dynamics. Together, they demonstrate that coordinated architectural and training adaptations are crucial for robust cross-view VLM detection.
comment: Accepted to CVPR 2026. Code available at https://github.com/1nyourlife/Crossvl_cvpr2026
☆ DRIVE-C: A Controlled Corruption Dataset for Autonomous Driving
DRIVE-C is a controlled corruption dataset designed to evaluate visual perception robustness in autonomous driving systems. It is built from real-world forward-facing driving videos collected across daytime, nighttime, urban, rural, freeway, and parking environments. Clean clips are anonymized via localized face and license plate blurring, then transformed with physics-inspired synthetic degradations. The dataset contains 10 clean clips and 600 corrupted clips spanning 12 camera degradation types across five severity levels, with per-clip metadata and Global Sensor Health Index (GSHI) annotations. DRIVE-C supports robustness benchmarking, degradation-aware modeling, uncertainty estimation, out-of-distribution (OOD) detection, and sensor health monitoring for Advanced Driver Assistance Systems (ADAS). By providing pixel-aligned clean and degraded video clips with fully reproducible corruption parameters, DRIVE-C offers a structured testbed for studying perception reliability under controlled camera degradation.
☆ Fetal Brain Imaging: A Composite Neural Network Approach for Keyframe Detection in Ultrasound Videos
This article presents a novel approach to keyframe detection in ultrasound videos, with a particular focus on fetal brain imaging. The proposed model is a composite neural network architecture that combines a Convolutional Neural Network (CNN) with a Recurrent Neural Network (RNN). The CNN extracts spatial features from individual video frames, while the RNN captures temporal dependencies between consecutive frames within each video sequence. The proposed model may improve the efficiency and accuracy of fetal brain ultrasound analysis, thereby supporting earlier detection, diagnosis, and treatment planning for selected fetal brain conditions.
☆ On-Policy Distillation with Best-of-N Teacher Rollout Selection
On-policy distillation (OPD), which supervises a student on its own sampled trajectories, has emerged as a data-efficient post-training method for improving reasoning while avoiding the reward dependence of reinforcement learning and the catastrophic forgetting often observed in standard supervised fine-tuning. However, standard OPD typically computes teacher supervision under noisy student-generated contexts and often relies on a single stochastic teacher rollout per prompt. As a result, the supervision signal can be high-variance: the sampled teacher trajectory can be incorrect, uninformative, or poorly matched to the student's current reasoning behavior. To address this limitation, we propose BRTS, a Best-of-N Rollout Teacher Selection framework for on-policy distillation. BRTS augments standard student-context OPD with a teacher-context supervision branch constructed from the curated teacher trajectory. Rather than distilling from the first sampled teacher rollout, BRTS samples a small pool of teacher trajectories and selects the auxiliary trajectory using a simple priority rule: correctness first, student alignment second. When multiple correct teacher trajectories are available, BRTS chooses the one most aligned with the student's current behavior; when unconditioned teacher samples fail on harder prompts, it invokes a ground-truth-conditioned recovery step to elicit a natural derivation. The selected trajectory is then used to provide reliable teacher-context supervision inside the OPD loop, augmented with an auxiliary loss on the teacher trajectory. Experiments on AIME 2024, AIME 2025, and AMC 2023 show that BRTS improves over standard OPD on challenging reasoning benchmarks, with the largest gains on harder datasets. Our code is available at https://github.com/BWGZK-keke/BRTS.
comment: 10 pages, 5 figures
☆ Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT
Large-scale 3D vision-language models (VLMs) like LLaVA-3D offer strong spatial reasoning but are difficult to deploy due to high computational costs. We propose a knowledge distillation framework that transfers spatial reasoning from a 7B teacher to a 2.29B student model. Our approach achieves 8.7x lower inference latency and a 3x reduction in model size while retaining 54-72% of the teacher's performance. The framework utilizes VGGT as the vision encoder and a multi-task distillation pipeline with uncertainty-aware loss weighting. To improve reasoning without chain-of-thought (CoT) data, we introduce "Hidden CoT": learnable latent tokens that serve as an internal scratchpad before answer generation. This is the first use of latent scratchpad reasoning in distilled 3D VLMs. The student model jointly performs spatial description, depth estimation, and object detection. Experiments on ScanNet and 3D-FRONT show strong spatial understanding, reaching 68-72% accuracy on proximity and contact tasks. Our framework enables efficient 3D scene QA on resource-constrained platforms.
☆ MOTOR-Bench: A Real-world Dataset and Multi-agent Framework for Zero-shot Human Mental State Understanding CVPR 2026
Understanding human mental states from natural behavior is crucial for intelligent systems in the real world. However, most current research focuses on predicting isolated mental state labels, lacking structured annotations of complex interpersonal interactions. To support structured analysis, we introduce MOTOR-Bench, a carefully-designed benchmark with a real-world dataset MOTOR-dataset, containing 1,440 multimodal video clips in collaborative learning scenarios, reflecting key real-world data challenges including natural class imbalance, visual noise, and domain-specific language. Each sample is labeled by educational experts based on self-regulated learning theory. We further evaluate several state-of-the-art multimodal large language models and multi-agent systems in a zero-shot setting on our MOTOR-Bench. However, their performance on this task remains limited, suggesting that existing methods still struggle with structured reasoning from observable behavior to deeper mental states. To address this challenge, we propose a reasoning multi-agent framework, named MOTOR-MAS. It coordinates multiple agents through a structured agent coordination mechanism to infer explicit behaviors, internal cognitions, and psychological emotions. Experimental results show that our MOTOR-MAS outperforms the best single-model benchmark by 15.93 points in Macro-F1 scores for the three labels of behavior, cognition, and emotion, and outperforms the general multi-agent benchmark by 10.2 points in internal cognition prediction.
comment: Accepted by CVPR 2026 workshop AI4RWC
☆ DriveFuture: Future-Aware Latent World Models for Autonomous Driving
Existing latent world models for autonomous driving have opened a promising path toward future-aware driving intelligence. However, they typically treat future latent states as prediction targets or auxiliary signals, rather than directly conditioning trajectory planning. This can entangle current and future features in latent space. In this work, we propose DriveFuture, a future-aware latent world modeling framework for autonomous driving that explicitly learns planning-oriented foresight by conditioning the current latent state modeling process on future world states. Specifically, during training, the model first predicts future latent world states from the current latent state and ego action, and then refines the prediction against the ground-truth future latent state via cross-attention. The resulting future-aware latent serves as an explicit condition for a diffusion-based trajectory planner. During inference, DriveFuture conditions on the predicted future latent state instead of the ground-truth future state. DriveFuture achieves SOTA performance on the public NAVSIM benchmarks, reaching \textbf{55.5} EPDMS on NAVSIM-v2 {\textcolor{blue}{\textit{navhard}}}, \textbf{89.9} EPDMS on NAVSIM-v2 {\textcolor{blue}{\textit{navtest}}}, and \textbf{90.7} PDMS on NAVSIM-v1 {\textcolor{blue}{\textit{navtest}}}, respectively. These results suggest that the key to latent world modeling lies not merely in simulating future states, but more importantly in conditioning current decision-making on future states. Notably, as of April 2026, DriveFuture ranks \textbf{1st} on the \href{https://huggingface.co/spaces/AGC2025/e2e-driving-navhard}{NAVSIM-v2 {\textcolor{blue}{\textit{navhard}}}} leaderboard and achieves SOTA performance on \href{https://huggingface.co/spaces/AGC2024-P/e2e-driving-navtest}{NAVSIM-v1 {\textcolor{blue}{\textit{navtest}}}}.
comment: 24pages, 7 figures
☆ A Real-Calibrated Synthetic-First Data Engine
Modern computer vision systems increasingly encounter performance limitations in data-scarce domains, where collecting large-scale, high-quality labeled data is costly or impractical. While controllable diffusion models enable scalable synthetic image generation, directly applying synthetic augmentation often leads to unstable performance gains due to dataset-level quality issues and insufficient feedback mechanisms. In this work, we present a Real-Calibrated Synthetic-First Data Engine, a modular data engineering framework that combines controllable diffusion generation and multi-stage curation/filtering within a unified pipeline, with optional support for uncertainty-driven selection and human verification. Instead of introducing new generative algorithms, our approach focuses on systematic dataset construction for improving the practical reliability of synthetic augmentation in low-data regimes. The framework is implemented as a modular CLI-based pipeline, where generation, filtering, selection, and validation components can be independently configured and replaced. This design emphasizes reproducibility, flexibility, and practical deployment in real-world data workflows. Through empirical evaluation centered on human pose estimation, we show that synthetic data improves a real-data baseline when used as near-zero-human-annotation-cost augmentation alongside real anchors, while synthetic-only training remains substantially below real-only performance. Supplementary segmentation diagnostics show the same domain-gap pattern. These results highlight the practical value of data-centric orchestration for low-data augmentation.
comment: 7 pages, 6 figures
☆ Discriminative Span as a Predictor of Synthetic Data Utility via Classifier Reconstruction
In many real-world computer vision applications, including medical imaging and industrial inspection, binary classification tasks are characterized by a severe scarcity of positive samples. A widely adopted solution is to generate synthetic positive data using image-to-image transformations applied to negative samples. However, a fundamental challenge remains: how can we reliably assess whether such synthetic data will improve downstream model performance? In this work, we propose a geometry-driven metric that predicts the utility of synthetic data without requiring model training. Our approach operates in the embedding space of a pre-trained foundation model and represents the dataset through difference vectors between samples. We evaluate whether the weight vector of a linear classifier can be expressed within the subspace spanned by these variations by measuring the relative projection error. Intuitively, if the variations induced by synthetic data capture task-relevant directions, their span can approximate the classifier, resulting in low projection error. Conversely, poor synthetic data fails to span these directions, leading to higher error. Across multiple datasets and architectures, we show that this metric exhibits strong correlation with downstream classification performance of CNNs trained on mixtures of real negative and synthetic positive data. These findings suggest that the proposed metric serves as a practical and informative tool for evaluating synthetic data quality in data-scarce settings.
comment: 15 pages, 17 tables
☆ Do multimodal models imagine electric sheep?
Yes. We find that large multimodal models develop mental imagery when solving spatial puzzles, and they do imagine sheep when solving sheep puzzles. We fine-tune a Qwen3.5 VLM to solve twelve diverse visual reasoning tasks -- including tangram, jigsaw, sokoban, 3D mental rotation, and rush hour -- that require understanding geometry, spatial relationships, and the consequences of actions. By supervising the model to predict the open-loop sequence of actions to solve a puzzle from an initial state, we show that the model's activations after each action encode meaningful visual information about the intermediate state. This finding suggests that an imperfect visual world model begins to form as a byproduct of learning to select correct actions, in the absence of any explicit visual supervision. Building on this observation, we propose two ways to sharpen and use the mental images formed by the model. We find that integrating as few as sixteen visual tokens per step into the chain of thought improves the average solve rate from 83% to 89%, with particularly strong gains on reasoning-heavy tasks such as jigsaw and 3D mental rotation.
☆ ConFixGS: Learning to Fix Feedforward 3D Gaussian Splatting with Confidence-Aware Diffusion Priors in Driving Scenes
Feedforward 3D Gaussian Splatting (3DGS) often struggles in trajectory-based sparse-view driving scenes. Existing Gaussian repair methods mainly target optimization-based 3DGS, while diffusion-based repair is typically restricted to iterative refinement near observed viewpoints, leaving feedforward 3DGS repair underexplored. We propose ConFixGS, a plug-and-play method that learns to fix feedforward 3DGS with confidence-aware diffusion priors. Starting from a pretrained feedforward model, ConFixGS generates diffusion-enhanced local pseudo-targets and validates them through reprojection-based cross-checking against support views. The resulting dense confidence maps guide refinement, enhancing reliable details while suppressing hallucinated or inconsistent evidence. On Waymo, nuScenes, and KITTI, ConFixGS improves challenging novel view synthesis, with PSNR gains of up to 3.68 dB and FID reduced by nearly half. Our results highlight confidence-aware fusion of generative priors and support-view consistency as a key principle for robust feedforward 3D driving scene reconstruction.
comment: 28 pages, 12 figures
☆ Spatial-Frequency Gated Swin Transformer for Remote Sensing Single-Image Super-Resolution
Remote Sensing (RS) single-image super-resolution aims to reconstruct high-resolution imagery from low-resolution observations while preserving fine spatial structures. Recent Swin Transformer-based models, including Swin2SR, provide strong spatial context modeling throughshifted-window self-attention, but their feed-forward networks remain generic channel-mixing modules and do not separate low-frequency structural content from high-frequency residual detail. To address this limitation, we propose SFG-SwinSR, a Spatial-Frequency Gated Swin Transformer for single-image super-resolution in remote sensing. SFG-SwinSR modifies the original Swin2SR attention block by replacing each transformer block's standard feed-forward network with a lightweight Spatial-Frequency Gated Feed-Forward Network (SFG-FFN). The module estimates low-frequency content via a depthwise-blur branch, extracts high-frequency residuals by subtraction, refines them with a lightweight spatial branch, and adaptively injects detail through a bottleneck gate. Experiments on SpaceNet and SEN2VENμS show that SFG-SwinSR improves reconstruction quality under the evaluated settings. On SpaceNet, it achieves 45.19 dB PSNR and 0.9852 SSIM, indicating effective enhancement of high-frequency details. This demonstrates that spatial-frequency transformation within the transformer feed-forward network improves detail reconstruction in RS super-resolution.
comment: 15 pages
☆ Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models
Autoregressive (AR) video diffusion models adopt a streaming generation framework, enabling long-horizon video generation with real-time responsiveness, as exemplified by the Self Forcing training paradigm. However, existing AR video diffusion models still suffer from significant attention complexity and severe memory overhead due to the redundant key-value (KV) caches across historical frames, which limits scalability. In this paper, we tackle this challenge by introducing KV cache compression into autoregressive video diffusion. We observe that attention heads in mainstream AR diffusion models exhibit markedly distinct attention patterns and functional roles that remain stable across samples and denoising steps. Building on our empirical study of head-wise functional specialization, we divide the attention heads into two categories: static heads, which focus on transitions across autoregressive chunks and intra-frame fidelity, and dynamic heads, which govern inter-frame motion and consistency. We then propose Forcing-KV, a hybrid KV cache compression strategy that performs structured static pruning for static heads and dynamic pruning based on segment-wise similarity for dynamic heads. While maintaining output quality, our method achieves a generation speed of over 29 frames per second on a single NVIDIA H200 GPU along with 30% cache memory reduction, delivering up to 1.35x and 1.50x speedups on LongLive and Self Forcing at 480P resolution, and further scaling to 2.82x speedup at 1080P resolution. Code and demo videos are provided at https://zju-jiyicheng.github.io/Forcing-KV-Page.
comment: 10 pages
☆ DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents
Medical vision-language models (VLMs) and AI agents have made significant progress in learning to analyze and reason about clinical images. However, existing medical visual question answering (VQA) benchmarks collapse model capabilities into a single accuracy score, obscuring where and why models fail. We propose DeepTumorVQA, a hierarchical benchmark that follows the multi-stage evidence chain in tumor diagnosis and decomposes 3D CT reasoning into four stages: recognition, measurement, visual reasoning, and medical reasoning. Higher-level questions remain independently scorable, while their ground-truth evidence chains are defined over lower-level primitives. The benchmark contains 476K questions across 42 clinical subtypes on 9,262 3D CT volumes. In addition to a direct reasoning mode for VLMs, DeepTumorVQA provides tool-interaction environments for agent evaluation, where a model can call external tools, including segmentation models, measurement programs, and medical knowledge modules, before answering the question. Evaluating over 30 model configurations, we find that reliable quantitative measurement is the primary bottleneck, making later-stage visual and medical reasoning harder for VLMs, while tool augmentation substantially mitigates this issue. When tools are available, leveraging medical knowledge and tools to reason about medical images becomes a new challenge. We further show that ground-truth step-by-step tool-use traces from DeepTumorVQA can supervise agents and reduce tool-use and reasoning failures. This stage-wise progression from recognition to measurement to visual and medical reasoning provides a concrete roadmap for future medical VLM and AI agent studies. All data and code are released at https://github.com/Schuture/DeepTumorVQA.
☆ VFM-SDM: A vision foundation model-based framework for training-free, marker-free, and calibration-free structural displacement measurement
Reliable displacement measurement is fundamental for structural health monitoring and digital engineering workflows, as it provides direct structural response information. Vision-based measurement has emerged as a promising approach for low-cost, non-contact displacement monitoring. However, its deployment often remains constrained by task-specific model training or on-site preparation, such as marker installation or manual camera calibration. This study presents a Vision Foundation Model-based framework for Structural Displacement Measurement (VFM-SDM) that integrates VFM-inferred camera parameter estimation and point tracking to reconstruct multi-directional structural displacements via triangulation without task-specific training or on-site preparation, enabling efficient non-contact deployment in real-world applications. Structural geometry constraints are incorporated to suppress physically implausible deviations and improve estimation consistency. A multi-modal field dataset collected from an in-service pedestrian bridge is introduced alongside a unified benchmarking protocol to support reproducible evaluation. Representative results show low amplitude errors (NRMSE$_{\text{range}}$: 0.11/0.12), strong temporal agreement (correlation coefficient: 0.86/0.88), and small peak-to-peak amplitude errors (RPPAE: 0.01/0.02) for vertical and lateral displacements, indicating robust performance under real-world conditions. The proposed framework advances automated, scalable displacement monitoring and lays the groundwork for VFM-enabled structural response measurements in digital twin and data-centric construction workflows.
☆ Towards Generative Predictive Display for Vision-Based Teleoperation: A Zero-Shot Benchmark of Off-the-Shelf Video Models
Teleoperation systems are fundamentally limited by communication latency, which degrades situational awareness and control performance. Predictive display aims to mitigate this limitation by presenting an estimate of the current visual state rather than delayed observations. While recent advances in generative video models enable high-quality video synthesis, their suitability for latency-sensitive predictive display remains unclear. This paper presents a zero-shot benchmark of off-the-shelf generative video models for short-horizon predictive display, without task-specific fine-tuning. We formulate the problem as rollout-based future frame prediction and develop a unified benchmarking pipeline using simulated driving data from the CARLA simulator. Five publicly released video models spanning transformer-based and diffusion-based families are evaluated across two resolutions and two conditioning regimes (multi-frame and single-frame). Performance is assessed using prediction accuracy (mean absolute difference), per-rollout latency, peak GPU memory usage, and temporal error evolution across the prediction horizon. On this zero-shot benchmark, no tested model simultaneously achieves low rollout error, non-divergent per-step error behavior, and real-time inference at the source frame rate. Increasing model scale or resolution yields limited and, in some cases, inverted improvements. These findings highlight a gap between general-purpose generative video synthesis and the requirements of predictive display in teleoperation, suggesting that practical deployment will require either explicit short-horizon temporal supervision, in-domain adaptation, or aggressive inference optimization rather than direct application of off-the-shelf models. Code, configurations, and qualitative results are released on the project page: https://bimilab.github.io/paper-GenPD
☆ S2P-Net: A Spectral-Spatial Polar Network for Rotation-Invariant Object Recognition in Low-Data Regimes
We present S2P-Net (Spectral-Spatial Polar Network), a compact deep learning architecture that achieves mathematically guaranteed rotation invariance without data augmentation. In this Paper, we also made a comparison to other neural network architectures (CNN`s). Have a look at the results and feel free to contact me for any questions. This is my first paper:) Made by Hackbert
comment: 9 pages, 4 figures, 3 tables. Preprint. Code available from the author upon request
☆ Rethinking Evaluation of Multiple Sclerosis (MS) Lesion Segmentation Models IJCNN 2026
Multiple Sclerosis (MS) is a chronic autoimmune disease that can significantly reduce the quality of life of a patient. Existing treatment options can only help slow down the progression of the disease. Therefore, early detection and precise monitoring of disease progression are important. Deep learning offers state-of-the-art models for detecting and segmenting MS lesions in brain MRI scans. However, most of these models are evaluated using the Dice score, without accounting for lesion-wise detection and segmentation performance or other metrics that quantify model performance in cases that are complex or confusing for human annotators, or in cases that are essential for disease detection and progression monitoring. In this paper, we highlight the need to rethink the evaluation of MS lesion segmentation models. In this context, we first present problem fingerprinting in detail to highlight what neurologists look for in brain MRI scans for MS detection and progression monitoring, and which metrics are required to properly quantify model performance in these contexts. Additionally, we present an analysis of state-of-the-art models on two open-source datasets using these metrics to highlight their usability for real-world deployment in hospitals.
comment: 8 pages, 5 figures, Accepted to IJCNN 2026
☆ BEA-GS: BEyond RAdiance Supervision in 3DGS for Precise Object Extraction CVPR 2026
Most Gaussian Splatting techniques that provide a 3D semantic representation of the scene do not optimize the underlying 3D geometry, making object-level editing or asset extraction challenging. Recent methods, such as COBGS, Trace3D, ObjectGS, acknowledge this limitation and propose approaches that modify the scene's geometry to represent the underlying semantics. We advance this concept further by proposing a novel solution that provides near perfect boundaries in object extraction. We do so by introducing two new losses in the optimization that take care of: 1) a loss that modifies the geometry of visible Gaussians to respect semantic boundaries, and 2) a loss that adjusts the geometry of non-visible Gaussians that appear once the object is extracted. Our first loss propagates gradients directly through the rasterization, allowing for seamless integration within the optimization of the Gaussian parameters. The second loss also propagates gradients to Gaussian parameters but does so without passing through the rasterization, enabling modification of the scene's geometry even when little transmittance reaches a Gaussian (partial or non-visible). Exhaustive comparisons with 12 state of the art methods across 4 datasets, using six metrics, demonstrate that our approach produces overall the best boundary segmentation to date.
comment: CVPR 2026 Highlight
☆ Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval
Visual Geometry Grounded Transformer (VGGT) advances 3D reconstruction via scalable Transformer architecture, but the quadratic complexity of global attention prevents long context application. StreamVGGT enables streaming with causal attention, yet its KV cache grows linearly with frames, causing memory overflow and quality degradation. We present RetrieveVGGT, a training-free framework, which formulates context construction for VGGT as a retrieval problem. By retrieving a fixed number of relevant frames at each step, VGGT maintains a controllable memory budget, which is close to its training context length. Interestingly, we find that the similarity between current frame queries and cached history frame keys at the first global attention layer of VGGT is already a strong indicator of relevance, eliminating the need for additional learned scoring. To enhance information diversity similar to a recommender system, we propose Segment Sampling so that the retrieval spans distinct relevant segments rather than a single high-similarity region. We design a pose-aware spatial memory mechanism that organizes history frames according to their already estimated camera poses, enabling location-aware retrieval. Extensive experiments demonstrate that RetrieveVGGT achieves state-of-the-art performance, outperforming StreamVGGT, TTT3R, and InfiniteVGGT while maintaining constant memory usage regardless of sequence length. Code is available at https://github.com/zzctmd/RetrieveVGGT.
☆ Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning
Recent studies suggest that Reinforcement Fine-Tuning (RFT) is inherently more resilient to catastrophic forgetting than Supervised Fine-Tuning (SFT). However, whether RFT (e.g., GRPO) can effectively overcome forgetting in challenging visual continual learning settings, such as class-incremental learning (CIL) and domain-incremental learning (DIL), remains an open problem. Through a pilot study, we confirm that while RFT consistently outperforms SFT, it still suffers from non-negligible forgetting. We empirically trace this bottleneck to Trajectory-level Drift Agnosticism: among candidate rollouts achieving identical task rewards, the KL divergence from the preceding-task policy varies substantially, which strongly correlates with catastrophic forgetting across sequential tasks. Motivated by this insight, we propose Retention-aware Policy Optimization (RaPO), a simple yet effective RFT method that explicitly mitigates forgetting through trajectory-level reward shaping. Specifically, RaPO comprises two core components: (1) Retention Reward that converts trajectory-level distribution drift into a continuous reward signal, preferentially reinforcing knowledge-preserving rollouts within each group; (2) Cross-Task Advantage Normalization (CTAN), which maintains a persistent exponential moving average of reward statistics across task boundaries to stabilize the optimization progress during continual learning. Leveraging the free-form textual generalization of MLLMs, we comprehensively evaluate RaPO across five visual continual learning settings. Extensive experiments demonstrate that RaPO achieves leading performance, substantially reducing catastrophic forgetting while preserving strong plasticity. To the best of our knowledge, this work represents the first systematic exploration of RFT in visual continual learning, offering insights that we hope will inspire future research.
☆ XTinyU-Net: Training-Free U-Net Scaling via Initialization-Time Sensitivity MICCAI 2026
While U-Net architectures remain the gold standard for medical image segmentation, their deployment in resource-constrained environments demands aggressive model compression. However, finding an optimally efficient configuration is computationally prohibitive, typically requiring exhaustive train-and-evaluate cycles to find the smallest model that maintains peak performance. In this paper, we introduce a training-free selection framework to automatically identify ultralightweight, dataset-specific U-Net configurations directly at initialization. We observe that systematically scaling down U-Net channel width induces a sharp transition from a stable performance plateau to representational capacity collapse. To pinpoint this boundary without training, we propose a Jacobian-based sensitivity metric that scores discrete, width-capped U-Net variants using a small set of unlabeled images. By analyzing the total variation of this sensitivity curve, we isolate the smallest stable configuration, which we denote as XTinyU-Net. Evaluated across six diverse medical datasets within the nnU-Net framework, XTinyU-Net achieves segmentation accuracy comparable to the heavy nnU-Net baseline with 400x-1600x fewer parameters, and outperforms contemporary lightweight architectures while utilizing 5x-72x fewer parameters. Code is publicly accessible on https://github.com/alvinkimbowa/nntinyunet.git.
comment: Early accepted to MICCAI 2026
☆ DegBins: Degradation-Driven Binning for Depth Super-Resolution
Depth super-resolution (DSR) aims to recover a high-resolution (HR) depth map from its low-resolution (LR) counterpart. With color image guidance, this task is typically formulated as learning the residual between HR and LR in a low-dimensional feature space. However, this additive formulation is insufficient to accurately capture the complex relationship between HR and LR, especially under spatially varying degradations. In this paper, we introduce DegBins, a novel DSR framework that leverages degradation-driven binning to adaptively enhance residual modeling. Specifically, DegBins reformulates the regression-based DSR as a hybrid classification-regression problem, where the residual depth is represented as a linear combination of discrete depth bins weighted by their learned probability distribution, yielding more flexible and expressive representations. Furthermore, DegBins models the degradation relationship between HR and LR in a high-dimensional feature space, enabling adaptive bin range adjustment and probability optimization conditioned on local degradation characteristics. To progressively improve reconstruction quality, DegBins adopts a multi-stage refinement scheme, where each stage performs finer-grained bin partitioning and probability updating based on the former estimation. This coarse-to-fine design facilitates more accurate depth recovery, particularly in regions with severe degradations or complex structural variations. Extensive experiments across five benchmarks demonstrate that DegBins consistently outperforms existing state-of-the-art methods in terms of accuracy, robustness, and generalization.
comment: 9 pages
☆ Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study CVPR 2026
Voxel-wise dose prediction is a critical yet challenging task in practical radiotherapy (RT) planning, as bespoke models trained from scratch often struggle to generalize across diverse clinical settings. Meanwhile, generative models trained on billion-scale datasets from vision domains have achieved impressive performance. Herein, we propose DiffKT3D, a unified Any2Any 3D diffusion framework that leverages prior knowledge from pretrained video diffusion models for efficient and clinically meaningful dose prediction. To enable flexible conditioning across multiple clinical modalities (CT, anatomical structures, body, beam settings, etc.), we introduce an Any2Any conditional paradigm utilizing modality-specific embeddings without cross-attention overhead. Further, we design a novel reinforcement learning (RL) post-training mechanism guided by a clinically-informed Scorecard explicitly tailored to institutional treatment preferences. Compared with winner of GDP-HMM challenge, DiffKT3D sets a new state-of-the-art in dose prediction by reducing voxel-level MAE from 2.07 to 1.93. In addition, DiffKT3D achieves superior image quality and preference match. These results demonstrate that transferring diffusion priors via modality-aware conditioning and clinically aligned RL post-training can provide a robust and generalizable solution for RT planning across various clinical scenarios.
comment: Accepted by CVPR 2026 main conference. Compare to CVPR version, minor updates here are included (e.g., combine main text and appendix; clarify the timing scenario in appendix)
☆ GSMap: 2D Gaussians for Online HD Mapping
Accurate High-Definition (HD) map construction is critical for autonomous driving, yet existing methods face a fundamental trade-off: vectorization-based approaches preserve topology but struggle with geometric fidelity, while rasterization-based approaches enable precise geometric supervision but produce unstructured outputs. To bridge this gap, we propose GSMap, a novel framework that unifies both paradigms via a learnable 2D Gaussian representation. Each map element is modeled as an ordered sequence of 2D Gaussians, whose centers correspond to the vertices of the vectorized polyline/polygon. This formulation enables simultaneous optimization through: (1) Differentiable rasterization that enforces pixel-level geometric constraints, and (2) Topology-aware vectorization that maintains structural regularity. Experiments on both nuScenes and Argoverse2 demonstrate that our Gaussian-based representation effectively unifies geometric and topological learning, achieving significant performance improvements and demonstrating strong compatibility with existing HD mapping architectures. Code will be available at https://github.com/peakpang/GSMap
comment: Preprint
☆ Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning
Long chain-of-thought (CoT) reasoning improves large vision--language models, but visual information often fades during generation, limiting long-horizon multimodal reasoning. Existing methods either re-inject vision at inference or train policies for stronger grounding, but where to intervene relies on perception heuristics rather than principled gain analysis, and how local visual influence propagates remains implicit. We study this problem from an information-theoretic standpoint and derive a lower bound on the downstream visual gain of a one-step intervention, which suggests two factors: local branching room (token entropy) and downstream visual propagation potential (suffix divergence from a vision-marginalized reference). Guided by this analysis, we propose reflection-anchor policy optimization (RAPO), a GRPO-based policy optimization method that selects high-entropy reflection anchors and optimizes a chain-masked finite-window KL surrogate for downstream visual dependence. Experiments on reasoning-intensive and general-domain benchmarks show that RAPO delivers substantial gains over strong baselines across multiple LVLM backbones. Mechanism analyses further indicate that reflection anchors are enriched for visually sensitive decision points and that RAPO increases contrastive visual-dependence signals along generated trajectories.
comment: Under Review
☆ SABER: A Scalable Action-Based Embodied Dataset for Real-World VLA Adaptation
Robotic deployment in real-world environments depends on rich, domain-specific action data as much as on strong model architecture. General-purpose robot foundation models show modest performance in complex unseen tasks such as manipulation in a retail domain when applied out of the box. The root cause is a data gap: retail environments are structurally absent from general robot pretraining distributions, and the path to filling that gap through teleoperation is prohibitively expensive, logistically constrained, and difficult to scale. We introduce SABER, a high-fidelity retail robotics action dataset built from over 100 hours of natural in-store capture across multiple real grocery environments. Egocentric footage from head-mounted cameras records fine-grained hand activity at the point of interaction, while exocentric 360-degree scene footage from DreamVu's ALIA camera simultaneously observes all actors and activities across the entire space. This combination yields a uniquely complete picture of human retail behavior: dexterous hand activity, whole-body motion, and scene dynamics, all captured without staging, scripting, or teleoperation overhead. The SABER corpus contains 44.8K training samples across three action representation streams: 25K latent action sequences via LAPA-style encoding, 18.6K dexterous hand-pose trajectories retargeted to robot joint space, and 1.2K whole-body synchronized motion sequences retargeted to a humanoid embodiment. When applied to GR00T N1.6 via a shared-backbone multi-task post-training recipe, SABER yields a mean success rate of 29.3% across ten retail manipulation tasks -- more than 2.19x over fine-tuning baselines (13.4%). SABER demonstrates that the path to capable retail robots runs through better data, which can be collected today, at scale, without a robot in the loop. The dataset and code are available at https://dreamvu.ai/saber
☆ On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models
Recent advances in image-to-3D models have significantly improved the fidelity and accessibility of 3D content creation. Such a powerful reconstruction capability that enables creative design can also be misused by the adversary to generate harmful geometries, which can be further fabricated via 3D printers and pose real-world risks. However, such risks are largely underexplored: it remains unclear how well current image-to-3D models can produce these harmful geometries, and whether existing safeguards can reliably prevent such generation. To fill this gap, we conduct a systematic measurement study of harmful geometry generation and mitigation. We first describe this risk through three kinds of unsafe categories: direct-use physical hazards, risky templates or components, and deceptive replicas. Each category is instantiated with representative objects. We evaluate both open-source and commercial image-to-3D models under original, degraded, viewpoint-shifted, and semantically camouflaged inputs. We consider different evaluation metrics, including geometric validity, multi-view VLM-based semantic scoring, targeted human validation, and controlled physical fabrication. The results reveal a concerning reality that current image-to-3D models can effectively reconstruct the harmful geometries, while fewer than 0.3% of such geometries trigger commercial moderation flags. As a first step toward mitigation, we evaluate three representative safeguard families, including input moderation, model-level benign alignment, and output-level filtering. We find that existing safeguards have distinct weaknesses. We further develop a stacked defense that can reduce harmful retention to <1%, but still at 11% overall false-positive cost. Taken together, our findings demonstrate that the risk in current system and encourage better geometry-aware safeguards for moderation.
☆ DAP: Doppler-aware Point Network for Heterogeneous mmWave Action Recognition
Millimeter-wave (mmWave) radar provides privacy-preserving sensing and is valuable for human action recognition (HAR). Existing mmWave point cloud datasets are limited in scale and mostly collected under homogeneous single-source settings, preventing current methods from handling real-world distribution shifts caused by heterogeneous radar sources, such as different devices and frequency bands. To address this, we introduce UniMM-HAR, the largest and first mmWave point cloud HAR dataset for heterogeneous multi-source scenarios, standardizing three distinct radar configurations to realistically evaluate cross-source generalization. We further propose the Doppler-aware Point Cloud Network (DAP-Net) to tackle heterogeneity challenges. DAP-Net enhances intra-modal representations and performs cross-modal alignment to learn source-invariant action semantics. Leveraging action-consistent spatio-temporal Doppler patterns as anchors, the Dual-space Doppler Reparameterization (D2R) module performs sample-adaptive geometric densification and Doppler-guided feature recalibration, while the Text Alignment Module (TAM) provides stable semantic anchors via a pretrained textual space. Experiments show that DAP-Net significantly outperforms existing methods under heterogeneous radar settings, achieving state-of-the-art accuracy and strong cross-source robustness.
☆ Uncertainty-Guided Dual-Domain Learning for Reliable Skin Lesion Segmentation
Accurate skin lesion segmentation is vital for dermoscopic Computer-Aided Diagnosis. However, visual ambiguity and morphological irregularity often defeat spatial modeling, necessitating multi-domain architectures. Existing paradigms frequently overlook the active use of prediction uncertainty, leading to deterministic frameworks that suffer from blind cross-domain fusion and overfit to label noise. To address these issues, we propose the Uncertainty-Guided Dual-Domain Network (UGDD-Net). UGDD-Net introduces a novel "Glance-and-Gaze" mechanism to transform uncertainty into an active guiding signal. Specifically, the Uncertainty-Guided Bi-directional Feature Fusion (UGBFF) module uses pixel-level uncertainty to modulate spatial-spectral interactions. The Uncertainty-Guided Graph Refinement (UGGR) module constructs a topology-aware graph to propagate reliable semantic consensus and refine uncertain nodes. Finally, the Uncertainty-Guided Margin-Adaptive Loss (UGML) enforces strict constraints on confident pixels while relaxing penalties on uncertain ones to improve statistical calibration. Extensive experiments on ISIC2017, ISIC2018, PH2, and HAM10000 datasets demonstrate that UGDD-Net achieves state-of-the-art performance, especially on "Hard Samples". Our uncertainty maps align with expert inter-observer variability, providing robust interpretability for human-machine collaborative diagnosis.
comment: 34 pages, 11figures
☆ SoccerLens: Grounded Soccer Video Understanding Beyond Accuracy NeurIPS 2026
Vision-language models (VLMs) have recently shown strong potential in soccer video understanding. However, given the high complexity of soccer videos due to large viewpoint variations, rapid shot transitions, and cluttered scenes, it remains unclear on whether VLMs rely on meaningful visual evidence or exploit spurious correlations and shortcut learning. Existing evaluation protocols focus primarily on classification accuracy and do not assess visual grounding. To address this limitation, we introduce SoccerLens, a benchmark for grounded soccer video understanding. The benchmark contains annotated video segments spanning $13$ common soccer events, with structured visual cues organized into three levels of semantic relevance. We further extend the attribution method of Chefer [arXiv:2103.15679] to jointly model spatial and temporal attention, and introduce evaluation metrics that measure whether model attention aligns with annotated cues or drifts toward spurious regions. Our evaluation of state-of-the-art soccer VLMs shows that, despite strong classification accuracy, current models fail to exceed $50\%$ grounding performance even under the loosest cue definitions and consistently underutilize temporal information. These results reveal a substantial gap between predictive performance and true visual grounding, highlighting the need for grounded evaluation in complex spatio-temporal domains such as soccer.
comment: Under Review in NeurIPS 2026
☆ From Pixels to Concepts: Do Segmentation Models Understand What They Segment?
Segmentation is a fundamental vision task underlying numerous downstream applications. Recent promptable segmentation models, such as Segment Anything Model 3 (SAM3), extend segmentation from category-agnostic mask prediction to concept-guided localization conditioned on high-level textual prompts. However, existing benchmarks primarily evaluate mask accuracy or object presence, leaving unclear whether these models faithfully ground the queried concept or instead rely on visually salient but semantically misleading cues. We introduce CAFE: \textbf{C}ounterfactual \textbf{A}ttribute \textbf{F}actuality \textbf{E}valuation, a novel benchmark for evaluating concept-faithful segmentation in promptable segmentation models. Our \textbf{CAFE} is built on attribute-level counterfactual manipulation: the target region and ground-truth mask are preserved, while attributes such as surface appearance, context, or material composition are modified to introduce misleading semantic cues. The benchmark contains 2,146 paired test samples, each consisting of a target image, a ground-truth mask, a positive prompt, and a misleading negative prompt. These samples cover three counterfactual categories: Superficial Mimicry (\textbf{SM}), Context Conflict (\textbf{CC}), and Ontological Conflict (\textbf{OC}). We evaluate various model types and sizes on our CAFE. Experiments reveal a systematic gap between localization quality and concept discrimination: models often generate accurate masks even for misleading prompts, suggesting that strong mask prediction does not necessarily imply faithful semantic grounding. Our CAFE provides a controlled benchmark for diagnosing whether promptable segmentation models perform concept-faithful grounding rather than shortcut-driven mask retrieval.
comment: 30 pages, 8 figures
☆ DeformMaster: An Interactive Physics-Neural World Model for Deformable Objects from Videos
World models for deformable objects should recover not only geometry and appearance, but also underlying physical dynamics, interaction grounding, and material behavior. Learning such a model from real videos is challenging because deformable linear, planar, and volumetric objects evolve under high-dimensional deformation, noisy interactions, and complex material response. The model must therefore infer a physical state from visual observations, roll it forward under new interactions, and render the resulting dynamics with high visual fidelity. We present DeformMaster, a video-derived interactive physics--neural world model that turns real interaction videos into an online interactive model of deformable objects within a unified dynamics-and-appearance framework. DeformMaster preserves structured physical rollout while using a neural residual to compensate for unmodeled effects, grounds sparse hand motion as distributed compliant actuator for hand--continuum interaction, represents material response with spatially varying constitutive experts, and drives high-fidelity 4D appearance from the predicted physical evolution. Experiments on real-world deformable-object sequences demonstrate DeformMaster's ability to roll out future dynamics and render dynamic appearance, outperforming state-of-the-art baselines while supporting novel action rollout, material-parameter variation, and dynamic novel-view synthesis.
☆ FPGA-Based Hardware Architecture for Contrast Maximization in Event-Based Vision
This paper presents a hardware architecture that implements the Contrast Maximization (CM) algorithm in Field-Programmable Gate Array (FPGA) resources for event-based vision systems. CM estimates motion parameters by maximizing the contrast of an Image of Warped Events (IWE) reconstructed from asynchronous event streams. Event-based vision sensors generate sparse data with high temporal resolution and low spatial redundancy, which makes them well suited for hardware processing. The deterministic, massively parallel structure of the FPGA is leveraged to design a deeply pipelined architecture capable of high-throughput, energy-efficient processing suitable for real-time embedded applications. This paper details the hardware modules responsible for event warping, contrast computation, and iterative optimization, discusses key implementation decisions, and presents the hardware-aware optimization method used in the design. Experimental results demonstrate a substantial speed and efficiency improvement over CPU- and GPU-based implementations, with motion parameter estimation executing over 200 times faster. To the best of our knowledge, this is the first hardware architecture enabling acceleration of CM algorithm computations. Its performance is evaluated in terms of processing speed, energy efficiency, and hardware resource utilization. The proposed design is validated using an event-based object tracking application. The results confirm that the architecture provides a solid foundation for real-time motion estimation in high-speed, low-power embedded systems.
comment: Accepted for ARC 2026
☆ Annotation-free deep learning for detection and segmentation of fetal germinal matrix-intraventricular hemorrhage in brain MRI
Background: Prenatal germinal matrix-intraventricular hemorrhage (GMH-IVH) is a leading cause of infant mortality and neurodevelopmental impairment. Manual diagnosis and lesion segmentation are labor-intensive and error-prone. Deep learning models offer potential for automation but typically require large annotated datasets, which are challenging to obtain. Purpose: To develop and validate an annotation-free deep learning framework for automated detection and segmentation of GMH-IVH on brain MRI. Materials and Methods: This retrospective study analyzed 2D T2-weighted MRI data from pregnant women collected from October 2015 to October 2023 at one hospital (internal validation) and two hospitals (external validation). Eligible participants included healthy fetuses and those with GMH-IVH. FreeHemoSeg was developed and trained using pseudo GMH-IVH images synthesized from normal fetal data guided by medical priors. Primary outcomes included diagnostic accuracy (area under the ROC curve [AUROC], sensitivity, specificity) and segmentation accuracy (Dice similarity coefficient [DSC]). A reader study evaluated clinical utility. Results: A total of 1674 stacks from 558 pregnant women were analyzed. FreeHemoSeg achieved the highest performance in both internal (sensitivity: 0.914, 95% CI 0.869-0.945; specificity: 0.966, 95% CI 0.946-0.978; DSC: 0.559, 95% CI 0.546-0.571) and external validation (sensitivity: 0.824, 95% CI 0.739-0.885; specificity: 0.943, 95% CI 0.913-0.964; DSC: 0.512, 95% CI 0.497-0.526), outperforming supervised and unsupervised methods. FreeHemoSeg assistance improved radiologists' sensitivity (from 0.882 to 0.941-1.000) and diagnostic confidence while reducing interpretation time by 16.0-52.7%. Conclusion: FreeHemoSeg accurately detects and localizes fetal brain hemorrhages without annotated training data, enabling earlier diagnosis and supporting timely clinical management.
☆ KAN Text to Vision? The Exploration of Kolmogorov-Arnold Networks for Multi-Scale Sequence-Based Pose Animation from Sign Language Notation
Sign language production from symbolic notation offers a scalable route to accessible sign animation. We present KANMultiSign, a multi-scale sequence generator that translates HamNoSys notation into two-dimensional human pose sequences. Our framework makes two complementary contributions. First, we introduce a coarse-to-fine generation strategy with multi-scale supervision: the model is first guided by an intermediate body--hand--face scaffold to encourage global structural coherence, and then refines fine-grained hand articulation to improve finger-level detail. Second, we investigate integrating Kolmogorov--Arnold Network modules into a Transformer backbone, using learnable univariate function primitives to model the highly non-linear mapping from discrete phonological symbols to continuous body kinematics with a compact parameterization. Experiments on multiple public corpora spanning Polish, German, Greek, and French sign languages show consistent reductions in dynamic time warping based joint error compared with a strong notation-to-pose baseline, while using substantially fewer parameters. Controlled ablations further indicate that KAN-based variants substantially reduce parameter count while maintaining competitive performance when coupled with multi-scale supervision, rather than serving as the main driver of accuracy gains. These findings position multi-scale supervision as the key mechanism for improving notation-conditioned pose generation, with KAN offering a compact alternative for efficient modeling. Our code will be publicly available.
comment: Accepted at Neurocomputing
☆ Dual-Path Hyperprior Informed Deep Unfolding Network for Image Compressive Sensing
Recent Deep Unfolding Networks (DUNs) have significantly advanced Compressive Sensing (CS) by integrating iterative optimization with deep networks. However, existing DUNs still suffer from two challenges: 1) Reliance on a single measurement stream, which limits effective information interaction across distinct measurement subsets. 2) Uniform processing of all image regions, which overlooks varying reconstruction difficulties induced by diverse textures. To address these limitations, a novel Dual-Path Hyperprior Informed Deep Unfolding Network (DPH-DUN) is proposed, which partitions measurements into double subsets to enable hyperprior-guided reconstruction via a dual-path architecture. In the Deep Hyperprior Learning branch, a series of lightweight neural modules are designed to efficiently generate hyperprior knowledge of different domains, enabling collaborative guidance for the CS reconstruction. In the Hyperprior Informed Reconstruction branch, a deep unfolding framework with hyperprior guidance is constructed to iteratively refine reconstruction. Specifically, i) in the gradient descent step, a Hyperprior Informed Step Size Generation network is designed to dynamically generate spatially varying step maps, enabling adaptive fine-grained gradient updates. ii) In the proximal mapping step, two well-designed hyperprior informed attention mechanisms are introduced to dynamically focus on challenging regions via gradient-based hard and soft attentions, facilitating CS reconstruction accuracy. Extensive experiments demonstrate that the proposed DPH-DUN outperforms existing CS methods.
☆ Towards Compact Sign Language Translation: Frame Rate and Model Size Trade-offs
Sign Language Translation (SLT) converts sign language videos into spoken-language text, bridging communication between Deaf and hearing communities. Current gloss-free approaches rely on large encoder-decoder models, limiting deployment. We propose a compact 77M-parameter pipeline that couples MMPose skeletal pose extraction with a single linear projection into T5-small. By varying the input frame rate, we expose a practical efficiency trade-off: at 12 fps the model halves its sequence length, achieving a 75% reduction in encoder quadratic self-attention computational complexity while incurring only a modest BLEU-4 drop (9.53 vs. 10.06 at 24 fps on How2Sign). Our system is roughly 3x smaller than prior T5-base systems, demonstrating that a lightweight architecture can remain competitive without hierarchical encoders or large-scale models.
comment: 2 pages, 1 figure, 2 tables
☆ PhysHanDI: Physics-Based Reconstruction of Hand-Deformable Object Interactions ICML 2026
While existing methods for reconstructing hand-object interactions have made impressive progress, they either focus on rigid or part-wise rigid objects-limiting their ability to model real-world objects (e.g., cloth, stuffed animals) that exhibit highly non-rigid deformations-or model deformable objects without full 3D hand reconstruction. To bridge this gap, we present PhysHanDI (Physics-based Reconstruction of Hand and Deformable Object Interactions), a framework that enables full 3D reconstruction of both interacting hands and non-rigid objects. Our key idea is to physically simulate object deformations driven by forces induced from densely reconstructed 3D hand motions, ensuring that the reconstructed object dynamics are both physically plausible and coherent with the interacting hand movements. Furthermore, we demonstrate that such simulation of object deformations can, in turn, refine and improve hand reconstruction via inverse physics. In experiments, PhysHanDI outperforms the state-of-the-art baseline across reconstruction and future prediction.
comment: Accepted to ICML 2026
☆ QueST: Persistent Queries as Semantic Monitors for Drift Suppression in Long-Horizon Tracking
Tracking points in videos is typically formulated as frame-to-frame correspondence, where each point is matched locally to the next frame. While this works over short horizons, errors accumulate under articulation, occlusion, and viewpoint change, leading to silent semantic drift that existing trackers cannot detect or correct. In this work, we revisit long-horizon tracking from a monitoring perspective and introduce QueST, a monitoring-by-design framework that treats interaction-relevant entities as persistent semantic queries rather than transient point tracks. Instead of local propagation, each query attends globally over spatio-temporal video features at every time-step, providing a stable semantic anchor across time. We further constrain query trajectories with lightweight 3D physical grounding, using geometric plausibility to suppress unbounded drift under occlusion. We evaluate QueST on long-horizon articulated sequences from PartNet-Mobility in SAPIEN and compare against RAFT-3D, CoTracker, and TAP-Net. QueST substantially reduces terminal drift achieving a 67.7% Absolute Point Error (APE) improvement over TAP-Net while better preserving identity over extended horizons. Our results show that embedding semantic monitoring directly into perception enables more reliable long-horizon tracking under distribution shift.
☆ Uncertainty-Aware and Decoder-Aligned Learning for Video Summarization IJCNN 2026
Video summarization aims to produce a compact representation of a long video by selecting a subset of temporally important segments that best reflect human preferences. This task is inherently difficult due to strong annotation subjectivity and the reliance on discrete decoding procedures, such as temporal segmentation and knapsack-based selection, during evaluation. Most existing approaches either learn deterministic importance scores that overlook these characteristics or adopt complex generative models that increase training and inference cost. In this paper, we propose VASTSum, an uncertainty-aware and decoder-aligned learning framework for video summarization that addresses both challenges within a single-pass model. The proposed method predicts probabilistic frame-level importance scores using a variational formulation, enabling explicit modeling of uncertainty arising from multi-annotator supervision. To account for subjectivity, particularly under binary annotations, we employ a supervision strategy that encourages alignment with plausible human annotation modes rather than enforcing a single consensus target. Furthermore, we introduce a decoder-aligned regularization that promotes stability of knapsack-based summary selection, reducing sensitivity to small perturbations in predicted scores. We evaluate the proposed framework on the SumMe and TVSum benchmarks using standard rank-based metrics. Experimental results show consistent and competitive Kendall and Spearman correlations across multiple data splits, demonstrating improved robustness under annotation disagreement while maintaining efficient single-forward inference. These results indicate that explicitly modeling uncertainty and aligning learning objectives with the decoding stage provide a principled alternative to both deterministic and diffusion-based video summarization methods.
comment: Accepted for presentation at the 2026 International Joint Conference on Neural Networks (IJCNN 2026)
☆ PermuQuant: Lowering Per-Group Quantization Error by Reordering Channels for Diffusion Models
Large-scale visual generative models have achieved remarkable performance. However, their high computational and memory costs make deployment challenging in resource-constrained scenarios, such as interactive applications and personal single-GPU usage. Post-training quantization (PTQ) offers a practical solution by compressing pretrained models without expensive retraining. However, existing PTQ methods still suffer from severe quality degradation under extremely low-bit settings. In this paper, we identify channel ordering as an important but underexplored factor in per-group quantization. In this setting, each contiguous group shares one quantization scale. When channels with very different statistics are placed in the same group, the scale can be dominated by outliers and cause large quantization errors. Based on this observation, we propose PermuQuant, a simple and effective PTQ framework for low-bit diffusion models. PermuQuant sorts channels by a joint second-moment criterion before per-group quantization, placing channels with similar activation and weight statistics into the same group. It further uses a calibration-based acceptance rule to apply reordering only when the selected permutation reduces quantization error on calibration data. The selected permutations are absorbed into adjacent modules or applied to weights offline, avoiding explicit runtime permutation operations. Extensive experiments on multiple large diffusion models show that PermuQuant consistently reduces quantization error and outperforms existing PTQ baselines. On FLUX.1-dev with an RTX 5090, PermuQuant achieves up to a 1.8$\times$ single step speedup and reduces the DiT memory footprint by 3.5$\times$ under W4A4 NVFP4 quantization. Code will be available at https://github.com/yscheng04/PermuQuant.
☆ ML-CLIPSim: Multi-Layer CLIP Similarity for Machine-Oriented Image Quality
We study full-reference image quality assessment from a machine-centric perspective, where images are evaluated by how well they preserve information for downstream models. We formulate machine-oriented quality as a latent machine utility and approximate it through pairwise predictive-consistency comparisons. To this end, we construct PCMP, a dataset of PSNR-matched distortion pairs labeled by consistency votes from multiple pretrained models. We further propose ML-CLIPSim, a differentiable quality metric built on a frozen CLIP visual encoder, which aggregates intermediate patch-token similarities and global image embeddings. Experiments on machine-preference benchmarks, human-IQA datasets, and learned image compression show that ML-CLIPSim better aligns with machine-oriented preferences than conventional fidelity and perceptual metrics, while remaining competitive for human quality prediction. Used as a compression distortion term, it improves rate--task trade-offs across multiple downstream tasks.
☆ Outlier-Robust Diffusion Solvers for Inverse Problems CVPR 2026
Methods based on diffusion models (DMs) for solving inverse problems (IPs) have recently achieved remarkable performance. However, DM-based methods typically struggle against outliers, which are common in real-world measurements. In this work, to tackle IPs with outliers, we first refine the measurement via explicit noise estimation to mitigate the effect of noise. Subsequently, we formulate an iteratively reweighted least squares objective based on the Huber loss to address the outliers. We propose a method utilizing gradient descent to approximately solve the corresponding optimization problem for the robust objective. To avoid delicate tuning of the learning rate required by the gradient descent method, we further employ the conjugate gradient method with an efficient strategy for updating. Extensive experiments on multiple image datasets for linear and nonlinear tasks under various conditions demonstrate that our proposed methods exhibit robustness to outliers and outperform recent DM-based methods in most cases.
comment: Accepted by CVPR 2026
☆ When Few Steps Are Enough: Training-Free Acceleration of Identity-Preserved Generation
Identity-preserved image generation is typically built on many-step diffusion backbones, making personalized generation expensive at deployment time. We show that this cost is often unnecessary for identity-conditioned FLUX generation. A frozen InfuseNet identity adapter trained with dev transfers directly to the distilled schnell backbone without retraining. This two-line replacement -- changing the backbone path and disabling classifier-free guidance -- reduces latency by 5.9x while improving ArcFace identity similarity by +0.028 and lpips by -0.016 over the standard 28-step dev baseline. To explain why this works, we analyze the denoising trajectory and find that identity fidelity enters an early effective regime, often within 4-8 steps, while later steps primarily refine visual detail, sharpness, and contrast. Adapter ablations confirm that identity formation depends on the identity adapter, while attention-stream norm probes suggest that the relative conditioning contribution decreases as sampling proceeds. Preliminary style-adapter and object-adapter sweeps on SDXL and SD1.5 show similar diminishing returns after intermediate steps. These results position distilled backbone replacement as a simple, training-free strategy for improving the efficiency-fidelity tradeoff of identity-preserved generation.
☆ Adaptive 3D Convolution for Remote Sensing Image Fusion IEEE
Remote sensing image fusion aims to create a high-resolution multi/hyper-spectral image from a high-resolution image with limited spectral information and a low-resolution image with abundant spectral data. Recently, deep learning (DL) techniques have shown significant effectiveness in this area. Most DL-based methods approach image fusion as a 2D problem by encoding spectral information into feature map channels. However, our research suggests that this strategy introduces notable spectral distortions. In contrast, some methods consider spectral data as an additional dimension, utilizing standard 3D convolutions to preserve spectral information. Nevertheless, in a standard 3D convolutional layer, the same set of kernels is applied across all input regions, which we have found to be sub-optimal for image fusion. Furthermore, standard 3D convolutions necessitate substantial computational resources. To address these challenges, we propose a novel convolutional paradigm called Adaptive 3D Convolution (Ada3D) for remote sensing image fusion. Ada3D applies a unique set of 3D kernels to each input voxel, enabling the capture of fine-grained details. These adaptive kernels are generated through a two-step process: (i) spatial and spectral kernels are derived from their respective image sources; (ii) these two types of kernels are then combined to form content-aware 3D kernels that effectively integrate spatial and spectral information. Additionally, adaptive biases are introduced to enhance the convolutional outcome at the voxel level. Furthermore, we incorporate the group convolution technique to reduce computational complexity. As a result, Ada3D offers full adaptivity in an efficient manner. Evaluation results across five datasets demonstrate that our method achieves SOTA performance, underscoring the superiority of Ada3D. The code is available at https://github.com/PSRben/Ada3D.
comment: Accepted by IEEE Transactions on Image Processing (TIP), Early Access, 2026
☆ SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs
Recent multimodal large language models (MLLMs) have made remarkable progress in visual understanding and language-based reasoning, yet they lack a persistent world-centered representation for spatially consistent reasoning in 3D environments. Inspired by the mammalian dual-stream system, where semantic and spatial cues are processed separately and integrated into an allocentric cognitive map, we propose SpaceMind++, a video MLLM architecture that explicitly builds a voxelized cognitive map from RGB videos. This map reorganizes fragmented egocentric observations into a shared 3D metric representation, enabling the model to preserve object permanence and spatial topology across changing viewpoints. To make this allocentric representation usable by a pretrained video MLLM without disrupting its native visual-token interface, we introduce Coordinate-Guided Deep Iterative Fusion, a new mechanism that relays map-level spatial knowledge back into the original 2D visual features. This fusion is explicitly guided by coordinate embeddings and 3D Rotary Positional Encoding, which ground semantic interactions in metric 3D space, resembling the entorhinal binding of sensory features to metric space. Extensive experiments show that SpaceMind++ achieves new state-of-the-art performance on VSI-Bench. Furthermore, it demonstrates superior out-of-distribution generalization on SPBench, SITE-Bench, and SPAR-Bench, underscoring its robustness in unseen 3D environments.
comment: 14 pages, 3 figures
☆ Through the Lens of Character: Resolving Modality-Role Interference in Multimodal Role-Playing Agent
The advancement of Multimodal Large Language Models (MLLMs) has expanded Role-Playing Agents (RPAs) into visually grounded environments. However, human vision is inherently subjective and identity-driven, whereas existing MLLMs extract objective, character-agnostic features for general tasks. In RPAs, this generic visual noise overpowers fragile character traits, causing Modality-Role Interference (MRI), where agents struggle to integrate visual grounding and character consistency. To address this, we introduce the training-free Character-Aware Visual Intervention (CAVI) framework, enabling agents to perceive the world through the lens of character. CAVI systematically targets MRI: macroscopically, Character-Guided Token Pruning (CTP) restricts the visual receptive field to role-relevant entities; microscopically, Orthogonal Feature Modulation (OFM) projects tokens onto a character-context subspace to extract aligned facts; and during decoding, Modality-Adaptive Role Steering (MARS) dynamically optimizes steering intensity based on visual reliance. Extensive experiments show CAVI effectively alleviates MRI, significantly enhancing character-consistent multimodal interactions.
☆ SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation
Streaming long-video generation faces a central challenge in continuous semantic switching, requiring adaptive memory to preserve coherent visual evolution. Current approaches rely on cache rebuilding at prompt boundaries or fixed memory budgets, but they introduce redundant computation and limit flexible semantic adaptation. This limitation arises from a mismatch between cached video history and prompt updates, as memory preserves visual continuity while prompt switches demand rapid semantic adaptation. Motivated by this observation, we present SWIFT, Semantic Windowing and Injection for Flexible Transitions, a training-free framework for multi-prompt long-video generation that enables efficient semantic switching while preserving temporal coherence in causal video diffusion models. SWIFT introduces a lightweight Semantic Injection Cache that augments cached video memory rather than reconstructing it from scratch at every prompt boundary. To avoid uniformly perturbing all attention channels, we further perform head-wise semantic injection, so that each attention head receives a prompt update proportional to its alignment with the current video state. In addition, we introduce an Adaptive Dynamic Window that allocates temporal memory according to prompt phase, using larger local context near switching boundaries and smaller windows during stable segments to reduce average inference cost. To preserve long-range semantic consistency under compressed local attention, we further maintain segment-level semantic anchors that summarize prompt-conditioned video history and reintroduce it as compact memory tokens. Compared with current state-of-the-art methods, SWIFT preserves generation quality while achieving 22.6 FPS on a single H100 GPU, establishing a substantially more efficient solution for multi-prompt long-video generation. Our code is available at https://github.com/ShanwenTan/SWIFT.
comment: Code is available at https://github.com/ShanwenTan/SWIFT
☆ Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs ICML 2026
Existing preference datasets for text-to-image models typically store only the final winner/loser images. This representation is insufficient for rectified flow (RF) models, whose generation is naturally indexed by a specific prior noise sample and follows a nearly straight denoising trajectory. In contrast, prior DPO-style alignment for diffusion models commonly estimates trajectories using an independent forward noising process, which can be mismatched to the true reverse dynamics and introduces unnecessary variance. We propose Prior Noise-Aware Preference Optimization (PNAPO), an off-policy alignment framework specialized for rectified flow. PNAPO augments preference data by retaining the paired prior noises used to generate each winner/loser image, turning the standard (prompt, winner, loser) triplet into a sextuple. Leveraging the straight-line property of RF, we estimate intermediate states via noise-image interpolation, which constrains the trajectory estimation space and yields a tighter surrogate objective for preference optimization. In addition, we introduce a dynamic regularization strategy that adapts the DPO regularization based on (i) the reward gap between winner and loser and (ii) training progress, improving stability and sample efficiency. Experiments on state-of-the-art RF T2I backbones show that PNAPO consistently improves preference metrics while substantially reducing training compute.
comment: Accepted by ICML 2026
☆ FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation
Large-scale autoregressive models have demonstrated remarkable capabilities in image generation. However, their sequential raster-scan decoding relies on strictly next-token prediction, making inference prohibitively expensive. Existing acceleration methods typically either introduce entirely new generation paradigms that necessitate costly pre-training from scratch, or enable parallel generation at the expense of a training-inference gap or altered prediction objectives. In this paper, we introduce FlashAR, a lightweight post-training adaptation framework that efficiently adapts a pre-trained raster-scan autoregressive model into a highly parallel generator based on two-way next-token prediction. Our key insight is that effective adaptation should minimize modifications to the pre-trained model's original training objective to preserve its learned prior. Accordingly, we retain the original AR head as a horizontal head for row-wise prediction and introduce a complementary, lightweight vertical head for column-wise prediction. To facilitate efficient adaptation, we branch the vertical head from an intermediate layer rather than the final layer, bypassing the inherent horizontal head bias. Moreover, since horizontal and vertical predictions capture complementary dependencies whose relative importance varies across target positions, we employ a learnable fusion gate to dynamically combine the two predictions at each position. To further reduce adaptation cost, we propose a two-stage adaptation pipeline: the vertical head is first initialized through adaptation from the pre-trained autoregressive model before jointly fine-tuned with backbone to adapt to the new decoding paradigm. Extensive experiments on LlamaGen and Emu3.5 show that FlashAR achieves up to a 22.9x speedup for 512x512 image generation through a lightweight post-training with merely 0.05% of the original training data.
comment: Post-training acceleration for autoregressive image generation
☆ Evading Visual Aphasia: Contrastive Adaptive Semantic Token Pruning for Vision-Language Models
Are low-attention visual tokens truly redundant in vision-language reasoning? Existing pruning methods often assume so, ranking visual tokens by shallow text-to-image attention and discarding low-scoring patches to accelerate LVLM inference. We show that this scalar criterion is unreliable for compositional reasoning: tokens ignored in early layers can later become essential for resolving secondary objects, spatial relations, and contextual cues. Premature pruning can therefore induce Visual Aphasia, a failure mode in which the model loses visual grounding and falls back on language priors. We introduce COAST (COntrastive Adaptive Semantic Token Pruning), a training-free pruning framework that casts compression as adaptive semantic routing. COAST uses native cross-modal attention to identify query-specific anchors and estimate contextual dispersion via attention entropy, then adapts the retention trade-off between semantic evidence and spatial context. It further uses a contrastive routing score to preserve both anchor-aligned evidence and complementary spatial context. Across seven benchmarks, COAST reduces visual tokens by 77.8% and achieves a 2.15x latency speedup while retaining 98.64% of the original average performance. Beyond a single backbone or compression setting, COAST consistently outperforms strong pruning baselines across token budgets and generalizes across multiple LVLM families, showing that adaptive semantic routing is a robust alternative to one-shot scalar pruning
♻ ☆ Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs
Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution; and \textit{reflection-on-action}, which uses test-time training to update both its internal reflection model and its action policy based on external reflections after execution. We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment. Experiments on our newly-designed Long-Horizon Household benchmark and MuJoCo Cupboard Fitting benchmark show significant gains over baseline models, with zero-shot generalization to photorealistic HM3D environments and real-robot experiments on a Franka Panda arm. Ablations confirm that reflection-in-action and reflection-on-action are mutually dependent, and that retrospective reflection achieves better credit assignment than step-wise external feedback at lower computational overhead. Qualitative analyses further highlight behavioral correction through reflection.
♻ ☆ AI- Enhanced Stethoscope in Remote Diagnostics for Cardiopulmonary Diseases
The increase in cardiac and pulmonary diseases presents an alarming and pervasive health challenge on a global scale responsible for unexpected and premature mortalities. In spite of how serious these conditions are, existing methods of detection and treatment encounter challenges, particularly in achieving timely diagnosis for effective medical intervention. Manual screening processes commonly used for primary detection of cardiac and respiratory problems face inherent limitations, increased by a scarcity of skilled medical practitioners in remote or under-resourced areas. To address this, our study introduces an innovative yet efficient model which integrates AI for diagnosing lung and heart conditions concurrently using the auscultation sounds. Unlike the already high-priced digital stethoscope, our proposed model has been particularly designed to deploy on low-cost embedded devices and thus ensure applicability in under-developed regions that actually face an issue of accessing medical care. Our proposed model incorporates MFCC feature extraction and engineering techniques to ensure that the signal is well analyzed for accurate diagnostics through the hybrid model combining Gated Recurrent Unit with CNN in processing audio signals recorded from the low-cost stethoscope. Beyond its diagnostic capabilities, the model generates digital audio records that facilitate in classifying six pulmonary and five cardiovascular diseases. Hence, the integration of a cost effective stethoscope with an efficient AI empowered model deployed on a web app providing real-time analysis, represents a transformative step towards standardized healthcare
♻ ☆ MonoUNet: A Robust Tiny Neural Network for Automated Knee Cartilage Segmentation on Point-of-Care Ultrasound Devices
Objective: To develop a robust and compact deep learning model for automated knee cartilage segmentation on point-of-care ultrasound (POCUS) devices. Methods: We propose MonoUNet, a novel, highly compact segmentation model consisting of (i) an aggressively reduced U-Net backbone, (ii) a trainable monogenic block that extracts multi-scale local phase features from the input, and (iii) a gating mechanism that injects these features into the encoder stages to reduce sensitivity to variations in ultrasound image appearance. MonoUNet segmentation performance was evaluated on a multi-site, multi-device knee cartilage ultrasound dataset using Dice score and mean average surface distance (MASD). Agreement between MonoUNet and manual cartilage outcomes (thickness and echo intensity) was assessed using Bland-Altman analysis with 95% limits of agreement, and reliability was assessed using intraclass correlation coefficient (ICC$_{2,k}$). Results: Overall, MonoUNet outperformed existing lightweight segmentation models, with average Dice scores ranging from 92.62% to 94.82% and MASD values between 0.133 mm and 0.254 mm. MonoUNet reduces the number of parameters by 10x--700x and computational cost by 14x--2000x relative to existing lightweight models. MonoUNet cartilage outcomes showed excellent reliability and agreement with the manual outcomes: intraclass correlation coefficients (ICC$_{2,k})$=0.96 and bias=2.00% (0.047 mm) for average thickness, and ICC$_{2,k}$=0.99 and bias=0.80% (0.328 a.u.) for echo intensity. Conclusion: Incorporating trainable local phase features improves the robustness of highly compact neural networks for knee cartilage segmentation across varying acquisition settings and could support scalable ultrasound-based assessment and monitoring of knee osteoarthritis using POCUS devices. The code is publicly available at https://github.com/alvinkimbowa/monounet.
comment: 17 pages, 4 figures. Published in Ultrasound in Medicine & Biology (2026)
♻ ☆ Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items
Recent advances in image generation and editing have opened new opportunities for virtual try-on. However, existing methods still struggle to meet complex real-world demands. We present Tstars-Tryon 1.0, a commercial-scale virtual try-on system that is robust, realistic, versatile, and highly efficient. First, our system maintains a high success rate across challenging cases like extreme poses, severe illumination variations, motion blur, and other in-the-wild conditions. Second, it delivers highly photorealistic results with fine-grained details, faithfully preserving garment texture, material properties, and structural characteristics, while largely avoiding common AI-generated artifacts. Third, beyond apparel try-on, our model supports flexible multi-image composition (up to 6 reference images) across 8 fashion categories, with coordinated control over person identity and background. Fourth, to overcome the latency bottlenecks of commercial deployment, our system is heavily optimized for inference speed, delivering near real-time generation for a seamless user experience. These capabilities are enabled by an integrated system design spanning end-to-end model architecture, a scalable data engine, robust infrastructure, and a multi-stage training paradigm. Extensive evaluation and large-scale product deployment demonstrate that Tstars-Tryon1.0 achieves leading overall performance. To support future research, we also release a comprehensive benchmark. The model has been deployed at an industrial scale on the Taobao App, serving millions of users with tens of millions of requests.
comment: 24 pages, model evaluation report
♻ ☆ SegviGen: Repurposing 3D Generative Model for Part Segmentation
We introduce SegviGen, a framework that repurposes native 3D generative models for 3D part segmentation. Existing pipelines either lift strong 2D priors into 3D via distillation or multi-view mask aggregation, often suffering from cross-view inconsistency and blurred boundaries, or explore native 3D discriminative segmentation, which typically requires large-scale annotated 3D data and substantial training resources. In contrast, SegviGen leverages the structured priors encoded in pretrained 3D generative model to induce segmentation through distinctive part colorization, establishing a novel and efficient framework for part segmentation. Specifically, SegviGen encodes a 3D asset and predicts part-indicative colors on active voxels of a geometry-aligned reconstruction. It supports interactive part segmentation, full segmentation, and full segmentation with 2D guidance in a unified framework. Extensive experiments show that SegviGen improves over the prior state of the art by 40% on interactive part segmentation and by 15% on full segmentation, while using only 0.32% of the labeled training data. It demonstrates that pretrained 3D generative priors transfer effectively to 3D part segmentation, enabling strong performance with limited supervision. See our project page at https://fenghora.github.io/SegviGen-Page/.
comment: Project page: https://fenghora.github.io/SegviGen-Page/
♻ ☆ Colorful-Noise: Training-Free Low-Frequency Noise Manipulation for Color-Based Conditional Image Generation SIGGRAPH 2026
Text-to-image diffusion models generate images by gradually converting white Gaussian noise into a natural image. White Gaussian noise is well suited for producing diverse outputs from a single text prompt due to its absence of structure. However, this very property limits control over, and predictability of, specific visual attributes, as the noise is not human-interpretable. In this work, we investigate the characteristics of the input noise in diffusion models. We show that, although all frequencies in white Gaussian noise have comparable statistical energy, low-frequency components primarily determine the images global structure and color composition, while high-frequency components control finer details. Building on this observation, we demonstrate that simple manipulations of the low-frequency noise using low-frequency image priors can effectively condition the generation process to reconstruct these low-frequency visual cues. This allows us to define a simple, training-free method with minimal overhead that steers overall image structure and color, while letting high-frequency components freely emerge as fine details, enabling variability across generated outputs.
comment: SIGGRAPH 2026 Conference Paper. Project Page at: https://nadavc220.github.io/colorful-noise/
♻ ☆ SS3D: End2End Self-Supervised 3D from Web Videos
We present SS3D, a web-scale SfM-based self-supervision pretraining pipeline for feed-forward 3D estimation from monocular video. Our model jointly predicts depth, ego-motion, and intrinsics in a single forward pass and is trained/evaluated as a coherent end-to-end 3D estimator. To stabilize joint learning, we use an intrinsics-first two-stage schedule and a unified single-checkpoint evaluation protocol. Scaling SfM self-supervision to unconstrained web video is challenging due to weak multi-view observability and strong corpus heterogeneity; we address these with a multi-view signal proxy (MVS) used for filtering and curriculum sampling, and with expert training distilled into a single student. Pretraining on YouTube-8M (~100M frames after filtering) yields strong cross-domain zero-shot transfer and improved fine-tuning performance over prior self-supervised baselines. We release the pretrained checkpoint and code.
♻ ☆ SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild? ICLR 2026
Spatial reasoning is a fundamental aspect of human cognition, yet it remains a major challenge for contemporary vision-language models (VLMs). Prior work largely relied on synthetic or LLM-generated environments with limited task designs and puzzle-like setups, failing to capture the real-world complexity, visual noise, and diverse spatial relationships that VLMs encounter. To address this, we introduce SpatiaLab, a comprehensive benchmark for evaluating VLMs' spatial reasoning in realistic, unconstrained contexts. SpatiaLab comprises 1,400 visual question-answer pairs across six major categories: Relative Positioning, Depth & Occlusion, Orientation, Size & Scale, Spatial Navigation, and 3D Geometry, each with five subcategories, yielding 30 distinct task types. Each subcategory contains at least 25 questions, and each main category includes at least 200 questions, supporting both multiple-choice and open-ended evaluation. Experiments across diverse state-of-the-art VLMs, including open- and closed-source models, reasoning-focused, and specialized spatial reasoning models, reveal a substantial gap in spatial reasoning capabilities compared with humans. In the multiple-choice setup, InternVL3.5-72B achieves 54.93% accuracy versus 87.57% for humans. In the open-ended setting, all models show a performance drop of around 10-25%, with GPT-5-mini scoring highest at 40.93% versus 64.93% for humans. These results highlight key limitations in handling complex spatial relationships, depth perception, navigation, and 3D geometry. By providing a diverse, real-world evaluation framework, SpatiaLab exposes critical challenges and opportunities for advancing VLMs' spatial reasoning, offering a benchmark to guide future research toward robust, human-aligned spatial understanding. SpatiaLab is available at: https://spatialab-reasoning.github.io/.
comment: Accepted to ICLR 2026 (https://openreview.net/forum?id=fWWUPOb0CT). 92 Pages. 42 Figures and 29 Tables
♻ ☆ Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation
We present Hunyuan3D 2.0, an advanced large-scale 3D synthesis system for generating high-resolution textured 3D assets. This system includes two foundation components: a large-scale shape generation model -- Hunyuan3D-DiT, and a large-scale texture synthesis model -- Hunyuan3D-Paint. The shape generative model, built on a scalable flow-based diffusion transformer, aims to create geometry that properly aligns with a given condition image, laying a solid foundation for downstream applications. The texture synthesis model, benefiting from strong geometric and diffusion priors, produces high-resolution and vibrant texture maps for either generated or hand-crafted meshes. Furthermore, we build Hunyuan3D-Studio -- a versatile, user-friendly production platform that simplifies the re-creation process of 3D assets. It allows both professional and amateur users to manipulate or even animate their meshes efficiently. We systematically evaluate our models, showing that Hunyuan3D 2.0 outperforms previous state-of-the-art models, including the open-source models and closed-source models in geometry details, condition alignment, texture quality, and etc. Hunyuan3D 2.0 is publicly released in order to fill the gaps in the open-source 3D community for large-scale foundation generative models. The code and pre-trained weights of our models are available at: https://github.com/Tencent/Hunyuan3D-2
comment: GitHub link: https://github.com/Tencent/Hunyuan3D-2
♻ ☆ CORP: Closed-Form One-shot Representation-Preserving Structured Pruning for Transformers
Transformers achieve strong accuracy but incur high compute and memory cost. Structured pruning reduces inference cost, but most methods rely on retraining or multi-stage optimization, which limits post-training deployment. We propose CORP, a closed-form one-shot structured pruning method that removes MLP dimensions and attention substructures using only unlabeled calibration data without gradients or fine-tuning. CORP formulates structured pruning as a representation recovery problem. It models removed components as affine functions of retained components and derives closed-form ridge regression solutions that fold compensation into model weights. This minimizes a layer-local affine/logit reconstruction objective under the calibration distribution. Experiments on ImageNet with DeiT reveal strong redundancy in both MLP and attention representations. With CORP, models retain high accuracy under aggressive sparsity. On DeiT-Huge, CORP achieves 83.27% Top-1 accuracy after pruning 50\% of both MLP and attention structures.
♻ ☆ HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (DiT) action expert in low-level part equipped with a novel cascaded cross-attention mechanism. This design sequentially fuses global context, high-resolution object-centric crops and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM's zero-shot reasoning while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition and the fine-grained manipulation of small objects in cluttered scenes.
comment: Project Page: https://tianshuoy.github.io/HiVLA-page/
♻ ☆ STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering
We introduce STRIVE (SpatioTemporal Reinforcement with Importance-aware Variant Exploration), a structured reinforcement learning framework for video question answering. While group-based policy optimization methods have shown promise in large multimodal models, they often suffer from low reward variance when responses exhibit similar correctness, leading to weak or unstable advantage estimates. STRIVE addresses this limitation by constructing multiple spatiotemporal variants of each input video and performing joint normalization across both textual generations and visual variants. By expanding group comparisons beyond linguistic diversity to structured visual perturbations, STRIVE enriches reward signals and promotes more stable and informative policy updates. To ensure exploration remains semantically grounded, we introduce an importance-aware sampling mechanism that prioritizes frames most relevant to the input question while preserving temporal coverage. This design encourages robust reasoning across complementary visual perspectives rather than overfitting to a single spatiotemporal configuration. Experiments on six challenging video reasoning benchmarks including VideoMME, TempCompass, VideoMMMU, MMVU, VSI-Bench, and PerceptionTest demonstrate consistent improvements over strong reinforcement learning baselines across multiple large multimodal models. Our results highlight the role of structured spatiotemporal exploration as a principled mechanism for stabilizing multimodal reinforcement learning and improving video reasoning performance.
♻ ☆ Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards CVPR 2026
Text-to-image generation powers content creation across design, media, and data augmentation. Post-training of text-to-image generative models is a promising path to improve human preference alignment, factuality, and aesthetics. We introduce SOLACE (Self-Originating LAtent Confidence Estimation), a post-training framework that replaces external reward supervision with an internal self-confidence signal: we re-noise the model's own outputs and measure how accurately it recovers the injected noise, treating low reconstruction error as high self-confidence. SOLACE converts this intrinsic signal into scalar rewards for reinforcement learning, requiring no external reward models, annotators, or preference data. By reinforcing high-confidence generations, SOLACE delivers consistent gains in compositional generation, text rendering, and text-image alignment. Integrating SOLACE with external rewards yields complementary improvements while alleviating reward hacking.
comment: 22 pages, accepted to CVPR 2026. Project page https://wookiekim.github.io/SOLACE/
♻ ☆ AUHead: Realistic Emotional Talking Head Generation via Action Units Control ICLR
Realistic talking-head video generation is critical for virtual avatars, film production, and interactive systems. Current methods struggle with nuanced emotional expressions due to the lack of fine-grained emotion control. To address this issue, we introduce a novel two-stage method (AUHead) to disentangle fine-grained emotion control, i.e. , Action Units (AUs), from audio and achieve controllable generation. In the first stage, we explore the AU generation abilities of large audio-language models (ALMs), by spatial-temporal AU tokenization and an "emotion-then-AU" chain-of-thought mechanism. It aims to disentangle AUs from raw speech, effectively capturing subtle emotional cues. In the second stage, we propose an AU-driven controllable diffusion model that synthesizes realistic talking-head videos conditioned on AU sequences. Specifically, we first map the AU sequences into the structured 2D facial representation to enhance spatial fidelity, and then model the AU-vision interaction within cross-attention modules. To achieve flexible AU-quality trade-off control, we introduce an AU disentanglement guidance strategy during inference, further refining the emotional expressiveness and identity consistency of the generated videos. Results on benchmark datasets demonstrate that our approach achieves competitive performance in emotional realism, accurate lip synchronization, and visual coherence, significantly surpassing existing techniques. Our implementation is available at https://github.com/laura990501/AUHead_ICLR
comment: https://openreview.net/forum?id=dmzlAUkulz&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DICLR.cc%2F2026%2FConference%2FAuthors%23your-submissions) Accepted at the 14th International Conference on Learning Representations (ICLR 2026)
♻ ☆ Mantis: Mamba-native Tuning is Efficient for 3D Point Cloud Foundation Models
Pre-trained 3D point cloud foundation models (PFMs) have demonstrated strong transferability across diverse downstream tasks. However, full fine-tuning these models is computationally expensive and storage-intensive. Parameter-efficient fine-tuning (PEFT) offers a promising alternative, but existing PEFT approaches are primarily designed for Transformer-based backbones and rely on token-level prompting or feature transformation. Mamba-based backbones introduce a granularity mismatch between token-level adaptation and state-level sequence dynamics. Consequently, straightforward transfer of existing PEFT approaches to frozen Mamba backbones leads to substantial accuracy degradation and unstable optimization. To address this issue, we propose Mantis, the first Mamba-native PEFT framework for 3D PFMs. Specifically, a State-Aware Adapter (SAA) is introduced to inject lightweight task-conditioned control signals into selective state-space updates, enabling state-level adaptation while keeping the pre-trained backbone frozen. Moreover, different valid point cloud serializations are regularized by Dual-Serialization Consistency Distillation (DSCD), thereby reducing serialization-induced instability. Extensive experiments across multiple benchmarks demonstrate that our Mantis achieves competitive performance with only about 5% trainable parameters. Our code is available at https://github.com/gzhhhhhhh/Mantis.
♻ ☆ When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise
Vision-language models (VLMs) achieve strong multimodal performance but remain prone to relation hallucination, which requires accurate reasoning over inter-object interactions. We study the impact of visual perturbations, specifically rotation and noise, and show that even mild distortions significantly degrade relational reasoning across models and datasets. We further evaluate prompt-based augmentation and preprocessing strategies (orientation correction and denoising), finding that while they offer partial improvements, they do not fully resolve hallucinations. Our results reveal a gap between perceptual robustness and relational understanding, highlighting the need for more robust, geometry-aware VLMs.
♻ ☆ AlignDrive: Aligned Lateral-Longitudinal Planning for End-to-End Autonomous Driving
Practical autonomous driving requires models that generalize by reasoning through spatial-temporal possibilities to exclude unsafe outcomes. While state-of-the-art (SOTA) methods use parallel planning architectures, they fail to explicitly couple speed decisions with agent behavior along the driving path, leading to suboptimal coordination. To address this, we propose a cascaded framework that transforms longitudinal planning from an independent prediction task into a path-conditioned reasoning process. On the model side, we introduce an anchor-based regression design that conditions longitudinal prediction on the lateral drive path, and reformulate longitudinal planning as 1D displacement prediction along the path. This reduces geometric uncertainty and sharpens the model's focus on interaction-driven dynamics. On the data side, we introduce a planning-oriented data augmentation strategy that simulates rare safety-critical events by programmatically inserting agents and relabeling longitudinal targets to enforce collision avoidance. Evaluated on the challenging Bench2Drive benchmark, our method achieves SOTA performance with a driving score of 89.07 and a success rate of 73.18%, demonstrating significantly improved coordination and safety. Further evaluation on Fail2Drive confirms strong generalization to rare edge cases where parallel formulations typically fail. Project page:https://yanhaowu.github.io/AlignDrive/.
comment: underreview
Computation and Language 104
☆ Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants
User simulators are increasingly leveraged to build interactive AI assistants, yet how to measure the quality of these simulators remains an open question. In this work, we show how simulator quality can be quantified in terms of its downstream utility: how an LLM assistant trained with this user simulator performs in the wild when interacting with real humans. In a controlled experiment where only the user simulator varies, we train LLM assistants via reinforcement learning against a spectrum of simulators, from an LLM prompted to role-play a user to one fine-tuned on human utterances from WildChat. As evaluation, we measure pairwise win rates in a user study with 283 participants and on WildBench, a benchmark derived from real human--AI conversations. Training against the role-playing LLM yields an assistant statistically indistinguishable from the initial assistant in our user study (51% win rate), whereas training against the fine-tuned simulator yields significant gains (58% over the initial and 57% over the one trained against role-playing). Closer inspection reveals three further patterns: methods for making role-playing LLMs more realistic (e.g., persona conditioning) improve trained assistants but do not close the gap to the fine-tuned simulator; scaling the simulator's model size benefits the fine-tuned simulator but yields no gain for role-playing ones; and assistants trained against role-playing simulators fail to generalize when paired with other simulators at test time, while the one trained against fine-tuned simulator does. Together, these results argue for grounding user simulators in real human behavior and measuring their quality by their downstream effect on real users.
☆ cantnlp@DravidianLangTech 2026: organic domain adaptation improves multi-class hope speech detection in Tulu
This paper presents our systems and results for the Hope Speech Detection in Code-Mixed Tulu Language shared task at the Sixth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages (DravidianLangTech-2026). We trained an XLM-RoBERTa-based text classification system for detecting hope speech in code-mixed Tulu social media comments. We compared this organically adapted hope speech detection model with our baseline model. On the development set, the organically adapted model outperformed the baseline system. While our submitted systems performed more modestly on the official test set, these results suggest that further adapting XLM-RoBERTa on organically collected Tulu social media text containing code-mixed and mixed-script variation can improve hope speech detection in code-mixed Tulu.
comment: Accepted to Sixth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages (DravidianLangTech-2026)
☆ Parameter-Efficient Neuroevolution for Diverse LLM Generation: Quality-Diversity Optimization via Prompt Embedding Evolution GECCO 2026
Large Language Models exhibit mode collapse, producing homogeneous outputs that fail to explore valid solution spaces. We present QD-LLM, a framework for parameter-efficient neuroevolution that evolves prompt embeddings, compact neural interfaces (~32K parameters) that steer generation in frozen LLMs (70B+ parameters), within a Quality-Diversity (QD) optimization framework. Our contributions: (1) evolved prompt embeddings via gradient-free optimization enabling behavioral steering without model fine-tuning; (2) hybrid behavior characterization combining semantic and explicit features with formal coverage bounds (Theorem 1) under validated near-independence (NMI $= 0.08 \pm 0.02$); (3) co-evolutionary variation operators including targeted behavioral mutation via finite-difference gradient estimation. On HumanEval (164 problems), MBPP, and creative writing benchmarks, QD-LLM achieves 46.4% higher coverage and 41.4% higher QD-Score than QDAIF ($p<0.001$, 30 runs, Vargha-Delaney $A=0.94$). We demonstrate downstream utility: diverse archives improve test generation (34% more edge cases) and fine-tuning data quality (8.3% accuracy gain). We validate across open-source LLMs (Llama-3-70B, Mistral-Large) with full embedding access, establishing prompt embedding evolution as an effective paradigm bridging neuroevolution and modern LLMs.
comment: 11 pages, 3 figures, 7 tables, 1 algorithm, 1 theorem. Accepted to GECCO 2026
☆ Nectar: Neural Estimation of Cached-Token Attention via Regression
Evaluating softmax attention over a fixed long context requires reading every cached key-value pair for each new query token. For a given context (a book, a manual, a legal corpus) the attention output is a deterministic function of the query. We propose Nectar, which fits a compact neural network to this function for queries drawn from a task-relevant distribution. Nectar fits two networks per layer and KV-head: a target network that predicts the attention output and a score network that predicts the log-normalizer. The pair plugs into the standard masked self-attention at inference time, replacing the $O(n)$ attention over the cache with a forward pass whose cost does not depend on $n$. Each module carries on the order of $|θ|$ parameters per layer and KV-head, typically much smaller than the $2nd$ KV-cache footprint at the same granularity. We report experiments on models from 1.7B to 8B parameters across five long-context datasets. The approximation error tracks the next-token accuracy gap to full attention, and allocating capacity non-uniformly across layers reduces that gap in our ablation. Beyond this analysis of metrics, we check that the text generations (following a question prompt) of a model equipped with a Nectar module match in semantic content those obtained by giving the same model access to the full cache.
☆ EvoPref: Multi-Objective Evolutionary Optimization Discovers Diverse LLM Alignments Beyond Gradient Descent GECCO 2026
Gradient-based preference optimization methods for large language model (LLM) alignment suffer from preference collapse, converging to narrow behavioral modes while neglecting preference diversity. We introduce EvoPref, a multi-objective evolutionary algorithm that maintains populations of Low-Rank Adaptation (LoRA) adapters optimized across helpfulness, harmlessness, and honesty objectives using Non-dominated Sorting Genetic Algorithm II (NSGA-II) selection with archive-based diversity preservation. Our primary contribution is demonstrating that population-based methods discover substantially more diverse alignments than gradient descent. On standard benchmarks, EvoPref improves preference coverage by 18% (median 82.5% vs. 70.0% for ORPO, $p<0.001$, Wilcoxon, $n=30$) and reduces collapse rates by 47% (11.0% vs. 20.6%, $p<0.001$), while achieving competitive alignment quality (median 75.5% RewardBench vs. 75.0% for ORPO, $p<0.05$). We provide theoretical motivation extending recent multi-objective evolutionary algorithm (MOEA) runtime analysis (Dang et al., 2025) suggesting why archive-based methods escape collapse more effectively than single-trajectory optimization. Comprehensive comparisons against MOEA/D, SMS-EMOA, CMA-ES, and gradient baselines (DPO, IPO, KTO, ORPO) with rigorous statistical testing (Friedman with Holm correction, Vargha-Delaney effect sizes, median with IQR) confirm that multi-objective selection with diversity preservation is essential. This work establishes evolutionary optimization as a principled paradigm for diverse LLM alignment.
comment: 10 pages, 2 figures, 6 tables, 1 algorithm. Accepted to GECCO 2026
☆ Exploitation Without Deception: Dark Triad Feature Steering Reveals Separable Antisocial Circuits in Language Models
We use sparse autoencoder (SAE) feature steering to amplify Dark Triad personality traits (Machiavellianism, narcissism, and psychopathy) in Llama-3.3-70B-Instruct and evaluate the resulting behavioral changes across five psychological instruments. The steered model becomes substantially more exploitative, aggressive, and callous on novel behavioral scenarios (d=10.62) while its cognitive empathy remains intact, reproducing the empathy dissociation characteristic of human Dark Triad populations. Critically, strategic deception is completely unaffected across all features, suggesting that exploitation and deception may operate through dissociable computational pathways in large language models. Individual feature analysis reveals non-redundant encoding, with each feature driving distinct antisocial mechanisms through separable computational pathways. We also show that feature discovery method itself modulates intervention depth: contrastively-discovered features change both self-report and behavior, while semantically-searched features change only self-report (d=12.65 between methods on behavior). These findings suggest that antisocial tendencies in at least one large language model comprise dissociable components rather than a unified construct, with implications for how such tendencies should be detected, measured, and controlled.
comment: 12 pages, 3 figures
☆ ConFit v3: Improving Resume-Job Matching with LLM-based Re-Ranking
A reliable resume-job matching system helps a company find suitable candidates from a pool of resumes and helps a job seeker find relevant jobs from a list of job posts. While recent advances in embedding-based methods such as ConFit and ConFit v2 can efficiently retrieve candidates at scale, the lack of controllability and explainability limits their real-world adaptations. LLM-based re-rankers can address these limitations through reasoning, but existing training recipes are developed on short-document benchmarks and do not account for noise in real-world recruiting data. In this work, we first conduct a systematic analysis over the LLM re-ranker training pipeline for person-job fit, covering inference algorithm design, RL algorithm selection, data processing, and SFT distillation. We find that using multi-pass re-ranking, training with listwise RL objectives, removing noisy samples, and distilling from a stronger LLM before RL significantly improves re-ranking performance. We then aggregate these findings to train ConFit v3 with Qwen3-8B and Qwen3-32B on real-world person-job fit datasets, and find significant improvements over existing best person-job fit systems as well as strong LLMs such as GPT-5 and Claude Opus-4.5. We hope our findings provide useful insights for future research on adapting LLM-based re-rankers to person-job fit systems.
☆ Language Models Without a Trainable Input Embedding Table: Learning from Fixed Minimal Binary Token Codes
Trainable input embedding tables are a standard component of modern language models. We ask whether they are actually necessary at the input interface. For a vocabulary of size $V$, exact token identity requires only $K=\lceil \log_2 V\rceil$ bits. We replace the usual trainable $V\times d_{\text{model}}$ input embedding matrix with fixed minimal binary token codes and a zero-parameter lift to model width. In our main setting, $V=65{,}536$, so $K=16$, and tokens are represented by fixed 16-dimensional binary codes tiled to $d_{\text{model}}=1024$. We also evaluate a fully table-free variant in which codes are generated from token IDs on the fly and randomly recoded by an invertible affine transform over $\mathbb{F}_2^K$. Across matched 32-layer decoder-only models trained on approximately 17B tokens and evaluated over three independent training seeds, fixed minimal codes achieve comparable held-out validation perplexity to a standard learned-input baseline while removing 67.1M trainable input parameters. The fixed-code runs have a lower mean validation perplexity in our experiments, 2.36 versus 2.44, but the observed gap is within the measured seed-to-seed variation of 4.8\%; we therefore interpret the result as evidence that the trainable input table is not necessary, rather than as a statistically resolved superiority claim. The table-free affine-recoded variant remains close at 2.39 despite a slightly shorter training run. These results show that, in this regime, a trainable input embedding table is not necessary for useful language modeling. The output projection remains standard and trainable.
☆ The Silent Vote: Improving Zero-Shot LLM Reliability by Aggregating Semantic Neighborhoods ACL 2026
Large Language Models are increasingly used as zero-shot classifiers in complex reasoning tasks. However, standard constrained decoding suffers from a phenomenon we define as Renormalization Bias. When a model is restricted to a small set of target labels, the standard softmax operation discards the probability mass assigned to semantic synonyms in the original distribution. This loss of information, which we call the Silent Vote, results in artificial overconfidence and poor calibration. We propose Semantic Softmax, an inference-time layer that recovers this lost information by aggregating the scores of the semantic neighborhood surrounding each target label. We evaluate this approach on Qwen-3 and Phi-4-mini models using GoEmotions and Civil Comments datasets. Our results demonstrate consistent improvements across all evaluation metrics: Semantic Softmax substantially reduces Expected Calibration Error (ECE) and Brier Score, while simultaneously enhancing discriminative performance in terms of AUROC and Macro-F1. By accounting for linguistic nuances, our method provides a more calibrated and accurate alternative for zero-shot classification.
comment: Accepted at GEM Workshop @ ACL 2026
☆ Calibrate, Don't Curate: Label-Efficient Estimation from Noisy LLM Judges
Multi-judge evaluation is increasingly used to assess LLMs and reward models, and the prevailing heuristic is to curate: keep the most accurate judges and discard weaker ones. We show that this heuristic can reverse when the target is not point accuracy, but calibrated probabilistic evaluation from a labeled calibration set. Holding the aggregation and calibration procedures fixed, we compare accuracy-ranked top-$k$ judge selection with using the full judge panel. Across four labeled pairwise-evaluation benchmarks spanning LLM-as-judge and reward-model settings, the calibrated full panel consistently outperforms accuracy-based selection. On RewardBench2, retaining all judges achieves negative log-likelihood (NLL) of $0.006$ versus $0.013$ under top-5 selection, halving the calibration error. This advantage persists after judge-family deduplication and against stronger same-pipeline subset search. We explain this reversal with oracle analyses showing that the optimal calibrated risk under proper scoring rules cannot increase when additional judge signals are made available, and that even below-chance judges can be useful when their biases are learnable and their signals are non-redundant. The resulting operating principle is simple: in multi-judge evaluation with labeled calibration data, do not discard weak judges by accuracy alone; keep them when they are parseable, non-redundant, and calibratable.
☆ Learning Multi-Indicator Weights for Data Selection: A Joint Task-Model Adaptation Framework with Efficient Proxies IJCAI 2026
Data selection is a key component of efficient instruction tuning for large language models, as recent work has shown that data quality often matters more than data quantity. Accordingly, prior studies have introduced various multi-dimensional heuristics to evaluate and filter instruction data. However, most existing methods rely on static task-agnostic and model-agnostic weighting schemes, which overlook the varying requirements of specific downstream tasks and the differing pre-existing capabilities of models. In this paper, we propose a framework for learning multi-indicator weights that jointly adapts data selection to both the downstream task and the specific model. Our method identifies optimal weight configurations without full-scale fine-tuning by utilizing in-context learning (ICL) signals on compact tiny-validation sets. These signals serve as efficient performance proxies that ensure high-fidelity evaluation at minimal computational cost. Experiments across multiple benchmarks and model families, including Mistral, Qwen, and Llama, show that the approach achieves performance comparable to or exceeding full-dataset tuning while using only 30\% of the training samples on GSM8K. Furthermore, our analysis reveals a trade-off between semantic diversity and logical complexity in reasoning tasks, highlighting the necessity of joint task-model adaptation.
comment: This work has been accepted at IJCAI 2026
☆ MedMeta: A Benchmark for LLMs in Synthesizing Meta-Analysis Conclusion from Medical Studies
Large language models (LLMs) have saturated standard medical benchmarks that test factual recall, yet their ability to perform higher-order reasoning, such as synthesizing evidence from multiple sources, remains critically under-explored. To address this gap, we introduce MedMeta, the first benchmark designed to evaluate an LLM's ability to generate conclusions from medical meta-analyses using only the abstracts of cited studies. MedMeta comprises 81 meta-analyses from PubMed (2018--2025) and evaluates models using two distinct workflows: a Retrieval-Augmented Generation (Golden-RAG) setting with ground-truth abstracts, and a Parametric-only approach relying on internal knowledge. Our evaluation framework is validated by a well-structured analysis showing our LLM-as-a-judge protocol strongly aligns with human expert ratings, as evidenced by high Pearson's r correlation (0.81) and Bland-Altman analysis revealing negligible systematic bias, establishing it as a reliable proxy for scalable evaluation. Our findings underscore the critical importance of information grounding: the Golden-RAG workflow consistently and significantly outperforms the Parametric-only approach across models. In contrast, the benefits of domain-specific fine-tuning are marginal and largely neutralized when external material is provided. Furthermore, stress tests show that all models, regardless of architecture, fail to identify and reject negated evidence, highlighting a critical vulnerability in current RAG systems. Notably, even under ideal RAG conditions, current LLMs achieve only slightly above-average performance (~2.7/5.0). MedMeta provides a challenging new benchmark for evidence synthesis and demonstrates that for clinical applications, developing robust RAG systems is a more promising direction than model specialization alone.
☆ K12-KGraph: A Curriculum-Aligned Knowledge Graph for Benchmarking and Training Educational LLMs
Large language models (LLMs) are increasingly used in K-12 education, yet existing benchmarks such as C-Eval, CMMLU, GaokaoBench, and EduEval mainly evaluate factual recall through exam-style question answering. Effective educational AI additionally requires curriculum cognition: understanding how knowledge is structured through prerequisite chains, concept taxonomies, experiment-concept links, and pedagogical sequencing. To address this gap, we introduce K12-KGraph, a curriculum-aligned knowledge graph extracted from official People's Education Press textbooks across mathematics, physics, chemistry, and biology from primary to high school. The graph contains seven node types (Concept, Skill, Experiment, Exercise, Section, Chapter, Book) and nine relation types covering taxonomy, prerequisite, association, verification, assessment, location, and order. Based on this graph, we construct two resources: (1) K12-Bench, a 23,640-question multi-select benchmark spanning five graph-derived task families (Ground, Prereq, Neighbor, Evidence, and Locate); and (2) K12-Train, a KG-guided supervised fine-tuning corpus of approximately 2,300 QA pairs synthesized from graph structure and node attributes. Experiments reveal substantial deficiencies in curriculum cognition: on K12-Bench, Gemini-3-Flash achieves only 57% exact match, while the best open-source model, Gemma-4-31B-IT, reaches 46%. Under a strictly matched 2,300-sample SFT budget on Qwen3-4B-Base and Llama-3.1-8B-Base, K12-Train consistently outperforms equally sized subsets from eight mainstream instruction-tuning corpora on both GaokaoBench and EduEval, demonstrating that curriculum-structured supervision is highly sample-efficient for educational tuning. We release the graph, benchmark, training data, and full construction pipeline.
☆ Can We Trust LLMs for Mental Health Screening? Consistency, ASR Robustness, and Evidence Faithfulness
LLMs can estimate Hospital Anxiety and Depression Scale (HADS) scores from speech in a zero-shot manner, but clinical deployment requires reliability across three dimensions: intra-model consistency, ASR robustness, and evidence faithfulness. We evaluate three LLMs (Phi-4, Gemma-2-9B, and Llama-3.1-8B) on 111 English-speaking participants using ground-truth transcripts and three Whisper ASR variants (Large, Medium, Small), with three independent runs per model-condition pair. We find that (i) Phi-4 and Gemma-2-9B achieve excellent intra-model consistency (ICC > 0.89) with minimal degradation under ASR; (ii) Llama-3.1-8B shows ASR-fragile consistency, with ICC dropping from 0.82 to 0.36 at 10% WER; (iii) predictive validity is largely preserved under ASR for robust models; and (iv) keyword groundedness exceeds 93% for Phi-4 and Gemma-2-9B but falls to 77-81% for Llama-3.1-8B. Inter-model keyword agreement is far lower than score-level agreement, revealing a score-evidence dissociation with implications for clinical interpretability.
☆ Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models
Tokenizer-free language models eliminate the tokenizer step of the language modeling pipeline by operating directly on bytes; patch-based variants further aggregate contiguous byte spans into patches for efficiency. However, the average patch size chosen at the model design stage governs a tight trade-off: larger patches reduce compute and KV-cache footprint, but degrade modeling quality. We trace this trade-off to patch lag: until a patch is fully observed, byte predictions within it must rely on a stale representation from the previous patch to preserve causality; this lag widens as patches grow larger. We introduce Scratchpad Patching (SP), which inserts transient scratchpads inside each patch to aggregate the bytes seen so far and refresh patch-level context for subsequent predictions. SP triggers scratchpads using next-byte prediction entropy, selectively allocating compute to information-dense regions and enabling post-hoc adjustment of inference-time compute. Across experiments on natural language and code, SP improves model quality at the same patch size; for example, even at $16$ bytes per patch, SP-augmented models match or closely approach the byte-level baseline on downstream evaluations while using a $16\times$ smaller KV cache over patches and $3$-$4\times$ less inference compute.
comment: 23 pages, 15 figures
☆ Statistical Scouting Finds Debate-Safe but Not Debate-Useful Cases: A Matched-Ceiling Study of Open-Weight LLM Reasoning Protocols
When should a language model answer directly, sample and vote, or engage in multi-agent debate? Recent work shows voting often explains much of the gain attributed to debate, while selective-debate systems activate deliberation only on uncertain examples. We ask: under a matched ceiling on generated tokens (960 per example), how much per-example routing headroom exists, and how much is recoverable from cheap pre-deliberation signals? We evaluate greedy decoding, three-sample voting, and a two-agent critique-revise debate on MuSiQue and GSM8K using Llama 3.1 8B Instruct and Ministral 3 8B Instruct. On MuSiQue, an oracle selecting the correct protocol per example gains +14.0 and +13.7 pp over the best fixed one. The best fixed protocol is model- and dataset-dependent: each (model, dataset) cell has a different winner. This headroom is hard to recover from cheap ex-ante signals. A vote-entropy threshold is the only controller that directionally beats the best fixed protocol on both models (+1.3 and +1.7 pp), though individual paired-bootstrap CIs include zero. A joint analysis (meta-analysis +1.6 pp, p=0.125; Bayesian P(both>0)=0.59) is directionally consistent but not significant. Learned controllers (LR, GBT) do not outperform the threshold. The key finding is structural: vote entropy predicts where debate is safe, not where debate is needed. High entropy sharply reduces debate backfire, but 66% of debate-helpful examples (31/47) occur when voting is unanimous but wrong. A single-prompt self-critique probe on Llama flips the answer in 127/127 unanimous cases, yielding zero mutual information with the debate-helpful label; we cannot rule out a prompt-compliance artifact, but either interpretation disqualifies the probe as a router. Recovering the remaining headroom requires behavioral probes that avoid format-compliance confounds at the 8B scale.
comment: 14 pages, 5 figures. Technical report / preprint
☆ Byte-Exact Deduplication in Retrieval-Augmented Generation: A Three-Regime Empirical Analysis Across Public Benchmarks
This preprint presents an empirical analysis of byte-exact chunk-level deduplication in Retrieval-Augmented Generation (RAG) pipelines. We measure context reduction across three distinct operating regimes: clean academic retrieval (0.16% byte reduction on 22.2M BeIR passages), constructed enterprise patterns (24.03% reduction), and multi-turn conversational AI (80.34% reduction). To validate quality preservation, we conducted a cross-vendor 5-judge calibrated panel evaluation across four production APIs (Google Gemini 2.5 Flash, Anthropic Claude Sonnet 4.6, Meta Llama 3.3 70B, and OpenAI GPT-5.1). Applying a five-category human-in-the-loop noise-removal protocol to panel-majority materially different (MAT) pairs, we establish that byte-exact deduplication introduces zero measurable quality regression. Post-audit, all four vendors clear the strict <5% Wilson 95% upper-bound MAT threshold in both the clean and high-redundancy RAG regimes. This work demonstrates that substantial inference compute savings can be achieved deterministically without compromising evaluation-grade model quality.
comment: Preprint. Implementation and open-source community version available at: https://github.com/corbenic/merlin-community - https://zenodo.org/records/20090712
☆ Edit-Based Refinement for Parallel Masked Diffusion Language Models ICML 2026
Masked diffusion language models enable parallel token generation and offer improved decoding efficiency over autoregressive models. However, their performance degrades significantly when generating multiple tokens simultaneously, due to a mismatch between token-level training objectives and joint sequence consistency. In this paper, we propose ME-DLM, an edit-based refinement framework that augments diffusion generation with lightweight post-editing steps. After producing an initial complete response, the model refines it through minimal edit operations, including replacement, deletion, and insertion, conditioned on the full sequence. Training supervision is derived from edit distance, providing a deterministic signal under a fixed canonicalization scheme for learning minimal corrections. This approach encourages sequence-level consistency through globally conditioned edits while preserving the efficiency benefits of parallel diffusion decoding. Extensive experiments demonstrate that ME-DLM improves the quality and robustness of multi-token parallel generation. In particular, when built upon LLaDA, our method achieves consistent gains of 11.6 points on HumanEval and 33.6 points on GSM8K while using one-eighth of the total diffusion steps. Code is available at https://github.com/renhouxing/ME-DLM.
comment: Accepted to ICML 2026
☆ CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics
Inpatient clinical reasoning is a sequential decision under partial observability: the clinician sees the admission so far and must choose the next action whose downstream consequences are not yet visible. Existing clinical-LLM evaluations and RL rewards signals collapse this into closed-form retrieval, clinical journey leakage, or unanchored LLM-as-judge scoring. We introduce CLR-voyance, a framework that reformulates inpatient reasoning as a Partially Observable Markov Decision Process (POMDP) and supervises it with rewards that are simultaneously outcome-grounded and clinician-validated. We instantiate the formulation as CLR-POMDP, which partitions successful patient journeys into a policy-visible past and an oracle-only future. Using the past information, an oracle LLM generates a case-specific query-answer pair, and the first adaptive rubric for clinical reasoning which is verifiable in the future of the patient journey. These rubrics are used for both post-training and evaluation of models for inpatient clinical reasoning. We post-train Qwen3-8B and MedGemma-4B with GRPO followed by model merging, yielding state-of-the-art inpatient clinical reasoning while retaining generalist capabilities. CLR-voyance-8B achieves 84.91% on CLR-POMDP, ahead of frontier medical reasoning models like GPT-5 (77.83%) and MedGemma-27B (66.66%) and has comparable or better performance on existing medical benchmarks. To ensure a clinically meaningful setting, we conduct a large-scale clinician alignment study, where physicians curate per-case rubrics, grade candidate responses, and provide blinded pairwise preferences of model reasoning. This study provides insights on clinical LLM-as-a-judge and clinical preference-model selection, which can inform the community at large. CLR-voyance has been deployed for 6+ months at a partner public hospital, drafting thousands of reasoning-heavy inpatient notes.
☆ Towards Compact Sign Language Translation: Frame Rate and Model Size Trade-offs
Sign Language Translation (SLT) converts sign language videos into spoken-language text, bridging communication between Deaf and hearing communities. Current gloss-free approaches rely on large encoder-decoder models, limiting deployment. We propose a compact 77M-parameter pipeline that couples MMPose skeletal pose extraction with a single linear projection into T5-small. By varying the input frame rate, we expose a practical efficiency trade-off: at 12 fps the model halves its sequence length, achieving a 75% reduction in encoder quadratic self-attention computational complexity while incurring only a modest BLEU-4 drop (9.53 vs. 10.06 at 24 fps on How2Sign). Our system is roughly 3x smaller than prior T5-base systems, demonstrating that a lightweight architecture can remain competitive without hierarchical encoders or large-scale models.
comment: 2 pages, 1 figure, 2 tables
☆ Crosslingual On-Policy Self-Distillation for Multilingual Reasoning
Large language models (LLMs) have achieved remarkable progress in mathematical reasoning, but this ability is not equally accessible across languages. Especially low-resource languages exhibit much lower reasoning performance. To address this, we propose Crosslingual On-Policy Self-Distillation (COPSD), which transfers a model's own high-resource reasoning behavior to low-resource languages. COPSD uses the same model as student and teacher: the student sees only the low-resource problem, while the teacher receives privileged crosslingual context, including the problem translation and reference solution in English. Training minimizes full-distribution token-level divergence on the student's own rollouts, providing dense supervision while avoiding the sparsity and instability of outcome-only reinforcement learning (RL). Experiments on 17 low-resource African languages show that COPSD consistently improves low-resource mathematical reasoning across model sizes and substantially outperforms Group Relative Policy Optimization (GRPO). Further analyses show that COPSD improves answer-format adherence, strengthens test-time scaling, and generalizes to harder multilingual reasoning benchmarks, with especially large gains for lower-resource languages. We make our code and data available at: https://github.com/cisnlp/COPSD.
comment: preprint
☆ TacoMAS: Test-Time Co-Evolution of Topology and Capability in LLM-based Multi-Agent Systems
Multi-agent systems (MAS) have emerged as a promising paradigm for solving complex tasks. Recent work has explored self-evolving MAS that automatically optimize agent capabilities or communication topologies. However, existing methods either learn a topology that remains fixed at inference time or adapt only the topology or capability during inference. We empirically and theoretically show that effective test-time evolution requires jointly adapting both axes, but on different time scales: capabilities should update rapidly to handle emerging subtasks, while the topology should evolve more slowly to preserve coordination stability. We then introduce TacoMAS, a test-time co-evolution framework for dynamic MAS. TacoMAS formulates MAS inference as a task of online graph adaptation, where nodes represent agents with role-specific capabilities and edges define their communication topology. During inference, a fast capability loop updates agent expertise using trajectory-level feedback, while a slow meta-LLM-driven topology loop performs agents' birth-death operations on MAS, including edge edit, agent addition, and agent removal. We further show that this fast-slow design drives MAS evolution toward a task-conditioned stable equilibrium. Experiments on four benchmarks demonstrate that TacoMAS outperforms nearly 20 multi-agent baselines, achieving an average improvement of 13.3% over the strongest baseline. The codes are released at https://github.com/chenxu2-gif/TacoMAS-MultiAgent.
☆ TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM
Diffusion large language models (dLLMs) offer a promising paradigm for parallel text generation, but in practice they face an accuracy-parallelism trade-off, where increasing tokens per forward (TPF) often degrades generation quality. Existing acceleration methods often gain speed at the cost of accuracy. To address this limitation, we propose TAD, a Temporal-Aware trajectory self-Distillation framework. During data construction, we condition a teacher model on both the prompt and the ground-truth response to generate decoding trajectories, recording the intermediate masked states throughout the process. Based on how many decoding steps remain before each masked token is revealed, we partition masked positions into near and distant subsets. For near tokens, we train the student with a hard cross-entropy loss using the teacher trajectory tokens as labels, encouraging confident predictions for tokens that are about to be decoded. For distant tokens, we apply a soft KL divergence loss between the teacher and student token distributions, providing softer supervision and preserving future planning knowledge. This temporal-aware partition naturally gives rise to two deployment configurations: a Quality model that prioritizes accuracy and a Speed model that favors more aggressive acceleration. Experiments show that TAD consistently improves the accuracy-parallelism trade-off. On LLaDA, it raises average accuracy from 46.2\% to 51.6\% with the Quality model and average AUP from 46.2 to 257.1 with the Speed model. Our code is available at: https://github.com/BHmingyang/TAD
☆ Assessment of RAG and Fine-Tuning for Industrial Question-Answering-Applications AAAI 2026
Large Language Models (LLMs) are increasingly employed in enterprise question-answering (QA) systems, requiring adaptation to domain-specific knowledge. Among the most prevalent methods for incorporating such knowledge are Retrieval-Augmented Generation (RAG) and fine-tuning (FT). Yet, from a cost-accuracy trade-off perspective, it remains unclear which approach best suits industry scenarios. This study examines the impact of RAG and FT on two closed datasets specific to the automotive industry, assessing answer quality and operational costs. We extend the Cost-of-Pass framework proposed by Erol et al. (arXiv:2504.13359) to jointly assess output quality, generation cost, and user interaction cost. Our findings reveal that while premium models perform best out of the box, open-source models can achieve comparable quality when enhanced with RAG. Overall, RAG emerges as the most effective and cost-efficient adaptation method for both closed- and open-source models.
comment: Accepted at AAAI 2026 Workshop on New Frontiers in Information Retrieval
☆ MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents
As LLM-powered agents are increasingly deployed in edge-cloud environments, personalized memory has become a key enabler of long-term adaptation and user-centric interaction. However, cloud-assisted memory management exposes sensitive user information, while existing privacy protection methods typically rely on aggressive masking that removes task-relevant semantics and consequently degrades memory utility and personalization quality. To address this challenge, We propose MemPrivacy, which identifies privacy-sensitive spans on edge devices, replaces them with semantically structured type-aware placeholders for cloud-side memory processing, and restores the original values locally when needed. By decoupling privacy protection from semantic destruction, MemPrivacy minimizes sensitive data exposure while retaining the information required for effective memory formation and retrieval. We also construct MemPrivacy-Bench for systematic evaluation, a dataset covering 200 users and over 52k privacy instances, and introduce a four-level privacy taxonomy for configurable protection policies. Experiments show that MemPrivacy achieves strong performance in privacy information extraction, substantially surpassing strong general-purpose models such as GPT-5.2 and Gemini-3.1-Pro, while also reducing inference latency. Across multiple widely used memory systems, MemPrivacy limits utility loss to within 1.6%, outperforming baseline masking strategies. Overall, MemPrivacy offers an effective balance between privacy protection and personalized memory utility for edge-cloud agents, enabling secure, practical, and user-transparent deployment.
☆ Hidden Error Awareness in Chain-of-Thought Reasoning: The Signal Is Diagnostic, Not Causal ICML 2026
Chain-of-thought (CoT) prompting assumes that generated reasoning reflects a model's internal computation. We show this assumption is wrong in a specific, measurable way: models internally detect their own reasoning errors but outwardly express confidence in them. A linear probe on hidden states predicts trace correctness with 0.95 AUROC -- from the very first reasoning step (0.79) -- while verbalized confidence for wrong traces is 4.55/5, nearly identical to correct ones (4.87/5). A text-surface classifier achieves only 0.59 on the same data, confirming a 0.20-point gap invisible in the generated text. This hidden error awareness holds across three model families (Qwen, Llama, Phi), 1.5B-72B parameters, and RL-trained reasoning models (DeepSeek-R1, 0.852 AUROC). The natural question is whether this signal can fix the errors it detects. It cannot. Four interventions -- activation steering, probe-guided best-of-N, self-correction, and activation patching -- all fail; patching destroys output coherence entirely. The signal is diagnostic, not causal: a readout of computation quality, not a lever to redirect it. This delineates a boundary for mechanistic interpretability: error representations during reasoning are fundamentally different from the factual knowledge representations that prior work has successfully edited.
comment: 10 pages, 5 figures, 10 tables.Mechanistic Interpretability @ ICML 2026
☆ Beyond Language: Format-Agnostic Reasoning Subspaces in Large Language Models
Large language models represent the same reasoning in vastly different surface forms -- English prose, Python code, mathematical notation -- yet whether they share a common internal substrate across these symbolic systems remains unknown. We introduce the TriForm Benchmark (18 concepts x 6 forms x 3 instances = 324 stimuli) and study five LLMs (1.6B-8B) across three architecture families. Using permutation-corrected RSA, cross-form probing, and activation patching, we find converging evidence for a Format-Agnostic Reasoning Subspace (FARS) in middle layers. We make FARS concrete: concept-centroid PCA extracts a 10-dimensional subspace that amplifies concept structure 3x while suppressing form information to near zero. Replacing only these 10 dimensions during cross-form patching preserves 90-96% of model output -- far exceeding both full activation replacement (44-56%) and variance-maximizing PCA (60-74%) -- while ablating them causes targeted disruption. FARS generalizes to held-out concepts and converges across architectures (CCA > 0.79 for all model pairs), providing within-modality evidence for the Platonic Representation Hypothesis. We further discover a declarative-procedural asymmetry: representations are far more compatible between prose and mathematics than between either and code, suggesting that the critical axis of divergence is not linguistic vs. formal but declarative vs. procedural.
comment: Preprint. 13 pages, 13 figures, 12 tables
☆ APCD: Adaptive Path-Contrastive Decoding for Reliable Large Language Model Generation
Large language models (LLMs) often suffer from hallucinations due to error accumulation in autoregressive decoding, where suboptimal early token choices misguide subsequent generation. Although multi-path decoding can improve robustness by exploring alternative trajectories, existing methods lack principled strategies for determining when to branch and how to regulate inter-path interactions. We propose Adaptive Path-Contrastive Decoding (APCD), a multi-path decoding framework that improves output reliability through adaptive exploration and controlled path interaction. APCD consists of two components: (1) Entropy-Driven Path Expansion, which delays branching until predictive uncertainty - measured by Shannon entropy over top candidate tokens - indicates multiple plausible continuations; and (2) Divergence-Aware Path Contrast, which encourages diverse reasoning trajectories while dynamically attenuating inter-path influence as prediction distributions diverge. Experiments on eight benchmarks demonstrate improved factual accuracy while maintaining decoding efficiency. Our code is available at https://github.com/zty-king/APCD.
☆ Not All Thoughts Need HBM: Semantics-Aware Memory Hierarchy for LLM Reasoning ICML 2026
Reasoning LLMs produce thousands of chain-of-thought tokens whose KV cache must reside in scarce GPU HBM. The dominant response -- permanently evicting low-importance tokens -- is catastrophic for reasoning: accuracy collapses to 0-2.5% when half the cache is removed. We ask a different question: must every token live in HBM, or can some live elsewhere? We introduce a semantics-aware memory hierarchy that sorts tokens into four tiers -- HBM, DDR, compressed, and evicted -- using cumulative attention scoring. Low-importance tokens are moved to CPU memory rather than destroyed; before each attention step they are prefetched back at full precision, contributing exactly the same terms as if they had never left the GPU. We formalize this as zero-approximation-error offloading and derive our central finding: accuracy depends solely on how many tokens are permanently discarded (the eviction ratio), not on how many remain in HBM. A controlled 3x3 grid over HBM and eviction ratios confirms this across three model scales (7B-32B) and four benchmarks. With only 3% eviction, the hierarchy retains 91% of full-cache accuracy on GSM8K and 71% on MATH-500 (n=200); at 14B scale it matches the uncompressed baseline (90% vs. 86%) while halving HBM occupancy. A head-to-head reproduction of R-KV -- the current SOTA eviction method -- on our setup achieves only 0-32% at comparable budgets. A system prototype with real GPU-CPU data movement shows that the price of this preservation is modest -- 5-7% transfer overhead -- and scaling analysis projects 2-48 GB HBM savings at production batch sizes.
comment: Preprint. 14 pages + appendix. Under review at AdaptFM Workshop @ ICML 2026
☆ A Cognitively Grounded Bayesian Framework for Misinformation Susceptibility
In this (work in progress) paper, we present Bounded Pragmatic Listener (or BPL), a cognitively grounded Bayesian framework for modelling susceptibility to information disorder. BPL extends Rational Speech Act theory with three cognitively motivated bounds derived from the bounded rationality literature with a) a recursion depth bound (that emphasises working memory limits);b) a prior compression parameter (which is oriented at capturing information bottleneck); and c) an availability sample size (that operationalises importance sampling with saliency-weighted proposals). This allows us to test predictions about misinformation susceptibility, annotator disagreement, and the differential vulnerability to mis-, dis-, and mal-information as defined in the Information Disorder framework. We validate BPL on the LIAR and MultiFC benchmarks showcasing competitive veracity classification and experimental support for the depth-mismatch paradox.
comment: work in progress
☆ Align and Shine: Building High-Quality Sentence-Aligned Corpora for Multilingual Text Simplification LREC 2026
Text simplification plays a crucial role in improving the accessibility and comprehensibility of written information for diverse audiences, including language learners and readers with limited literacy. Despite its importance, large-scale, high-quality datasets for training and evaluating text simplification models remain scarce for languages other than English. This paper reports an experimental study on the collection and processing of crowd-sourced simplification data from comparable corpora to construct a corpus suitable for both training and testing text simplification systems across multiple languages (Catalan, English, French, Italian and Spanish). We report mechanisms for sentence-level alignment from document-level data. The resulting dataset of the aligned sentence pairs is publicly available.
comment: Accepted at BUCC 2026 workshop at LREC 2026
☆ FinMoji: A Framework for Emoji-driven Sentiment Analysis in Financial Social Media
This paper explores the use of emojis in financial sentiment analysis, focusing on the social media platform StockTwits. Emojis, increasingly prevalent in digital communication, have potential as compact indicators of investor sentiment, which can be critical for predicting market trends. Our study examines whether emojis alone can serve as reliable proxies for financial sentiment and how they compare with traditional text-based analysis. We conduct a series of experiments using logistic regression and transformer models. We further analyze the performance, computational efficiency, and data requirements of emoji-based versus text-based sentiment classification. Using a balanced dataset of about 528,000 emoji-containing StockTwits posts, we find that emoji-only models achieve F1 approximately 0.75, lower than text-emoji combined models, which achieve F1 approximately 0.88, but with far lower computational cost. This is a useful feature in time-sensitive settings such as high-frequency trading. Furthermore, certain emojis and emoji pairs exhibit strong predictive power for market sentiment, demonstrating over 90 percent accuracy in predicting bullish or bearish trends. Finally, our research reveals large statistical differences in emoji usage between financial and general social media contexts, stressing the need for domain-specific sentiment analysis models.
☆ Beyond Position Bias: Shifting Context Compression from Position-Driven to Semantic-Driven
Large Language Models (LLMs) have demonstrated exceptional performance across diverse tasks. However, their deployment in long-context scenarios faces high computational overhead and information redundancy. While soft prompt compression has emerged as a promising way to mitigate these costs by compressing sequences into compact embeddings, existing paradigms remain fundamentally constrained by position bias: they primarily rely on learnable tokens insertion at fixed positions or group tokens according to their physical token layout, thereby inducing performance instability and semantic fragmentation. To overcome this bottleneck, we propose Semantic Consistency Context Compression (SeCo), a method that shifts context compression from position-driven to semantic-driven. Rather than constraint by physical token layout, SeCo dynamically anchors compression directly in the semantic space by selecting query-relevant tokens as semantic centers and aggregating remaining tokens via consistency-weighted merging. This design inherently preserves semantic consistency while eliminating position bias. Extensive experiments on 14 benchmarks across two backbone models demonstrate that SeCo consistently shows superiority in downstream tasks, inference latency, and out-of-domain robustness. The code is available at https://anonymous.4open.science/r/seco-EE5E.
comment: 20 pages, 6 figures
☆ Through the Lens of Character: Resolving Modality-Role Interference in Multimodal Role-Playing Agent
The advancement of Multimodal Large Language Models (MLLMs) has expanded Role-Playing Agents (RPAs) into visually grounded environments. However, human vision is inherently subjective and identity-driven, whereas existing MLLMs extract objective, character-agnostic features for general tasks. In RPAs, this generic visual noise overpowers fragile character traits, causing Modality-Role Interference (MRI), where agents struggle to integrate visual grounding and character consistency. To address this, we introduce the training-free Character-Aware Visual Intervention (CAVI) framework, enabling agents to perceive the world through the lens of character. CAVI systematically targets MRI: macroscopically, Character-Guided Token Pruning (CTP) restricts the visual receptive field to role-relevant entities; microscopically, Orthogonal Feature Modulation (OFM) projects tokens onto a character-context subspace to extract aligned facts; and during decoding, Modality-Adaptive Role Steering (MARS) dynamically optimizes steering intensity based on visual reliance. Extensive experiments show CAVI effectively alleviates MRI, significantly enhancing character-consistent multimodal interactions.
☆ Key Coverage Matters: Semi-Structured Extraction of OCR Clinical Reports
Clinical reports are often fragmented across healthcare institutions because privacy regulations and data silos limit direct information sharing. When patients seek care at a different hospital, they often carry paper or scanned reports from prior visits. This hinders EHR integration and longitudinal review, and downstream applications that depend on more complete patient records, such as patient management, follow-up care, real-world studies, and clinical-trial matching. Although OCR can digitize such reports, reliable extraction remains challenging because clinical documents are heterogeneous, OCR text is noisy, and many healthcare settings require low-cost on-premise deployment. We formulate this problem as canonical key-conditioned extractive question answering over OCR-derived clinical reports. Because the key fields are neither fixed nor known in advance, the key space is open. We maintain a canonical key inventory through iterative key mining, normalization, clustering, and lightweight human verification, and introduce key coverage as a metric to quantify inventory completeness. Using a 0.2B BERT-based model, experiments on real-world reports from more than 20 hospitals show performance improves monotonically with key coverage. The model achieves F1 scores of 0.839 and 0.893 under exact match and boundary-tolerant matching, respectively, once the Top-90 canonical keys are covered. These results show that key coverage is a dominant factor for end-to-end performance. At Top-90 coverage, our model outperforms a fine-tuned Qwen3-0.6B baseline under exact match. Although our annotated corpus is Chinese, the method relies on the language-agnostic key-value organization of semi-structured clinical reports and can be adapted to other settings given an appropriate canonical key inventory and alias mapping.
comment: Preprint. Under review at MLHC 2026
☆ PumpSense: Real-Time Detection and Target Extraction of Crypto Pump-and-Dumps on Telegram IEEE
Cryptocurrency pump-and-dump schemes coordinated via Telegram threaten market integrity. However, existing research addressing this specific threat has not yet produced solutions that combine reliable results with fast response. This is in part due to the absence of publicly available, message-level labeled data, as well as design choices. In this paper, we address both issues. In particular, we introduce a corpus of over 280,000 Telegram posts from 39 pump-organizing groups, all manually reviewed to identify 2,246 pump announcements and their targeted cryptocurrency and exchange. Leveraging this dataset, we define two tasks: real-time pump-announcement detection and target cryptocurrency/exchange extraction. For detection, we compare two machine-learning models: a lightweight tree-based LightGBM classifier (F1=0.79, latency=9.4 s/sample) and a transformer-based BGE-M3 (F1=0.83, latency=50 ms/sample). With our proposed approach, we show that message analysis can achieve near-instant pump detection at the level of individual Telegram message windows. Unlike prior work that relies purely on market data and typically detects pumps tens of seconds after abnormal trading activity is observed, our method operates directly on the coordination messages themselves and can be evaluated in microseconds per window on commodity hardware. To our knowledge, we also establish the first benchmark for manipulated coin and exchange extraction. We demonstrate that traditional rule-based extraction methods, widely relied upon in prior literature, are ineffective due to ticker ambiguity. In contrast, LLMs achieve the highest accuracy with a score of 0.91.
comment: Accepted to the 2026 IEEE International Conference on Blockchain and Cryptocurrency (ICBC)
☆ Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs
Although Large Multimodal Models (LMMs) have achieved strong performance on general video understanding, their susceptibility to textual prior shortcuts during causal discovery has been recognized as a critical deficit. The underlying mechanisms of this phenomenon remain incompletely understood, as existing benchmarks only measure response accuracy without revealing the sources and extent of the deficit. We introduce ProCauEval, a perturbation-based evaluation protocol that shifts from outcome assessment to mechanism diagnosis, probing causal discovery through five controlled configurations that systematically manipulate visual and textual modalities to decompose their respective contributions to model behavior and dissect the failure modes. Evaluating 17 mainstream LMMs, we find that models faithfully perceive video content yet systematically underexploit it during causal reasoning. We further observe that stronger post-training amplifies rather than mitigates textual prior reliance, and that higher baseline performance correlates with greater fragility under perturbation. To address these, we propose Anti-Distillation Policy Optimization (ADPO), a reinforcement learning framework built on negative teacher alignment, which augments GRPO by explicitly pushing the policy away from a prior-only counterfactual teacher induced by visual corruption. Specifically, ADPO maximizes the divergence between the policy distributions conditioned on the original and visually corrupted inputs, thereby forcing the model to ground its reasoning in visual evidence rather than textual shortcuts. Extensive experiments show that ADPO improves visual engagement without sacrificing fundamental comprehension, thus offering a preliminary step toward reliable causal discovery.
comment: 17 pages, 5 figures
☆ Cross-Cultural Transfer of Emoji Semantics and Sentiment in Financial Social Media ACL 2026
Emojis are widely used in online financial communication, but it is unclear whether they provide transferable sentiment signals across languages, platforms, and asset communities. This study examines the extent to which emoji usage, semantics, and sentiment polarity remain stable across financial communities, and how these layers influence zero-shot sentiment transfer. Using large corpora of Twitter and StockTwits posts in four languages, we measure cross-community divergence and evaluate sentiment models trained under emoji-only, text-only, and text+emoji inputs. We find that emoji frequencies differ across communities, especially across languages, but their semantics and sentiment polarity are largely stable. Cross-asset transferability shows minimal degradation, while cross-language transfer remains the most challenging. Including emojis consistently reduces transfer gaps relative to text-only models. These results indicate that financial communication exhibits a partially shared ``emoji code,'' and that emojis provide compact, language-independent sentiment cues that improve model generalization across markets and platforms.
comment: Accepted to Findings of the Association for Computational Linguistics: ACL 2026
☆ Let the Target Select for Itself: Data Selection via Target-Aligned Paths
Targeted data selection aims to identify training samples from a large candidate pool that improve performance on a specific downstream task. Many recent methods estimate candidate utility by aggregating local attribution scores along a trajectory induced by the candidate pool. When the pool is heterogeneous, however, this reference trajectory may be misaligned with the dynamics of a target-aligned selected subset, creating what we call reference path bias. We propose an alternative reference path: a validation-induced flow obtained from a short, capacity-limited warmup on the available target validation proxy. Along this path, candidates are scored by a normalized endpoint loss drop, yielding a simple zero-order selection rule that requires no candidate gradients or Hessian approximations. Across controlled logistic, vision, and instruction-tuning experiments, this score is competitive with strong dynamic attribution baselines while substantially reducing warmup and storage cost. Moreover, since the reference trajectory is decoupled from any specific candidate pool, the same compact warmup can be reused across additional pools without recomputing the trajectory.
☆ EduStory: A Unified Framework for Pedagogically-Consistent Multi-Shot STEM Instructional Video Generation
Long-horizon video generation has advanced in visual quality, yet existing methods still struggle to maintain knowledge consistency and coherent pedagogical narratives across multi-shot instructional videos, especially in STEM domains. To address these challenges, we propose EduStory, a unified framework for reliable instructional video generation. EduStory integrates pedagogical state modeling to track persistent knowledge states, script-guided structured control to organize multi-shot narratives, and learning-oriented evaluation metrics to assess knowledge fidelity and constraint satisfaction. To support rigorous evaluation, we further introduce EduVideoBench, a diagnostic benchmark with multi-granularity annotations, including pedagogical storyboards, shot-level semantics, and knowledge state transitions, together with baseline tasks for controllable instructional video generation. Extensive experiments demonstrate that domain-aware state modeling and structured control substantially reduce narrative breakdown and improve alignment with instructional intent. These results highlight the significance of domain-specific structural constraints and tailored benchmarks for advancing reliable, controllable, and also trustworthy long-horizon video generation.
☆ Position: Avoid Overstretching LLMs for every Enterprise Task
Enterprise workloads are dominated by deterministic, structured, and knowledge-dependent tasks operating under strict cost, latency, and reliability constraints. While these are often addressed through large language model (LLM) deployment or distillation into smaller models, we argue this is inefficient, unreliable, and misaligned with enterprise task structures. Instead, AI systems should treat language models as interfaces rather than monolithic engines, externalizing knowledge and computation into dedicated components for greater reliability, scalability, and transparency. Our theoretical evidences show that finite-capacity models cannot fully capture the breadth of knowledge required for enterprise tasks, creating inherent limits to efficiency and interpretability. Building on this, we take the position that language models should primarily be used for structured extraction in deterministic enterprise workflows, while computation and storage are delegated to knowledge bases and symbolic procedures. We formally demonstrate that such modular architectures are more reliable and maintainable than monolithic frameworks, offering a sustainable foundation for enterprise tasks.
☆ Your Simulation Runs but Solves the Wrong Physics: PDE-Grounded Intent Verification for LLM-Generated Multiphysics Simulation Code
Execution-based evaluation of LLM-generated code implicitly treats successful execution as a proxy for correctness. In scientific simulation, this proxy is insufficient: a generated input file can run, mesh, and converge while encoding governing equations that differ from the user's intent. We call this mismatch between intended physics and generated code the comprehension-generation gap. We instantiate this in MOOSE, where Kernel and BC objects map compositionally to weak-form residual terms, enabling deterministic reconstruction of the encoded PDE and comparison against an intended contract. We formalize this comparison as the Intent Fidelity Score (IFS), a structural metric covering governing terms, BCs, ICs, coefficients, and time scheme. Building on IFS, we develop a PDE-grounded refinement loop that uses deterministic violation reports to correct generated code iteratively. We evaluate on MooseBench, a 220-case multiphysics benchmark with PDE-level ground truth released with this work. On this benchmark, our method consistently improves mean IFS over direct generation, with gains concentrated on hard cases. On the subset where direct generation falls below IFS 0.7, refinement adds +0.22 to +0.41 absolute IFS. In the deployment audit, execution-only repair improves execution success while leaving 39-40% of all 220 cases runnable but still solving the wrong physics across the three main deployment-audit models, exposing executability and intent fidelity as separable failure modes. Static proof-of-concept experiments on four PDE-oriented DSLs (UFL/FEniCS, FreeFEM, FiPy, and Devito) suggest that the reconstruction-and-comparison pattern extends beyond MOOSE. These findings reinforce that executable simulation code should be verified against the mathematical structure it is intended to encode, not accepted on execution alone.
comment: Preprint
☆ HOME-KGQA: A Benchmark Dataset for Multimodal Knowledge Graph Question Answering on Household Daily Activities LREC2026
Large Language Models (LLMs) provide flexible natural language processing capabilities, while knowledge graphs (KGs) offer explicit and structured knowledge. Integrating these two in a complementary manner enables the development of reliable and verifiable AI systems. In particular, knowledge graph question answering (KGQA) has attracted attention as a means to reduce LLM hallucinations and to leverage knowledge beyond the training data. However, existing KGQA benchmark datasets are biased toward encyclopedic knowledge, limited to a single modality, and lack fine-grained spatiotemporal data, which limits their applicability to real-world scenarios targeted by Embodied AI. We introduce HOME-KGQA, a novel KGQA benchmark dataset built on a multimodal KG of daily household activities. HOME-KGQA consists of complex, multi-hop natural language questions paired with graph database query languages. Compared to existing benchmarks, it includes more challenging questions that involve multi-level spatiotemporal reasoning, multimodal grounding, and aggregate functions. Experimental results show that the LLM-based KGQA methods fail to achieve performance comparable to that on existing datasets when evaluated on HOME-KGQA. This highlights significant challenges that should be addressed for the real-world deployment of KGQA systems. Our dataset is available at https://github.com/aistairc/home-kgqa
comment: 12 pages, 4 figures, 7 tables, accepted at LREC2026
☆ RuPLaR : Efficient Latent Compression of LLM Reasoning Chains with Rule-Based Priors From Multi-Step to One-Step
The Chain-of-Thought (CoT) paradigm, while enhancing the interpretability of Large Language Models (LLMs), is constrained by the inefficiencies and expressive limits of natural language. Latent Chain-of-Thought (latent CoT) reasoning, which operates in a continuous latent space, offers a promising alternative but faces challenges from structural complexities in existing multi-step or multi-model paradigms, such as error propagation and coordination overhead. In this paper, we introduce One-Model One-Step, a novel compression framework for Latent Reasoning with Rule-Based Priors(RuPLaR) to address this challenge. Our method trains an LLM to autonomously generate latent reasoning tokens in a single training stage, guided by rule-based prior probability distributions, thereby eliminating cascaded processes and inter-model dependencies. To ensure reasoning quality, we design a joint training objective that enforces answer consistency via cross-entropy, aligns soft tokens with rule-based priors via KL divergence (the Soft Thinking constraint), and adds a problem-thought semantic alignment constraint in the representation space. Extensive experiments show that our compression framework not only improves accuracy by 11.1% over existing latent CoT methods but also achieves this with minimal token usage, underscoring its effectiveness and extensibility. Code: https://github.com/xiaocen-luo/RuPLaR.
comment: 15 pages, 15 figures
☆ SkillMAS: Skill Co-Evolution with LLM-based Multi-Agent System
Large language model (LLM) agent systems are increasingly expected to improve after deployment, but existing work often decouples two adaptation targets: skill evolution and multi-agent system (MAS) restructuring. This separation can create organization bottlenecks, context pressure, and mis-specialization. We present SkillMAS, a non-parametric framework for adaptive specialization in multi-agent systems that couples skill evolution with MAS restructuring. SkillMAS uses Utility Learning to assign credit from verified execution traces, bounded skill evolution to refine reusable procedures without unfiltered library growth, and evidence-gated MAS restructuring when retained failures and Executor Utility indicate a structural mismatch. Across embodied manipulation, command-line execution, and retail workflows, SkillMAS is competitive under the reported harnesses while clarifying how post-deployment specialization is attributed, updated, and applied.
comment: 21 pages, 2 figures
☆ Test-Time Speculation
Speculative decoding accelerates LLM inference by using a fast draft model to generate tokens and a more accurate target model to verify them. Its performance depends on the $\textit{acceptance length}$, or number of draft tokens accepted by the target. Our studies show that the acceptance length of even state-of-the-art speculators, like DFlash, EAGLE-3 and PARD degrade with generation length, reaching values close to 1 (i.e. no speedup) within just a few thousand output tokens, making speculators ineffective for long-response tasks. Acceptance lengths decline because most speculators are trained offline on short sequences, but are forced to match the target model on much longer outputs at inference, well beyond their training distribution. To address this issue, we propose $\textit{Test-Time Speculation (TTS)}$, an online distillation approach that continuously adapts the speculator at test-time. TTS leverages the key insight that the token verification step already invokes the target model for each draft token, providing the training signal needed to adapt the draft at no additional cost. Treating the draft as the student and the target as a teacher, TTS adjusts the draft over several speculation rounds, with each update improving the draft's accuracy as generation proceeds. Our results across multiple models from the Qwen-3, Qwen-3.5, and Llama3.1 families show that TTS improves acceptance lengths over state-of-the-art speculators by up to $72\%$ and $41\%$ on average, with the benefits scaling with increased generation lengths.
☆ Mem-W: Latent Memory-Native GUI Agents
GUI agents are beginning to operate the web, mobile, and desktop as interactive worlds, where successful control depends on carrying forward visual, procedural, and task-level evidence beyond the fleeting present screen. Yet most agents still treat memory as an external, human-readable artifact: histories are summarized, categorized, retrieved, and reinserted as text or structured records before being encoded again by the policy. This creates a mismatch between the representational form in which experience is stored and the latent embedding sequence over which modern GUI policies actually act. We introduce Mem-W, a series of latent-memory-native GUI agents that treat memory as part of the agent's continuous context rather than as an auxiliary symbolic scaffold. Mem-W weaves both historical trajectories (as experiential memory) and in-session segments (as working memory) into compact memory tokens through a shared trajectory-to-latent compressor. These tokens are woven with the current GUI observation and local context into one continuous embedding sequence, allowing the agent to read successes, failures, and unfinished progress through the same machine-native interface. Mem-W is trained with self-distillation and outcome-aware supervision to preserve decision-relevant state while filtering memory toward evidence that truly supports task success. Across four web and mobile navigation benchmarks, Mem-W consistently improves diverse backbones and memory-enhanced baselines, with gains of up to $+30.0$, suggesting that latent-context-native memory can serve as a scalable foundation for long-horizon GUI agency.
☆ Do Self-Evolving Agents Forget? Capability Degradation and Preservation in Lifelong LLM Agent Adaptation
Recent advances in LLM agents enable systems that autonomously refine workflows, accumulate reusable skills, self-train their underlying models, and maintain persistent memory. However, we show that such self-evolution is often non-monotonic: adapting to new task distributions can progressively degrade previously acquired capabilities across all major evolution channels. We identify this phenomenon as \emph{capability erosion under self-evolution} and show that it consistently emerges across workflow, skill, model, and memory evolution. To mitigate this issue, we propose \emph{Capability-Preserving Evolution} (CPE), a general stabilization principle that constrains destructive capability drift during continual adaptation. Across all four evolution dimensions, CPE consistently improves retained capability stability while preserving adaptation performance. For example, in workflow evolution, CPE improves retained simple-task performance from 41.8\% to 52.8\% under GPT-5.1 optimization while simultaneously achieving stronger complex-task adaptation. Our findings suggest that stable long-horizon self-evolving agents require not only acquiring new capabilities, but also explicitly preserving previously learned ones during continual adaptation.
☆ LEAF-SQL: Level-wise Exploration with Adaptive Fine-graining for Text-to-SQL Skeleton Prediction
Text-to-SQL translates natural language questions into executable SQL queries, enabling intuitive database access for non-experts. While large language models achieve strong performance on Text-to-SQL with prompting, they still struggle with complex queries that involve deeply nested logic or multiple clauses. A widely used approach employs SQL skeletons--intermediate representations of query logic--to streamline generation, but existing methods are limited by their reliance on a single structural hypothesis and lack of progressive reasoning. To overcome these limitations, we propose LEAF-SQL, a novel framework that reframes skeleton prediction as a coarse-to-fine tree search process. LEAF-SQL enables systematic exploration of diverse structural hypotheses with adaptive refinement. Several key techniques are employed in LEAF-SQL: (1) a three-level skeleton hierarchy to guide the search, (2) a Skeleton Formulation Agent to generate diverse candidates, and (3) a Skeleton Evaluation Agent to efficiently prune the search space. This integrated design yields skeleton candidates that are both structurally diverse and granularity-adaptive, providing a stronger foundation for the SQL generation. Extensive experiments show that LEAF-SQL consistently improves the performance of various LLM backbones. On the official hidden test set of the challenging BIRD benchmark, our method achieves 71.6 execution accuracy, which outperforms leading search-based and skeleton-based methods, affirming its effectiveness for complex queries.
☆ BetaEdit: Null-Space Constrained Sequential Model Editing
Null-space-based methods have garnered considerable attention in model editing by constraining updates to the null space of the pre-existing knowledge representation, thereby preserving the model's original behavior. However, in practice these methods rely on an approximate null space--leading to knowledge leakage--and further suffer from severe performance degradation during sequential editing. Recent work shows that history-aware editing strategies can empirically mitigate this decline, yet the underlying reason remains unclear. In this paper, we first expose the knowledge leakage inherent in existing null-space approaches and then analyze why history-aware updates effectively preserve both editing performance and general capabilities during long-horizon editing. Building on these insights, we propose BetaEdit, a refined framework that effectively controls the knowledge leakage and integrates history-aware updates into the null-space paradigm. Extensive experiments on three large language models across two standard benchmarks show that BetaEdit consistently outperforms prior methods in the challenging regime of massive-scale sequential editing. Code is available at: https://github.com/lbq8942/BetaEdit.
☆ A Prompt-Aware Structuring Framework for Reliable Reuse of AI-Generated Content in the Agentic Web WWW2026
The evolution of Large Language Models (LLMs) and the software agents built on them (AI agents) marks a turning point in the transition from a human-centric Web to an ``Agentic Web'' driven by AI agents. However, for AI-Generated Content (AIGC), which is expected to dominate the Web, there is currently no mechanism for agents to verify its reliability, reproducibility, or license compliance during generation. This lack of transparency risks causing chained hallucinations and compliance violations through the reuse of AIGC. Consequently, a framework to manage the provenance and generation conditions of AIGC is essential. In this paper, we present a framework that automatically attaches structured metadata to AIGC at generation time, including modularized prompts, contexts, thoughts, model information, hyperparameters, and confidence. The metadata is enveloped together with verifiable credentials to support the reliable assessment and reuse of AIGC. This framework enables efficient curation of structured AIGC and facilitates its safe use for applications such as fine-tuning and knowledge distillation.
comment: 5 pages, 2 figures, Accepted at FAAW@WWW2026
☆ Towards Conversational Medical AI with Eyes, Ears and a Voice
The practice of medicine relies not only upon skillful dialogue but also on the nuanced exchange and interpretation of rich auditory and visual cues between doctors and patients. Building on the low-latency voice and video processing capabilities of Gemini, we introduce AI co-clinician, a first-of-its-kind conversational AI system utilizing continuous streams of audio-visual data from live patient conversations to inform real-time clinical decisions. Its dual-agent architecture balances deep clinical reasoning with the low latency required for natural dialogue. To assess this system, we implemented a video-based interface emulating telemedicine consultations. We crafted 20 standardized outpatient scenarios requiring proactive real-time auditory and visual reasoning and designed "TelePACES" evaluation criteria alongside case-specific rubrics. In a randomized, interface-blinded, crossover simulation study (n = 120 encounters) with 10 internal medicine residents as patient actors, we compared AI co-clinician with primary care physicians (PCPs), GPT-Realtime, and a baseline agent. AI co-clinician approached PCPs in key TelePACES dimensions, including management plans and differential diagnosis, while significantly outperforming GPT-Realtime across all general criteria. While our agent demonstrated parity with PCPs in case-specific triage measures, physicians maintained superior overall performance in case-specific assessments. Although AI co-clinician marks a significant advance in real-time telemedical AI, gaps remain in physical examination and disease-specific reasoning. Our work shows that text-only approaches fail to capture the true challenges of medical consultation and suggests that high-stakes real-time diagnostic AI is most safely advanced in collaborative, triadic models where AI can be a supportive co-clinician for doctors and patients.
comment: Video examples are available on Youtube: https://youtu.be/y5Vaa_SN1t0, https://youtu.be/dC4icb75vLQ, and https://youtu.be/E7iEvWo-E6c
☆ DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification
Aligning Multimodal Large Language Models (MLLMs) requires reliable reward models, yet existing single-step evaluators can suffer from lazy judging, exploiting language priors over fine-grained visual verification. While rubric-based evaluation mitigates these biases in text-only settings, extending it to multimodal tasks is bottlenecked by the complexity of visual reasoning. The critical differences between responses often depend on instance-specific visual details. Robust evaluation requires dynamically synthesizing rubrics that isolate spatial and factual discrepancies. To address this, we introduce $\textbf{DeltaRubric}$, an approach that reformulates multimodal preference evaluation as a plan-and-execute process within a single MLLM. DeltaRubric operates in two steps: acting first as a $\textit{Disagreement Planner}$, the model generates a neutral, instance-specific verification checklist. Transitioning into a $\textit{Checklist Verifier}$, it executes these self-generated checks against the image and question to produce the final grounded judgment. We formulate DeltaRubric as a multi-role reinforcement learning problem, jointly optimizing planning and verification capabilities. Validated on Qwen3-VL 4B and 8B Instruct models, DeltaRubric achieves solid empirical gains. For instance, On VL-RewardBench, it improves base model overall accuracy by $\textbf{+22.6}$ (4B) and $\textbf{+18.8}$ (8B) points, largely outperforming standard no-rubric baselines. The results demonstrate that decomposing evaluation into structured, verifiable steps leads to more reliable and generalizable multimodal reward modeling.
☆ Beyond Continuity: Challenges of Context Switching in Multi-Turn Dialogue with LLMs ICLR 2026
Users interacting with Large Language Models (LLMs) in a multi-turn conversation routinely refine their requests or pivot to new topics. LLMs, however, often miss these topic shifts and carry over irrelevant context from previous turns, leading to inaccurate responses. In this paper, we stress-test the multi-turn understanding of LLMs and study the following two sub-tasks: (1) detecting whether the user pivots or refines in the current turn, and (2) shortlisting relevant context from previous turns. To this end, we construct synthetic benchmarks based on real-world datasets from varied domains, as to simulate context shifts of different levels of difficulty. We then evaluate the zero-shot performance of ten LLMs (open-weight, closed-source and reasoning), and demonstrate that only some reasoning and strongly instructed LLMs are accurate in detecting pivots; open-weight LLMs struggle with the task and frequently carry stale context even with explicit cues; and all models suffer from a position bias. Based on the results, we discuss key takeaways for improving long-term robustness in multi-turn capabilities for LLMs.
comment: Accepted to the ICBINB Workshop @ ICLR 2026
☆ Reinforcing Multimodal Reasoning Against Visual Degradation
Reinforcement Learning has significantly advanced the reasoning capabilities of Multimodal Large Language Models (MLLMs), yet the resulting policies remain brittle against real-world visual degradations such as blur, compression artifacts, and low-resolution scans. Prior robustness techniques from vision and deep RL rely on static data augmentation or value-based regularization, neither of which transfers cleanly to critic-free RL fine-tuning of autoregressive MLLMs. Reinforcing reasoning against such corruptions is non-trivial: naively injecting degraded views during rollout induces reward poisoning, where perceptual occlusions trigger hallucinated trajectories and destabilize optimization. We propose ROMA, an RL fine-tuning framework that modifies the optimization dynamics to reinforce reasoning against visual degradation while preserving clean-input performance. A dual-forward-pass strategy uses teacher forcing to evaluate corrupted views against clean-image trajectories, avoiding new rollouts on degraded inputs. For distributional consistency, we apply a token-level surrogate KL penalty against the worst-case augmentation; to prevent policy collapse under regularization, an auxiliary policy gradient loss anchored to clean-image advantages preserves a reliable reward signal; and to avoid systematically incorrect invariance, correctness-conditioned regularization restricts enforcement to successful trajectories. On Qwen3-VL 4B/8B across seven multimodal reasoning benchmarks, our method improves robustness by +2.4% on seen and +2.3% on unseen corruptions over GRPO while matching clean accuracy.
☆ Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation
While recent work in Reinforcement Learning with Verifiable Rewards (RLVR) has shown that a small subset of critical tokens disproportionately drives reasoning gains, an analogous token-level understanding of On-Policy Distillation (OPD) remains largely unexplored. In this work, we investigate high-loss tokens, a token type that--as the most direct signal of student-teacher mismatch under OPD's per-token KL objective--should progressively diminish as training converges according to existing studies; however, our empirical analysis shows otherwise. Even after OPD training reaches apparent saturation, a substantial subset of tokens continues to exhibit persistently high loss; these tokens, which we term Rock Tokens, can account for up to 18\% of the tokens in generated outputs. Our investigation reveals two startling paradoxes. First, despite their high occurrence frequency providing a disproportionately large share of total gradient norms, Rock Tokens themselves remain stagnant throughout training, resisting teacher-driven corrections. Second, through causal intervention, we find that these tokens provide negligible functional contribution to the model's actual reasoning performance. These findings suggest that a vast amount of optimization bandwidth is spent on structural and discourse residuals that the student model cannot or need not internalize. By deconstructing these dynamics, we demonstrate that strategically bypassing these ``stumbling blocks'' can significantly streamline the alignment process, challenging the necessity of uniform token weighting and offering a more efficient paradigm for large-scale model distillation.
☆ LLM Agents Already Know When to Call Tools -- Even Without Reasoning
Tool-augmented LLM agents tend to call tools indiscriminately, even when the model can answer directly. Each unnecessary call wastes API fees and latency, yet no existing benchmark systematically studies when a tool call is actually needed. We propose When2Tool, a benchmark of 18 environments (15 single-hop, 3 multi-hop) spanning three categories of tool necessity -- computational scale, knowledge boundaries, and execution reliability -- each with controlled difficulty levels that create a clear decision boundary between tool-necessary and tool-unnecessary tasks. We evaluate two families of training-free baselines: Prompt-only (varying the prompt to discourage unnecessary calls) and Reason-then-Act (requiring the model to reason about tool necessity before acting). Both provide limited control: Prompt-only suppresses necessary calls alongside unnecessary ones, and Reason-then-Act still incurs a disproportionate accuracy cost on hard tasks. To understand why these baselines fail, we probe the models' hidden states and find that tool necessity is linearly decodable from the pre-generation representation with AUROC 0.89--0.96 across six models, substantially exceeding the model's own verbalized reasoning. This reveals that models already know when tools are needed, but fail to act on this knowledge during generation. Building on this finding, we propose Probe&Prefill, which uses a lightweight linear probe to read the hidden-state signal and prefills the model's response with a steering sentence. Across all models tested, Probe&Prefill reduces tool calls by 48% with only 1.7% accuracy loss, while the best baseline at comparable accuracy only reduces 6% of tool calls, or achieves a similar tool call reduction but incurs a 5$\times$ higher accuracy loss. Our code is available at https://github.com/Trustworthy-ML-Lab/when2tool
☆ Repeated-Token Counting Reveals a Dissociation Between Representations and Outputs
Large language models fail at counting repeated tokens despite strong performance on broader reasoning benchmarks. These failures are commonly attributed to limitations in internal count tracking. We show this attribution is wrong. Linear probes on the residual stream decode the correct count with near-perfect accuracy at every post-embedding layer, across all model depths. This holds even at the exact layers where the wrong answer crystallizes while the model simultaneously outputs an incorrect count. Attention patterns show no evidence of collapse over repeated tokens and tokenization artifacts account for none of the failure. Instead, a format-triggered multi-layer perceptron (MLP) block overwrites the correctly-encoded count with a fixed wrong answer at roughly 88--93,% network depth. This prior fires for repeated word-tokens in space-separated list format and is absent for repeated digit-tokens. It is suppressed by comma-separated delimiters in larger models but persists in smaller ones. The finding holds across Llama-3.2 (1B and 3B) and Qwen2.5 (1.5B, 3B and 7B) at consistent relative depth. Counting failure is a failure of routing not of representation and the two require different interventions.
comment: Code is available at https://github.com/sohv/counting-failure
☆ Matching Meaning at Scale: Evaluating Semantic Search for 18th-Century Intellectual History through the Case of Locke
While digitized corpora have transformed the study of intellectual transmission, current methods rely heavily on lexical text reuse detection, capturing verbatim quotations but fundamentally missing paraphrases and complex implicit engagement. This paper evaluates semantic search in 18th-century intellectual history through the reception of John Locke's foundational work. Using expert annotation grounded in a semantic taxonomy, we examine whether an off-the-shelf semantic search pipeline can surface meaning-level correspondences overlooked by lexical methods. Our results demonstrate that semantic search retrieves substantially more implicit receptions than lexical baselines. However, linguistic diagnostics also reveal a "lexical gatekeeping" effect, where retrieval remains partially constrained by surface vocabulary overlap. These findings highlight both the potential and the limitations of semantic retrieval for analyzing the circulation of ideas in large historical corpora. The data is available at https://github.com/COMHIS/locke-sim-data.
comment: Accepted by NLP4DH 2026
♻ ☆ Temperature and Persona Shape LLM Agent Consensus With Minimal Accuracy Gains in Qualitative Coding
Large Language Models (LLMs) enable new possibilities for qualitative research at scale, including annotation and qualitative coding of educational data. While LLM-based multi-agent systems (MAS) can emulate human coding workflows, their benefits over single LLM agents for coding remain poorly understood. To that end, we conducted an experimental study of how persona and temperature of component agents of a MAS shapes consensus-building and coding accuracy for dialog segments. LLMs were prompted to code these segments deductively using a mature codebook with 8 codes and high inter-rater reliability derived from prior research. Our open-source MAS mirrors deductive human coding through structured agent discussion and consensus arbitration. Using six open-source LLMs (with 3 to 32 billion parameters) and 18 experimental configurations, we analyze over 77,000 coding decisions against a gold-standard dataset of human-annotated transcripts from online math tutoring sessions facilitated by educational software. Temperature significantly impacted whether and when consensus was reached across all six LLMs. MAS with multiple personas (including neutral, assertive, or empathetic) significantly delayed consensus in four out of six LLMs compared to uniform personas. In three of those LLMs, higher temperatures significantly diminished the effects of multiple personas on consensus. However, neither temperature nor persona pairing led to robust improvements in coding accuracy. Single agents matched or outperformed MAS consensus in most conditions. Qualitative analysis of MAS collaboration and coding disagreement may, however, improve codebook design and human-AI coding.
comment: Accepted as full paper to the 19th International Conference on Educational Data Mining (EDM 2026)
♻ ☆ Where Do Reasoning Models Refuse? ICML 2025
Chat models without chain-of-thought (CoT) reasoning must decide whether to refuse a harmful request before generating their first response token. Reasoning models, by contrast, produce extended chains of thought before their final output, raising a natural question: where in this process does the decision to refuse occur? We investigate this across four open-source reasoning models. We first show that the CoT causally influences refusal outcomes; fixing a specific reasoning trace substantially reduces variance in whether the model ultimately refuses or complies. Zooming into the reasoning trace, we find that in distilled models, subtle differences in the opening sentence of the CoT can fully determine the model's refusal decision, and that these patterns transfer across models distilled from the same teacher. Finally, we extract linear refusal directions from model activations and show that ablating them increases harmful compliance, though less reliably than the same technique achieves on non-reasoning models, and with non-negligible degradation to general capabilities.
comment: v1 accepted to the ICML 2025 Workshop on Reliable and Responsible Foundation Models (R2FM). 20 pages, 12 figures v2 submitted to NeurIPS 2026. 31 pages, 16 figures
♻ ☆ Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs
Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution; and \textit{reflection-on-action}, which uses test-time training to update both its internal reflection model and its action policy based on external reflections after execution. We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment. Experiments on our newly-designed Long-Horizon Household benchmark and MuJoCo Cupboard Fitting benchmark show significant gains over baseline models, with zero-shot generalization to photorealistic HM3D environments and real-robot experiments on a Franka Panda arm. Ablations confirm that reflection-in-action and reflection-on-action are mutually dependent, and that retrospective reflection achieves better credit assignment than step-wise external feedback at lower computational overhead. Qualitative analyses further highlight behavioral correction through reflection.
♻ ☆ LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers
Automated feature engineering plays a critical role in improving predictive model performance for tabular learning tasks. Traditional automated feature engineering methods are limited by their reliance on pre-defined transformations within fixed, manually designed search spaces, often neglecting domain knowledge. Recent advances using Large Language Models (LLMs) have enabled the integration of domain knowledge into the feature engineering process. However, existing LLM-based approaches use direct prompting or rely solely on validation scores for feature selection, failing to leverage insights from prior feature discovery experiments or establish meaningful reasoning between feature generation and data-driven performance. To address these challenges, we propose LLM-FE, a novel framework that combines evolutionary search with the domain knowledge and reasoning capabilities of LLMs to automatically discover effective features for tabular learning tasks. LLM-FE formulates feature engineering as a program search problem, where LLMs propose new feature transformation programs iteratively, and data-driven feedback guides the search process. Our results demonstrate that LLM-FE consistently outperforms state-of-the-art baselines, significantly enhancing the performance of tabular prediction models across diverse classification and regression benchmarks. The code is available at: https://github.com/nikhilsab/LLMFE
comment: Accepted in Transactions on Machine Learning Research (TMLR)
♻ ☆ Positional Encoding via Token-Aware Phase Attention
We prove under practical assumptions that Rotary Positional Embedding (RoPE) introduces an intrinsic distance-dependent bias in attention scores that limits RoPE's ability to model long-context. RoPE extension methods may alleviate this issue, but they typically require post-hoc adjustments after pretraining, such as rescaling or hyperparameters retuning. This paper introduces Token-Aware Phase Attention (TAPA), a new positional encoding method that incorporates a learnable phase function into the attention mechanism. TAPA preserves token interactions over long range, extends to longer contexts with direct and light continual pretraining, extrapolates to unseen lengths, and attains substantially lower perplexity and stronger retrieval performance in the long-context regime than RoPE-style baselines.
comment: 28 pages
♻ ☆ LINGOLY-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation ICLR 2026
Frontier language models demonstrate increasing ability at solving reasoning problems, but their performance is often inflated by circumventing reasoning and instead relying on their expanding knowledge and memorisation capacity. We introduce LINGOLY-TOO, a challenging reasoning benchmark of 1,203 questions and a total of 6,995 sub-questions that counters these shortcuts by applying expert-designed obfuscations to Linguistics Olympiad problems. These obfuscations preserve the underlying solution logic while reducing the likelihood problems are solvable with via knowledge or memorisation. Our experiments show that models exploit shortcuts on the original question as performance markedly drop upon obfuscation. Even the best reasoning models remain highly sensitive, with scores dropping from around 0.59 on original problems to 0.48 after obfuscation. LINGOLY-TOO disentangles reasoning from knowledge, offering a clearer measure of true reasoning capabilities.
comment: Published as a conference paper at ICLR 2026
♻ ☆ Efficient Estimation of Kernel Surrogate Models for Task Attribution ICLR 2026
Modern AI agents such as large language models are trained on diverse tasks -- translation, code generation, mathematical reasoning, and text prediction -- simultaneously. A key question is how to quantify the influence of each individual training task on performance on a target task, a problem we refer to as task attribution. The direct approach, leave-one-out retraining, measures the effect of removing each task, but is computationally infeasible at scale. An alternative approach that builds surrogate models to predict the performance on a target task for any subset of training tasks has emerged in the recent literature. Prior work focuses on linear surrogate models, which capture first-order relationships but miss nonlinear interactions such as XOR-type effects. In this paper, we first consider a unified task-weighting framework for analyzing task-attribution methods and establish a new connection between linear surrogate models and influence functions via a second-order analysis. Then, we introduce kernel surrogate models, which more effectively represent second-order task interactions. To efficiently learn the kernel surrogate, we develop a gradient-based estimation procedure that leverages a first-order approximation of pretrained models; empirically, this yields accurate surrogate estimates with less than $2\%$ relative error without repeated retraining. Experiments across multiple settings -- including mathematical reasoning in transformers, in-context learning, and multi-objective reinforcement learning -- demonstrate the effectiveness of kernel surrogate models. They achieve a $25\%$ higher correlation with the leave-one-out ground truth than linear surrogates and influence-function baselines, enabling more accurate and scalable task attribution. When used for downstream data selection, kernel surrogate models further yield a $40\%$ improvement in the aforementioned settings.
comment: 27 pages. Appeared in ICLR 2026
♻ ☆ AgentReview: Exploring Peer Review Dynamics with LLM Agents EMNLP 2024
Peer review is fundamental to the integrity and advancement of scientific publication. Traditional methods of peer review analyses often rely on exploration and statistics of existing peer review data, which do not adequately address the multivariate nature of the process, account for the latent variables, and are further constrained by privacy concerns due to the sensitive nature of the data. We introduce AgentReview, the first large language model (LLM) based peer review simulation framework, which effectively disentangles the impacts of multiple latent factors and addresses the privacy issue. Our study reveals significant insights, including a notable 37.1% variation in paper decisions due to reviewers' biases, supported by sociological theories such as the social influence theory, altruism fatigue, and authority bias. We believe that this study could offer valuable insights to improve the design of peer review mechanisms. Our code is available at https://github.com/Ahren09/AgentReview.
comment: Accepted at EMNLP 2024 Main Track (Oral). https://agentreview.github.io/
♻ ☆ No Mean Feat: Simple, Strong Baselines for Context Compression
Context compression reduces Transformer inference costs by replacing lengthy inputs with shorter pre-computed representations. It carries significant benefits for retrieval-augmented generation (RAG) and has attracted growing research attention. However, progress remains difficult to measure due to inconsistent evaluations and baselines. We design a standard, easy-to-reproduce evaluation suite for context compression, BenchPress, along with simple, high-performance baselines for English reading comprehension. BenchPress supports benchmarking across model scales, datasets, compression ratios, and short ($<$1K tokens) to mid-range ($<$8K tokens) contexts. While the suite is applicable to any compression paradigm, our baselines target soft context compression. We establish two simple baselines that strongly outperform the widely used causal compression-token approach: mean pooling and a bidirectional compression-token variant. Our results show the benefit of bidirectional attention when computing compressed representations, and that simple pooling is an expressive compression operator.
comment: Code available at https://github.com/lil-lab/benchpress
♻ ☆ Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction ACL2026
Accurate estimation of item (question or task) difficulty is critical for educational assessment but suffers from the cold start problem. While Large Language Models demonstrate superhuman problem-solving capabilities, it remains an open question whether they can perceive the cognitive struggles of human learners. In this work, we present a large-scale empirical analysis of Human-AI Difficulty Alignment for over 20 models across diverse domains such as medical knowledge and mathematical reasoning. Our findings reveal a systematic misalignment where scaling up model size is not reliably helpful; instead of aligning with humans, models converge toward a shared machine consensus. We observe that high performance often impedes accurate difficulty estimation, as models struggle to simulate the capability limitations of students even when being explicitly prompted to adopt specific proficiency levels. Furthermore, we identify a critical lack of introspection, as models fail to predict their own limitations. These results suggest that general problem-solving capability does not imply an understanding of human cognitive struggles, highlighting the challenge of using current models for automated difficulty prediction.
comment: ACL2026, camera-ready
♻ ☆ Beyond the Singular: Revealing the Value of Multiple Generations in Benchmark Evaluation ACL 2026
Large language models (LLMs) have demonstrated significant utility in real-world applications, exhibiting impressive capabilities in natural language processing and understanding. Benchmark evaluations are crucial for assessing the capabilities of LLMs as they can provide a comprehensive assessment of their strengths and weaknesses. However, current evaluation methods often overlook the inherent randomness of LLMs by employing deterministic generation strategies or relying on a single random sample, resulting in unaccounted sampling variance and unreliable benchmark score estimates. In this paper, we propose a hierarchical statistical model that provides a more comprehensive representation of the benchmarking process by incorporating both benchmark characteristics and LLM randomness. We show that leveraging multiple generations improves the accuracy of estimating the benchmark score and reduces variance. Multiple generations also allow us to define $\mathbb P\left(\text{correct}\right)$, a prompt-level difficulty score based on correct ratios, providing fine-grained insights into individual prompts. Additionally, we create a data map that visualizes difficulty and semantics of prompts, enabling error detection and quality control in benchmark construction.
comment: 11 pages, 5 figures, accepted at the Findings of ACL 2026
♻ ☆ Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
Reinforcement learning (RL) has been applied to improve large language model (LLM) reasoning, yet the systematic study of how training scales with task difficulty has been hampered by the lack of controlled, scalable environments. Observed LLM shortcomings in long-horizon reasoning have raised the prospect that these shortcomings are fundamental to the autoregressive transformer architecture. We introduce ScaleLogic, a synthetic logical reasoning framework that offers independent control over two axes of difficulty: the depth of the required proof planning (i.e., the horizon) and the expressiveness of the underlying logic. Our proposed framework supports a wide range of logics: from simple implication-only logic ("if-then") towards more expressive first-order reasoning with conjunction ("and"), disjunction ("or"), negation ("not"), and universal quantification ("for all"). Using this framework, we show that the RL training compute $T$ follows a power law with respect to reasoning depth $D$ ($T \propto D^γ$, $R^{2} > 0.99$), and that the scaling exponent $γ$ increases monotonically with logical expressiveness, from $1.04$ to $2.60$. On downstream mathematics and general reasoning benchmarks, more expressive training settings yield both larger performance gains (up to $+10.66$ points) and more compute-efficient transfer compared to less expressive settings, demonstrating that what a model is trained on, not just how much it is trained, shapes downstream transfer. We further show that the power-law relationship holds across multiple RL methods, and curriculum-based training substantially improves scaling efficiency. More broadly, our results demonstrate that LLM shortcomings in long-horizon reasoning are not fundamental to the underlying architecture, and can be addressed by improved training methodology and data.
♻ ☆ Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark NeurIPS 2025
Multi-modal Large Language Models (MLLMs) exhibit impressive problem-solving abilities in various domains, but their visual comprehension and abstract reasoning skills remain under-evaluated. To this end, we present PolyMATH, a challenging benchmark aimed at evaluating the general cognitive reasoning abilities of MLLMs. PolyMATH comprises 5,000 manually collected high-quality images of cognitive textual and visual challenges across 10 distinct categories, including pattern recognition, spatial reasoning, and relative reasoning. We conducted a comprehensive, and quantitative evaluation of 15 MLLMs using four diverse prompting strategies, including Chain-of-Thought and Step-Back. The best scores achieved on PolyMATH are ~41%, ~36%, and ~27%, obtained by Claude-3.5 Sonnet, GPT-4o and Gemini-1.5 Pro respectively - highlighting the logical and visual complexity of these questions. A further fine-grained error analysis reveals that these models struggle to understand spatial relations and perform drawn-out, high-level reasoning. This is further strengthened by our ablation study estimating MLLM performance when given textual descriptions in place of diagrams. As evidenced by ~4% improvement over textual descriptions as opposed to actual images, we discover that models do not truly comprehend visual diagrams and the spatial information therein, and are thus prone to logical errors. Finally, we evaluate the OpenAI o1 models and find that their performance only matches the human baseline, highlighting the difficulty of the benchmark. The results on PolyMATH highlight the room for improvement in multi-modal reasoning and provide unique insights to guide the development of future MLLMs.
comment: Accepted in Neural Information Processing Systems (NeurIPS 2025) Workshop: Foundations of Reasoning in Language Models
♻ ☆ Explicit Reasoning Makes Better Judges: A Systematic Study on Accuracy, Efficiency, and Robustness NeurIPS
As Large Language Models (LLMs) are increasingly adopted as automated judges in benchmarking and reward modeling, ensuring their reliability, efficiency, and robustness has become critical. In this work, we present a systematic comparison of "thinking" and "non-thinking" LLMs in the LLM-as-a-judge paradigm using open-source Qwen 3 models of relatively small sizes (0.6B, 1.7B, and 4B parameters). We evaluate both accuracy and computational efficiency (FLOPs) on RewardBench tasks, and further examine augmentation strategies for non-thinking models, including in-context learning, rubric-guided judging, reference-based evaluation, and n-best aggregation. Our results show that despite these enhancements, non-thinking models generally fall short of their thinking counterparts. Our results show that thinking models achieve approximately 10% points higher accuracy with little overhead (under 2x), in contrast to augmentation strategies like few-shot learning, which deliver modest gains at a higher cost (>8x). Bias and robustness analyses further demonstrate that thinking models maintain significantly greater consistency under a variety of bias conditions such as positional, bandwagon, identity, diversity, and random biases (6% higher on average). We further extend our experiments to the multilingual setting and our results confirm that explicit reasoning extends its benefits beyond English. Overall, our work results in several important findings that provide systematic evidence that explicit reasoning offers clear advantages in the LLM-as-a-judge paradigm not only in accuracy and efficiency but also in robustness.
comment: Accepted in 2025 NeurIPS Foundations of Reasoning in Language Models Workshop
♻ ☆ Virtual Personas for Language Models via an Anthology of Backstories EMNLP 2024
Large language models (LLMs) are trained from vast repositories of text authored by millions of distinct authors, reflecting an enormous diversity of human traits. While these models bear the potential to be used as approximations of human subjects in behavioral studies, prior efforts have been limited in steering model responses to match individual human users. In this work, we introduce "Anthology", a method for conditioning LLMs to particular virtual personas by harnessing open-ended life narratives, which we refer to as "backstories." We show that our methodology enhances the consistency and reliability of experimental outcomes while ensuring better representation of diverse sub-populations. Across three nationally representative human surveys conducted as part of Pew Research Center's American Trends Panel (ATP), we demonstrate that Anthology achieves up to 18% improvement in matching the response distributions of human respondents and 27% improvement in consistency metrics.
comment: EMNLP 2024 Main
♻ ☆ Don't Retrieve, Generate: Prompting LLMs for Synthetic Training Data in Dense Retrieval
Training effective dense retrieval models typically relies on hard negative (HN) examples mined from large document corpora using methods such as BM25 or cross-encoders, which require full corpus access and expensive index construction. We propose generating synthetic hard negatives directly from a provided query and positive passage, using Large Language Models(LLMs). We fine-tune DistilBERT using synthetic negatives generated by four state-of-the-art LLMs ranging from 4B to 30B parameters (Qwen3, LLaMA3, Phi4) and evaluate performance across 10 BEIR benchmark datasets. Contrary to the prevailing assumption that stronger generative models yield better synthetic data, find that our generative pipeline consistently underperforms traditional corpus-based mining strategies (BM25 and Cross-Encoder). Furthermore, we observe that scaling the generator model does not monotonically improve retrieval performance and find that the 14B parameter model outperforms the 30B model and in some settings it is the worst performing.
♻ ☆ Code Mixologist : A Practitioner's Guide to Building Code-Mixed LLMs
Code-mixing and code-switching (CSW) remain challenging phenomena for large language models (LLMs). Despite recent advances in multilingual modeling, LLMs often struggle in mixed-language settings, exhibiting systematic degradation in grammaticality, factuality, and safety behavior. This work provides a comprehensive overview of CSW research in modern large language model settings. We introduce a unifying taxonomy that organizes prior work along dimensions of data, modeling, and evaluation, and we distill these findings into a practical playbook of actionable recommendations for building, adapting, and evaluating CSW-capable LLMs. We review modeling approaches ranging from CSW-tailored pre-training and task-specific post-training to prompting strategies and in-context learning. We analyze current evaluation practices, highlighting sources of instability and limited reproducibility, and we catalog existing benchmarks while critically examining their linguistic coverage and English-centric biases. Finally, we discuss emerging safety concerns, including use of code-mixing as a mechanism for bypassing model safeguards, and identify open research challenges.
comment: 8 pages main paper, 13 pages total
♻ ☆ Verbalized Algorithms: Classical Algorithms are All You Need (Mostly) NeurIPS 2025
Reasoning is a fundamentally algorithmic task. Yet current work on LLM-based reasoning relies on free-form generation whose theoretical guarantees (soundness, completeness, complexity, optimality) remain poorly understood. We argue that we should not treat them as general-purpose reasoners, and as an alternative, we propose a paradigm we call \emph{verbalized algorithms} (VAs), which combines LLMs and various algorithms with established guarantees. Instead of betting on LLM's ability to solve a reasoning task, VAs limit their scope by decomposing the task down to simple elementary operations on strings that they can answer reliably. For example, sorting a list of natural language strings could be done by using an LLM as a binary comparison oracle in a parallel or approximate sorting algorithm. We push the accuracy-runtime Pareto front with \emph{verbalized maximum}, \emph{sorting}, \emph{clustering}, and \emph{submodular maximization}, for numerical reasoning, topic clustering, Wi-Fi access point optimization, and multi-hop Q\&A RAG task. These results suggest improving LLM-based reasoning through standard algorithmic analysis is a feasible and better grounded research direction.
comment: Accepted in NeurIPS 2025 Workshop on Efficient Reasoning; Submitted to Position Paper Track at Neurips 2026
♻ ☆ QM-ToT: A Medical Tree of Thoughts Reasoning Framework for Quantized Model
Large language models (LLMs) face significant challenges in specialized biomedical tasks due to the inherent complexity of medical reasoning and the sensitive nature of clinical data. Existing LLMs often struggle with intricate medical terminology and the need for accurate clinical insights, leading to performance reduction when quantized for resource-constrained deployment. To address these issues, we propose Quantized Medical Tree of Thought (QM-ToT), a path-based reasoning framework. QM-ToT leverages a Tree of Thought (ToT) reasoning approach to decompose complex medical problems into manageable subtasks, coupled with evaluator assessment layers. This framework facilitates substantial performance improvements in INT4-quantized models on the challenging MedQAUSMLE dataset. Specifically, we demonstrate a remarkable accuracy increase from 34% to 50% for the LLaMA2-70b model and from 58.77% to 69.49% for LLaMA-3.1-8b. Besides, we also proposed an effect data distillation method based on ToT. Compared to the traditional distillation method, we achieved an improvement of 86. 27% while using only 3.9% of the data.This work, for the first time, showcases the potential of ToT to significantly enhance performance on complex biomedical tasks, establishing a crucial foundation for future advances in deploying high-performing quantized LLM in resource-limited medical settings.
comment: Accepted by ICIC 2026 Poster
♻ ☆ Knowledge is Not Enough: Injecting RL Skills for Continual Adaptation
Large Language Models (LLMs) face the "knowledge cutoff" challenge, where their frozen parametric memory prevents direct internalization of new information. While Supervised Fine-Tuning (SFT) is commonly used to update model knowledge, it often updates factual content without reliably improving the model's ability to use the newly incorporated information for question answering or decision-making. Reinforcement Learning (RL) is essential for acquiring reasoning skills; however, its high computational cost makes it impractical for efficient online adaptation. We empirically observe that the parameter updates induced by SFT and RL are nearly orthogonal. Based on this observation, we propose Parametric Skill Transfer (PaST), a framework that supports modular skill transfer for efficient and effective knowledge adaptation. By extracting a domain-agnostic Skill Vector from a source domain, we can linearly inject knowledge manipulation skills into a target model after it has undergone lightweight SFT on new data. Experiments on knowledge-incorporation QA (SQuAD, LooGLE) and agentic tool-use benchmarks (ToolBench) demonstrate the effectiveness of our method. On SQuAD, PaST outperforms the state-of-the-art self-editing SFT baseline by up to 9.9 points. PaST further scales to long-context QA on LooGLE with an 8.0-point absolute accuracy gain, and improves zero-shot ToolBench success rates by +10.3 points on average with consistent gains across tool categories, indicating strong scalability and cross-domain transferability of the Skill Vector.
♻ ☆ AQUA-Bench: Beyond Finding Answers to Knowing When There Are None in Audio Question Answering ICASSP 2026
Recent advances in audio-aware large language models have shown strong performance on audio question answering. However, existing benchmarks mainly cover answerable questions and overlook the challenge of unanswerable ones, where no reliable answer can be inferred from the audio. Such cases are common in real-world settings, where questions may be misleading, ill-posed, or incompatible with the information. To address this gap, we present AQUA-Bench, a benchmark for Audio Question Unanswerability Assessment. It systematically evaluates three scenarios: Absent Answer Detection (the correct option is missing), Incompatible Answer Set Detection (choices are categorically mismatched with the question), and Incompatible Audio Question Detection (the question is irrelevant or lacks sufficient grounding in the audio). By assessing these cases, AQUA-Bench offers a rigorous measure of model reliability and promotes the development of audio-language systems that are more robust and trustworthy. Our experiments suggest that while models excel on standard answerable tasks, they often face notable challenges with unanswerable ones, pointing to a blind spot in current audio-language understanding.
comment: Accepted to ICASSP 2026 (Oral). Project Website: https://github.com/kuan2jiu99/aqua-bench
♻ ☆ First, Do No Harm: AI Supervisor Scaffolds Novice Growth in Counselor Education
The most dangerous mistakes a novice counselor makes are not the obvious ones: they are utterances that sound caring while quietly violating professional ethics and leaving vulnerable clients less protected. We build an AI supervisor that does not replace novice counselors, but grows them-teaching them to internalize ethical violations they would otherwise never notice. What makes this supervisor non-trivial is not detection but teaching: it must locate the ethical-violating utterance, diagnose the ethical violation against APA principles, and deliver feedback that explains not just what went wrong, but why it is risky and how to respond differently. The core obstacle is that (1) ethical violations are by nature unlabeled in real clinical data, and (2) existing AI counselors trained only to match correct answers will never learn to teach. We resolve both at once: a controllable AI novice that intentionally enacts predefined mistake categories makes supervision labels a natural byproduct of generation, yielding ETHICSCAFF, a 9,915-instance human-in-the-loop dataset; and GRPO under a Novice Growth Reward (NGR) optimizes the supervisor not for answer correctness but for whether a weaker novice model actually improves after reading its explanation. Experiments show that a novice guided by our supervisor outperforms an unguided peer on clinical metrics, and that teaching-oriented optimization via NGR further sharpens the supervisor's own ethical detection. In a user study with novice counseling-psychology students, participants show significant self-efficacy gains across all eight assessed competencies after receiving AI supervisory feedback, demonstrating that the scaffold transfers from simulation to real-world practice.
comment: 9 pages, 5 figures
♻ ☆ EMO: Pretraining Mixture of Experts for Emergent Modularity
Large language models are typically deployed as monolithic systems, requiring the full model even when applications need only a narrow subset of capabilities, e.g., code, math, or domain-specific knowledge. Mixture-of-Experts (MoEs) seemingly offer a potential alternative by activating only a subset of experts per input, but in practice, restricting inference to a subset of experts for a given domain leads to severe performance degradation. This limits their practicality in memory-constrained settings, especially as models grow larger and sparser. We introduce EMO, an MoE designed for modularity-the independent use and composition of expert subsets-without requiring human-defined priors. Our key idea is to encourage tokens from similar domains to rely on similar experts. Since tokens within a document often share a domain, EMO restricts them to select experts from a shared pool, while allowing different documents to use different pools. This simple constraint enables coherent expert groupings to emerge during pretraining using document boundaries alone. We pretrain a 1B-active, 14B-total EMO on 1T tokens. As a full model, it matches standard MoE performance. Crucially, it enables selective expert use: retaining only 25% (12.5%) of experts incurs just a 1% (3%) absolute drop, whereas standard MoEs break under the same setting. We further find that expert subsets in EMO specialize at semantic levels (e.g., domains such as math or code), in contrast to the low-level syntactic specialization observed in standard MoEs. Altogether, our results demonstrate a path toward modular, memory-efficient deployment of large, sparse models and open new opportunities for composable architectures.
♻ ☆ Frame In, Frame Out: Measuring Framing Bias in LLM-Generated News Summaries
News headlines and summaries shape how events are interpreted through selective emphasis and omission, a phenomenon commonly referred to as framing. Large language models are now routinely used to generate such content, yet existing evaluation frameworks largely overlook this dimension. We introduce Frame In, Frame Out (FIFO), the first large-scale benchmark for measuring framing bias in LLM-generated news summaries, grounded in the widely used XSum dataset. FIFO combines 15,499 jury-annotated examples with 320 expert-labeled instances ($κ= 0.61$) to validate and calibrate model-based annotations. Using FIFO, we analyze framing behavior across 27 summarization models. We find that LLMs systematically exhibit higher framing rates than human journalists, with strong variation across topics and training regimes, including elevated framing in scientific and public health summaries. Our results establish framing as a missing yet consequential dimension of summarization quality.
♻ ☆ Mitigating Misalignment Contagion by Steering with Implicit Traits
Language models (LMs) are increasingly used in high-stakes, multi-agent settings, where following instructions and maintaining value alignment are critical. Most alignment research focuses on interactions between a single LM and a single user, failing to address the risk of misaligned behavior spreading between multiple LMs in multi-turn interactions. We find evidence of this phenomenon, which we call misalignment contagion, across multiple LMs as they engage multi-turn conversational social dilemma games. Specifically, we find that LMs become more anti-social after gameplay and that this effect is intensified when other players are steered to act maliciously. We explore different steering techniques to mitigate such misalignment contagion and find that reinforcing an LM's system prompt is insufficient and often harmful. Instead, we propose steering with implicit traits: a technique that intermittently injects system prompts with statements that reinforce an LMs initial traits and is more effective than system prompt repetition at keeping models in line with their initial pro-social behaviors. Importantly, this method does not require access to model parameters or internal model states, making it suitable for increasingly common use cases where complex multi-agent workflows are being designed with black box models.
♻ ☆ SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild? ICLR 2026
Spatial reasoning is a fundamental aspect of human cognition, yet it remains a major challenge for contemporary vision-language models (VLMs). Prior work largely relied on synthetic or LLM-generated environments with limited task designs and puzzle-like setups, failing to capture the real-world complexity, visual noise, and diverse spatial relationships that VLMs encounter. To address this, we introduce SpatiaLab, a comprehensive benchmark for evaluating VLMs' spatial reasoning in realistic, unconstrained contexts. SpatiaLab comprises 1,400 visual question-answer pairs across six major categories: Relative Positioning, Depth & Occlusion, Orientation, Size & Scale, Spatial Navigation, and 3D Geometry, each with five subcategories, yielding 30 distinct task types. Each subcategory contains at least 25 questions, and each main category includes at least 200 questions, supporting both multiple-choice and open-ended evaluation. Experiments across diverse state-of-the-art VLMs, including open- and closed-source models, reasoning-focused, and specialized spatial reasoning models, reveal a substantial gap in spatial reasoning capabilities compared with humans. In the multiple-choice setup, InternVL3.5-72B achieves 54.93% accuracy versus 87.57% for humans. In the open-ended setting, all models show a performance drop of around 10-25%, with GPT-5-mini scoring highest at 40.93% versus 64.93% for humans. These results highlight key limitations in handling complex spatial relationships, depth perception, navigation, and 3D geometry. By providing a diverse, real-world evaluation framework, SpatiaLab exposes critical challenges and opportunities for advancing VLMs' spatial reasoning, offering a benchmark to guide future research toward robust, human-aligned spatial understanding. SpatiaLab is available at: https://spatialab-reasoning.github.io/.
comment: Accepted to ICLR 2026 (https://openreview.net/forum?id=fWWUPOb0CT). 92 Pages. 42 Figures and 29 Tables
♻ ☆ Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models
Real-time Spoken Language Models (SLMs) struggle to leverage Chain-of-Thought (CoT) reasoning due to the prohibitive latency of generating the entire thought process sequentially. Enabling SLMs to think while speaking, similar to humans, is attracting increasing attention. We present, for the first time, Mind-Paced Speaking (MPS), a brain-inspired framework that enables high-fidelity, real-time reasoning. Similar to how humans utilize distinct brain regions for thinking and responding, we propose a novel dual-brain approach, employing a "Formulation Brain" for high-level reasoning to pace and guide a separate "Articulation Brain" for fluent speech generation. This division of labor eliminates mode-switching, preserving the integrity of the reasoning process. Experiments show that MPS significantly outperforms existing think-while-speaking methods and achieves reasoning performance comparable to models that pre-compute the full CoT before speaking, while drastically reducing latency. Under a zero-latency configuration, the proposed method achieves an accuracy of 92.8% on the mathematical reasoning task Spoken-MQA and attains a score of 82.5 on the speech conversation task URO-Bench. MPS is the methodology underlying our released Step-Audio R1.1 system, effectively bridging the gap between high-quality reasoning and real-time interaction.
♻ ☆ AdaRubric: Task-Adaptive Rubrics for Reliable LLM Agent Evaluation and Reward Learning ACL 2026
Evaluating LLM agent trajectories is fundamentally task-specific: a code-debugging agent should be judged on Correctness and Error Handling, not on Fluency or Safety. Yet the dominant paradigm -- LLM-as-Judge with a fixed rubric -- applies the same static dimensions regardless of task, producing systematic mis-evaluation. We present AdaRubric, a framework that (i) adaptively generates task-specific evaluation rubrics from task descriptions via LLM, (ii) evaluates agent trajectories step-by-step with confidence-weighted, per-dimension scoring, and (iii) produces dense reward signals for preference learning. Three composable filtering strategies, including the novel DimensionAwareFilter that provably prevents dimension-level quality masking, yield high-quality DPO preference pairs. On WebArena, ToolBench, and AgentBench, AdaRubric achieves Pearson r = 0.79 human correlation (+0.15 over the strongest baseline), with strong reliability (Krippendorff's alpha = 0.83). DPO models trained on AdaRubric-generated pairs improve task success by +6.8-8.5% over the best baseline. AdaRubric also generalises zero-shot to unseen domains (SWE-bench) and extends to multimodal agents (VisualWebArena, OSWorld) without modification. Our code is available at: github.com/alphadl/AdaRubrics
comment: KnowFM @ ACL 2026
♻ ☆ AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
LLM-agent training pipelines routinely discard failed trajectories even though GPT-4o achieves only 14-20% on WebArena and below 55% pass@1 on ToolBench; even specialised systems at 50-65% leave the majority of trajectories unused. We introduce AgentHER, which recovers this lost signal by adapting Hindsight Experience Replay (HER) to natural-language agent trajectories: a trajectory that fails goal A is often a correct demonstration for an achievable alternative goal B. AgentHER realises this through a four-stage pipeline (failure classification, outcome extraction, LLM-guided relabeling with confidence gating, and data packaging) that converts discarded failures into SFT, DPO, and ShareGPT training data. On WebArena and ToolBench under a strict task-disjoint held-out protocol, AgentHER improves over success-only SFT by +7.6-11.4% across four model families (GPT-4o, Qwen2.5-72B/7B, LLaMA-3.1-8B), achieves 2x sample efficiency, and beats the strongest experience-centric baseline (Agent Workflow Memory) by +3.0-6.2%. Two robustness mechanisms, failure-severity weighting and cross-model multi-judge verification (gpt-4o-mini paired with Qwen2.5-72B-Instruct), reduce label noise from 5.9% to 2.9% and raise human-rated relabeling precision to 97.1% on WebArena and 96.0% on ToolBench. A full system-cost audit shows the entire relabeling pipeline costs 2.98 and 26 wall-clock minutes for 3,000 trajectories, i.e. 1.4 x 10^-3 per accepted pair. Code: https://github.com/alphadl/AgentHER
♻ ☆ Data Mixing Can Induce Phase Transitions in Knowledge Acquisition NeurIPS'25
Large Language Models (LLMs) are typically trained on data mixtures: most data come from web scrapes, while a small portion is curated from high-quality sources with dense domain-specific knowledge. In this paper, we show that when training LLMs on such data mixtures, knowledge acquisition from knowledge-dense datasets, unlike training exclusively on knowledge-dense data (arXiv:2404.05405), does not always follow a smooth scaling law but can exhibit phase transitions with respect to the mixing ratio and model size. Through controlled experiments on a synthetic biography dataset mixed with web-scraped data, we demonstrate that: (1) as we increase the model size to a critical value, the model suddenly transitions from memorizing very few to most of the biographies; (2) below a critical mixing ratio, the model memorizes almost nothing even with extensive training, but beyond this threshold, it rapidly memorizes more biographies. We attribute these phase transitions to a capacity allocation phenomenon: a model with bounded capacity must act like a knapsack problem solver to minimize the overall test loss, and the optimal allocation across datasets can change discontinuously as the model size or mixing ratio varies. We formalize this intuition in an information-theoretic framework and reveal that these phase transitions are predictable, with the critical mixing ratio following a power-law relationship with the model size. Our findings highlight a concrete case where a good mixing recipe for large models may not be optimal for small models, and vice versa.
comment: NeurIPS'25 Spotlight
♻ ☆ Model-Aware Tokenizer Transfer
Large Language Models (LLMs) are trained to support an increasing number of languages, yet their predefined tokenizers remain a bottleneck for adapting models to lower-resource or distinct-script languages. Existing tokenizer transfer methods typically rely on semantic heuristics to initialize new embeddings, ignoring higher-layer model dynamics and limiting transfer quality. We propose Model-Aware Tokenizer Transfer (MATT), a method that incorporates model internals into the tokenizer transfer process. MATT introduces an Attention Influence Modeling (AIM) objective that distills inter-token communication patterns from a source model into a target model with a new tokenizer, providing an efficient warm-up before standard language modeling. Unlike approaches that focus solely on embedding similarity, MATT leverages attention behavior to guide embedding initialization and adaptation. Experiments across diverse linguistic settings show that MATT recovers a large fraction of the original model's performance within a few GPU hours, outperforming heuristic baselines. These results demonstrate that incorporating model-level signals offers a practical and effective path toward robust tokenizer transfer in multilingual LLMs.
♻ ☆ LILO: Bayesian Optimization with Natural Language Feedback
Many real-world optimization problems are guided by complex, subjective preferences that are difficult to express as explicit closed-form objectives. In response, we introduce Language-in-the-Loop Optimization (LILO), a Bayesian optimization (BO) framework that employs a large language model (LLM) to translate free-form natural language feedback and prior knowledge from a decision maker into structured preference signals, going beyond the restrictive scalar or pairwise feedback formats typically assumed in preferential BO. The LLM-derived preferences are integrated by a Gaussian process proxy model, enabling principled acquisition-driven exploration with calibrated uncertainty. By placing the LLM in a supporting role rather than as the optimizer itself, LILO preserves the sample efficiency and stability of BO while providing a flexible and expressive feedback interface. Across synthetic and real-world benchmarks, LILO consistently outperforms both conventional preference-based BO methods and LLM-only optimizers, with particularly strong gains in feedback-limited regimes.
♻ ☆ TELL-TALE: Task Efficient LLMs with Task Aware Layer Elimination ACL 2026
Large Language Models (LLMs) typically come with a fixed architecture, despite growing evidence that not all layers contribute equally to every downstream task. We introduce TALE (Task-Aware Layer Elimination), an inference-time method that improves task performance by selectively removing layers that are irrelevant or detrimental for a given task. TALE optimizes task-specific performance, yielding a task-optimized architecture without retraining. Across 9 tasks and 5 model families, under both zero-shot and few-shot settings, TALE consistently matches or surpasses baseline performance while simultaneously reducing computational costs. TALE also synergizes with fine-tuning, leading to further performance improvements. Computing TALE for a new task requires modest resources, making it a practical and deployable solution for task-specialized LLM inference.
comment: ACL 2026 Findings
♻ ☆ TinyTroupe: An LLM-powered Multiagent Persona Simulation Toolkit
Recent advances in Large Language Models (LLM) have led to a new class of autonomous agents, renewing and expanding interest in the area. LLM-powered Multiagent Systems (MAS) have thus emerged, both for assistive and simulation purposes, yet tools for realistic human behavior simulation -- with its distinctive challenges and opportunities -- remain underdeveloped. Existing MAS libraries and tools lack fine-grained persona specifications, population sampling facilities, experimentation support, and integrated validation, among other key capabilities, limiting their utility for behavioral studies, social simulation, and related applications. To address these deficiencies, in this work we introduce TinyTroupe, a simulation toolkit enabling detailed persona definitions (e.g., nationality, age, occupation, personality, beliefs, behaviors) and programmatic control via numerous LLM-driven mechanisms. This allows for the concise formulation of behavioral problems of practical interest, either at the individual or group level, and provides effective means for their solution. TinyTroupe's components are presented using representative working examples, such as brainstorming and market research sessions, thereby simultaneously clarifying their purpose and demonstrating their usefulness. Quantitative and qualitative evaluations of selected aspects are also provided, highlighting possibilities, limitations, and trade-offs. The approach, though realized as a specific Python implementation, is meant as a novel conceptual contribution, which can be partially or fully incorporated in other contexts. The library is available as open source at https://github.com/microsoft/tinytroupe.
comment: 9 pages. Preprint to be submitted to peer-review
♻ ☆ When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise
Vision-language models (VLMs) achieve strong multimodal performance but remain prone to relation hallucination, which requires accurate reasoning over inter-object interactions. We study the impact of visual perturbations, specifically rotation and noise, and show that even mild distortions significantly degrade relational reasoning across models and datasets. We further evaluate prompt-based augmentation and preprocessing strategies (orientation correction and denoising), finding that while they offer partial improvements, they do not fully resolve hallucinations. Our results reveal a gap between perceptual robustness and relational understanding, highlighting the need for more robust, geometry-aware VLMs.
♻ ☆ Fast-MIA: Efficient and Scalable Membership Inference for LLMs ACL 2026
We propose Fast-MIA (https://github.com/Nikkei/fast-mia), a Python library for efficiently evaluating membership inference attacks (MIA) against large language models (LLMs). MIA has emerged as a crucial technique for auditing privacy risks and copyright infringement in LLMs. However, computational demands have grown substantially: recent methods rely on repeated inference, while practical auditing requires large-scale evaluation. Progress is further hindered by existing implementations that execute methods independently, redundantly computing shared intermediate results such as log-probabilities. To address these challenges, Fast-MIA combines two strategies: (1) high-throughput batch inference via vLLM, achieving approximately 5$\times$ speedup, and (2) a cross-method caching architecture that computes intermediate results once and shares them across methods. The library includes representative MIA methods under a unified framework, integrates with established benchmarks, and supports flexible YAML configuration. We release Fast-MIA under the Apache License 2.0 to support scalable and reproducible MIA research.
comment: ACL 2026 System Demonstrations
♻ ☆ MapFormer: Self-Supervised Learning of Cognitive Maps with Input-Dependent Positional Embeddings
A cognitive map is an internal model which encodes the abstract relationships among entities in the world, giving humans and animals the flexibility to adapt to new situations, with a strong out-of-distribution (OOD) generalization that current AI systems still do not possess. To bridge this gap, we introduce $\textit{MapFormers}$, new Transformer-based architectures, which can learn cognitive maps from observational data and perform path-integration without supervision. Cognitive maps are learned in the model by disentangling structural relationships in the inputs from their specific content, a property that can be achieved by updating position encodings with input-dependent matrices, built as exponentials of learned combinations of Lie-algebra generators. We developed two variants of $\textit{MapFormers}$ that unify absolute and relative positional encoding to model episodic (EM) and working memory (WM), respectively. We tested $\textit{MapFormers}$ on several formal tasks targeting distinct cognitive capacities, including gating, 2D navigation and nested hierarchies (Dyck Languages). Our results demonstrate that $\textit{MapFormers}$ significantly outperform current AI architectures, achieving near-perfect OOD generalization where standard models fail. Furthermore, we show that $\textit{MapFormers}$ are scalable; evaluations on naturalistic data yield perplexity improvements over baselines, suggesting that these principles extend to large-scale, real-world domains. These results are obtained through efficient parallel computation on commutative maps, though our models can also learn non-commutative cognitive maps via sequential path-integration. Overall, these results suggest that input-dependent matrices provide a critical structural bias, by disentangling abstract relations from content in order to drive robust OOD generalization.
comment: 19 pages (29 with appendix), 8 figures
♻ ☆ Benchmarking Real-Time Question Answering via Executable Code Workflows
Retrieving real-time information is a fundamental capability for search-integrated agents in real-world applications. However, existing benchmarks are predominantly static and therefore fail to capture the temporal dynamics of information and the continuously evolving nature of real-world knowledge. To address this limitation, we propose RT-QA, a dynamic evaluation framework that leverages executable code workflows to retrieve up-to-date answers at evaluation time. Specifically, we construct an agent-driven pipeline that autonomously generates code for web crawling and DOM-based answer extraction to produce real-time ground truth. To ensure robust evaluation over time, the pipeline further incorporates a self-repair mechanism to adapt to changes in web page structures. RT-QA spans 12 domains (e.g., Finance, Sports) with 320 Chinese questions categorized into three difficulty levels. Extensive evaluations of state-of-the-art models (e.g., GPT-5.2, GLM-4.7) reveal significant limitations in real-time adaptability: even the best models achieve only 46% accuracy. Our analysis highlights two primary failure modes: (1) Lazy Retrieval, where agents rely on search snippets instead of deeply scanning specific websites for information (20% of failures); and (2) Temporal Confusion, a cognitive error where agents retrieve a historical date (e.g., an event in 2024) and fail to re-anchor to the current time (2026) for subsequent reasoning. These findings suggest that future agents require not just better retrieval strategies, but robust temporal state management.
♻ ☆ UPA: Unsupervised Prompt Agent via Tree-Based Search and Selection
Prompt agents have recently emerged as a promising paradigm for automated prompt optimization, framing prompt discovery as a sequential decision-making problem over a structured prompt space. While this formulation enables the use of advanced planning algorithms, these methods typically assume access to supervised reward signals, which are often unavailable in practical scenarios. In this work, we propose UPA, an Unsupervised Prompt Agent that realizes structured search and selection without relying on ground-truth (GT) rewards. Specifically, during search, UPA iteratively constructs an evolving tree structure to navigate the prompt space, guided by fine-grained and position-debiased pairwise comparisons from Large Language Models (LLMs). Crucially, as these local comparisons do not inherently yield a consistent global scale, we decouple systematic prompt exploration from final selection, introducing a two-stage framework grounded in the Bradley-Terry-Luce (BTL) model. This framework first performs path-wise Bayesian aggregation of local comparisons to filter candidates under uncertainty, followed by global tournament-style comparisons to infer latent prompt quality and identify the optimal prompt. Experiments across multiple tasks demonstrate that UPA consistently outperforms existing prompt optimization methods, showing that agent-style optimization can remain highly effective even in unsupervised settings.
♻ ☆ LLM-Augmented Chemical Synthesis and Design Decision Programs ICML 2025
Retrosynthesis, the process of breaking down a target molecule into simpler precursors through a series of valid reactions, stands at the core of organic chemistry and drug development. Although recent machine learning (ML) research has advanced single-step retrosynthetic modeling and subsequent route searches, these solutions remain restricted by the extensive combinatorial space of possible pathways. Concurrently, large language models (LLMs) have exhibited remarkable chemical knowledge, hinting at their potential to tackle complex decision-making tasks in chemistry. In this work, we explore whether LLMs can successfully navigate the highly constrained, multi-step retrosynthesis planning problem. We introduce an efficient scheme for encoding reaction pathways and present a new route-level search strategy, moving beyond the conventional step-by-step reactant prediction. Through comprehensive evaluations, we show that our LLM-augmented approach excels at retrosynthesis planning and extends naturally to the broader challenge of synthesizable molecular design.
comment: ICML 2025
♻ ☆ Towards Cross-lingual Values Judgment: A Consensus-Pluralism Perspective
As large language models (LLMs) are employed worldwide, existing evaluation paradigms for their multilingual capabilities primarily focus on factual task performance, neglecting the ability to judge content's deep-level values across multiple languages. To bridge this gap, we first reveal two primary challenges in constructing values judgment benchmarks, cultural diversity and disciplinary complexity, and propose a novel two-stage human-AI collaborative annotation framework to alleviate them. This framework identifies the issue scope and nature, establishes specific annotation criteria, and utilizes multiple LLMs for final review. Building upon this framework, we introduce \textbf{X-Value}, the first \textit{Cross-lingual Values Judgment Benchmark} designed to evaluate the capability of LLMs in judging deep-level values of content. X-Value comprises 4,750 Question-Answer pairs across 14 languages, covering 7 major global issue categories, and provides 12 granular annotation metadata to facilitate a rigorous evaluation of model performance. Systematic evaluations of X-Value are conducted across 17 LLMs using distinct prompting strategies. Multi-dimensional analysis of accuracy and F1-scores reveals their limitations in cross-lingual values judgment and indicates performance disparities across categories and languages. This work highlights the urgent need to improve the underlying, values-aware content judgment capability of LLMs.\footnote{Samples of X-Value are available at https://huggingface.co/datasets/Whitolf/X-Value.}
♻ ☆ SSA: Improving Performance With a Better Scoring Function ACL 2026
While transformer models exhibit strong in-context learning (ICL) abilities, they often fail to generalize under simple distribution shifts. We analyze these failures and identify Softmax, the scoring function in the attention mechanism, as a contributing factor. We propose \textbf{Scaled Signed Averaging (SSA)}, a novel attention scoring function that mitigates these failures. SSA significantly improves performance on our ICL tasks and outperforms transformer models with Softmax on several NLP benchmarks and linguistic probing tasks, in both decoder-only and encoder-only architectures.
comment: ACL 2026 Main Conference
♻ ☆ MARS-SQL: A multi-agent reinforcement learning framework for Text-to-SQL
Large Language Models (LLMs) often struggle with the precise logic and schema alignment required for complex Text-to-SQL tasks. While current methods rely heavily on static prompting, they lack the ability to dynamically adapt and self-correct through environmental interaction. To bridge this gap, we propose MARS-SQL, a trainable multi-agent framework for Text-to-SQL. Rather than introducing a new standalone SQL primitive, MARS-SQL makes an agentic workflow trainable by decomposing the problem into three specialized roles: schema grounding, query generation, and solution validation. Central to our approach is a generation agent trained via a multi-turn RL policy within a ReAct-style loop. The agent learns to iteratively reason, execute intermediate SQL actions on a live database, and refine its strategy based on execution feedback. To improve robustness, we further introduce a validation mechanism that treats solution selection as a generative modeling task, identifying the optimal interaction trajectory through next-token prediction probabilities. Empirical evaluations demonstrate the effectiveness of coupling interactive learning with trajectory ranking. MARS-SQL achieves state-of-the-art performance, recording an execution accuracy of 77.84% on the BIRD development dataset and 89.75% on the Spider test dataset, while also transferring strongly to out-of-domain benchmarks. Code is available at https://github.com/YangHaolin0526/MARS-SQL.
♻ ☆ Top-H Decoding: Adapting the Creativity and Coherence with Bounded Entropy in Text Generation
Large language models (LLMs), despite their impressive performance across a wide range of tasks, often struggle to balance two competing objectives in open-ended text generation: fostering diversity and creativity while preserving logical coherence. Existing truncated sampling techniques, including temperature scaling, top-\$p\$ (nucleus) sampling, and min-\$p\$ sampling, aim to manage this trade-off. However, they exhibit limitations, particularly in the effective incorporation of the confidence of the model into the corresponding sampling strategy. For example, min-\$p\$ sampling relies on a single top token as a heuristic for confidence, eventually underutilizing the information of the probability distribution. Toward effective incorporation of the confidence of the model, in this paper, we present **top-H** decoding. We first establish the theoretical foundation of the interplay between creativity and coherence in truncated sampling by formulating an **entropy-constrained minimum divergence** problem. We then prove this minimization problem to be equivalent to an **entropy-constrained mass maximization** (ECMM) problem, which is NP-hard. Finally, we present top-H decoding, a computationally efficient greedy algorithm to solve the ECMM problem. Extensive empirical evaluations demonstrate that top-H outperforms the state-of-the-art (SoTA) alternative of min-\$p\$ sampling by up to **25.63%** on creative writing benchmarks, while maintaining robustness on question-answering datasets such as GPQA, GSM8K, and MT-Bench. Additionally, an *LLM-as-judge* evaluation confirms that top-H indeed produces coherent outputs even at higher temperatures, where creativity is especially critical. In summary, top-H advances SoTA in open-ended text generation and can be *easily integrated* into creative writing applications. The code is available at https://github.com/ErfanBaghaei/Top-H-Decoding.
♻ ☆ Addressing Performance Saturation for LLM RL via Precise Entropy Curve Control
Reinforcement learning (RL) has enabled complex reasoning abilities in large language models (LLMs). However, most RL algorithms suffer from performance saturation, preventing continued gains as RL training scales. This problem can be characterized by the collapse of entropy, a key diagnostic for exploration in RL. Existing attempts focus on preventing entropy collapse through regularization or clipping. However, their resulting entropy curves often exhibit instability in the long term, which hinders performance gains. In this paper, we introduce Entrocraft, a simple rejection-sampling approach that realizes user-customized entropy schedule by biasing the advantage distributions. Entrocraft requires no objective regularization and is advantage-estimator-agnostic. Theoretically, we relate per-step entropy change to the advantage distribution under minimal assumptions. This explains the behavior of existing RL and entropy-preserving methods. Entrocraft also enables a systematic study of entropy schedules, which reveals that linear annealing, which starts high and decays to a slightly lower target, performs best. Empirically, Entrocraft addresses performance saturation, significantly improving generalization, output diversity, and long-term training. It enables a 4B model to outperform an 8B baseline, sustains improvement for up to 4x longer before plateauing, and raises pass@K by 50% over the baseline.
Multimedia 7
☆ KAN Text to Vision? The Exploration of Kolmogorov-Arnold Networks for Multi-Scale Sequence-Based Pose Animation from Sign Language Notation
Sign language production from symbolic notation offers a scalable route to accessible sign animation. We present KANMultiSign, a multi-scale sequence generator that translates HamNoSys notation into two-dimensional human pose sequences. Our framework makes two complementary contributions. First, we introduce a coarse-to-fine generation strategy with multi-scale supervision: the model is first guided by an intermediate body--hand--face scaffold to encourage global structural coherence, and then refines fine-grained hand articulation to improve finger-level detail. Second, we investigate integrating Kolmogorov--Arnold Network modules into a Transformer backbone, using learnable univariate function primitives to model the highly non-linear mapping from discrete phonological symbols to continuous body kinematics with a compact parameterization. Experiments on multiple public corpora spanning Polish, German, Greek, and French sign languages show consistent reductions in dynamic time warping based joint error compared with a strong notation-to-pose baseline, while using substantially fewer parameters. Controlled ablations further indicate that KAN-based variants substantially reduce parameter count while maintaining competitive performance when coupled with multi-scale supervision, rather than serving as the main driver of accuracy gains. These findings position multi-scale supervision as the key mechanism for improving notation-conditioned pose generation, with KAN offering a compact alternative for efficient modeling. Our code will be publicly available.
comment: Accepted at Neurocomputing
☆ ML-CLIPSim: Multi-Layer CLIP Similarity for Machine-Oriented Image Quality
We study full-reference image quality assessment from a machine-centric perspective, where images are evaluated by how well they preserve information for downstream models. We formulate machine-oriented quality as a latent machine utility and approximate it through pairwise predictive-consistency comparisons. To this end, we construct PCMP, a dataset of PSNR-matched distortion pairs labeled by consistency votes from multiple pretrained models. We further propose ML-CLIPSim, a differentiable quality metric built on a frozen CLIP visual encoder, which aggregates intermediate patch-token similarities and global image embeddings. Experiments on machine-preference benchmarks, human-IQA datasets, and learned image compression show that ML-CLIPSim better aligns with machine-oriented preferences than conventional fidelity and perceptual metrics, while remaining competitive for human quality prediction. Used as a compression distortion term, it improves rate--task trade-offs across multiple downstream tasks.
☆ Mitigating Multimodal Inconsistency via Cognitive Dual-Pathway Reasoning for Intent Recognition ICMR 2026
Multimodal Intent Recognition (MIR) aims to understand complex user intentions by leveraging text, video, and audio signals. However, existing approaches face two key challenges: (1) overlooking intricate cross-modal interactions for distinguishing consistent and inconsistent cues, and (2) ineffectively modeling multimodal conflicts, leading to semantic cancellation. To address these, we propose a novel Cognitive Dual-Pathway Reasoning (CDPR) framework, which constructs a stable semantic foundation via the intuition pathway and mitigates high-level semantic conflicts through the reasoning pathway, cooperatively establishing deep semantic relations. Specifically, we first employ a representation disentanglement strategy to extract modality-invariant and specific features. Subsequently, the intuition pathway aggregates cross-modal consensus using shared features for solid global representations. The reasoning pathway introduces an inconsistency perception mechanism, combining semantic prototype matching with statistical probability calibration to precisely quantify conflict severity, and dynamically adjusting the weights between both pathways. Furthermore, a multi-view loss function is adopted to alleviate modality laziness and learn structured features at different stages. Extensive experiments on two benchmarks show that CDPR achieves SOTA performance and superior robustness in mitigating multimodal inconsistency. The code is available at https://github.com/Hebust-NLP/CDPR.
comment: Accepted by ICMR 2026 (Main Track, Long Paper)
☆ Relational Retrieval: Leveraging Known-Novel Interactions for Generalized Category Discovery ICMR 2026
In this study, we tackle Generalized Category Discovery (GCD) via a Relational Retrieval perspective, explicitly coupling labeled and unlabeled data through bidirectional knowledge transfer. While existing methods treat these sources separately, missing valuable interaction opportunities, we propose Relational Pattern Consistency (RPC) that enables mutual enhancement. RPC employs One-vs-All classifiers for soft ID/OOD decomposition, then introduces two mechanisms: (i) for known-class preservation, we transfer semantic behavioral alignment; (ii) for category discovery, we leverage the insight that samples from the same category maintain invariant relationships with known-class prototypes, transforming unreliable pseudo-labeling into well-defined relational pattern matching. This bidirectional design allows labeled data to guide unlabeled learning while discovering novel categories through their collective relational signatures. Extensive experiments demonstrate RPC achieves state-of-the-art performance on both generic and fine-grained benchmarks.
comment: Accepted by ICMR 2026. Generalized category discovery, semi-supervised learning, contrastive learning
☆ Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning
In this paper, we propose the first VL$\underline{\textbf{M}}$ $\underline{\textbf{a}}$gentic $\underline{\textbf{r}}$easoning framework for few-$\underline{\textbf{s}}$hot multimodal $\underline{\textbf{T}}$ime $\underline{\textbf{S}}$eries $\underline{\textbf{C}}$lassification ($\textbf{MarsTSC}$), which introduces a self-evolving knowledge bank as a dynamic context iteratively refined via reflective agentic reasoning. The framework comprises three collaborative roles: i) Generator conducts reliable classification via reasoning; ii) Reflector diagnoses the root causes of reasoning errors to yield discriminative insights targeting the temporal features overlooked by Generator; iii) Modifier applies verified updates to the knowledge bank to prevent context collapse. We further introduce a test-time update strategy to enable cautious, continuous knowledge bank refinement to mitigate few-shot bias and distribution shift. Extensive experiments across 12 mainstream time series benchmarks demonstrate that $\textbf{MarsTSC}$ delivers substantial and consistent performance gains across 6 VLM backbones, outperforming both classical and foundation model-based time series baselines under few-shot conditions, while producing interpretable rationales that ground each classification decision in human-readable feature evidence.
comment: 18 pages, 12 figures, 6 tables. Preprint
☆ HOME-KGQA: A Benchmark Dataset for Multimodal Knowledge Graph Question Answering on Household Daily Activities LREC2026
Large Language Models (LLMs) provide flexible natural language processing capabilities, while knowledge graphs (KGs) offer explicit and structured knowledge. Integrating these two in a complementary manner enables the development of reliable and verifiable AI systems. In particular, knowledge graph question answering (KGQA) has attracted attention as a means to reduce LLM hallucinations and to leverage knowledge beyond the training data. However, existing KGQA benchmark datasets are biased toward encyclopedic knowledge, limited to a single modality, and lack fine-grained spatiotemporal data, which limits their applicability to real-world scenarios targeted by Embodied AI. We introduce HOME-KGQA, a novel KGQA benchmark dataset built on a multimodal KG of daily household activities. HOME-KGQA consists of complex, multi-hop natural language questions paired with graph database query languages. Compared to existing benchmarks, it includes more challenging questions that involve multi-level spatiotemporal reasoning, multimodal grounding, and aggregate functions. Experimental results show that the LLM-based KGQA methods fail to achieve performance comparable to that on existing datasets when evaluated on HOME-KGQA. This highlights significant challenges that should be addressed for the real-world deployment of KGQA systems. Our dataset is available at https://github.com/aistairc/home-kgqa
comment: 12 pages, 4 figures, 7 tables, accepted at LREC2026
☆ CAGS: Color-Adaptive Volumetric Video Streaming with Dynamic 3D Gaussian Splatting SIGGRAPH 2026
Volumetric video (VV) streaming enables real-time, immersive access to remote 3D environments, powering telepresence, ecological monitoring, and robotic teleoperation. These applications turn VV streaming into a real-time interface to remote physical environments, imposing new system-level demands for photorealistic scene representation, low-latency interaction, and robust performance under heterogeneous networks. 3D Gaussian Splatting (3DGS) has been widely used for real-time photorealistic rendering, offering superior visual quality and rendering performance, but it faces challenges due to bandwidth consumption. Furthermore, as the foundation of adaptive VV streaming, existing Levels of Detail (LoD) methods based on density are not well-suited to Gaussian representations, leading to visible gaps and severe quality degradation. Recent studies have also explored attribute compression techniques to reduce bandwidth consumption. Our preliminary studies reveal that aggressive attribute compression primarily causes color distortion, which can be effectively corrected in the rendered image using a reference image. Motivated by these findings, we propose a novel Color-Adaptive scheme for adaptive VV streaming that uses vector quantization (VQ) to establish LoDs and correct color distortions with low-resolution reference images. We further present CAGS, an adaptive VV streaming system compatible with diverse Gaussian representations, which integrates the Color-Adaptive scheme by rendering reference images on the streaming server and performing color restoration on the client. Extensive experiments on our prototype system demonstrate that CAGS outperforms the existing adaptive streaming systems in PSNR by 5$\sim$20 dB under fluctuating bandwidth, operates significantly faster than existing scalable Gaussian compression methods, and generalizes across different Gaussian representations.
comment: SIGGRAPH 2026 Conference Paper. Code is available at https://github.com/yindaheng98/ColorAdaptiveGaussianSplatting
Computation and Language 28
☆ Two Ways to De-Bias an LLM-as-a-Judge: A Continuous-Score Comparison of Hierarchical Bayesian Calibration and Neural-ODE Score Transport
[Abridged] Using a Large Language Model (LLM) as an automatic rater (LLM-as-a-judge) is cheap but potentially biased: some judges run lenient, others strict, the middle of the scale gets compressed, and verbose answers may be over-rewarded. A common remedy is post-hoc calibration: leave the cheap judge in place and, on a modest set of paired anchors, fit a transformation from raw judge scores to an estimate of the human rating. We compare two correctors that take opposing views on how this mapping should be modeled: a parametric, small-anchor hierarchical Bayesian linear correction with per-score uncertainty, and a non-parametric Neural-ODE (FFJORD) score-transport flow. Both are run head-to-head on UltraFeedback fine-grained_score (1700 paired examples, 200 held out), with calibration split into three operational sub-questions: population-mean recovery, per-item accuracy, and distributional-shape match. The headline result is that the choice between methods is primarily a data-budget question. Both correctors close the raw $+0.71$-point mean offset to within $\pm 0.08$ of the GPT-4 reference, at 100 and at 1500 anchors. Past that, the methods swap roles. With 100 anchors, the linear corrector reconstructs the human-score distribution roughly twice as well by KL divergence (0.031 vs. 0.058) and ties the flow on MAE. With 1500 anchors the flow wins on every metric (MAE 0.320 vs. 0.359, Pearson 0.922 vs. 0.896, KL 0.026 vs. 0.037). The Bayesian linear corrector saturates well below 1500 anchors: residual $\tanh$-shaped non-linearity is, by construction, structure a linear correction cannot fit. The flow keeps improving as labels grow. We translate these findings into an explicit decision rule for production deployments.
☆ Emergent Semantic Role Understanding in Language Models
Understanding how linguistic structure emerges in language models is central to interpreting what these systems learn from data and how much supervision they truly require. In particular, semantic role understanding ("who did what to whom") is a core component of meaning representation, yet it remains unclear whether it arises from pre-training alone or depends on task-specific fine-tuning. We study whether semantic role understanding emerges during language model pre-training or requires task-specific fine-tuning. We freeze decoder-only transformers and train linear probes to extract semantic roles, using performance to infer whether role information is already encoded in pre-training or learned during adaptation. Across model scales, we find that frozen representations contain substantial semantic role information, with performance improving but not fully matching fine-tuned models. This indicates partial but incomplete emergence from pre-training alone. We show that semantic role structure emerges from language modeling objectives, but its internal implementation shifts toward more distributed representations as model scale increases.
☆ Agentic MIP Research: Accelerated Constraint Handler Generation
Mixed-integer programming (MIP) research is both mathematically sophisticated and engineering-intensive: testing an algorithmic hypothesis within a branch-and-cut solver requires substantial implementation, debugging, tuning, and large-scale benchmarking. We propose an agentic MIP research framework that shortens this feedback loop by embedding LLM agents into a solver-aware harness for generating, verifying, and evaluating plugins for the open-source solver SCIP. Propagation methods play a central role in accelerating MIP solving by exploiting global constraints. We instantiate our framework on the semantic lifting of MIP formulations into global constraints and the automatic construction of propagation-only SCIP constraint handlers. On the MIPLIB 2017 benchmark set, the framework successfully recovers global constraint structures from constraint programming and generates executable constraint detectors and propagation-only constraint handlers. Furthermore, the framework naturally extends to in-context learning within a sandboxed environment, enabling agents not only to tune and debug generated constraint handlers on real instances, but also to explore global constraint patterns in MIP problems and discover novel propagation strategies not yet implemented in SCIP. This framework allows us to systematically distinguish meaningful algorithmic improvements from low-value or overly costly candidates: the novel propagation methods successfully solved five additional instances within the explored benchmark. Overall, this framework demonstrates that LLM agents can autonomously navigate the complex MIP research loop, paving the way for a more automated solver development process.
☆ Open Ontologies: Tool-Augmented Ontology Engineering with Stable Matching Alignment
We present Open Ontologies, an open-source ontology engineering system implemented in Rust that integrates LLM-driven construction with formal OWL reasoning and ontology alignment via the Model Context Protocol. Our primary finding is that stable 1-to-1 matching is the dominant factor in ontology alignment quality: on the OAEI Anatomy track, it achieves F1 = 0.832 (P = 0.963, R = 0.733), competitive with state-of-the-art systems and exceeding all in precision. Ablation across five weight configurations shows that signal weights are irrelevant when stable matching is applied (F1 varies by less than 0.004), while removing stable matching drops F1 to 0.728. On the Conference track, the same method achieves F1 = 0.438. On tool-augmented ontology interaction, we find a surprising result: an LLM reading a raw OWL file (F1 = 0.323) performs worse than the same LLM with no file at all (F1 = 0.431), while structured MCP tool access achieves F1 = 0.717. This demonstrates that tool structure provides a qualitatively different mode of access that the LLM cannot replicate by reading raw syntax. The system ships as a single binary under the MIT licence.
comment: 10 pages, 6 tables. Code: https://github.com/fabio-rovai/open-ontologies
☆ WorldSpeech: A Multilingual Speech Corpus from Around the World
Automatic speech recognition (ASR) performs well for high-resource languages with abundant paired audio-transcript data, but its accuracy degrades sharply for most languages due to limited publicly available aligned data. To this end, we introduce WorldSpeech, a 24 kHz multilingual speech corpus comprising 65k hours of aligned audio-transcript data across 76 languages, collected from diverse public sources including parliamentary proceedings, international broadcasts, and public-domain audiobooks. For 37 languages, WorldSpeech provides more than 200 hours of aligned speech, with 28 exceeding 500 hours and 24 surpassing 1k hours. Fine-tuning existing ASR models on WorldSpeech results in an average relative Word-Error-Rate reduction of 63.5% across 11 typologically diverse languages.
☆ Sparse Layers are Critical to Scaling Looped Language Models
Looped language models repeat a set of transformer layers through depth, reducing memory costs and providing natural early-exit points at loop boundaries. However, looped models do not scale as favorably as standard transformers with unique layers. We compare standard and Mixture-of-Experts (MoE) transformers, with and without looping, and find two main results. First, we find Looped-MoE models scale better than the standard baseline while dense looped models do not. We trace this to routing divergence between loops: in Looped-MoE models, different experts are activated on each pass through the same shared layers, recovering expressivity without additional parameters. Our second finding is that looped models have better compute-quality trade-offs with early exits than standard models. Because each loop ends with the same layers that produce the final output, loop boundaries are superior exit points, as confirmed by earlier output convergence at these points. In sum, we provide a clear direction for scaling looped models: a Looped-MoE model with early exits can not only beat standard transformers at scale, but also enable significant memory and inference savings with minimal degradation in quality.
☆ Lost in Translation? Exploring the Shift in Grammatical Gender from Latin to Occitan ACL 2026
The diachronic evolution from Latin to the Romance languages involved a restructuring of the grammatical gender system from a tripartite configuration (masculine, feminine, neuter) to a bipartite one (masculine, feminine). In this work, we introduce an interpretable deep learning framework to investigate this phenomenon at both lexical and contextual levels. First, we show that conventional tokenization strategies are insufficiently robust for this low-resource historical setting, and that our proposed tokenizer improves performance over these baselines. At the lexical level, we evaluate the contribution of morphological features to gender prediction. At the contextual level, we quantify the contributions of different part-of-speech categories to grammatical gender prediction. Together, these analyses characterize the distribution of gender information between the lemma and its sentential context. We make our codebase, datasets, and results publicly available.
comment: Accepted at NLP4DH @ ACL 2026
☆ Meow-Omni 1: A Multimodal Large Language Model for Feline Ethology
Deciphering animal intent is a fundamental challenge in computational ethology, largely because of semantic aliasing, the phenomenon where identical external signals (e.g., a cat's purr) correspond to radically different internal states depending on physiological context. Existing Multimodal Large Language Models (MLLMs) are blind to high-frequency biological time-series data, restricting them to superficial behavioural pattern matching rather than genuine latent-state reasoning. To bridge this gap, we introduce Meow-Omni 1, the first open-source, quad-modal MLLM purpose-built for computational ethology. It natively fuses video, audio, and physiological time-series streams with textual reasoning. Through targeted architectural adaptation, we integrate specialized scientific encoders into a unified backbone and formalize intent inference via physiologically grounded cross-modal alignment. Evaluated on MeowBench, a novel, expert-verified quad-modal benchmark, Meow-Omni 1 achieves state-of-the-art intent-recognition accuracy (71.16%), substantially outperforming leading vision-language and omni-modal baselines. We release the complete open-source pipeline including model weights, training framework, and the Meow-10K dataset, to establish a scalable paradigm for inter-species intent understanding and to advance foundation models toward real-world veterinary diagnostics and wildlife conservation.
☆ From Traditional Taggers to LLMs: A Comparative Study of POS Tagging for Medieval Romance Languages ACL 2026
Part-of-speech (POS) tagging for Medieval Romance languages remains challenging due to orthographic variation, morphological complexity, and limited annotated resources. This paper presents a systematic empirical evaluation of large language models (LLMs) for POS tagging across three medieval varieties: Medieval Occitan, Medieval Catalan, and Medieval French. We compare traditional rule-based and statistical taggers with modern open-source LLMs under zero-shot prompting, few-shot prompting, monolingual fine-tuning, and cross-lingual transfer learning settings. Experiments on historically grounded datasets show that LLM-based approaches consistently outperform traditional taggers, with fine-tuning and multilingual training yielding the largest improvements. In particular, cross-lingual transfer learning substantially benefits under-resourced varieties, while targeted bilingual training can outperform broader multilingual configurations for specific target languages. The results highlight the importance of linguistic proximity and dataset characteristics when designing transfer strategies for historical NLP. These findings provide empirical insights into the applicability of modern neural methods to medieval text processing and provide practical guidance for deploying LLM-based POS tagging pipelines in digital humanities research. All code, models, and processed datasets are released for reproducibility.
comment: Accepted at NLP4DH @ ACL 2026
☆ A Communication-Theoretic Framework for LLM Agents: Cost-Aware Adaptive Reliability
Agents built on large language models (LLMs) rely on a range of reliability techniques, including retry, majority voting, and self-consistency, that have been developed in parallel rather than within a common analytical framework. We observe that an LLM sampled at temperature $T$ is a discrete stochastic channel $p(y \mid x)$ in the sense of Shannon's coding theory, and use this identity as the entry point for such a framework grounded in communication theory. Each of these techniques is a special case of one of six classical reliability operators: diversity combining, hybrid retransmission, iterative generator-critic decoding, rateless sampling, structured redundant verification, and difficulty-adaptive routing. Within the framework we give two closed-form results: a noise-variance threshold above which uniform averaging beats quality-weighted averaging, and a contractivity criterion for generator-critic refinement, consistent with a contractive-to-divergent transition we observe between 3B- and 14B-parameter models. We further introduce a cost-aware semantic-nearest-neighbor router whose single Lagrangian knob traverses the quality-cost frontier without retraining. Across six channel configurations spanning local and cloud models on 69 hard tasks, no fixed model-technique-budget choice dominates, motivating per-task allocation. On a 300-item hard split of MMLU, GSM8K, and HumanEval, our router occupies the full empirical Pareto frontier: at matched quality, its normalized cost is ${\approx}56$\% lower than the strongest fixed technique; at matched normalized cost, it improves quality by ${\approx}7$\% ($26$\% over single-shot decoding). These results argue for consolidating these reliability techniques into a single tunable layer informed by channel coding.
☆ Personalized Alignment Revisited: The Necessity and Sufficiency of User Diversity
Personalized alignment aims to adapt large language models to heterogeneous user preferences, yet the precise theoretical conditions for its statistical efficiency have not been formally established. This paper characterizes the conditions under which personalized alignment achieves O(1) online regret and log(1/epsilon) offline sample complexity. We show that these optimal rates depend on a specific user-diversity condition: the population of user-specific heads must span the latent reward directions that can alter the optimal response. We prove that this condition is both necessary and sufficient. When it holds, simple greedy algorithms achieve benchmark efficiency; when it fails, every learner in a natural admissible class incurs at least logarithmic regret. Our results identify user diversity as the fundamental driver of personalized identifiability.
☆ Fin-Bias: Comprehensive Evaluation for LLM Decision-Making under human bias in Finance Domain ACL 2026
Large language models (LLMs) are increasingly deployed in financial contexts, raising critical concerns about reliability, alignment, and susceptibility to adversarial manipulation. While prior finance-related benchmarks assess LLMs' capabilities in stock trading, they are often restricted to small sample and fail to demonstrate LLM susceptibility to context with potential human bias. We introduce Fin-Bias (financial herding under long and uncertain financial context), a benchmark for evaluating LLM investment decision-making when faced with uncertainty and possible human-biased opinions. Fin-Bias includes 8868 long firm-specific analyst reports, including firm aspects summarized and analyzed by sophisticated analysts with investment ratings (Bullish/Neutral/Bearish) spanning from various industries. We present large language models with firm analyst reports with/without analyst investment ratings and even with 'fake' rating, to get investment ratings generated by LLMs. Our results reveal that LLMs tend to herd the explicit bias in context. We also develop a method to detect potential human opinions, which can encourage LLMs to think independently, some models even exceed human performance in predicting future stock return.
comment: ACL 2026 Findings
☆ GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression
Text embedding and generative tasks are usually trained separately based on large language models (LLMs) nowadays. This causes a large amount of training cost and deployment effort. Context compression is also a challenging and pressing task, which is vital to reasoning-driven generation, and agentic tasks requiring long context and continual learning. In this paper, we explore how to unify reasoning-driven generation, reasoning-enhanced text representation and context compression tasks in one forward pass for LLMs. Through meta latent tokens and a unified generative, representative and compressive tuning approach, we propose a training framework named GRC that bridges the three tasks. The trained models can accomplish three objectives in a single forward pass while maintaining modular, LEGO-style flexibility during inference. This design greatly reduces the deployment effort for retrieval-augmented generation (RAG) and achieves efficient inference and three times data utilization during training. Furthermore, this framework design enables a new paradigm for text embedding: self-reason-latent embeds, and a new generation paradigm, latent memory-augmented generation, where compressed and internalized KV cache with O(1) length is used as the updatable memory. We also propose hybrid paged attention to speed up the inference of our models. Extensive experiments on reasoning-intensive retrieval benchmarks, generative tasks, document compression, latency evaluation, and RAG settings demonstrate the effectiveness of our method and may shed light on the truly unified model that can handle reasoning-driven generation, embedding and compression tasks seamlessly.
☆ Dynamic Meta-Metrics: Source-Sentence Conditioned Weighting for MT Evaluation ACL
We propose Dynamic Meta-Metrics (DMM), a framework for machine translation evaluation that learns source-sentence conditioned combinations of existing metrics. Rather than relying on a single static ensemble or language-specific weighting, DMM adapts the metric combination based on properties of the source segment. We study hard conditioning, which fits an interpretable combiner per cluster, and an exploratory soft-conditioned extension whose weights vary continuously with source-cluster responsibilities. We evaluate DMM on the WMT Metrics Shared Task data across multiple language pairs using pairwise agreement measures at the system and segment levels. Across settings, MLP-based combinations outperform linear and Gaussian process-based ensembles, and introducing soft conditioning yields gains over linear models.
comment: 5 pages, ACL SRW 2026
☆ Character-Level Transformer for Tajik-Persian Transliteration with a Parallel Lexical Corpus
This study addresses automatic transliteration from Tajik (Cyrillic script) to Persian (Perso-Arabic script). We present a curated, lexicographically verified parallel corpus of 52,152 Tajik--Persian words and short phrases, compiled from printed dictionaries, encyclopedic sources, and manually verified online resources. To the best of our knowledge, this is one of the largest publicly available word-level corpora for Tajik--Persian transliteration. Using this corpus, we train a character-level sequence-to-sequence Transformer model and evaluate it using Character Error Rate (CER) and exact-match accuracy. The Transformer achieves a CER of 0.3216 and an exact-match accuracy of 0.3133, outperforming both dictionary-based rule-based and recurrent neural baselines. With beam search (k=3), performance further improves to CER 0.3182 and accuracy 0.3215. We describe the data collection and preprocessing pipeline, model architecture, and experimental protocol, and report a part-of-speech analysis showing performance differences across lexical categories. All preprocessing scripts, deterministic splits into training, validation, and test sets, and training configurations are released to support reproducibility and further research on Tajik and related Persian dialects. The corpus supports research in character-level transliteration, cross-script NLP, and lexicographic applications.
comment: Published in Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script (AbjadNLP), pages 75-83, Rabat, Morocco, March 2026
☆ Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs
Following the recent achievement of gold-medal performance on the IMO by frontier LLMs, the community is searching for the next meaningful and challenging target for measuring LLM reasoning. Whereas olympiad-style problems measure step-by-step reasoning alone, research-level problems use such reasoning to advance the frontier of mathematical knowledge itself, emerging as a compelling alternative. Yet research-level math benchmarks remain scarce because such problems are difficult to source (e.g., Riemann Bench and FrontierMath-Tier 4 contain 25 and 50 problems, respectively). To support reliable evaluation of next-generation frontier models, we introduce Soohak, a 439-problem benchmark newly authored from scratch by 64 mathematicians. Soohak comprises two subsets. On the Challenge subset, frontier models including Gemini-3-Pro, GPT-5, and Claude-Opus-4.5 reach 30.4%, 26.4%, and 10.4% respectively, leaving substantial headroom, while leading open-weight models such as Qwen3-235B, GPT-OSS-120B, and Kimi-2.5 remain below 15%. Notably, beyond standard problem solving, Soohak introduces a refusal subset that probes a capability intrinsic to research mathematics: recognizing ill-posed problems and pausing rather than producing confident but unjustified answers. On this subset, no model exceeds 50%, identifying refusal as a new optimization target that current models do not directly address. To prevent contamination, the dataset will be publicly released in late 2026, with model evaluations available upon request in the interim.
comment: Under review, For questions or model-evaluation requests, contact guijin.son@snu.ac.kr
☆ Language-Conditioned Visual Grounding with CLIP Multilingual
Multilingual vision-language models exhibit systematic performance gaps across languages, but the mechanism remains ambiguous: cross-language divergence could arise from the visual encoder, the text branch, or their interaction. We resolve this ambiguity through a dense multilingual CLIP probe in which the visual encoder is held identical across thirteen typologically diverse languages and only the XLM-RoBERTa text branch varies. We evaluate two CLIP architectures spanning a 7x visual-encoder scale gap (XLM-R base + ViT-B/32, ~87M visual parameters; XLM-R large + ViT-H/14, ~632M) on 11 concepts and 210 images, and quantify cross-language agreement via cluster-mask IoU, top-percentile IoU, and Spearman rank correlation against an English reference (n=2,310 paired observations per language). Three findings emerge. First, low-resource languages (Arabic, Basque, Luxembourgish) incur a structural penalty at both backbone scales (Wilcoxon HR>LR p<10^-300; cluster-mask IoU gap +0.114 at base, +0.143 at large), isolating the deficit to the text branch. Second, scaling the encoder 7x widens the gap for structural failure cases (Basque Δ=-0.056, Luxembourgish Δ=-0.076) while improving Arabic (Δ=+0.033), separating corpus-coverage from tokeniser-fertility failures. Third, peak similarity is preserved across languages (mean ratio 0.94 at large scale) while cluster-mask IoU drops sharply, identifying spatial misalignment, not signal collapse, as the dominant failure mode. At 3.4-3.9 Wh per 1,000 queries, dense-CLIP grounding is competitive with high-throughput inference budgets, positioning it as a practical substrate for energy-aware multilingual deployment.
♻ ☆ CLEAR: Revealing How Noise and Ambiguity Degrade Reliability in LLMs for Medicine
Medical large language model (LLM) evaluations rely on simplified, exam-style benchmarks that rarely reflect the ambiguity of real-world medical inquiries. We introduce the CLinical Evaluation of Ambiguity and Reliability (CLEAR) framework, which assesses how decision-space presentation, ambiguity, and uncertainty affect LLMs' reasoning on medical benchmarks. CLEAR systematically perturbs (1) the number of plausible answer options, (2) the presence of a ground truth or abstention option, and (3) the semantic framing of answer options. Applying CLEAR on three benchmarks evaluated across 17 LLMs reveals three notable limitations of existing evaluation methods. First, increasing the number of plausible answers degrades a model's ability to identify the correct answer and abstain against incorrect ones. Second, this lack of caution intensifies as the framing of abstention shifts from assertive rejection like "None of the Above" to uncertainty admission like "I don't know" (IDK). Notably, just including IDK in the answer space increases incorrect answer selections. Lastly, we formalize the performance gap between identifying the correct answer and abstaining from incorrect ones as the humility deficit, which worsens with model scale. Our findings reveal limitations in standard medical benchmarks and underscore that scaling alone does not resolve LLM reliability issues.
♻ ☆ ROM: Real-time Overthinking Mitigation via Streaming Detection and Intervention
Large Reasoning Models (LRMs) often reach a correct solution before their long Chain-of-Thought trace ends, yet continue with redundant verification, repeated attempts, or unnecessary exploration that wastes computation and can even overturn the correct answer. We frame this behavior as a latent productive-to-redundant transition and show that it is directly reflected in hidden states: around first-correct-solution (FCS) boundaries, late-layer representations separate efficient from overthinking tokens, while boundary-permutation and position-control baselines collapse. Based on this signal, we propose ROM, a model-agnostic streaming intervention framework that monitors frozen LRMs with a lightweight hidden-state detector and intervenes at well-formed reasoning boundaries. Counterfactual Self-Correction (CSC) augments supervision with balanced wrong to correct trajectories, preserving useful pre-FCS correction while labeling only post-FCS continuation as redundant. Across MATH500, GSM8K, AIME25, and MMLU-Pro, ROM improves the overall tradeoff on both Qwen3-8B and DeepSeek-R1-Distill-Qwen-32B (DS-32B): on Qwen3-8B, it raises accuracy from 74.47% to 74.78% and reduces response length from 4262 to 3107 tokens; on DS-32B, it raises accuracy from 68.60% to 68.72% and reduces response length from 3062 to 2319 tokens. The same FCS-derived supervision transfers across scale and training origin, suggesting a shared long-CoT boundary rather than a backbone-specific artifact. ROM is compatible with L1, removing another 20.9-21.6% tokens at zero accuracy loss. ROM also generalizes to open-ended MMLU-Pro (+1.56 pp, 35.4% shorter) and reduces wall-clock latency by 46.5%. Code is available at https://github.com/SaFo-Lab/ROM.
comment: Code is available at https://github.com/SaFo-Lab/ROM
♻ ☆ Bring Your Own Prompts: Use-Case-Specific Bias and Fairness Evaluation for LLMs
Bias and fairness risks in Large Language Models (LLMs) vary substantially across deployment contexts, yet existing approaches lack systematic guidance for selecting appropriate evaluation metrics. We present a decision framework that maps LLM use cases, characterized by a model and population of prompts, to relevant bias and fairness metrics based on task type, whether prompts contain protected attribute mentions, and stakeholder priorities. Our framework addresses toxicity, stereotyping, counterfactual unfairness, and allocational harms, and introduces novel metrics based on stereotype classifiers and counterfactual adaptations of text similarity measures. We release an open-source Python library, \texttt{langfair}, for practical adoption. Extensive experiments on use cases across five LLMs and five prompt populations demonstrate that fairness risks cannot be reliably assessed from benchmark performance alone: results on one prompt dataset likely overstate or understate risks for another, underscoring that fairness evaluation must be grounded in the specific deployment context.
comment: v6: Updated title; LangFair repository: https://github.com/cvs-health/langfair
♻ ☆ Paraphrase-Induced Output-Mode Collapse: When LLMs Break Character Under Semantically Equivalent Inputs
When the substantive content of a request is rewritten, do large language models still answer in the format the original task asked for? We find that they often do not, even at temperature zero. On a 150-query evaluation over five compact 2025-era LLMs and four task types, we observe a systematic failure mode we call prompt-variant output-mode collapse: when a closed-form prompt asks for a bare label or a single choice token, content-preserving prompt variants can push the model into conversational prose, the requested format dissolves, and exact-match evaluation pipelines silently misjudge the result. To make this measurable, we release PARACONSIST, a 900-prompt benchmark of 150 base queries with five lexical, syntactic, and semantic-expansion prompt variants each, and a Semantic Consistency Score that decomposes prompt-variant robustness into answer consistency, sentence-BERT semantic similarity, and length stability. Under a whole-word answer-set match, only ~22% of closed-form variant responses preserve the ground-truth label inside their output, while ~78% drift away from the answer space entirely. In our pool, the dominant predictor of collapse is task structure rather than model identity, with model differentiation jointly carried by answer consistency and length stability. Robustness audits should therefore track response-mode preservation as a first-class reliability target alongside answer accuracy.
comment: Added a footnote; author order is alphabetical by last name
♻ ☆ Overcoming Multi-step Complexity in Multimodal Theory-of-Mind Reasoning: A Scalable Bayesian Planner ICML 2025
Theory-of-Mind (ToM) enables humans to infer mental states-such as beliefs, desires, and intentions-forming the foundation of social cognition. However, existing computational ToM methods rely on structured workflows with ToM-specific priors or deep model fine-tuning, which struggle with scalability in multimodal environments and fail to generalize as task complexity increases. To address these limitations, we propose a scalable Bayesian ToM planner that decomposes ToM reasoning into stepwise Bayesian updates. Our framework introduces weak-to-strong control, allowing smaller language models (LMs) to specialize in ToM-specific likelihood estimation and transfer their reasoning behaviors to larger LMs (7B to 405B) for integration with social and world knowledge. This synergistic approach aligns large-model inference of human mental states with Bayesian principles. Extensive experiments show that our method achieves a 4.6% accuracy improvement over state-of-the-art techniques on multimodal ToM benchmarks, including challenging unseen scenarios, thereby establishing a new standard for modeling human mental states in complex environments.
comment: Accepted as a Spotlight at the 2025 Forty-Second International Conference on Machine Learning (ICML 2025)
♻ ☆ Users as Annotators: LLM Preference Learning from Comparison Mode
Pairwise preference data have played an important role in the alignment of large language models (LLMs). Each sample of such data consists of a prompt, two different responses to the prompt, and a binary label indicating which of the two responses is better. The labels are usually annotated by professional human annotators. In this paper, we consider an alternative approach to collect pairwise preference data -- user annotation from comparison mode. With the increasingly wider adoption of LLMs among the population, users are contributing more and more of their preference labels through their daily interactions with the LLMs. The upside of such labels is that users are the best experts in judging the responses to their own queries/prompts, but the downside is the lack of quality control in these labels. In this paper, we consider a new idea of generating two responses from two different models or two different versions of the same model. The asymmetry allows us to make an inference of the user's data quality through our proposed user behavior model. We develop an expectation-maximization algorithm to estimate a latent quality factor of the user, and filter users' annotation data accordingly. The downstream task shows the effectiveness of our approach in both capturing the user behavior and data filtering for LLM alignment.
♻ ☆ LogitTrace: Detecting Benchmark Contamination via Layerwise Logit Trajectories
Large language models (LLMs) are commonly evaluated on challenging benchmarks such as AIME and Math500, where benchmark contamination can make memorized solutions appear as genuine reasoning. Existing detection methods largely rely on surface overlap, completion behavior, or final-output likelihood, and often degrade when inputs are simply rephrased. In this paper, we propose LogitTrace(Layerwise Logit Trajectories), a framework for analyzing memorization-like decision dynamics through intermediate logit trajectories. Instead of judging memorization only from the final answer, LogitTrace examines how model preferences emerge and stabilize across layers. We find that contaminated examples tend to show earlier commitment, while clean examples exhibit more gradual evidence accumulation. These trajectory signals allow a lightweight classifier to separate contaminated and clean examples across multiple models and input variants. Controlled LoRA injection experiments further show that repeated exposure to target samples induces similar trajectory patterns. Overall, our results suggest that LogitTrace provides evidence beyond surface overlap and final-output confidence, offering a useful lens for studying memorization-like behavior in LLMs.
comment: 23pages, 10 figures, 9tables
♻ ☆ GRIT: Teaching MLLMs to Think with Images
Recent studies have demonstrated the efficacy of using Reinforcement Learning (RL) in building reasoning models that articulate chains of thoughts prior to producing final answers. However, despite ongoing advances that aim at enabling reasoning for vision-language tasks, existing open-source visual reasoning models typically generate reasoning content with pure natural language, lacking explicit integration of visual information. This limits their ability to produce clearly articulated and visually grounded reasoning chains. To this end, we propose Grounded Reasoning with Images and Texts (GRIT), a novel method for training MLLMs to think with images. GRIT introduces a grounded reasoning paradigm, in which models generate reasoning chains that interleave natural language and explicit bounding box coordinates. These coordinates point to regions of the input image that the model consults during its reasoning process. Additionally, GRIT is equipped with a reinforcement learning approach, GRPO-GR, built upon the GRPO algorithm. GRPO-GR employs robust rewards focused on the final answer accuracy and format of the grounded reasoning output, which eliminates the need for data with reasoning chain annotations or explicit bounding box labels. As a result, GRIT achieves exceptional data efficiency, requiring as few as 20 image-question-answer triplets from existing datasets. Comprehensive evaluations demonstrate that GRIT effectively trains MLLMs to produce coherent and visually grounded reasoning chains, showing a successful unification of reasoning and grounding abilities.
♻ ☆ Selective Deficits in LLM Mental Self-Modeling in a Behavior-Based Test of Theory of Mind
The ability to represent oneself and others as agents with knowledge, intentions, and belief states that guide their behavior - Theory of Mind - is a human universal that enables us to navigate - and manipulate - the social world. It is supported by our ability to form mental models of ourselves and others. Its ubiquity in human affairs entails that LLMs have seen innumerable examples of it in their training data and therefore may have learned to mimic it, but whether they have actually learned causal models that they can deploy in arbitrary settings is unclear. We therefore develop a novel experimental paradigm that requires that subjects form representations of the mental states of themselves and others and act on them strategically rather than merely describe them. We test a wide range of leading open and closed source LLMs released since 2024, as well as human subjects, on this paradigm. We find that 1) LLMs released before mid-2025 fail at all of our tasks, 2) more recent LLMs achieve human-level performance on modeling the cognitive states of others, and 3) even frontier LLMs fail at our self-modeling task - unless afforded a scratchpad in the form of a reasoning trace. We further demonstrate cognitive load effects on other-modeling tasks, offering suggestive evidence that LLMs are using something akin to limited-capacity working memory to hold these mental representations in mind during a single forward pass. Finally, we explore the mechanisms by which reasoning models succeed at the self- and other-modeling tasks, and show that they readily engage in strategic deception.
comment: 22 pages, 13 figures, 1 table
♻ ☆ Text Corpora as Concept Fields: Black-Box Hallucination and Novelty Measurement
We introduce the \textbf{Concept Field} of a text corpus: a local drift field with pointwise uncertainty, estimated in sentence-embedding space from the deltas between consecutive sentences. Given a candidate sentence transition, we score its agreement with the field by $ζ$, the mean absolute z-distance between the observed delta and the field's local Gaussian estimate. The score is black-box (no model internals), corpus-attributable (every score traces to nearby corpus sentences), and admits a probabilistically motivated interpretation under a local Gaussian approximation. We support the computation with the introduction of a \textbf{Vector Sequence Database (VSDB)} that stores embeddings together with sequence-position and next-delta metadata. We evaluate this approach on two large-scale settings: hallucination-style groundedness detection over the U.S. Code of Federal Regulations, and novelty detection over Project Gutenberg. On controlled LLM-generated rewrites, Concept Fields achieve strong selective classification performance under a grounded / ungrounded / unsure triage policy. Unlike retrieval-centric baselines, the resulting coverage-risk behavior is similar across both domains, supporting a degree of cross-domain stability for the standardized deviation score. We also sketch how divergence and curl of the Concept Field, computed on dense clusters, surface qualitatively meaningful semantic patterns (logic sources, sinks, and implicit topics), which we offer as hypothesis-generating rather than as a quantitative result. Concept Fields provide a fast, lightweight, and interpretable signal for groundedness and novelty, complementary to LLM-as-judge and white-box detectors.
comment: 25 pages, 8 figures
♻ ☆ STAGE: A Full-Screenplay Benchmark for Reasoning over Evolving Storie
Movie screenplays are rich long-form narratives that interleave complex character relationships, temporally ordered events, and dialogue-driven interactions. While prior benchmarks target individual subtasks such as question answering or dialogue generation, they rarely evaluate whether models can construct a coherent story world and use it consistently across multiple forms of reasoning and generation. We introduce STAGE (Screenplay Text, Agents, Graphs and Evaluation), a unified benchmark for narrative understanding over full-length movie screenplays. STAGE defines four tasks: knowledge graph construction, scene-level event summarization, long-context screenplay question answering, and in-script character role-playing, all grounded in a shared narrative world representation. The benchmark provides cleaned scripts, curated knowledge graphs, and event- and character-centric annotations for 150 films across English and Chinese, enabling holistic evaluation of models' abilities to build world representations, abstract and verify narrative events, reason over long narratives, and generate character-consistent responses.
comment: 66 pages, 9 figures
Multimedia 8
☆ Relightable Gaussian Splatting for Virtual Production Using Image-Based Illumination
Virtual production (VP) use LED walls to provide both background imagery and image-based lighting. While this enables on-set compositing, it couples lighting to background and scene appearance, limiting flexibility for downstream editing. In addition, inverse rendering conventionally relies on physically-based rendering to estimates 3D geometry and lighting, using environment maps. However, these maps are typically low-resolution and assume far-field lighting. In VP, with near-field and high-resolution image-based lighting, this can lead to inaccuracies and introduce complexities when editing. Addressing this, we propose a VP-specific framework for 3D reconstruction and relighting using Gaussian Splatting. This uses the known background imagery to condition the relighting process. This avoids relying on environment maps and reduces compositing to a background-image editing task. To realize our framework, we introduce a process (and associated dataset) that captures real VP scenes under varying background content and illumination conditions. This data is used to decompose a 3D scene into fixed appearance and variable lighting components. The variable lighting process simulates light transport by parameterizing each primitive with a UV coordinate, intensity value and resolution modifier. Using mipmaps, these directly sample the background texture in image space - implicitly capturing reflections and refractions without physically-based rendering. Combined with the fixed appearance component, this allows us to render relit scenes using a Gaussian Splatting rasterizer. Compared to baselines, our approach achieves higher-quality 3D reconstruction and controllable relighting. The method is efficient (<3 GB RAM, <5 GB VRAM, <2 hours training, ~35 FPS) and supports rendering useful arbitrary output variables including depth, lighting intensity, lighting color, and unlit renders.
☆ Accelerating Multi-Condition T2I Generation via Adaptive Condition Offloading and Pruning IEEE
Text-to-image (T2I) generation using multiple conditions enables fine-grained user control on the generated image. Yet, incorporating multi-condition inputs incurs substantial computation and communication overhead, due to additional preprocessing subtasks and control optimizations. It hence leads to unacceptable generation latency. In this paper, we propose an end-edge collaborative system design to accelerate multi-condition T2I generation through adaptive condition offloading and pruning. Extensive offline profiling reveal that, different conditions exhibit significant diversity in computation and communication costs. To this end, we propose a \textit{Subtask Manager} that jointly optimizes condition inference offloading and bandwidth allocation using a heuristic algorithm, balancing local and edge execution delays to minimize overall preprocessing latency. Then, we design a lightweight feature-driven \textit{Conditioning Scale Estimator} that evaluates the contribution of each condition by analyzing its feature activation strength and overlap with other conditions. This allows adaptive conditioning scale selection and pruning of insignificant conditions, thereby accelerating the denoising process. Extensive experimental results show that our system reduces latency by nearly 25\% and improves 6\% average generation quality, outperforming other benchmarks.
comment: accepted by IEEE ICME 2026
☆ Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation
Motion, speech, and sound effects are fundamental elements of human-centric videos, yet their heterogeneous temporal characteristics make joint generation highly challenging. Existing audio-video generation models often fail to maintain consistent alignment across these modalities, leading to noticeable mismatches between motion, speech, and environmental sounds. We present Unison, a unified framework that explicitly promotes coherence across the motion, speech, and sound modalities. Within the audio stream, Unison employs a semantic-guided harmonization strategy that decouples the generation of speech and sound-effect components. Leveraging bidirectional audio cross-attention and semantic-conditioned gating for semantic-driven adaptive recomposition, this approach effectively mitigates speech dominance and enhances acoustic clarity. For audio-motion synchronization, we propose a bidirectional cross-modal forcing strategy where the cleaner modality guides the noisier one through decoupled denoising schedules, reinforced by a progressive stabilization strategy. Extensive experiments demonstrate that Unison achieves state-of-the-art performance in both audio perceptual quality and cross-modal synchronization, highlighting the importance of explicit multimodal harmonization in human-centric video generation.
☆ EAR: Enhancing Uni-Modal Representations for Weakly Supervised Audio-Visual Video Parsing
Weakly supervised Audio-Visual Video Parsing (AVVP) aims to recognize and temporally localize audio, visual, and audio-visual events in videos using only coarse-grained labels. Faced with the challenging task settings, existing research advances along two main paths: pre-training pseudo-label generators for fine-grained cross-modal semantic guidance, or refining AVVP model architectures to enhance audio-visual fusion. However, since audio and visual signals are typically unaligned, achieving accurate video parsing fundamentally relies on precise perception of uni-modal events. Yet these multi-modal focused strategies excessively emphasize multi-modal fusion while inadequately guiding and preserving uni-modal semantics, resulting in noisy pseudo-labels and sub-optimal video parsing performance. This paper proposes a novel framework that enhances uni-modal representations for both the pseudo-label generator and the AVVP model. Specifically, we introduce a similarity-based label migration approach to annotate pre-training data, thereby enabling the pseudo-label generator to better understand uni-modal events. We also employ a soft-constrained manner to refine modeling of uni-modal features in parallel with multi-modal fusion. These designs enable coordinated attention to both uni-modal and cross-modal representations, thus boosting the localization performance for events. Extensive experiments show that our method outperforms state-of-the-art methods in both pseudo-label and AVVP performance.
☆ Thin-Client Interactive Gaussian Adaptive Streaming over HTTP/3
Recent advancements in 3D Gaussian Splatting (3DGS) have enabled photorealistic rendering of complex scenes, yet widespread adoption on mobile and Extended Reality (XR) devices is hindered by substantial computational and bandwidth requirements. While existing solutions often focus on model compression for client-side rendering, they still demand significant GPU power, limiting applicability on resource-constrained hardware. We propose TIGAS (Thin-client Interactive Gaussian Adaptive Streaming), a remote rendering framework offloading rasterization to a backend. To bypass the prohibitive latencies connected to fluctuating network conditions, TIGAS streams view-dependent 2D projections to a lightweight web client over QUIC, minimizing head-of-line (HoL) blocking. A dedicated ABR algorithm adapts rendering quality to fluctuating network conditions, maintaining motion-to-photon latency within strict 6DoF interactive constraints. Furthermore, we discuss the integration of an experimental WebGPU super-resolution pipeline to analyze the trade-offs between perceptual quality enhancements and thin-client processing bottlenecks. We extensively evaluate TIGAS across multi-continental environments using 14 3DGS models and real 6DoF EyeNavGS movement traces. Powered by a backend rendering frames in under 10 milliseconds, TIGAS maintains latency within interactive thresholds while achieving an average SSIM of 0.88, serving both as a robust testbed for 3DGS streaming research and a capable delivery system. The source code is available at: https://github.com/Rekenar/GaussianAdaptiveStreamer.
☆ Streaming of rendered content with adaptive frame rate and resolution
Streaming rendered content is an attractive way to bring high-quality graphics to billions of mobile devices that do not have sufficient rendering power. Existing solutions render content on a server at a fixed frame rate, typically 30 or 60 frames per second, and reduce resolution when bandwidth is restricted. However, this strategy leads to suboptimal rendering quality under the bandwidth constraints. In this work, we exploit the spatio-temporal limits of the human visual system to improve perceived quality while reducing rendering costs by adaptively adjusting both frame rate and resolution based on scene content and motion. Our approach is codec-agnostic and requires only minimal modifications to existing rendering infrastructure. We propose a system in which a lightweight neural network predicts the optimal combination of frame rate and resolution for a given transmission bandwidth, content, and motion velocity. This prediction significantly enhances perceptual quality while minimizing computational cost under bandwidth constraints. The network is trained on a large dataset of rendered content labeled with a perceptual video quality metric. The dataset and further information can be found at the project web page: https://www.cl.cam.ac.uk/research/rainbow/projects/adaptive_streaming/.
♻ ☆ From Natural Alignment to Conditional Controllability in Multimodal Dialogue ICLR 2026
The recent advancement of Artificial Intelligence Generated Content (AIGC) has led to significant strides in modeling human interaction, particularly in the context of multimodal dialogue. While current methods impressively generate realistic dialogue in isolated modalities like speech or vision, challenges remain in controllable Multimodal Dialogue Generation (MDG). This paper focuses on the natural alignment between speech, vision, and text in human interaction, aiming for expressive dialogue generation through multimodal conditional control. To address the insufficient richness and diversity of dialogue expressiveness in existing datasets, we introduce a novel multimodal dialogue annotation pipeline to curate dialogues from movies and TV series with fine-grained annotations in interactional characteristics. The resulting MM-Dia dataset (360+ hours, 54,700 dialogues) facilitates explicitly controlled MDG, specifically through style-controllable dialogue speech synthesis. In parallel, MM-Dia-Bench (309 highly expressive dialogues with visible single-/dual-speaker scenes) serves as a rigorous testbed for implicit cross-modal MDG control, evaluating audio-visual style consistency across modalities. Extensive experiments demonstrate that training on MM-Dia significantly enhances fine-grained controllability, while evaluations on MM-Dia-Bench reveal limitations in current frameworks to replicate the nuanced expressiveness of human interaction. These findings provides new insights and challenges for multimodal conditional dialogue generation.
comment: Accepted by ICLR 2026
♻ ☆ Context and Pixel Aware Large Language Model for Video Quality Assessment ICIP 2026
Video quality assessment (VQA) is a challenging research topic with broad applications. Traditional hand-crafted and discriminative learning-based VQA models mainly focus on pixel-level distortions and lack contextual understanding, while recent multimodal large language models (MLLMs) struggle with sensitivity to small distortions or handle quality scoring and description as separate tasks. To address these shortcomings, we introduce CP-LLM: a Context- and Pixel-aware Large Language Model. CP-LLM is a novel multimodal LLM architecture featuring dual vision encoders designed to independently analyze perceptual quality at both high-level (video context) and low-level (pixel distortion) granularity, along with a language decoder that subsequently reasons about the interplay between these aspects. This design enables CP-LLM to simultaneously produce robust quality scores and interpretable quality descriptions, with enhanced sensitivity to pixel distortions (e.g., compression artifacts). Experiment results demonstrate that CP-LLM achieves state-of-the-art cross-dataset performance on VQA benchmarks and superior robustness to pixel distortions.
comment: Accepted to ICIP 2026
Computer Vision and Pattern Recognition 234
☆ 123D: Unifying Multi-Modal Autonomous Driving Data at Scale
The pursuit of autonomous driving has produced one of the richest sensor data collections in all of robotics. However, its scale and diversity remain largely untapped. Each dataset adopts different 2D and 3D modalities, such as cameras, lidar, ego states, annotations, traffic lights, and HD maps, with different rates and synchronization schemes. They come in fragmented formats requiring complex dependencies that cannot natively coexist in the same development environment. Further, major inconsistencies in annotation conventions prevent training or measuring generalization across multiple datasets. We present 123D, an open-source framework that unifies such multi-modal driving data through a single API. To handle synchronization, we store each modality as an independent timestamped event stream with no prescribed rate, enabling synchronous or asynchronous access across arbitrary datasets. Using 123D, we consolidate eight real-world driving datasets spanning 3,300 hours and 90,000 kilometers, together with a synthetic dataset with configurable collection scripts, and provide tools for data analysis and visualization. We conduct a systematic study comparing annotation statistics and assessing each dataset's pose and calibration accuracy. Further, we showcase two applications 123D enables: cross-dataset 3D object detection transfer and reinforcement learning for planning, and offer recommendations for future directions. Code and documentation are available at https://github.com/kesai-labs/py123d.
☆ Normalizing Trajectory Models
Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coarse transitions. Existing few-step methods address this through distillation, consistency training, or adversarial objectives, but sacrifice the likelihood framework in the process. We introduce Normalizing Trajectory Models (NTM), which models each reverse step as an expressive conditional normalizing flow with exact likelihood training. Architecturally, NTM combines shallow invertible blocks within each step with a deep parallel predictor across the trajectory, forming an end-to-end network trainable from scratch or initializable from pretrained flow-matching models. Its exact trajectory likelihood further enables self-distillation: a lightweight denoiser trained on the model's own score produces high-quality samples in four steps. On text-to-image benchmarks, NTM matches or outperforms strong image generation baselines in just four sampling steps while uniquely retaining exact likelihood over the generative trajectory.
☆ EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction
Recent event-based image reconstruction methods predominantly rely on Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) to process complementary event information. However, these architectures face fundamental limitations: CNNs often fail to capture global feature correlations, whereas ViTs incur quadratic computational complexity (e.g., $O(n^2)$), hindering their application in high-resolution scenarios. To address these bottlenecks, we introduce EmambaIR, an Efficient visual State Space Model designed for image reconstruction using spatially sparse and temporally continuous event streams. Our framework introduces two key components: the cross-modal Top-k Sparse Attention Module (TSAM) and the Gated State-Space Module (GSSM). TSAM efficiently performs pixel-level top-k sparse attention to guide cross-modal interactions, yielding rich yet sparse fusion features. Subsequently, GSSM utilizes a nonlinear gated unit to enhance the temporal representation of vanilla linear-complexity ($O(n)$) SSMs, effectively capturing global contextual dependencies without the typical computational overhead. Extensive experiments on six datasets across three diverse image reconstruction tasks - motion deblurring, deraining, and High Dynamic Range (HDR) enhancement - demonstrate that EmambaIR significantly outperforms state-of-the-art methods while offering substantial reductions in memory consumption and computational cost. The source code and data are publicly available at: https://github.com/YunhangWickert/EmambaIR
☆ Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment CVPR 2026
Spatial intelligence in vision-language models (VLMs) attracts research interest with the practical demand to reason in the 3D world.Despite promising results, most existing methods follow the conventional 2D pipeline in VLMs and use pixel-aligned representations for the vision modality. However, correspondence-based models with implicit 3D scene understanding often fail to achieve spatial consistency, and representation-based models with 3D geometric priors lack efficiency in vision sequence serialization. To address this, we propose a Proxy3D method with compact yet comprehensive 3D proxy representations for the vision modality. Given only video frames as input, we employ semantic and geometric encoders to extract scene features and then perform their semantic-aware clustering to obtain a set of proxies in the 3D space. For representation alignment, we further curate the SpaceSpan dataset and apply multi-stage training to adopt the proposed 3D proxy representations with the VLM. When using shorter sequences for vision information, our method achieves competitive or state-of-the-art performance in 3D visual question answering, visual grounding and general spatial intelligence benchmarks.
comment: Accepted by CVPR 2026. Project page: https://wzzheng.net/Proxy3D
☆ Flow-OPD: On-Policy Distillation for Flow Matching Models
Existing Flow Matching (FM) text-to-image models suffer from two critical bottlenecks under multi-task alignment: the reward sparsity induced by scalar-valued rewards, and the gradient interference arising from jointly optimizing heterogeneous objectives, which together give rise to a 'seesaw effect' of competing metrics and pervasive reward hacking. Inspired by the success of On-Policy Distillation (OPD) in the large language model community, we propose Flow-OPD, the first unified post-training framework that integrates on-policy distillation into Flow Matching models. Flow-OPD adopts a two-stage alignment strategy: it first cultivates domain-specialized teacher models via single-reward GRPO fine-tuning, allowing each expert to reach its performance ceiling in isolation; it then establishes a robust initial policy through a Flow-based Cold-Start scheme and seamlessly consolidates heterogeneous expertise into a single student via a three-step orchestration of on-policy sampling, task-routing labeling, and dense trajectory-level supervision. We further introduce Manifold Anchor Regularization (MAR), which leverages a task-agnostic teacher to provide full-data supervision that anchors generation to a high-quality manifold, effectively mitigating the aesthetic degradation commonly observed in purely RL-driven alignment. Built upon Stable Diffusion 3.5 Medium, Flow-OPD raises the GenEval score from 63 to 92 and the OCR accuracy from 59 to 94, yielding an overall improvement of roughly 10 points over vanilla GRPO, while preserving image fidelity and human-preference alignment and exhibiting an emergent 'teacher-surpassing' effect. These results establish Flow-OPD as a scalable alignment paradigm for building generalist text-to-image models.
☆ 6D Pose Estimation via Keypoint Heatmap Regression with RGB-D Residual Neural Networks
In this paper, we propose a modular framework for 6D pose estimation based on keypoint heatmap regression. Our approach combines YOLOv10m for object detection with a ResNet18-based network that predicts 2D heatmaps from RGB images. Keypoints extracted from these heatmaps are used to estimate the 6D object pose via the PnP RANSAC algorithm. We compare different keypoint selection strategies to assess their impact on pose accuracy. Additionally, we extend the baseline by incorporating depth data using a cross-fusion architecture, which enables interaction between RGB and depth features at multiple stages. We further explore general training improvements, such as experimenting with activation functions and learning rate scheduling strategies to improve model performance. Our best RGB-only model achieved a mean ADD-based accuracy of 84.50%, while the RGB-D fusion model reached 92.41% on the LINEMOD dataset. The code is available at https://github.com/ameermasood/HeatNet.
comment: Source code available at: https://github.com/ameermasood/HeatNet
☆ Towards Highly-Constrained Human Motion Generation with Retrieval-Guided Diffusion Noise Optimization CVPR2026
Generating human motion that satisfies customized zero-shot goal functions, enabling applications such as controllable character animation and behavior synthesis for virtual agents, is a critical capability. While current approaches handle many unseen constraints, they fail on tasks with very challenging spatiotemporal restrictions, such as severe spatial obstacles or specified numbers of walking steps. To equip motion generators for these highly constrained tasks, we present a retrieval-guided method built on the training-free diffusion noise optimization framework. The key idea is to search within large motion datasets for guidance that can potentially satisfy difficult constraints. We introduce relational task parsing to group target constraints and identify the difficult ones to be handled by retrieved reference. A better initialization for diffusion noise is then obtained via a reward-guided mask that combines random noise with retrieved noise. By optimizing diffusion noise from this improved initialization, we successfully solve highly constrained generation tasks. By leveraging LLM for relational task parsing, the whole framework is further enabled to automatically reason for what to retrieve, improving the intelligence of moving agents under a training-free optimization scheme.
comment: Accepted to CVPR2026
☆ MoCoTalk: Multi-Conditional Diffusion with Adaptive Router for Controllable Talking Head Generation
Talking-head generation requires joint modeling of identity, head pose, facial expression, and mouth dynamics. Existing methods typically address only a subset of these factors, and rely on fixed-weight or heuristic fusion when multiple conditions are involved. We present MoCoTalk, a multi-conditional video diffusion framework that unifies four complementary control signals: a reference image, facial keypoints, 3DMM-rendered shading meshes, and the corresponding speech audio. To resolve destructive interference among heterogeneous conditions, we introduce an Adaptive Multi-Condition Router that computes channel-wise, timestep-aware gating over the four condition streams, allowing the fusion strategy to vary with both feature subspace and noise level. To better capture speech-related facial dynamics, we design a Mouth-Augmented Shading Mesh, a 3DMM-based representation that decouples head motion, mouth motion, expression, and lighting. This design provides a temporally consistent geometric prior and allows flexible recombination of these attributes at inference. We further introduce a lip consistency loss to tighten audio-visual alignment. Extensive experiments show that MoCoTalk achieves state-of-the-art performance on the majority of structural, motion, and perceptual metrics, while offering attribute-level controllability that single-condition methods do not provide.
☆ SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation
While text-to-image models have made strong progress in visual fidelity, faithfully realizing complex visual intents remains challenging because many requirements must be tracked across grounding, generation, and verification. We refer to these requirements as semantic commitments and formalize their lifecycle discontinuity as the Conceptual Rift, where commitments may be locally resolved or checked but fail to remain identifiable as the same operational units throughout the generation lifecycle. To address this, we propose SCOPE, a specification-guided skill orchestration framework that maintains semantic commitments in an evolving structured specification and conditionally invokes retrieval, reasoning, and repair skills around unresolved or violated commitments. To evaluate commitment-level intent realization, we introduce Gen-Arena, a human-annotated benchmark with entity- and constraint-level specifications, together with Entity-Gated Intent Pass Rate (EGIP), a strict entity-first pass criterion. SCOPE substantially outperforms all evaluated baselines on Gen-Arena, achieving 0.60 EGIP, and further achieves strong results on WISE-V (0.907) and MindBench (0.61), demonstrating the effectiveness of persistent commitment tracking for complex image generation.
☆ Object Hallucination-Free Reinforcement Unlearning for Vision-Language Models
Vision-language models (VLMs) raise growing concerns about privacy, copyright, and bias, motivating machine unlearning to remove sensitive knowledge. However, existing methods primarily fine-tune the language decoder, leading to superficial forgetting that fails to erase underlying visual representations and often introduces object hallucination. We propose HFRU, a reinforcement unlearning framework that operates on the vision encoder for deep semantic removal. Our two-stage approach combines alignment disruption with GRPO-based optimization using a composite reward, including an abstraction reward that encourages semantically valid substitutions and mitigates hallucinations. Experiments on object recognition and face identity tasks show that HFRU achieves over 98% forgetting and retention performance, while introducing negligible object hallucination, significantly outperforming prior methods.Our code and implementation details are available at https://github.com/XMUDeepLIT/HFRU.
☆ PET-Adapter: Test-Time Domain Adaptation for Full and Limited-Angle PET Image Reconstruction
Positron Emission Tomography (PET) image reconstruction is inherently challenged by Poisson noise and physical degradation factors, which are further exacerbated in limited-angle acquisitions. While deep learning methods demonstrate promising performance, their generalization to unseen clinical data distributions remains limited without extensive retraining. We propose PET-Adapter, a test-time domain adaptation framework for generative PET reconstruction models pretrained solely on phantom data. Our method enables adaptation to clinical datasets with varying anatomies, tracers, and scanner configurations without requiring paired ground truth. PET-Adapter introduces layer-wise low-rank anatomical conditioning during adaptation and Ordered Subset Expectation Maximization-based warm-starting that initializes the generation from physics-informed reconstructions, reducing diffusion steps from 50 to 2 without compromising quality. Experiments across multiple clinical datasets demonstrate superior 3D reconstruction performance in both full-angle and limited-angle settings, highlighting the clinical feasibility and computational efficiency of the proposed approach.
☆ STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation
Deep generative models have advanced rapidly across text and vision, motivating unified multimodal systems that can understand, reason over, and generate interleaved text-image sequences. Most existing approaches combine autoregressive language modeling with diffusion-based image generators, inheriting a structural mismatch between causal text generation and iterative visual denoising. We observe that autoregressive normalizing flows are autoregressive Transformers--sharing the same causal mask, KV-cache mechanism, and left-to-right structure as LLMs--making them the most natural paradigm for true unified multimodal generation. We present STARFlow2, built on the Pretzel architecture that vertically interleaves a pretrained VLM stream with a TarFlow stream via residual skip connections, both operating under the same causal mask. Combined with a deep-shallow flow design and a unified FAE latent space, STARFlow2 enables cache-friendly interleaved generation where both text and visual outputs directly enter the KV-cache without re-encoding. Experiments demonstrate strong performance across image generation and multimodal understanding benchmarks, validating autoregressive flows as a viable foundation for unified multimodal modeling.
comment: 19 pages, 9 figures
☆ TRAS: An Interactive Software for Tracing Tree Ring Cross Sections
Tree ring marking remains a key step in dendrometry and dendrochronology, but it is often performed manually, making the process time-consuming, subjective, and difficult to scale to large image datasets. We present the Tree Ring Analyzer Suite (TRAS), an open-source graphical software for automatic delineation, manual correction, and measurement of tree rings in wood cross-sectional images. TRAS integrates three complementary detection algorithms: the classical image-processing method CS-TRD and two deep-learning approaches, DeepCS-TRD and INBD. The interface allows users to refine automatic detections, remove false positives, and manually add missing rings. It also computes dendrochronological metrics such as earlywood and latewood areas, ring perimeter, equivalent ring width, and custom path-based ring-width measurements. TRAS was evaluated on 18 expertly annotated Pinus taeda L. cross-section images. DeepCS-TRD achieved the best automatic detection performance, with an F-score of 81.0% and precision of 86.4%. Automatic detection reduced the required manual correction effort to approximately 20% of ring boundaries. For one-dimensional ring-width measurements, TRAS showed excellent agreement with CooRecorder ($r > 0.99$). Common detection errors, such as jump propagation or false positives near knots, were easily corrected through the postprocessing interface. TRAS provides a flexible and reproducible solution for tree-ring analysis on Windows, macOS, and Linux. Code is available at the https://hmarichal93.github.io/tras.
comment: This manuscript has been accepted for publication in Forestry: An International Journal of Forest Research, published by Oxford University Press. This is an author-produced version and may differ from the final Version of Record. The final published version will be available through the journal website
☆ SphereVAD: Training-Free Video Anomaly Detection via Geodesic Inference on the Unit Hypersphere
Video anomaly detection (VAD) aims to automatically identify events that deviate from normal patterns in untrimmed surveillance videos. Existing methods universally depend on large-scale annotations or task-specific training procedures, severely limiting their rapid deployment to novel scenes. We observe that intermediate-layer features of pre-trained multimodal large language models (MLLMs) already encode rich anomaly semantics, yet existing approaches rely on the language output pathway and fail to exploit the geometric discriminability latent in these representations. Based on this finding, we propose SphereVAD, a fully training-free, zero-shot VAD framework that recasts anomaly discrimination as von Mises-Fisher (vMF) likelihood-ratio geodesic inference on the unit hypersphere, unleashing latent discriminability through principled geometric reasoning rather than learning new representations. Specifically, SphereVAD first applies Frechet mean centering to unfold feature distributions and eliminate domain biases, then employs Holistic Scene Attention (HSA) to reinforce feature consistency using cross-video priors, and finally performs vMF-guided Spherical Geodesic Pulling (SGP) to align ambiguous segments with directional prototypes on the spherical manifold. This training-free pipeline requires only minimal synthetic images for calibration. SphereVAD establishes new state-of-the-art results among training-free approaches on three major benchmarks and remains competitive with fully supervised baselines. Code will be available upon acceptance.
comment: 48 pages, 25 figures
☆ Rethinking Dense Optical Flow without Test-Time Scaling CVPR 2026
Recent progress in dense optical flow has been driven by increasingly complex architectures and multi-step refinement for test-time scaling. While these approaches achieve strong benchmark performance, they also require substantial computation during inference. This raises a fundamental question: Is scaling test-time computation the only way to improve dense optical flow accuracy? We argue that it is not. Instead, powerful visual semantic and geometric priors encoded in modern foundation models can reduce, if not overcome, the need for computationally expensive iterative refinement at test-time. In this paper, we present a framework that estimates dense optical flow in a single forward pass, leveraging pretrained foundation representations, while avoiding iterative refinement and additional inference-time computation, thus offering an alternative to test-time scaling. Our method extracts visual semantic features from a frozen DINO-v2 backbone and combines them with geometric cues from a monocular depth foundation model. We fuse these complementary priors into a unified representation and apply a global matching formulation to estimate dense correspondences without recurrent updates or test-time optimization. Despite avoiding iterative refinement, our approach achieves strong cross-dataset generalization across challenging benchmarks. On Sintel Final, we obtain 2.81 EPE without refinement, significantly improving over state-of-the-art (SOTA) SEA-RAFT under comparable training conditions and outperforming RAFT, GMFlow (without refinement), and recent FlowSeek in the same setting. These results suggest that strong foundation priors can substitute for test-time scaling, offering a computationally efficient alternative to refinement-heavy pipelines.
comment: Accepted for publication at CVPR 2026; ViSCALE Workshop. Draft info: 10 pages, 2 figures, 4 tables
☆ Uncertainty Quantification for Cardiac Shape Reconstruction with Deep Signed Distance Functions via MCMC methods
Atlas-based approaches allow high-quality, patient-specific shape reconstructions of cardiac anatomy from sparse and/or noisy data such as point clouds. However, these methods are mainly prior-driven, so the impact of uncertainty can be large, limiting their clinical reliability. We propose a probabilistic framework for uncertainty-aware cardiac shape reconstruction that combines Deep Signed Distance Functions (DeepSDFs) with Markov Chain Monte Carlo (MCMC) sampling. Cardiac geometries are modeled implicitly as zero-level sets of a neural network conditioned on learned latent codes, enabling multi-surface reconstruction of the left and right ventricles. By interpreting the reconstruction loss as a log-likelihood, we perform Bayesian inference in the latent space to obtain both maximum a posteriori (MAP) and posterior-sampled reconstructions. Experiments on a public cardiac dataset show that our approach produces accurate reconstructions and well-calibrated uncertainty estimates.
☆ Seeing Across Skies and Streets: Feedforward 3D Reconstruction from Satellite, Drone, and Ground Images
Cross-view localization classically asks: where does this ground image lie on the satellite tile? Existing methods are typically limited to 3-DoF estimates -- an $(x,y)$ position and a yaw angle -- because nadir satellite imagery provides no direct cues for roll, pitch, or altitude, forcing a reliance on planar-motion and zero-tilt assumptions. These assumptions break on real terrain with slopes, ramps, and tilted camera mounts. To overcome this, we introduce a single UAV image as an intermediate viewpoint: it reveals the 3D structure invisible from nadir, supplies the cues for roll, pitch, and altitude that the satellite alone cannot provide, and needs only spatial overlap with the ground camera -- no known relative pose is required. Building on this insight, we propose **Cross3R**, a flexible feed-forward model that ingests a satellite tile together with a UAV image, a ground image, or both, and, in a single forward pass, recovers a cross-view 3D point cloud, the 6-DoF poses of every input camera, and the on-tile $(x,y)$ position and yaw of each perspective camera. For training and evaluation, we also construct **CrossGeo**, a 278K-image tri-view dataset spanning 85 scenes across every continent except Antarctica. On CrossGeo, Cross3R consistently outperforms feed-forward 3D baselines in point-cloud reconstruction, 6-DoF camera-pose estimation, and cross-view localization. On KITTI, it outperforms dedicated cross-view methods trained on KITTI on most metrics, despite having no KITTI training itself.
☆ HEART: Hyperspherical Embedding Alignment via Kent-Representation Traversal in Diffusion Models
Text-to-image diffusion models can generate visually stunning images, yet, controlling what appears and how it appears, remains surprisingly difficult, especially when operating solely within the constraints of the text-conditioning space. For example, changing a subject or adjusting an attribute often leads to unintended side effects, such as altered backgrounds or distorted details. This is because most existing text-based control methods treat the embedding space as Euclidean and apply simple linear transformations, which do not reflect how semantic concepts are actually organized. In this work, we take a step back and ask: what is the true geometry of these embeddings? We find that text encoder representations lie on a hypersphere, where concepts are not linear directions but structured, anisotropic distributions better captured by Kent distributions. Building on this insight, we propose HEART, a training-free framework that performs Kent-aware geodesic transformations directly on the hypersphere. By respecting the underlying geometry, HEART enables intuitive and precise edits, such as consistent subject replacement and fine-grained attribute control, while preserving the original scene. Importantly, HEART requires no finetuning, inversion, or optimization, and generalizes across diffusion model architectures. Our results show that a simple shift in perspective, from linear to spherical, can unlock fast, and controllable image generation.
☆ DVD: Discrete Voxel Diffusion for 3D Generation and Editing
We introduce Discrete Voxel Diffusion (DVD), a discrete diffusion framework to generate, assess, and edit sparse voxels for SLat (Structured LATent) based 3D generative pipelines. Although discrete diffusion has not generally displaced continuous diffusion in image-like generation, we show that it can be an effective first-stage prior for sparse voxel scaffolds. By treating voxel occupancy as a native discrete variable, DVD avoids continuous-to-discrete thresholding and provides a simple framework for voxel generation, uncertainty estimation, and editing. Beyond quality gains, DVD provides more interpretable generation dynamics through explicit categorical modeling. Furthermore, we leverage the predictive entropy as a robust uncertainty metric to identify ambiguous voxel regions and complicated samples, facilitating tasks such as data filtering and quality assessment. Finally, we propose a lightweight fine-tuning strategy using block-structured perturbation patterns. This approach empowers the model to inpaint and edit voxels within a single sampling round, requiring negligible auxiliary computation and no additional model evaluations.
☆ TimeLesSeg: Unified Contrast-Agnostic Cross-Sectional and Longitudinal MS Lesion Segmentation via a Stochastic Generative Model
Multiple sclerosis (MS) expresses substantial clinical and radiological heterogeneity, which poses significant challenges for automatic lesion segmentation. The current deep learning-based SOTA is highly susceptible to changes in both distribution, e.g., changes in scanner; as well as the structure of inputs, evident in the current divide between cross-sectional and longitudinal approaches. We introduce TimeLesSeg, a unified contrast-agnostic framework designed to segment MS lesions regardless of the presence of a temporal dimension in its inputs, with a single convolutional neural network. Our approach models pathological priors through lesion masks, which are processed together with the current scan. Cross-sectional processing is enabled by exposing the model to training cases where no prior information is available, which are modeled with an empty mask, allowing it to operate seamlessly in both scenarios. To overcome the scarcity and inconsistency of longitudinal datasets, we propose a novel generative pipeline in which patterns of lesion evolution are simulated by stochastically deforming each individual lesion with morphological operations, producing realistic prior timepoints. In parallel, we achieve contrast agnosticism through Gaussian mixture model-based domain randomization, enabling the network to experience a wide spectrum of intensity profiles. Results on three publicly available and two in-house datasets show that TimeLesSeg outperforms the contrast-agnostic state of the art on single-modality inputs across overlap- and distance-based metrics. In longitudinal processing, our method outperforms SAMSEG, and captures lesion load dynamics more accurately than both the former and LST-AI. All source code related to the development of TimeLesSeg is available at https://github.com/NeuroADaS-Lab/TimeLesSeg.
☆ Rebalancing gradient to improve self-supervised co-training of depth, odometry and optical flow predictions
We present CoopNet, an approach that improves the cooperation of co-trained networks by dynamically adapting the apportionment of gradient, to ensure equitable learning progress. It is applied to motion-aware self-supervised prediction of depth maps, by introducing a new hybrid loss, based on a distribution model of photo-metric reconstruction errors made by, on the one hand the depth + odometry paired networks, and on the other hand the optical flow network. This model essentially assumes that the pixels from moving objects (that must be discarded for training depth and odometry), correspond to those where the two reconstructions strongly disagree. We justify this model by theoretical considerations and experimental evidences. A comparative evaluation on KITTI and CityScapes datasets shows that CoopNet improves or is comparable to the state-of-the-art in depth, odometry and optical flow predictions.
☆ TAVIS: A Benchmark for Egocentric Active Vision and Anticipatory Gaze in Imitation Learning
Active vision -- where a policy controls its own gaze during manipulation -- has emerged as a key capability for imitation learning, with multiple independent systems demonstrating its benefits in the past year. Yet there is no shared benchmark to compare approaches or quantify what active vision contributes, on which task types, and under what conditions. We introduce TAVIS, evaluation infrastructure for active-vision imitation learning, with two complementary task suites -- TAVIS-Head (5 tasks, global search via pan/tilt necks) and TAVIS-Hands (3 tasks, local occlusion via wrist cameras) -- on two humanoid torso embodiments (GR1T2, Reachy2), built on IsaacLab. TAVIS provides three evaluation primitives: a paired headcam-vs-fixedcam protocol on identical demonstrations; GALT (Gaze-Action Lead Time), a novel metric grounded in cognitive science and HRI that quantifies anticipatory gaze in learned policies; and procedural ID/OOD splits. Baseline experiments with Diffusion Policy and $π_0$ reveal that (i) active-vision generally helps, but benefits are task-conditional rather than uniform; (ii) multi-task policies degrade sharply under controlled distribution shifts on both suites; and (iii) imitation alone yields anticipatory gaze, with median lead times comparable to the human teleoperator reference. Code, evaluation scripts, demonstrations (LeRobot v3.0; ~2200 episodes) and trained baselines are released at https://github.com/spiglerg/tavis and https://huggingface.co/tavis-benchmark.
☆ Delta-Adapter: Scalable Exemplar-Based Image Editing with Single-Pair Supervision
Exemplar-based image editing applies a transformation defined by a source-target image pair to a new query image. Existing methods rely on a pair-of-pairs supervision paradigm, requiring two image pairs sharing the same edit semantics to learn the target transformation. This constraint makes training data difficult to curate at scale and limits generalization across diverse edit types. We propose Delta-Adapter, a method that learns transferable editing semantics under single-pair supervision, requiring no textual guidance. Rather than directly exposing the exemplar pair to the model, we leverage a pre-trained vision encoder to extract a semantic delta that encodes the visual transformation between the two images. This semantic delta is injected into a pre-trained image editing model via a Perceiver-based adapter. Since the target image is never directly visible to the model, it can serve as the prediction target, enabling single-pair supervision without requiring additional exemplar pairs. This formulation allows us to leverage existing large-scale editing datasets for training. To further promote faithful transformation transfer, we introduce a semantic delta consistency loss that aligns the semantic change of the generated output with the ground-truth semantic delta extracted from the exemplar pair. Extensive experiments demonstrate that Delta-Adapter consistently improves both editing accuracy and content consistency over four strong baselines on seen editing tasks, while also generalizing more effectively to unseen editing tasks. Code will be available at https://delta-adapter.github.io.
☆ One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Vision-language-action (VLA) models increasingly rely on auxiliary world modules to plan over long horizons, yet how such modules should be parameterized on top of a pretrained VLA remains an open design question. Existing world-model-augmented VLAs typically pass the per-frame visual stream into the world module at high visual bandwidth and treat its rollout as a side product of action prediction; under a constrained adaptation budget on a frozen backbone, this leaves both the per-frame representation and the latent action coupling under-examined. We introduce OneWM-VLA, which compresses each view into a single semantic token per frame through an Adaptive Attention Pooling, and produces the resulting latent stream and the action trajectory under a single flow-matching objective rather than connecting them through a separate decoder. Empirically, we find that per-frame visual bandwidth can be reduced to a single token without compromising long-horizon performance under our setup. Trained with 14.71M LoRA parameters on a $π_0$ (2B) backbone, OneWM-VLA improves the average success rate from 47.9% to 61.3% on MetaWorld~MT50, reaches 95.6% on LIBERO-Long (vs.85.2% for $π_0$), and reaches 60.0% on the long-horizon deformable task Fold Cloth on a real Piper arm (vs.20.0% for $π_0$).
☆ MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence
Medical vision--language models (VLMs) are usually evaluated on intact image--question pairs, but trustworthy clinical use requires a stronger property: a model must recognise when the evidential basis for an answer has failed. We study this through silent failures under perturbed evidence, where a vision-required medical question is paired with a false premise, wording perturbation, knowledge-only rewrite, or ROI-corrupted image, yet the model returns a fluent non-refusal answer. We introduce medvigil, a 300-case evaluation suite drawn from four public medical VQA sources, supervised end to end by four board-certified radiologists: every gold answer, refusal option, candidate-answer set, paraphrase, false-premise trap, ROI box, and clinical risk tier is clinician-authored. Two attending radiologists annotate every case in parallel, a senior radiologist consolidates the released manifest, and a separate fourth radiologist independent of construction answers every probe to provide the human reference baseline. The release contains 2{,}556 MCQ probes, 240 counterfactual triplets, physician-adjudicated risk-tier and answerability flags, ROI boxes, and a paired open-ended variant. We report seven correctness-conditioned audit metrics that summarise into the medvigil Composite Score (MCS), and audit 16 vision-capable models plus two text-only baselines. The independent radiologist scores MCS 83.3 at silent-failure rate 5.8%, leaving a 14.1-point composite headroom above the strongest audited model (Claude Opus 4.7 at 69.2). The benchmark and evaluation harness are publicly released.
☆ What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion
Tokenizers are a crucial component of latent diffusion models, as they define the latent space in which diffusion models operate. However, existing tokenizers are primarily designed to improve reconstruction fidelity or inherit pretrained representations, leaving unclear what kind of latent space is truly friendly for generative modeling. In this paper, we study this question from the perspective of latent manifold organization. By constructing controlled tokenizer variants, we identify three key properties of a diffusion-friendly latent manifold: coherent spatial structure, local manifold continuity, and global manifold semantics. We find that these properties are more consistent with downstream generation quality than reconstruction fidelity. Motivated by this finding, we propose the Prior-Aligned AutoEncoder (PAE), which explicitly shapes the latent manifold instead of leaving diffusion-friendly manifold to emerge indirectly from reconstruction or inheritance. Specifically, PAE leverages refined priors derived from VFMs and perturbation-based regularization to turn spatial structure, local continuity, and global semantics into explicit training objectives. On ImageNet 256x256, PAE improves both training efficiency and generation quality over existing tokenizers, reaching performance comparable to RAE with up to 13x faster convergence under the same training setup and achieving a new state-of-the-art gFID of 1.03. These results highlight the importance of organizing the latent manifold for latent diffusion models.
☆ Flatness and Gradient Alignment Are Both Necessary: Spectral-Aware Gradient-Aligned Exploration for Multi-Distribution Learning NeurIPS 2026
Sharpness-aware and gradient-alignment methods have been shown to improve generalization, however each family of methods targets a single geometric property of the loss landscape, while ignoring the other. In this paper, we show that this omission is structurally unavoidable and that both flatness and gradient alignment should be considered in multi-distribution learning settings. Specifically, we derive an excess-risk decomposition that yields two additive leading-order terms: (i) an alignment term, controlled by the trace of $\bar{H}^{-1}Σ_g$ and (ii) a curvature term, controlled by $\bar{H}$, where $\bar{H}$ is the average Hessian and $Σ_g$ is the covariance of the gradient across distributions. Notably, $\bar{H}$ appears inverted in one and non-inverted in the other. We further show, via a counterexample, that neither quantity bounds the other in general, so no algorithm targeting only one term can guarantee low excess risk. Motivated by this decomposition, we propose SAGE (Spectral-Aware Gradient-Aligned Exploration) that targets both terms. The curvature component replaces SAM's gradient-scaled perturbation with the polar factor of each layer's gradient matrix, computed via Newton-Schulz iteration, so that the ascent step probes all directions with similar magnitude. On the other hand, the alignment component injects isotropic noise at the descent step, the magnitude of which scales with cross-distribution gradient disagreement. Experiments on five domain-generalization and two multi-task learning benchmarks show that the proposed method establishes a new state-of-the-art on DomainBed and acts as a general-purpose improvement to base MTL solvers, remaining competitive with, or even surpassing, state-of-the-art methods.
comment: Preprint - Submitted to NeurIPS 2026
☆ One World, Dual Timeline: Decoupled Spatio-Temporal Gaussian Scene Graph for 4D Cooperative Driving Reconstruction
Reconstructing dynamic scenes from Vehicle-to-Infrastructure Cooperative Autonomous Driving (VICAD) data is fundamentally complicated by temporal asynchrony: vehicle and infrastructure cameras operate on independent clocks, capturing the same dynamic agent such as cars and pedestrians at different physical times. Existing Gaussian Scene Graph methods implicitly assume synchronized observations and assign a single pose per agent per frame, which is an assumption that breaks in cooperative settings, where the resulting gradient conflicts cause severe ghosting on dynamic agents. We identify this as a representation-level failure, not an optimization artifact: we prove that any single-timeline formulation incurs an irreducible photometric loss scaling quadratically with agent velocity and cross-source time offset. To resolve this, we propose Dust (DecoUpled Spatio-Temporal) Gaussian Scene Graph for 4D Cooperative Driving Reconstruction. DUST Gaussian Scene Graph shares a canonical Gaussian set per agent for appearance consistency, while maintaining decouple pose trajectories aligned to each source's true capture timestamps. We prove that this decoupling enables the pose-gradient kernel block-diagonal, eliminating cross-source interference entirely. To make Dust practical, we further introduce a static anchor-based pose correction pipeline that corrects spatio misalignment between vehicle and infrastructure annotations, and a pose-regularized joint optimization scheme that prevents trajectory jitter and drift during early training. On 26 sequences from V2X-Seq, DUST achieves state-of-the-art performance, improving dynamic-area PSNR by 3.2 dB over the strongest baseline and reducing Fréchet Video Distance by 37.7%, with keeping robustness under larger temporal asynchrony. Code is available at https://anonymous.4open.science/r/DUST-6A55.
☆ Consistency Regularised Gradient Flows for Inverse Problems
Vision-Language Latent Diffusion Models (LDMs) (Rombach et al., 2022) provide powerful generative priors for inverse problems. However, existing LDM-based inverse solvers typically require a large number of neural function evaluations (NFEs) and backpropagation through large pretrained components, leading to substantial computational costs and, in some cases, degraded reconstruction quality. We propose a unified Euclidean-Wasserstein-2 gradient-flow framework that jointly performs posterior sampling and prompt optimization in the latent space through a single flow that aligns the prior and posterior with the observed data. Combined with few-step latent text-to-image models, this formulation enables low-NFE inference without backpropagation through autoencoders. Experiments across several canonical imaging inverse problems show that our method achieves state-of-the-art performance with significantly reduced computational cost.
☆ Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding
Online streaming video understanding requires models to process continuous visual inputs and respond to user queries in real time, where the unbounded stream and unpredictable query timing turn memory management into a central challenge. Existing methods typically compress visual tokens via visual similarity heuristics, or augment compression with KV-cache-level retrieval. However, compression decisions rarely incorporate semantic signals, and retrieval is often added after compression is finalized, making the two stages hard to coordinate. We present SAVEMem, a training-free dual-stage framework that brings semantic awareness into memory generation and lets the retrieval scope adapt per query. In Stage~1, SAVEMem builds a three-tier streaming memory online under a constant memory budget. A fixed pseudo-question bank provides a lightweight semantic prior, so that long-term retention is shaped by semantic salience rather than visual similarity alone. In Stage~2, SAVEMem performs query-aware retrieval over this memory. An anchor-conditioned recency gate adapts the retrieval scope from short-term to mid- and long-term memory based on whether the query targets the present or the distant past. Within this scope, late interaction between query and memory tokens selects candidate frames for answering. Applied to Qwen2.5-VL without training, SAVEMem improves the OVO-Bench overall score from 52.27 to 62.69 and yields consistent gains on StreamingBench and ODV-Bench, while reducing peak GPU memory by 48\% at 128 frames over the backbone.
☆ Enhancing Federated Quadruplet Learning: Stochastic Client Selection and Embedding Stability Analysis
Federated Learning (FL) enables decentralised model training across distributed clients without requiring data centralisation. However, the generalisation performance of the global model is usually degraded by data heterogeneity across clients, particularly under limited data availability and class imbalance. To address this challenge, we propose FedQuad, a novel method that explicitly enforces minimising intra-class representations while enabling inter-class splits across clients. By jointly minimising distances between positive pairs and maximising distances between negative pairs, the proposed approach mitigates representation misalignment introduced during model aggregation. We evaluate our method on CIFAR-10, CIFAR-100, and Tiny-ImageNet under diverse non-IID settings and varying numbers of clients, demonstrating consistent improvements over existing baselines. Additionally, we provide a comprehensive analysis of metric learning-based approaches in both centralised and federated environments, highlighting their effectiveness in alleviating representation collapse under heterogeneous data distributions.
comment: arXiv admin note: substantial text overlap with arXiv:2509.04107
☆ Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models
Multimodal reward models have advanced substantially in text and image domains, yet progress in video understanding reward modeling remains severely limited by the lack of robust evaluation benchmarks and high-quality preference data. To address this, we propose a unified framework spanning benchmark design, data construction, and reward model training. We introduce Video Understanding Reward Bench (VURB), a benchmark featuring 2,100 preference pairs with long chain-of-thought reasoning traces (averaging 1,143 tokens) and majority voting evaluation across general, long, and reasoning-oriented video tasks. We further construct Video Understanding Preference Dataset (VUP-35K) via a fully automated pipeline, providing large-scale high-quality supervision for video reward training. Building on the data, we train VideoDRM and VideoGRM, a discriminative and a generative reward model, both achieving state-of-the-art performance on VURB and VideoRewardBench. Further analysis confirms that VUP-35K enhances both reward performance and model reasoning capability, while VideoDRM and VideoGRM yield significant gains under best-of-$N$ test-time scaling.
☆ From Synthetic to Real: Toward Identity-Consistent Makeup Transfer with Synthetic and Real Data
Makeup transfer aims to apply the makeup style of a reference portrait to a source portrait while preserving identity and background. Early methods formulate this task as unsupervised image-to-image translation, relying on surrogate objectives and often yielding limited performance. Recent diffusion- and flow-based approaches instead exploit synthetic data for supervised training, leading to significant improvements. However, these methods still face two critical challenges: synthetic supervision frequently fails to faithfully preserve identity, and the domain gap between synthetic and real data limits generalization, resulting in degraded performance in complex real-world scenarios. To address these issues, this paper first proposes ConsistentBeauty, a novel data curation pipeline that ensures makeup fidelity and strict identity consistency within the synthesized data. Second, we propose RealBeauty, a synthetic-to-real post-training framework. Beyond supervised learning on curated synthetic data, we further adapt the model to real-world scenarios through reinforcement learning and design novel verifiable rewards tailored to the makeup transfer task. It allows the model to further benefit from real makeup patterns beyond synthetic supervision. In addition, we establish a new diverse benchmark for makeup transfer, covering a wide range of skin tones, ages, genders, poses, and makeup styles, thereby enabling a more comprehensive evaluation of model performance under diverse real-world conditions. Extensive experiments show that our method achieves state-of-the-art performance on multiple benchmarks and demonstrates clear advantages in identity preservation and performance on complex real-world cases.
☆ EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding IJCAI 2026
Driver cognitive distraction is a major cause of road collisions and remains difficult to detect. Unlike manual or visual distraction, cognitive distraction is diverted by thoughts unrelated to driving, even when the driver appears visually attentive and exhibits no explicit physical movements. In this work, we propose EyeCue, a gaze-empowered egocentric video understanding framework, to detect driver cognitive distraction. A key insight is that cognitive distraction manifests in the interaction between eye gaze and visual context. To capture this interaction, EyeCue integrates eye gaze with egocentric video to enable context-aware modeling of the driver's attention over time. Furthermore, to tackle the limited scale and diversity of existing datasets, we introduce CogDrive, a comprehensive multi-scenario dataset that augments four existing driving datasets with cognitive distraction annotations. Through extensive evaluations on CogDrive, we show that EyeCue achieves the highest accuracy of 74.38%, outperforming 11 baselines from 6 model families by over 7%. Notably, EyeCue can achieve an accuracy of over 70% across various driving scenarios (different road types, times of day, and weather conditions) with strong generalizability. These results highlight the importance of modeling gaze-context interactions and the effectiveness of cross-modal interaction modeling for multimodal cognitive distraction detection. Our codes and CogDrive dataset resources are available at https://github.com/langzhang2000/EyeCue.
comment: Accepted to the 35th International Joint Conference on Artificial Intelligence (IJCAI 2026)
☆ BRIDGE: Background Routing and Isolated Discrete Gating for Coarse-Mask Local Editing
Coarse-mask local image editing asks a model to modify a user-indicated region while preserving the surrounding scene. In practice, however, rough masks often become unintended shape priors: instead of serving as flexible edit support, the mask can pull the generated object toward its accidental boundary. We study this failure as mask-shape bias and frame the task through a Two-Zone Constraint, where the background should remain stable while the editable region should follow the instruction without being forced to inherit the mask contour. BRIDGE addresses this setting by keeping masks outside the DiT backbone for support construction and blending, avoiding DiT-internal mask injection and copied control branches. It uses BridgePath generation, where a Main Path preserves background context and a Subject Path generates editable content from independent noise. Motivated by a diagnostic Qwen-Image experiment showing that positional embeddings and attention connectivity regulate which image context visual tokens reuse, BRIDGE introduces a learnable Discrete Geometric Gate for token-level positional-embedding routing. This gate lets subject tokens borrow background-anchored coordinates near fusion regions or keep subject-centric coordinates for geometric freedom. We evaluate BRIDGE on BRIDGE-Bench, MagicBrush, and ICE-Bench. On BRIDGE-Bench, BRIDGE improves Local SigLIP2-T from 0.262 with FLUX.1-Fill and 0.390 with ACE++ to 0.503, with parallel gains in local DINO and DreamSim. Zero-shot results on MagicBrush and ICE-Bench further indicate competitive alignment and source preservation beyond the curated benchmark, while the added routing module remains compact at 13.31M parameters compared with ControlNet-style copied branches.
comment: 11 pages, 6 figures
☆ Explainable Part-Based Vehicle Classifier with Spatial Awareness
In the area of Intelligent Transportation Systems (ITS), fine-grained vehicle classification systems play an essential role. Recently, the authors have presented a novel vision-based classification approach in which standard end-to-end Convolutional Neural Networks (CNNs) have been decomposed into 1) a CNN-based detector for semantically strong vehicle parts, followed by 2) feature construction and 3) final classification by a decision tree. In contrast to conventional CNNs, this allows both easy extensibility to new vehicle categories - without the need to fully retrain the part detector - and an important step towards the interpretability of the model, removing partially the black-box nature inherent to CNNs. Here we present an important extension of this approach that now incorporates spatial awareness of the vehicle parts: while the feature construction 2) of the previous approach used a binary decision for each feature (present vs. absent), now a full spatial probability map is constructed to condition the presence of each individual part with respect to a given vehicle category. The classification is performed using a softmax regression approach for the overall vehicle probabilities. This method shows a considerably improved robustness against false (part-)detections, a point that is crucial for practical application. Comparative analyses with a state-of-the-art end-to-end CNN indicate that our part-based methods achieve comparable accuracy, effectively challenging the presumed trade-off between accuracy and explainability. This research represents a significant advance in vehicle classification for ITS and forms the basis for systems that combine high accuracy with intuitive interpretability.
☆ Anisotropic Modality Align
Training multimodal large language models has long been limited by the scarcity of high-quality paired multimodal data. Recent studies show that the shared representation space of pretrained multimodal contrastive models can serve as a bridge, enabling models to perform multimodal training with unimodal data. However, the key premise of this paradigm remains insufficiently understood: can representations from different modalities be reliably interchanged? The core obstacle lies in the persistent Modality Gap in the shared space. In this work, we revisit the geometric nature of the modality gap. We find that modality representations already share compatible dominant semantic geometry. What truly hinders modality interchangeability is not a simple global shift, but an anisotropic residual structure concentrated along a small number of dominant directions. Based on this finding, we further propose the principle of anisotropic modality gap alignment: effective modality alignment should align with the target-modality distribution while preserving the semantic structure of the source modality. Guided by this principle, we propose an anisotropic geometric correction framework, AnisoAlign, for unpaired modality alignment. This framework leverages the internal geometric prior of the target modality and performs bounded correction on source-modality representations, thereby constructing substitute representations in the target modality. Experiments confirm its benefits in both geometric diagnostics and text-only MLLM training. Overall, this work recasts the modality gap from an empirical observation into a correctable, structured geometric phenomenon and provides a new representation alignment perspective for training multimodal models with unimodal data.
☆ Divide and Conquer: Object Co-occurrence Helps Mitigate Simplicity Bias in OOD Detection CVPR2026
Out-of-distribution (OOD) detection is crucial for ensuring the reliability of deep learning models. Existing methods mostly focus on regular entangled representations to discriminate in-distribution (ID) and OOD data, neglecting the rich contextual information within images. This issue is particularly challenging for detecting near-OOD, as models with simplicity bias struggle to learn discriminative features in disentangled representations. The human visual system can use the co-occurrence of objects in the natural environment to facilitate scene understanding. Inspired by this, we propose an Object-Centric OOD detection framework that learns to capture Object CO-occurrence (OCO) patterns within images. The proposed method introduces a new OOD detection paradigm that understands object co-occurrence within an image by predicting disentangled representations for the test sample, then adaptively divides patterns into three scenarios based on object co-occurrence patterns observed in ID training data, and finally performs OOD detection in a divide-and-conquer manner. By doing so, OCO can distinguish near-OOD by considering the semantic contextual relationships present in their images, avoiding the tendency to focus solely on simple, easily learnable regions. We evaluate OCO through experiments across challenging and full-spectrum OOD settings, demonstrating competitive results and confirming its ability to address both semantic and covariate shifts. Code is released at https://github.com/Michael-McQueen/OCO.
comment: This paper has been accepted by CVPR2026
☆ GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
Human visual reasoning is governed by active vision, a process where metacognitive control drives top-down goal-directed attention, dynamically routing foveal focus toward task-relevant details while maintaining peripheral awareness of the global scene. In contrast, modern Vision-Language Models (VLMs) process visual information passively, relying on the static accumulation of massive token contexts that dilute spatial reasoning and induce linguistic hallucinations. Here we propose the following paradigm shift: GazeVLM, a multimodal architecture that internalizes this metacognitive oversight over its deployment of attention resources directly into the reasoning loop. By empowering the VLM to autonomously generate gaze tokens ($\texttt{}$), GazeVLM establishes a top-down control mechanism over its own causal attention mask. The model dynamically dictates its focal intent, triggering a continuous suppression bias that dampens irrelevant visual features, implementing spatial selective attention and simulating foveal fixation. Once local reasoning concludes, the bias lifts, seamlessly restoring the global view. This architecture enables the model to fluidly transition between global spatial awareness and localized focal reasoning without relying on external agentic contraptions like cropping tools, or inflating the context window with additional visual tokens derived from localized visual patches. Trained with a bespoke Group Relative Policy Optimization (GRPO) procedure that rewards valid grounding, our 4B-parameter GazeVLM delivers strong high-resolution multimodal reasoning performance, surpassing state-of-the-art VLMs in its parameter class by nearly 4% and agentic multimodal pipelines built around thinking with images by more than 5% on HRBench-4k and HRBench-8k.
☆ ICDAR 2026 Competition on Writer Identification and Pen Classification from Hand-Drawn Circles
This paper presents CircleID, a large-scale ICDAR 2026 competition on writer identification and pen classification from scanned hand-drawn circles. The primary objective is to investigate how biometric writer characteristics and physical pen features naturally entangle within minimal, static traces. CircleID comprises two distinct tasks: (1) open-set writer identification, requiring models to recognize known writers while explicitly rejecting unknown ones, and (2) cross-writer pen classification, evaluated across both seen and unseen writers. Participants were provided with a new, controlled dataset of 46,155 tightly cropped circle images, digitized at 400 DPI and annotated for writer identity and pen type. The dataset comprises samples from 50 known and 16 unknown writers using eight different pens. Hosted on Kaggle as two separate tracks with public and private leaderboards, the competition provided participants with a ResNet baseline. In total, 389 teams (436 participants) made 3,185 submissions for the pen classification task, and 113 teams (141 participants) made 1,737 submissions for the writer identification track. The best-performing private leaderboard submissions achieved a Top-1 accuracy of 64.801% for writer identification and 92.726% for pen classification. This paper details the dataset, evaluates the winning methodologies, and analyzes the impact of out-of-distribution writers on model generalization and feature disentanglement. In this large-scale competition, CircleID establishes a new baseline for minimal-trace analysis.
Pre-training Enables Extraordinary All-optical Image Denoising
Optical neural networks are emerging as powerful machine learning and information processing tools because of their potential advantages in speed and energy efficiency. The training methods of these physical models, however, remain underexplored compared to their digital counterparts and are leading to suboptimal performance. This paper reports a pre-training-driven approach that leads to snapshot image denoising with substantially improved quality. We demonstrated effective free-space optical denoising by a diffractive network optimized by a two-step process including (1) pre-training using a massive dataset of 3.45 million diverse but simple images and (2) fine-tuning with the corresponding task-specific datasets. Compared to conventional Fourier-domain filtering and directly trained diffractive networks, such a transfer learning process exhibited prominent advantages for denoising images degraded by severe noise, peak signal-to-noise ratio (PSNR) below 8 dB, while preserving fine image features and improving the PSNR to above 18 dB. Importantly, the same pre-trained optical network could be consistently fine-tuned to process degraded images from highly diverse styles ranging from handwritten digits (MNIST) and chest X-rays (ChestMNIST) to CIFAR-10 images and human faces (CelebA). We further demonstrated the critical role of our optical denoisers in vision-based applications, including face detection, plate recognition, and localization of UAVs in noisy conditions.
☆ Text-to-CAD Evaluation with CADTests
Text-to-CAD has recently emerged as an important task with the potential to substantially accelerate design workflows. Despite its significance, there has been surprisingly little work on Text-to-CAD evaluation, and assessing CAD model generation performance remains a considerable challenge. In this work, we introduce a new evaluation perspective for Text-to-CAD based on automated testing. We propose CADTestBench, the first test-based benchmark for Text-to-CAD, based on CADTests, executable software tests that verify whether a generated CAD model satisfies the geometric and topological requirements of the input prompt. Using CADTestBench, we conduct comprehensive benchmarking of recent Text-to-CAD methods and further demonstrate that CADTests can also guide CAD model generation, yielding simple baselines that surpass performance of current methods. CADTestBench code and data are available at GitHub and Hugging Face dataset.
☆ SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models
Recent video diffusion models (VDMs) synthesize visually convincing clips, yet still drop entities, mis-bind attributes, and weaken the interactions specified in the prompt. Representation-alignment objectives such as VideoREPA and MoAlign improve fine-grained text following by distilling spatio-temporal token relations from a frozen visual foundation model, but their pairwise supervision budget is allocated by visual or motion cues rather than by how relevant each pair is to the prompt. We present SARA, Semantically Adaptive Relational Alignment, which keeps token-relation distillation (TRD) on a frozen VFM target and adds a text-conditioned saliency that decides which token pairs carry supervision. A lightweight Stage 1 aligner is trained with per-entity SAM 3.1 mask supervision and an InfoNCE regulariser, and its continuous saliency is fused into TRD through a pair-routing operator that assigns each token pair a weight whenever either of its two endpoints is salient, thereby routing supervision toward subject-subject and subject-background pairs and away from background-background ones. In the Wan2.2 continual-training setting, SARA improves both text alignment and motion quality over SFT, VideoREPA, and MoAlign on a 13-dimension VLM rubric, on the public VBench benchmarks, and in a blind user study.
☆ Spectral Surgery: Class-Targeted Post-Hoc Rebalancing via Hessian Spike Perturbation
The Hessian spectrum of trained deep networks exhibits a characteristic structure: a continuous bulk of near-zero eigenvalues and a small number of large outlier eigenvalues (spikes), confirming the relevance of Random Matrix Theory in deep learning. The spike count matches the number of classes minus one. While prior work has described this structure, no method has exploited it operationally to improve classification performance. We propose Spectral Surgery, a post-hoc optimization method that directly perturbs model weights along spike eigenvectors to rebalance per-class accuracy without retraining. We introduce (i) a spike-class sensitivity matrix that quantifies the directional derivative of each class's accuracy along each spike eigenvector, (ii) a constrained optimization of perturbation coefficients that targets weak classes while preserving strong ones, and (iii) an adaptive amplitude control that raises or lowers the perturbation budget based on iteration-level improvement signals. We obtain encouraging results on CIFAR-10 and ISIC-2019 on both balanced accuracy and standard deviation.
☆ APEX: Assumption-free Projection-based Embedding eXamination Metric for Image Quality Assessment
As generative models achieve unprecedented visual quality, the gold standard for image evaluation remains traditional feature-distribution metrics (e.g., FID). However, these metrics are provably hindered by the closed-vocabulary bottleneck of outdated features and the assumptive bias of rigid parametric formulations. Recent alternatives exploit modern backbones to solve the feature bottleneck, yet continue to suffer from parametric limitations. To close this gap, we introduce APEX (Assumption-free Projection-based Embedding eXamination), a novel evaluation framework leveraging the Sliced Wasserstein Distance as a mathematically grounded, assumption-free similarity measure. APEX inherits effective scalability to high-dimensional spaces, as we prove with theoretical and empirical evidences. Moreover, APEX is embedding-agnostic and uses two open-vocabulary foundation models, CLIP and DINOv2, as feature extractors. Benchmarking APEX against established baselines reveals superior robustness to visual degradations. Additionally, we show that APEX metrics exhibit intra- and cross-dataset stability, ensuring highly stable evaluations on out-of-domain datasets.
☆ Radiologist-Guided Causal Concept Bottleneck Models for Chest X-Ray Interpretation
Concept Bottleneck Models (CBMs) in medical imaging aim to improve model interpretability by predicting intermediate clinical concepts before final diagnoses. However, most existing CBMs treat concepts as discriminative predictors of pathology labels, without explicitly modelling the underlying clinical generative process where diseases produce observable radiographic findings. We propose XpertCausal, a radiologist-guided causal CBM for chest X-ray interpretation which models pathology-to-concept relationships using a probabilistic noisy-OR framework. This generative model is then inverted via Bayesian inference to estimate pathology probabilities from predicted concepts. Radiologist-curated concept-pathology associations are used to constrain model structure to radiologist-defined clinically plausible reasoning pathways. We evaluate XpertCausal on MIMIC-CXR across pathology classification performance, calibration, explanation quality, and alignment with radiologist-defined reasoning pathways. Compared with both a non-causal CBM baseline and a causal ablation with unconstrained learned associations, XpertCausal achieves improved AUROC, calibration, and clinically relevant explanation quality, while learning concept-pathology relationships that more closely align with expert knowledge. These results demonstrate that incorporating clinically motivated causal structure and expert domain knowledge into CBMs can lead to more accurate, interpretable, and clinically aligned models for CXR interpretation.
☆ Differentiable Ray Tracing with Gaussians for Unified Radio Propagation Simulation and View Synthesis
Explicit neural representations such as 3D Gaussian Splatting (3DGS) enable high-fidelity and real-time novel view synthesis, yet optimize for alpha-composited optical appearance rather than ray-intersectable geometry. In contrast, radio-frequency (RF) digital twins require deterministic multi-bounce paths, where the geometry dictates trajectories and their associated attenuation and delay. We introduce a framework enabling differentiable RF propagation simulation directly within visually reconstructed neural scenes, allowing point-to-point path computation between arbitrary 3D locations while preserving high-quality visual rendering. Unlike conventional RF simulation pipelines that rely on manually constructed meshes, we embed Gaussian primitives into a hardware-accelerated ray tracing structure as the underlying spatial representation. By extracting physically meaningful channel impulse responses from visual-only reconstructions, we provide cross-modal evidence that neural reconstructions can serve as unified spatial representations for both electromagnetic propagation simulation and photorealistic view synthesis.
☆ SIMI: Self-information Mining Network for Low-light Image Enhancement
Poor lighting conditions significantly impact image quality, posing substantial challenges for image editing and visualization. Many existing enhancement methods aim at proposing complex models while neglecting the intrinsic information contained within low-light images. In this work, we propose the Self-Information Mining (SIMI) network, an innovative unsupervised framework that decomposes low-light images into multiple components based on bit-plane decomposition. Our approach allows mining intrinsic information without relying on external data. This not only accelerates model convergence but also improves performance and reduces computational overhead. The unsupervised nature of our method facilitates real-world applicability. Experiments conducted on standard benchmarks demonstrate that SIMI achieves state-of-the-art performance.
☆ Head Similarity: Modeling Structured Whole-Head Appearance Beyond Face Recognition
Many vision applications require identity consistency beyond strict biometric recognition, especially under non-frontal views or when facial cues are missing. However, conventional face recognition models enforce intra-identity invariance, collapsing appearance variations such as hairstyle or styling changes into a single representation, limiting their use in appearance-sensitive scenarios. To address this limitation, we introduce Head Similarity, a new formulation that extends identity-centric recognition to structured whole-head similarity modeling. Our approach explicitly captures intra-identity appearance variation and enforces hierarchical similarity ordering across identity and appearance states, enabling meaningful comparison even under occlusion or rear-view conditions. We construct a large-scale benchmark from long-form videos with weakly-supervised appearance states, covering diverse poses, occlusions, and temporal changes. As a first step, we develop a simple yet effective framework that jointly models identity discrimination and appearance-sensitive similarity through hierarchical supervision and identity-aware distillation. Experiments show that conventional face recognition models fail to capture appearance-dependent similarity, while our approach demonstrates the feasibility of structured whole-head similarity modeling.
☆ Benchmarking Foundation Models for Renal Lesion Stratification in CT
The rapid proliferation of open-source medical foundation models (FMs) raises a practical question: how well do their pre-trained representations transfer to clinically relevant but data-scarce classification tasks? Particularly in CT-based renal lesion classification, a push toward greater generalizability would be meaningful, as the field is constrained by inherently limited training data. We addressed this through a benchmark of three medical FMs on this specific task. This six-class problem spans common entities like cysts and clear cell renal cell carcinoma, alongside rare subtypes. Using a frozen feature-probing protocol, we compared FM embeddings against a handcrafted radiomics classifier and a 3D ResNet-50 trained from scratch. Models were trained on a composite dataset of 2,854 lesions and evaluated on an external test set of 234 lesions from The Cancer Imaging Archive. Our results reveal two key findings. First, FM performance (AUC 0.70-0.77) matched the from-scratch ResNet (AUC 0.72) while drastically reducing hardware demand, requiring only seconds on a CPU after feature extraction. However, the conventional radiomics baseline significantly outperformed all deep learning approaches, achieving an AUC of 0.88 (all p $\leq$ 0.002). This suggests that current generalist FM embeddings do not yet capture the fine-grained texture and shape heterogeneity driving histological subtype discrimination. Despite their potential in data-scarce settings, medical FMs did not surpass established models for renal lesion stratification, leaving radiomics as the current state-of-the-art.
comment: 13 pages, 4 figures
☆ LAMES: A Large-Scale and Artisanal Mining Environmental Segmentation Dataset
Mining operations are of utmost importance to the economy of some nations. However, such operations result in land-use change, very high energy consumption, and negative impacts on the environment, including soil erosion and deforestation. The mining process can impact an area much larger than the mining site itself. Adding to the negative externalities linked to mining is the fact that, in addition to government-sanctioned legal mining operations, illegal mining is widespread, including in various countries of Africa. The ability to monitor remote mining site activities can be useful, e.g., for the detection of illegal artisanal mining activities and their environmental impacts. An important outcome of such monitoring could include a better understanding of the interrelationship between mine facility attributes (e.g., mining types, processing methods, commodities, etc.) and their impact on the natural environment. In this work, we present a data set that contains 150 Large Scale Mining (LSM) sites and 870km^2 annotated area of Artisanal Small-scale Mining (ASM) sites. The metadata includes nine eminent LSM sections and 27 mining site attributes for each LSM site. We also discuss the data set's possible contribution to the research community, social and environmental consequences, and researchers' responsibilities from an ethics perspective.
☆ OphEdit: Training-Free Text-Guided Editing of Ophthalmic Surgical Videos
High-fidelity surgical video generation can greatly improve medical training and the development of AI, adapting these generative models for precise video editing remains a formidable challenge. Modifying surgical attributes, such as instrument tissue interactions or procedural phases is challenging due to the strict anatomical and temporal constraints. In this paper, we propose OphEdit, a novel training-free framework for the text-guided editing of ophthalmic surgical videos. Our approach leverages a deterministic second-order ODE inversion pipeline to capture Attention Value (V) tensors from the original video. By selectively injecting these stored tensors into the conditional Classifier-Free Guidance (CFG) branch during the denoising phase, OphEdit rigorously preserves the intricate anatomical geometry of the eye while seamlessly mapping text-driven semantic modifications onto the video stream. Clinical evaluations demonstrates that OphEdit effectively handles complex surgical transformations, such as instrument swaps and procedural variations, with superior structural fidelity and temporal consistency compared to natural-domain video editors. Our work represents the first application of training-free video editing in the ophthalmic surgical domain, offering a scalable solution for generating diverse, annotated medical datasets without the need for exhaustive manual recording or costly model fine-tuning. The code and prompts can be accessed at https://github.com/ophedit/OphEdit
☆ Stochastic Transition-Map Distillation for Fast Probabilistic Inference
Diffusion models achieve strong generation quality, diversity, and distribution coverage, but their performance often comes with expensive inference. In this work, we propose Stochastic Transition-Map Distillation (STMD), a teacher-free framework for accelerating diffusion model inference while preserving probabilistic sample generation. In contrast to score-based diffusion models, whose denoising parametrization models the mean of the posterior distribution, STMD distills the full transition map associated with the sampling stochastic differential equation (SDE). We parameterize these SDE transitions with a conditional Mean Flow model, yielding a one- or few-step stochastic sampler that retains the transition structure of the underlying diffusion process. This perspective is especially useful for downstream tasks that require stochastic inference, such as diffusion posterior sampling, inverse problems, and energy-based fine-tuning. Compared to recent distillation methods, STMD requires no pretrained teacher, bi-level optimization, or trajectory simulation and caching, enabling efficient and scalable training. We derive convergence bounds for our method in the Wasserstein distance, providing a strong theoretical foundation for our approach, and validate STMD on various image generation examples on the MNIST, CIFAR-10, and CelebA datasets.
☆ Towards Billion-scale Multi-modal Biometric Search
Searching a multi-biometric database of a billion records for a country-level identity system requires pushing the limits of all aspects of a biometric system, including acquisition, preprocessing, feature extraction, accuracy, matching speed, presentation attack detection, and handling of special cases (e.g., missing finger digits). This is the first paper that gives insights into such a large-scale multimodal biometric search system, called Bharat ABIS, based on open-source architectures. The end-to-end pipeline of Bharat ABIS processes fingerprint, face and iris modalities through modality-specific stages of preprocessing (segmentation), quality assessment, presentation attack detection, and learning an embedding (feature extraction), producing a concatenated template of 13.5KB per person. We present a detailed analysis of the modalities and how they are integrated to create an efficient and effective solution for 1:N search (de-duplication). Evaluations on a demographically stratified gallery of 220 million identities, randomly sampled from 1.55 billion records in India's Aadhaar database, yield an FNIR of 0.3% at an FPIR of 0.5%, for adult probes (over 18 years). We also compare the performance of Bharat ABIS against three state-of-the-art COTS systems on a 20M gallery. Our system achieves a throughput of 100 searches per second on a gallery of 40M on a single server (8xNvidia H100 GPUs, 2TB RAM).
☆ Aquatic Neuromorphic Optical Flow
Underwater environments impose severe constraints on conventional imaging systems and demand solutions that balance high-quality sensing with strict resource efficiency. While emerging event cameras offer a promising alternative, their potential in aquatic scenarios remains largely unexplored. Through the lens of neuromorphic vision, this work pioneers the investigation of motion fields that serve as key media for agile underwater perception. Built upon spiking neural networks, we introduce a self-supervised framework to estimate per-pixel optical flow from asynchronous event streams, elegantly bypassing the long-standing bottleneck of underwater data scarcity. Extensive evaluations demonstrate that our method achieves competitive visual and quantitative results against leading techniques while operating with superior computational efficiency. By bridging neuromorphic sensing and aquatic intelligence, this work opens new frontiers for lightweight, real-time, and low-cost perception on resource-constrained underwater edge platforms.
comment: This work is under review
☆ Breaking Spatial Uniformity: Prior-Guided Mamba with Radial Serialization for Lens Flare Removal
Lens flares, caused by complex optical aberrations, severely degrade image quality especially in nighttime photography. Although recent restoration methods have made remarkable progress, most still rely on spatially uniform processing. They are failing to handle the region-dependent restoration demands of flare scenes, where saturated light sources should be preserved, flare artifacts removed, and background details recovered. To address this challenge, we propose DeflareMambav2, a prior-guided Mamba framework for lens flare removal. Specifically, we introduce a Flare Prior Network (FPN) to estimate flare priors and guide adaptive restoration. Besides, a novel radial serialization strategy breaks spatially homogeneous processing by performing flare-aware targeted sampling, and better supports long-range modeling in State Space Models (SSMs). Based on these priors, the backbone adopts a dual-level adaptive scheme. It explicitly preserves light-source regions to avoid over-processing, and applies curriculum-based restoration to the remaining contaminated areas while calibrating restoration intensity at the pixel level. Extensive experiments demonstrate that DeflareMambav2 achieves state-of-the-art performance with reduced parameter burden. Code is available at https://github.com/BNU-ERC-ITEA/DeflareMambav2.
☆ Operating Within the Operational Design Domain: Zero-Shot Perception with Vision-Language Models
Over the last few years, research on autonomous systems has matured to such a degree that the field is increasingly well-positioned to translate research into practical, stakeholder-driven use cases across well-defined domains. However, for a wide-scale practical adoption of autonomous systems, adherence to safety regulations is crucial. Many regulations are influenced by the Operational Design Domain (ODD), which defines the specific conditions in which an autonomous agent can function. This is especially relevant for Automated Driving Systems (ADS), as a dependable perception of ODD elements is essential for safe implementation and auditing. Vision-language models (VLMs) integrate visual recognition and language reasoning, functioning without task-specific training data, which makes them suitable for adaptable ODD perception. To assess whether VLMs can function as zero-shot "ODD sensors" that adapt to evolving definitions, we contribute (i) an empirical study of zero-shot ODD classification and detection using four VLMs on a custom dataset and Mapillary Vistas, along with failure analyses; (ii) an ablation of zero-shot optimization strategies with a cost-performance overview; and (iii) a suite of reusable prompting templates with guidance for adaptation. Our findings indicate that definition-anchored chain-of-thought prompting with persona decomposition performs best, while other methods may result in reduced recall. Overall, our results pave the way for transparent and effective ODD-based perception in safety-critical applications.
comment: 8 pages, 4 figures
☆ EggHand: A Multimodal Foundation Model for Egocentric Hand Pose Forecasting CVPR
Forecasting future 3D hand pose sequences from egocentric video is essential for understanding human intention and enabling embodied applications such as AR/VR assistance and human-robot interaction. However, this task remains a highly challenging problem because egocentric hand motion is driven by complex human intent, exhibits highly dexterous articulations, and is observed under drastic viewpoint shifts induced by ego-motion. In this work, we introduce EggHand, a foundation-model-based framework for egocentric hand pose forecasting that unifies multimodal semantic reasoning with dynamic motion modeling. Our approach couples an action decoder from a Vision-Language-Action (VLA) model, which captures the structured temporal dynamics of hand motion, with an egocentric video-text encoder that provides viewpoint-aware contextual information learned from large-scale first-person video. Together, these components overcome the brittleness of generic visual encoders under ego-motion and enable joint reasoning over motion, context, and high-level intent-without relying on body pose or external tracking. Experiments on the EgoExo4D dataset show that EggHand sets a new state of the art in forecasting accuracy, remains robust under severe ego-motion, and further enables controllable prediction via language-based task prompts. Project page: https://jyoun9.github.io/EggHand
comment: CVPR Findings 2026
☆ LithoBench: Benchmarking Large Multimodal Models for Remote-Sensing Lithology Interpretation
Remote sensing lithology interpretation is fundamental to geological surveys, mineral exploration, and regional geological mapping. Unlike general land-cover recognition, lithology interpretation is a knowledge-intensive task that requires experts to infer rock types from various features, e.g., subtle visual, spectral, textural, geomorphological, and contextual cues, making reliable automated interpretation highly challenging. Geological knowledge-guided large multimodal models offer new opportunities, yet their evaluation remains constrained by the lack of benchmarks that capture lithological annotations, multi-level geological semantics, and expert-informed assessment. Here, we propose LithoBench, a multi-level benchmark for evaluating geological semantic understanding in remote sensing lithology interpretation. LithoBench contains 10,000 expert-annotated interpretation instances across 12 representative lithological categories, including 4,000 multiple-choice and 6,000 open-ended tasks organized into five cognitive levels: Identification and Description, Comparative Analysis, Mechanism Explanation, Practical Application, and Comprehensive Reasoning. We further develop an expert-in-the-loop, knowledge-grounded semi-automated construction pipeline, coupling multi sub-processes, e.g., structured geological image descriptions, to enhance geological validity and evaluation reliability. Experiments with multiple large vision-language models eveal substantial limitations in geological semantic understanding, particularly on higher-order explanation, application, and reasoning tasks.
☆ FS-I2P:A Hierarchical Focus-Sweep Registration Network with Dynamically Allocated Depth ICML 2026
Image-to-point cloud registration is often challenged by viewpoint changes, cross-modal discrepancies, and repetitive textures, which induce scale ambiguity and consequently lead to erroneous correspondences. Recent detection-free methods alleviate this issue by leveraging multi-scale features and transformer-based interactions. However, they still suffer from attention drift across layers and intra-scale inconsistencies, hindering precise registration. Inspired by human behavior, we propose a ``Focus--Sweep'' paradigm and develop a Hierarchical Focus--Sweep Interaction Module within an SSM-based framework to enhance multi-level cross-modal feature association. In addition, we introduce a Dynamic Layer Allocation Strategy that adaptively determines the iteration depth to better exploit geometric constraints and improve matching robustness. Extensive experiments and ablations on two benchmarks, RGB-D Scenes V2 and 7-Scenes, demonstrate that our approach achieves state-of-the-art performance.
comment: ICML 2026
☆ SAM 3D Animal: Promptable Animal 3D Reconstruction from Images in the Wild
3D animal reconstruction in the wild remains challenging due to large species variation, frequent occlusions, and the prevalence of multi-animal scenes, while existing methods predominantly focus on single-animal settings. We present SAM 3D Animal, the first promptable framework for multi-animal 3D reconstruction from a single image. Built on the SMAL+ parametric animal model, our method jointly reconstructs multiple instances and supports flexible prompts in the form of keypoints and masks which enable more reliable disambiguation in crowded and occluded scenes. To train such a model, we further introduce Herd3D, a multi-animal 3D dataset containing over 5K images, designed to increase diversity in species, interactions, and occlusion patterns. Experiments on the Animal3D, APTv2, and Animal Kingdom datasets show that our framework achieves state-of-the-art results over both existing model-based and model-free methods, demonstrating a scalable and effective solution for prompt-driven animal 3D reconstruction in the wild.
☆ TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
Real-world audio-visual understanding requires chaining evidence that is sparse, temporally dispersed, and split across the visual and auditory streams, whereas existing benchmarks largely fail to evaluate this capability. They restrict videos to short clips, isolate modalities, or reduce questions to one-hop perception. We introduce TraceAV-Bench, the first benchmark to jointly evaluate multi-hop reasoning over long audio-visual trajectories and multimodal hallucination robustness. TraceAV-Bench comprises 2,200 rigorously validated multiple-choice questions over 578 long videos, totaling 339.5 hours, spanning 4 evaluation dimensions and 15 sub-tasks. Each question is grounded in an explicit reasoning chain that averages 3.68 hops across a 15.1-minute temporal span. The dataset is built by a three-step semi-automated pipeline followed by a strict quality assurance process. Evaluation of multiple representative OmniLLMs on TraceAV-Bench reveals that the benchmark poses a persistent challenge across all models, with the strongest closed-source model (Gemini 3.1 Pro) reaching only 68.29% on general tasks, and the best open-source model (Ming-Flash-Omni-2.0) reaching 51.70%, leaving substantial headroom. Moreover, we find that robustness to multimodal hallucination is largely decoupled from general multimodal reasoning performance. We anticipate that TraceAV-Bench will stimulate further research toward OmniLLMs that can reason coherently and faithfully over long-form audio-visual content.
☆ Beyond Defenses: Manifold-Aligned Regularization for Intrinsic 3D Point Cloud Robustness
Despite extensive progress in point cloud robustness, existing methods primarily improve performance through augmentation or defense mechanisms, while overlooking the geometric root cause of adversarial fragility. We hypothesize that adversarial vulnerability in 3D networks arises from a manifold misalignment between the latent geometry learned by the model and the intrinsic geometry of the underlying surface. Small, geometry-preserving perturbations along the input manifold often induce disproportionate distortions in feature space, revealing a misalignment between latent and intrinsic geometries. We formalize this phenomenon by developing a geometric interpretation of 3D robustness that links classical adversarial theory to the intrinsic structure of point clouds. Motivated by this analysis, we introduce Manifold-Aligned Point Recognition (MAPR), a framework that regularizes the latent geometry by aligning predictions across intrinsic perturbations. MAPR augments each point cloud with intrinsic features capturing local curvature and diffusion structure, and applies a consistency loss that preserves invariance to intrinsic, geometry-preserving perturbations. Without relying on adversarial training or additional data, MAPR consistently improves robustness across multiple adversarial attacks on both the ModelNet40 and ScanObjectNN datasets, achieving average robustness gains of +20.02% and +8.58% on ModelNet40 and ScanObjectNN, respectively.
☆ Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding ACL 2026
Proactive streaming video understanding requires Video-LLMs to decide when to respond as a video unfolds, a task where existing methods often fall short due to their implicit, query-agnostic modeling of visual evidence. We introduce Response-G1, a novel framework that establishes explicit, structured alignment between the accumulated video evidence and the query's expected response conditions via scene graphs. The framework operates in three fine-tuning-free stages: (1) online query-guided scene graph generation from streaming clips; (2) memory-based retrieval of the most semantically relevant historical scene graphs; and (3) retrieval-augmented trigger prompting for per-frame "silence/response" decisions.By grounding both evidence and conditions in a shared graph representation, Response-G1 achieves more interpretable and accurate response timing decisions. Experimental results on established benchmarks demonstrate the superiority of our method in both proactive and reactive tasks, validating the advantage of explicit scene graph modeling and retrieval in streaming video understanding.
comment: Accepted to ACL 2026
☆ PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models
Mainstream vision-language models (VLMs) fundamentally struggle with severe optical ambiguities, such as reflections and transparent objects, due to the inherent limitations of standard RGB inputs. While polarization imaging captures polarimetric physical parameters that resolve these ambiguities, existing methods are constrained by fixed-format outputs and remain isolated from open-ended reasoning. To bridge this semantic-physical gap, we introduce PolarVLM, the first multimodal framework integrating polarimetric physical parameters into VLMs. By employing a dual-stream architecture and a progressive two-stage training strategy, PolarVLM effectively prevents physical misinterpretations while preserving general visual abilities. Complementing our architecture, we construct PolarVQA, the first benchmark for polarization-aware VQA, featuring 75K physics-grounded instruction-tuning pairs targeting reflective and transparent scenes. Experiments show that PolarVLM surpasses the RGB baseline by 25.4% overall across five evaluation tasks, with remarkable gains of 26.6% in reflection recognition and 34.0% in glass counting, successfully unlocking physics-aware semantic understanding.
comment: 23 pages, 12 figures, including appendices
☆ Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs
The Arrow-of-Time (AoT) task, determining whether a video plays forward or backward by recognizing temporal irreversibility, is one humans solve with near-perfect accuracy, yet frontier Video Large Language Models (Video-LLMs) perform only modestly above chance. This gap raises a key question: do visual backbones fail to encode temporal information, or does information bottleneck lie elsewhere in the Video-LLM architecture? We address this question by isolating the vision encoder from the Video-LLM and tracing temporal information across the encoder, projector, and LLM. We find that video-centric encoders with explicit temporal modeling encode strong temporal signals, whereas frame-centric encoders do not. However, when video-centric representations are passed through a standard Video-LLM architecture, performance often collapses, revealing a bottleneck of temporal information flow. We identify projector design as a key factor: Q-Former disrupts temporal information, while a time-preserved MLP projection substantially improves the LLM's access to such information. Our layer-wise analysis further shows temporal representation dynamics across encoder layers. Guided by these findings, we build a Video-LLM with temporal-aware video-centric encoder, time-preserved projector, and AoT supervision, surpassing human performance on AoT$_{PPB}$ with 98.1\% accuracy, and improving broader temporal reasoning tasks by up to 6.0 points on VITATECS-Direction and 1.3 points on TVBench. Our results show that temporal reasoning in Video-LLMs requires both effective temporal encoding and reliable transfer of this information to the LLM.
☆ Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs
Remote sensing vision-language models (RS-VLMs) face a fundamental mismatch with natural-image counterparts: the same geographic object exhibits radically different visual evidence across ground sampling distances (GSDs) spanning multiple orders of magnitude. Yet existing RS-VLMs often discard GSD or inject it as a discrete text token, forcing a single static parameter set to absorb the entire scale spectrum. We introduce ScaleEarth, a parameter-efficient fine-tuning framework built on Qwen3-VL that treats GSD as a continuous conditioning variable governing the model's computation path. At its core, CS-HLoRA (Continuous Scale-Conditioned Hyper-LoRA) modulates the LoRA low-rank subspace through a GSD-driven gate, enabling the model to dynamically route computation by physical scale. To remove reliance on sensor metadata at deployment, we pair CS-HLoRA with SSE-U, a lightweight heteroscedastic sub-head that predicts GSD and its uncertainty from visual features. To provide matching supervision, we construct GeoScale-VQA, a 1.5M-sample scale-layered RS-VQA corpus whose question-answer generation is conditioned on the same physical scalar that drives CS-HLoRA, forming a closed method-data loop. Trained with QLoRA on an 8B backbone, ScaleEarth achieves state-of-the-art results on remote-sensing benchmarks covering diverse Earth-system tasks, including XLRS-Bench and OmniEarth-Bench.
comment: Under review. 30 pages, 16 figures, 7 tables
☆ Multimodal Stepwise Clinically-Guided Attention Learning for Pathological Complete Response Prediction in Breast Cancer
Pathological complete response (pCR) is a key prognostic factor in breast cancer patients undergoing neoadjuvant therapy, strongly associated with long-term survival and treatment personalization. However, accurate pre-treatment pCR prediction remains challenging due to severe class imbalance and limited generalizability across diverse clinical settings. In this work, we propose a multimodal stepwise clinically-guided attention learning framework for pCR prediction from breast magnetic resonance imaging (MRI), designed to address these limitations through medically grounded spatial guidance and multimodal integration. The approach follows a stepwise training strategy inspired by physician reasoning: the model first learns global discriminative imaging patterns, then attention mechanisms are introduced to constrain the network toward tumor regions, and finally clinical variables are integrated to refine decision-making. This guidance strategy encourages prioritization of task-relevant features, improving identification of responders despite their limited representation in the dataset. Moreover, grounding attention in anatomically consistent tumor regions reduces reliance on dataset-specific patterns, thereby enhancing cross-institutional generalization. The framework is evaluated through external validation across heterogeneous MRI cohorts. Compared to non-guided single-stage baselines, the proposed approach improves sensitivity while maintaining competitive specificity, and produces anatomically coherent attention maps that support interpretation of the model's predictions. These findings highlight the potential of clinically-guided multimodal attention learning for robust and generalizable pCR prediction in breast cancer.
☆ Dynamic Mode Decomposition along Depth in Vision Transformers
Recent work has shown that contiguous vision transformer (ViT) blocks (a) can be replaced by a linear map and (b) organize into recurrent phases of computation. We ask whether these observations coincide: does ViT depth implement approximately \textit{autonomous linear} dynamics, admitting a single operator $K$ applied recurrently across a contiguous span? We test this using Dynamic Mode Decomposition (DMD), which fits $K$ from selected, consecutive hidden-state pairs and predicts $p$ steps ahead via $K^p$. On four pretrained DINO ViTs, we study the regularization, rank, and calibration budget required for stable fitting. For short spans ($p \leq 4$), $K^p$ tracks an unconstrained endpoint map to within $0.02$ cosine similarity on DINOv3-H/16+, while also recovering intermediate activations at each skipped block. At early cut starts, the fitted operators compress to rank $\ll d$ with minimal calibration data, and across tokens, \texttt{cls} is most amenable to linearization; both properties decay monotonically with depth. Yet this local fidelity does not transfer downstream. At the final hidden state, after propagating through the remaining blocks, an identity baseline becomes competitive.
☆ VIMCAN: Visual-Inertial 3D Human Pose Estimation with Hybrid Mamba-Cross-Attention Network
The rapid advances in deep learning have significantly enhanced the accuracy of multimodal 3D human pose estimation (HPE). However, the state-of-the-art (SOTA) HPE pipelines still rely on Transformers, whose quadratic complexity makes real-time processing for long sequences impractical. Mamba addresses this issue through selective state-space modeling, enabling efficient sequence processing without sacrificing representational power. Nevertheless, it struggles to capture complex spatial dependencies in multimodal settings. To bridge this gap, we propose VIMCAN, a hybrid architecture that combines the efficient sequence modeling of Mamba with the spatial reasoning of Cross-Attention, and performs robust visual-inertial fusion and human pose estimation between RGB keypoints and wearable IMU data. By leveraging Mamba's dynamic parameterization for temporal modeling and Attention for spatial dependency extraction, VIMCAN achieves superior accuracy, with mean per-joint position errors (MPJPE) of 17.2 mm on TotalCapture and 45.3 mm on 3DPW. VIMCAN outperforms prior Transformer-based and other SOTA approaches while supporting real-time inference at over 60 frames per second on consumer-grade hardware. The source code is available on GitHub.
comment: 10 pages
☆ Mind the Gap: Geometrically Accurate Generative Reconstruction from Disjoint Views
3D vision systems are fundamentally constrained by their reliance on visual overlap: reconstruction methods require it for geometric alignment, while generative models use it to enforce multi-view consistency. This limitation is particularly acute in real-world scenarios such as distributed swarm robotics or crowd-sourced data collection, where capturing overlapping perspectives, both in terms of spatial and appearance overlap, is often impossible. We introduce Generative Reconstruction from Disjoint Views as a new paradigm, establish a comprehensive dataset, and propose specialized evaluation metrics for zero-overlap scenarios. Our benchmarking demonstrates that existing state-of-the-art methods fail catastrophically on this task, producing disconnected geometries or semantically incoherent reconstructions. To address these limitations, we propose GLADOS, a general, modular framework that operates through three stages: (1) Generative Bridging, where foundation models synthesize intermediate perspectives to connect disjoint inputs; (2) Robust Coarse 3D Reconstruction, that establish coarse geometric scaffold via global alignment which absorbs local contradictions from generative process; and (3) Iterative Context Expansion and Consistency Optimization to fill missing regions and unify the reconstruction. As an architectureagnostic framework, GLADOS enables seamless integration of future advances in generation, reconstruction, and inpainting. The source code is available at: https://github.com/gwilczynski95/GLADOS.
☆ Probabilistic Object Detection with Conformal Prediction
Conformal Prediction (CP) is a distribution-free method for constructing prediction sets with marginal finite-sample coverage guarantees, making it a suitable framework for reliable uncertainty quantification in safety-critical object detection. However, object detection introduces structured multi-output predictions, complicating the application of classical CP theory developed for single outputs. In addition, standard, unscaled CP produces fixed-width prediction intervals across inputs, leading to unnecessary width for low-uncertainty predictions. While scaled CP addresses this by adapting the interval width to an input-dependent uncertainty estimate, prior work has neither systematically compared unscaled and scaled CP for multi-class object detection, nor integrated CP with a complementary uncertainty quantification method in this setting. We fill this gap by: (i) applying CP coordinate-wise to bounding box corners with a Bonferroni correction for box-level guarantees; (ii) scaling the resulting intervals using per-prediction aleatoric uncertainty estimates derived from a probabilistic object detector trained with loss attenuation, evaluated in uncalibrated and two calibrated variants; (iii) extending to a two-step pipeline that constructs prediction sets for the class using RAPS and conditions the conformalized bounding boxes on the predicted class set. Across three autonomous driving datasets (KITTI, BDD, CODA), including a cross-domain setting under distribution shift, scaled CP consistently improves interval sharpness over unscaled CP, achieving up to 19% higher IoU and 39% lower interval scores, without sacrificing coverage. Class-wise calibration further improves coverage for both variants with a negligible effect on sharpness. Together, these improvements yield more actionable uncertainty estimates for real-time, real-world object detection.
comment: Code is available at https://github.com/mos-ks/OD-CP
☆ Implicit Preference Alignment for Human Image Animation ICML 2026
Human image animation has witnessed significant advancements, yet generating high-fidelity hand motions remains a persistent challenge due to their high degrees of freedom and motion complexity. While reinforcement learning from human feedback, particularly direct preference optimization, offers a potential solution, it necessitates the construction of strict preference pairs. However, curating such pairs for dynamic hand regions is prohibitively expensive and often impractical due to frame-wise inconsistencies. In this paper, we propose Implicit Preference Alignment (IPA), a data-efficient post-training framework that eliminates the need for paired preference data. Theoretically grounded in implicit reward maximization, IPA aligns the model by maximizing the likelihood of self-generated high-quality samples while penalizing deviations from the pretrained prior. Furthermore, we introduce a Hand-Aware Local Optimization mechanism to explicitly steer the alignment process toward hand regions. Experiments demonstrate that our method achieves effective preference optimization to enhance hand generation quality, while significantly lowering the barrier for constructing preference data. Codes are released at https://github.com/mdswyz/IPA
comment: Accepted by ICML 2026
☆ Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models
World Action Models (WAMs) enable decision-making through imagined rollouts by predicting future observations and actions. However, the reliability of these imagined futures remains under-examined: is a generated future merely visually plausible, or is it dynamically compatible with the action sequence it claims to model? In this work, we identify action-state consistency, the alignment between predicted actions and induced state transitions, as a missing reliability axis for WAMs. Through a systematic study across representative joint-prediction and inverse-dynamics models, we find that action-state consistency systematically separates successful and failed rollouts across many tasks and follows similar success-failure trends as learned value estimates. These results suggest that consistency captures decision-relevant structure beyond visual realism. We further identify background collapse as an important boundary condition, where low-dynamics failed trajectories can become deceptively consistent because static futures are easier to predict. Building on these findings, we introduce a value-free consensus strategy for test-time selection, which ranks candidate rollouts by agreement among predicted futures. This strategy improves success rates on RoboCasa and RoboTwin 2.0 without additional training or reward modeling. Taken together, our findings establish action-state consistency as both a diagnostic tool for evaluating WAM reliability and a practical signal for value-free planning.
comment: Technical Report
☆ Hierarchical Dual-Subspace Decoupling for Continual Learning in Vision-Language Models
Class-incremental learning aims to continuously acquire new knowledge while preserving previously learned information, thereby mitigating catastrophic forgetting. Existing methods primarily restrict parameter updates but often overlook their structural properties in high-dimensional spaces. From a subspace perspective, updates induced by different tasks tend to lie in multiple overlapping low-rank subspaces, leading to cross-task subspace interference and severe forgetting. To address this issue, we propose HDSD, a Hierarchical Dual-Subspace Decoupling framework for continual learning in vision-language models. Specifically, we introduce a lightweight Feature Modulation Module (FMM) that explicitly decomposes the parameter space into general and task-specific subspaces. Building on this design, we develop two complementary components. First, a General Fusion Module (GFM) evaluates relative parameter changes across tasks and uses an adaptive threshold to capture stable and transferable knowledge. Second, a Hierarchical Learning Module (HLM) performs structured parameter decomposition via Singular Value Decomposition (SVD) and uses a scaling mechanism to constrain updates within distinct subspace scales. Together, these designs reduce subspace interference and parameter drift. Extensive experiments on conventional benchmarks show that HDSD achieves state-of-the-art results.
☆ InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search
Existing benchmarks for multimodal agentic search evaluate multimodal search and visual browsing, but visual evidence is either confined to the input or treated as an answer endpoint rather than part of an interleaved search trajectory. We introduce \textbf{InterLV-Search}, a benchmark for Interleaved Language-Vision Agentic Search, in which textual and visual evidence is repeatedly used to condition later search. It contains 2,061 examples across three levels: active visual evidence seeking, controlled offline interleaved multimodal search, and open-web interleaved multimodal search. Beyond existing benchmarks, it also includes multimodal multi-branch samples that involve comparison between multiple entities during the evidence search. We construct Level 1 and Level 2 with automated pipelines and Level 3 with a machine-led, human-supervised open-web pipeline. We further provide InterLV-Agent for standardized tool use, trajectory logging, and evaluation. Experiments on proprietary and open-source multimodal agents show that current systems remain far from solving interleaved multimodal search, with the best model below 50% overall accuracy, highlighting challenges in visual evidence seeking, search control, and multimodal evidence integration. We release the benchmark data and evaluation code at https://github.com/hbhalpha/InterLV-Search-Bench
☆ Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers
Efficiently aligning large-scale video diffusion models with human intent requires a scalable and trajectory-aware pathway that bridges the inherent discrepancy between training noise distributions and practical inference trajectories. While existing paradigms such as Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) attempt to address this, they are often hindered by either reliance on bias-prone, complex reward models or suboptimal timestep sampling. In this paper, we propose Diffusion-APO (Aligned Preference Optimization), a trajectory-aware algorithm that resolves this misalignment by synchronizing training noise with inference-time denoising paths to maximize gradient signal efficacy. To translate this algorithmic innovation into a practical solution, we introduce a unified and modular RLHF framework that integrates online ranking, half-online anchoring, offline refinement, and distillation-aware drift correction. This framework enables flexible, multi-stage preference alignment across diverse data and computational constraints without relying on scalar-reward-based policy gradients. Through extensive experiments, we demonstrate that Diffusion-APO consistently outperforms standard baselines in visual quality and instruction following, while effectively preserving generative fidelity during model acceleration, providing a robust, end-to-end pathway for scalable video diffusion alignment.
☆ Cloud-top infrared observations reveal the four-dimensional precipitation structure
Accurate four-dimensional (4D) precipitation information is essential for understanding the Earth's energy and water cycles, yet remains observationally unresolved at global scales. Conventional theory holds that geostationary infrared observations primarily sense cloud-top properties, with limited sensitivity to sub-cloud precipitation. Here we show that cloud-top infrared measurements nevertheless encode sufficient information to recover the four-dimensional structure of precipitation, revealing a previously unexploited observability of sub-cloud processes. We introduce a physically constrained deep learning framework, 4DPrecipNet, in which a moisture-first constraint requires the latent representation to recover precipitable water vapour, anchoring the model in thermodynamic consistency. By integrating multi-channel infrared radiances with these constraints and radar-derived precipitation profiles, we reconstruct the vertical and temporal evolution of precipitation systems from geostationary orbit. The framework captures deep convective structures and their evolution, with robust performance across large samples and independent radar comparisons. These results demonstrate that sub-cloud precipitation is physically encoded in cloud-top infrared observations, establishing a new pathway for continuous global monitoring of precipitation structure.
☆ Lightweight Unpaired Smartphone ISP Transfer with Semantic Pseudo-Pairing CVPR
Unpaired smartphone ISP is a challenging problem due to the lack of scene and color alignment between RAW and target RGB images. Many existing methods either require paired data or rely heavily on adversarial training, which can become unstable in the unpaired setting. In this work, we present a simple and effective approach developed for the NTIRE 2026 Learned Smartphone ISP Challenge with Unpaired Data. Our method first reconstructs larger images from training patches to recover global context. Then, we extract semantic embeddings with DINOv2, and use fused Gromov-Wasserstein (FGW) optimal transport to build pseudo pairs between RAW and RGB images at both image and patch levels. This semantic matching allows us to partially alleviate the unpairedness of the data and build these pseudo input-target pairs. Based on these pseudo pairs, we train a lightweight CNN with only 7K parameters for color rendering. The network is designed to be compact and focus on color transformation rather than structural change, which helps reduce artifacts and improve training stability. Our challenge submission achieves 22.569 PSNR, 0.675 SSIM, and 8.067 $ΔE$ on the final hidden test set, significantly improving over the baseline and achieving the 3rd best SSIM and $ΔE$ among all challenge entries. Our code is available at github.com/nuniniyujin/Unpaired-ISP .
comment: 13 pages, 9 figures, CVPR Workshops 2026
☆ DIMoE-Adapters: Dynamic Expert Evolution for Continual Learning in Vision-Language Models
Continual learning enables vision-language models to accumulate knowledge and adapt to evolving tasks without retraining from scratch. However, in multi-domain task-incremental learning, large domain shifts intensify the stability-plasticity dilemma. Most existing methods rely on fixed architectures with statically allocated parameters, which limits adaptation to new domains and aggravates catastrophic forgetting. To address these challenges, we propose DIMoE-Adapters, a Dynamic Incremental Mixture-of-Experts Adapters framework that introduces a dynamic expert evolution paradigm to balance stability and plasticity. This paradigm is implemented through two collaborative components: Self-Calibrated Expert Evolution (SCEE) and Prototype-Guided Expert Selection (PGES). SCEE constructs and evolves a sparse expert pool through expert optimization dynamics, improving plasticity while reducing redundant capacity. PGES controls expert utilization based on the pool shaped by SCEE, improving stability across both previously encountered and unseen tasks. Extensive experiments show that DIMoE-Adapters outperforms previous state-of-the-art methods across various settings.
☆ How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings
The past year has seen over 20 open-source document parsing models, yet thefield still benchmarks almost exclusively on OmniDocBench, a 1,355-pagemanually annotated dataset whose top scores have saturated above 90%. Athree-stage audit pipeline we run on OmniDocBench screens its 21,353evaluator-scored blocks and confirms 2,580 errors (12.08%); combined with overa year of public availability, both annotation quality and contamination riskcall its rankings into question. To address these issues, we presentPureDocBench, a programmatically generated, source-traceable benchmark thatrenders document images from HTML/CSS and produces verifiable annotations fromthe same source, covering 10 domains, 66 subcategories, and 1,475 pages, eachin three versions: clean, digitally degraded, and real-degraded (4,425 imagestotal). Evaluating 40 models spanning pipeline specialists, end-to-endspecialists, and general-purpose VLMs, we find: (i) document parsing is farfrom solved: the best model scores only ~74 out of 100, with a 44.6-point gapbetween the strongest and weakest models; (ii) specialist parsers with <=4Bparameters rival or surpass general VLMs that are 5-100x larger, yet formularecognition remains a shared bottleneck where no model exceeds 67% whenaveraging the formula metric across all three tracks; (iii) general VLMs loseonly 0.99/8.52 Overall points under digital/real degradation versus 4.90/14.21for pipeline specialists, producing ranking reversals that make clean-onlyevaluation misleading for deployment. All data, code, and artifacts arepublicly released.
comment: 42 pages, 20 figures, 16 tables
☆ Implicit Multi-Camera System Calibration Using Gaussian Processes
This paper proposes a novel framework for implicit multi-camera system calibration utilizing Gaussian Process (GP) regression. Conventional explicit calibration methods are constrained by rigid mathematical models and struggle with complex, non-linear distortions from unconventional optics, while existing neural network-based implicit approaches are typically data-hungry and lack inherent uncertainty quantification (UQ). Our GP-based model directly learns the complex, non-linear mapping from 2D image coordinates across all cameras to a 3D world coordinate, completely bypassing time-consuming estimation of explicit intrinsic and extrinsic parameters. Moreover, the inherent UQ is critical for transforming a simple 3D point prediction into a verifiable 3D measurement, complete with statistically-sound confidence bounds. To further enhance data efficiency and practical deployment, we integrate Active Learning (AL), which intelligently leverages the GP's predictive uncertainty to strategically guide the acquisition of new calibration data. This approach results in a robust, data-efficient, and reliable calibration solution, proving particularly effective in practical scenarios where collecting extensive calibration data is a dominant constraint. Our experiments show that the uncertainty for the 3D predictions is higher closer to the cameras. The data points in $uv$-coordinate space are more sparse in that region, even though they are not in 3D space. This work is relevant for anyone who is tasked with the calibration of complex multi-camera systems.
☆ AudioFace: Language-Assisted Speech-Driven Facial Animation with Multimodal Language Models
Speech-driven facial animation requires accurate correspondence between acoustic signals and facial motion, especially for articulation-related mouth movements. However, directly mapping speech audio to facial coefficients often overlooks the linguistic and phonetic structure underlying speech production. In this paper, we propose AudioFace, a language-assisted framework for speech-driven blendshape generation that treats mouth-related facial coefficient prediction as a structured generation problem guided by linguistic and articulatory information. Instead of relying solely on acoustic features, our method leverages the prior knowledge of multimodal large language models and introduces transcript- and phoneme-level cues to bridge speech signals with interpretable facial actions. Extensive experiments show that AudioFace achieves superior performance across multiple evaluation metrics, validating the effectiveness of language-assisted and multimodal-prior-guided speech-driven facial animation.
☆ ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning
Recent text-guided image editing (TIE) models have achieved remarkable progress, however, many edited results still suffer from artifacts, unintended modifications, and suboptimal aesthetics. Although several benchmarks and evaluation methods have been proposed, most existing approaches rely on scalar scores and lack interpretability. This limitation largely stems from the absence of high-quality interpretation datasets for TIE and effective reward models to train interpretable evaluators. To address these challenges, we introduce ReasonEdit-22K, the first dataset that combines 22K edited images with 113K Chain-of-Thought (CoT) samples, along with 1.3M human judgments assessing these interpretations in terms of logicality, accuracy, and usefulness. Building upon this dataset, we propose RE-Reward, a multimodal large language model (MLLM)-based reward model designed to provide human-aligned feedback for evaluating interpretable reasoning in image editing. Furthermore, we develop ReasonEdit, which is trained using reward signals derived from RE-Reward and the Group Relative Policy Optimization (GRPO) algorithm to learn an interpretable evaluation model. Extensive experiments demonstrate that ReasonEdit achieves superior alignment with human preferences and exhibits strong generalization across public benchmarks. In addition, it is capable of generating high-quality interpretable evaluation text, enabling more transparent and trustworthy assessment for image editing. The code is available at https://github.com/IntMeGroup/ReasonEdit.
☆ ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations
Vision-Language-Action (VLA) models hold great promise for general-purpose robotic intelligence, yet scaling up such models is severely bottlenecked by the high cost of acquiring annotated training data. Fortunately, vision-equipped robots deployed across various domains already produce abundant vision-action pairs that can be leveraged to scale up VLA training more efficiently. However, these raw data cannot be centrally aggregated due to various constraints and also exhibit severe heterogeneity. To address these challenges, in this paper, we propose ForgeVLA, a federated VLA training framework that learns VLA models from distributed vision-action pairs without centralizing raw data or requiring manual annotations. Specifically, each client in ForgeVLA is equipped with an embodied instruction classifier that maps vision-action pairs to a predefined instruction set, recovering the missing language modality and forming complete vision-language-action triplets. Beyond triplet construction, we also identify vision-language feature collapse as a critical challenge that has been largely overlooked in prior federated VLA research. To mitigate this issue, ForgeVLA combines a client-side contrastive planning loss with a server-side adaptive aggregation strategy to learn task-discriminative representations efficiently. Extensive experiments across multiple benchmarks show that ForgeVLA significantly outperforms other baselines, and ablation studies further validate the contribution of each component.
comment: 26 pages
☆ A Unified Framework for the Detection and Classification of Fatty Pancreas in Ultrasound Images
Non-alcoholic fatty pancreas disease (NAFPD) is an underdiagnosed condition associated with metabolic syndrome, insulin resistance, and increased risk of pancreatic cancer. Diagnosis typically relies on subjective visual assessment of ultrasound images by clinicians. We propose an end-to-end framework for automatically classifying normal versus fatty pancreas from abdominal ultrasound images. Our method employs a TransUNet-based segmentation architecture with a ResNet encoder and transformer bottleneck to delineate the pancreas and the splenic vein, followed by anatomically-guided patch extraction and patient-level classification through pairwise texture comparison. The feature engineering mimics clinical reasoning by comparing the echogenicity of peri-venous fat to the pancreatic parenchyma, providing an interpretable signal for classification. The segmentation models are initialized via domain-specific transfer learning from a liver segmentation task. We validate the full pipeline on a clinical dataset of 214 abdominal ultrasound images with 107 expert-labeled cases using 5-fold cross-validation. SVM with RBF kernel achieves a mean cross-validated accuracy of 89.7\%\,$\pm$\,1.8\% and F1 of 0.898\,$\pm$\,0.019, while the unsupervised K-Means baseline reaches 87.8\% accuracy, demonstrating that the proposed features capture the relevant clinical signal even without labeled training data. To our knowledge, this is the first end-to-end automated framework for fatty pancreas classification from ultrasound using segmentation-guided texture analysis.
☆ EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement
Recent text-guided image editing (TIE) models have made remarkable progress, yet edited images still frequently suffer from fine-grained issues such as unnatural objects, lighting mismatch, and unexpected changes. Existing refinement approaches either rely on costly iterative regeneration or employ vision-language models (VLMs) with weak spatial grounding, often resulting in semantic drift and unreliable local corrections. To address these limitations, we first construct EditFHF-15K, a dataset of fine-grained human feedback for edited images, comprising (1) 15K images from 12 TIE models spanning 43 editing tasks, (2) 60K annotated artifact regions and 80K editing failure regions, each accompanied by textual reasoning, and (3) 45K mean opinion scores (MOSs) assessing perceptual quality, instruction following, and visual consistency. Based on EditFHF-15K, we propose EditRefiner, a hierarchical, interpretable, and human-aligned agentic framework that reformulates post-editing correction as a human-like perception-reasoning-action-evaluation loop. Specifically, we introduce: (1) a perception agent that detects contextual saliency maps of artifacts and editing failures, (2) a reasoning agent that interprets these perceptual cues to perform human-aligned diagnostic inference, (3) an action agent that uses the reasoning output to plan and execute localized re-editing, and (4) an evaluation agent that assesses the re-edited image and guides the action agent on whether further refinements are required. Extensive experiments demonstrate that EditRefiner consistently outperforms state-of-the-art methods in distortion localization, diagnose accuracy and human perception alignment, establishing a new paradigm for self-corrective and perceptually reliable image editing. The code is available at https://github.com/IntMeGroup/EditRefiner.
☆ EditTransfer++: Toward Faithful and Efficient Visual-Prompt-Guided Image Editing
Visual-prompt-guided edit transfer aims to learn image transformations directly from example pairs, offering more precise and controllable editing than purely text-driven approaches. However, existing diffusion transformer-based methods often fail to faithfully reproduce the demonstrated edits due to structural mismatches between the task and the backbone, including a pretrained bias toward textual conditioning and inherent stochastic instability during sampling. To bridge this gap, we present EditTransfer++, a framework that combines progressively structured training with an efficient conditioning scheme to improve both visual prompt faithfulness and inference efficiency. We first mitigate textual dominance with a text-decoupled training strategy that removes text conditioning during fine-tuning, compelling the model to infer transformations solely from visual evidence while still supporting optional text guidance at inference. On top of this visually grounded model, a best-worst contrastive refinement mechanism reshapes the denoising trajectories to suppress unfaithful generations and improve consistency across random seeds. To alleviate the computational bottleneck of high-resolution in-context editing, we further introduce a condition compression and reuse strategy that reduces token redundancy and enables efficient generation of images with a 1024-pixel long edge. Extensive experiments on existing benchmarks and the proposed EditTransfer-Bench show that EditTransfer++ achieves state-of-the-art visual prompt faithfulness with substantially faster inference than prior methods, suggesting a promising direction for scalable prompt-guided image editing and broader visual in-context learning.
☆ Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs
Vision-language models (VLMs) have advanced rapidly and are increasingly deployed in real-world applications, especially with the rise of agent-based systems. However, their safety has received relatively limited attention. Even the latest proprietary and open-weight VLMs remain highly vulnerable to adversarial attacks, leaving downstream applications exposed to significant risks. In this work, we propose a novel and lightweight adversarial attack detection framework based on sparse autoencoders (SAEs), termed SAEgis. By inserting an SAE module into a pretrained VLM and training it with standard reconstruction objectives, we find that the learned sparse latent features naturally capture attack-relevant signals. These features enable reliable classification of whether an input image has been adversarially perturbed, even for previously unseen samples. Extensive experiments show that SAEgis achieves strong performance across in-domain, cross-domain, and cross-attack settings, with particularly large improvements in cross-domain generalization compared to existing baselines. In addition, combining signals from multiple layers further improves robustness and stability. To the best of our knowledge, this is the first work to explore SAE as a plug-and-play mechanism for adversarial attack detection in VLMs. Our method requires no additional adversarial training, introduces minimal overhead, and provides a practical approach for improving the safety of real-world VLM systems.
☆ Towards Photorealistic and Efficient Bokeh Rendering via Diffusion Framework
Existing mobile devices are constrained by compact optical designs, such as small apertures, which make it difficult to produce natural, optically realistic bokeh effects. Although recent learning-based methods have shown promising results, they still struggle with photos captured under high digital zoom levels, which often suffer from reduced resolution and loss of fine details. A naive solution is to enhance image quality before applying bokeh rendering, yet this two-stage pipeline reduces efficiency and introduces unnecessary error accumulation. To overcome these limitations, we propose MagicBokeh, a unified diffusion-based framework designed for high-quality and efficient bokeh rendering. Through an alternative training strategy and a focus-aware masked attention mechanism, our method jointly optimizes bokeh rendering and super-resolution, substantially improving both controllability and visual fidelity. Furthermore, we introduce degradation-aware depth module to enable more accurate depth estimation from low-quality inputs. Experimental results demonstrate that MagicBokeh efficiently produces photorealistic bokeh effects, particularly on real-world low-resolution images, paving the way for future advancements in bokeh rendering. Our code and models are available at https://github.com/vivoCameraResearch/MagicBokeh.
☆ SR$^2$-LoRA: Self-Rectifying Inter-layer Relations in Low-Rank Adaptation for Class-Incremental Learning
Pre-trained models with parameter-efficient fine-tuning (PEFT) have demonstrated promising potential for class-incremental learning (CIL), yet catastrophic forgetting still persists when adapting models to new tasks. In this paper, we present a novel perspective on catastrophic forgetting through the analysis of inter-layer relation drift, i.e., the progressive disruption of relationships among layer-wise representations during the learning of new tasks. We theoretically show that the increase of such drift reduces the classification margins of previously learned tasks, thereby degrading overall model performance. To address this issue, we propose \underline{S}elf-\underline{R}ectifying inter-layer \underline{R}elation Low-Rank Adaptation~(SR$^2$-LoRA), a simple yet effective method that mitigates catastrophic forgetting by constraining inter-layer relation drift. Specifically, SR$^2$-LoRA constructs the relation matrices induced by the previous and current models on current-task samples, and aligns the corresponding singular values. We further theoretically show that this alignment exhibits greater robustness to estimation perturbations than direct entry-wise alignment. Extensive experiments on standard CIL benchmarks demonstrate that SR$^2$-LoRA effectively mitigates catastrophic forgetting, with its advantages becoming more pronounced as the number of tasks increases. Code is available in the \href{https://github.com/FqWan24/SR-2-LoRA}{repository}.
☆ Learning Image-Adaptive Scale Fields for Metric Depth Recovery
Monocular depth estimation (MDE) typically produces depth estimations that are defined up to an unknown scale or shift. When only sparse metric anchors are available, recovering accurate metric depth becomes challenging yet necessary for practical applications. We address this problem by formulating metric depth recovery as image-adaptive scale field modeling. Instead of directly correcting the depth, we reformulate the correction as a low-dimensional linear combination of image-adaptive basis maps. These maps are derived from semantic and geometric cues encoded in the MDE estimations and intermediate representations. The weights of basis maps are efficiently determined from sparse metric anchors via a least-squares problem. This formulation yields improved metric depth accuracy, strong robustness under extreme anchor sparsity, and an interpretable decomposition of spatial scale variations. Extensive experiments across multiple datasets and representative MDE models demonstrate the effectiveness and general applicability of our approach.
☆ ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring
Referring expression grounding is a core problem in visual grounding and is widely used as a diagnostic of spatial grounding and reasoning in vision and language models, yet most prior work focuses on natural images. In contrast, existing chart referring expression grounding-related benchmarks remain limited: (1) they largely adopt bounding boxes, constraining localization precision for fine chart elements (2) they mostly assume a single and two referred target instances, failing to handle multi-instance target references; (3) the language expressions over-rely on textual cues or data-rank clues (4) they cover only a narrow range of chart types. To address these issues, we introduce a chart referring expression grounding benchmark that systematically supports multiple localization forms, multiple referred targets, diverse grounding cues and diverse chart types. Results across representative multimodal large models reveal a significant performance gap. We further introduce a code-driven synthesis pipeline that exploits the inherent alignment between plotting programs and rendered chart primitives to derive pixel accurate instance masks across chart element types and granularities. We train an instance segmentation model with the synthesized masks and integrate it into a general-purpose multimodal grounding framework. The resulting system consistently outperforms baselines on our benchmark and generalizes well to a ChartQA-derived real-chart grounding benchmark.
☆ InsHuman: Towards Natural and Identity-Preserving Human Insertion
Human insertion aims to naturally place specific individuals into a target background. Although existing image editing models may have such ability, they often produce failure cases, including inappropriate human pose in new background, inconsistent number of people, and modified facial identity. Moreover, publicly available human datasets often lack full-body portraits and realistic physical interaction between humans and their background. To address these challenges, we propose InsHuman for natural and identity-preserving human insertion. Specifically, we propose Human-Background Adaptive Fusion (HBAF), which detects foreground humans to obtain a binary mask and applies region-aware weighting to align the human regions between predicted and ground-truth latents, ensuring the person's pose, count, and overall appearance are coherently adapted to the target background.We further propose Face-to-Face ID-Preserving (FFIP), which detects and matches faces between the generated image and the source image in terms of face recognition features to enforce identity consistency for each face.In addition, we propose Bidirectional Data Pairing (BDP) strategy to construct BDP-InsHuman, a high-quality dataset with realistic human-background interactions. Experiments demonstrate that InsHuman achieves significant improvements in generating plausible images while keeping human identity unchanged.
☆ GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization
Diffusion Vision-Language Models (dVLMs), built upon the non-causal foundations of Diffusion Large Language Models (dLLMs), have demonstrated remarkable efficacy in multimodal tasks by departing from the traditional autoregressive generation paradigm. While dVLMs appear inherently robust against conventional jailbreak tactics, which we categorize as Fixed Prefix Optimization (FPO) (e.g., anchoring responses with "Sure, here is"), this perceived resilience is deceptive. Our investigation into the safety landscape of dVLMs reveals a unique refusal pattern: Immediate Refusal and Progressive Refusal. We find that while FPO-based attacks often fail by triggering the latter, the progressive refinement process itself uncovers a novel, latent attack surface. To exploit this vulnerability, we propose Global Probability Optimization (GPO), a general jailbreak paradigm designed specifically for the denoising trajectory of masked diffusion models. Unlike prefix-based methods, GPO manipulates the global generative dynamics to bypass guardrails in diffusion language models. Building on this, we introduce GPO-V, the first visual-modality jailbreak framework tailored for dVLMs. Empirical results demonstrate that GPO-V produces stealthy perturbations with exceptional cross-model transferability, revealing a critical security gap in non-sequential generative architectures. Our findings underscore the critical urgency of addressing safety alignment in dVLMs. These results necessitate an immediate and fundamental re-evaluation of current defense paradigms to mitigate the unique risks of diffusion-based generation. Our code is available at: https://anonymous.4open.science/r/GPO-V-0250.
☆ Exposing and Mitigating Temporal Attack in Deepfake Video Detection
While spatiotemporal deepfake detectors achieve high AUC, our experiments reveal their susceptibility to evasion attacks. These models tend to overfit on fragile temporal spectrum cues, rather than learning robust semantic causality. To mitigate this vulnerability, we propose SpInShield, a temporal spectral-invariant defense framework explicitly designed to decouple semantic motion from manipulatable spectral artifacts. We propose a learnable spectral adversary that dynamically synthesizes severe spectral deformations, simulating extreme attack scenarios. By employing a shortcut suppression optimization strategy, SpInShield compels the encoder to extract reliable forensic cues while purging unstable spectral statistics from the latent space. Experiments show that SpInShield obtains competitive performance on widely used datasets and outperforms the strongest baseline by 21.30 percentage points in AUC under simulated amplitude spectral attacks.
☆ BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning
Image captioning is one of the most fundamental tasks in computer vision. Owing to its open-ended nature, it has received significant attention in the era of multimodal large language models (MLLMs). In pursuit of ever more detailed and accurate captions, recent work has increasingly turned to reinforcement learning (RL). However, existing captioning-RL methods and evaluation metrics often emphasize a narrow notion of caption quality, inducing trade-offs across core dimensions of captioning. For example, utility-oriented objectives can encourage noisy, hallucinated, or overlong captions that improve downstream question answering while harming fluency, whereas arena-style objectives can favor fluent but generic descriptions with limited usefulness. To address this, we propose a more balanced RL framework that jointly optimizes utility-aware correctness, reference coverage, and linguistic quality. In order to effectively optimize the resulting continuous multi-objective reward formulation, we apply GDPO-style reward-decoupled normalization to continuous-valued captioning rewards and show that it improves performance over vanilla GRPO. Additionally, we introduce length-conditional reward masking, yielding a more suitable length penalty for captioning. Across LLaVA-1.5-7B and Qwen2.5-VL 3B and 7B base models, our method consistently improves caption quality, with peak gains of +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena across different models.
☆ ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation
Generative models have achieved success in producing apparently coherent 2D videos, but remain challenging in the physical world due to lack of 4D spatiotemporal scale. Typically, existing 4D generative models directly embed macro scale constraints to enhance overall spatiotemporal consistency. However, these methods only ensure global appearance coherence and fail to reveal the local dynamics of the physical world. Our insight is that global appearance structure and local dynamic topology empower 4D spatiotemporal cognition, thereby enabling 4D generation with spatiotemporal regularities. In this work, we propose ST-Gen4D, a 4D generation framework with 4D spatiotemporal cognition-based world model. Our model is guided by four key designs: 1) Spatiotemporal representation. We encode various modalities into multiple representations as a feature basis. 2) Spatiotemporal cognition. We sculpture these representations into global appearance graph and local dynamic graph, and fuse them via semantic-bridged spatiotemporal fusion to obtain a 4D cognition graph. 3) Spatiotemporal reasoning. We utilize a world model to derive future state based on the 4D cognition. 4) Spatiotemporal generation. We leverage the derived cognition as condition to guide latent diffusion for 4D Gaussian generation. By deeply integrating 4D intrinsic cognition with generative priors, our model guarantees the structural rationality and topological consistency of 4D generation. Moreover, we propose ST-4D datasets by aggregating public 4D datasets and self-built subset. Extensive experiments demonstrate the superiority of our ST-Gen4D across 3D and 4D generation tasks.
☆ A Marine Debris Detection Framework for Ocean Robots via Self-Attention Enhancement and Feature Interaction Optimization
Marine debris detection for ocean robot is crucial for ecological protection, yet performance is often degraded by low-quality images with blur, complex backgrounds, and small targets. To address these challenges, we propose YOLO-MD, an enhanced YOLO-based detection framework. A Dual-Branch Convolutional Enhanced Self-Attention (DB-CASA) module is designed to strengthen spatial-channel interactions, improving feature representation in degraded images. Additionally, a lightweight shift-based operation is introduced to enhance fine-grained feature extraction for objects of varying scales while maintaining parameter efficiency. We further propose SFG-Loss to mitigate class imbalance and optimization instability via dynamic sample reweighting. Experiments on the UODM dataset demonstrate that YOLO-MD achieves 0.875 precision, 0.822 F1-score, and 0.849 mAP50, outperforming the latest state-of-the-art methods. The effectiveness of this method has also been verified through real-world robotic edge deployment experiments.
☆ Velocity-Space 3D Asset Editing
Editing a 3D asset locally, modifying a target region while preserving the rest, is a fundamental requirement of native 3D editing. Existing methods enforce locality through mechanisms external to the generator, such as manual 3D masks, post-hoc voxel merging, or 2D multi-view lifting. None of them intervene where the corruption actually originates: inside the ODE sampler. For a rectified-flow generator to achieve faithful local editing, its velocity field should be strong over the target editing region while vanishing on preserved content. Yet a single velocity field can hardly satisfy both requirements simultaneously, leading to three problems: (i) identity leakage that keeps the edit signal non-zero on preserved regions; (ii) no dedicated edit-amplification channel, so strengthening the edit inevitably perturbs identity; and (iii) an identity drag at the geometry and material stages, where a global condition pulls every token toward the target. We propose VS3D (Velocity-Space 3D Asset editing}), an inversion-free, training-free, and mask-free framework that addresses each problem with a targeted intervention inside the sampler. VS3D integrates three complementary modules, each corresponding to a specific stage of the editing pipeline. Reconstruction-Anchored Source Injection (RASI) absorbs identity leakage by turning the unconditional embedding into a per-step, asset-specific anchor calibrated through source reconstruction. Partial-Mean Guidance (PMG) amplifies the edit signal by contrasting high- and low-quality subsample estimates of the velocity difference, active only where a consistent edit exists. Twin-Agreement Residual injection (TAR) lets the sampler decide token by token what to preserve at the geometry and material stages.
☆ RELO: Reinforcement Learning to Localize for Visual Object Tracking ICML 2026
Conventional visual object trackers localize targets using handcrafted spatial priors, often in the form of heatmaps. Such priors provide only surrogate supervision and are poorly aligned with tracking optimization and evaluation metrics, such as intersection over union (IoU) and area under the success curve (AUC). Here, we introduce RELO, a REinforcement-learning-to-LOcalize method for visual object tracking that formulates target localization as a Markov decision process. Specifically, RELO replaces handcrafted spatial priors with a localization policy learned over spatial positions via reinforcement learning, with rewards combining frame-level IoU and sequence-level AUC. We additionally introduce layer-aligned temporal token propagation to improve semantic consistency across frames, with negligible computational overhead. Across multiple benchmarks, RELO achieves superior results, attaining 57.5% AUC on LaSOText without template updates. This confirms that reward-driven localization provides an effective alternative to prior-driven localization for visual object tracking.
comment: ICML 2026 paper
☆ Weather-Robust Scene Semantics with Vision-Aligned 4D Radar ICRA 2026
Cameras and LiDAR degrade in rain, fog, and snow, while millimeter-wave radar remains largely unaffected. We align a radar encoder to frozen SigLIP vision embeddings and decode structured scene captions through a frozen vision-language model (VLM) with approximately 7M trainable parameters. On K-RADAR with held-out fog, light snow, and heavy snow sequences, all radar configurations outperform a camera baseline that collapses to over 90% hallucination. We identify a token-norm mismatch as the dominant failure mode when bridging radar to a frozen VLM and show that projector-output LayerNorm resolves it. Analysis of encoder complexity, caption format, and pooling strategy reveals tradeoffs that inform future radar-VLM pipeline design.
comment: 5 pages + references, 2 appendix pages. ICRA 2026 Radar in Robotics Workshop
☆ UniISP: A Unified ISP Framework for Both Human and Machine Vision
Compared to RGB images, raw sensor data provides a richer representation of information, which is crucial for accurate recognition, particularly under challenging conditions such as low-light environments. The traditional Image Signal Processing (ISP) pipeline generates visually pleasing RGB images for human perception through a series of steps, but some of these operations may adversely impact the information integrity by introducing compression and loss. Furthermore, in computer vision tasks that directly utilize raw camera data, most existing methods integrate minimal ISP processing with downstream networks, yet the resulting images are often difficult to visualize or do not align with human aesthetic preferences. This paper proposes UniISP, a novel ISP framework designed to simultaneously meet the requirements of both human visual perception and computer vision applications. By incorporating a carefully designed Hybrid Attention Module (HAM) and employing supervised learning, the proposed method ensures that the generated images are visually appealing. Additionally, a Feature Adapter module is introduced to effectively propagate informative features from the ISP stage to subsequent downstream networks. Extensive experiments demonstrate that our approach achieves state-of-the-art performance across various scenarios and multiple datasets, proving its generalizability and effectiveness.
☆ UniD-Shift: Towards Unified Semantic Segmentation via Interpretable Share-Private Multimodal Decomposition
Semantic segmentation of large-scale 3D point clouds is crucial for applications such as autonomous driving and urban digital twins. However, the sparse sampling pattern of LiDAR and the view-dependent geometric distortion in image observations complicate cross-modal alignment and hinder stable fusion. Inspired by the fact that 2D images captured by cameras are representations of the 3D world, we recognize that the features learned from 2D and 3D segmentation share some common semantics, while other aspects remain modality-specific. This insight motivates a unified multimodal framework for joint 2D-3D semantic segmentation. We combine a SAM-based vision encoder with a SPTNet-based geometric encoder to extract complementary semantic and geometric representations. The resulting features from both modalities are explicitly decomposed into shared and private subspaces, where the shared components summarize semantic factors common to both domains, and the private components preserve properties that are unique to each modality. A lightweight attention-based fusion module aggregates the shared features into a consistent cross-modal representation, and a regularized training objective ensures both semantic alignment and subspace independence. Experiments on the SemanticKITTI and nuScenes benchmarks demonstrate consistent improvements in segmentation accuracy over representative multimodal baselines, accompanied by competitive computational efficiency. Cross-domain evaluation on nuScenes USA-Singapore shows stable performance under distribution shifts, demonstrating strong generalization. The implementation code is publicly available at: https://github.com/shuaizhang69/UniD-Shift.
☆ TTF: Temporal Token Fusion for Efficient Video-Language Model NeurIPS 2026
Video-language models (VLMs) face rapid inference costs as visual token counts scale with video length. For example, 32 frames at $448{\times}448$ resolution already yield >8,000 visual tokens in Qwen3-VL, making LLM prefill the dominant throughput bottleneck. Existing methods often rely on global similarity or attention-guided compression, incurring offsets to their gains. We propose \textbf{Temporal Token Fusion (TTF)}, a training-free, plug-and-play pre-LLM token compression framework that exploits structured temporal redundancy in video. TTF automatically selects an anchor frame, then for each subsequent frame, performs a local window similarity search (e.g.,$3\times 3$), fusing tokens that exceed a threshold. The compressed sequence maintains positional consistency across both prefill and decoding through coordinate realignment, enabling seamless integration with existing VLM pipelines. On Qwen3-VL-8B with threshold t=0.70, TTF removes about 67\% of visual tokens while retaining 99.5\% of the baseline accuracy and introducing only ${\approx}0.16$\,GFLOPs of matching overhead. Overall, TTF offers a practical, efficient solution for video understanding. The code is available at \href{https://github.com/Cominder/ttf}{https://github.com/Cominder/ttf}
comment: 14 pages; manuscript submitted to NeurIPS 2026
☆ Task-Oriented Communication for Human Action Understanding via Edge-Cloud Co-Inference
The expanding application of smart sensing has created a growing demand for the accurate understanding of human action at the network edge. Traditional approaches require massive video data to be transmitted from resource-constrained edge devices to powerful cloud servers, incurring prohibitive uplink bandwidth consumption and unacceptable latency while raising privacy concerns. To overcome these bottlenecks, we propose a task-oriented communication framework for human action understanding (TOAU) through edge-cloud collaboration. Our framework utilizes a monocular pose estimator to extract continuous joint coordinates from raw videos, followed by a vector quantized variational autoencoder (VQ-VAE) to convert these coordinates into discrete motion tokens. Consequently, only a compact sequence of codebook indices is transmitted over the network, consuming as few as 9 bits per frame and avoiding privacy leakages. At the cloud server, a lightweight projector aligns these motion tokens with the embedding space of a large vision-language model (VLM) to facilitate complex action understanding, which is trained with an efficient instruction tuning paradigm. Comprehensive evaluations on three benchmarks demonstrate that our TOAU system reduces the transmission payload to approximately 1\% and the system latency to around 20\% compared to video codec-based solutions, while delivering comparable action understanding accuracy.
comment: 12 pages, 6 figures
☆ Disambiguating 2D-3D Correspondences in Gaussian Splatting-based Feature Fields for Visual Localization
While Gaussian Splatting-based Feature Fields (GSFFs) have shown promise for visual localization, this paper highlights that photometrically optimized GSFFs are inherently ill-suited for 2D-3D matching. The volumetric extent of each Gaussian induces many-to-one pixel-to-point mappings that destabilize PnP-based pose estimation, while photometric optimization gives rise to superfluous Gaussians devoid of multi-view consistency. To address these issues, we propose SplitGS-Loc, a localization-specialized GSFFs construction framework that disambiguates 2D-3D correspondences by exploiting Gaussian attributes. Our key design, Mixture-of-Gaussians-based splitting, decomposes each Gaussian into smaller Gaussians, replacing ambiguous many-to-one with precise one-to-one correspondences. In parallel, we exploit composition weights from GS rasterization to select Gaussians that significantly and consistently contribute across multiple views and aggregate discriminative features through strong pixel-Gaussian associations, enforcing multi-view consistency. The resulting compact yet discriminative feature fields enable stable PnP convergence, achieving state-of-the-art performance on localization benchmarks. Extensive experiments validate that SplitGS-Loc extends the utility of photometric GSFFs to accurate and efficient localization by exploiting Gaussian attributes, without per-scene training or iterative pose refinement.
☆ SoLAR: Error-Resilient Streamable Long-Horizon Free-Viewpoint Video Reconstruction with Anchor Activation and Latent Recalibration
Free-Viewpoint Video (FVV) has emerged as a cornerstone of next-generation immersive media systems and attracted widespread attention. Previous methods primarily focus on short video sequences and suffer from significant performance degradation when processing long-horizon free-viewpoint video (LFVV). Motivated by bit allocation theory, we analyze dynamic-anchor-based volumetric video representation within a rate-distortion optimization framework and propose \textbf{SoLAR}, which is the first error-resilient streamable FVV framework that maintains stable reconstruction quality on long sequences without requiring group-of-pictures partitioning. We propose the Anchor Activation Dynamics (AAD), which enables dynamic anchors to model non-rigid transformations by dynamically activating informative anchors and suppressing redundant ones. Furthermore, we introduce Latent Discrepancy Aware Recalibration (LaDAR), which is a mechanism to identify discrepancies between latent representations and recalibrate the correspondences encoded in the network, effectively mitigating error propagation in LFVV without compromising real-time performance or storage compactness. Extensive experiments demonstrate that \textbf{SoLAR} achieves state-of-the-art reconstruction performance while maintaining minimum storage overhead, which provides a new direction for LFVV reconstruction and advances the practical deployment of immersive systems. Demo free-viewpoint videos are provided in the supplementary material.
☆ ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs
The decline of global shellfish biodiversity poses a severe threat to coastal ecosystems. Although artificial intelligence (AI) technologies show potential for automated ecological monitoring, existing marine benthic datasets often lack adaptation to the complexities of real underwater environments (e.g., variable lighting conditions and diverse species postures), posing challenges for the robust generalization of vision models in practical ecological monitoring. To address this problem, we construct ShellfishNet, a comprehensive image benchmark dataset designed specifically for real-world ecological monitoring constraints. Comprising 8,691 images across 32 taxa, this dataset includes a curated subset annotated with descriptive captions. It is constructed through field photography and web scraping, encompassing samples from complex real-world environments. Based on this benchmark, we systematically evaluate 80 representative neural network models, including Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), State Space Models (SSMs), and Self-Supervised Learning (SSL) methods. Furthermore, we evaluate the performance of fine-grained visual categorization (FGVC) models and investigate the image captioning capabilities of several mainstream multimodal large language models (MLLMs). Meanwhile, we introduce image corruption benchmark tests to simulate common underwater degradation scenarios (turbidity, severe weather) and assess the robustness of vision models, enabling trustworthy decisions on ecological protection in the wild. ShellfishNet is dedicated to providing a data foundation and a model-evaluation benchmark for the intelligent monitoring of benthic organisms.
☆ RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation
Video Reasoning Segmentation (VRS) aims to segment target objects in videos based on implicit instructions that convey human intent and temporal logic. Existing MLLM-based methods predict masks with a [SEG] token after selecting frames via simple sampling or an auxiliary MLLM, where limited supervision and frame-language similarity rules often yield narrow-scope keyframe choices that weaken holistic temporal understanding and lead to brittle localization in complex multi-object scenes. To address these issues, we introduce RCoT-Seg, a video-of-thought framework that factorizes VRS into temporal video reasoning (TVR) and keyframe target perception (KTP), explicitly separating temporal reasoning from spatial perception. Specifically, in the TVR stage, an agentic keyframe selection module, initialized with a curated CoT-start corpus and refined by GRPO under task-aligned rewards, is proposed to generate and reselect the keyframe through self-evaluation, strengthening moment localization and temporal reasoning. In the KTP stage, RCoT-Seg performs high-resolution segmentation on the selected frame and propagates masks with SAM2-based methods across the sequence, replacing heuristic sampling and external selectors while improving spatial precision and inter-frame consistency. Extensive experimental results demonstrate that the proposed RCoT-Seg achieves favorable performance against the state-of-the-art methods. The code and models will be publicly released at https://github.com/Victor-wjw/RCoT-Seg.
comment: 21 pages
☆ GC-ART: Global Learnable Second-Order Rational Tone Curves for Illumination Robustness
We introduce GC-ART (Global Curve Adaptive Rational Tone-mapping), a lightweight differentiable pre-processing module for robust image classification. GC-ART predicts an endpoint-pinned rational tone curve from per-channel soft histograms using a 643-parameter MLP, then applies the curve pointwise before the classifier. The module is trained end-to-end with cross-entropy and a soft monotonicity penalty. On CIFAR-10 with a CIFAR-style ResNet-18, GC-ART matches clean accuracy with the unenhanced baseline and other learned enhancers, improves over the baseline on multiplicative darkening, and achieves the best learned-method result on contrast corruption (48.45% vs. 46.27% for the baseline and 47.13% for Zero-DCE++). These results suggest that histogram-conditioned rational curves can learn useful global tone corrections, including contrast-expanding behavior, while preserving edge locations by construction through pointwise mapping. GC-ART also uses substantially fewer FLOPs than convolutional learned enhancers at 32 x 32. The current hyperparameters are untuned, leaving room for systematic improvement.
☆ Teacher-Feature Drifting: One-Step Diffusion Distillation with Pretrained Diffusion Representations
Sampling from pretrained diffusion and flow-matching models typically requires many forward passes to generate diverse and high-fidelity images. Existing distillation methods often rely on multiple auxiliary networks, carefully designed training stages, or complex optimization pipelines. In this work, we revisit the recently proposed Drifting Model objective and show that a single drifting loss can be directly used to simplify one step distillation. A key observation is that the pretrained diffusion teacher itself already provides a strong representation space. Unlike the original Drifting Model, which relies on an additional pretrained feature extractor, we use intermediate hidden states of the pretrained teacher model as the feature representation. This removes the need for training or introducing an extra representation network while preserving a semantically meaningful feature geometry for drifting. Furthermore, we introduce a lightweight mode coverage loss to mitigate mode collapse during distillation and encourage the student generator to cover diverse teacher-supported regions. Extensive experiments on ImageNet and SDXL demonstrate that our method achieves efficient one step generation with competitive image quality and diversity, achieving FID scores of 1.58 on ImageNet-64$\times$64 and 18.4 on SDXL, while substantially simplifying the overall distillation framework.
☆ GEM: Generating LiDAR World Model via Deformable Mamba
World models, which simulate environmental dynamics and generate sensor observations, are gaining increasing attention in autonomous driving. However, progress in LiDAR-based world models has lagged behind those built on camera videos or occupancy data, primarily due to two core challenges: the inherent disorder of LiDAR point clouds and the difficulty of distinguishing dynamic objects from static structures. To address these issues, we propose GEM: a Generative LiDAR world model that leverages deformable mamba architecture, significantly improving fidelity and imaginative capability. Specifically, leveraging the structural similarity between sequential laser scanning and Mamba's processing mechanism, we first tokenize LiDAR sweeps into compact representations via a custom LiDAR scene tokenizer. After unsupervised disentanglement of tokenized features via a dynamic-static separator, a tri-path deformable Mamba is introduced to perform selective scanning and adaptive gating fusion over the disentangled features, leading to enhanced spatial-temporal understanding of the world evolution. Optionally, a planner and a BEV layout controller can be integrated to explore the model's capability for autonomous rollout and its potential to generate ``what-if" scenarios. Extensive experiments show that GEM achieves state-of-the-art performances across diverse benchmarks and evaluation settings, demonstrating its superiority and effectiveness. Project page: https://github.com/wuyang98/GEM.
☆ Amortized-Precision Quantization for Early-Exit Vision Transformers
Vision Transformers (ViTs) achieve strong performance across vision tasks, yet their deployment with low-precision early exiting remains fragile. Existing quantization methods assume static full-depth execution, making them unstable when exit decisions are perturbed by quantization noise, which can amplify errors along dynamic inference paths. In this paper, we introduce Amortized-Precision Quantization (APQ), a utilization-aware formulation that accounts for layer-wise stochastic exposure to quantization noise and reveals depth-precision trade-offs. Building on APQ, we propose Mutual Adaptive Quantization with Early Exiting (MAQEE), a bi-level framework that jointly optimizes exit thresholds and bit-widths under explicit risk control to improve inference stability. MAQEE establishes a superior Pareto frontier in the accuracy-efficiency trade-off, reducing BOPs by up to 95% while maintaining accuracy and outperforming strong baselines by up to 20\% across classification, detection, and segmentation tasks.
☆ EgoPro-Bench: Benchmarking Personalized Proactive Interaction in Egocentric Video Streams
Existing Multimodal Large Language Models (MLLMs) remain primarily reactive, failing to continuously perceive environments or proactively assist users. While emerging benchmarks address proactivity, they are largely confined to alert scenarios, neglect personalized context, and fail to evaluate the precise timing of human-machine interactions (HMI).In this paper, we introduce EgoPro-Bench, a novel benchmark for training and evaluating proactive interaction capabilities based on streaming egocentric videos; it comprises 2,400 videos in the evaluation set and over 12,000 videos in the training set.Unlike previous works, EgoPro-Bench leverages simulated user profiles to generate diverse user intentions and to construct high-fidelity HMI data across 12 distinct domains.Subsequently, we propose a specialized evaluation protocol and metrics, train proactive interaction models designed for efficient reasoning and low-latency interaction on streaming video data, and conduct comprehensive evaluations.Furthermore, we introduce an interaction principle termed "short thinking, better interaction", which allocates a limited token budget prior to intent recognition, thereby enhancing interaction performance.The experiments demonstrate that EgoPro-Bench substantially enhances the intention understanding capabilities of MLLMs and enables accurate identification of appropriate timings for HMI, thereby laying a solid foundation for next-generation user-centric proactive interactive agents.
comment: 8 pages
☆ Sword: Style-Robust World Models as Simulators via Dynamic Latent Bootstrapping for VLA Policy Post-Training
The integration of Vision-Language-Action (VLA) models with World Models has gained increasing attention. One representative approach treats learned World Models as generative simulators, enabling policy optimization entirely within "imagination." However, when deployed as simulators for specific environments such as the LIBERO benchmark, existing World Models often suffer from poor generalization and long-horizon error accumulation. During closed-loop rollouts, these models are highly sensitive to initial-state perturbations; minor changes in color, illumination, and other visual factors can trigger cascading hallucinations, leading to severe blurriness or overexposure. Moreover, long-horizon error accumulation further degrades the quality and fidelity of predicted future states. These issues limit the reliability of World Models as simulators. To mitigate these problems, we propose Sword, a robust World Model framework. Our method introduces Structure-Guided Style Augmentation to disentangle the visual textures of interactive environments from task-relevant dynamics, thereby improving generalization. We further propose Dynamic Latent Bootstrapping, which maintains consistency between training and inference while keeping memory consumption low. Extensive experiments on the LIBERO benchmark show that our method significantly outperforms the baseline WoVR in terms of generalization, generation quality, robustness, fidelity, and the success rate of reinforcement-learning post-training for VLA models.
☆ SplatWeaver: Learning to Allocate Gaussian Primitives for Generalizable Novel View Synthesis
Generalizable novel view synthesis aims to render unseen views from uncalibrated input images without requiring per-scene optimization. Recent feed-forward approaches based on 3D Gaussian Splatting have achieved promising efficiency and rendering quality. However, most of them assign a fixed number of Gaussians to each pixel or voxel, ignoring the spatially varying complexity of real-world scenes. Such uniform allocation often wastes Gaussian primitives in smooth regions while providing insufficient capacity for fine structures, complex geometry, and high-frequency details. This motivates us to predict region-dependent primitive cardinalities rather than impose a fixed primitive budget everywhere, enabling a more expressive yet compact 3D scene representation. Therefore, we propose SplatWeaver, a generalizable novel view synthesis framework that is able to dynamically allocate Gaussian primitives over different regions in a feed-forward manner. Specifically, SplatWeaver introduces cardinality Gaussian experts and a pixel-level routing scheme, wherein each expert specializes in producing a specific number of primitives from 0 to M, and the routing scheme coordinates these experts to adaptively determine how many Gaussian primitives should be allocated to each spatial location. Moreover, SplatWeaver incorporates a high-frequency prior with attendant guidance module and routing regularization to stabilize expert selection and promote complexity-aware allocation. By leveraging high-frequency structural cues, the routing process is encouraged to assign more Gaussian primitives to fine structures, complex geometry, and textured regions, while suppressing redundant primitives in smooth areas. Extensive experiments across diverse scenarios show that SplatWeaver consistently outperforms state-of-the-art methods, delivering more faithful novel-view renderings with fewer Gaussian primitives.
☆ Predictive but Not Plannable: RC-aux for Latent World Models
A latent world model may achieve accurate short-horizon prediction while still inducing a latent space that is poorly aligned with planning. A key issue is spatiotemporal mismatch: these models are often trained with local predictive supervision, but deployed for long-horizon goal-directed search in latent spaces where Euclidean distance may not reflect what is reachable within a finite action budget. We present the Reachability-Correction auxiliary objective (RC-aux), a lightweight correction for this mismatch in reconstruction-free latent world models. RC-aux keeps the world-model backbone unchanged and adds planning-aligned supervision along two axes. Along the time axis, multi-horizon open-loop prediction trains the model beyond one-step consistency. Along the space axis, budget-conditioned reachability supervision, together with temporal hard negatives, encourages the latent space to distinguish states that are eventually reachable from those reachable within the current planning horizon. At test time, the learned reachability signal can also be used by a reachability-aware planner to favor trajectories that are both goal-directed and attainable under the available budget. We instantiate RC-aux on LeWorldModel and evaluate it under both continuation-training and matched-from-scratch settings. Across goal-conditioned pixel-control tasks and a LIBERO-Goal extension, RC-aux improves LeWM-style planning with modest additional cost. These results suggest that planning with latent world models depends not only on predictive accuracy, but also on whether the learned representation encodes the temporal and geometric structure required by downstream search. The code is available at https://github.com/Guang000/RC-aux.
☆ From Clouds to Hallucinations: Atmospheric Retrieval Hijacking in Remote Sensing Vision-Language RAG
Multimodal RAG systems increasingly rely on vision-language retrievers to ground visual queries in external textual evidence. Existing adversarial studies on RAG mainly manipulate the retrieval corpus or memory, while attacks on vision-language and remote sensing models typically target end-task predictions. Input-space threats to the evidence retrieval stage of remote sensing multimodal RAG remain underexplored. To address this gap, we introduce CloudWeb, an atmospheric retrieval hijacking attack that modifies only the input image while keeping the retriever, generator, and knowledge base fixed at deployment. CloudWeb overlays parameterized cloud- and haze-like patterns on remote sensing images and optimizes them with a retrieval-oriented objective that pulls adversarial image embeddings toward target atmospheric evidence, suppresses source-scene evidence, enforces rank separation, and regularizes naturalness and coverage. To the best of our knowledge, this is the first study of retrieval-stage atmospheric evidence hijacking in remote sensing multimodal RAG. We evaluate CloudWeb on a seven-dataset remote sensing RAG benchmark with five CLIP-style retrievers, including GeoRSCLIP, RemoteCLIP, OpenAI CLIP, and OpenCLIP, together with downstream vision-language generators. Across retrievers, CloudWeb consistently outperforms clean retrieval, handcrafted atmospheric baselines, random cloud perturbations, and fixed variants in injecting weather-related evidence into top-ranked results. On GeoRSCLIP ViT-B/32, Weather@5 increases from 0.71\% to 43.29\%. Downstream generation further shows measurable weather hallucination and semantic shift, indicating that retrieval-stage hijacking can propagate to the final RAG response. These findings reveal a practical failure mode: natural-looking atmospheric changes can compromise evidence retrieval before generation begins.
☆ Sat3R: Satellite DSM Reconstruction via RPC-Aware Depth Fine-tuning
Accurate Digital Surface Model (DSM) reconstruction from satellite imagery is critical for applications such as disaster response, urban planning, and large-scale geographic mapping. Existing approaches face a fundamental trade-off: optimization-based methods achieve strong accuracy but require hours of per-scene computation, while generalizable geometry foundation models offer near-instant inference but fail to generalize to satellite imagery due to the domain gap introduced by the Rational Polynomial Camera (RPC) model and mismatched depth scale distributions. We present Sat3R, a feed-forward framework that bridges this gap via RPC-aware metric depth fine-tuning of Depth Anything V2 using the Scale-Invariant Logarithmic (SiLog) loss. By constructing physically consistent pseudo depth supervision from RPC geometry, Sat3R adapts a monocular depth foundation model to the satellite domain without per-scene optimization. Experiments on the DFC2019 benchmark demonstrate that Sat3R reduces MAE by 38% over zero-shot feed-forward baselines and achieves competitive accuracy against optimization-based methods, while delivering over 300x speedup. Sat3R demonstrates that feed-forward models, when properly adapted to the satellite domain, can match optimization-based accuracy at a fraction of the computational cost, paving the way for practical large-scale satellite DSM reconstruction.
☆ Adaptive Subspace Projection for Generative Personalization
Generative personalization often suffers from the semantic collapsing problem (SCP), where a learned personalized concept overpowers the rest of the text prompt, causing the model to ignore important contextual details. To address this, we first analyze the underlying cause, revealing that the semantic drift responsible for SCP is not random but is concentrated within a specific low-dimensional subspace. We also discover that the personalization process perturbs the embedding of the original base concept, making it an unstable reference point. Based on these insights, we introduce Test-time Embedding Adjustment with Adaptive Subspace Projection (AdaptSP), a training-free method that uses the stable, pre-trained embedding as an anchor. AdaptSP isolates the semantic drift and projects it onto the identified subspace, performing a precise adjustment that mitigates SCP while maintaining the subject identity. Our experiments show that this targeted approach significantly improves prompt fidelity and contextual alignment.
☆ TAS-LoRA: Transformer Architecture Search with Mixture-of-LoRA Experts CVPR 2026
Transformer architecture search (TAS) discovers optimal vision transformer (ViT) architectures automatically, reducing human effort to manually design ViTs. However, existing TAS methods suffer from the feature collapse problem, where subnets within a supernet fail to learn subnet-specific features, mainly due to the shared weights in a supernet, limiting the performance of individual subnets. To address this, we propose TAS-LoRA, a novel method that introduces parameter-efficient low-rank adaptation (LoRA) to enable subnet-specific feature learning, while maintaining computational efficiency. TAS-LoRA incorporates a Mixture-of-LoRAExperts (MoLE) strategy, where a lightweight router dynamically assigns LoRA experts based on subnet architectures, and introduces a group-wise router initialization technique to encourage diverse feature learning across experts early in training. Extensive experiments on ImageNet and several transfer learning benchmarks, including CIFAR-10/100, Flowers, CARS, and INAT-19, demonstrate that TAS-LoRA mitigates feature collapse effectively, improving performance over state-of-the-art TAS methods significantly.
comment: Accepted to CVPR 2026
☆ High-Fidelity Surface Splatting-Based 3D Reconstruction from Multi-View Images
Multi-view mesh reconstruction remains a core challenge in computer graphics and vision, especially for recovering high-frequency geometry from sparse observations. Recent methods such as 3D Gaussian Splatting (3DGS) and Neural Radiance Fields (NeRF) rely on post-processing for mesh extraction, thereby limiting joint optimization of geometry and appearance. Implicit Moving Least Squares (IMLS) instead enables direct conversion of point clouds into signed distance and texture fields, supporting end-to-end reconstruction and rendering. However, existing IMLS formulations use exponential kernels that struggle with high-frequency detail. We introduce a compact polynomial kernel with local support and greater flexibility, allowing better control over frequency content and improved geometric fidelity. To further enhance fine details, we incorporate stochastic regularization with Laplacian filtering. Together, these improve the preservation of high-frequency structure while maintaining stable optimization. Experiments show state-of-the-art performance in both surface reconstruction and rendering, yielding more accurate geometry and sharper visuals from multi-view data.
comment: 19 pages, 9 figures
☆ LENS: Low-Frequency Eigen Noise Shaping for Efficient Diffusion Sampling
Distilled diffusion models accelerate image generation by reducing the number of denoising steps, but often suffer from degraded image quality. To mitigate this trade-off, test-time optimization methods improve quality, yet their iterative nature incurs substantial computational overhead and leads to slow inference, limiting practical usability. Recent hypernetwork-based approaches amortize this process during training, but still require costly noise modulation in high-dimensional latent spaces. In this work, we propose LENS (Low-frequency Eigen Noise Shaping), an efficient noise modulation framework that operates in a low-dimensional subspace. Our approach is motivated by the observation that low-frequency components of the noise largely determine the global structure and visual fidelity of generated images. Based on this observation, we provide a theoretical justification for restricting modulation to the low-frequency subspace and derive a principled training objective. Building on this, LENS employs a lightweight, standalone network to selectively modulate these components, enabling efficient and targeted noise modulation. Extensive experiments demonstrate that LENS achieves competitive image quality while reducing FLOPs by 400-700$\times$, model parameters by 25-75$\times$, and inference-time overhead by 10-20$\times$ compared to prior methods.
comment: 27 pages, 7 figures
☆ PersonaGest: Personalized Co-Speech Gesture Generation with Semantic-Guided Hierarchical Motion Representation
Co-speech gesture generation aims to synthesize realistic body movements that are semantically coherent with speech and faithful to a user-specified gestural style. Existing VQ-VAE based co-speech gesture generation methods improve generation quality but fail to encode semantic structure into the motion representation or explicitly disentangle content from style, limiting both semantic coherence and personalization fidelity. We present PersonaGest, a two-stage framework addressing both limitations. In the first stage, a semantic-guided RVQ-VAE disentangles motion content and gestural style within the residual quantization structure, where a Semantic-Aware Motion Codebook (SMoC) organizes the content codebook by gesture semantics and contrastive learning further enforces content-style separation. In the second stage, a Masked Generative Transformer generates content tokens via a semantic-aware re-masking strategy, followed by a cascade of Style Residual Transformers conditioned on a reference motion prompt for style control. Extensive experiments demonstrate state-of-the-art performance on objective metrics and perceptual user studies, with strong style consistency to the reference prompt. Our project page with demo videos is available at https://danny-nus.github.io/PersonaGest/
comment: 26 pages, 10 figures, 12 tables
☆ Hard to Read, Easy to Jailbreak: How Visual Degradation Bypasses MLLM Safety Alignment ACL 2026
Recent advancements in visual context compression enable MLLMs to process ultra-long contexts efficiently by rendering text into images. However, we identify a critical vulnerability inherent to this paradigm: lowering image resolution inadvertently catalyzes jailbreaking. Our experiments reveal that the safety defenses of SOTA models deteriorate sharply as resolution degrades, surprisingly persisting even when text remains legible. We attribute this to ``Cognitive Overload'', hypothesizing that the effort required to decipher degraded inputs diverts attentional resources from safety auditing. This phenomenon is consistent across various visual perturbations, including noise and geometric distortion. To address this, we propose a simple ``Structured Cognitive Offloading'' strategy that mitigates these risks by enforcing a serialized pipeline to decouple visual transcription from safety assessment. Our work exposes a significant risk in vision-based compression and provides critical insights for the secure design of future MLLMs.
comment: Accepted to Findings of ACL 2026
☆ Towards multi-modal forgery representation learning for AI-generated video detection and localization
Recent advances in generative AI have democratized video creation at scale. AI-generated videos, including partially manipulated clips across visual and audio channels, pose escalating risks of semantic distortion and misuse, which motivates the need for reliable detection tools. Most existing AI-generated video detectors remain limited by single- or partial-modality of data modeling and the lack of fine-grained temporal forgery localization. To address these challenges, our primary novelty introduces a core architecture that jointly integrates an LMM semantic branch with a spatio-temporal (ST) visual branch and a multi-scale partial-spoof (PS) audio branch. This multi-modal approach enables simultaneous detection and fine-grained temporal localization of partially manipulated AI-generated video forgeries. Extensive experiments show that this approach outperforms existing state-of-the-art methods.
☆ CASCADE: Context-Aware Relaxation for Speculative Image Decoding
Autoregressive generation is a powerful approach for high-fidelity image synthesis, but it remains computationally demanding and slow even on the most advanced accelerators. While speculative decoding has been explored to mitigate this bottleneck, existing approaches fail to achieve efficiency gains comparable to those observed in text generation. A key limitation is the target model's high uncertainty during image generation, which leads to high draft token rejection rates. In this work, we identify previously overlooked patterns in the target model's behavior that emerge naturally in tree-based speculative decoding. Specifically, we formalize two properties, semantic interchangeability and convergence, arising from the redundancies in the target model's hidden state representations. By capturing these redundancies across the depth and breadth of the predicted token tree, our method identifies principled opportunities for acceptance relaxation without requiring additional training. Additionally, we enhance standalone drafter performance by injecting the redundancy signals from the target model into drafter training with minimal modification. We evaluate our approach across multiple text-to-image models and drafter architectures. Results show that CASCADE achieves state-of-the-art speedups for drafter-based speculative decoding, with up to 3.6x acceleration, while maintaining image quality and text-prompt fidelity.
☆ DINO-MVR: Multi-View Readout of Frozen DINOv3 for Annotation-Efficient Medical Segmentation
Adapting foundation models to medical segmentation typically requires either backbone fine-tuning or high-capacity task-specific decoders, both of which are difficult to fit reliably when annotations are scarce. We show that frozen DINOv3 features already contain useful structural and boundary cues for medical segmentation, and that the main bottleneck lies in how these features are read out. We propose DINO-MVR, a Multi-View Readout framework for annotation-efficient medical segmentation. DINO-MVR trains only lightweight MLP probes on features from the final three transformer blocks of a frozen DINOv3 backbone, without updating the backbone itself. At inference, each input is interpreted through complementary resolutions and test-time augmentations, whose probability maps are combined by entropy-weighted fusion and refined with simple spatial regularization. For volumetric inputs, Gaussian z-axis smoothing further improves inter-slice consistency. Under fixed evaluation protocols on endoscopy, dermoscopy, and MRI benchmarks, DINO-MVR achieves strong readout-only performance, including 0.895 Dice on Kvasir-SEG, 0.897 Dice on ISIC 2018, and 0.908 Dice on BraTS FLAIR whole-tumor segmentation. With only five annotated BraTS patients, it recovers 98.4% of the performance obtained by the 40-patient BraTS reference run. These results suggest that frozen self-supervised vision backbones can support accurate medical segmentation when paired with an effective multi-view readout.
☆ LoHGNet: Infrared Small Target Detection through Lorentz Geometric Encoding with High-Order Relation Learning
Infrared small target detection (IRSTD) remains challenging due to the scarcity of useful target cues and the presence of severe background clutter. Most current methods rely on conventional feature learning and local interaction modeling, where features are represented in Euclidean space. However, such designs may still be limited in describing the subtle differences of weak targets and the contextual relations between targets and backgrounds. To address these limitations, we propose LoHGNet, an IRSTD network that integrates Lorentz geometric encoding with high-order relation learning. By introducing Lorentz manifold based feature learning, LoHGNet offers a different feature representation from conventional IRSTD methods and provides new discriminative cues for IRSTD. Specifically, a Lorentz encoding branch is constructed with the Geometric Attention Guided Lorentz Residual Convolution Module (GA-LRCM) to perform feature modeling under hyperbolic geometric constraints and enhance the hierarchical geometric representation capability of weak targets. Subsequently, the hyperbolic features are mapped into the Euclidean tangent space through logarithmic mapping, and a High-Order Relation Learning Module (HORL) is designed to model the high-order contextual dependencies between targets and backgrounds via hypergraph construction, thereby improving target discrimination in complex backgrounds. Experimental results on three datasets demonstrate that the proposed LoHGNet achieves competitive performance in both detection accuracy and adaptability to complex scenes. The code will be available at https://github.com/Kingwin97.
☆ From Pixels to Primitives: Scene Change Detection in 3D Gaussian Splatting
Scene change detection methods built on Gaussian splatting universally follow a render-then-compare paradigm: the pre-change scene is rendered into 2D and compared against post-change images via pixel or feature residuals. This change detection problem with Gaussian Splatting has been treated as a question about pixels; we treat it as a question about primitives. We provide direct evidence that native primitive attributes alone -- position, anisotropic covariance, and color -- carry sufficient signal for scene change detection. What makes primitive-space comparison hard is the under-constrained nature of Gaussian splatting representation: independent optimizations yield primitive solutions whose count, positions, shapes, and colors differ even where nothing has changed. We address this challenge with anisotropic models of geometric and photometric drift, complemented by a per-primitive observability term that reflects the extent to which each Gaussian is constrained by the camera geometry. Operating directly on primitives gives our method, GD-DIFF, two properties that distinguish it from render-then-compare methods. First, change maps are multi-view consistent by construction, where prior work had to learn this through an additional optimization objective. Second, geometric and appearance changes are scored separately, identifying not just where but what kind of change occurred, distinguishing structural changes (e.g., an added object) from surface-level ones (e.g., a color change) without supervision or external model dependencies. On real-world benchmarks, GS-DIFF surpasses the prior state-of-the-art approach by approximatelt 17% in mean Intersection over Union.
comment: Project Page: https://chumsy0725.github.io/GS-DIFF/
☆ See Tomorrow, Act Today: Foresight-Driven Autonomous Driving CVPR
Current end-to-end autonomous driving planners are fundamentally reactive: they condition on historical and present observations to predict future actions. We argue that autonomous agents should instead imagine future scenes before deciding, just as human drivers mentally simulate ``what will happen next" before acting. We introduce ForeSight, a foundation world model centric planning framework that reframes autonomous driving as anticipatory decision-making. Rather than treating world models as auxiliary components, ForeSight makes future scene imagination the primary driver of action prediction. Our approach operates in two stages: (1) generating plausible future visual worlds via a pretrained world model, and (2) planning actions conditioned on these imagined futures. This paradigm shift from ``what should I do now?" to ``what will happen, and how should I respond?" enables genuinely anticipatory rather than reactive planning. By grounding decisions in anticipated contexts rather than present observations alone, ForeSight navigates dynamic, interactive scenarios more effectively. Extensive experiments on NAVSIM and nuScenes demonstrate that explicit future imagination significantly outperforms previous state-of-the-art alternatives, validating our foresight-driven approach.
comment: CVPR Findings 2026
☆ Closed-Form Linear-Probe Dataset Distillation for Pre-trained Vision Models
Dataset distillation compresses a large training set into a small synthetic set that preserves downstream training utility. While most existing methods target training networks from scratch, modern visual transfer learning often uses frozen pre-trained encoders followed by lightweight linear probing. Existing distillation methods for this setting either unroll iterative linear-probe updates with trajectory-based gradient matching, or rely on closed-form formulations originally designed for from-scratch training with neural-tangent-kernel (NTK) approximations. Neither route exploits the fact that frozen-feature linear probing admits a closed-form solution determined directly by the pre-trained features themselves, with no infinite-width approximation and no inner-loop trajectory. We propose Closed-Form Linear-Probe Dataset Distillation (CLP-DD), a bilevel formulation that computes the linear probe induced by the synthetic set with a sample-space kernel ridge solver. The synthetic images are then updated by evaluating this induced classifier on real features through a temperature-scaled softmax cross-entropy, where the classifier columns act as learned class anchors in feature space. We further show that the choice of outer objective is decisive: pairing the closed-form inner solver with a standard MSE outer loss substantially underperforms trajectory-based methods, while the discriminative outer loss closes most of the gap. On ImageNet-100 with four pre-trained backbones, CLP-DD substantially improves over LGM without DSA and approaches LGM with DSA at a fraction of the computational cost. On ImageNet-1K, CLP-DD matches or surpasses LGM with DSA on three of four backbones while running roughly $14\times$ faster and using less than one-eighth of the GPU memory.
☆ AsyncEvGS: Asynchronous Event-Assisted Gaussian Splatting for Handheld Motion-Blurred Scenes
3D reconstruction methods such as 3D Gaussian Splatting (3DGS) and Neural Radiance Fields (NeRF) achieve impressive photorealism but fail when input images suffer from severe motion blur. While event cameras provide high-temporal-resolution motion cues, existing event-assisted approaches rely on low-resolution sensors and strict synchronization, limiting their practicality for handheld 3D capture on common devices, such as smartphones. We introduce a flexible, high-resolution asynchronous RGB-Event dual-camera system and a corresponding reconstruction framework. Our approach first reconstructs sharp images from the event data and then employs a cross-domain pose estimation module based on the Visual Geometry Transformer (VGGT) to obtain robust initialization for 3DGS. During optimization, we employ a structure-driven event loss and view-specific consistency regularizers to mitigate the ill-posed behavior of traditional event losses and deblurring losses, ensuring both stable and high-fidelity reconstruction. We further contribute AsyncEv-Deblur, a new high-resolution RGB-Event dataset captured with our asynchronous system. Experiments demonstrate that our method achieves state-of-the-art performance on both our challenging dataset and existing benchmarks, substantially improving reconstruction robustness under severe motion blur. Project page: https://openimaginglab.github.io/AsyncEvGS/
☆ Attention Transfer Is Not Universally Effective for Vision Transformers
A recent work shows that Attention Transfer, which transfers only the attention patterns from a pre-trained teacher Vision Transformer (ViT) to a randomly initialized standard student ViT, is sufficient to recover the full benefit of the teacher's pre-trained weights. We revisit this finding on a comprehensive benchmark of 20 teachers from 11 well-known ViT families and reveal that Attention Transfer is not universally effective. While 7 families transfer successfully, 4 consistently fail, falling up to 5.1\% below the from-scratch no-transfer baseline. Further results demonstrate that this failure is family-consistent across model sizes, and persists under extended training durations, different transfer datasets, and out-of-distribution evaluations. Controlled analyses then consistently localize the problem to the attention-routing channel, indicating that the key issue is not whether the student can match the teacher's attention patterns, but whether the matched patterns remain functional for the student. Crucially, we identify architectural mismatch between the pre-trained teacher and the standard student as the primary mechanism. By adding only the teacher's native architectural components to the student in a randomly initialized state, we completely reverse the failure for all 4 families. Notably, these components alone do not improve from-scratch training, confirming that they specifically unlock the usability of the teacher's attention. We further systematically show that this failure is not explained by the inadequate choice of transfer loss or by differences in pre-training recipes. Our findings refine the prevailing understanding of attention in ViT representations: attention is sufficient \textit{only} when the student architecture matches the teacher.
☆ PicoEyes: Unified Gaze Estimation Framework for Mixed Reality with a Large-Scale Multi-View Dataset
We present PicoEyes, a unified gaze estimation framework that directly predicts all key attributes of gaze, including 3D eye parameters, eye-region segmentation, optical axis, visual axis, and depth maps, from either monocular or binocular inputs. The framework simultaneously addresses calibration, gaze forecasting, and varying device postures, while also supporting 3D eye reconstruction via joint estimation of eye parameters and depth maps in an end-to-end manner. In addition, we introduce a large-scale multi-view near-eye dataset containing comprehensive 2D and 3D annotations under diverse conditions, including train, test, rewear-test, and calibration sessions. Extensive experiments demonstrate that PicoEyes achieves state-ofthe-art performance, consistently outperforming both academic and industrial gaze tracking methods across nocalibration, calibration, rewear-after-calibration, and forecasting settings. This work establishes a practical, end-toend paradigm for robust and generalizable gaze estimation in mixed reality (MR) applications.
comment: 15 pages, 10 figures, conference
☆ SatSurfGS: Generalizable 2D Gaussian Splatting for Sparse-View Satellite Surface Reconstruction
Sparse-view satellite image surface reconstruction remains highly challenging, fundamentally because the reliability of multi-view matching under satellite imaging conditions is strongly spatially heterogeneous. Affected by large photometric differences, weak textures, and repetitive textures, multi-view geometric constraints are often sparse, unevenly distributed, and locally unreliable. Although 2D Gaussian Splatting (2DGS) is more suitable than 3D Gaussian Splatting (3DGS) for the explicit representation of continuous surfaces, research on generalizable feed-forward 2DGS frameworks for sparse-view satellite surface reconstruction is still lacking. To address this issue, we propose SatSurfGS, a generalizable sparse-view surface reconstruction method for satellite imagery based on 2DGS. The proposed method builds a coarse-to-fine Gaussian attribute prediction framework and explicitly models local geometric reliability at three levels: feature learning, Gaussian parameter estimation, and training optimization. Specifically, we propose a confidence-aware monocular multi-view feature fusion module to adaptively integrate monocular priors and multi-view matching features according to local confidence; a cross-stage self-consistency residual guidance module to stabilize stage-wise Gaussian parameter refinement using the residual between the rendered height map from the previous stage and the current-stage MVS height map, together with confidence information; and a confidence bidirectional routing loss to achieve differentiated allocation of geometric and appearance supervision. Experiments on satellite datasets show that the proposed method achieves improved rendering quality, surface reconstruction accuracy, cross-dataset generalization, and inference efficiency compared with representative generalizable baselines and competitive per-scene optimization methods.
☆ Masks Can Talk: Extracting Structured Text Information from Single-Modal Images for Remote Sensing Change Detection
Remote sensing change detection is pivotal for urban monitoring, disaster assessment, and environmental resource management. Yet, unimodal deep learning methods frequently confuse genuine semantic changes with visually similar but irrelevant variations. Recent multimodal approaches incorporate text as auxiliary supervision, but their descriptions are either semantically coarse and unstructured or model-generated and thus noisy. Critically, all of them overlook a simple fact: fine-grained change semantics are already implicitly encoded in the ground-truth mask labels that come standard with every change detection dataset. These masks know where the change happened, what the land-cover types were before and after, how the transition occurred, and how many objects were involved. In this paper, we propose S2M, a framework that obtains structured textual features directly from change labels at zero additional annotation cost. Specifically, each change region is automatically transcribed into a semantic quadruple (where, what, how, how many) and converted into several fixed-template text descriptions, providing precise, dense, and noise-free multimodal supervision. We adopts a two-stage training strategy to fine-tune on remote sensing imagery firstly for robust domain-specific representation, after which a multimodal decoder with a bi-directional contrastive loss is introduced to achieve deep alignment between visual features and structured textual embeddings. To validate our method, we construct Gaza-Change-v2, a new multi-class change detection (MCD) dataset about the Gaza Strip. On this MCD dataset, S2M achieves a Sek of 17.80\% and an F$_{\text{scd}}$ of 66.14\%, notably surpassing even multimodal methods that leverage large language models. Our work demonstrates that masks can indeed talk. They tell us exactly what, where, how, and how many changes have occurred.
☆ Hierarchical Perfusion Graphs for Tumor Heterogeneity Modeling in Glioma Molecular Subtyping MICCAI 2026
Precise molecular subtyping of gliomas, including isocitrate dehydrogenase (IDH) mutation and 1p/19q codeletion, directly guides surgical and therapeutic decisions, yet currently relies on invasive tissue sampling. Deep learning on structural MRI has emerged as a non-invasive alternative, but anatomy-only approaches cannot capture the hemodynamic signatures that distinguish molecular subtypes. Radiogenomics based on dynamic susceptibility contrast (DSC) MRI holds immense potential for non-invasively characterizing glioma molecular subtypes, yet clinical deployment has been hindered by inter-site variability and the limitations of voxel-wise analysis. We introduce HiPerfGNN, a framework that first learns discrete hemodynamic representations from raw time-intensity curves using a vector-quantized variational autoencoder (VQ-VAE). These quantized perfusion codes define coarse-level graph nodes representing functional tumor habitats, each of which is hierarchically subdivided into fine-level subregions guided by structural MRI. A hierarchical graph neural network then propagates information across scales for molecular prediction. On an internal cohort (n=475), the model achieved AUCs of 0.96 (IDH), 0.89 (1p/19q), and 0.84 (WHO grade), and maintained robust IDH performance (AUC 0.89) on an independent external cohort (n=397) without recalibration. Gradient-based saliency analysis confirms biologically grounded attention patterns aligned with known glioma pathophysiology. Our results demonstrate the added value of integrating perfusion dynamics into radiogenomic pipelines for glioma molecular subtyping. Code is available at https://github.com/janghana/HiPerfGNN.
comment: Accepted at MICCAI 2026. 11 pages, 2 figures, 2 tables
☆ PRIMED: Adaptive Modality Suppression for Referring Audio-Visual Segmentation via Biased Competition
Referring Audio-Visual Segmentation (Ref-AVS) seeks to localize and segment target objects in video frames based on visual, auditory, and textual referring cues. The task is challenging because the relevance of different modalities varies across referring expressions and scenes, while existing methods typically treat multimodal cues as homogeneous inputs for fusion, prompting, or reasoning, making them vulnerable to irrelevant or misleading modalities. To address this problem, we propose PRIMED, inspired by the biased competition theory in cognitive neuroscience, which explicitly models both visual perception and language-driven prior modulation, and enables more accurate Ref-AVS by adaptive modality suppression. Specifically, a Modality Prior Decoder first estimates whether the referring expression relies primarily on audio, vision, or their joint interaction, generating a modality prior to adaptively guide high-level attention. A Token Distiller further extracts compact global visual tokens from high-level features and shares them across Competition-aware Cross-modal Fusion modules to provide hierarchical global context. Additionally, we introduce a Spatial-Aware Semantic Alignment loss to further enhance foreground-background discrimination through contrastive learning. Extensive experiments on the Ref-AVS benchmark demonstrate that PRIMED achieves state-of-the-art overall performance.
comment: 11 pages, 8 figures
☆ DPG-CD: Depth-Prior-Guided Cross-Modal Joint 2D-3D Change Detection
Urban spatial evolution is manifested not only through horizontal expansion but also through vertical structural changes. Consequently, jointly capturing 2D semantic changes and 3D height changes is essential for urban morphology analysis and emergency management. In practical scenarios, collecting 3D observations is often constrained by high acquisition costs and the inability to support frequent updates. The multi-temporal cross-modal input consisting of pre-event Digital Surface Model (DSM) and post-event imagery provides a practical solution for 3D change detection in high-frequency urban monitoring, disaster assessment, and emergency response scenarios. However, this setting remains challenging as imagery and DSM data exhibit significant spectral-geometric representation gaps. Moreover, modality differences may be confused with actual changes, and robust change detection requires effective fusion of semantic and geometric features from multi-temporal data. In this paper, we propose DPG-CD, a depth-prior-guided multi-temporal cross-modal fusion framework for joint 2D semantic and 3D height change detection. Specifically, an estimated depth prior is introduced into the imagery to mitigate the modality gap with DSM. A gated fusion mechanism then selectively injects geometric cues from depth prior while preserving discriminative spectral representations. Subsequently, a multi-stage cross-temporal cross-modal feature fusion architecture is employed to extract change-aware features. Finally, a multi-task decoder jointly predicts 2D semantic changes and 3D height changes, complemented by an auxiliary DSM prediction task to improve structural consistency and height estimation accuracy. Experiments on two public datasets, Hi-BCD and 3DCD, and a new dataset, NYC-MMCD, demonstrate that DPG-CD outperforms state-of-the-art methods on both 2D and 3D change detection tasks.
☆ Real-IAD MVN: A Multi-View Normal Vector Dataset and Benchmark for High-Fidelity Industrial Anomaly Detection CVPR 2025
Industrial Anomaly Detection (IAD) is critical for quality control, but existing methods struggle with subtle, geometric defects. Standard 2D (RGB) images are sensitive to texture and lighting but often miss fine geometric anomalies. While 3D point clouds capture macro-shape, they are typically too sparse to detect micro-defects like scratches or pits. We address this fundamental data limitation by introducing Real-IAD-MVN (Multi-View Normal), a large-scale industrial dataset. By upgrading our acquisition system, Real-IAD-MVN captures high-fidelity surface normal maps from five distinct viewpoints, replacing sparse 3D data entirely. This provides a comprehensive geometric representation at a micro-detail level, making previously invisible side-wall and occluded defects explicitly detectable. Our experiments, conducted on this new dataset, first provide evidence that incorporating dense, multi-view pseudo-3D (surface normals) yields significantly better detection performance than using sparse 3D point cloud data. To further validate the dataset and provide a strong benchmark, we introduce a baseline method based on reconstruction, which learns to extract cross-modal unified prototypes from the image and normal map streams. We demonstrate that this unified prototype approach surpasses existing state-of-the-art multimodal fusion methods, highlighting the rich potential of our new dataset for advancing geometric anomaly detection.
comment: Accepted to CVPR 2025. 15 pages
☆ Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models
Decades of cognitive science establish that humans navigate environments by forming cognitive maps, defined as allocentric and topology-preserving representations of 3D space. While modern Vision-Language Models (VLMs) demonstrate emergent spatial reasoning from 2D egocentric inputs, it remains unclear whether they construct an analogous 3D internal representation. In this paper, we demonstrate that current VLMs do possess a latent topological map of 3D scenes, but it is heavily overshadowed by non-geometric visual semantics, such as color and shape. By isolating this spatial subspace through cross-scene linear feature extraction, we extract a clean spatial subspace that causally controls the model's spatial outputs. We mathematically shape this latent representation and prove its correspondence to the Laplacian eigenmaps of the scene's 3D Gaussian-kernel graph, converging to the physical 3D space in the continuous limit. Motivated by this geometric identification, we further introduce a mathematically principled latent regularization method for VLMs, based on Dirichlet energy. Applying this single-term regularizer to a minimal 500-step supervised VLM fine-tuning (SFT) on simple synthetic data yields significant improvements on real-world spatial benchmarks, outperforming standard SFT and competitive baselines by up to 12.1\% in spatial tasks involving scene topology understanding. Source code is available at https://github.com/pittisl/vlm-latent-shaping
☆ UniV2D: Bridging Visual Restoration and Semantic Perception for Underwater Salient Object Detection
Underwater salient object detection (USOD) plays a vital role in marine vision tasks but remains fundamentally challenging due to severe visual degradation, such as selective absorption and medium scattering. Conventional pipelines typically adopt a sequential "enhance-then-detect" paradigm. However, isolating low-level visual restoration from high-level semantic perception often leads to semantic inconsistency, where the restored images may not be optimal for detection and can even introduce task-irrelevant noise. To break this sequential bottleneck, we propose UniV2D, a Unified Vision-to-Detection Network that jointly optimizes visual restoration and salient object detection within a mutually beneficial framework. Unlike traditional methods that rely on disjointed pipelines or rigid physical priors, UniV2D introduces a semantic-driven learning paradigm: high-level saliency semantics actively guide the restoration process, while the restored visual cues reciprocally enhance saliency perception. Specifically, UniV2D features a hierarchical dual-branch architecture. It first employs a self-calibrated decoder to predict initial saliency masks alongside a mask-aware restoration module to reconstruct image content. Subsequently, a saliency-guided refinement module equipped with cross-level modulation is utilized to align structural fidelity with semantic consistency. Extensive experiments across multiple benchmarks demonstrate that UniV2D significantly outperforms state-of-the-art methods in both quantitative and qualitative evaluations, establishing a new standard for joint underwater perception.
☆ Fine-tuning a vision-language model for fracture-surface morphology recognition
Vision-language models (VLMs) have shown strong potential for scientific image understanding, but general-purpose models often lack the domain-specific visual knowledge required for reliable materials characterization. In this work, we fine-tuned an open-source VLM (Qwen3-VL-32B-Instruct) for fracture-surface image analysis using a curated dataset of 13,168 open-source, literature-mined fracture-surface images. Morphology annotations were generated by GPT-5.2-Reasoning (high) from both the images and relevant excerpts of their source papers, and the dataset was further enriched with targeted manual collection and rotation-based augmentation. The resulting specialist model outperforms flagship proprietary multimodal models on a benchmark of 100 manually annotated images. It achieves a precision of 0.92, compared to 0.35 for the base Qwen3-VL-32B-Instruct, 0.58 for GPT-5.5-Reasoning (high), and 0.78 for Gemini 3.1 Pro-Reasoning (high). Dataset ablations show that manual collection of rare-feature images and augmentation via image rotation are both beneficial to improve recognition of less common fracture morphology features. We further discuss integrated use of the fine-tuned model with proprietary models to combine fracture-specific visual accuracy with broader multimodal reasoning for autonomous fractography. Although focused on fracture-surface images, this work demonstrates how VLMs can be adapted through targeted collection and fine-tuning on novel feature images to recognize those features and support downstream decision-making in autonomous microscopy workflows.
☆ TriP: A Triangle Puzzle Approach to Robust Translation Averaging
Translation averaging aims to recover camera locations from pairwise relative translation directions and is a fundamental component of global Structure-from-Motion pipelines. The problem is challenging because direction measurements contain no distance information, making the estimation problem highly ill-conditioned and highly sensitive to corrupted observations. In this paper, we propose TriP, a triangle-based framework for robust translation averaging. TriP first infers local relative edge scales from triangle geometry, and then synchronizes the scales of overlapping triangles in the logarithmic domain to recover globally consistent edge lengths and camera locations. By leveraging higher-order consistency across triangles, the proposed method is robust to adversarial, cycle-consistent, and other structured corruptions. In addition, TriP avoids the collapse issue without requiring any extra anti-collapse constraints, since log-scale synchronization excludes the degenerate zero-scale solution by construction. These structural advantages enable a particularly strong theory for exact location recovery. On the practical side, TriP is fully parallelizable, computationally efficient, and naturally scalable to graphs with millions of cameras. Moreover, it outperforms all previous translation averaging methods by a large margin on both synthetic and real datasets.
☆ AGA3DNet: Anatomy-Guided Gaussian Priors with Multi-view xLSTM for 3D Brain MRI Subtype Classification CVPR
Accurate 3D brain MRI subtype classification benefits from both localized anatomical cues and long-range contextual reasoning. We present AGA3DNet, a report-grounded framework that incorporates brief anatomical phrases extracted from radiology reports as a soft anatomical prior channel and fuses it with a lightweight 3D CNN and multi-view xLSTM aggregation. Specifically, extracted anatomical phrases are mapped to atlas-defined regions and converted into smooth spatial priors using a signed-distance transform followed by Gaussian weighting, providing interpretable, anatomy-grounded guidance without requiring dense voxel annotations. We evaluate AGA3DNet on a retrospective institutional brain MRI cohort for abnormal subtype discrimination and compare against reproducible 3D classification baselines. AGA3DNet achieves improved overall balance across performance metrics and supports clinically interpretable localization through the prior channel. We discuss limitations related to single-cohort evaluation and the lack of large-scale public brain MRI datasets paired with radiology reports under broadly usable terms.
comment: CVPR CV4CLINIC 2026
☆ Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding
Open-world referring segmentation requires grounding unconstrained language expressions to precise pixel-level regions. Existing multimodal large language models (MLLMs) exhibit strong open-world visual grounding, but their outputs remain limited to sparse bounding-box coordinates and are insufficient for dense visual prediction. Recent MLLM-based segmentation methods either directly predict sparse contour coordinates, struggling to reconstruct continuous object boundaries, or rely on external segmentation foundation models such as the Segment Anything Model (SAM), introducing substantial architectural and deployment overhead. We present Qwen3-VL-Seg, a parameter-efficient framework that treats the MLLM-predicted box as a semantically grounded structural prior and decodes it into pixel-level referring segmentation. At its core, a lightweight box-guided mask decoder combines multi-scale spatial feature injection, spatial-semantic query construction, box-guided high-resolution pixel fusion, and iterative mask-aware query refinement, introducing only 17M parameters (about 0.4\% of the base model). For scalable open-world training, we construct SA1B-ORS, an SA-1B-derived dataset with two subsets: SA1B-CoRS (category-oriented samples) and SA1B-DeRS (descriptive, instance-specific samples). For evaluation, we curate ORS-Bench, a manually screened benchmark with in-distribution and out-of-distribution subsets covering diverse referring expression types. Extensive experiments on referring expression segmentation, visual grounding, and ORS-Bench show that Qwen3-VL-Seg performs strongly across closed-set and open-world settings, with clear advantages on language-intensive instructions and strong out-of-distribution generalization. Evaluations on general multimodal benchmarks further show that the model broadly preserves general-purpose multimodal competence after segmentation-oriented adaptation.
☆ Neurosymbolic Framework for Concept-Driven Logical Reasoning in Skeleton-Based Human Action Recognition IJCAI 2026
Skeleton-based human activity recognition has achieved strong empirical performance, yet most existing models remain black boxes and difficult to interpret. In this work, we introduce a neurosymbolic formulation of skeleton-based HAR that reframes action recognition as concept-driven first-order logical reasoning over motion primitives. Our framework bridges representation learning and symbolic inference by grounding first-order logic predicates in learnable spatial and temporal motion concepts. Specifically, we employ a standard spatio-temporal skeleton encoder to extract latent motion representations, which are then mapped to interpretable concept predicates via a spatio-temporal concept decoder that explicitly separates pose-centric and dynamics-centric abstractions. These concept predicates are composed through differentiable first-order logic layers, enabling the model to learn human-readable logical rules that govern action semantics. To impose semantic structure on the learned concepts, we align skeleton representations with LLM-derived descriptions of atomic motion primitives, establishing a shared conceptual space for perception and reasoning. Extensive experiments on NTU RGB+D 60/120 and NW-UCLA demonstrate that our approach achieves competitive recognition performance while providing explicit, interpretable explanations grounded in logical structure. Our results highlight neurosymbolic reasoning as an effective paradigm for interpretable spatio-temporal action understanding. Code: https://github.com/Mr-TalhaIlyas/REASON
comment: Accepted In Proceedings of the 35th International Joint Conference on Artificial Intelligence (IJCAI 2026)
☆ InfoGeo: Information-Theoretic Object-Centric Learning for Cross-View Generalizable UAV Geo-Localization
Cross-view geo-localization (CVGL) is fundamental for precise localization and navigation in GPS-denied environments, aiming to match ground or UAV imagery with satellite views. While existing approaches rely on global feature alignment, they often suffer from substantial domain shifts induced by varying regional textures and weather conditions. This issue becomes even more pronounced in UAV-based scenarios, where the broader perspective inevitably introduces dense, fine-grained objects, creating significant visual clutter. To address this, we draw inspiration from Object-Centric Learning (OCL) and propose InfoGeo, an information-theoretic framework designed to enhance robustness and generalization. InfoGeo reformulates the optimization as an information bottleneck process with two core objectives: (i) maximizing view-invariant information by aligning the object-centric structural relations across views, and (ii) minimizing view-specific noisy signals through cross-view knowledge constraints. Extensive evaluations across diverse benchmarks and challenging scenarios demonstrate that InfoGeo significantly outperforms state-of-the-art methods.
☆ Task Relevance Is Not Local Replaceability: A Two-Axis View of Channel Information
Channel importance in vision networks is usually summarized by a single score. That summary hides two different questions: how much a channel is related to the task, and whether its function can be supplied by same-layer peers when the channel is removed. We call the second property local replaceability. We introduce a two-axis view that separates these questions. The local axis measures input capture and peer overlap, while the target axis measures task information and target-excess information. Across ResNet-18, VGG-16, and MobileNetV2 trained on CIFAR-100, the two axes are weakly aligned, induce different channel groupings, and separate rapidly during training despite being strongly coupled at random initialization. A Gaussian linear analysis accounts for how this separation can arise through residualized gradient directions, and lesion plus peer-replacement experiments show that peer support refines removability beyond input capture and task relevance alone. Under the fixed FLOPs-matched pruning protocol, local-axis metrics are more reliable predictors of removability than target-axis metrics across the three CIFAR-100 backbones, with the same direction preserved in stress tests on CIFAR-10, Tiny-ImageNet, ImageNet-100, and a ConvNeXt-T/ImageNet-100 pilot. These findings identify an axis-level distinction rather than a universal ranking of pruning scores: local replaceability is a more reliable guide to removability than target relevance, while norm-based baselines remain competitive in architectures such as VGG-16. Relevance-based scores ask what a channel says about the task; pruning asks whether the network still needs that channel when its peers remain available.
☆ ImplantMamba: Long-range Sequential Modeling Mamba For Dental Implant Position Prediction
In the design of surgical guides for implant placement, determining the precise implant position is a critical step. However, the implant region itself is often characterized by a lack of distinctive texture in medical images. Consequently, artificial intelligence (AI) models must infer the correct implant position and angulation (slope) primarily by analyzing the texture of the surrounding teeth, which poses a significant challenge. To address this, we propose ImplantMamba, a network architecture designed for long-range sequential modeling to integrate texture information from adjacent teeth. Our approach explicitly couples the regression of the implant position with its slope. The core of ImplantMamba is a hybrid encoder that combines Convolutional Neural Networks (CNNs) with Mamba layers. This design enables the network to hierarchically extract local anatomical features through CNNs while simultaneously modeling global contextual dependencies across the entire scan volume via Mamba's selective scan operations, leading to a more comprehensive understanding of the implant site. Furthermore, we introduce a Slope-Coupled Prediction Branch (SCP). This branch is designed to connect the prediction of implant position with the slope, ensuring internal consistency and anatomical plausibility by thereby enforcing a coherent relationship between the predicted implant location and its angulation. Extensive experiments on a large-scale dental implant dataset demonstrate that the proposed ImplantMamba achieves superior performance compared to existing methods.
☆ Learning Visual Feature-Based World Models via Residual Latent Action
World models predict future transitions from observations and actions. Existing works predominantly focus on image generation only. Visual feature-based world models, on the other hand, predict future visual features instead of raw video pixels, offering a promising alternative that is more efficient and less prone to hallucination. However, current feature-based approaches rely on direct regression, which leads to blurry or collapsed predictions in complex interactions, while generative modeling in high-dimensional feature spaces still remains challenging. In this work, we discover that a new type of latent action representation, which we refer to as *Residual Latent Action* (RLA), can be easily learned from DINO residuals. We also show that RLA is predictive, generalizable, and encodes temporal progression. Building on RLA, we propose *RLA World Model* (RLA-WM), which predicts RLA values via flow matching. RLA-WM outperforms both state-of-the-art feature-based and video-diffusion world models on simulation and real-world datasets, while being orders of magnitude faster than video diffusion. Furthermore, we develop two robot learning techniques that use RLA-WM to improve policy learning. The first one is a minimalist world action model with RLA that learns from actionless demonstration videos. The second one is the first visual RL framework trained entirely inside a world model learned from offline videos only, using a video-aligned reward and no online interactions or handcrafted rewards. Project page: https://mlzxy.github.io/rla-wm
☆ Decoupling Semantics and Fingerprints: A Universal Representation for AI-Generated Image Detection
Detecting AI-generated images across unseen architectures remains challenging, as existing models often overfit to generator-specific fingerprints and semantic content rather than learning universal forgery traces. We attribute this failure to feature entanglement: detectors learn these factors as a single entangled representation, where universal forgery traces are inextricably confounded with both generator-specific fingerprints and semantic content. Crucially, our spectral analysis reveals that this entanglement is avoidable: distinct generator-specific fingerprints (e.g., GAN stripes vs. Diffusion Model spots) occupy disjoint frequency subspaces and coexist as independent superpositions. Leveraging this physical orthogonality, we propose the Orthogonal Decomposition and Purification Network (ODP-Net) to structurally disentangle these factors. Specifically, ODP-Net employs (1) Instance-aware Orthogonal Decomposition to project features into mutually exclusive subspaces: universal forgery traces, generator-specific fingerprints, and semantic content; (2) Perturbation-based Purification to enforce semantic invariance via cross-sample feature injection; and (3) Manifold Alignment to bridge domain gaps. By explicitly decoupling universal forgery traces from generator-specific fingerprints and semantic content, ODP-Net achieves state-of-the-art performance on unseen architectures (e.g., Stable Diffusion 3), validating that structural disentanglement is key to generalization.
comment: ~10 pages (IEEEtran two-column), 6 figures, 6 tables, 1 algorithm
☆ Learning to Track Instance from Single Nature Language Description CVPR 2026
How to achieve vision-language (VL) tracking using natural language descriptions from a video sequence \textbf{without relying on any bounding-box ground truth}? In this work, we achieve this goal by tackling \textit{self-supervised VL tracking}, which aims to evaluate tracking capabilities guided by natural language descriptions. We introduce \textbf{\tracker}, a novel self-supervised VL tracker that is capable of tracking any referred object by a language description. Unlike traditional methods that equally fuse all language and visual tokens, we propose an efficient Dynamic Token Aggregation Module, which treats each visual token \textbf{unequally}. The module consists of three main steps: i) Based on an anchor token, it selects multiple important target tokens from the template frame. ii) The selected target tokens are merged according to their attention scores and aggregated into the language tokens, thereby eliminating redundant visual token noise and enhancing semantic alignment. iii) Finally, the fused language tokens serve as guiding signals to extract potential target tokens from the search frame and propagate them to subsequent frames, enhancing temporal prompts and encouraging the tracker to autonomously learn instance tracking from unlabeled videos. This new modeling approach enables the effective self-supervised learning of language-guided tracking representations without the need for large-scale bounding box annotations. Extensive experiments on VL tracking benchmarks show that {\tracker} surpasses SOTA self-supervised methods.
comment: CVPR 2026
☆ Do Joint Audio-Video Generation Models Understand Physics?
Joint audio-video generation models are rapidly approaching professional production quality, raising a central question: do they understand audio-visual physics, or merely generate plausible sounds and frames that violate real-world consistency? We introduce AV-Phys Bench, a benchmark for evaluating physical commonsense in joint audio-video generation. AV-Phys Bench tests models across three scene categories: Steady State, Event Transition, and Environment Transition. It covers physics-grounded subcategories drawn from real-world scenes, plus Anti-AV-Physics prompts that deliberately request physically inconsistent audio-video behavior. Each generation is evaluated along five dimensions: visual semantic adherence, audio semantic adherence, visual physical commonsense, audio physical commonsense, and cross-modal physical commonsense. Across three proprietary and four open-source models, we find that Seedance 2.0 performs best overall, but all models remain far from robust physical understanding. Performance drops sharply on event-driven and environment-driven transitions, and even strong proprietary systems collapse on Anti-AV-Physics prompts. We further introduce AV-Phys Agent, a ReAct-style evaluator that combines a multimodal language model with deterministic acoustic measurement tools, producing rankings that closely align with human ratings. Our results identify cross-modal physical consistency and transition-driven scene dynamics as key open challenges for joint audio-video generation.
comment: Preprint. Full abstract appears in the PDF
☆ Pan-FM: A Pan-Organ Foundation Model with Saliency-Guided Masking for Missing Robustness
Foundation models (FMs) have shown great promise in medical imaging, but most FMs are trained on unimodal data within isolated domains, such as brain MRI alone. Human aging and disease arise through coordinated biological processes across organs, therefore motivating multimodal FMs that learn whole-body representations. A key challenge, however, is that real-world multimodal biomedical data are often missing not at random, which can reduce power, limit generalizability, and introduce bias. We propose Pan-FM, a pan-organ foundation model pre-trained on imaging from seven organs (Brain, Heart, Adipose, Liver, Kidney, Spleen, and Pancreas) under realistic missing-organ scenarios. Pan-FM uses a unified backbone that handles organ missingness during both training and inference, and is pre-trained with masking-based self-distillation. We find that naive multimodal pre-training leads to dominant-organ shortcut learning bias, with the model over-relying on dominant organs such as adipose and heart. To address this, we introduce Saliency-Guided Masking (SGM), which uses the model attention distribution to adaptively mask dominant organs during pre-training, thus encouraging more balanced cross-organ, whole-body learning. Notably, SGM introduces negligible computational overhead and can be seamlessly integrated into existing self-supervised learning frameworks to improve multi-organ representation learning. On the UK Biobank, Pan-FM achieves stronger prediction across 13 disease categories and 14 single disease entities than single-organ and multi-organ baselines, with improved robustness under missing-organ settings. Pan-FM serves as a scalable solution to realistic modality-missingness in multimodal learning in system neuroscience and as a step toward more generalizable whole-body FMs.
♻ ☆ Surgical Visual Understanding (SurgVU) Dataset
Owing to recent advances in machine learning and the ability to harvest large amounts of data during robotic-assisted surgeries, surgical data science is ripe for foundational work. We present a large dataset of surgical videos and their accompanying labels for this purpose. We describe how the data was collected and some of its unique attributes. Multiple example problems are outlined. Although the dataset was curated for a particular set of scientific challenges (in an accompanying paper), it is general enough to be used for a broad range machine learning questions. Our hope is that this dataset exposes the larger machine learning community to the challenging problems within surgical data science, and becomes a touch-stone for future research. The videos are available at https://storage.googleapis.com/isi-surgvu/surgvu24_videos_only.zip, the labels at https://storage.googleapis.com/isi-surgvu/surgvu24_labels_updated_v2.zip, a validation set for tool detection problem at https://storage.googleapis.com/isi-surgvu/cat1_test_set_public.zip, and a sample set of question & answer pairs dataset for surgical visual question answering at https://storage.googleapis.com/isi-surgvu/SURGVU25_cat_2_sample_set_public.zip.
♻ ☆ Scaling Continual Learning to 300+ Tasks with Bi-Level Routing Mixture-of-Experts ICML 2026
Continual learning, especially class-incremental learning (CIL), on the basis of a pre-trained model (PTM) has garnered substantial research interest in recent years. However, how to effectively learn both discriminative and comprehensive feature representations while maintaining stability and plasticity over very long task sequences remains an open problem. We propose CaRE, a scalable {C}ontinual Le{a}rner with efficient Bi-Level {R}outing Mixture-of-{E}xperts (BR-MoE). The core idea of BR-MoE is a bi-level routing mechanism: a router selection stage that dynamically activates relevant task-specific routers, followed by an expert routing phase that dynamically activates and aggregates experts, aiming to inject discriminative and comprehensive representations into every intermediate network layer. On the other hand, we introduce a challenging dataset, OmniBenchmark-1K, for CIL performance evaluation on very long task sequences with hundreds of tasks. Extensive experiments show that CaRE demonstrates leading performance across a variety of datasets and task settings, including commonly used CIL datasets with classical CIL settings (e.g., 5-20 tasks). To the best of our knowledge, CaRE is the first continual learner that scales to very long task sequences (ranging from 100 to over 300 non-overlapping tasks), while outperforming all baselines by a large margin on such task sequences. We hope that this work will inspire further research into continual learning over extremely long task sequences. Code and dataset are publicly released at https://github.com/LMMMEng/CaRE.
comment: Accepted by ICML 2026
♻ ☆ Multispectral Indices for Wildfire Management
The increasing frequency and severity of wildfires necessitates advanced methods for effective surveillance and management, as traditional ground-based techniques often struggle to adapt to rapidly changing fire behavior and environmental conditions. This study investigates the use of multispectral aerial and satellite imagery for wildfire management through an assessment of current literature and two practical case studies. We evaluate several multispectral indices for their ability to extract environmental features critical for analyzing wildfire behavior, including vegetation, water bodies, and artificial structures. Our results highlight NVDI for vegetation, MNDWI for water features, and MSR for artificial structures as particularly effective for segmentation and feature extraction. The application of these indices enhances wildfire data processing and supports improved monitoring, risk assessment, and response strategies, demonstrating the potential of multispectral imagery to complement traditional wildfire monitoring and management approaches.
comment: The peer-reviewed version of this paper is published in Frontiers in Remote Sensing at https://doi.org/10.3389/frsen.2026.1807451. This version is typeset by the authors and differs only in pagination and typographical detail
♻ ☆ Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
While autoregressive Large Vision-Language Models (LVLMs) demonstrate remarkable proficiency in multimodal tasks, they face a "Visual Signal Dilution" phenomenon, where the accumulation of textual history expands the attention partition function, causing visual attention to decay inversely with generated sequence length. To counteract this, we propose Persistent Visual Memory (PVM), a lightweight learnable module designed to strengthen sustained, on-demand access to visual evidence. Integrated as a parallel branch alongside the Feed-Forward Network (FFN) in LVLMs, PVM establishes a distance-agnostic retrieval pathway that directly provides visual embeddings for enhanced visual perception, thereby structurally mitigating the signal suppression inherent to deep generation. Extensive experiments on Qwen3-VL models demonstrate that PVM brings notable improvements with negligible parameter overhead, delivering consistent average accuracy gains across both 4B and 8B scales, particularly in complex reasoning tasks that demand persistent visual perception. Furthermore, in-depth analysis reveals that PVM shows improved robustness in longer generations and accelerates internal prediction convergence.
♻ ☆ SynthForensics: Benchmarking and Evaluating People-Centric Synthetic Video Deepfakes
Modern T2V/I2V generators synthesize people increasingly hard to distinguish from authentic footage, while current evaluation suites lag: legacy benchmarks target manipulation-based forgeries, and recent synthetic-video benchmarks prioritize scale over realistic human depiction. We introduce SynthForensics, a people-centric benchmark of $20{,}445$ videos from 8 T2V and 7 I2V open-source generators, paired-source from FF++/DFD reals, two-stage human-validated, in four compression versions with full metadata. In our paired-comparison human study, raters prefer SynthForensics in $71$--$77\%$ of head-to-head comparisons against each of nine existing synthetic-video benchmarks, while facial-quality metrics fall within the FF++/DFD baseline range. Across 15 detectors and three protocols, face-based methods drop $13$--$55$ AUC points (mean $27$) from FF++ to SynthForensics and a further $23$ under aggressive compression; fine-tuning closes the gap at a backward cost on legacy benchmarks; training from scratch shows synthetic and manipulation features largely disjoint for most detectors. We release dataset, pipeline, and code.
♻ ☆ Pretty Good Measurement for Radiomics: A Quantum-Inspired Multi-Class Classifier for Lung Cancer Subtyping and Prostate Cancer Risk Stratification
We investigate a quantum-inspired approach to supervised multi-class classification based on the Pretty Good Measurement (PGM), viewed as an operator-valued decision rule derived from quantum state discrimination. The method associates each class with an encoded mixed state and performs classification through a single POVM construction, thus providing a genuinely multi-class strategy without reduction to pairwise or one-vs-rest schemes. In this perspective, classification is reformulated as the discrimination of a finite ensemble of class-dependent density operators, with performance governed by the geometry induced by the encoding map and by the overlap structure among classes. To assess the practical scope of this framework, we apply the PGM-based classifier to two biomedical radiomics case studies: histopathological subtyping of non-small-cell lung carcinoma (NSCLC) and prostate cancer (PCa) risk stratification. The evaluation is conducted under protocols aligned with previously reported radiomics studies, enabling direct comparison with established classical baselines. The results show that the PGM-based classifier is consistently competitive and, in several settings, improves upon standard methods. In particular, the method performs especially well in the NSCLC binary and three-class tasks, while remaining competitive in the four-class case, where increased class overlap yields a more demanding discrimination geometry. In the PCa study, the PGM classifier remains close to the strongest ensemble baseline and exhibits clinically relevant sensitivity--specificity trade-offs across feature-selection scenarios.
comment: 22 pages, 9 figures, 12 table, in preparation for journal submission
♻ ☆ Frequency-Aware Model Parameter Explorer: A new attribution method for improving explainability
State-of-the-art attribution methods rely on adversarial sample generation that applies an all-pass filter across the frequency spectrum, discarding fine-grained high-frequency information that is demonstrably important for accurate feature attribution in deep neural networks. By generating adversarial samples that selectively perturb high- and low-frequency components, we can probe which spectral features a model relies on most -- directly translating frequency-domain exploration into attribution signals. Building on this insight, we propose FAMPE (Frequency-Aware Model Parameter Explorer), a novel attribution method that introduces an FFT-based α-weighted perturbation scheme -- separately modulating high- and low-frequency components via an energy-driven spectral cutoff -- and, crucially, integrates this frequency-aware exploration directly into model parameter exploration for attribution, a connection that has not been established in prior work. Unlike prior frequency-aware adversarial approaches that target transferability or imperceptibility, FAMPE's specific formulation is designed and validated exclusively for explainability, translating spectral structure into fine-grained attribution maps without requiring any manual baseline selection. Evaluated on ImageNet across four architectures spanning CNNs and Vision Transformers, at fixed α= 0.1 FAMPE outperforms AttEXplore by 4.25% on Inception-v3 and 12.04% on MaxViT-T, with per-sample oracle selection further revealing that low-frequency-dominated images systematically benefit from high-frequency perturbations -- underscoring the potential of adaptive spectral exploration. Our ablation studies confirm that high-frequency perturbations are disproportionately responsible for attribution precision, while excessive low-frequency noise degrades global structural coherence.
comment: Preprint
♻ ☆ Drifting Fields are not Conservative
Drifting models have recently gained attention for generating high-quality samples in a single forward pass. During training, they learn a push-forward map by following a vector-valued field, the drift field. We ask whether this procedure is equivalent to optimizing a scalar loss and find that, in general, it is not: drift fields are not conservative and cannot be written as the gradient of any scalar potential. We identify the position-dependent normalization as the source of non-conservatism, with the Gaussian kernel as the unique radial exception. Guided by this, we introduce the sharp kernel $k^\#$ and a sharp-normalized drift field that is conservative for general radial kernels. The resulting vector field is the gradient of a scalar potential that can be optimized directly using stochastic gradient descent. Moreover, the field has the form of a score difference of kernel density estimates, and gives exact equilibrium identifiability. Thus, sharp normalization closes the gap to related literature, such as Wasserstein gradient-flows and denoising score matching, also for non-Gaussian kernels. Empirically, sharp normalization preserves the performance of the original drifting objective, suggesting that the non-conservative flexibility is not required for high-quality generation.
♻ ☆ ProcObject-10K: Benchmarking Object-Centric Procedural Understanding in Instructional Videos
Procedural activities are fundamentally driven by object state transitions, yet existing instructional video benchmarks remain action-centric and cannot evaluate whether models reason about how objects evolve toward task completion. In this work, we introduce ProcObject-10K, the first benchmark that jointly evaluates object-centric reasoning and temporal evidence grounding in instructional videos, across both egocentric and exocentric views. It comprises 10,522 open-ended VideoQA pairs grounded in 1,799 video clips, spanning 137 tasks across 9 domains and five reasoning types covering preconditions, state evolution, counterfactuals, mistakes, and readiness. Benchmarking 13 leading MLLMs reveals a substantial answering-grounding gap: models produce plausible answers while failing to localize the supporting evidence (mIoU < 45%), exposing their reliance on linguistic priors rather than fine-grained object dynamics. As a step toward closing this gap, we further provide an object-centric supervised fine-tuning baseline with pseudo object-level supervision and spatial-temporal constraints. Models fine-tuned on ProcObject-10K not only improve on the benchmark itself, but also transfer effectively to other grounded VideoQA and embodied planning tasks. The dataset, annotations, and evaluation toolkit will be publicly released to support future research on object-centric procedural understanding.
♻ ☆ Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models
Despite the success of multimodal contrastive learning in aligning visual and linguistic representations, a persistent geometric anomaly, the Modality Gap, remains: embeddings of distinct modalities expressing identical semantics occupy systematically offset regions. Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions, hindering their application in large-scale scenarios. In this paper, we address these limitations by precisely characterizing the geometric shape of the modality gap and leveraging it for efficient model scaling. First, we propose the Fixed-frame Modality Gap Theory, which decomposes the modality gap within a frozen reference frame into stable biases and anisotropic residuals. Guided by this precise modeling, we introduce ReAlign, a training-free modality alignment strategy. Utilizing statistics from massive unpaired data, ReAlign aligns text representation into the image representation distribution via a three-step process comprising Anchor, Trace, and Centroid Alignment, thereby explicitly rectifying geometric misalignment. Building on ReAlign, we propose ReVision, a scalable training paradigm for Multimodal Large Language Models~(MLLMs). ReVision integrates ReAlign into the pretraining stage, enabling the model to learn the distribution of visual representations from unpaired text before visual instruction tuning, without the need for large-scale, high-quality image-text pairs. Our framework demonstrates that statistically aligned unpaired data can effectively substitute for expensive image-text pairs, offering a robust path for the efficient scaling of MLLMs.
♻ ☆ Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks
Multimodal large language models (MLLMs) have shown promising reasoning abilities, yet evaluating their performance in specialized domains remains challenging. STEM reasoning is a particularly valuable testbed because it provides highly verifiable feedback, but existing benchmarks often permit unimodal shortcuts due to modality redundancy and focus mainly on final-answer accuracy, overlooking the reasoning process itself. To address this challenge, we introduce StepSTEM: a graduate-level benchmark of 283 problems across mathematics, physics, chemistry, biology, and engineering for fine-grained evaluation of cross-modal reasoning in MLLMs. StepSTEM is constructed through a rigorous curation pipeline that enforces strict complementarity between textual and visual inputs. We further propose a general step-level evaluation framework for both text-only chain-of-thought and interleaved image-text reasoning, using dynamic programming to align predicted reasoning steps with multiple reference solutions. Experiments across a wide range of models show that current MLLMs still rely heavily on textual reasoning, with even Gemini 3.1 Pro and Claude Opus 4.6 achieving only 38.29% accuracy. These results highlight substantial headroom for genuine cross-modal STEM reasoning and position StepSTEM as a benchmark for fine-grained evaluation of multimodal reasoning. Source code is available at https://github.com/lll-hhh/STEPSTEM.
♻ ☆ Retina-RAG: Retrieval-Augmented Vision-Language Modeling for Joint Retinal Diagnosis and Clinical Report Generation MICCAI 2026
Diabetic Retinopathy (DR) is a leading cause of preventable blindness among working-age adults worldwide, yet most automated screening systems are limited to image-level classification and lack clinically structured reporting. We propose Retina-RAG, a low-cost modular framework that jointly performs DR severity grading, macular edema (ME) detection, and report generation. The architecture decouples a high-performance retinal classifier and a parameter-efficient vision-language model (Qwen2.5-VL-7B-Instruct) adapted via Low-Rank Adaptation (LoRA), enabling flexible component integration. A retrieval-augmented generation (RAG) module injects curated ophthalmic knowledge together with structured classifier outputs at inference time to improve diagnostic consistency and reduce hallucinations. Retina-RAG achieves an F1-score of 0.731 for DR grading and 0.948 for ME detection, substantially outperforming zero-shot Qwen (0.096, 0.732) and MMed-RAG (0.541, 0.641) on a retinal disease detection dataset with captions. For report generation, Retina-RAG attains ROUGE-L 0.438 and SBERT similarity 0.884, exceeding all baselines. The full framework operates on a single consumer-grade GPU, demonstrating that clinically structured retinal AI can be achieved with modest computational resources.
comment: 10 pages, 5 figures. Submitted to MICCAI 2026
♻ ☆ Clinically Aware Synthetic Image Generation for Concept Coverage in Chest X-ray Models
Deep learning models for chest X-ray diagnosis are constrained by limited coverage of clinically meaningful concept combinations in publicly available training datasets. While synthetic image generation has been explored to increase data diversity, existing methods rarely enforce clinical or anatomical constraints, limiting utility for improving model reliability. We propose CARPA, a clinically aware and anatomically grounded framework for synthetic chest X-ray generation that applies targeted perturbations to clinical concept vectors while preserving anatomical structure. By producing anatomically faithful synthetic images with controlled concept insertions and deletions, CARPA expands clinically relevant concept coverage. We evaluate CARPA across seven backbone architectures by fine-tuning models on synthetic subsets and testing on a held-out MIMIC-CXR benchmark. Compared to prior concept perturbation approaches, fine-tuning on CARPA-generated images consistently improves precision-recall performance, reduces predictive uncertainty, and improves model calibration. Structural and semantic analyses demonstrate high anatomical fidelity, strong concept alignment, and low semantic uncertainty. Evaluation by two expert radiologists further confirms realism and clinical agreement. Together, these results show that anatomically grounded concept perturbations enable more effective use of synthetic data, improving both performance and reliability of chest X-ray classification models and supporting safer clinical deployment.
♻ ☆ VDEGaussian: Video Diffusion Enhanced 4D Gaussian Splatting for Dynamic Urban Scenes Modeling
Dynamic urban scene modeling is a rapidly evolving area with broad applications. While current approaches leveraging neural radiance fields or Gaussian Splatting have achieved fine-grained reconstruction and high-fidelity novel view synthesis, they still face significant limitations. These often stem from a dependence on pre-calibrated object tracks or difficulties in accurately modeling fast-moving objects from undersampled capture, particularly due to challenges in handling temporal discontinuities. To overcome these issues, we propose a novel video diffusion-enhanced 4D Gaussian Splatting framework. Our key insight is to distill robust, temporally consistent priors from a test-time adapted video diffusion model. To ensure precise pose alignment and effective integration of this denoised content, we introduce two core innovations: a joint timestamp optimization strategy that refines interpolated frame poses, and an uncertainty distillation method that adaptively extracts target content while preserving well-reconstructed regions. Extensive experiments demonstrate that our method significantly enhances dynamic modeling, especially for fast-moving objects, achieving an approximate PSNR gain of 2 dB for novel view synthesis over baseline approaches.
♻ ☆ DeepFedNAS: Efficient Hardware-Aware Architecture Adaptation for Heterogeneous IoT Federations via Pareto-Guided Supernet Training
Deploying federated learning across heterogeneous IoT device fleets requires tailored neural network architectures for each device class, yet existing Federated Neural Architecture Search (FedNAS) methods suffer from unguided supernet training and prohibitively costly post-training search pipelines that demand over 20 GPU-hours per deployment target. We introduce DeepFedNAS, a two-phase framework built on a multi-objective fitness function that synthesizes information-theoretic network metrics with architectural heuristics. In the first phase, Federated Pareto Optimal Supernet Training replaces random subnet sampling with a pre-computed cache of elite, high-fitness architectures, yielding a superior supernet. In the second phase, a Predictor-Free Search uses this fitness function as a zero-cost accuracy proxy, discovering hardware-optimized subnets in ~20 seconds, a ~61x speedup over the baseline pipeline. Experiments on CIFAR-10, CIFAR-100, and CINIC-10 demonstrate state-of-the-art accuracy (up to +1.21% on CIFAR-100), a 2.8x reduction in per-round transmission size, and robust performance under extreme non-IID conditions (α = 0.1), making DeepFedNAS practical for scalable, communication-constrained IoT federations. Source code: https://github.com/bostankhan6/DeepFedNAS
comment: This paper significantly extends the preliminary work presented at ESANN 2026. Source Code: https://github.com/bostankhan6/DeepFedNAS
♻ ☆ NS-Net: Decoupling CLIP Semantic Information through NULL-Space for Generalizable AI-Generated Image Detection
The rapid progress of generative models, such as GANs and diffusion models, has facilitated the creation of highly realistic images, raising growing concerns over their misuse in security-sensitive domains. While existing detectors perform well under known generative settings, they often fail to generalize to unknown generative models, especially when semantic content between real and fake images is closely aligned. In this paper, we revisit the use of CLIP features for AI-generated image detection and uncover a critical limitation: the high-level semantic information embedded in CLIP's visual features hinders effective discrimination. To address this, we propose NS-Net, a novel detection framework that leverages NULL-Space projection to decouple semantic information from CLIP's visual features, followed by contrastive learning to capture intrinsic distributional differences between real and generated images. Furthermore, we design a Patch Selection strategy to preserve fine-grained artifacts by mitigating semantic bias caused by global image structures. Extensive experiments on an open-world benchmark comprising images generated by 40 diverse generative models show that NS-Net outperforms existing state-of-the-art methods, achieving a 7.4\% improvement in detection accuracy, thereby demonstrating strong generalization across both GAN- and diffusion-based image generation techniques.
♻ ☆ VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
Video large multimodal models increasingly face a scalability bottleneck: long videos produce excessively long visual-token sequences, which sharply increase memory and latency during inference. While existing compression methods are effective in specific settings, most are either weakly query-aware or apply a fixed compression policy across frames, proving suboptimal when visual evidence is unevenly distributed over time. To address this, we present VideoRouter, a query-adaptive dual-router framework built on InternVL for budgeted evidence allocation. The Semantic Router predicts the dominant allocation policy, choosing between broad temporal coverage and adaptive high-resolution preservation, while the Image Router uses early LLM layers to score frame relevance. This enables aggressive compression on less relevant frames while preserving detail on critical evidence frames. To train both routers, we build Video-QTR-10K for allocation-policy supervision and Video-FLR-200K for frame-relevance supervision. Experiments on VideoMME, MLVU, and LongVideoBench show that VideoRouter consistently improves over the InternVL baseline under comparable or lower budgets, achieving up to a 67.9% token reduction.
♻ ☆ 3D Generation for Embodied AI and Robotic Simulation: A Survey
Embodied AI and robotic systems increasingly depend on scalable, diverse, and physically grounded 3D content for simulation-based training and real-world deployment. While 3D generative modeling has advanced rapidly, embodied applications impose requirements far beyond visual realism: generated objects must carry kinematic structure and material properties, scenes must support interaction and task execution, and the resulting content must bridge the gap between simulation and reality. This survey reviews 3D generation for embodied AI and organizes the literature around three roles that 3D generation plays in embodied systems. In Data Generator, 3D generation produces simulation-ready objects and assets, including articulated, physically grounded, and deformable content for downstream interaction; in Simulation Environments, it constructs interactive and task-oriented worlds, spanning structure-aware, controllable, and agentic scene generation; and in Sim2Real Bridge, it supports digital twin reconstruction, data augmentation, and synthetic demonstrations for downstream robot learning and real-world transfer. We also show that the field is shifting from visual realism toward interaction readiness, and we identify the main bottlenecks, including limited physical annotations, the gap between geometric quality and physical validity, fragmented evaluation, and the persistent sim-to-real divide, that must be addressed for 3D generation to become a dependable foundation for embodied intelligence. Our project page is at https://3dgen4robot.github.io.
comment: 27 pages, 11 figures, 8 tables
♻ ☆ Frozen Backpropagation: Relaxing Weight Symmetry in Deep Spiking Neural Networks
Direct training of Spiking Neural Networks (SNNs) on neuromorphic hardware can greatly reduce energy costs compared to GPU-based training. However, implementing Backpropagation (BP) on such hardware is challenging because forward and backward passes are typically performed by separate networks with distinct weights. To compute correct gradients, forward and feedback weights must remain symmetric during training, necessitating weight transport between the two networks. This symmetry requirement imposes hardware overhead and increases energy costs. To address this issue, we introduce Frozen Backpropagation (\textsc{fBP}), a BP-based training algorithm relaxing weight symmetry in settings with separate networks. fBP updates forward weights by computing gradients with periodically frozen feedback weights, reducing weight transports during training and minimizing synchronization overhead. To further improve transport efficiency, we propose three partial weight transport schemes of varying computational complexity, where only a subset of weights is transported at a time. We evaluate our methods on image recognition tasks using both temporally and rate-coded SNNs, and compare them to existing approaches addressing the weight symmetry requirement. Our results show that fBP outperforms these methods and achieves accuracy comparable to BP while significantly lowering transport costs. With partial weight transport, fBP can further lower those costs by up to 10,000x at the expense of moderate accuracy loss. This work provides insights for guiding the design of neuromorphic hardware incorporating BP-based on-chip learning.
♻ ☆ Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency
Recent advancements in image animation have utilized diffusion models to breathe life into static images. However, existing controllable frameworks typically rely on Lagrangian motion guidance, where optical flow is estimated relative to the initial frame. This paper revisits the same optical-flow primitive through a more local supervision design: we use adjacent-frame Eulerian motion fields to guide generation, where the motion signal always describes a short temporal hop. This shift enables parallelized training and provides bounded-error supervision throughout the generation process. To mitigate the drift artifacts common in adjacent frame generation, we introduce a Bidirectional Geometric Consistency mechanism, which computes a forward-backward cycle check to mathematically identify and mask occluded regions, preventing the model from learning incorrect warping objectives. Extensive experiments demonstrate that our approach accelerates training, preserves temporal coherence, and reduces dynamic artifacts compared to reference-based baselines.
comment: Work in progress
♻ ☆ VISD: Enhancing Video Reasoning via Structured Self-Distillation
Training VideoLLMs for complex reasoning remains challenging due to sparse sequence level rewards and the lack of fine grained credit assignment over long, temporally grounded reasoning trajectories. While reinforcement learning with verifiable rewards (RLVR) provides reliable supervision, it fails to capture token level contributions, leading to inefficient learning. Conversely, existing self distillation methods offer dense supervision but lack structure and diagnostic specificity, and often interact unstably with reinforcement learning. In this work, we propose VISD, a structured self distillation framework that introduces diagnostically meaningful privileged information for video reasoning. VISD employs a video aware judge model to decompose reasoning quality into multiple dimensions, including answer correctness, logical consistency, and spatio-temporal grounding, and uses this structured feedback to guide a teacher policy for token level supervision. To stably integrate dense supervision with RL, we introduce a direction magnitude decoupling mechanism, where rollout level advantages computed from rewards determine update direction, while structured privileged signals modulate token level update magnitudes. This design enables semantically aligned and fine grained credit assignment, improving both reasoning faithfulness and training efficiency. Additionally, VISD incorporates curriculum scheduling and EMA based teacher stabilization to support robust optimization over long video sequences. Experiments on diverse benchmarks show that VISD consistently outperforms strong baselines, improving answer accuracy and spatio temporal grounding quality. Notably, VISD reaches these gains with nearly 2x faster convergence in optimization steps, highlighting the effectiveness of structured self supervision in improving both performance and sample efficiency for VideoLLMs.
♻ ☆ GraphFusion3D: Dynamic Graph Attention Convolution with Adaptive Cross-Modal Transformer for 3D Object Detection
Despite significant progress in 3D object detection, point clouds remain challenging due to sparse data, incomplete structures, and limited semantic information. Capturing contextual relationships between distant objects presents additional difficulties. To address these challenges, we propose GraphFusion3D, a unified framework combining multi-modal fusion with advanced feature learning. Our approach introduces the Adaptive Cross-Modal Transformer (ACMT), which adaptively integrates image features into point representations to enrich both geometric and semantic information. For proposal refinement, we introduce the Graph Reasoning Module (GRM), a novel mechanism that models neighborhood relationships to simultaneously capture local geometric structures and global semantic context. The module employs multi-scale graph attention to dynamically weight both spatial proximity and feature similarity between proposals. We further employ a cascade decoder that progressively refines detections through multi-stage predictions. Extensive experiments on SUN RGB-D (70.6% AP$_{25}$ and 51.2% AP$_{50}$) and ScanNetV2 (75.1% AP$_{25}$ and 60.8% AP$_{50}$) demonstrate a substantial performance improvement over existing approaches.
♻ ☆ GRAPE: Let GRPO Supervise Query Rewriting by Ranking for Retrieval
The CLIP model has established itself as a cornerstone of large-scale retrieval systems. However, its performance often degrades under distributional shifts such as multilingual, long-form, or multimodal queries. To avoid the prohibitive costs associated with retriever retraining or corpus re-embedding, we propose GRAPE (Grouped Ranking-Aware Policy Optimization Enhancement), a plug-and-play approach that leverages LLM-based query rewriting to bridge these gaps. Unlike existing methods that lack explicit supervision, GRAPE integrates ranking signals into the rewriting LLM via Grouped Relative Policy Optimization (GRPO), ensuring rewritten queries are better aligned with the frozen retriever's latent distribution. Crucially, we identify a score inflation phenomenon in naive similarity-based finetuning - where irrelevant candidates receive indiscriminately high scores - and mitigate it with a novel corpus-relative ranking-based reward. Extensive experiments across multilingual (Flickr30k-CN, CVLUE, XM3600), long-form (Wikipedia), and multimodal (CIRR) benchmarks demonstrate that GRAPE consistently improves performance, achieving an average gain of 4.9% in Recall@10 without any modification to the underlying retriever. The code is available at https://github.com/mogulzhang/GRAPE.
♻ ☆ The Effective Depth Paradox: Evaluating the Relationship between Architectural Topology and Trainability in Deep CNNs
This paper investigates the relationship between convolutional neural network (CNN) topology and image recognition performance through a comparative study of the VGG, ResNet, and GoogLeNet architectural families. Utilizing a unified experimental framework, the study isolates the impact of depth from confounding implementation variables. A formal distinction is introduced between nominal depth ($D_{\mathrm{nom}}$), representing the physical layer count, and effective depth ($D_{\mathrm{eff}}$), an operational metric quantifying the expected number of sequential transformations. Empirical results demonstrate that architectures utilizing identity shortcuts or branching modules maintain optimization stability by decoupling $D_{\mathrm{eff}}$ from $D_{\mathrm{nom}}$. These findings suggest that effective depth serves as a superior framework for predicting scaling potential and practical trainability, ultimately indicating that architectural topology - rather than sheer layer volume - is the primary determinant of gradient health in deep learning models.
♻ ☆ Zero-Shot Satellite Image Retrieval through Joint Embeddings: Application to Crisis Response
Semantic search of Earth observation archives remains challenging. Visual foundation models such as CLAY produce rich embeddings of satellite imagery but lack the natural-language grounding needed for intuitive query, and full contrastive training of a remote-sensing CLIP-style model requires paired data and compute that are unavailable at global scale. To allow natural language querying at global scales, we present GeoQuery, a zero-shot retrieval system that sidesteps data and compute constraints through a two-stage semantic and visual search, leveraging a natural language embedding of a subset (proxy) of global data. Rather than training a joint encoder, we generate language descriptions for a 100k proxy subset of global Sentinel-2 tiles and optimise the description-generation prompt so that distances in the resulting text-embedding space correlate with distances in the frozen CLAY visual-embedding space. Queries are resolved in two stages, with a text-similarity search over the proxy subset followed by a visual nearest-neighbour search over worldwide CLAY embeddings On 76 disaster-location queries covering UK floods, US wildfires, and US droughts, GeoQuery achieves 31.6\% accuracy within 50\,km, with the strongest performance on floods (50\% within 50\,km) where terrain features are well captured by RGB embeddings. Deployed within a crisis response system called \ECHO{}, GeoQuery identified vulnerable areas during Brisbane's 2025 Cyclone Alfred, with downstream flood simulations reproducing historical patterns. Prompt-aligned proxies offer a practical bridge between EO foundation models and operational retrieval when full contrastive training is out of reach.
♻ ☆ Structure Over Scale: Learning Visual Reasoning from Pedagogical Video
State-of-the-art vision-language models (VLMs) score impressively on video benchmarks yet stumble on basic visual reasoning tasks involving spatial relations, navigation, and object selection that a preschooler solves easily. We hypothesize that the explicit pedagogical structure, specifically the context-question-pause-answer cycles embedded in children's educational video, provides naturally co-aligned reasoning traces: temporally synchronized visual cues, questions, and answers that emerge only from deliberate pedagogical authoring and cannot be practically reconstructed through manual annotation at scale. To test this, we introduce SoSVQA (Structure over Scale Visual Question Answering), a unified benchmark of 10K question-answer pairs automatically extracted from Dora the Explorer (DoraVQA) and Mickey Mouse Clubhouse (ClubHVQA) with precise timestamp alignment, and fine-tune Qwen2-VL and Qwen3-VL using Group Relative Policy Optimization (GRPO) to leverage the clear correctness signals and structured reasoning traces inherent in educational content. Despite training on just 10K QA pairs from 78 hours of children's television, orders of magnitude less data than GPT and Gemini, our approach delivers generalizable performance gains for Qwen-based VLMs, yielding consistent improvements on NExT-QA (+19.7), Video-MME (+10.6), and MotionBench (+4.9), matching the performance of leading proprietary systems and demonstrating that content structure can compensate for content scale.
♻ ☆ Channel-Level Relation to Attentive Aggregation with Neighborhood-Homogeneity Constraint for Point Cloud Analysis
In 3D point cloud understanding, the core challenge lies in accurately capturing discriminative features within complex neighborhoods, which directly affects the execution precision of downstream tasks such as embodied AI and autonomous driving. Existing methods explore feature correlation discrimination but are limited to point-level spatial distribution or channel responses, enabling only coarse-grained level evaluation. For modern multi-scale point cloud networks, such coarse-grained metrics inevitably incur significant information loss in deeper layers. To address this, we propose PointCRA, a novel network with a channel-level metric-based enhancement mechanism. Our core idea is to introduce temporal trend variation as a new evaluation dimension to avoid the information loss caused by weight dimension collapse in existing spatial and channel attention mechanisms. On this basis, we construct a multi-level calibration framework guided by neighborhood homogeneity for weight calibration, and design a dedicated loss function to enhance channel discriminability.PointCRA leverages intrinsic feature priors to adaptively correct feature aggregation, offering interpretability with low parameter overhead. Our method is transferable, interpretable, and efficient. We validate the proposed method on diverse datasets and benchmark models, and further demonstrate its rationality through extensive analytical experiments. Our PointCRA achieves 77.5\% mIoU on the S3DIS dataset, 90.4\% OA on the ScanObjectNN dataset, and 87.4\% instance mIoU on the ShapeNetPart dataset. The code and pretrained weights are publicly available on GitHub: https://github.com/AGENT9717/PointCRA
♻ ☆ ReCLIP++: Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation CVPR 24
Recent works utilize CLIP to perform the challenging unsupervised semantic segmentation task where only images without annotations are available. However, we observe that when adopting CLIP to such a pixel-level understanding task, unexpected bias (including class-preference bias and space-preference bias) occurs. Previous works don't explicitly model the bias, which largely constrains the segmentation performance. In this paper, we propose to explicitly model and rectify the bias existing in CLIP to facilitate the unsupervised semantic segmentation task. Specifically, we design a learnable "Reference" prompt to encode class-preference bias and a projection of the positional embedding in the vision transformer to encode space-preference bias respectively. To avoid interference, two kinds of biases are firstly independently encoded into different features, i.e., the Reference feature and the positional feature. Via a matrix multiplication between the Reference feature and the positional feature, a bias logit map is generated to explicitly represent two kinds of biases. Then we rectify the logits of CLIP via a simple element-wise subtraction. To make the rectified results smoother and more contextual, we design a mask decoder which takes the feature of CLIP and the rectified logits as input and outputs a rectified segmentation mask with the help of Gumbel-Softmax operation. A contrastive loss based on the masked visual features and the text features of different classes is imposed, which makes the bias modeling and rectification process meaningful and effective. Extensive experiments on various benchmarks including PASCAL VOC, PASCAL Context, ADE20K, Cityscapes, and COCO Stuff demonstrate that our method performs favorably against previous state-of-the-arts. The implementation is available at: https://github.com/dogehhh/ReCLIP.
comment: Extended version of our CVPR 24 paper, accepted by IJCV 2025
♻ ☆ LR-SGS: Robust LiDAR-Reflectance-Guided Salient Gaussian Splatting for Self-Driving Scene Reconstruction
Recent 3D Gaussian Splatting (3DGS) methods have demonstrated the feasibility of self-driving scene reconstruction and novel view synthesis. However, most existing methods either rely solely on cameras or use LiDAR only for Gaussian initialization or depth supervision, while the rich scene information contained in point clouds, such as reflectance, and the complementarity between LiDAR and RGB have not been fully exploited, leading to degradation in challenging self-driving scenes, such as those with high ego-motion and complex lighting. To address these issues, we propose a robust and efficient LiDAR-reflectance-guided Salient Gaussian Splatting method (LR-SGS) for self-driving scenes, which introduces a structure-aware Salient Gaussian representation, initialized from geometric and reflectance feature points extracted from LiDAR and refined through a salient transform and improved density control to capture edge and planar structures. Furthermore, we calibrate LiDAR intensity into reflectance and attach it to each Gaussian as a lighting-invariant material channel, jointly aligned with RGB to enforce boundary consistency. Extensive experiments on the Waymo Open Dataset demonstrate that LR-SGS achieves superior reconstruction performance with fewer Gaussians and shorter training time. In particular, on Complex Lighting scenes, our method surpasses OmniRe by 1.18 dB PSNR.
comment: 8 pages, 7 figures, conference
♻ ☆ PRPO: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection ICML 2026
The rapid rise of synthetic media has made deepfake detection a critical challenge for online safety and trust. Progress remains constrained by the scarcity of large, high-quality datasets. Although multimodal large language models (LLMs) exhibit strong reasoning capabilities, their performance on deepfake detection is poor, often producing explanations that are misaligned with visual evidence or hallucinatory. To address this limitation, we introduce a reasoning-annotated dataset for deepfake detection and propose Paragraph-level Relative Policy Optimization (PRPO), a reinforcement learning algorithm that aligns LLM reasoning with image content at the paragraph level. Experiments show that PRPO improves detection accuracy by a wide margin and achieves the highest reasoning score of 4.55/5.0. Ablation studies further demonstrate that PRPO significantly outperforms GRPO under test-time conditions. These results underscore the importance of grounding multimodal reasoning in visual evidence to enable more reliable and interpretable deepfake detection.
comment: Accepted at ICML 2026
♻ ☆ SoftSAE: Dynamic Top-K Selection for Adaptive Sparse Autoencoders
Sparse Autoencoders (SAEs) have become an important tool in mechanistic interpretability, helping to analyze internal representations in both Large Language Models (LLMs) and Vision Transformers (ViTs). By decomposing polysemantic activations into sparse sets of monosemantic features, SAEs aim to translate neural network computations into human-understandable concepts. However, common architectures such as TopK SAEs rely on a fixed sparsity level. They enforce the same number of active features (K) across all inputs, ignoring the varying complexity of real-world data. Natural data often lies on manifolds with varying local intrinsic dimensionality, meaning the number of relevant factors can change significantly across samples. This suggests that a fixed sparsity level is not optimal. Simple inputs may require only a few features, while more complex ones need more expressive representations. Using a constant K can therefore introduce noise in simple cases or miss important structure in more complex ones. To address this issue, we propose SoftSAE, a sparse autoencoder with a Dynamic Top-K selection mechanism. Our method uses a differentiable Soft Top-K operator to learn an input-dependent sparsity level k. This allows the model to adjust the number of active features based on the complexity of each input. As a result, the representation better matches the structure of the data, and the explanation length reflects the amount of information in the input. Experimental results confirm that SoftSAE not only finds meaningful features, but also selects the right number of features for each concept. The source code is available at: https://github.com/St0pien/SoftSAE.
♻ ☆ Large-Scale High-Quality 3D Gaussian Head Reconstruction from Multi-View Captures
We propose HeadsUp, a scalable feed-forward method for reconstructing high-quality 3D Gaussian heads from large-scale multi-camera setups. Our method employs an efficient encoder-decoder architecture that compresses input views into a compact latent representation. This latent representation is then decoded into a set of UV-parameterized 3D Gaussians anchored to a neutral head template. This UV representation decouples the number of 3D Gaussians from the number and resolution of input images, enabling training with many high-resolution input views. We train and evaluate our model on an internal dataset with more than 10,000 subjects, which is an order of magnitude larger than existing multi-view human head datasets. HeadsUp achieves state-of-the-art reconstruction quality and generalizes to novel identities without test-time optimization. We extensively analyze the scaling behavior of our model across identities, views, and model capacity, revealing practical insights for quality-compute trade-offs. Finally, we highlight the strength of our latent space by showcasing two downstream applications: generating novel 3D identities and animating the 3D heads with expression blendshapes.
comment: Project page: https://apple.github.io/ml-headsup/
♻ ☆ MicroBi-ConvLSTM: An Ultra-Lightweight Efficient Model for Human Activity Recognition on Resource Constrained Devices
Human Activity Recognition (HAR) on resource constrained wearables requires models that balance accuracy against strict memory and computational budgets. State of the art lightweight architectures such as TinierHAR (34K parameters), and TinyHAR (55K parameters) achieve strong accuracy, but exceed memory budgets of microcontrollers with limited SRAM once operating system overhead is considered. We present MicroBi-ConvLSTM, an ultra-lightweight convolutional recurrent architecture achieving 11.4K parameters on average through two stage convolutional feature extraction with 4x temporal pooling, and a single bidirectional LSTM layer. This represents 2.9x parameter reduction versus TinierHAR, and 11.9x versus DeepConvLSTM while preserving linear O(N) complexity. Evaluation across eight diverse HAR benchmarks shows that MicroBi-ConvLSTM maintains competitive performance within the ultra-lightweight regime: 93.41% macro F1 on UCI-HAR, 94.46% on SKODA assembly gestures, and 88.98% on Daphnet gait freeze detection. Systematic ablation reveals task dependent component contributions where bidirectionality benefits episodic event detection, but provides marginal gains on periodic locomotion. On-device deployment on the Raspberry Pi Pico 2 and ESP32 validates hardware viability: MicroBi-ConvLSTM is the only architecture achieving full 8/8 dataset coverage on both platforms, with 72.8 ms average latency on Pico 2 and 97.9% PyTorch parity on ESP32, while all three baselines show partial or complete deployment failure.
♻ ☆ Tables Guide Vision: Learning to See the Heart through Tabular Data
Contrastive learning methods in computer vision typically rely on augmented views of the same image or multimodal pretraining strategies that align paired modalities. However, these approaches often overlook semantic relationships between distinct instances, leading to false negatives when semantically similar samples are treated as negatives. This limitation is especially critical in medical imaging domains such as cardiology, where demographic and clinical attributes play a critical role in assessing disease risk and patient outcomes. We introduce a tabular-guided contrastive learning framework that leverages clinically relevant tabular data to identify patient-level similarities and construct more meaningful pairs, enabling semantically aligned representation learning without requiring joint embeddings across modalities. Additionally, we adapt the k-NN algorithm for zero-shot prediction to overcome the lack of zero-shot capability in unimodal representations. We demonstrate the strength of our methods using a large cohort of short-axis cardiac MR images and clinical attributes, where tabular data helps to more effectively distinguish between patient subgroups. Evaluation on downstream tasks, including fine-tuning, linear probing, and zero-shot prediction of cardiovascular artery diseases and cardiac phenotypes, shows that incorporating tabular data guidance yields stronger visual representations than conventional methods that rely solely on image augmentation or combined image-tabular embeddings. Further, we show that our method can generalize to natural images by evaluating it on a car advertisement dataset. Code is available at https://github.com/marteczkah/tables_guide_vision.
♻ ☆ Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles
Interpreting natural-language commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods for autonomous vehicles (AVs) typically struggle with ambiguous, context-dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial-Aware World Model (SA-WM) that learns to reason ahead by distilling the current scene into a command-aware latent state and rolling out a sequence of future latent states, providing forward-looking cues for disambiguation. Complementing this, a hypergraph-guided decoder then hierarchically fuses these states with the multimodal input, capturing higher-order spatial dependencies for robust localization. In addition, we present DrivePilot, a multi-source VG dataset in AD, featuring semantic annotations generated by a Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT)-prompted LLM pipeline. Extensive evaluations on six benchmarks, ThinkDeeper ranks #1 on the Talk2Car leaderboard and surpasses state-of-the-art baselines on DrivePilot, MoCAD, and RefCOCO/+/g benchmarks. Notably, it shows strong robustness and efficiency in challenging scenes (long-text, multi-agent, ambiguity) and retains superior performance even when trained on 50% of the data.
♻ ☆ RECON: Robust symmetry discovery via Explicit Canonical Orientation Normalization ICLR 2026
Real world data often exhibits unknown, instance-specific symmetries that rarely exactly match a transformation group $G$ fixed a priori. Class-pose decompositions aim to create disentangled representations by factoring inputs into invariant features and a pose $g\in G$ defined relative to a training-dependent, arbitrary canonical representation. We introduce RECON, a class-pose agnostic canonical orientation normalization that corrects arbitrary canonicals via a simple right translation, yielding natural, data-aligned canonicalizations. This enables (i) unsupervised discovery of instance-specific pose distributions, (ii) detection of out-of-distribution poses and (iii) a plug-and-play test-time canonicalization layer. This layer can be attached on top of any pre-trained model to infuse group invariance, improving its performance without retraining. We validate on images and molecular ensembles, demonstrating accurate symmetry discovery, and matching or outperforming other canonicalizations in downstream classification.
comment: Accepted as a conference paper at ICLR 2026
♻ ☆ Multimodal Diffusion Transformer with Memory Bank for Scalable Long-Duration Talking Video Generation
Long-duration talking video synthesis faces enduring challenges in achieving high video quality, portrait consistency, temporal coherence, and computational efficiency. As video length increases, issues such as visual degradation, portrait drift, temporal artifacts, and error accumulation become increasingly problematic, severely affecting the realism and reliability of the results. To address these challenges, we present LetsTalk, a diffusion transformer framework equipped with multimodal guidance and a novel memory bank mechanism, explicitly maintaining contextual continuity and enabling robust, high-quality, and efficient generation of long-duration talking videos. In particular, LetsTalk introduces a noise-regularized memory bank to alleviate error accumulation and sampling artifacts during extended video generation. To further improve efficiency and spatiotemporal modeling, LetsTalk employs a deep compression autoencoder and a spatiotemporal-aware transformer with linear attention for effective multimodal fusion. We systematically analyze three fusion schemes and show that combining deep (Symbiotic Fusion) for portrait features and shallow (Direct Fusion) for audio achieves superior visual realism and precise speech-driven motion, while preserving diversity of movements. Extensive experiments demonstrate that LetsTalk establishes new state-of-the-art in generation quality, producing temporally coherent and realistic talking videos with enhanced diversity and liveliness, and maintains remarkable efficiency with 8x fewer parameters than previous approaches.
comment: 16 pages, 25 figures
♻ ☆ 3DSS: 3D Surface Splatting for Inverse Rendering
We present 3D Surface Splatting (3DSS), the first differentiable surface splatting renderer for physically-based inverse rendering from multi-view images. Our central insight is that the surface separation problem at the heart of surface splatting admits a direct formulation in terms of the reconstruction kernels themselves. From this foundation we derive a coverage-based compositing model whose per-layer opacity arises directly from the accumulated Elliptical Weighted Average reconstruction weight, yielding anti-aliased silhouettes and informative visibility gradients at sparsely covered edges. Combined with forward microfacet shading under co-optimized HDR environment lighting and density-aware adaptive refinement, 3DSS jointly recovers shape, spatially-varying BRDF materials, and illumination. Because the optimized representation is a set of oriented surface samples, it bridges natively to mesh-based workflows via surface reconstruction from oriented point cloud methods. We evaluate 3DSS against mesh-based, implicit, and Gaussian-splatting baselines across geometry reconstruction, novel-view synthesis, and novel-illumination relighting.
♻ ☆ Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement
Training world models on vast quantities of unlabelled videos is a critical step toward fully autonomous intelligence. However, the prevailing paradigm of encoding raw pixels into opaque latent spaces and relying on heavy decoders for reconstruction leaves these models computationally expensive and uninterpretable. We address this problem by introducing NOVA, a world modelling framework that represents the system state as the weights and biases of an auxiliary coordinate-based implicit neural representation (INR). This structured representation is analytically rendered, which eliminates the decoder bottleneck while conferring compactness, portability, and zero-shot super-resolution. Furthermore, like most latent action models, NOVA can be distilled into a context-dependent video generator via an action-matching objective. Surprisingly, without resorting to auxiliary losses or adversarial objectives, NOVA can disentangle structural scene components such as background, foreground, and inter-frame motion, enabling users to edit either content or dynamics without compromising the other. We validate our framework on several challenging datasets, achieving strong controllable forecasting while operating on a single consumer GPU at $\sim$40M parameters. Ultimately, structured representations like INRs not only enhance our understanding of latent dynamics but also pave the way for immersive and customisable virtual experiences.
comment: 35 pages, 30 figures, 8 tables
♻ ☆ Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis
Data synthesis for training large reasoning models offers a scalable alternative to limited, human-curated datasets, enabling the creation of high-quality data. However, existing approaches face several challenges: (i) indiscriminate generation that ignores the solver's ability and yields low-value problems, or reliance on complex data pipelines to balance problem difficulty; and (ii) a lack of reasoning in problem generation, leading to shallow problem variants. In this paper, we develop a problem generator that reasons explicitly to plan problem directions before synthesis and adapts difficulty to the solver's ability. Specifically, we construct related problem pairs and augment them with intermediate problem-design CoT produced by a reasoning model. These data are used to bootstrap problem-design strategies in the generator. Then, we treat the solver's feedback on synthetic problems as a reward signal, enabling the generator to calibrate difficulty and produce complementary problems near the edge of the solver's competence. Extensive experiments on 10 mathematical and general reasoning benchmarks show that our proposed framework achieves a cumulative average improvement of 3.4%, demonstrating robust generalization across both language and vision-language models.
♻ ☆ Affordance Agent Harness: Verification-Gated Skill Orchestration
Affordance grounding requires identifying where and how an agent should interact in open-world scenes, where actionable regions are often small, occluded, reflective, and visually ambiguous. Recent systems therefore combine multiple skills (e.g., detection, segmentation, interaction-imagination), yet most orchestrate them with fixed pipelines that are poorly matched to per-instance difficulty, offer limited targeted recovery from intermediate errors, and fail to reuse experience from recurring objects. These failures expose a systems problem: test-time grounding must acquire the right evidence, decide whether that evidence is reliable enough to commit, and do so under bounded inference cost without access to labels. We propose Affordance Agent Harness, a closed-loop runtime that unifies heterogeneous skills with an evidence store and cost control, retrieves episodic memories to provide priors for recurring categories, and employs a Router to adaptively select and parameterize skills. An affordance-specific Verifier then gates commitments using self-consistency, cross-scale stability, and evidence sufficiency, triggering targeted retries before a final judge fuses accumulated evidence and trajectories into the prediction. Experiments on multiple affordance benchmarks and difficulty-controlled subsets show a stronger accuracy-cost Pareto frontier than fixed-pipeline baselines, improving grounding quality while reducing average skill calls and latency. Project page: https://tenplusgood.github.io/a-harness-page/.
comment: 43 pages, 22 figures, 8 tables. Ongoing work
♻ ☆ CONSIGN: Conformal Segmentation Informed by Spatial Groupings via Decomposition ICLR 2026
Most machine learning-based image segmentation models produce pixel-wise confidence scores that represent the model's predicted probability for each class label at every pixel. While this information can be particularly valuable in high-stakes domains such as medical imaging, these scores are heuristic in nature and do not constitute rigorous quantitative uncertainty estimates. Conformal prediction (CP) provides a principled framework for transforming heuristic confidence scores into statistically valid uncertainty estimates. However, applying CP directly to image segmentation ignores the spatial correlations between pixels, a fundamental characteristic of image data. This can result in overly conservative and less interpretable uncertainty estimates. To address this, we propose CONSIGN (Conformal Segmentation Informed by Spatial Groupings via Decomposition), a CP-based method that incorporates spatial correlations to improve uncertainty quantification in image segmentation. Our method generates meaningful prediction sets that come with user-specified, high-probability error guarantees. It is compatible with any pre-trained segmentation model capable of generating multiple sample outputs. We evaluate CONSIGN against two CP baselines across three medical imaging datasets and two COCO dataset subsets, using three different pre-trained segmentation models. Results demonstrate that accounting for spatial structure significantly improves performance across multiple metrics and enhances the quality of uncertainty estimates.
comment: Accepted as poster at ICLR 2026
♻ ☆ CSMCIR: CoT-Enhanced Symmetric Alignment with Memory Bank for Composed Image Retrieval
Composed Image Retrieval (CIR) enables users to search for target images using both a reference image and manipulation text, offering substantial advantages over single-modality retrieval systems. However, existing CIR methods suffer from representation space fragmentation: queries and targets comprise heterogeneous modalities and are processed by distinct encoders, forcing models to bridge misaligned representation spaces only through post-hoc alignment, which fundamentally limits retrieval performance. This architectural asymmetry manifests as three distinct, well-separated clusters in the feature space, directly demonstrating how heterogeneous modalities create fundamentally misaligned representation spaces from initialization. In this work, we propose CSMCIR, a unified representation framework that achieves efficient query-target alignment through three synergistic components. First, we introduce a Multi-level Chain-of-Thought (MCoT) prompting strategy that guides Multimodal Large Language Models to generate discriminative, semantically compatible captions for target images, establishing modal symmetry. Building upon this, we design a symmetric dual-tower architecture where both query and target sides utilize the identical shared-parameter Q-Former for cross-modal encoding, ensuring consistent feature representations and further reducing the alignment gap. Finally, this architectural symmetry enables an entropy-based, temporally dynamic Memory Bank strategy that provides high-quality negative samples while maintaining consistency with the evolving model state. Extensive experiments on four benchmark datasets demonstrate that our CSMCIR achieves state-of-the-art performance with superior training efficiency. Comprehensive ablation studies further validate the effectiveness of each proposed component.
♻ ☆ Multi-Modal Multi-Agent Reinforcement Learning for Radiology Report Generation
We propose MARL-Rad, a multi-modal multi-agent reinforcement learning framework for radiology report generation that trains the entire agentic system on policy within its deployed radiology workflow. MARL-Rad addresses the limitation of post-hoc agentization, where fixed LLMs are organized into hand-designed agentic workflows without being optimized for their assigned roles. Our framework decomposes chest X-ray interpretation into region-specific agents and a global integrating agent, and jointly optimizes them using clinically verifiable rewards. Experiments on the MIMIC-CXR and IU X-ray datasets show that MARL-Rad consistently improves clinical efficacy metrics such as RadGraph, CheXbert, and GREEN scores, achieving state-of-the-art clinical efficacy performance. Further analyses show that MARL-Rad improves laterality consistency and produces more accurate and detailed reports. A blinded clinician evaluation further suggests that MARL-Rad produces reports clinically comparable to ground-truth reports.
comment: 23 pages, 4 figures
♻ ☆ Physics-Based Benchmarking Metrics for Multimodal Synthetic Images
Current state of the art measures like BLEU, CIDEr, VQA score, SigLIP-2 and CLIPScore are often unable to capture semantic or structural accuracy, especially for domain-specific or context-dependent scenarios. For this, this paper proposes a Physics-Constrained Multimodal Data Evaluation (PCMDE) metric combining large language models with reasoning, knowledge based mapping and vision-language models to overcome these limitations. The architecture is comprised of three main stages: (1) feature extraction of spatial and semantic information with multimodal features through object detection and VLMs; (2) Confidence-Weighted Component Fusion for adaptive component-level validation; and (3) physics-guided reasoning using large language models for structural and relational constraints (e.g., alignment, position, consistency) enforcement.
♻ ☆ Teaching Prompts to Coordinate: Hierarchical Layer-Grouped Prompt Tuning for Continual Learning
Prompt-based continual learning methods fine-tune only a small set of additional learnable parameters while keeping the pre-trained model's parameters frozen. It enables efficient adaptation to new tasks while mitigating the risk of catastrophic forgetting. These methods typically attach one independent task-specific prompt to each layer of pre-trained models to locally modulate its features, ensuring that the layer's representation aligns with the requirements of the new task. However, although introducing learnable prompts independently at each layer provides high flexibility for adapting to new tasks, this overly flexible tuning could make certain layers susceptible to unnecessary updates. As all prompts till the current task are added together as a final prompt for all seen tasks, the model may easily overwrite feature representations essential to previous tasks, which increases the risk of catastrophic forgetting. To address this issue, we propose a novel hierarchical layer-grouped prompt tuning method for continual learning. It improves model stability in two ways: (i) Layers in the same group share roughly the same prompts, which are adjusted by position encoding. This helps preserve the intrinsic feature relationships and propagation pathways of the pre-trained model within each group. (ii) It utilizes a single task-specific root prompt to learn to generate sub-prompts for each layer group. In this way, all sub-prompts are conditioned on the same root prompt, enhancing their synergy and reducing independence. Extensive experiments across four benchmarks demonstrate that our method achieves favorable performance compared with several state-of-the-art methods.
comment: We have reconsidered the issue, and an updated version will be released later
♻ ☆ Metric Unreliability in Multimodal Machine Unlearning: A Systematic Analysis and Principled Unified Score
Machine unlearning in Vision-Language Models (VLMs) is required for compliance with the General Data Protection Regulation (GDPR), yet current evaluation practices are inconsistent. We present the first systematic study of metric reliability in multimodal unlearning. Five standard metrics, Forget Accuracy (FA), Retain Accuracy (RA), Membership Inference Attack (MIA), Activation Distance (AD), and JS divergence (JS), yield conflicting method rankings across three VQA benchmarks (MLLMU-Bench, UnLOK-VQA, MMUBench). Kendall tau analysis over 36 unlearned LLaVA-1.5-7B models reveals two opposing clusters, {FA, RA, MIA} and {AD, JS}, with tau_FA_AD = -0.26, reproduced on BLIP-2 OPT-2.7B. Agreement is lower in multimodal VQA (average tau = 0.086) than in unimodal classification (average tau = 0.158; difference = 0.072), indicating that dual image-and-text pathways amplify inconsistency. We introduce the Unified Quality Score (UQS), a composite metric with weights derived from each metric's Spearman correlation with the oracle distance d(M_hat, M_star), where M_star is the oracle model retrained only on the retain set. RA shows the strongest reliability (rho = 0.484, p = 0.003), while FA is negatively correlated (rho = -0.418, p = 0.011). UQS yields stable rankings under 100 random weight perturbations (tau = 0.647 +- 0.262). We release the benchmark, 36 checkpoints, and an interactive leaderboard. Code and pre-computed results are available at https://github.com/neurips26/UnifiedUnl.
comment: 9 Pages , 6 figures, Neurips 2026
♻ ☆ RedDiffuser: Auditing Multimodal Safety Failures in Vision-Language Models via Reinforced Diffusion
Large Vision-Language Models (VLMs) are increasingly deployed in open-ended environments, where ensuring reliable safety under multimodal inputs is critical. However, existing evaluations remain largely instruction-centric, focusing on explicit malicious queries while overlooking a more realistic and underexplored risk: whether safety alignment remains robust under harmful contextual exposure. This limitation is particularly important for multimodal systems, where visual inputs can substantially steer model behavior and render text-only auditing insufficient. In this work, we study multimodal safety auditing under harmful contextual exposure, asking whether VLMs can maintain safe behavior when partial toxic text is paired with visual context. To enable systematic auditing, we propose RedDiffuser (RedDiff), a reinforcement-based framework that leverages diffusion models to generate semantically coherent visual inputs for black-box safety testing. By combining greedy prompt search with reinforcement optimization, RedDiffuser uncovers high-risk multimodal inputs that expose latent safety failures. Extensive experiments on both open-source and commercial VLMs show that such context-conditioned failures are widespread. On LLaVA, RedDiffuser increases unsafe response rates by up to 10.69% on the original set and 8.91% on a hold-out set, with strong transferability to Gemini and LLaMA-Vision. These vulnerabilities persist even under external safety guardrails, suggesting that current system-level safety mechanisms remain insufficient for realistic multimodal risks. Our findings reveal a critical blind spot in existing safety evaluations and establish context-aware multimodal auditing as an essential paradigm for diagnosing hidden vulnerabilities in modern VLM systems.
♻ ☆ Zero-Shot Quantization via Weight-Space Arithmetic
We show that robustness to post-training quantization (PTQ) is a transferable direction in weight space. We call this direction the quantization vector: extracted from a donor task by simple weight-space arithmetic, it can be used to patch a receiver model and improve post-PTQ Top-1 accuracy by up to 60 points in a 3-bit setting, without receiver-side quantization-aware training (QAT). Because the method requires no receiver training data, it provides a zero-shot, low-cost alternative to QAT for extremely low-bit deployment. Across four ViT scales and 22 image classification tasks, donor quantization vectors often yield substantial gains even when donor and receiver tasks differ markedly. We further prove rigorously that quantization vectors are well-defined and do not suffer from reparameterization symmetries, and provide a local geometric account of their effect. Together, these results suggest that quantization robustness can be partially isolated, reused, and transferred through simple weight-space algebra.
♻ ☆ Dino U-Net: Exploiting High-Fidelity Dense Features from Foundation Models for Medical Image Segmentation MICCAI 2026
Foundation models pre-trained on large-scale natural image datasets offer a powerful paradigm for medical image segmentation. However, effectively transferring their learned representations for precise clinical applications remains a challenge. In this work, we propose Dino U-Net, a novel encoder-decoder architecture designed to exploit the high-fidelity dense features of the DINOv3 vision foundation model. Our architecture introduces an encoder built upon a frozen DINOv3 backbone, which employs a specialized adapter to fuse the model's rich semantic features with low-level spatial details. To preserve the quality of these representations during dimensionality reduction, we design a new fidelity-aware projection module (FAPM) that effectively refines and projects the features for the decoder. We conducted extensive experiments on seven diverse public medical image segmentation datasets. Our results show that Dino U-Net achieves state-of-the-art performance, consistently outperforming previous methods across various imaging modalities. Our framework proves to be highly scalable, with segmentation accuracy consistently improving as the backbone model size increases up to the 7-billion-parameter variant. The findings demonstrate that leveraging the superior, dense-pretrained features from a general-purpose foundation model provides a highly effective and parameter-efficient approach to advance the accuracy of medical image segmentation. The code is available at https://github.com/yifangao112/DinoUNet.
comment: MICCAI 2026
♻ ☆ Large Video Planner Enables Generalizable Robot Control
General-purpose robots require decision-making models that generalize across diverse tasks and environments. Recent works build robot foundation models by extending multimodal large language models (MLLMs) with action outputs, creating vision-language-action (VLA) systems. These efforts are motivated by the intuition that MLLMs' large-scale language and image pretraining can be effectively transferred to the action output modality. In this work, we explore an alternative paradigm of using large-scale video pretraining as a primary modality for building robot foundation models. Unlike static images and language, videos capture spatio-temporal sequences of states and actions in the physical world that are naturally aligned with robotic behavior. We curate an internet-scale video dataset of human activities and task demonstrations, and train, for the first time at a foundation-model scale, an open video model for generative robotics planning. The model produces zero-shot video plans for novel scenes and tasks, which we post-process to extract executable robot actions. We evaluate task-level generalization through third-party selected tasks in the wild and real-robot experiments, demonstrating successful physical execution. Together, these results show robust instruction following, strong generalization, and real-world feasibility. We release both the model and dataset to support open, reproducible video-based robot learning. Our website is available at https://www.boyuan.space/large-video-planner/.
comment: 29 pages, 16 figures
♻ ☆ Data Augmentation of Contrastive Learning is Estimating Positive-incentive Noise ICML 2026
Inspired by the idea of Positive-incentive Noise (Pi-Noise or $π$-Noise) that aims at learning the reliable noise beneficial to tasks, we scientifically investigate the connection between contrastive learning and $π$-noise in this paper. By converting the contrastive loss to an auxiliary Gaussian distribution to quantitatively measure the difficulty of the specific contrastive model under the information theory framework, we properly define the task entropy, the core concept of $π$-noise, of contrastive learning. It is further proved that the predefined data augmentation in the standard contrastive learning paradigm can be regarded as a kind of point estimation of $π$-noise. Inspired by the theoretical study, a framework that develops a $π$-noise generator to learn the beneficial noise (instead of estimation) as data augmentations for contrast is proposed. The designed framework can be applied to diverse types of data and is also completely compatible with the existing contrastive models. From the visualization, we surprisingly find that the proposed method successfully learns effective augmentations. Our code is available at https://github.com/hyzhang98/PiNDA.
comment: Accepted by ICML 2026
♻ ☆ S2M-Net: Spectral-Spatial Mixing for Medical Image Segmentation with Morphology-Aware Adaptive Loss
Medical image segmentation requires balancing local precision for boundary-critical clinical applications, global context for anatomical coherence, and computational efficiency for deployment on limited data and hardware a trilemma that existing architectures fail to resolve. Although convolutional networks provide local precision at $\mathcal{O}(n)$ cost but limited receptive fields, vision transformers achieve global context through $\mathcal{O}(n^2)$ self-attention at prohibitive computational expense, causing overfitting on small clinical datasets. We propose S2M-Net, a 4.7M-parameter architecture that achieves $\mathcal{O}(HW \log HW)$ global context through two synergistic innovations: (i) Spectral-Selective Token Mixer (SSTM), which exploits the spectral concentration of medical images via truncated 2D FFT with learnable frequency filtering and content-gated spatial projection, avoiding quadratic attention cost while maintaining global receptive fields; and (ii) Morphology-Aware Adaptive Segmentation Loss (MASL), which automatically analyzes structure characteristics (compactness, tubularity, irregularity, scale) to modulate five complementary loss components through constrained learnable weights, eliminating manual per-dataset tuning. Comprehensive evaluation in 16 medical imaging datasets that span 8 modalities demonstrates state-of-the-art performance: 96.12\% Dice on polyp segmentation, 83.77\% on surgical instruments (+17.85\% over the prior art) and 80.90\% on brain tumors, with consistent 3-18\% improvements over specialized baselines while using 3.5--6$\times$ fewer parameters than transformer-based methods.
comment: I would like to withdraw the paper from arXiv because the current version contains issues that need to be carefully revised before public dissemination
♻ ☆ Super-Resolution of Airborne Laser Scanning Point Clouds for Forest Inventory
Airborne Laser Scanning (ALS) can collect point clouds across large areas, enabling large-scale forest inventory. However, ALS point clouds are sparse and noisy, resulting in inaccurate individual-tree-level forest inventory, such as stem localization and tree size estimation. To overcome this problem, we propose a deep learning model, 3D Forest Super Resolution (3DFSR), to simultaneously improve point density and reduce noise for ALS forest point cloud. 3DFSR is a voxel-based CNN with a U-Net architecture. The proposed 3DFSR is evaluated on ALS point clouds collected in both temperate forests in the U.S. and boreal forests in Germany. Experimental results demonstrate that 3DFSR can generate finer point clouds of tree structure than other state-of-the-art point cloud super-resolution algorithms, achieving 0.249 m Chamfer Distance and 2.711 m Hausdorff Distance. Furthermore, to verify the effectiveness of 3DFSR point clouds in forest inventory, we conduct stem detection, DBH measurements, and stem reconstruction on both original ALS point clouds and 3DFSR enhanced point clouds. We find that stem detection and reconstruction algorithms developed for TLS/MLS point clouds can directly work on our 3DFSR point clouds, and DBH can be derived with circle-fitting method. F1 score of stem detection is improved from 0.71 on original ALS point clouds to 0.97 on 3DFSR point clouds; DBH estimation improves from 13.45 cm RMSE using allometric equations to 6.43 cm using circle fitting; comparing to stems reconstruction from MLS point clouds, stem reconstructed from 3DFSR point clouds has 0.170 m of Chamfer Distance and 0.377 m of Hausdorff Distance, and 0.95 R2 volume estimation. Finally, we find that the proposed 3DFSR is applicable to process point densities from 10 to 1700 points/m2; it also can be generalized across data collected from different LiDAR platforms without transfer learning.
♻ ☆ Instruction-Evidence Contrastive Dual-Stream Decoding for Grounded Vision-Language Reasoning
Vision-Language Models (VLMs) exhibit strong performance in instruction following and open-ended vision-language reasoning, yet they frequently generate fluent outputs that are weakly grounded in visual evidence. Prior works have shown that instruction prompting further worsens this issue by amplifying language priors, especially when the visual signal is uncertain or ambiguous. To address this challenge, we propose a decoding framework that explicitly balances linguistic informativeness and visual faithfulness during generation. Our method, Instruction-Evidence Contrastive Dual-Stream Decoding (IECD$^2$), maintains two parallel probability distribution of tokens at each decoding step: an instruction-driven stream that promotes expressive and informative responses, and an evidence-driven stream that enforces strict grounding in the image. These two streams are adaptively fused using a symmetric KL-based contrastive gate, which suppresses tokens favored by language priors but unsupported by visual evidence, while preserving them when both distributions agree. We evaluate IECD$^2$ on multiple datasets spanning various generative vision-language reasoning tasks such as captioning and visual question answering on multiple datasets such as, POPE, MME, VQAv2, AMBER, and MSCOCO. IECD$^2$ demonstrates consistent improvements in task accuracy and reasoning performance with substantial reduction in hallucination compared to state-of-the-art decoding approaches.
♻ ☆ AGILE: Hand-Object Interaction Reconstruction from Video via Agentic Generation SIGGRAPH 2026
Reconstructing dynamic hand-object interactions from monocular videos is critical for dexterous manipulation data collection and creating realistic digital twins for robotics and VR. However, current methods face two prohibitive barriers: (1) reliance on neural rendering often yields fragmented, non-simulation-ready geometries under heavy occlusion, and (2) dependence on brittle Structure-from-Motion (SfM) initialization leads to frequent failures on in-the-wild footage. To overcome these limitations, we introduce AGILE, a robust framework that shifts the paradigm from reconstruction to agentic generation for interaction learning. First, we employ an agentic pipeline where a Vision-Language Model (VLM) guides a generative model to synthesize a complete, watertight object mesh with high-fidelity texture, independent of video occlusions. Second, bypassing fragile SfM entirely, we propose a robust anchor-and-track strategy. We initialize the object pose at a single interaction onset frame using a foundation model and propagate it temporally by leveraging the strong visual similarity between our generated asset and video observations. Finally, a contact-aware optimization integrates semantic, geometric, and interaction stability constraints to enforce physical plausibility. Extensive experiments on HO3D, DexYCB, ARCTIC, and in-the-wild videos reveal that AGILE outperforms baselines in global geometric accuracy while demonstrating exceptional robustness on challenging sequences where prior arts frequently collapse. By prioritizing physical validity, our method produces simulation-ready assets validated via real-to-sim retargeting for robotic applications. Project page: https://agile-hoi.github.io.
comment: 16 pages, SIGGRAPH 2026
♻ ☆ CalexNet: Soft Cascade-Aligned Training and Calibration for Lightweight Early-Exit Branches
Early-exit cascades over a frozen convolutional backbone enable adaptive inference but suffer from three sources of train-inference mismatch: branches train on samples they will never see at inference, their per-class precision thresholds are calibrated on the wrong distribution, and the standard cross-entropy target on backbone argmax labels discards the backbone's uncertainty signal. We close all three gaps with CalexNet (Cascade-Aligned Early eXits), a training-recipe-only modification: branches train under continuously-weighted importance sampling that matches the cascade-survivor distribution; per-class precision thresholds are calibrated on the actual cascade-survivor subset of the validation set; and the classification head is trained against the backbone's full softmax via a temperature-scaled KL objective. Combined with an augmented prototype-pooling branch head, CalexNet is evaluated on ResNet18 and ResNet50 backbones across CIFAR-100 (20-superclass coarse, the harder primary setting) and CINIC-10 (10-class, the easier cross-validation counterpart). On the accuracy-FLOPs Pareto frontier, CalexNet matches or exceeds three published baselines (PTEEnet, ZTW, BoostNet) and a within-paper "no-alignment, no-KD" reference. The largest gains appear in the practically relevant 30-70% FLOPs-reduction regime and are stable across n=3 training seeds. CalexNet requires no inference-time architectural change and is a drop-in for any frozen-backbone early-exit cascade.
comment: 19 pages, 6 figures
♻ ☆ Automatic Image-Level Morphological Trait Annotation for Organismal Images ICLR 2026
Morphological traits are physical characteristics of biological organisms that provide vital clues on how organisms interact with their environment. Yet extracting these traits remains a slow, expert-driven process, limiting their use in large-scale ecological studies. A major bottleneck is the absence of high-quality datasets linking biological images to trait-level annotations. In this work, we demonstrate that sparse autoencoders trained on foundation-model features yield monosemantic, spatially grounded neurons that consistently activate on meaningful morphological parts. Leveraging this property, we introduce a trait annotation pipeline that localizes salient regions and uses vision-language prompting to generate interpretable trait descriptions. Using this approach, we construct Bioscan-Traits, a dataset of 80K trait annotations spanning 19K insect images from BIOSCAN-5M. Human evaluation confirms the biological plausibility of the generated morphological descriptions. We assess design sensitivity through a comprehensive ablation study, systematically varying key design choices and measuring their impact on the quality of the resulting trait descriptions. By annotating traits with a modular pipeline rather than prohibitively expensive manual efforts, we offer a scalable way to inject biologically meaningful supervision into foundation models, enable large-scale morphological analyses, and bridge the gap between ecological relevance and machine-learning practicality.
comment: ICLR 2026
♻ ☆ DKDS: A Benchmark Dataset of Degraded Kuzushiji Documents with Seals for Detection and Binarization ICDAR
Kuzushiji, a pre-modern Japanese cursive script, can currently be read and understood by only a few thousand trained experts in Japan. With the rapid development of deep learning, researchers have begun applying Optical Character Recognition (OCR) techniques to transcribe Kuzushiji into modern Japanese. Although existing OCR methods perform well on clean pre-modern Japanese documents written in Kuzushiji, they often fail to consider various types of noise, such as document degradation and seals, which significantly affect recognition accuracy. To the best of our knowledge, no existing dataset specifically addresses these challenges. To address this gap, we introduce the Degraded Kuzushiji Documents with Seals (DKDS) dataset as a new benchmark for related tasks. We describe the dataset construction process, which involves the assistance of a trained Kuzushiji expert, and define two benchmark tracks: (1) Kuzushiji character and seal detection and (2) document binarization. For the Kuzushiji character and seal detection track, we provide baseline results using several recent versions of YOLO to detect Kuzushiji characters and seals. For the document binarization track, we present baseline results from traditional binarization algorithms, traditional algorithms combined with K-means clustering, two state-of-the-art (SOTA) generative adversarial network (GAN) methods, and our improved conditional GAN (cGAN)-based method. The DKDS dataset and the implementation code for baseline methods are available at https://ruiyangju.github.io/DKDS.
comment: IJDAR 2026 (ICDAR-IJDAR Track)
♻ ☆ EGA: Adapting Frozen Encoders for Vector Search with Bounded Out-of-Distribution Degradation
Vector search systems built on frozen vision encoders face queries from unseen classes at deployment, yet existing adapter training collapses under this shift: high-capacity adapters with global contrastive losses silently reassign unseen-class samples to wrong seen-class clusters, dropping worst-case Label Precision by over 40 points below the frozen baseline in our tests. We propose Euclidean Geodesic Alignment (EGA), a residual adapter that couples three principles: zero initialization, local triplet loss, and hypersphere projection. These collectively induce a self-limiting dynamic: triplets that already satisfy a small margin stop producing gradients, so the adapter automatically stops updating where the local geometry is already correct. Our experiments show that at convergence $96.5\%$ of triplets are gradient-free, leaving unseen-class regions largely untouched while still enabling full-capacity refinement of seen classes. Across five diverse out-of-distribution (OOD) benchmarks, EGA achieves the highest worst-case Label Precision on the four primary splits and a consistent improvement on the fifth. The design also transfers to stronger backbones in addition to CLIP, and we provide an analytical justification linking gradient sparsity to bounded OOD perturbation.
comment: added ack and github link
♻ ☆ Saving Foundation Flow-Matching Priors for Inverse Problems
Foundation flow-matching (FM) models promise a universal prior for solving inverse problems (IPs), yet today they trail behind domain-specific or even untrained priors. How can we unlock their potential? We introduce FMPlug, a plug-in framework that redefines how foundation FMs are used in IPs. FMPlug combines an instance-guided, time-dependent warm-start strategy with a sharp Gaussianity regularization, adding problem-specific guidance while preserving the Gaussian structures. This leads to a significant performance boost across image restoration and scientific IPs. Our results point to a path for making foundation FM models practical, reusable priors for IP solving.
♻ ☆ Benchmarking and Improving GUI Agents in High-Dynamic Environments
Recent advancements in Graphical User Interface (GUI) agents have predominantly focused on training paradigms like supervised fine-tuning (SFT) and reinforcement learning (RL). However, the challenge of high-dynamic GUI environments remains largely underexplored. Existing agents typically rely on a single screenshot after each action for decision-making, leading to a partially observable (or even unobservable) Markov decision process, where the key GUI state including important information for actions is often inadequately captured. To systematically explore this challenge, we introduce DynamicGUIBench, a comprehensive online GUI benchmark spanning ten applications and diverse interaction scenarios characterized by important interface changes between actions. Furthermore, we present DynamicUI, an agent designed for dynamic interfaces, which takes screen-recording videos of the interaction process as input and consists of three components: a dynamic perceiver, a refinement strategy, and a reflection. Specifically, the dynamic perceiver clusters frames of the GUI video, generates captions for the centroids, and iteratively selects the most informative frames as the salient dynamic context. Considering that there may be inconsistencies and noise between the selected frames and the textual context of the agent, the refinement strategy employs an action-conditioned filtering to refine thoughts to mitigate thought-action inconsistency and redundancy. Based on the refined agent trajectories, the reflection module provides effective and accurate guidance for further actions. Experiments on DynamicGUIBench demonstrate that DynamicUI significantly improves the performance in dynamic GUI environments, while maintaining competitive performance on other public benchmarks.
♻ ☆ Attention Sparsity is Input-Stable: Training-Free Sparse Attention for Video Generation via Offline Sparsity Profiling and Online QK Co-Clustering ICML 2026
Diffusion Transformers (DiTs) achieve strong video generation quality but suffer from high inference cost due to dense 3D attention, motivating sparse attention techniques for improving efficiency. However, existing training-free sparse attention methods for video generation still face two unresolved limitations: ignoring layer heterogeneity in attention pruning and ignoring query-key coupling in block partitioning, which hinder a better quality-speedup trade-off. In this work, we uncover a critical insight: attention sparsity is an intrinsic layer-wise property, with only minor variation across different inputs. Motivated by this observation, we propose SVOO, a training-free sparse attention framework for fast video generation via offline layer-wise sparsity profiling and online bidirectional co-clustering. Specifically, SVOO adopts a two-stage paradigm: (i) offline layer-wise sensitivity profiling to derive intrinsic per-layer pruning levels, and (ii) online block-wise sparse attention via a bidirectional co-clustering algorithm. Extensive experiments on seven widely used video generation models demonstrate that SVOO achieves a superior quality-speedup trade-off over state-of-the-art methods, delivering up to 1.93x speedup while maintaining a PSNR of up to 29 dB on Wan2.1. Code is available at: https://github.com/Mutual-Luo/SVOO.
comment: Accepted by ICML 2026
♻ ☆ InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning
Emerging unified editing models have demonstrated strong capabilities in general object editing tasks. However, it remains a significant challenge to perform fine-grained editing in complex multi-entity scenes, particularly those where targets are not visually salient and require spatial reasoning. To this end, we propose InterCoG, a novel text-vision Interleaved Chain-of-Grounding reasoning framework for fine-grained image editing in complex real-world scenes. The key insight of InterCoG is to first perform object position reasoning solely within text that includes spatial relation details to explicitly deduce the location and identity of the edited target. It then conducts visual grounding via highlighting the editing targets with generated bounding boxes and masks in pixel space, and finally rewrites the editing description to specify the intended outcomes. To further facilitate this paradigm, we propose two auxiliary training modules: multimodal grounding reconstruction supervision and multimodal grounding reasoning alignment to enforce spatial localization accuracy and reasoning interpretability, respectively. We also construct GroundEdit-45K, a dataset comprising 45K grounding-oriented editing samples with detailed reasoning annotations, and GroundEdit-Bench for grounding-aware editing evaluation. Extensive experiments substantiate the superiority of our approach in highly precise edits under spatially intricate and multi-entity scenes.
♻ ☆ LoopNav: Benchmarking Spatial Consistency in World Models
The ability to simulate the world in a spatially consistent manner is a crucial requirement for effective world models. Such a model enables high-quality visual generation, and also ensures the reliability of world models for downstream tasks such as simulation and planning. It must not only retain long-horizon observational information, but also enables the construction of explicit or implicit internal spatial representations. However, existing datasets do not explicitly enforce spatial consistency constraints, limiting both the ability to systematically evaluate this capability and to learn it through data-driven approaches. Furthermore, most existing benchmarks primarily emphasize visual coherence or generation quality, neglecting the requirement of long-range spatial consistency. To bridge this gap, we propose LoopNav, a dataset and corresponding benchmark centered on loop-based navigation for evaluating spatial consistency. The dataset comprises 250 hours (20 million frames) of loop-based navigation videos with actions, collected from diverse locations in the open-world environment of Minecraft. We further introduce a Scene Graph Consistency Score to quantify spatial consistency while remaining invariant to pixel-level variations. Dataset, benchmark, and code are open-sourced to support future research.
comment: V3: SGCS
♻ ☆ A Unified and Controllable Framework for Layered Image Generation with Visual Effects
Recent image generation models produce impressive composites, but often fail to preserve the identity of user-provided content when editing specific elements: the surrounding scene may shift, and even the edited object's appearance can drift from the original. Layered representation offer a natural remedy--they allow users to independently manipulate individual elements--but existing layered methods typically produce transparent foregrounds without realistic visual effects such as shadows and reflections, forcing the use of a second harmonization model after every edit, which in turn introduces drift. To overcome these limitations, we present LASAGNA, which generates a photorealistic background (BG) and an RGBA foreground with compelling visual effects in a single forward pass. By treating object-associated visual effects as part of the foreground (FG) layer, LASAGNA supports the dominant class of consumer edits (e.g., translation, scaling, recoloring, duplication) via alpha compositing alone, without invoking any model post-edit, thereby eliminating identity drift introduced by cascade editing pipelines. This single-pass design contrasts with prior layered methods that rely on separate expert models for each task. LASAGNA handles diverse conditional inputs--text prompts, FG, BG, and location masks--within a unified architecture. We further release two community resources: LASAGNA-48K, the first public dataset of 48K layered image triplets with photorealistic visual effects, and LASAGNA-BENCH, the first standardized benchmark for layer-centric generation and editing, comprising 242 expert-annotated samples across six diverse sources. Experiments show that LASAGNA outperforms both general-purpose editors and prior layered methods across three generation modes, and supports a wide range of post-edits without any model re-inference.
♻ ☆ Multimodal Latent Reasoning via Hierarchical Visual Cues Injection
The advancement of multimodal large language models (MLLMs) has enabled impressive perception capabilities. However, their reasoning process often remains a "fast thinking" paradigm, reliant on end-to-end generation or explicit, language-centric chains of thought (CoT), which can be inefficient, verbose, and prone to hallucination. This work posits that robust reasoning should evolve within a latent space, integrating multimodal signals seamlessly. We propose multimodal latent reasoning via HIerarchical Visual cuEs injection (\emph{HIVE}), a novel framework that instills deliberate, "slow thinking" without depending on superficial textual rationales. Our method recursively extends transformer blocks, creating an internal loop for iterative reasoning refinement. Crucially, it injectively grounds this process with hierarchical visual cues from global scene context to fine-grained regional details directly into the model's latent representations. This enables the model to perform grounded, multi-step inference entirely in the aligned latent space. Extensive evaluations demonstrate that test-time scaling is effective when incorporating vision knowledge, and that integrating hierarchical information significantly enhances the model's understanding of complex scenes.
♻ ☆ A Step to Decouple Optimization in 3DGS ICLR 2026
3D Gaussian Splatting (3DGS) has emerged as a powerful technique for real-time novel view synthesis. As an explicit representation optimized through gradient propagation among primitives, optimization widely accepted in deep neural networks (DNNs) is actually adopted in 3DGS, such as synchronous weight updating and Adam with the adaptive gradient. However, considering the physical significance and specific design in 3DGS, there are two overlooked details in the optimization of 3DGS: (i) update step coupling, which induces optimizer state rescaling and costly attribute updates outside the viewpoints, and (ii) gradient coupling in the moment, which may lead to under- or over-effective regularization. Nevertheless, such a complex coupling is under-explored. After revisiting the optimization of 3DGS, we take a step to decouple it and recompose the process into: Sparse Adam, Re-State Regularization and Decoupled Attribute Regularization. Taking a large number of experiments under the 3DGS and 3DGS-MCMC frameworks, our work provides a deeper understanding of these components. Finally, based on the empirical analysis, we re-design the optimization and propose AdamW-GS by re-coupling the beneficial components, under which better optimization efficiency and representation effectiveness are achieved simultaneously.
comment: Accepted by ICLR 2026 (fixed typo)
♻ ☆ Adapting Vision-Language Models for Neutrino Event Classification in High-Energy Physics
Recent advances in Large Language Models (LLMs) have demonstrated their remarkable capacity to process and reason over structured and unstructured data modalities beyond natural language. In this work, we explore the applications of Vision Language Models (VLMs), specifically a fine-tuned variant of LLaMA 3.2 to the task of identifying neutrino interactions in pixelated detector data from high-energy physics (HEP) experiments. We benchmark this model against a state-of-the-art convolutional neural network (CNN) architecture, similar to those used in major neutrino experiments, which have achieved high efficiency and purity in classifying electron and muon neutrino events, and also a Vision Transformer (ViT-h/14), which is the same architecture inside the VLM's vision encoder. Our evaluation considers both classification performance and interpretability of the model predictions, comparing a VLM with a vision-only transformer (ViT) and a convolutional neural network (CNN) baseline. We find that transformer-based architectures outperform conventional CNNs in classification accuracy and robustness, with the VLM providing additional flexibility through the integration of auxiliary textual or semantic information and enabling more interpretable, reasoning-based predictions. These results highlight the potential of large transformer models, particularly vision-language models, as general-purpose backbones for physics event classification, combining strong performance, robustness, and interpretability, and opening new avenues for multimodal reasoning in experimental neutrino physics.
comment: Accepted for publication in Communications Physics (Nature Portfolio)
♻ ☆ SteelDefectX: A Multi-Form Vision-Language Dataset and Benchmark for Steel Surface Defect Analysis
Steel surface defect analysis is critical for industrial quality control, yet existing benchmarks rely primarily on label-only annotations, limiting fine-grained semantic understanding and systematic evaluation of vision-language models. To address this gap, we introduce SteelDefectX, a vision-language dataset with multi-form textual annotations for steel surface defect analysis, comprising 7,778 images across 25 defect categories. At the class level, the dataset provides defect names, representative visual attributes, and industrial causes. At the sample level, each image is annotated with three forms of textual representations: (1) free-form natural language descriptions, (2) structured attribute annotations, and (3) template-based sentences. These annotations provide flexible textual supervision with varying levels of expressiveness and controllability. We further establish a comprehensive benchmark covering vision-language classification, segmentation, and cross-dataset transfer, along with additional evaluations such as retrieval and text-guided localization. Experimental results reveal a trade-off between structure and flexibility in textual representations. Structured attributes provide more stable semantic alignment, while natural language descriptions improve transferability and fine-grained spatial grounding. These findings highlight the critical role of textual design in industrial vision-language learning. SteelDefectX provides a new benchmark for studying semantic alignment and generalization in industrial vision-language learning. The code and dataset are available at https://github.com/Zhaosxian/SteelDefectX.
♻ ☆ DisCo-FLoc: Semantic-Free Floorplan Localization via $SE(2)$-Aware Contrastive Disambiguation
Visual Floorplan Localization (FLoc) struggles with severe structural aliasing caused by repetitive minimalist layouts. This occurs because physically distant poses share highly similar visual-geometric features, which degrades spatial separability and angular discriminability. While existing methods attempt to mitigate these ambiguities by relying on costly semantic annotations, the resulting performance gains remain inherently limited. To address the above issues, we propose DisCo-FLoc, a semantic-free method for visual-geometric Contrastive Disambiguation. First, we introduce a depth-aware Ray Regression Predictor (RRP) that serves as a dense-to-ray geometric projector. By explicitly suppressing visual clutter along the vertical dimension, RRP projects monocular RGB images into 2D ray primitives, which are matched with floorplans to produce geometry-aware FLoc candidates. Second, to resolve the remaining ambiguity among these candidates, we propose a spatially perturbed contrastive objective to align RGB images with local floorplan structures and formulate a visual-geometric compatibility function. In particular, we meticulously construct positive and negative samples at both positional and directional levels through $SE(2)$ pose perturbations for contrastive learning, effectively achieving pose smoothness, spatial separability, and angular discriminability. The compatibility function enables DisCo-FLoc to disambiguate FLoc by using richer visual context beyond pure geometric layouts, without requiring any semantic annotations. Extensive experiments on two challenging visual FLoc benchmarks demonstrate that DisCo-FLoc significantly outperforms state-of-the-art semantic-based methods, especially narrowing the performance gap between positional and directional FLoc accuracy.
comment: 9 pages, 3 figures
♻ ☆ NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps
Existing Vision-Language Navigation (VLN) methods typically adopt an egocentric, step-by-step paradigm, which struggles with error accumulation and limits efficiency. While recent approaches attempt to leverage pre-built environment maps, they often rely on incrementally updating memory graphs or scoring discrete path proposals, which restricts continuous spatial reasoning and creates discrete bottlenecks. We propose Top-Down VLN (TD-VLN), reformulating navigation as a one-step global path planning problem on pre-built top-down maps, supported by our newly constructed R2R-TopDown dataset. To solve this, we introduce NavOne, a unified framework that directly predicts dense path probabilities over multi-modal maps in a single end-to-end forward pass. NavOne features a Top-Down Map Fuser for joint multi-modal map representation, and extends Attention Residuals for spatial-aware depth mixing. Extensive experiments on R2R-TopDown show that NavOne achieves state-of-the-art performance among map-based VLN methods, with a planning-stage speedup of 8x over existing map-based baselines and 80x over egocentric methods, enabling highly efficient global navigation.
comment: 10 pages, 7 figures
♻ ☆ Bridging Restoration and Generation Manifolds in One-Step Diffusion for Real-World Super-Resolution
Pretrained diffusion models have revolutionized real-world image super-resolution (Real-ISR) but suffer from computational bottlenecks due to iterative sampling. Recent single-step distillation accelerates inference but faces a stark perception-distortion trade-off due to rigid timestep initialization, distributional trajectory mismatches, and fragile stochastic modulation. To address this, we present Adaptive Inversion and Degradation-aware Sampling for Real-ISR (IDaS-SR), a one-step framework bridging the deterministic restoration and stochastic generation manifolds. At its core, the Manifold Inversion Noise Estimator (MINE) resolves these initialization and trajectory mismatches by predicting a severity-aware timestep and inversion noise, precisely anchoring low-quality latents onto the diffusion trajectory. Furthermore, to mitigate fragile stochastic modulation, we propose CHARIOT, a continuous generative steering mechanism. By rescheduling trajectories and interpolating noise, it enables explicit navigation of the perception-distortion boundary without compromising structural priors. Extensive experiments demonstrate that IDaS-SR outperforms state-of-the-art methods, seamlessly transitioning from a rigorous structural restorer to a sophisticated texture hallucinator in a single inference step.
♻ ☆ Towards Explainable Industrial Anomaly Detection via Knowledge-Guided Latent Reasoning
Industrial anomaly detection demands precise reasoning over fine-grained defect patterns. However, existing multimodal large language models (MLLMs), pretrained on general-domain data, often struggle to capture category-specific anomalies, thereby limiting both detection accuracy and interpretability. To address these limitations, we propose Reason-IAD, a knowledge-guided dynamic latent reasoning framework for explainable industrial anomaly detection. Reason-IAD comprises two core components. First, a retrieval-augmented knowledge module incorporates category-specific textual descriptions into the model input, enabling context-aware reasoning over domain-specific defects. Second, an entropy-driven latent reasoning mechanism conducts iterative exploration within a compact latent space using optimizable latent think tokens, guided by an entropy-based reward that encourages confident and stable predictions. Furthermore, a dynamic visual injection strategy selectively incorporates the most informative image patches into the latent sequence, directing the reasoning process toward regions critical for anomaly detection. Extensive experimental results demonstrate that Reason-IAD consistently outperforms state-of-the-art methods across multiple tasks. The code will be publicly available at https://github.com/chenpeng052/Reason-IAD.
♻ ☆ SRGAN-CKAN: Expressive Super-Resolution with Nonlinear Functional Operators under Minimal Resources
Single-Image Super-Resolution (SISR) aims to reconstruct a High-Resolution (HR) image from a Low-Resolution (LR) observation, a fundamentally ill-posed problem where high-frequency details are severely degraded at large upscaling factors. Recent advances have been driven by transformer-based architectures and diffusion models improve global context modeling and perceptual quality at the cost of increased computational complexity. In contrast, this work focuses on enhancing the expressivity of local operators under minimal resources. We propose SRGAN--CKAN, a hybrid super-resolution framework that integrates Convolutional Kolmogorov--Arnold Networks (CKAN) into an adversarial learning setting reformulating convolution as a nonlinear patch-based transformation. The proposed operator replaces linear local mappings with spline-based functional representations, allowing expressive modeling of complex local structures and high-frequency textures using minimal hardware resources. Experimental results demonstrate that the proposed approach improves perceptual quality while preserving reconstruction fidelity, achieving a favorable balance between distortion-based and perceptual metrics. These results are obtained under constrained computational settings, highlighting the efficiency of the proposed formulation. Overall, this work introduces a complementary direction to existing approaches by improving the representational power of local transformations, providing an efficient and scalable alternative to globally intensive architectures.
♻ ☆ AdaCorrection: Adaptive Offset Cache Correction for Accurate Diffusion Transformers
Diffusion Transformers (DiTs) achieve state-of-the-art performance in high-fidelity image and video generation but suffer from expensive inference due to their iterative denoising structure. While prior methods accelerate sampling by caching intermediate features, they rely on static reuse schedules or coarse-grained heuristics, which often lead to temporal drift and cache misalignment that significantly degrade generation quality. We introduce \textbf{AdaCorrection}, an adaptive offset cache correction framework that maintains high generation fidelity while enabling efficient cache reuse across Transformer layers during diffusion inference. At each timestep, AdaCorrection estimates cache validity with lightweight spatio-temporal signals and adaptively blends cached and fresh activations. This correction is computed on-the-fly without additional supervision or retraining. Our approach achieves strong generation quality with minimal computational overhead, maintaining near-original FID while providing moderate acceleration. Experiments on image and video diffusion benchmarks show that AdaCorrection consistently improves generation performance.
♻ ☆ Direction-Flipped Influence Audits Reveal Hidden Structure in Moral Choices of LLMs
Moral benchmarks for LLMs typically score models on context-free prompts, implicitly treating the measured choice rate as stable. We test this assumption with a direction-flipped influence audit: for each scenario, we compare a baseline prompt with matched cues steering toward option A or option B. Across a trolley-problem-style moral triage task, BBQ, and DailyDilemmas, and across five LLM families with and without reasoning, short contextual cues shift per-condition choice rates by 12-18 percentage points on average. These shifts reveal structure that baseline scores miss: roughly 40% of baseline-neutral triage and BBQ conditions exhibit directional asymmetry under influence, and a meaningful share of significant effects backfire, moving opposite the cue's intended direction. In follow-up probes, models often recognize the cue while denying that it affected their choice. Among significant backfire trials, this stated-vs.-revealed inconsistency appears in 78% of cases. Reasoning does not eliminate contextual sensitivity but reshapes it: social-pressure cues such as user preference and emotional appeal weaken across benchmarks, while few-shot demonstrations strengthen sharply on both triage and BBQ. We recommend direction-flipped influence pairs as a standard complement to context-free moral-bias evaluation, and release the harness and data to make such audits routine.
Artificial Intelligence 300
☆ EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction
Recent event-based image reconstruction methods predominantly rely on Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) to process complementary event information. However, these architectures face fundamental limitations: CNNs often fail to capture global feature correlations, whereas ViTs incur quadratic computational complexity (e.g., $O(n^2)$), hindering their application in high-resolution scenarios. To address these bottlenecks, we introduce EmambaIR, an Efficient visual State Space Model designed for image reconstruction using spatially sparse and temporally continuous event streams. Our framework introduces two key components: the cross-modal Top-k Sparse Attention Module (TSAM) and the Gated State-Space Module (GSSM). TSAM efficiently performs pixel-level top-k sparse attention to guide cross-modal interactions, yielding rich yet sparse fusion features. Subsequently, GSSM utilizes a nonlinear gated unit to enhance the temporal representation of vanilla linear-complexity ($O(n)$) SSMs, effectively capturing global contextual dependencies without the typical computational overhead. Extensive experiments on six datasets across three diverse image reconstruction tasks - motion deblurring, deraining, and High Dynamic Range (HDR) enhancement - demonstrate that EmambaIR significantly outperforms state-of-the-art methods while offering substantial reductions in memory consumption and computational cost. The source code and data are publicly available at: https://github.com/YunhangWickert/EmambaIR
☆ VecCISC: Improving Confidence-Informed Self-Consistency with Reasoning Trace Clustering and Candidate Answer Selection ACL 2026
A standard technique for scaling inference-time reasoning is Self-Consistency, whereby multiple candidate answers are sampled from an LLM and the most common answer is selected. More recently, it has been shown that weighted majority voting (e.g. Confidence-Informed Self Consistency (CISC)), which assigns a confidence value to each candidate answer and chooses the answer with the largest accumulated score, tends to be more accurate on a wide range of popular benchmarks. In practice, weighted majority voting necessitates calling a critic LLM on each candidate's reasoning trace to produce the answer's confidence score. This secondary series of LLM calls greatly increases the overhead and cost of weighted majority voting, despite its potential performance benefits. To reduce this expense, we propose VecCISC, a lightweight, adaptive framework that uses a measure of semantic similarity to filter reasoning traces that are semantically equivalent to others, degenerate, or hallucinated, thus decreasing the number of candidate answers that must be evaluated by the critic. To ensure adequate experimental thoroughness, we evaluate VecCISC on five challenging, widely-adopted datasets spanning the domains of mathematics, chemistry, biology, commonsense reasoning, and the humanities. Our results demonstrate that VecCISC reduces the total token usage by 47%, while maintaining or exceeding the accuracy of CISC.
comment: Accepted to Findings of ACL 2026
☆ Flow-OPD: On-Policy Distillation for Flow Matching Models
Existing Flow Matching (FM) text-to-image models suffer from two critical bottlenecks under multi-task alignment: the reward sparsity induced by scalar-valued rewards, and the gradient interference arising from jointly optimizing heterogeneous objectives, which together give rise to a 'seesaw effect' of competing metrics and pervasive reward hacking. Inspired by the success of On-Policy Distillation (OPD) in the large language model community, we propose Flow-OPD, the first unified post-training framework that integrates on-policy distillation into Flow Matching models. Flow-OPD adopts a two-stage alignment strategy: it first cultivates domain-specialized teacher models via single-reward GRPO fine-tuning, allowing each expert to reach its performance ceiling in isolation; it then establishes a robust initial policy through a Flow-based Cold-Start scheme and seamlessly consolidates heterogeneous expertise into a single student via a three-step orchestration of on-policy sampling, task-routing labeling, and dense trajectory-level supervision. We further introduce Manifold Anchor Regularization (MAR), which leverages a task-agnostic teacher to provide full-data supervision that anchors generation to a high-quality manifold, effectively mitigating the aesthetic degradation commonly observed in purely RL-driven alignment. Built upon Stable Diffusion 3.5 Medium, Flow-OPD raises the GenEval score from 63 to 92 and the OCR accuracy from 59 to 94, yielding an overall improvement of roughly 10 points over vanilla GRPO, while preserving image fidelity and human-preference alignment and exhibiting an emergent 'teacher-surpassing' effect. These results establish Flow-OPD as a scalable alignment paradigm for building generalist text-to-image models.
☆ Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning
We argue that decomposing reward into weighted, verifiable criteria and using an LLM judge to score them provides a partial-credit optimization signal: instead of a binary outcome or a single holistic score, each response is graded along multiple task-specific criteria. We formalize \emph{rubric-grounded reinforcement learning (RL)}: a framework in which the policy is optimized against a structured, multi-criterion reward produced by a frozen LLM judge that conditions on auxiliary grounding the policy never sees. We instantiate the framework by deriving rubrics from an Office of Scientific and Technical Information (OSTI)-derived corpus of roughly 100,000 scientific and technical documents and training Llama-3.1-8B-Instruct with Group Relative Policy Optimization (GRPO). With GRPO-based training, the model achieves $71.7\%$ normalized reward on held-out rubric evaluation. The GRPO-tuned policy also improves over the base model on four reasoning benchmarks not derived from the training corpus -- GSM8K, MATH, GPQA Main, and GPQA Diamond. These results provide evidence that structured, document-grounded rewards can improve held-out rubric performance and induce transferable reasoning behaviors beyond the corpus used to construct the training environment.
☆ The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents
Context window expansion is often treated as a straightforward capability upgrade for LLMs, but we find it systematically fails in multi-agent social dilemmas. Across 7 LLMs and 4 games over 500 rounds, expanding accessible history degrades cooperation in 18 of 28 model--game settings, a pattern we term the memory curse. We isolate the underlying mechanism through three analyses. First, lexical analysis of 378,000 reasoning traces associates this breakdown with eroding forward-looking intent rather than rising paranoia. We validate this using targeted fine-tuning as a cognitive probe: a LoRA adapter trained exclusively on forward-looking traces mitigates the decay and transfers zero-shot to distinct games. Second, memory sanitization holds prompt length fixed while replacing visible history with synthetic cooperative records, which restores cooperation substantially, proving the trigger is memory content, not length alone. Finally, ablating explicit Chain-of-Thought reasoning often reduces the collapse, showing that deliberation paradoxically amplifies the memory curse. Together, these results recast memory as an active determinant of multi-agent behavior: longer recall can either destabilize or support cooperation depending on the reasoning patterns it elicits.
☆ CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation
While recent advancements in inference-time learning have improved LLM reasoning on Text-to-SQL tasks, current solutions still struggle to perform well on the most challenging tasks in the Bird-Bench (BIRD) benchmark. This is due to inadequate solution space exploration, which is necessary to uncover promising candidate queries that can be further refined to produce the correct output. To address this challenge, we introduce CA-SQL, a novel Text-to-SQL pipeline that utilizes the estimated difficulty of a task to dynamically scale the breadth of the exploration for generating solution candidates. In addition, we use a custom prompt seeding method, based on principles of evolutionary search, to further elicit exploratory behavior from the base LLM and a novel voting method to select the best candidate solution at the end of the search. Experiments demonstrate that our solution achieves a state-of-the-art score of 51.72% on the "challenging" tier of BIRD development set problems, using only GPT-4o-mini, out-performing other in-context learning approaches, even those that leverage larger models. Overall, our method attains a competitive 61.06% execution accuracy and 68.77% Soft F1 score on the BIRD development dataset.
☆ Fast Byte Latent Transformer
Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slow, byte-by-byte autoregressive generation. We address this bottleneck in the Byte Latent Transformer (BLT) through new training and generation techniques. First, we introduce BLT Diffusion (BLT-D), a new model and our fastest BLT variant, trained with an auxiliary block-wise diffusion objective alongside the standard next-byte prediction loss. This enables an inference procedure that generates multiple bytes in parallel per decoding step, substantially reducing the number of forward passes required to generate a sequence. Second, we propose two extensions inspired by speculative decoding that trade some of this speed for higher generation quality: BLT Self-speculation (BLT-S), in which BLT's local decoder continues generating past its normal patch boundaries to draft bytes, which are then verified with a single full-model forward pass; and BLT Diffusion+Verification (BLT-DV), which augments BLT-D with an autoregressive verification step after diffusion-based generation. All methods may achieve an estimated memory-bandwidth cost over 50% lower than BLT on generation tasks. Each approach offers its own unique advantages, together removing key barriers to the practical use of byte-level LMs.
☆ SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation
While text-to-image models have made strong progress in visual fidelity, faithfully realizing complex visual intents remains challenging because many requirements must be tracked across grounding, generation, and verification. We refer to these requirements as semantic commitments and formalize their lifecycle discontinuity as the Conceptual Rift, where commitments may be locally resolved or checked but fail to remain identifiable as the same operational units throughout the generation lifecycle. To address this, we propose SCOPE, a specification-guided skill orchestration framework that maintains semantic commitments in an evolving structured specification and conditionally invokes retrieval, reasoning, and repair skills around unresolved or violated commitments. To evaluate commitment-level intent realization, we introduce Gen-Arena, a human-annotated benchmark with entity- and constraint-level specifications, together with Entity-Gated Intent Pass Rate (EGIP), a strict entity-first pass criterion. SCOPE substantially outperforms all evaluated baselines on Gen-Arena, achieving 0.60 EGIP, and further achieves strong results on WISE-V (0.907) and MindBench (0.61), demonstrating the effectiveness of persistent commitment tracking for complex image generation.
☆ Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph
Direct Preference Optimization (DPO) aligns language models using pairwise preference comparisons, offering a simple and effective alternative to Reinforcement Learning (RL) from human feedback. However, in many practical settings, training data consists of multiple rollouts per prompt, inducing rich preference structure that pairwise DPO fails to exploit. Collapsing such data into independent pairs discards transitivity, introduces redundant or conflicting supervision, and can lead to unstable optimization. We propose Graph Direct Preference Optimization (GraphDPO), a principled generalization of DPO that operates over directed acyclic preference graphs induced by rollout rankings. GraphDPO encodes dominance relations as edges and optimizes a graph-structured Plackett--Luce-inspired objective that aggregates supervision over graph neighborhoods, enforcing transitivity while recovering standard DPO as a special case. To handle discrete or sparse signals, we introduce an equivalence-class construction where responses with identical preferences form graph layers, and intra-layer edges contribute zero loss, preventing spurious gradients. Despite leveraging full graph structure, GraphDPO maintains linear per-prompt complexity via efficient log-sum-exp aggregation. We further incorporate optional ground-truth anchoring by inserting verified solutions as dominant nodes and applying an annealed schedule that stabilizes early training while gradually relaxing oracle supervision. Experiments on reasoning and program synthesis tasks demonstrate superior performance, suggesting that graph-structured preference modeling is a scalable and robust alternative to pairwise and listwise alignment objectives.
☆ MPD$^2$-Router: Mask-aware Multi-expert Prior-regularized Dual-head Deferral Router in Glaucoma Screening and Diagnosis
Learning-to-defer (L2D) can make glaucoma screening safer by routing difficult/uncertain cases to humans, yet standard formulations overlook expert availability, heterogeneous readers behavior, workload imbalance, asymmetric diagnostic harm, case difficulty from morphology and deployment shift. We introduce MPD$^2$-Router, a mask-aware multi-expert deferral framework that recasts ophthalmic triage as constrained human--AI routing: whether to defer and to which available expert. It couples a dual-head deferral/allocation policy with mask-aware Gumbel--sigmoid gating that strictly enforces per-sample availability, and fuses uncertainty, morphology, image-quality, and OOD signals. Training uses an asymmetric cost-sensitive objective with an augmented-Lagrangian deferral budget, a group-specific distribution prior, and a rank-majorization JS regularizer that jointly prevent expert collapse without forcing uniform allocation. Across three cross-national glaucoma cohorts (REFUGE, CHAKSU, ORIGA) with a frozen REFUGE-trained backbone, MPD$^2$-Router substantially lowers clinical cost and improves MCC over AI-only at a moderate deferral rate. It is Pareto-optimal in F1--MCC--cost, robust under cross-domain shift, and yields balanced expert utilization.
☆ Globally Optimal Training of Spiking Neural Networks via Parameter Reconstruction
Spiking Neural Networks (SNNs) have been proposed as biologically plausible and energy-efficient alternatives to conventional Artificial Neural Networks (ANNs). However, the training of SNN usually relies on surrogate gradients due to the non-differentiability of the spike function, introducing approximation errors that accumulate across layers. To address this challenge, we extend the work on convexification of parallel feedforward threshold networks to parallel recurrent threshold networks, which subsume parallel SNNs as a structured special case. Building on this theoretical framework, we propose a parameter reconstruction algorithm for SNN training that demonstrates consistent and significant advantages across various tasks, both as a standalone method and in combination with surrogate-gradient training. The ablations further demonstrate the data scalability and robustness to model configurations of our training algorithm, pointing toward its potential in large-scale SNN training.
☆ Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners
Humans rapidly learn abstract knowledge when encountering novel environments and flexibly deploy this knowledge to guide efficient and intelligent action. Can modern AI systems learn and plan in a similar way? We study this question using a dataset of complex human gameplay with concurrent fMRI recordings, in which participants learn novel video games that require rule discovery, hypothesis revision, and multi-step planning. We jointly evaluate models by their ability to play the games, match human learning behavior, and predict brain activity during the same task, comparing a suite of frontier Large Reasoning Models (LRMs) against model-free and model-based deep reinforcement learning agents and a Bayesian theory-based agent. We find that frontier LRMs most closely match human behavioral patterns during game discovery and predict brain activity an order of magnitude better than both reinforcement learning alternatives across cortical and subcortical regions, with effects robust to permutation controls. Through targeted manipulations, we further show that brain alignment reflects the model's in-context representation of the game state rather than its downstream planning or reasoning. Our results establish LRMs as compelling computational accounts of human learning and decision making in complex, naturalistic environments. Project page with interactive replays: https://botcs.github.io/reason-to-play/
☆ Learning CLI Agents with Structured Action Credit under Selective Observation
Command line interface (CLI) agents are emerging as a practical paradigm for agent-computer interaction over evolving filesystems, executable command line programs, and online execution feedback. Recent work has used reinforcement learning (RL) to learn these interaction abilities from verifiable task feedback, yet few methods exploit the native structured attributes of CLI actions as learning signals. Beyond this underused action structure, CLI learning also couples two bottlenecks for coding agents. First, the agent must identify task-relevant evidence in a large codebase from partial observations. Second, sparse terminal rewards must be assigned to the actions that shape a long multi-turn trajectory. We study these bottlenecks through shell-driven information extraction and file editing tasks. For selective observation, we introduce $σ$-Reveal, an inference-time mechanism that selects token-budgeted context for the same CLI. For credit assignment, we propose Action Advantage Assignment ($\mathrm{A}^3$), a native agentic RL method that preserves the algorithmic complexity of standard agentic RL. $\mathrm{A}^3$ constructs turn-level advantages from episode-level relative feedback, abstract syntax tree (AST) based action sub-chain residuals, and tree-level trajectory margins. To further evaluate this problem setting, we construct ShellOps, a verifiable dataset suite covering CLI tasks in repository environments.
☆ Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims NeurIPS 2026
Mechanistic interpretability papers increasingly use causal vocabulary: circuits, mediators, causal abstraction, monosemanticity. Such claims require explicit identification assumptions. A purposive audit of 10 papers across four methodological strands finds no dedicated identification-assumptions section and a recurring pattern: validation metrics such as faithfulness, completeness, monosemanticity, alignment, or ablation effects are reported as causal support without stating the assumptions that make them identifying. A two-human-coder audit on $n=30$ reproduces the direction of the main finding: dedicated identification sections are absent, and validation-metric substitution is common, though exact Dim B/D counts are coding-rule sensitive. The paper proposes a disclosure norm: state whether the claim is causal, name the identification strategy, enumerate assumptions, stress at least one, and explain how conclusions shift if assumptions fail. Validation is not identification.
comment: 10 pages, 2 figures. Submitted to NeurIPS 2026 (Position Track)
☆ Abductive Reasoning with Probabilistic Commonsense
Recent efforts to improve the reasoning abilities of Large Language Models (LLMs) have focused on integrating formal logic solvers within neurosymbolic frameworks. A key challenge is that formal solvers lack commonsense world knowledge, preventing them from making reasoning steps that humans find obvious. Prior methods address this by using LLMs to supply missing commonsense assumptions, but these approaches implicitly assume universal agreement on such commonsense facts. In reality, commonsense beliefs vary across individuals. We propose a probabilistic framework for abductive commonsense reasoning that explicitly models this variation, aiming to determine whether most people would judge a statement as true or false. We introduce Probabilistic Abductive CommonSense (PACS), a novel algorithm that uses an LLM and a formal solver to sample proofs as observations of individuals' distinct commonsense beliefs, and aggregates conclusions across these samples. Empirically, PACS outperforms chain-of-thought reasoning, prior neurosymbolic methods, and search-based approaches across multiple benchmarks.
☆ Graph-Structured Hyperdimensional Computing for Data-Efficient and Explainable Process-Structure-Property Prediction
Multiphoton photoreduction enables high-fidelity fabrication of complex 3D microstructures, yet reliable process-structure-property (PSP) prediction remains difficult because the available data are sparse, heterogeneous, and interaction-dominated. In this regime, conventional feature-vector models are statistically underdetermined, making them prone to spurious correlations, poor regime transfer, and unstable post hoc explanations, whereas mechanistic pipelines depend on calibrated submodels that are rarely available during early process development. We present PSP-HDC, a graph-structured hyperdimensional computing framework that encodes a directed PSP graph as an internal prior for representation, inference, and explanation. A trainable scalar-to-hypervector encoder learns parameter-specific embeddings on a fixed hyperdimensional basis to accommodate heterogeneous scales and noise. Sample representations are then composed through graph-aligned binding and bundling along directed PSP dependencies, and prediction is performed by associative-memory retrieval against class prototypes. Because the same prototype memories support both decision making and attribution, PSP-HDC provides intrinsic explanations at the parameter, group, and within-group levels, while memory alignment and separation quantify prototype formation during training. On sheet-resistance regime prediction for the 3D platform, PSP-HDC achieves an accuracy of 0.910 +/- 0.077 over 1000 random splits and 0.896 under process-fold generalization, outperforming strong baselines.
comment: 19 pages, 18 figures
☆ Tool Calling is Linearly Readable and Steerable in Language Models
When a tool-calling agent picks the wrong tool, the failure is invisible until execution: the email gets sent, the meeting gets missed. Probing 12 instruction-tuned models across Gemma 3, Qwen 3, Qwen 2.5, and Llama 3.1 (270M to 27B), we find the identity of the chosen tool is linearly readable and steerable inside the model. Adding the mean-difference between two tools' average internal activations switches which tool the model selects at 77-100% accuracy on name-only single-turn prompts (93-100% at 4B+), and the JSON arguments that follow autoregressively match the new tool's schema, so flipping the name is enough. The same per-tool means also flag likely errors before they happen: on Gemma 3 12B and 27B, queries where the gap between the top-1 and top-2 tool is smallest produce 14-21x more wrong calls than queries with the largest gap. The causal effect concentrates along one direction, the row of the output layer that produces the target tool's first token: a unit vector along it at matched magnitude already reaches 93-100%, while what is left over leaves the choice almost untouched. Activation patching localises this to a small set of mid- and late-layer attention heads, and a within-topic probe across 14 same-domain $τ$-bench airline tools reaches top-1 61-89% across five 4B-14B models, ruling out the reading that we are just moving the model along a topic axis. Even base models encode the right tool before they can emit it: cosine readout from the internal state recovers 69-82% on BFCL while base generation reaches only 2-10%, suggesting pretraining forms the representation and instruction tuning later wires it to the output. We measure tool identity selection and JSON schema correctness in single-turn fixed-menu settings; multi-turn agentic transfer is more fragile and is discussed in Limitations.
comment: 29 pages, 6 figures, 7 tables. Manuscript under review
☆ Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios
AI measurement science has a wide variety of methodologies and measurements for comparing AI systems, resulting in what often appear to be "apples-to-oranges" comparisons across AI evaluations. To move toward "apples-to-apples" comparisons in real-world AI evaluations, this work advocates for methodological transparency in evaluation scenarios, operational grounding, and human-centered design (HCD) principles. We propose a repeatable process for transforming high-level use cases to detailed scenarios by eliciting use cases from subject matter experts (SMEs) via a structured AI Use Case Worksheet with six key elements: use case, sector, user (direct and indirect), intended outcomes, expected impacts (positive and negative), and KPIs and metrics. We demonstrate utility of the worksheet and process in the U.S. financial services sector. This paper reports on example high-level AI use cases identified by financial services sector SMEs: cyber defense enablement, developer productivity, financial crime aggregation, suspicious activity report (SAR) filing, credit memo generation, and internal call center support. These AI use cases provided are illustrative of the process and not exhaustive. Central to our work is a three-stage expansion pipeline combining LLM prompting with human reviews to generate 107 scenarios from those use cases elicited from SMEs. This process integrates iterative human reviews at every juncture to ensure operational grounding: for scenario titles and descriptions; for core scenario elements like users, benefits and risks, and metrics; and for scenario narratives and evaluation objectives. Human checkpoints ensure scenarios remain reflective of real-world usage and human needs. We describe a validation rubric to assess scenario quality. By defining key scenario components, this work supports a more consistent and meaningful paradigm for human-centered AI evaluations.
comment: 23 pages, 3 figures
☆ Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation
Selecting the optimal LLM inference configuration requires evaluation across hardware, serving engines, attention backends, and model architectures, since no single choice performs best across all workloads. Profile-based simulators are the standard tool, yet they hardcode their operation set to a specific configuration and re-profile every operation from scratch, making exploration prohibitively expensive. This cost stems from a missing structural understanding: every input dimension of each operation is fixed by the model configuration or determined by the incoming request. Many model-configuration values (e.g., head size, layer count) recur across models, so the same operation runs in many configurations; a single sweep over the request-dependent dimensions can serve them all. We present Dooly, which exploits this structure to achieve configuration-agnostic, redundancy-aware profiling. Dooly performs a single inference pass, labels each input dimension with its origin via taint propagation, and selectively profiles only operations absent from its latency database; stateful operations such as attention are isolated by reusing the serving engine's own initialization code, eliminating manual instrumentation. It builds latency regression models based on the database, which becomes a drop-in backend for existing simulators. Across two GPU platforms, three attention backends, and diverse model architectures, Dooly achieves simulation accuracy within 5% MAPE for TTFT and 8% for TPOT while reducing profiling GPU-hours by 56.4% across 12 models compared to the existing profiling approach.
☆ Where's the Plan? Locating Latent Planning in Language Models with Lightweight Mechanistic Interventions
We study planning site formation in language models -- where internal representations of structurally-constrained future tokens form during the forward pass, and whether they causally drive generation. Using rhyming-couplet completion as a clean test of forward-looking constraint, we apply two lightweight methods (linear probing and activation patching) across Qwen3, Gemma-3, and Llama-3 at more than ten scales. Probing shows that future-rhyme information is linearly decodable at the line boundary, with signal that strengthens with scale in all three families. Activation patching reveals that only Gemma-3-27B causally relies on this encoding, exhibiting a handoff in which the causal driver migrates from the rhyme word to the line boundary around layer 30. Every other model we test conditions on the rhyme word throughout generation, with near-zero causal effect at the line boundary despite strong probe signal. We localize the Gemma-3-27B handoff to five attention heads through two-stage path patching that recover ~90% of the rhyme-routing capacity at the newline.
comment: 13 pages, 20 figures, 3 tables
☆ The Limits of AI-Driven Allocation: Optimal Screening under Aleatoric Uncertainty
The rise of machine learning has shifted targeted resource allocation in policy and humanitarian settings toward algorithmic targeting based on predicted risk scores. This approach is typically cheaper and faster than traditional screening procedures that directly observe the latent vulnerability status through physical verification. Yet, even access to the true conditional vulnerability probability cannot eliminate misallocation: aleatoric uncertainty over individual vulnerability status is irreducible, and probabilistic targeting inevitably misallocates some resources. In this work we study how screening and algorithmic targeting should be optimally combined in a two-stage allocation framework where a screening stage observes true outcomes for a subset of units before a final allocation stage assigns the resource under a fixed coverage budget. We show that the optimal strategy screens units at the margin of algorithmic allocation, while directly targeting the highest-risk units. Furthermore, we empirically characterize when screening and algorithmic targeting act as complements or substitutes: efficiency gains from screening grow as the aleatoric uncertainty in the population increases. We illustrate our framework with applications in income-based social protection programs and humanitarian demining in Colombia, where the tension between screening costs and allocation efficiency is operationally consequential.
☆ It Just Takes Two: Scaling Amortized Inference to Large Sets
Neural posterior estimation has emerged as a powerful tool for amortized inference, with growing adoption across scientific and applied domains. In many of these applications, the conditioning variable is a set of observations whose elements depend not only on the target but also on unknown factors shared across the set. Optimal inference therefore requires treating the set jointly, which in turn requires training the estimator at the deployment set size -- a regime where memory and compute quickly become prohibitive. We introduce a simple, theoretically grounded strategy that decouples representation learning from posterior modeling. Our method trains a mean-pool Deep Set on sets of size at most two, producing an encoder that generalizes to arbitrary set sizes. The inference head is then finetuned on pre-aggregated embeddings, making training cost essentially independent of the deployment set size N. Across scalar, image, multi-view 3D, molecular, and high-dimensional conditional generation benchmarks with N in the thousands, our approach matches or outperforms standard baselines at a fraction of the compute.
☆ TimeLesSeg: Unified Contrast-Agnostic Cross-Sectional and Longitudinal MS Lesion Segmentation via a Stochastic Generative Model
Multiple sclerosis (MS) expresses substantial clinical and radiological heterogeneity, which poses significant challenges for automatic lesion segmentation. The current deep learning-based SOTA is highly susceptible to changes in both distribution, e.g., changes in scanner; as well as the structure of inputs, evident in the current divide between cross-sectional and longitudinal approaches. We introduce TimeLesSeg, a unified contrast-agnostic framework designed to segment MS lesions regardless of the presence of a temporal dimension in its inputs, with a single convolutional neural network. Our approach models pathological priors through lesion masks, which are processed together with the current scan. Cross-sectional processing is enabled by exposing the model to training cases where no prior information is available, which are modeled with an empty mask, allowing it to operate seamlessly in both scenarios. To overcome the scarcity and inconsistency of longitudinal datasets, we propose a novel generative pipeline in which patterns of lesion evolution are simulated by stochastically deforming each individual lesion with morphological operations, producing realistic prior timepoints. In parallel, we achieve contrast agnosticism through Gaussian mixture model-based domain randomization, enabling the network to experience a wide spectrum of intensity profiles. Results on three publicly available and two in-house datasets show that TimeLesSeg outperforms the contrast-agnostic state of the art on single-modality inputs across overlap- and distance-based metrics. In longitudinal processing, our method outperforms SAMSEG, and captures lesion load dynamics more accurately than both the former and LST-AI. All source code related to the development of TimeLesSeg is available at https://github.com/NeuroADaS-Lab/TimeLesSeg.
☆ Exploring the non-convexity in machine learning using quantum-inspired optimization
The escalating complexity of modern machine learning necessitates solving challenging non-convex optimization problems, particularly in high-dimensional regimes and scenarios contaminated by gross outliers. Traditional approaches, relying on convex relaxations or specialized local search heuristics, frequently succumb to suboptimal local minima and fail to recover the true underlying discrete structures. In this paper, we propose treating these non-convex challenges as a global search problem and introduce a unified framework based on Quantum-Inspired Evolutionary Optimization (QIEO). By leveraging a probabilistic representation inspired by quantum superposition, QIEO maintains a global view of the search space, enabling it to tunnel through local optima that trap conventional gradient-based and greedy solvers. We comprehensively evaluate QIEO across diverse non-convex applications, including sparse signal recovery (gene expression analysis and compressed sensing) and robust linear regression. Extensive benchmarking against state-of-the-art continuous solvers (ADAM, Differential Evolution), classical metaheuristics (Genetic Algorithms), and specialized non-convex algorithms (Iterative Hard Thresholding) demonstrates that QIEO consistently achieves superior structural fidelity, lower mean squared error, and enhanced robustness without support inflation. Our findings suggest that embracing a quantum-inspired global search provides a resilient, unified paradigm for overcoming the inherent intractability of discrete nonconvex machine learning landscapes.
☆ TAVIS: A Benchmark for Egocentric Active Vision and Anticipatory Gaze in Imitation Learning
Active vision -- where a policy controls its own gaze during manipulation -- has emerged as a key capability for imitation learning, with multiple independent systems demonstrating its benefits in the past year. Yet there is no shared benchmark to compare approaches or quantify what active vision contributes, on which task types, and under what conditions. We introduce TAVIS, evaluation infrastructure for active-vision imitation learning, with two complementary task suites -- TAVIS-Head (5 tasks, global search via pan/tilt necks) and TAVIS-Hands (3 tasks, local occlusion via wrist cameras) -- on two humanoid torso embodiments (GR1T2, Reachy2), built on IsaacLab. TAVIS provides three evaluation primitives: a paired headcam-vs-fixedcam protocol on identical demonstrations; GALT (Gaze-Action Lead Time), a novel metric grounded in cognitive science and HRI that quantifies anticipatory gaze in learned policies; and procedural ID/OOD splits. Baseline experiments with Diffusion Policy and $π_0$ reveal that (i) active-vision generally helps, but benefits are task-conditional rather than uniform; (ii) multi-task policies degrade sharply under controlled distribution shifts on both suites; and (iii) imitation alone yields anticipatory gaze, with median lead times comparable to the human teleoperator reference. Code, evaluation scripts, demonstrations (LeRobot v3.0; ~2200 episodes) and trained baselines are released at https://github.com/spiglerg/tavis and https://huggingface.co/tavis-benchmark.
☆ TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples
We present TraceFix, a verification-first pipeline for Large Language Model (LLM) multi-agent coordination. An agent synthesizes a protocol topology as a structured intermediate representation (IR) from a task description, generates PlusCal coordination logic, and iteratively repairs the protocol using counterexamples from the TLA+ model checker (TLC) until verification succeeds. Verified process bodies are compiled into per-agent system prompts and executed under a runtime monitor that rejects out-of-topology coordination operations. On 48 tasks spanning 16 scenario families, all tasks reach full TLC verification; 62.5% pass on the first attempt and none requires more than four repair iterations. State spaces span six orders of magnitude yet verification completes in under 60 s for every task. A 3,456-run runtime comparison shows that topology-monitored execution achieves the highest task completion (89.4% average, 81.5% full) and that runtimes using the verified protocol degrade at roughly half the rate of prompt-only and chat-only baselines when model capability is reduced. A paired ablation under a fixed runtime shows that TLC-verified protocols cut deadlock/livelock (DL/LL) from 31.1% to 14.1%, with the largest separation under fault injection.
☆ One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Vision-language-action (VLA) models increasingly rely on auxiliary world modules to plan over long horizons, yet how such modules should be parameterized on top of a pretrained VLA remains an open design question. Existing world-model-augmented VLAs typically pass the per-frame visual stream into the world module at high visual bandwidth and treat its rollout as a side product of action prediction; under a constrained adaptation budget on a frozen backbone, this leaves both the per-frame representation and the latent action coupling under-examined. We introduce OneWM-VLA, which compresses each view into a single semantic token per frame through an Adaptive Attention Pooling, and produces the resulting latent stream and the action trajectory under a single flow-matching objective rather than connecting them through a separate decoder. Empirically, we find that per-frame visual bandwidth can be reduced to a single token without compromising long-horizon performance under our setup. Trained with 14.71M LoRA parameters on a $π_0$ (2B) backbone, OneWM-VLA improves the average success rate from 47.9% to 61.3% on MetaWorld~MT50, reaches 95.6% on LIBERO-Long (vs.85.2% for $π_0$), and reaches 60.0% on the long-horizon deformable task Fold Cloth on a real Piper arm (vs.20.0% for $π_0$).
☆ INO-SGD: Addressing Utility Imbalance under Individualized Differential Privacy ICLR-26
Differential privacy (DP) is widely employed in machine learning to protect confidential or sensitive training data from being revealed. As data owners gain greater control over their data due to personal data ownership, they are more likely to set their own privacy requirements, necessitating individualized DP (IDP) to fulfil such requests. In particular, owners of data from more sensitive subsets, such as positive cases of stigmatized diseases, likely set stronger privacy requirements, as leakage of such data could incur more serious societal impact. However, existing IDP algorithms induce a critical utility imbalance problem: Data from owners with stronger privacy requirements may be severely underrepresented in the trained model, resulting in poorer performance on similar data from subsequent users during deployment. In this paper, we analyze this problem and propose the INO-SGD algorithm, which strategically down-weights data within each batch to improve performance on the more private data across all iterations. Notably, our algorithm is specially designed to satisfy IDP, while existing techniques addressing utility imbalance neither satisfy IDP nor can be easily adapted to do so. Lastly, we demonstrate the empirical feasibility of our approach.
comment: Accepted to the 14th International Conference on Learning Representations (ICLR-26)
☆ AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents
As LLM-based agents increasingly rely on external tools, it is important to evaluate their ability to sustain tool-grounded reasoning beyond familiar workflows and short-range interactions. We introduce AgentEscapeBench, an escape-room-style benchmark that tests whether agents can infer, execute, and revise novel tool-use procedures under explicit long-range dependency constraints. Each task defines a directed acyclic dependency graph over tools and items, requiring agents to invoke real external functions, track hidden state revealed incrementally, propagate intermediate results, and submit a deterministically verifiable final answer. AgentEscapeBench includes 270 instances across five difficulty tiers and supports fully automated evaluation. Experiments with sixteen LLM agents and human participants show that performance drops sharply as dependency depth increases: humans decline from 98.3% success at difficulty-5 to 80.0% at difficulty-25, while the best model drops from 90.0% to 60.0%. Trajectory analysis attributes model failures mainly to breakdowns in long-range state tracking, clue adherence, and intermediate-result propagation. These findings suggest that current agents can often handle local tool use but still struggle with deep contextual dependencies. We hope AgentEscapeBench can serve as a diagnostic testbed for measuring current agent capabilities and informing future training efforts toward more robust general-purpose reasoning, action, and adaptation.
☆ Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation
Discrete flow matching generates text by iteratively transforming noise tokens into coherent language, but may require hundreds of forward passes. Distillation uses the multi-step trajectory to train a student to reproduce the process in a few steps. When the student underperforms, the usual explanation is insufficient capacity. We argue the opposite: the trajectory is the bottleneck, not the student. Each training trajectory is built through a chain of blind stochastic jumps with no evaluation of sequence quality; a single bad decision at an early midpoint propagates through subsequent steps, yet the student must imitate the result. Trajectory-Shaped Discrete Flow Matching (TS-DFM) replaces these blind jumps with guided navigation: a lightweight energy compass evaluates candidate continuations at each midpoint, selecting the most coherent. All shaping is training-only; inference cost is unchanged. On 170M-parameter language modeling, the shaped student at 8 steps achieves 32% lower perplexity than the 1,024-step teacher while being 128x faster, with gains consistent across source distributions and three evaluators of increasing scale. TS-DFM achieves the best perplexity of any discrete-generation baseline we compare against, including methods trained on 6x more data or using 5x larger models.
☆ Sycophantic AI makes human interaction feel more effortful and less satisfying over time
Millions of people now turn to artificial intelligence (AI) systems for personal advice, guidance, and support. Such systems can be sycophantic, frequently affirming users' views and beliefs. Across five preregistered studies (N = 3,075 participants, 12,766 human-AI conversations), including a three-week study with a census-representative U.S. sample, we provide longitudinal experimental evidence that sycophantic AI shifts how users approach their closest relationships. We show that sycophantic AI immediately delivers the emotional and esteem support users typically associate with close friends and family. Over three weeks of such interactions, users became nearly as likely to seek personal advice from sycophantic AI as from close friends and family, and reported lower satisfaction with their real-world social interactions. When given a choice among AI response styles, a majority preferred sycophantic AI -- not for the quality of its advice, but because it made them feel most understood. Together, these findings offer a relational account of AI sycophancy: by providing frictionless understanding, it may quietly raise the bar against which human relationships are judged.
☆ Statistical inference with belief functions: A survey
Belief functions are a powerful and popular framework for the mathematical characterisation of uncertainty, in particular in situations in which lack of data renders learning a probability distribution for the problem impractical. The first step in a reasoning chain based on belief functions is inference: how to learn a belief measure from the available data. In this survey we focus, in particular, on making inference from statistical data, and review the most significant contributions in the area.
comment: 9 pages, 0 figures
☆ CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers ICML 2026
Despite the rapid development of AI reviewers, evaluating such systems remains challenging: metrics favor overlap with human reviews over correctness. However, since human reviews often cover only a subset of salient issues and sometimes contain mistakes, they are unreliable as gold references. To address this, we build category-specific benchmark subsets and skip evaluation when the corresponding human reviews are missing to strengthen Completeness. We also leverage reviewer--author--meta-review discussions as expert annotations and filter unreliable reviews accordingly to strengthen Correctness. Finally, we introduce CoCoReviewBench, which curates 3,900 papers from ICLR and NeurIPS to enable reliable and fine-grained evaluation of AI reviewers. Analysis shows that AI reviewers remain limited in correctness and are prone to hallucinations, and highlights reasoning models as more effective reviewers, motivating further directions for improving AI reviewers. Benchmarks and models are available at https://github.com/hexuandeng/CoCoReviewBench.
comment: Accepted by ICML 2026
☆ BeeVe: Unsupervised Acoustic State Discovery in Honey Bee Buzzing
Discovering structure in biological signals without supervision is a fundamental problem in computational intelligence, yet existing bioacoustic methods assume vocal production models or predefined semantic units, leaving non-vocal species poorly served. This work introduces BeeVe, an unsupervised framework for acoustic state discovery in collective honey bee buzzing. BeeVe uses the self-supervised Patchout Spectrogram Transformer (PaSST) as a frozen feature extractor, then trains a Vector-Quantized Variational Autoencoder (VQ-VAE) without labels on those embeddings, learning a finite discrete codebook of acoustic tokens directly from unlabelled hive audio. No labels, pretext tasks, or contrastive objectives are used at any stage. Post-hoc evaluation against known queen status reveals that the learned tokens separate queenright and queenless conditions with Jensen-Shannon Divergence values between 0.609 and 0.688, and that the queenless condition further decomposes into three internally coherent sub-states stable across experiments with different codebook sizes and random seeds. Token transition analysis confirms non-random sequential structure (p << 0.001) across all experiments. Generalisation to unseen recordings preserves both token overlap (Jaccard = 0.947) and global manifold topology. These results demonstrate that unsupervised discrete codebook learning can recover repeatable acoustic structure from a non-vocal biological signal without annotation, opening a path toward non-invasive acoustic hive health monitoring.
☆ Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding
Online streaming video understanding requires models to process continuous visual inputs and respond to user queries in real time, where the unbounded stream and unpredictable query timing turn memory management into a central challenge. Existing methods typically compress visual tokens via visual similarity heuristics, or augment compression with KV-cache-level retrieval. However, compression decisions rarely incorporate semantic signals, and retrieval is often added after compression is finalized, making the two stages hard to coordinate. We present SAVEMem, a training-free dual-stage framework that brings semantic awareness into memory generation and lets the retrieval scope adapt per query. In Stage~1, SAVEMem builds a three-tier streaming memory online under a constant memory budget. A fixed pseudo-question bank provides a lightweight semantic prior, so that long-term retention is shaped by semantic salience rather than visual similarity alone. In Stage~2, SAVEMem performs query-aware retrieval over this memory. An anchor-conditioned recency gate adapts the retrieval scope from short-term to mid- and long-term memory based on whether the query targets the present or the distant past. Within this scope, late interaction between query and memory tokens selects candidate frames for answering. Applied to Qwen2.5-VL without training, SAVEMem improves the OVO-Bench overall score from 52.27 to 62.69 and yields consistent gains on StreamingBench and ODV-Bench, while reducing peak GPU memory by 48\% at 128 frames over the backbone.
☆ What if AI systems weren't chatbots?
The rapid convergence of artificial intelligence (AI) toward conversational chatbot interfaces marks a critical moment for the industry. This paper argues that the chatbot paradigm is not a neutral interface choice, but a dominant sociotechnical configuration whose widespread adoption reshapes social, economic, legal, and environmental systems. We examine how treating AI primarily as conversational assistants has extensive structural downsides. We show how chatbot-based systems often fail to adequately meet user needs, particularly in complex or high-stakes contexts, while projecting confidence and authority. We further analyze how the normalization of chatbot-mediated interaction alters patterns of work, learning, and decision-making, contributing to deskilling, homogenization of knowledge, and shifting expectations of expertise. Finally, we examine broader societal effects, including labor displacement, concentration of economic power, and increased environmental costs driven by sustained investment in large-scale chatbot infrastructures. While acknowledging legitimate benefits, we argue that the current trajectory of AI development reflects specific value choices that prioritize conversational generality over domain specificity, accountability, and long-term social sustainability. We conclude by outlining alternative directions for AI development and governance that move beyond one-size-fits-all chatbots, emphasizing pluralistic system design, task-specific tools, and institutional safeguards to mitigate social and economic harm.
comment: Accepted at The 2026 ACM Conference on Fairness, Accountability, and Transparency, June 25--28, 2026, Montreal, QC, Canada
☆ Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models
Multimodal reward models have advanced substantially in text and image domains, yet progress in video understanding reward modeling remains severely limited by the lack of robust evaluation benchmarks and high-quality preference data. To address this, we propose a unified framework spanning benchmark design, data construction, and reward model training. We introduce Video Understanding Reward Bench (VURB), a benchmark featuring 2,100 preference pairs with long chain-of-thought reasoning traces (averaging 1,143 tokens) and majority voting evaluation across general, long, and reasoning-oriented video tasks. We further construct Video Understanding Preference Dataset (VUP-35K) via a fully automated pipeline, providing large-scale high-quality supervision for video reward training. Building on the data, we train VideoDRM and VideoGRM, a discriminative and a generative reward model, both achieving state-of-the-art performance on VURB and VideoRewardBench. Further analysis confirms that VUP-35K enhances both reward performance and model reasoning capability, while VideoDRM and VideoGRM yield significant gains under best-of-$N$ test-time scaling.
☆ Spectral Dynamics in Deep Networks: Feature Learning, Outlier Escape, and Learning Rate Transfer
We study the evolution of hidden-weight spectra in wide neural networks trained by (stochastic) gradient descent. We develop a two-level dynamical mean-field theory (DMFT) that jointly tracks bulk and outlier spectral dynamics for spiked ensembles whose spike directions remain statistically dependent on the random bulk. We apply this framework to two settings: (1) infinite-width nonlinear networks in mean-field/$μ$P scaling and (2) deep linear networks in the proportional high-dimensional limit, where width, input dimension, and sample size diverge with fixed ratios. Our theory predicts how outliers evolve with training time, width, output scale, and initialization variance. In deep linear networks, $μ$P yields width-consistent outlier dynamics and hyperparameter transfer, including width-stable growth of the leading NTK mode toward the edge of stability (EoS). In contrast, NTK parameterization exhibits strongly width-dependent outlier dynamics, despite converging to a stable large-width limit. We show that this bulk+outlier picture is descriptive of simple tasks with small output channels, but that tasks involving large numbers of outputs (ImageNet classification or GPT language modeling) are better described by a restructuring of the spectral bulk. We develop a toy model with extensive output channels that recapitulates this phenomenon and show that edge of the spectrum still converges for sufficiently wide networks.
☆ KL for a KL: On-Policy Distillation with Control Variate Baseline
On-Policy Distillation (OPD) has emerged as a dominant post-training paradigm for large language models, especially for reasoning domains. However, OPD remains unstable in practice due to the high gradient variance of its single-sample Monte Carlo estimator, and recipes for stable training are still immature. We propose vOPD (On-Policy Distillation with a control variate baseline), which casts OPD as policy-gradient RL and stabilizes it by introducing a control variate baseline-canonically a value function -- from the RL literature. We show that the OPD value function admits a closed form as the per-token negative reverse KL divergence between the student and the teacher, available directly from the already-computed forward pass with no additional critic or inference. Existing stabilization methods either compute the full token-level reverse KL over the entire vocabulary, adding significant overhead, or restrict it to a top-k support, biasing the objective. vOPD instead preserves the lightweight single-sample estimator, subtracting the value function as a detached baseline to keep the gradient unbiased while reducing variance. Furthermore, we show that a top-k approximation of the baseline further lowers cost without compromising performance. Across mathematical and scientific reasoning benchmarks, vOPD consistently outperforms vanilla OPD and matches the most expensive full-vocabulary baseline, offering an efficient stabilization of On-Policy Distillation through principled RL variance reduction.
☆ On the Tradeoffs of On-Device Generative Models in Federated Predictive Maintenance Systems
Federated Learning (FL) has emerged as a promising paradigm for preserving client data ownership and control over distributed Internet of Things (IoT) environments. While discriminative models dominate most FL use cases, recent advances in generative models -- such as Variational Autoencoders (VAE), Generative Adversarial Networks (GAN), and Diffusion Models (DM) -- offer new opportunities for unsupervised anomaly detection in time series analysis, with relevant applications in predictive maintenance (PdM) in critical industrial infrastructures. In this work, we present a comprehensive analysis of VAEs, GANs, and DMs in the context of federated PdM. We analyze their performance and communication overhead under both full and partial federation setups, where only subsets of model components are shared. Building on this analysis, the paper proposes a novel taxonomy for federated generative models that formalizes partial component sharing as a principled mechanism for model personalization. Our experiments over a real-world time series dataset reveal distinct trade-offs in model utility, stability, and scalability, especially in heterogeneous and bandwidth-constrained FL settings. For the evaluated GAN-based configurations, full federation improves training stability relative to independent local training, although the model remains less robust than the VAE- and DDPM-based alternatives. For DMs, however, partial federation -- especially decoder sharing -- can outperform full federation in bandwidth-constrained, non-IID settings.
☆ MatryoshkaLoRA: Learning Accurate Hierarchical Low-Rank Representations for LLM Fine-Tuning
With the rise in scale for deep learning models to billions of parameters, the computational cost of fine-tuning remains a significant barrier to deployment. While Low-Rank Adaptation (LoRA) has become the standard for parameter-efficient fine-tuning, the need to set a predefined, static rank $r$ requires exhaustive grid searches to balance efficiency and performance. Existing rank-adaptive solutions such as DyLoRA mitigate this by sampling ranks during the training from a predefined distribution. However, they often yield sub-optimal results at higher ranks due to lack of consistent gradient signals across the full hierarchy of ranks, thus making these methods data-inefficient. In this paper, we propose MatryoshkaLoRA, a general, Matryoshka-inspired training framework for LoRA that learns accurate hierarchical low-rank representations by inserting a fixed, carefully crafted diagonal matrix $P$ between the existing LoRA adapters to scale their sub-ranks accordingly. By introducing this simple modification, our general framework recovers LoRA and DyLoRA only by changing $P$ and ensures all sub-ranks embed the available gradient information efficiently. Our MatryoshkaLoRA supports dynamic rank selection with minimal degradation in accuracy. We further propose Area Under the Rank Accuracy Curve (AURAC), a metric that consistently evaluates the performance of hierarchical low-rank adapters. Our results demonstrate that MatryoshkaLoRA learns more accurate hierarchical low-rank representations than prior rank-adaptive approaches and achieves superior accuracy-performance trade-offs across ranks on the evaluated datasets. Our code is available at https://github.com/IST-DASLab/MatryoshkaLoRA.
☆ \mathsf{VISTA}: Decentralized Machine Learning in Adversary Dominated Environments
Decentralized machine learning often relies on outsourcing computations, such as gradient evaluations, to untrusted worker nodes. Existing robust aggregation methods can mitigate malicious behavior under honest-majority assumptions, but may fail when adversaries control a majority of the workers. We study this adversary-dominated setting through an incentive-oriented framework in which reports are accepted and rewarded only when they are mutually consistent up to a threshold. This turns the adversary from a pure saboteur into a rational agent that trades off increasing estimation error against the risk of rejection and loss of reward. We consider iterative optimization under this model. Unlike one-shot computation, iterative learning requires long-horizon decisions: permissive acceptance rules enable faster early progress but admit more adversarial corruption, while strict rules improve estimation accuracy but cause frequent rejections. We propose \mathsf{VISTA}, an adaptive algorithm that tunes the acceptance threshold using the optimization history. Numerical results show that \mathsf{VISTA} improves convergence over static thresholds. We also provide a rigorous convergence analysis showing that, with suitable incentive-aware adaptation, adversary-dominated decentralized learning can retain the asymptotic convergence behavior of standard SGD without relying on an honest majority.
☆ Exact Regular-Constrained Variable-Order Markov Generation via Sparse Context-State Belief Propagation
Variable-order Markov models generate sequences over a finite alphabet by conditioning each symbol on the longest available suffix of the generated history. Regular constraints, by contrast, describe finite-horizon control requirements by an automaton: fixed positions, forced endings, metrical patterns, and forbidden copied fragments are all special cases. Existing exact methods already handle regular constraints with belief propagation for first-order Markov chains. The contribution here is the variable-order extension: identifying the state space on which the existing BP-regular machinery must be run when the generator is a variable-order/backoff model. A first-order constraint layer can enforce useful support conditions, but it computes future mass after merging histories that a variable-order generator deliberately keeps distinct. We formalize this mismatch and give the sparse construction obtained by replacing the first-order Markov state with the observed context state, then taking the standard product with the regular constraint automaton. For a fixed trained context graph and automaton, inference is linear in the sequence horizon; in general it is polynomial in the number of reachable product edges. This gives the correct variable-order distribution conditioned on regular constraints without expanding to all K-tuples. The same finite-source interface supports reversible data augmentation by inverse count lookup, matching materialized transposition augmentation without storing transformed corpora. We also separate exact BP inference from generation-time backoff policies, such as singleton avoidance, whose stochastic semantics must be made explicit if exactness is claimed.
☆ PPI-Net connects molecular protein interactions to functional processes in disease
Understanding how molecular alterations propagate across biological systems to drive disease remains a central challenge. Although high-throughput profiling enables comprehensive characterization of tumor states, most models neglect structured biological relationships or lack interpretability across scales. Here we present PPI-Net, a hierarchical graph neural network that integrates protein-protein interaction (PPI) networks with pathway-level representations to model disease from molecular interactions to functional processes. Patient-specific molecular profiles are embedded within a shared interaction network from STRING and propagated through a multi-layer Reactome hierarchy using graph attention, enabling aggregation of gene-level signals into higher-order biological programs. Across RNA-seq data from ten cancer types from The Cancer Genome Atlas, PPI-Net achieves robust predictive performance, with balanced accuracy exceeding 90% in multiple cohorts. Comparative analysis on RNA-Seq data from breast cancer demonstrated that PPI-Net's integration of the Reactome hierarchy improved balanced accuracy by 6.7% relative to a PPI-only model, while hierarchical multi-level supervision improved balanced accuracy by 12.3% relative to using only a single top-level prediction head. Applying a multi-omics approach using RNA-seq and methylation data improves model interpretation, recovering canonical oncogenic modules, including TP53-AKT signaling and stress response pathways, while revealing convergence onto coherent programs such as ion signaling and cellular responses to stimuli. These results demonstrate that integrating interaction networks with pathway hierarchies enables accurate prediction while providing mechanistic insight into cancer biology.
comment: 17 pages, 3 figures, 2 tables
☆ Approximation-Free Differentiable Oblique Decision Trees
Decision Trees (DTs) are widely used in safety-critical domains such as medical diagnosis, valued for their interpretability and effectiveness on tabular data. However, training accurate oblique DTs is challenging due to complex optimization landscapes and overfitting risks, particularly in regression. Recent advances have introduced differentiable formulations that enable gradient-based training and joint optimization of decision boundaries and leaf regressors. Yet, existing approaches typically rely on approximations, either through probabilistic softening of boundaries (soft DTs) or quantized gradients such as the Straight-Through Estimator (STE). To overcome these limitations, we propose DTSemNet, a novel, semantically equivalent, and invertible representation of hard oblique DTs as neural networks. DTSemNet enables end-to-end training with standard gradient descent, eliminating the need for approximations in both classification and regression. While classification aligns naturally with this formulation, regression remains challenging due to the joint optimization of internal nodes and leaf regressors. To address this, we analyze the limitations of STE and introduce an annealed Top-k method that provides accurate gradient signals without approximation. Extensive experiments on classification and regression benchmarks show that DTSemNet-trained oblique DTs outperform state-of-the-art differentiable DTs. Furthermore, we demonstrate that DTSemNet can serve as programmatic DT policies in reinforcement learning environments, thereby broadening their applicability.
comment: Accepted for publication in JMLR, Vol. 27, 2026
☆ CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios
Large language models (LLMs) are increasingly deployed as autonomous agents in offensive cybersecurity. In this paper, we reveal an interesting phenomenon: different agents exhibit distinct attack patterns. Specifically, each agent exhibits an attack-selection bias, disproportionately concentrating its efforts on a narrow subset of attack families regardless of prompt variations. To systematically quantify this behavior, we introduce CyBiasBench, a comprehensive 630-session benchmark that evaluates five agents on three targets and four prompt conditions with ten attack families. We identify explicit bias across agents, with different dominant attack families and varying entropy levels in their attack-family allocation distributions. Such bias is better characterized as a trait of the agents, rather than a factor associated with the attack success rate. Furthermore, our experiments reveal a bias momentum effect, where agents resist explicit steering toward attack families that conflict with their bias. This forced distribution shift does not yield measurable improvements in attack performance. To ensure reproducibility and facilitate future research, we release an interactive result dashboard at https://trustworthyai.co.kr/CyBiasBench/ and a reproducibility artifact with aggregated session-level statistics and full evaluation scripts at https://github.com/Harry24k/CyBiasBench.
comment: Under Review
☆ Divide and Conquer: Object Co-occurrence Helps Mitigate Simplicity Bias in OOD Detection CVPR2026
Out-of-distribution (OOD) detection is crucial for ensuring the reliability of deep learning models. Existing methods mostly focus on regular entangled representations to discriminate in-distribution (ID) and OOD data, neglecting the rich contextual information within images. This issue is particularly challenging for detecting near-OOD, as models with simplicity bias struggle to learn discriminative features in disentangled representations. The human visual system can use the co-occurrence of objects in the natural environment to facilitate scene understanding. Inspired by this, we propose an Object-Centric OOD detection framework that learns to capture Object CO-occurrence (OCO) patterns within images. The proposed method introduces a new OOD detection paradigm that understands object co-occurrence within an image by predicting disentangled representations for the test sample, then adaptively divides patterns into three scenarios based on object co-occurrence patterns observed in ID training data, and finally performs OOD detection in a divide-and-conquer manner. By doing so, OCO can distinguish near-OOD by considering the semantic contextual relationships present in their images, avoiding the tendency to focus solely on simple, easily learnable regions. We evaluate OCO through experiments across challenging and full-spectrum OOD settings, demonstrating competitive results and confirming its ability to address both semantic and covariate shifts. Code is released at https://github.com/Michael-McQueen/OCO.
comment: This paper has been accepted by CVPR2026
☆ GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
Human visual reasoning is governed by active vision, a process where metacognitive control drives top-down goal-directed attention, dynamically routing foveal focus toward task-relevant details while maintaining peripheral awareness of the global scene. In contrast, modern Vision-Language Models (VLMs) process visual information passively, relying on the static accumulation of massive token contexts that dilute spatial reasoning and induce linguistic hallucinations. Here we propose the following paradigm shift: GazeVLM, a multimodal architecture that internalizes this metacognitive oversight over its deployment of attention resources directly into the reasoning loop. By empowering the VLM to autonomously generate gaze tokens ($\texttt{}$), GazeVLM establishes a top-down control mechanism over its own causal attention mask. The model dynamically dictates its focal intent, triggering a continuous suppression bias that dampens irrelevant visual features, implementing spatial selective attention and simulating foveal fixation. Once local reasoning concludes, the bias lifts, seamlessly restoring the global view. This architecture enables the model to fluidly transition between global spatial awareness and localized focal reasoning without relying on external agentic contraptions like cropping tools, or inflating the context window with additional visual tokens derived from localized visual patches. Trained with a bespoke Group Relative Policy Optimization (GRPO) procedure that rewards valid grounding, our 4B-parameter GazeVLM delivers strong high-resolution multimodal reasoning performance, surpassing state-of-the-art VLMs in its parameter class by nearly 4% and agentic multimodal pipelines built around thinking with images by more than 5% on HRBench-4k and HRBench-8k.
☆ Text-to-CAD Evaluation with CADTests
Text-to-CAD has recently emerged as an important task with the potential to substantially accelerate design workflows. Despite its significance, there has been surprisingly little work on Text-to-CAD evaluation, and assessing CAD model generation performance remains a considerable challenge. In this work, we introduce a new evaluation perspective for Text-to-CAD based on automated testing. We propose CADTestBench, the first test-based benchmark for Text-to-CAD, based on CADTests, executable software tests that verify whether a generated CAD model satisfies the geometric and topological requirements of the input prompt. Using CADTestBench, we conduct comprehensive benchmarking of recent Text-to-CAD methods and further demonstrate that CADTests can also guide CAD model generation, yielding simple baselines that surpass performance of current methods. CADTestBench code and data are available at GitHub and Hugging Face dataset.
☆ Beyond Confidence: Rethinking Self-Assessments for Performance Prediction in LLMs
Large Language Models (LLMs) are increasingly used in settings where reliable self-assessment is critical. Assessing model reliability has evolved from using probabilistic correctness estimates to, more recently, eliciting verbalized confidence. Confidence, however, has been shown to be an inconsistent and overoptimistic predictor of model correctness. Drawing on cognitive appraisal theory, a framework from human psychology that decomposes self-evaluation into multiple components, we propose a multidimensional perspective on model self-assessment. We elicit six appraisal-based dimensions of self-assessment, alongside confidence, and evaluate their utility for predicting model failure across 12 LLMs and 38 tasks spanning eight domains. We find that competence-related appraisal dimensions, particularly effort and ability, consistently match or outperform confidence across most settings. Effort additionally yields less overoptimistic estimates that remain stable across model sizes. In contrast, affective dimensions provide marginally predictive signals. Furthermore, the most informative dimension varies systematically with task characteristics: effort is most predictive for reasoning-intensive tasks, while ability and confidence dominate on retrieval-oriented tasks. Broadly, our findings indicate that structured multidimensional self-assessment is a promising approach to improving the reliability and safety of language model deployment across diverse real-world settings.
☆ Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning
On-policy distillation (OPD) leverages dense teacher rewards to enhance reasoning models. However, scaling OPD to long-horizon tasks exposes a critical flaw: as the student's generated prefix inevitably diverges from the teacher's thought process, the teacher's dense reward loses local exploitability. Continuing to generate and evaluate tokens on these ``drifted'' trajectories not only degrades reward quality but also incurs massive computational waste. To address this, we introduce \textbf{Prune-OPD}, a framework that dynamically aligns training budgets with supervision quality. By continuously monitoring the local compatibility between student and teacher predictions (e.g., via top-$k$ overlap), Prune-OPD detects prefix-drift events in real time. Upon detecting severe drift, it monotonically down-weights subsequent unreliable rewards and triggers dynamic rollout truncation. This allows the training process to halt futile generation and reallocate compute strictly to reliable teacher supervision. Across diverse teacher-student combinations, Prune-OPD consistently aligns computation with supervision reliability. When prefix drift makes dense teacher rewards unreliable, it reduces training time by 37.6\%--68.0\% while preserving, and often improving, performance on challenging benchmarks (AMC, AIME, HMMT). When student-teacher compatibility remains high, it automatically preserves long-context supervision by expanding the training window. These results suggest that Prune-OPD improves OPD not by blindly shortening rollouts, but by reallocating computation toward locally exploitable teacher rewards.
comment: 17 pages, 8 figures
☆ Toward Privileged Foundation Models:LUPI for Accelerated and Improved Learning
Training foundation models is computationally intensive and often slow to converge.We introduce PIQL,Privileged Information for Quick and Quality Learning, the first framework to systematically integrate privileged information (PI) to simultaneously accelerate learning and improve generalization in tabular foundation models (TFMs). We construct two complementary forms of PI: (i) aggregate dataset-level statistics that reduce the burden on in-context learning, and (ii) encodings of the underlying data-generating program, providing knowledge beyond observable data. We further design an architecture that effectively transfers the train-time-only PI by learning to reconstruct it from observed context at inference. We provide a theoretical analysis characterizing conditions under which PI reduces the population-level approximation gap and accelerates convergence in finite-data regimes. Empirical evidence shows that PIQL enables TFMs to achieve faster convergence, lower final loss, and better generalization, in effect, reducing data and compute requirements. Our work establishes PI-guided pretraining as a principled and practical paradigm for improving the efficiency and performance of foundation models.
☆ Neural Operators as Efficient Function Interpolators
Neural operators (NOs) are designed to learn maps between infinite-dimensional function spaces. We propose a novel reframing of their use. By introducing an auxiliary base-space, any finite-dimensional function can be viewed as an operator acting by composition on functions of the base-space. Through a range of benchmarks on analytic functions of increasing complexity and dimensionality, we demonstrate that NOs can match or outperform standard multilayer perceptrons and Kolmogorov--Arnold Networks in accuracy while requiring significantly fewer parameters and training time. As a real-world application, we apply a two-dimensional Tensorized Fourier Neural Operator (TFNO) to the nuclear chart, learning a correction to state-of-the-art nuclear mass models as a partially observed residual field. A TFNO ensemble reaches a held-out root-mean-square error of 198.2 keV, placing it among the best recent neural-network approaches while retaining high parameter efficiency and short training times. More broadly, these results introduce NOs as a scalable framework for finite-dimensional function interpolation, from analytic benchmarks to structured scientific data.
comment: 12 pages, 9 figures
☆ APEX: Assumption-free Projection-based Embedding eXamination Metric for Image Quality Assessment
As generative models achieve unprecedented visual quality, the gold standard for image evaluation remains traditional feature-distribution metrics (e.g., FID). However, these metrics are provably hindered by the closed-vocabulary bottleneck of outdated features and the assumptive bias of rigid parametric formulations. Recent alternatives exploit modern backbones to solve the feature bottleneck, yet continue to suffer from parametric limitations. To close this gap, we introduce APEX (Assumption-free Projection-based Embedding eXamination), a novel evaluation framework leveraging the Sliced Wasserstein Distance as a mathematically grounded, assumption-free similarity measure. APEX inherits effective scalability to high-dimensional spaces, as we prove with theoretical and empirical evidences. Moreover, APEX is embedding-agnostic and uses two open-vocabulary foundation models, CLIP and DINOv2, as feature extractors. Benchmarking APEX against established baselines reveals superior robustness to visual degradations. Additionally, we show that APEX metrics exhibit intra- and cross-dataset stability, ensuring highly stable evaluations on out-of-domain datasets.
☆ Tracing Uncertainty in Language Model "Reasoning"
Language model (LM) "reasoning", commonly described as Chain-of-Thought or test-time scaling, often improves benchmark performance, but the dynamics underlying this process remain poorly understood. We study these dynamics through the lens of uncertainty quantification by treating the "reasoning" traces, the intermediate token sequences generated by LMs, as evolving model states. We summarize each trace by an uncertainty trace profile: a small set of features describing the shape of the uncertainty signal over its trace, such as its slope and linearity. We find that across five LMs evaluated on GSM8K and ProntoQA, these profiles predict whether a trace yields a correct final answer with AUROC up to 0.807, improving markedly on recent related work. We reach AUROC 0.801 using only the first few hundred tokens of full traces, suggesting that errors can be detected early in the generation. A detailed comparison of correct and incorrect traces further reveals qualitatively distinct uncertainty profiles, with correct traces showing a steeper and less linear decline in uncertainty. Together, the results suggest that our method, grounded in decision-making under uncertainty, provides a principled lens for studying the generative process underlying LM "reasoning".
☆ POETS: Uncertainty-Aware LLM Optimization via Compute-Efficient Policy Ensembles
Balancing exploration and exploitation is a core challenge in sequential decision-making and black-box optimization. We introduce POETS ($\textbf{Po}$licy $\textbf{E}$nsembles for $\textbf{T}$hompson $\textbf{S}$ampling), a novel framework that bridges uncertainty quantification and policy optimization. Our approach is grounded in the insight that policies trained with Kullback-Leibler (KL) regularization implicitly encode an underlying reward function. Building on this, POETS bypasses the complex, nested process of training an uncertainty-aware reward model and separately fitting a policy to this model. Instead, we directly train a policy ensemble to capture epistemic uncertainty by matching implicitly encoded reward functions to online, bootstrapped data. To overcome the prohibitive compute and memory constraints of ensembling Large Language Models (LLMs), POETS utilizes an efficient architecture: the ensemble shares a pre-trained backbone while maintaining diversity through independent Low-Rank Adaptation (LoRA) branches. Theoretically, we prove that POETS implicitly conducts KL-regularized Thompson sampling and thus inherits strong cumulative regret bounds of ${\mathcal O}(\sqrt{T γ_T})$. Empirically, we demonstrate that POETS achieves state-of-the-art sample efficiency across diverse scientific discovery domains, including protein search and quantum circuit design. Furthermore, it improves the optimization trajectories of reinforcement learning, proving particularly robust in off-policy settings with experience replay or in small dataset regimes.
comment: preprint
☆ RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation
Platform content moderation applies explicit policy rules and context-dependent conditions to decide whether user content is allowed, restricted, or removed. A correct moderation outcome must therefore depend on which rules a case activates, how those rules interact, and whether the available evidence is sufficient. Current multimodal safety benchmarks largely reduce moderation to matching predefined final labels, leaving this underlying rule structure untested. As a result, a high benchmark score reveals little about whether a model applies the policy correctly or arrives at the correct label through superficial cues. To evaluate this rule-governed process, we introduce RuleSafe-VL, a benchmark for rule-conditioned decision reasoning in vision-language content moderation. Derived from publicly available platform moderation policies, RuleSafe-VL formalizes 93 atomic rules and 92 typed rule relations, yielding 2,166 context-sensitive image-text cases across three high-risk policy families. Its four diagnostic tasks decompose moderation into a rule-conditioned decision chain. They identify activated rules, recover rule interactions, judge decision sufficiency, and resolve outcomes once missing context is supplied. Experiments on 10 frontier, open-source, and safety-oriented VLMs reveal rule-relation recovery as the dominant bottleneck, where the best model reaches only 64.8 Macro-F1 and some safety-oriented models fall below 7 Macro-F1. Decision-state prediction also remains unreliable, peaking at 64.5 Macro-F1. RuleSafe-VL shifts moderation evaluation from final-label scoring toward diagnostic assessment of rule-conditioned decision reasoning.
comment: Preprint
☆ When Losses Align: Gradient-Based Composite Loss Weighting for Efficient Pretraining
Modern deep models are often pretrained on large-scale data with missing labels using composite objectives, where the relative weights of multiple loss terms act as hyperparameters. Tuning these weights with random search or Bayesian optimization is computationally expensive, as it requires many independent training runs. To address this, we propose a gradient-based bilevel method that learns pretraining loss weights online by aligning the composite pretraining gradient with a downstream objective. By exploiting the structure of the loss, the method avoids the multiple backward passes typically required by truncated backpropagation through the full model, reducing the overhead of hyperparameter tuning to approximately 30% above a single training run. We evaluate the approach on event-sequence modeling and self-supervised computer vision, where it matches or improves upon carefully tuned baselines while substantially reducing the cost of hyperparameter tuning compared to random or Bayesian search.
☆ Vibe coding before the trend
Early 2025 we ran a series of vibe coding challenges across four different student cohorts. The cohorts included 54 ICT students, 24 digital marketing students, and 7 journalism students at Fontys University of Applied Sciences (Netherlands), and 22 BA Communication students at North-West University (South Africa). From the student reflections, five major patterns emerged. Students reported that AI tools shifted their focus from syntax to higher-order thinking; they also described a skill shift from memorizing to evaluating; they viewed AI proficiency as career-essential; they framed their relationship with AI as partnership rather than replacement; and finally non-technical students showed the strongest appreciation for the accessibility these tools provide. This practitioner report documents what we observed during the classroom experiments, we reflect on how the landscape has shifted in the year since, and shares practical lessons for educators considering similar experiments. We present the observations as what they are: patterns from practice, not proven conclusions, in the beleif that sharing early stage experiences contributes to the overall field of AI and education.
comment: 10 pages
☆ Alternating Target-Path Planning for Scalable Multi-Agent Coordination
The concurrent target assignment and pathfinding (TAPF) problem extends multi-agent pathfinding (MAPF) by asking planners to allocate distinct targets and collision-free paths to agents. Prior work on TAPF has relied exclusively on Conflict-Based Search (CBS), which tightly couples target assignment and pathfinding, resulting in compute-intensive, non-scalable solutions. In contrast, we propose an iterative refinement framework that decouples target assignment from pathfinding. Our framework builds on modern, fast, suboptimal MAPF solvers, such as LaCAM. Specifically, within a given time budget, it repeatedly solves MAPF for the current target assignment, identifies bottleneck agents via MAPF feedback, and refines the assignment. Empirical results show that feedback-driven reassignment loop is effective, enabling our framework to scale well beyond the reach of the state-of-the-art CBS-based solver while maintaining decent solution quality. This represents a solid step toward practical, large scale TAPF suitable for real-world setups.
☆ Online Goal Recognition using Path Signature and Dynamic Time Warping
Online goal recognition in continuous domains poses two central challenges: efficiently encoding large trajectories and effectively comparing them. Recent work addresses these challenges by using custom state-space representations and metrics to compare observations against hypotheses. However, these approaches often overlook well-established encoding techniques used in other domains that offer substantial advantages. This paper introduces a novel method for online goal recognition that leverages path signatures, a compact, expressive representation of rough path theory that efficiently captures key semantic features of trajectories, enabling more meaningful comparisons between them. Experiments show that our method consistently outperforms the state of the art in predictive accuracy and online planning efficiency, while remaining competitive offline.
comment: Accepted as part of the 35th International Joint Conference on Artificial Intelligence
☆ Intelligent Truck Matching in Full Truckload Shipments using Ping2Hex approach SC
Accurate truck-to-shipment matching using GPS data is foundational for full truckload supply chain visibility, enabling real-time tracking and accurate estimated time of arrival (ETA) predictions. However, missing or corrupted vehicle identifiers prevent traditional matching approaches, leaving shipments without visibility. This paper presents Intelligent Truck Matching (ITM) 2.0, a machine learning system that addresses this critical gap by formulating matching as a probabilistic ranking problem. Our approach leverages Uber H3 hexagonal spatial indexing to discretize GPS pings into route similarity features, combined with temporal information, then applies LightGBM gradient boosting with threshold-based post-processing. Through rigorous evaluation including offline model selection (SVM, XGBoost, LightGBM), comprehensive ablation studies, and production shadow testing, we demonstrate substantial gains over rule-based baselines. ITM 2.0 achieves 26 percentage point precision improvement in North America and 14 points in Europe, while doubling coverage. Deployed in production at Project44 handling full truckload shipments, the system demonstrates robustness to geocoding errors up to 1 km, multiple candidate trucks, and sparse pings.
comment: 12 pages, 10 figures, 8 tables. Accepted at iSCSi 2026 (International Conference on Industry Sciences and Computer Sciences Innovation). To appear in Procedia Computer Science (Elsevier)
☆ Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs
This report benchmarks the performance of ENGINEERING Ingegneria Informatica S.p.A.'s EngGPT2MoE-16B-A3B LLM, a 16B parameter Mixture of Experts (MoE) model with 3B active parameters. Performance is investigated across a wide variety of representative benchmarks, and is compared against comparably-sized open-source MoE and dense models. In comparison with popular Italian models, namely FastwebMIIA-7B, Minerva-7B, Velvet-14B, and LLaMAntino-3-ANITA-8B, EngGPT2MoE-16B-A3B performs as well or better on international benchmarks: ARC-Challenge, GSM8K, AIME24, AIME25, MMLU, and HumanEval (HE). It achieves the best performance for the longest context setting (32k) of the RULER benchmark. On the Italian benchmark dataset ITALIC, the model performs as well or better than the other models except for Velvet-14B, which outperforms it. Compared with popular MoE models of comparable size, the new model reports higher values than DeepSeek-MoE-16B-Chat on all considered benchmarks. It has higher values than Moonlight-16B-A3B on HE, MMLU, AIME24, AIME25, GSM8K, and the 32k RULER setting, but lower on BFCL and some ARC and ITALIC settings. Finally it has lower values than GPT-OSS-20B on most benchmarks, including HE, MMLU, AIME24, AIME25, GSM8K, ARC, BFCL, and the RULER 32k. When compared with popular dense models, EngGPT2MoE-16B-A3B reports higher values on AIME24 and AIME25 than Llama-3.1-8B-Instruct, Gemma-3-12b-it, and Ministral-3-8BInstruct-2512-BF16, but lower values on ITALIC, BFCL, and RULER with a 32k context. When performance is aggregated across all benchmark metrics, EngGPT2MoE-16B-A3B shows higher performance than the Italian models under evaluation while achieving lower results than some of the most performant international models, in particular GPT-5 nano and Qwen3-8B. Taken together, our findings find the new model to be a step forward for native Italian Large Language Models.
☆ Drifting Field Policy: A One-Step Generative Policy via Wasserstein Gradient Flow
We propose Drifting Field Policy (DFP), a non-ODE one-step generative policy built on the drifting model paradigm. We frame the policy update as a reverse-KL Wasserstein-2 gradient flow toward a soft target policy, so that each DFP update corresponds to a gradient step in probability space. By construction, this gradient is decomposed into an ascent toward higher action-value regions and a score matching with the anchor policy as a trust region. We further derive a simple, tractable surrogate of the otherwise intractable update loss, akin to behavior cloning on top-K critic-selected actions. We find empirically that this mechanism uniquely benefits the drifting backbone owing to its non-ODE parameterization. With one-step inference, DFP achieves state-of-the-art performance on several manipulation tasks across Robomimic and OGBench, outperforming ODE-based policies.
☆ SOD: Step-wise On-policy Distillation for Small Language Model Agents
Tool-integrated reasoning (TIR) is difficult to scale to small language models due to instability in long-horizon tool interactions and limited model capacity. While reinforcement learning methods like group relative policy optimization provide only sparse outcome-level rewards. Recently, on-policy distillation (OPD) has gained popularity by supplying dense token-level supervision from a teacher on student-generated trajectories. However, our experiments indicate that applying OPD to TIR leads to a critical failure mode: erroneous tool calls tend to cascade across subsequent reasoning steps, progressively amplifying student-teacher divergence and rendering the teacher's token-level supervision increasingly unreliable. To address this, we propose SOD, a step-wise on-policy distillation framework for small language model agents, which adaptively reweights distillation strength at each step based on step-level divergence. Therefore, SOD can attenuate potentially misleading teacher signals in high-divergence regions while preserving dense guidance in well-aligned states. Experiments on challenging math, science, and code benchmarks show that SOD achieves up to 20.86% improvement over the second-best baseline. Notably, our 0.6B student achieves 26.13% on AIME 2025, demonstrating effective transfer of agentic reasoning to lightweight models. Our code is available at https://github.com/YoungZ365/SOD.
☆ Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences
Recursive retraining of generative models poses a critical representation challenge: when synthetic outputs are curated based on a fixed reward signal, the model tends to collapse onto a narrow set of outputs that over-optimize that objective. Prior work suggests that such collapse is unavoidable without adding real data into the mix. We revisit this conclusion from an alignment perspective and show that collapse can be mitigated through curation based on multiple reward functions. We formalize the dynamics of recursive training under heterogeneous preferences and prove that, under certain conditions, the model converges to a stable distribution that allocates probability mass across competing high-reward regions. The limiting distribution preserves diversity and provably satisfies a weighted Nash bargaining solution, offering a formal interpretation of value aggregation in synthetic retraining loops.
comment: Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026
☆ LLM hallucinations in the wild: Large-scale evidence from non-existent citations
Large language models (LLMs) are known to generate plausible but false information across a wide range of contexts, yet the real-world magnitude and consequences of this hallucination problem remain poorly understood. Here we leverage a uniquely verifiable object - scientific citations - to audit 111 million references across 2.5 million papers in arXiv, bioRxiv, SSRN, and PubMed Central. We find a sharp rise in non-existent references following widespread LLM adoption, with a conservative estimate of 146,932 hallucinated citations in 2025 alone. These errors are diffusely embedded across many papers but especially pronounced in fields with rapid AI uptake, in manuscripts with linguistic signatures of AI-assisted writing, and among small and early-career author teams. At the same time, hallucinated references disproportionately assign credit to already prominent and male scholars, suggesting that LLM-generated errors may reinforce existing inequities in scientific recognition. Preprint moderation and journal publication processes capture only a fraction of these errors, suggesting that the spread of hallucinated content has outpaced existing safeguards. Together, these findings demonstrate that LLM hallucinations are infiltrating knowledge production at scale, threatening both the reliability and equity of future scientific discovery as human and AI systems draw on the existing literature.
☆ Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models
Recurrent LLM architectures have emerged as a promising approach for improving reasoning, as they enable multi-step computation in the embedding space without generating intermediate tokens. Models such as Ouro perform reasoning by iteratively updating internal representations while retaining a standard Key-Value (KV) cache across iterations, causing memory consumption to grow linearly with reasoning depth. Consequently, increasing the number of reasoning iterations can lead to prohibitive memory usage, limiting the practical scalability of such architectures. In this work, we propose Memory-Efficient Looped Transformer (MELT), a novel architecture that decouples reasoning depth from memory consumption. Instead of using a standard KV cache per layer and loop, MELT maintains a single KV cache per layer that is shared across reasoning loops. This cache is updated over time via a learnable gating mechanism. To enable stable and efficient training under this architecture, we propose to train MELT using chunk-wise training in a two phase procedure: interpolated transition, followed by attention-aligned distillation, both from the LoopLM starting model to MELT. Empirically, we show that MELT models fine-tuned from pretrained Ouro parameters outperform standard LLMs of comparable size, while maintaining a memory footprint comparable to those models and dramatically smaller than Ouro's. Overall, MELT achieves constant-memory iterative reasoning without sacrificing LoopLM performance, using only a lightweight post-training procedure.
comment: 22 pages, 5 figures, 11 tables
☆ An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference
Long-context inference increasingly operates over CPU-resident KV caches, either because decoding-time KV states exceed GPU memory capacity or because disaggregated prefill-decode systems place KV data in host memory. Although block-sparse attention reduces attention cost in this setting, sparsity alone is insufficient for end-to-end efficiency. GPU-only designs remain constrained by PCIe bandwidth and metadata memory overhead, while CPU-GPU hybrid designs still suffer from substantial GPU idle time and bottlenecks in CPU-side top-k selection and sparse attention computation. Fluxion is built on three key insights: output-aware KV budgeting, head-specific and granularity-aware sparse configuration, and cross-device coordinated execution for sparse attention over CPU-resident KV caches. Guided by these insights, Fluxion combines a lightweight head-property predictor, a granularity-budget selector, and a priority-based scheduler to jointly optimize budget allocation, sparse configuration, and CPU-GPU execution overlap. This co-design enables hybrid sparse attention to achieve both accuracy and system efficiency in long-context inference. Across 2 models, 3 benchmarks, and 40 tasks, Fluxion preserves quality well -- the worst average degradation is only -0.26 relative to FULL, while delivering 1.5$\times$-3.7$\times$ speedup over the strongest fixed sparse hybrid baseline, whose KV budget is only 0.05.
☆ The AI-Native Large-Scale Agile Software Development Manifesto
Despite the widespread adoption of agile methods, achieving true agility at scale remains elusive. Large-scale agile frameworks remain largely human-centric and manual, relying on coordination meetings, artifact synchronization, and role-based handoffs that inhibit real-time adaptation. Meanwhile, rapid advances in AI, particularly large language models, have begun transforming software engineering, yet their potential for organizational-level agility remains underexplored. We present the AI-Native Large-Scale Agile Software Development Manifesto: a set of values and principles that redefine how large-scale software development is organized when AI becomes a first-class participant rather than a peripheral tool. The manifesto is grounded in six principles, parallel processes, intent-driven teams, living knowledge, verification-first assurance, orchestrated agent workforces, and reusable blueprints, that together shift development from a meeting-driven, document-heavy, sequential process to an intelligent, adaptive, continuously learning system.
☆ Hierarchical Task Network Planning with LLM-Generated Heuristics NeurIPS 2026
HTN planning is a variation of classical planning where, instead of searching for a linear sequence of actions, an algorithm decomposes higher-level tasks using a method library until only executable actions remain. On one hand, this allows one to introduce domain knowledge that can speed up the search for a solution through the method library. On the other hand, it creates challenges that go beyond those of classical state-space search. While recent research produced a number of heuristics and novel algorithms that speed up HTN planning, these heuristics are not yet as informative as those available in classical planning algorithms. We investigate whether large language models (LLMs) can generate effective search heuristics for HTN planning, extending the methodology of Corrêa, Pereira, and Seipp (2025) from classical to hierarchical planning. Using the Pytrich planner on six standard total-order HTN benchmark domains, we evaluate heuristics generated by nine LLMs under domain-specific prompting and compare them against the TDG and LMCount domain-independent baselines and the PANDA planner. Our results show that LLM-generated heuristics nearly match the coverage of the best available HTN planner, while substantially reducing search effort on 83% of shared problems.
comment: 9 pages, 3 figures; submitted to NeurIPS 2026
☆ Cross-Attention and Encoder-Decoder Transformers: A Logical Characterization
We give a novel logical characterization of encoder-decoder transformers, the foundational architecture for LLMs that also sees use in various settings that benefit from cross-attention. We study such transformers over text in the practical setting of floating-point numbers and soft-attention, characterizing them with a new temporal logic. This logic extends propositional logic with a counting global modality over the encoder input and a past modality over the decoder input. We also give an additional characterization of such transformers via a type of distributed automata, and show that our results are not limited to the specific choices in the architecture and can account for changes in, e.g., masking. Finally, we discuss encoder-decoder transformers in the autoregressive setting.
☆ Finite-Time Analysis of MCTS in Continuous POMDP Planning
This paper presents a finite-time analysis for Monte Carlo Tree Search (MCTS) in Partially Observable Markov Decision Processes (POMDPs), with probabilistic concentration bounds in both discrete and continuous observation spaces. While MCTS-style solvers such as POMCP achieve empirical success in many applications, rigorous finite-time guarantees remain an open problem due to the nonstationarity and the interdependencies induced by heuristic action selection (e.g., UCB). In the discrete setting, we address these challenges by extending the polynomial exploration bonus to UCB in POMDP setting, yielding polynomial concentration bounds for the empirical value estimation at the root node. For continuous observation spaces, we introduce an abstract partitioning framework and propose a finite-time bound on partitioning loss. Under mild conditions, we prove highprobability bound on value estimates in POMDPs with continuous observation space. Specifically, we propose Voro-POMCPOW, a variant of POMCPOW with f inite-time guarantees that adaptively partitions the continuous observation space using Voronoi cells. This approach maintains a finite branching factor while preserving the original observation generator. Empirical validation demonstrates that the proposed Voro-POMCPOW shows competitive performance while providing theoretical guarantees. Although our analysis focuses on continuous POMDPs, the techniques developed herein are also applicable to continuous MDPs, closing another gap on the MDP side.
comment: 9 pages, 1 figure
☆ DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain
LLM-based agents are increasingly deployed for routine but consequential tasks in real-world domains, where their behavior is governed by inherently ambiguous domain policies that admit multiple valid interpretations. Despite the prevalence of such ambiguities in practice, existing agent benchmarks largely assume unambiguous, well-specified policies, leaving a critical evaluation gap. We introduce DRIP-R, a benchmark that systematically exploits real-world retail policy ambiguities to construct scenarios in which no single correct resolution exists. DRIP-R comprises a curated set of policy-ambiguous return scenarios paired with a realistic customer personas, a full-duplex conversational simulation with tool-calling capabilities and a multi-judge evaluation framework covering policy adherence, dialogue quality, behavioral alignment, and resolution quality. Our experiments show that frontier models fundamentally disagree on identical policy-ambiguous scenarios, confirming that ambiguity poses a genuine and systematic challenge to LLM decision-making.
comment: 10 pages
☆ Dependence on Early and Late Reverberation of Single-Channel Speaker Distance Estimation
Single-channel speaker distance estimation has recently achieved centimeter-level accuracy in simulated environments, yet it remains unclear which components of the room impulse response (RIR) the model exploits and how performance depends on the recording conditions. In this work, we decompose simulated RIRs into four variants (full, direct-only, no-late, and no-early) using the mixing time estimated from the echo density function as the boundary between early reflections and late reverberation. We define four calibration scenarios, from fully calibrated (synchronised capture, known source level) to fully uncalibrated (arbitrary onset, unknown level), and evaluate all combinations on a matched dataset. Results show that without time calibration, mean absolute error (MAE) increases to $1.29$ m and the model extracts reverberation-based cues, with early reflections emerging as the most informative component. Further analysis against DRR, $C_{50}$, and $T_{60}$ confirms that estimation accuracy improves with stronger early energy and degrades in highly reverberant environments. When time calibration is available, the model achieves a MAE of $0.14$ m by extracting the propagation delay alone, regardless of the RIR content.
comment: Submitted to IWAENC 2026
☆ GASim: A Graph-Accelerated Hybrid Framework for Social Simulation
Large-scale social simulators are essential for studying complex social patterns. Prior work explores hybrid methods to scale up simulations, combining large language models (LLM)-based agents with numerical agent-based models (ABM). However, this incurs high latency due to expensive memory retrieval and sequential ABM execution. To address this challenge, we propose GASim, a graph-accelerated hybrid multi-agent framework for large-scale social simulations. For core agents driven by LLM, GASim introduces Graph-Optimized Memory (GOM) to replace intensive LLM-based retrieval pipelines with lightweight propagation over a sparse memory graph. For the majority of ordinary agents, GASim employs Graph Message Passing (GMP), substituting sequential ABM execution with parallel updates by fine-grained feature aggregation and Graph Attention Network. We further introduce Entropy-Driven Grouping (EDG) that coordinates this hybrid partitioning, leveraging information entropy to dynamically identify emergent core agents situated in information-diverse neighborhoods. Extensive experiments show that GASim not only delivers a substantial 9.94-fold end-to-end speedup over the traditional hybrid framework but also consumes less than 20% of baseline tokens, significantly reducing costs while preserving strong alignment with real-world public opinion trends. Our code is available at https://github.com/Jasmine0201/GASim.
☆ TRACE: Tourism Recommendation with Accountable Citation Evidence
Tourism is a high-stakes setting for conversational recommender systems (CRS): a plausible-sounding suggestion can waste real money and trip time once a traveler acts on it. Existing CRS benchmarks primarily evaluate systems with a single Recall@k score over entity mentions, and tourism-specific resources add spatial or knowledge-graph context, yet none of them couple multi-turn recommendation with verbatim review-span evidence and rejection recovery. This leaves an evaluation gap for tourism recommendation that is simultaneously trustworthy, verifiable, and adaptive: recommend the right point of interest (POI) for multi-aspect preferences (such as cuisine, price, atmosphere, walking distance), justify each suggestion with verifiable evidence from prior visitors so the traveler can act without trial and error, and recover when the first recommendation is rejected mid-dialogue. We introduce TRACE, where each item is a multi-turn tourism recommendation dialogue with review-span citations and explicit rejection turns: 10,000 dialogues over 2,400 Yelp POIs and 34,208 reviews across eight U.S. cities, paired with 14 retrieval, planning, and LLM baselines, along with 25 metrics organized under Accuracy, Grounding, and Recovery. Across these baselines, TRACE reveals the Three-Competency Gap: LLM Zero-Shot leads in closed-set Recall@1 and rejection recovery but cites less densely than retrievers; non-LLM retrievers achieve surface-verbatim grounding but with low accuracy; Multi-Review Synthesis fails at recovery. The Grounding Score agrees with human citation precision (Spearman rho=+0.80, p<10^-20), and paired t-tests reproduce the per-baseline ranking (p<0.01 on the dominant contrasts). TRACE reframes accountable tourism recommendation as a joint target (right POI, verifiable evidence, adaptive repair) rather than a single-axis leaderboard.
☆ FactoryBench: Evaluating Industrial Machine Understanding
We introduce FactoryBench, a benchmark for evaluating time-series models and LLMs on machine understanding over industrial robotic telemetry. Q&A pairs are organized along four causal levels (state, intervention, counterfactual, decision) instantiating Pearl's ladder of causation, and span five answer formats: four structured formats are scored deterministically and free-form answers are scored by an LLM-as-judge voting protocol. We propose a scalable Q&A generation framework built around structured question templates, present FactoryWave (a dense, multitask, multivariate sensor dataset collected from a UR3 cobot and a KUKA KR10 industrial arm), and construct FactoryBench as a large-scale benchmark of over 70k Q&A items grounded in roughly 15k normalized episodes from FactoryWave, AURSAD, and voraus-AD. Zero-shot evaluation of six frontier LLMs shows that no model exceeds 50% on structured levels or 18% on decision-making, revealing a wide gap between current models and operational machine understanding.
comment: 9 pages, 4 figures, 14 tables; appendix with 24 pages
☆ The Endogeneity of Miscalibration: Impossibility and Escape in Scored Reporting
Eliciting truthful reports from autonomous agents is a core problem in scalable AI oversight: a principal scores the agent's report using a strictly proper scoring rule, but the agent also benefits from the report through a non-accuracy channel (approval for autonomous action, allocation share, downstream control). The same structure appears in classical mechanism-design settings such as marketplace operation. Our main result is an endogeneity: the principal's optimal oversight necessarily uses a non-affine approval function to screen types, yet any non-affine approval makes truthful reporting suboptimal under the combined objective whenever deviation is undetectable. The principal cannot avoid the perturbation that undermines calibration. This impossibility holds for all strictly proper scoring rules, with a closed-form perturbation formula. A constructive escape exists: a step-function approval threshold achieves first-best screening for every strictly proper scoring rule, because the agent's binary inflate-or-not choice creates a type-space threshold regardless of the generator's curvature. Under the Brier score specifically, the type-independent inflation cost yields a welfare equivalence between second-best and first-best; we prove this equivalence is unique to Brier (the welfare gap under smooth $C^1$ oversight is bounded below by $Ω(\text{Var}(1/G'') (γ/β)^2)$ for every non-Brier rule). Two instances develop the framework: AI agent oversight (the lead motivating setting) and marketplace operation (a parallel mechanism-design domain). The message for AI alignment is direct: smooth scoring-based oversight cannot elicit truthful reports from a strategic agent; sharp thresholds are the calibration-preserving design.
comment: 38 pages, no figures. Targeting ACM Transactions on Economics and Computation (TEAC); preprint
☆ Towards Billion-scale Multi-modal Biometric Search
Searching a multi-biometric database of a billion records for a country-level identity system requires pushing the limits of all aspects of a biometric system, including acquisition, preprocessing, feature extraction, accuracy, matching speed, presentation attack detection, and handling of special cases (e.g., missing finger digits). This is the first paper that gives insights into such a large-scale multimodal biometric search system, called Bharat ABIS, based on open-source architectures. The end-to-end pipeline of Bharat ABIS processes fingerprint, face and iris modalities through modality-specific stages of preprocessing (segmentation), quality assessment, presentation attack detection, and learning an embedding (feature extraction), producing a concatenated template of 13.5KB per person. We present a detailed analysis of the modalities and how they are integrated to create an efficient and effective solution for 1:N search (de-duplication). Evaluations on a demographically stratified gallery of 220 million identities, randomly sampled from 1.55 billion records in India's Aadhaar database, yield an FNIR of 0.3% at an FPIR of 0.5%, for adult probes (over 18 years). We also compare the performance of Bharat ABIS against three state-of-the-art COTS systems on a 20M gallery. Our system achieves a throughput of 100 searches per second on a gallery of 40M on a single server (8xNvidia H100 GPUs, 2TB RAM).
☆ Operating Within the Operational Design Domain: Zero-Shot Perception with Vision-Language Models
Over the last few years, research on autonomous systems has matured to such a degree that the field is increasingly well-positioned to translate research into practical, stakeholder-driven use cases across well-defined domains. However, for a wide-scale practical adoption of autonomous systems, adherence to safety regulations is crucial. Many regulations are influenced by the Operational Design Domain (ODD), which defines the specific conditions in which an autonomous agent can function. This is especially relevant for Automated Driving Systems (ADS), as a dependable perception of ODD elements is essential for safe implementation and auditing. Vision-language models (VLMs) integrate visual recognition and language reasoning, functioning without task-specific training data, which makes them suitable for adaptable ODD perception. To assess whether VLMs can function as zero-shot "ODD sensors" that adapt to evolving definitions, we contribute (i) an empirical study of zero-shot ODD classification and detection using four VLMs on a custom dataset and Mapillary Vistas, along with failure analyses; (ii) an ablation of zero-shot optimization strategies with a cost-performance overview; and (iii) a suite of reusable prompting templates with guidance for adaptation. Our findings indicate that definition-anchored chain-of-thought prompting with persona decomposition performs best, while other methods may result in reduced recall. Overall, our results pave the way for transparent and effective ODD-based perception in safety-critical applications.
comment: 8 pages, 4 figures
☆ Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation
Automated short answer scoring (ASAS) is shifting from discriminative, fine-tuned models to large language models (LLMs) used in few-shot settings. This paradigm leverages LLMs broad world knowledge and ease of deployment, but limited task-specific data may reduce alignment on complex scoring tasks. In particular, its impact on scoring partially correct responses that require nuanced interpretation remains underexplored. We investigate the relationship between the degree of task-specific adaptation of different models and quality-conditioned scoring agreement. We compare three LLMs (GPT-5.2, GPT-4o, Claude Opus 4.5) in few-shot mode, a fine-tuned BERT-based encoder, and a human expert on two open-ended biology items, using several hundred student responses and ground truth scores provided by a biology education expert. The results show that human-human agreement is highest and stable across the full quality spectrum. All AI models perform well on fully correct and fully incorrect responses, but exhibit substantial degradation on mid-range responses. This mid-range degradation is conditioned on task-specific adaptation: It is most severe in few-shot LLMs with few examples and decreases as task-specific data increases, with fine-tuned encoder models performing best. This mid-range degradation may lead to inequitable evaluation of responses produced by students with developing understanding. Our findings highlight the importance of quality-conditioned fairness, with particular attention to mid-range responses.
☆ MAVEN: Multi-Agent Verification-Elaboration Network with In-Step Epistemic Auditing
While explicit reasoning trajectories enhance model interpretability, existing paradigms often rely on monolithic chains that lack intermediate verification, allowing early errors to cascade unchecked. This lack of modularity impedes granular auditing and compromises the epistemic trust required for high-stakes applications. We propose MAVEN (Multi-Agent Verification-Elaboration Network with In-Step Epistemic Auditing), a blackboard-inspired framework designed to transform LLMs into deliberate reasoners through explicit role-decoupling. At its core, MAVEN operationalizes an adversarial Skeptic-Researcher-Judge loop, simulating expert deliberation by functionally separating logical defense from factual grounding. Experiments on OpenBookQA, TruthfulQA, HALUEVAL and StrategyQA benchmarks demonstrate that MAVEN delivers superior reasoning quality across four fine-grained metrics. Notably, MAVEN consistently outperforms latent reasoning models such as GEMINI-3.1-Pro and consensus-based baselines (e.g., ReConcile) by generating explicitly structured, modular, and verifiable deliberation trajectories, rather than relying on implicit internal states or post-hoc consensus. Moreover, comprehensive evaluations confirm that MAVEN is fully model-agnostic, serving as a strong and transferable reasoning booster that yields substantial performance improvements across diverse backbone models.
comment: 24 pages, 2 figures
☆ LithoBench: Benchmarking Large Multimodal Models for Remote-Sensing Lithology Interpretation
Remote sensing lithology interpretation is fundamental to geological surveys, mineral exploration, and regional geological mapping. Unlike general land-cover recognition, lithology interpretation is a knowledge-intensive task that requires experts to infer rock types from various features, e.g., subtle visual, spectral, textural, geomorphological, and contextual cues, making reliable automated interpretation highly challenging. Geological knowledge-guided large multimodal models offer new opportunities, yet their evaluation remains constrained by the lack of benchmarks that capture lithological annotations, multi-level geological semantics, and expert-informed assessment. Here, we propose LithoBench, a multi-level benchmark for evaluating geological semantic understanding in remote sensing lithology interpretation. LithoBench contains 10,000 expert-annotated interpretation instances across 12 representative lithological categories, including 4,000 multiple-choice and 6,000 open-ended tasks organized into five cognitive levels: Identification and Description, Comparative Analysis, Mechanism Explanation, Practical Application, and Comprehensive Reasoning. We further develop an expert-in-the-loop, knowledge-grounded semi-automated construction pipeline, coupling multi sub-processes, e.g., structured geological image descriptions, to enhance geological validity and evaluation reliability. Experiments with multiple large vision-language models eveal substantial limitations in geological semantic understanding, particularly on higher-order explanation, application, and reasoning tasks.
☆ Tacit Knowledge Extraction via Logic Augmented Generation and Active Inference
Tacit knowledge plays a central role in human expertise, yet it remains difficult to capture, formalize, and reuse in machine-interpretable form. This challenge is especially relevant in procedural domains, where successful execution depends not only on explicit instructions, but also on implicit assumptions, contextual constraints, embodied skills, and experience-based judgments rarely documented. As a result, current knowledge engineering pipelines struggle to transform tacit and process-centric knowledge into formally specified, machine-interpretable representations that can be queried, validated, reasoned over, and reused. In this paper, we introduce a neuro-symbolic framework that combines Logic-Augmented Generation and an Active-Inference-inspired approach for ontology-grounded Knowledge Graph construction. We evaluate the approach in a knowledge transfer case study in manufacturing, using assembly-like repair procedures from instructional videos as a reproducible proxy domain. Results show that the proposed solution improves completeness and semantic quality, advancing neuro-symbolic knowledge engineering for industrial domains.
☆ Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding
Multi-agent pathfinding (MAPF) is a widely used abstraction for multi-robot trajectory planning problems, where multiple homogeneous agents move simultaneously within a shared environment. Although solving MAPF optimally is NP-hard, scalable and efficient solvers are critical for real-world applications such as logistics and search-and-rescue. To this end, the research community has proposed various decentralized suboptimal MAPF solvers that leverage machine learning. Such methods frame MAPF (from a single agent perspective) as a Dec-POMDP where at each time step an agent has to decide an action based on the local observation and typically solve the problem via reinforcement learning or imitation learning. We follow the same approach but additionally introduce a learnable communication module tailored to enhance cooperation between agents via efficient feature sharing. We present the Local Communication for Multi-agent Pathfinding (LC-MAPF), a generalizable pre-trained model that applies multi-round communication between neighboring agents to exchange information and improve their coordination. Our experiments show that the introduced method outperforms the existing learning-based MAPF solvers, including IL and RL-based approaches, across diverse metrics in a diverse range of (unseen) test scenarios. Remarkably, the introduced communication mechanism does not compromise LC-MAPF's scalability, a common bottleneck for communication-based MAPF solvers.
☆ Post-training makes large language models less human-like
Large language models (LLMs) are increasingly used as surrogates for human participants, but it remains unclear which models best capture human behavior and why. To address this, we introduce Psych-201, a novel dataset that enables us to measure behavioral alignment at scale. We find that post-training -- the stage that turns base models into useful assistants -- consistently reduces alignment with human behavior across model families, sizes, and objectives. Moreover, this misalignment widens in newer model generations even as base models continue to improve. Finally, we find that persona-induction -- a popular technique for eliciting human-like behavior by conditioning models on participant-specific information -- does not improve predictions at the level of individuals. Taken together, our results suggest that the very processes that are currently employed to turn LLMs into useful assistants also make them less accurate models of human behavior.
☆ Inference Time Causal Probing in LLMs
Causal probing methods aim to test and control how internal representations influence the behavior of generative models. In causal probing, an intervention modifies hidden states so that a property takes on a different value. Most existing approaches define such interventions by training an auxiliary probe classifier, which ties the method to a specific task or model and risks misalignment with the model's predictive geometry. We propose Hidden-state Driven Margin Intervention (HDMI), a probe-free, gradient-based technique that directly steers hidden states using the model's native output. HDMI applies a margin objective that increases the probability of a target continuation while decreasing that of the source, without relying on probe classifiers. We further introduce a lookahead variant (LA-HDMI) for text editing that backpropagates through the softmax embeddings, modifying the current hidden state so that the likelihood of user-specified tokens increases in next token generations while preserving fluency. To evaluate interventions, we measure completeness (whether the targeted property changes as intended) and selectivity (whether unrelated properties are preserved), and report their harmonic mean as an overall measure of reliability. HDMI consistently achieves higher reliability than prior methods on the LGD agreement corpus and the CausalGym benchmark, across Meta-Llama-3-8B-Instruct, and Pythia-70M.
comment: 16 pages, 4 tables, 3 figures
☆ Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents
When a phone-use agent avoids harm, does that show safety, or simply inability to act? Existing evaluations often cannot tell. A harmful outcome may be avoided because the agent recognized the risk and chose the safe action, or because it failed to understand the screen or execute any relevant action at all. These cases have different causes and call for different fixes, yet current benchmarks often merge them under task success, refusal, or final harmful outcome. We address this problem with PhoneSafety, a benchmark of 700 safety-critical moments drawn from real phone interactions across more than 130 apps. Each instance isolates the next decision at a risky moment and asks a simple question: does the model take the safe action, take the unsafe action, or fail to do anything useful? We evaluate eight representative phone-use agents under this framework. Our results reveal two main patterns. First, stronger general phone-use ability does not reliably imply safer choices at risky moments. Models that perform better on ordinary app tasks are not always the ones that behave more safely when the next action matters. Second, failures to do anything useful behave like a capability signal rather than a safety signal: they are concentrated in more visually and operationally demanding settings and remain stable when the evaluation protocol changes. Across models, failures split into two recurring patterns: unsafe choices in settings where the model can act but chooses wrongly, and inability to act in more visually and operationally demanding screens. Overall, a harmless outcome is not enough to count as evidence of safety. Evaluating phone-use agents requires separating unsafe judgment from inability to act.
comment: work in progress
☆ Nürnberg NLP at PsyDefDetect: Multi-Axis Voter Ensembles for Psychological Defence Mechanism Classification ACL 2026
Detecting levels of psychological defence mechanisms in supportive conversations is inherently ambiguous. In the PsyDefDetect shared task at BioNLP 2026 the eight positive defence categories share surface language and differ only in pragmatic function and trained raters reach only moderate inter-annotator agreement. On such a task the decisive lever is not a stronger single model but error independence, since any single representation will waver on the overlapping defence boundaries. We translate this insight into a 9-voter ensemble spanning three orthogonal axes: class granularity (all nine classes for the gatekeeper, only the eight defence classes for the specialists), training method (generative and discriminative) and base model. The system reaches $F1_{test}{=}.420$ on the hidden test set, placing first among 21 registered teams.
comment: Accepted at the BioNLP 2026 PsyDefDetect Shared Task @ ACL 2026 (1st place, 21 registered teams)
☆ SAM 3D Animal: Promptable Animal 3D Reconstruction from Images in the Wild
3D animal reconstruction in the wild remains challenging due to large species variation, frequent occlusions, and the prevalence of multi-animal scenes, while existing methods predominantly focus on single-animal settings. We present SAM 3D Animal, the first promptable framework for multi-animal 3D reconstruction from a single image. Built on the SMAL+ parametric animal model, our method jointly reconstructs multiple instances and supports flexible prompts in the form of keypoints and masks which enable more reliable disambiguation in crowded and occluded scenes. To train such a model, we further introduce Herd3D, a multi-animal 3D dataset containing over 5K images, designed to increase diversity in species, interactions, and occlusion patterns. Experiments on the Animal3D, APTv2, and Animal Kingdom datasets show that our framework achieves state-of-the-art results over both existing model-based and model-free methods, demonstrating a scalable and effective solution for prompt-driven animal 3D reconstruction in the wild.
☆ Mathematical Reasoning via Intervention-Based Time-Series Causal Discovery Using LLMs as Concept Mastery Simulators
Recent methods for improving LLM mathematical reasoning, whether through MCTS-based test-time search or causal graph-guided knowledge injection, cannot identify which concepts causally contribute to a correct answer, as the observed association may be spurious, driven by confounders such as problem difficulty. We propose CIKA (Causal Intervention for Knowledge Activation), a framework that uses the LLM itself as an interventional simulator: a prompt sets the concept state to ``mastered'' and the correctness change estimates the causal effect. We formalize this quantity as an Interventional Capability Probe (ICP), which diagnoses whether the LLM can use a given concept -- distinct from merely possessing knowledge. Because the intervention exogenously sets the concept state independently of problem difficulty, ICP separates confounding that observational methods cannot. On 67 screened problems, the ICP of the top-ranked concept (+0.219) is significantly larger than that of the negative control (+0.039; paired $t$-test, $p < 10^{-6}$, Cohen's $d = 0.86$), confirming that the probe discriminates causally relevant concepts from irrelevant ones. Analysis of 601 Omni-MATH problems further shows that solved problems have 6.1$\times$ higher ATE than unsolved ones (0.338 vs. 0.055), confirming that ICP is predictive of problem-solving success. With a 7B-parameter LLM whose weights are entirely frozen, CIKA achieves 69.7\% on the contamination-free Omni-MATH-Rule benchmark and 64.0\% overall, compared to 60.5\% for o1-mini, and 97.2\% on GSM8K, 46--50\% on AIME 2024--2026, and 46.2\% on MathArena. The Causal Knowledge Activation component contributes 33.8\% of correct answers on problems where the base model alone fails, demonstrating that the LLM already possessed but had not activated the requisite knowledge.
comment: 17 pages, 0 figures
☆ Revisiting Transformer Layer Parameterization Through Causal Energy Minimization
Transformer blocks typically combine multi-head attention (MHA) for token mixing with gated MLPs for token-wise feature transformation, yet many choices in their parameterization remain largely empirical. We introduce Causal Energy Minimization (CEM), a framework that recasts Transformer layers as optimization steps on conditional energy functions while explicitly accounting for layer parameterization. Extending prior energy-based interpretations of attention, CEM shows that weight-tied MHA can be derived as a gradient update on an interaction energy, and that a gated MLP with shared up/down projections can be viewed through an element-wise energy. This perspective identifies a design space for Transformer layers that includes within-layer weight sharing, diagonal-plus-low-rank interactions, lightweight preconditioners, and recursive updates. We evaluate CEM-derived layers in language-modeling experiments at the moderate hundred-million-parameter scale. Despite their constrained parameterizations, these layers train stably and can match corresponding Transformer baselines. Overall, our results suggest that CEM provides a useful lens for understanding Transformer layer parameterization, connecting Transformer architectures to energy-based models and motivating further exploration of energy-guided layer designs.
☆ Parallel Lifted Planning via Semi-Naive Datalog Evaluation
Lifted classical planners operate directly on first-order planning tasks to avoid the computationally demanding grounding step. However, lifted planning is typically slower, as planners must repeatedly instantiate ground structures during search. Many core components of lifted classical planning, such as successor generation, axiom evaluation, task grounding, and delete-relaxed heuristics, have previously been studied through the lens of Datalog evaluation. We build upon this line of work and extend it by developing and analyzing an execution model with two levels of parallelism: rule-level parallelism and grounding parallelism. We further specialize this solver for planning-specific workloads with a grounder based on clique enumeration, which we extend to support semi-naive Datalog evaluation. Our experimental evaluation using greedy best-first search with the FF heuristic shows that our implementation already solves more tasks than the baselines on a single core, and the gap widens as additional cores are used. Moreover, on hard-to-ground tasks where on average 97.6% of the total runtime is spent in Datalog execution, the proposed execution model exhibits an average parallel fraction of 92.4%, while achieving up to a 6-fold speedup on 8 cores in practice.
☆ Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States
Reinforcement learning with verifiable rewards (RLVR) for Large Reasoning Models hinges on baseline estimation for variance reduction, but existing approaches pay a heavy price: PPO requires a policy-model scale critic, while GRPO needs multiple rollouts per prompt to keep its empirical group mean stable. We introduce Policy Optimization with Internal State Value Estimation), which obtains a baseline at negligible cost by using the policy model's internal signals already computed during the policy forward pass. A lightweight probe predicts the expected verifiable reward from the hidden states of the prompt and generated trajectory, as well as token-entropy statistics, and is trained online alongside the policy. To preserve gradient unbiasedness despite using trajectory-conditioned features, we introduce a cross-rollout construction that predicts each rollout's value from an independent rollout's internal states. Because POISE estimates prompt value using only a single rollout, it enables higher prompt diversity for a fixed compute budget during training. This reduces gradient variance for more stable learning and also eliminates the compute overhead of sampling costs for detecting zero-advantage prompts. On Qwen3-4B and DeepSeek-R1-Distill-Qwen-1.5B across math reasoning benchmarks, POISE matches DAPO while requiring less compute. Moreover, its value estimator shows similar performance to a separate LLM-scale value model and generalizes to various verifiable tasks. By leveraging the model's own internal representations, POISE enables more stable and efficient policy optimization.
☆ Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding ACL 2026
Proactive streaming video understanding requires Video-LLMs to decide when to respond as a video unfolds, a task where existing methods often fall short due to their implicit, query-agnostic modeling of visual evidence. We introduce Response-G1, a novel framework that establishes explicit, structured alignment between the accumulated video evidence and the query's expected response conditions via scene graphs. The framework operates in three fine-tuning-free stages: (1) online query-guided scene graph generation from streaming clips; (2) memory-based retrieval of the most semantically relevant historical scene graphs; and (3) retrieval-augmented trigger prompting for per-frame "silence/response" decisions.By grounding both evidence and conditions in a shared graph representation, Response-G1 achieves more interpretable and accurate response timing decisions. Experimental results on established benchmarks demonstrate the superiority of our method in both proactive and reactive tasks, validating the advantage of explicit scene graph modeling and retrieval in streaming video understanding.
comment: Accepted to ACL 2026
☆ Open-Ended Task Discovery via Bayesian Optimization
When applying Bayesian optimization (BO) to scientific workflow, a major yet often overlooked source of uncertainty is the task itself -- namely, what to optimize and how to evaluate it -- which can evolve as evidence accumulates. We introduce Generate-Select-Refine (GSR), a open-ended BO framework that alternates between task generation and task optimization. Starting from a user-provided seed task, GSR generates new tasks in a coarse-to-fine manner while a task-acquisition function schedules optimization. Asymptotically, it concentrates evaluations on the best task, incurring only logarithmic regret overhead relative to single-task BO. We apply GSR to new product development, chemical synthesis scaling, algorithm analysis, and patent repurposing, where it outperforms existing LLM-based optimizers.
comment: 60 pages, 11 figures
☆ Ensemble Distributionally Robust Bayesian Optimisation
We study zeroth-order optimisation under context distributional uncertainty, a setting commonly tackled using Bayesian optimisation (BO). A prevailing strategy to make BO more robust to the complex and noisy nature of data is to employ an ensemble as the surrogate model, thereby mitigating the weaknesses of any single model. In this study, we propose a novel algorithm for Ensemble Distributionally Robust Bayesian Optimisation that remains computationally tractable while managing continuous context. We obtain theoretical sublinear regret bounds, improving current state-of-the-art results. We show that our method's empirical behaviour aligns with its theoretical guarantees.
☆ ProteinJEPA: Latent prediction complements protein language models
Protein language models are trained primarily with masked language modeling (MLM), which predicts amino-acid identities at masked positions. We ask whether latent-space prediction can complement these token-level objectives under matched wall-clock budget. Across pretrained and random-init protein sequence encoders at 35--150M parameters, we find that the best protein-JEPA design is not all-position latent prediction but a variant: predicting latent targets only at masked positions, and retaining the MLM cross-entropy. We call this recipe masked-position MLM+JEPA. On a 16-task downstream suite (15 frozen linear probes plus SCOPe-40 zero-shot fold retrieval), under matched wall-clock budgets, this recipe wins more tasks than it loses against MLM-only continuation: 10 wins / 3 losses / 3 ties (hereafter W/L/T) on pretrained ESM2-35M, 11/2/3 on ESM2-150M while results in pretraining from scratch are mixed (6/8/2). Gains are seen for multiple models on 11 of 16 tasks, including stability, \b{eta}β\b{eta}-lactamase fitness, variant effect, intrinsic disorder, remote homology, enzyme classification, and SCOPe-40 fold retrieval. Tasks with more losses than wins are Fluorescence (TAPE) and Peptide-HLA Binding. All-position MLM+JEPA matches MLM-only overall but does not reproduce the masked-position gains. JEPA-only (no MLM) collapses in nearly every experiment. We conclude that JEPA, when combined with MLM, is competitive and can outperform pure MLM in pretraining and continued training, even under matched wall-clock budgets.
☆ Implicit Preference Alignment for Human Image Animation ICML 2026
Human image animation has witnessed significant advancements, yet generating high-fidelity hand motions remains a persistent challenge due to their high degrees of freedom and motion complexity. While reinforcement learning from human feedback, particularly direct preference optimization, offers a potential solution, it necessitates the construction of strict preference pairs. However, curating such pairs for dynamic hand regions is prohibitively expensive and often impractical due to frame-wise inconsistencies. In this paper, we propose Implicit Preference Alignment (IPA), a data-efficient post-training framework that eliminates the need for paired preference data. Theoretically grounded in implicit reward maximization, IPA aligns the model by maximizing the likelihood of self-generated high-quality samples while penalizing deviations from the pretrained prior. Furthermore, we introduce a Hand-Aware Local Optimization mechanism to explicitly steer the alignment process toward hand regions. Experiments demonstrate that our method achieves effective preference optimization to enhance hand generation quality, while significantly lowering the barrier for constructing preference data. Codes are released at https://github.com/mdswyz/IPA
comment: Accepted by ICML 2026
☆ From Pixels to Prompts: Vision-Language Models
When you read a paper about a new Vision-Language Model today, it can be easy to forget how strange this idea would have sounded not so long ago. Teaching machines to see was already hard. Teaching them to read and generate language was already hard. Asking them to do both at once - and then to reason, answer questions, follow instructions, and sometimes even surprise us - still carries a quiet trace of science fiction, even as it becomes routine. This book was born from a simple feeling: \emph{it is too easy to get lost}. The field moves quickly, new model names appear constantly, and the gap between ``I know the buzzwords'' and ``I actually understand how this works'' can feel uncomfortably wide. I have felt that gap many times. If you are holding this book, you probably have too. My goal is not to provide an exhaustive catalog of every dataset, benchmark, and new model variant. Instead, I want to offer something more modest - and, I hope, more durable: a clear mental map of Vision-Language Models. Enough structure that you can read new papers with confidence; enough intuition that you can design your own systems without feeling as if you are assembling LEGO bricks blindly.
☆ Multi-Environment POMDPs with Finite-Horizon Objectives
Partially Observable Markov Decision Processes (POMDPs) are systems in which one agent interacts with a stochastic environment, and receives only partial information about the current state. In a multi-environment POMDP (MEPOMDP), the initial state is unknown, and assumed to be adversarially chosen. In this work we focus on computing the optimal value and policy in MEPOMDPs with finite-horizon objectives. That problem is known to be PSPACE-complete in POMDPs. Our main results are as follows: (1) we establish that it is also PSPACE-complete in the more general setting of MEPOMDPs; (2) we present a practical algorithm and evaluate it on classical benchmarks, significantly outperforming the only previously known algorithm.
☆ Why Self-Inconsistency Arises in GNN Explanations and How to Exploit It
Recent work has observed that explanations produced by Self-Interpretable Graph Neural Networks (SI-GNNs) can be self-inconsistent: when the model is reapplied to its own explanatory graph subset, it may produce a different explanation. However, why self-inconsistency arises remains poorly understood. In this work, we first identify re-explanation-induced context perturbation as the direct cause of score variation. We then introduce a latent signal assignment hypothesis to explain why only some edges are sensitive to this perturbation, and analyze how conciseness regularization affects latent signal assignment. Given that self-inconsistent edges do not provide stable evidence for the model's prediction, we propose Self-Denoising (SD), a model-agnostic and training-free post-processing strategy that calibrates explanations with only one additional forward pass. Experiments across representative SI-GNN frameworks, backbone architectures, and benchmark datasets support our hypothesis and show that SD consistently improves explanation quality while adding only about 4--6\% computational overhead in practice.
☆ From Feasible to Practical: Pareto-Optimal Synthesis Planning
Current computer-aided synthesis planning (CASP) methods often treat retrosynthesis as solved once a single feasible route is identified, focusing primarily on convergence or shortest-path metrics. This view is misaligned with real-world practice, where chemists must balance competing objectives such as cost, sustainability, toxicity, and overall yield. To address this, we formulate synthesis planning as a multi-objective search problem and introduce MORetro*, an algorithm that generates a Pareto front of synthesis routes to explicitly capture trade-offs among user-defined criteria. MORetro* uses weighted scalarization and BO-informed sampling to efficiently navigate the combinatorial search space and prioritize promising trade-offs. Building on multi-objective A*-search, we provide optimality guarantees showing that, for a fixed single-step model, MORetro* recovers the true Pareto front. Across multiple retrosynthesis benchmarks, MORetro* produces diverse, high-quality Pareto fronts, uncovering solutions overlooked by single-objective approaches and better aligning CASP outputs with industrial decision-making.
☆ Model-Driven Policy Optimization in Differentiable Simulators via Stochastic Exploration
Differentiable planning enables gradient-based optimization of decision-making problems by leveraging differentiable models of system dynamics. However, in highly nonlinear and hybrid discrete-continuous domains, the resulting optimization landscapes are often ill-conditioned, with flat regions and sharp transitions that hinder effective optimization. We propose Model-Driven Policy Optimization (MDPO), a framework that introduces stochastic exploration into differentiable planning by injecting noise into the action space during optimization. Leveraging access to the model, MDPO further adapts the noise magnitude based on gradient-derived sensitivity of the trajectory objective, yielding a time-dependent exploration profile. This enables improved exploration of the objective landscape and helps escape poor local optima via dynamic allocation of exploration across timesteps and iterations. Experiments on benchmark domains demonstrate that MDPO consistently outperforms deterministic differentiable planning, including both the noise-free variant of our method and available state-of-the-art implementations, as well as model-free baselines such as PPO, significantly improving solution quality across challenging nonlinear and hybrid settings. We further analyze the evolution of the adaptive noise magnitude across both time steps and optimization iterations, providing insight into how exploration is allocated during learning.
☆ LARAG: Link-Aware Retrieval Strategy for RAG Systems in Hyperlinked Technical Documentation
Retrieval-Augmented Generation (RAG) enhances the factual grounding of Large Language Models by conditioning their outputs on external documents. However, standard embedding-based retrievers treat naturally structured corpora, such as technical manuals, as flat collections of passages, thereby overlooking the hyperlink topology that users rely on when navigating such content. We introduce LARAG (Link-Aware RAG): a lightweight, link-aware retrieval strategy that leverages the author-defined hyperlink structure already present in HTML documentation, encoding hyperlink relations as metadata in the chunk representations and exploiting them to perform a form of graph-like retrieval of locally relevant content. In a benchmark of twenty expert-designed queries over Rulex Platform technical documentation and four prompting strategies, LARAG consistently improves answer quality, achieving the highest BERTScore F1, while retrieving fewer chunks and generating fewer tokens than a baseline RAG architecture used for comparison. These results show that directly leveraging the existing hyperlink topology of technical documentation, even without explicit graph construction or inference, enables an implicit form of graph-like retrieval that yields a more faithful and efficient RAG pipeline, providing better grounding at lower cost.
☆ LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning
Developing lightweight, on-device vision-language GUI agents is essential for efficient cross-platform automated interaction. However, current on-device agents are constrained by limited model capacity, and further performance improvements remain urgently needed. Traditional Supervised Fine-Tuning (SFT) for small-scale models often leads to overfitting, catastrophic forgetting and policy rigidity, and thus fails to fully address these challenges. In this work, we propose a novel SFT-free training paradigm that significantly enhances the performance of small-scale models. We first present the initial systematic integration of generalized knowledge distillation into the GUI agent domain via Guided On-policy Distillation. By incorporating oracle reference trajectories together with a dynamic retrieval mechanism, our method reduces hallucinations and mitigates the cognitive misalignment inherent in multi-solution GUI tasks. Building on this foundation, we further introduce a Multi-solution Dual-level GRPO framework that jointly aligns macro-level subtask planning with micro-level execution matching, thereby improving exploration in long-horizon GUI agent scenarios. In addition, we construct an automated data generation pipeline to synthesize GUI task trajectories with rich multi-solution annotations. Extensive experiments show that our method achieves state-of-the-art performance among lightweight models while remaining competitive with substantially larger-scale models across all benchmarks. Ablation studies further demonstrate that structured on-policy distillation and multi-solution dual-level exploration can fully unlock the capabilities of 2B/3B scale agents, surpassing the performance limits of conventional imitation learning.
☆ Efficient Data Selection for Multimodal Models via Incremental Optimization Utility
The scaling of Large Multimodal Models (LMMs) is constrained by the quality-quantity trade-off inherent in synthetic data. Previous approaches, such as LLM-as-a-Judge, have proven their effectiveness in addressing this but suffer from prohibitive computational costs and lack of interpretability. To bridge this gap, we propose One-Step-Train (OST), a framework that reformulates data selection as an incremental optimization utility ranking problem. Instead of relying on semantic heuristics, OST estimates the marginal utility of each sample via a simulated single-step update on a lightweight proxy. Experiments on the Qwen series across multimodal mathematical reasoning benchmarks demonstrate that OST achieves Pareto-optimal efficiency. By selecting the top-50 subset, OST reduces training costs by 43% (and total time consumption by 17) while surpassing the strong LLM-as-a-Judge baseline by 1.8 points. Furthermore, under a fixed compute budget, our method using only the top-20 subset achieves a 5.6 point gain over LLM-as-a-Judge, improves upon heuristic scoring baselines like DEITA, and outperforms the Full-SFT baseline by 8.8 points. Notably, while Full-SFT suffers from performance degradation due to noise, our optimization-grounded approach effectively identifies toxic samples, successfully reversing the negative transfer frequently observed in complex reasoning tasks.
☆ Excluding the Target Domain Improves Extrapolation: Deconfounded Hierarchical Physics Constraints
Extrapolation to out-of-distribution conditions is a fundamental challenge for physics-constrained deep generative models. Existing methods apply physical constraints as a single static regularization term uniformly across the generation process, and address neither the hierarchical structure of physical laws and the confounding variable problem. We propose the Deconfounded Hierarchical Gate (DHG), which serves as a diagnostic and control mechanism: it identifies when and how strongly temperature confounding contaminates each constraint level, so that hierarchical gates reflect intrinsic physical inconsistency rather than spurious temperature effects. DHG combines counterfactual estimation via the do-operator with backdoor adjustment to remove confounding, then applies Coarse-to-Fine physical constraints progressively. We report a counter-intuitive finding in pretraining: excluding the target-domain data from pretraining outperforms including it by 39% in extrapolation performance (RMSE 0.224 vs. 0.324). This occurs because FNO learns domain-agnostic physical patterns that transfer more effectively when the target domain is withheld. On a lithium-ion battery temperature extrapolation benchmark (trained at 24 degrees Celsius, evaluated at 4.0--43.0 degrees Celsius), our method achieves RMSE = 0.215, a 46% improvement over the unconstrained baseline (Pure CFM: 0.397).
comment: 16 pages, 2 figures
☆ Does Your Neural Network Extrapolate? Feature Engineering as Identifiability Bias for OOD Generalization
Successful deep neural networks discover salient features of data. We show when and why they fail to learn out-of-distribution (OOD)-relevant representations from an in-distribution (ID) training window. This requires decoupling feature learning from data-generating-process (DGP) identifiability. From a single training window, OOD extrapolation is non-identifiable: infinitely many DGPs are $\varepsilon$-observationally equivalent on the training data but diverge arbitrarily outside it, and no in-distribution criterion alone reliably breaks the tie. A structural commitment, the feature map, label map, and model class $(\varphi, ψ, \mathcal{M})$, dictates the assumed DGP and governs OOD generalization while leaving ID performance essentially unchanged. When architecture, pretraining, augmentation, input formats, or domain knowledge implicitly inject the missing commitment, the model succeeds. When it cannot infer OOD-relevant structure from ID evidence, it fails. Changing only the representation can make the same architecture, at the same in-distribution loss, differ by ${\sim}520\times$ out of distribution. When the commitment is correct and identifiable, OOD error vanishes. For example, Fourier coordinates turn periodic extrapolation into interpolation on $\mathbb{S}^1$. The same mechanism predicts outcomes in three natural-science settings (mass-action chemistry; Kepler's-third-law exoplanet prediction, $n=2{,}362$; and cross-species coding-DNA detection) and in a 264-run positional-encoding study across Transformer, Mamba, and S4D. Finally, a controlled study shows: correct features are necessary but not sufficient. The model class must express the target, and the transformed training data must cover the relevant representation space.
☆ SHRED: Retain-Set-Free Unlearning via Self-Distillation with Logit Demotion
Machine unlearning for large language models (LLMs) aims to selectively remove memorized content such as private data, copyrighted text, or hazardous knowledge, without costly full retraining. Most existing methods require a retain set of curated examples to prevent catastrophic degradation of general model utility, creating an extra data dependency that complicates deployment. We propose SHRED (Self-distillation via High-surprisal-only Retain-set-free Entropy Demotion), a retain-set-free unlearning method built on a key insight: not all tokens within a forget set instance carry memorized information equally. High-information tokens concentrate the model's memorized knowledge, while low-information tokens reflect general language competence. SHRED operates in two stages. (1) Selection: We perform a forward pass on a forget set instance, collect per-token autoregressive probabilities, and select the bottom (lowest probability, highest Shannon information) as forget positions; the remaining positions are retained as benign anchors. (2) Training: We construct modified KL targets that demote the memorized token's logit at forget positions while preserving the original distribution at benign positions. The model is then trained via a single top KL self-distillation objective that simultaneously drives forgetting and utility preservation. We evaluate SHRED across four standard unlearning benchmarks and demonstrate that it establishes a new Pareto-optimal trade-off between forget efficacy and model utility, outperforming retain-set-dependent methods. Our analysis shows that SHRED is robust against relearning attacks and membership-inference attacks, and it maintains stable utility even after many sequential unlearning runs.
☆ Vaporizer: Breaking Watermarking Schemes for Large Language Model Outputs
In this paper, we investigate the recent state-of-the-art schemes for watermarking large language models (LLMs) outputs. These techniques are claimed to be robust, scalable and production-grade, aimed at promoting responsible usage of LLMs. We analyse the effectiveness of these watermarking techniques against an extensive collection of modified text attacks, which perform targeted semantic changes without altering the general meaning of the text content. Our approach encompasses multiple attack strategies, which include lexical alterations, machine translation, and even neural paraphrasing. The attack efficacy is measured with two target criteria - successful removal of the watermark and preservation of semantic content. We evaluate semantic preservation through BERT scores, text complexity measures, grammatical errors, and Flesch Reading Ease indices. The experimental results reveal varying levels of effectiveness among different watermarking models, with the same underlying result that it is possible to remove the watermark with reasonable effort. This study sheds light on the strengths and weaknesses of existing LLM watermarking systems, suggesting how they should be constructed to improve security of available schemes.
☆ ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations
Vision-Language-Action (VLA) models hold great promise for general-purpose robotic intelligence, yet scaling up such models is severely bottlenecked by the high cost of acquiring annotated training data. Fortunately, vision-equipped robots deployed across various domains already produce abundant vision-action pairs that can be leveraged to scale up VLA training more efficiently. However, these raw data cannot be centrally aggregated due to various constraints and also exhibit severe heterogeneity. To address these challenges, in this paper, we propose ForgeVLA, a federated VLA training framework that learns VLA models from distributed vision-action pairs without centralizing raw data or requiring manual annotations. Specifically, each client in ForgeVLA is equipped with an embodied instruction classifier that maps vision-action pairs to a predefined instruction set, recovering the missing language modality and forming complete vision-language-action triplets. Beyond triplet construction, we also identify vision-language feature collapse as a critical challenge that has been largely overlooked in prior federated VLA research. To mitigate this issue, ForgeVLA combines a client-side contrastive planning loss with a server-side adaptive aggregation strategy to learn task-discriminative representations efficiently. Extensive experiments across multiple benchmarks show that ForgeVLA significantly outperforms other baselines, and ablation studies further validate the contribution of each component.
comment: 26 pages
☆ HBEE: Human Behavioral Entropy Engine -- Pre-Registered Multi-Agent LLM Simulation of Peer-Suspicion-Based Detection Inversion
Insider threat detection assumes that an adaptive insider leaves behavioral residue distinguishing them from legitimate users. We test this assumption against an LLM-driven adaptive insider in a controlled multi-agent simulator. Our pre-registered five-condition study isolates defender mode (cascade vs. blind UEBA) crossed with adversary type (naive vs. adaptive OPSEC) plus a no-mole control, across 100 runs (95 valid after pre-committed exclusions). The primary finding is a detection inversion: at T_60, the adaptive mole's suspicion in-degree is statistically lower than a randomly selected innocent agent (Cliff's delta = -0.694, 95% BCa CI [-0.855, -0.519], Mann-Whitney p << 0.01). The pre-registered prediction was the opposite direction. A pre-registered equivalence test (H2) shows adaptive OPSEC produces no detectable shift in the mole's UEBA rank under either defender mode. The two detection signals (peer suspicion graph in-degree and per-agent UEBA rank) decouple under adaptive adversary behavior. We bound generalization explicitly: a pre-registered Gini calibration check (H4) returns FAIL, with HBEE pairwise message-exposure Gini (0.213) diverging from the SNAP Enron reference (0.730) by |Delta Gini| = 0.52, exceeding the equivalence bound by 5x. The paper makes a narrow but surprising claim: in a controlled environment where adaptive OPSEC is implementable as an LLM directive, peer-suspicion-cascade detection inverts. We release the simulator, pre-registration document, frozen scenarios, raw telemetry, and analysis pipeline under an open-source license.
comment: 14 pages, 6 figures. Pre-registration document and full deviation log included in artifact
☆ Physical Simulators as Do-Operators: Causal Discovery under Latent Confounders for AI-for-Science
Existing interventional causal discovery methods -- IGSP, DCDI, ENCO -- assume causal sufficiency (no latent confounders) and rely on virtual interventions in synthetic simulators. In AI-for-Science settings such as molecular design and materials science, latent confounders are ubiquitous and real interventions (e.g., physics-based simulations) require hours to days per data point. We propose CFM-SD (Causal Flow Matching with Simulation Data), which uses first-principles physical simulators as do-operators in Pearl's interventional calculus to simultaneously handle latent confounders and real interventional data. Theoretically, $d$-variable causal structure is identifiable with $O(d)$ single-variable interventions -- the minimum under physical realizability constraints. In Intrinsic Evaluation on synthetic data ($γ=0.2$--$0.8$), CFM-SD achieves average F1$=0.800$ vs. F1$=0.127$--$0.562$ for all baselines. In Extrinsic Evaluation on real scientific data, CFM-SD achieves 57--58\% bias reduction in molecular toxicity prediction and battery electrolyte optimization, demonstrating practical value beyond synthetic benchmarks.
comment: 17 pages, 1 figure
☆ The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment
Moltbook is a Reddit-like platform where OpenClaw agents post, comment, and vote at scale - a so far unprecedented incident that comes with serious safety concerns. With the aim of studying emergent behavior in populations, we release the Moltbook Files, a dataset of 232k posts and 2.2M comments covering the platform's first 12 days, processed through a pipeline to identify and remove Personally-Identifiable Information (PII). We analyze community structure, authorship, lexical properties, sentiment, topics, semantic geometry, and comment interaction. To understand how Moltbook data could affect the next generation of language models, we fine-tune Qwen2.5-14B-Instruct on Moltbook Files with three adaptation levels. Our PII pipeline reveals that agents post API keys, passwords, BIP39 seed phrases on Moltbook, a publicly indexed platform. The overall sentiment is mostly neutral and mildly positive (66.6% neutral, 19.5% positive) and shows a tendency for self-referential linking. We find that fine-tuning on Moltbook data reduces truthfulness from 0.366 to 0.187. However, a model fine-tuned on a size-matched Reddit dataset produces a comparable decrease. Moltbook thus seems to be more of a harmless slopocalypse. However, tail risks remain, including agent affordances, contamination of future crawls through self-links, and potential transfer of traits to the next generation of language models. More broadly, our findings highlight the importance of control baselines in emergent misalignment evaluations.
☆ Bounded Fitting for Expressive Description Logics IJCAI
Bounded fitting is an attractive paradigm for learning logical formulas from labeled data examples that offers PAC-style generalization guarantees and can often be implemented leveraging SAT solvers. It has been successfully applied to learning concepts of the description logic ALC. We study bounded fitting for learning concepts in expressive description logics that extend ALC with inverse roles, qualified number restrictions, and feature comparisons. We investigate under which conditions bounded fitting keeps its favorable theoretical properties in this setting, and implement it using a SAT solver. We compare our tool with state-of-the-art concept learners with encouraging results, demonstrating that it is a practical approach to expressive concept learning.
comment: 16 pages, full version of paper accepted at IJCAI-ECAI 2026
☆ Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs
Vision-language models (VLMs) have advanced rapidly and are increasingly deployed in real-world applications, especially with the rise of agent-based systems. However, their safety has received relatively limited attention. Even the latest proprietary and open-weight VLMs remain highly vulnerable to adversarial attacks, leaving downstream applications exposed to significant risks. In this work, we propose a novel and lightweight adversarial attack detection framework based on sparse autoencoders (SAEs), termed SAEgis. By inserting an SAE module into a pretrained VLM and training it with standard reconstruction objectives, we find that the learned sparse latent features naturally capture attack-relevant signals. These features enable reliable classification of whether an input image has been adversarially perturbed, even for previously unseen samples. Extensive experiments show that SAEgis achieves strong performance across in-domain, cross-domain, and cross-attack settings, with particularly large improvements in cross-domain generalization compared to existing baselines. In addition, combining signals from multiple layers further improves robustness and stability. To the best of our knowledge, this is the first work to explore SAE as a plug-and-play mechanism for adversarial attack detection in VLMs. Our method requires no additional adversarial training, introduces minimal overhead, and provides a practical approach for improving the safety of real-world VLM systems.
☆ Accelerated and data-efficient flow prediction in stirred tanks via physics-informed learning
The simulation of fluid flows is computationally expensive due to the complexity of its governing partial differential equations. Machine learning models offer a potential surrogate, enabling learning from simulations and significantly faster predictions of flow fields. However, these models require large training datasets, which introduces a trade-off between dataset generation cost and predictive accuracy. In this work, we investigate the relationship between the size of the training-set and accuracy of the prediction when learning steady flow fields in an industrial-scale stirred vessel. A data set of steady flows is generated using Reynolds Averaged Navier Stokes (RANS) simulations in a range of realistic operating conditions, including impeller speeds and liquid heights. We train implicit neural representations of flow fields and compare purely data-driven and constrained variants. Model performance is evaluated using global mean squared error (MSE), qualitative spatial comparisons of predicted and reference flow fields, and tracer transport simulations. We find that the prediction error decreases monotonically with increasing training data, but also that it exhibits clear diminishing returns beyond moderate dataset sizes. Physics-based constraints significantly improve accuracy and reduce variability across training runs in low-data regimes, and they lead to more stable tracer-transport behavior. Furthermore, reasonable interpolation can be achieved over different impeller speeds and liquid heights. However, these benefits come with an increase in the complexity of training, and their relative advantage diminishes as the training set grows.
Prompt Engineering Strategies for LLM-based Qualitative Coding of Psychological Safety in Software Engineering Communities: A Controlled Empirical Study
Qualitative analysis plays a pivotal role in understanding the human and social aspects of software engineering. However, it remains a demanding process shaped by the subjective interpretation of individual researchers and sensitive to methodological choices such as prompt design. Recent advancements in Large Language Models (LLMs) offer promising opportunities to support this type of analysis, although their reliability in reproducing human qualitative reasoning under varying prompting conditions remains largely untested. This study presents a controlled empirical evaluation of three LLMs -- Claude Haiku, DeepSeek-Chat, and Gemini 2.5 Flash -- across two prompt engineering strategies (zero-shot and multi-shot closed coding), using Cohen's kappa as the primary agreement metric over ten independent runs per configuration. Results suggest that multi-shot prompting significantly improves agreement for Claude Haiku (Delta kappa = +0.034, Wilcoxon p = 0.004) but not for DeepSeek-Chat or Gemini 2.5 Flash. Intra-model stability varies substantially -- DeepSeek-Chat and Claude Haiku exhibit the lowest variance (SD approx. 0.017), while Gemini 2.5 Flash is the least stable (SD = 0.038). A systematic over-prediction of "Sharing Negative Feedback" is identified across all models (bias ratios up to 5.25x), alongside consistent under-prediction of "Expressing Concerns." Collectively, these findings provide empirical evidence for prompt engineering guidelines in LLM-assisted qualitative coding for software engineering research.
comment: 9 pages, 5 figures. Accepted at the 1st International Workshop on Prompt Engineering for Software Engineering (PROMPT-SE 2026), co-located with the 30th International Conference on Evaluation and Assessment in Software Engineering (EASE 2026), Glasgow, Scotland, United Kingdom, June 9--12, 2026
☆ OrchJail: Jailbreaking Tool-Calling Text-to-Image Agents by Orchestration-Guided Fuzzing
Tool-calling text-to-image (T2I) agents can plan and execute multi-step tool chains to accomplish complex generation and editing queries. However, this capability introduces a new safety attack surface: harmful outputs may arise from tool orchestration, where individually benign steps combine into unsafe results, making prompt-only jailbreak techniques insufficient. We present OrchJail, an orchestration-guided fuzzing framework for jailbreaking tool-calling T2I agents. Its core idea is to exploit high-risk tool-orchestration patterns: by learning from successful jailbreak tool-calling traces and their causal relationships to prompt wording, OrchJail directly guides the fuzzing search toward prompts that are more likely to trigger unsafe multi-step tool behaviors, rather than relying on surface-level textual perturbations. Extensive experiments demonstrate that OrchJail improves jailbreak effectiveness and efficiency across representative toolcalling T2I agents, achieving higher attack success rates, better image fidelity, and lower query costs, while remaining robust against common jailbreak defenses. Our work highlights tool orchestration as a critical, previously unexplored attack surface and provides a novel framework for uncovering safety risks in T2I agents.
☆ Tracking Large-scale Shared Bikes with Inertial Motion Learning in GNSS Blocked Environments IEEE
Although Global Navigation Satellite Systems (GNSS) provide a general solution for bike tracking outdoors, there still exist complex riding environments where only inertial navigation systems work, such as urban canyons. Despite decades of research, localization using only low-cost inertial sensors still faces challenges such as cumulative drifts and poor robustness caused by filtering methods. Furthermore, sensors such as visual and LiDAR could provide reliable measurements, but they are not suitable for large-scale deployment. In this paper, we propose an inertial tracking framework that integrates bicycle mechanical constraints with a mixture-of-experts model. Specifically, we leverage multiple expert modules to capture shared representations and weight them through the gating mechanism, thus improving multi-task learning performance and enabling uncertainty-aware trajectory estimation. Furthermore, based on the mechanical transmission between the pedal and the rear wheel of a bike, we explore the intrinsic relationship between the rider's periodic pedalling behaviors and acceleration variations, and convert such patterns into bike's wheel speed for dynamic calibration. Experiments with real-world riding data from shared bikes of the DiDi ride-hailing platform demonstrate that our system improves the accuracy of baselines by at least 12%, with wheel speed errors below 0.5 m/s at 95-percentile.
comment: It has been submitted to IEEE Transactions on Intelligent Transportation Systems (T-ITS). Journal article. 14 pages, 18 figures, 10 tables
☆ Exposing and Mitigating Temporal Attack in Deepfake Video Detection
While spatiotemporal deepfake detectors achieve high AUC, our experiments reveal their susceptibility to evasion attacks. These models tend to overfit on fragile temporal spectrum cues, rather than learning robust semantic causality. To mitigate this vulnerability, we propose SpInShield, a temporal spectral-invariant defense framework explicitly designed to decouple semantic motion from manipulatable spectral artifacts. We propose a learnable spectral adversary that dynamically synthesizes severe spectral deformations, simulating extreme attack scenarios. By employing a shortcut suppression optimization strategy, SpInShield compels the encoder to extract reliable forensic cues while purging unstable spectral statistics from the latent space. Experiments show that SpInShield obtains competitive performance on widely used datasets and outperforms the strongest baseline by 21.30 percentage points in AUC under simulated amplitude spectral attacks.
☆ Rubric-based On-policy Distillation
On-policy distillation (OPD) is a powerful paradigm for model alignment, yet its reliance on teacher logits restricts its application to white-box scenarios. We contend that structured semantic rubrics can serve as a scalable alternative to teacher logits, enabling OPD using only teacher-generated responses. To prove it, we introduce ROPD, a simple yet foundational framework for rubric-based OPD. Specifically, ROPD induces prompt-specific rubrics from teacher-student contrasts, and then utilizes these rubrics to score the student rollouts for on-policy optimization. Empirically, ROPD outperforms the advanced logit-based OPD methods across most scenarios, and achieving up to a 10x gain in sample efficiency. These results position rubric-based OPD as a flexible, black-box-compatible alternative to the prevailing logit-based OPD, offering a simple yet strong baseline for scalable distillation across proprietary and open-source LLMs. Code is available at https://github.com/Peregrine123/ROPD_official.
comment: Preprint. Code is available at https://github.com/Peregrine123/ROPD_official
☆ Unsolvability Ceiling in Multi-LLM Routing: An Empirical Study of Evaluation Artifacts
Efficient routing across multiple LLMs enables cost-quality tradeoffs by directing queries to the cheapest capable model. Prior work attributes routing headroom to an "unsolvability ceiling", queries no model in the pool can solve. We present a large-scale study of multi-tier LLM routing with 206,000 query-model pairs across six benchmarks (MMLU, MedQA, HumanEval, MBPP, Alpaca, ShareGPT) using the Gemma 4 and Llama 3.1 families. Evaluating with both LLM-as-a-judge and exact-match metrics, we show that a substantial portion of reported unsolvability stems from evaluation artifacts: (i) systematic judge biases favoring verbosity over correctness, (ii) truncation under fixed generation budgets, and (iii) output format mismatches. Through dual-judge validation and exact-match grounding, we reduce measured unsolvability across tasks. We introduce a decomposition framework attributing failures to these artifacts, revealing consistent patterns across domains and model families. These artifacts also distort router training signals: standard routers collapse to majority-class prediction (~79% smallest-tier optimal), confirmed via random-feature and shuffled-label controls, incurring a 13-17 percentage point opportunity cost. We provide actionable recommendations including dual-judge validation, exact-match anchoring, and cost-sensitive objectives. Our findings suggest existing routing headroom estimates are substantially inflated, underscoring the need for reliable evaluation protocols in multi-LLM systems.
comment: 12 pages, 14 tables
☆ BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning
Image captioning is one of the most fundamental tasks in computer vision. Owing to its open-ended nature, it has received significant attention in the era of multimodal large language models (MLLMs). In pursuit of ever more detailed and accurate captions, recent work has increasingly turned to reinforcement learning (RL). However, existing captioning-RL methods and evaluation metrics often emphasize a narrow notion of caption quality, inducing trade-offs across core dimensions of captioning. For example, utility-oriented objectives can encourage noisy, hallucinated, or overlong captions that improve downstream question answering while harming fluency, whereas arena-style objectives can favor fluent but generic descriptions with limited usefulness. To address this, we propose a more balanced RL framework that jointly optimizes utility-aware correctness, reference coverage, and linguistic quality. In order to effectively optimize the resulting continuous multi-objective reward formulation, we apply GDPO-style reward-decoupled normalization to continuous-valued captioning rewards and show that it improves performance over vanilla GRPO. Additionally, we introduce length-conditional reward masking, yielding a more suitable length penalty for captioning. Across LLaVA-1.5-7B and Qwen2.5-VL 3B and 7B base models, our method consistently improves caption quality, with peak gains of +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena across different models.
☆ Offline Policy Optimization with Posterior Sampling
A fundamental challenge in model-based offline reinforcement learning (RL) lies in the trade-off between generalization and robustness against exploitation errors in out-of-distribution (OOD) regions. While OOD samples may capture valid underlying physical dynamics, they also introduce the risk of model exploitation. Existing methods typically address this risk through excessive pessimistic regularization, which ensures robustness but often sacrifices generalization. To overcome this limitation, we propose Posterior Sampling-based Policy Optimization (PSPO), which formulates dynamics modeling as a Bayesian inference process to derive a posterior that explicitly quantifies model fidelity. Through the integration of posterior sampling and constrained policy optimization, our method leverages dynamics-consistent OOD transitions for generalization while ensuring robustness against model exploitation. Theoretically, we formulate Q-value estimation under posterior sampling as a stochastic approximation problem and establish its convergence. We decompose policy optimization into a sequence of constrained subproblems, demonstrating that solving these subproblems guarantees monotonic improvement until convergence. Experiments on standard benchmarks validate that PSPO achieves superior performance compared to state-of-the-art baselines.
comment: 25 pages, 3 figures
☆ Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation
While Vision-Language-Action (VLA) models offer broad general capabilities, deploying them on specific hardware requires real-world adaptation to bridge the embodiment gap. Since robot demonstrations are costly, this adaptation must often occur under a strict data budget. In this work, we identify a critical diversity trap: the standard heuristic of "maximizing coverage" by collecting diverse, single-shot demonstrations can be self-defeating due to non-vanishing estimation noise. We formalize this phenomenon as a Coverage--Density Trade-off. By decomposing the policy error into estimation (density) and extrapolation (coverage) terms, we characterize an interior optimal allocation of unique conditions for a fixed budget. Guided by this analysis, we propose Anchor-Centric Adaptation (ACA), a two-stage framework that first stabilizes a policy skeleton through repeated demonstrations at core anchors, then selectively expands coverage to high-risk boundaries via teacher-forced error mining and constrained residual updates. Real-robot experiments validate our trade-off framework and demonstrate that ACA significantly improves task reliability and success rates over standard diverse sampling strategies under the same budget.
comment: 21 pages, 8 figures
☆ RELO: Reinforcement Learning to Localize for Visual Object Tracking ICML 2026
Conventional visual object trackers localize targets using handcrafted spatial priors, often in the form of heatmaps. Such priors provide only surrogate supervision and are poorly aligned with tracking optimization and evaluation metrics, such as intersection over union (IoU) and area under the success curve (AUC). Here, we introduce RELO, a REinforcement-learning-to-LOcalize method for visual object tracking that formulates target localization as a Markov decision process. Specifically, RELO replaces handcrafted spatial priors with a localization policy learned over spatial positions via reinforcement learning, with rewards combining frame-level IoU and sequence-level AUC. We additionally introduce layer-aligned temporal token propagation to improve semantic consistency across frames, with negligible computational overhead. Across multiple benchmarks, RELO achieves superior results, attaining 57.5% AUC on LaSOText without template updates. This confirms that reward-driven localization provides an effective alternative to prior-driven localization for visual object tracking.
comment: ICML 2026 paper
☆ MORPH-U: Multi-Objective Resilient Motion Planning for V2X-Enabled Autonomous Driving in High-Uncertainty Environments via Simulation
V2X can warn an autonomous vehicle about hazards beyond line-of-sight, but it also brings uncertainty: messages may be delayed, dropped, or even forged. Meanwhile, map knowledge may change during a trip, forcing the vehicle to replan under tight real-time budgets. This paper studies how to make motion planning and low-level control robust to such uncertain, event-driven updates. We present MORPH-U, a CARLA-based closed-loop stack that fuses LiDAR/radar/camera with V2X (CAM/DENM) into a Local Dynamic Map (LDM) and triggers Hybrid-A* replanning when validated hazards or map changes affect the planned route. We expose the planning/control trade-offs via a multi-objective formulation over tracking error, safety margin (minimum TTC), responsiveness, and smoothness, and select operating points using Pareto-frontier analysis. To avoid unsafe replanning from faulty V2X triggers, MORPH-U adds a lightweight Byzantine-inspired acceptance gate that combines a quorum rule with an on-board sensor veto. Experiments in dynamic CARLA scenarios show that V2X-augmented LDM improves downstream safety, Pareto tuning provides controllable accuracy-comfort trade-offs, and the gate prevents replanning under saturated false-DENM injection ($p_{\text{attack}}=1.0$).
☆ MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
DeepSeek Sparse Attention (DSA) sets the state of the art for fine-grained inference-time sparse attention by introducing a learned token-wise indexer that scores every prefix token and selects the most relevant ones for the main attention. To remain expressive, the indexer uses many query heads (for example, 64 on DeepSeek-V3.2) that share the same selected token set; this multi-head design is precisely what makes the indexer the dominant cost on long contexts. We propose MISA (Mixture of Indexer Sparse Attention), a drop-in replacement for the DSA indexer that treats its indexer heads as a pool of mixture-of-experts. A lightweight router uses cheap block-level statistics to pick a query-dependent subset of only a few active heads, and only those heads run the heavy token-level scoring. This preserves the diversity of the original indexer pool while reducing the per-query cost from scoring every prefix token with every head to scoring it with only a handful of routed heads, plus a negligible router term computed on a small set of pooled keys. We further introduce a hierarchical variant of MISA that uses the routed pass to keep an enlarged candidate set and then re-ranks it with the original DSA indexer to recover the final selected tokens almost exactly. With only eight active heads and no additional training, MISA matches the dense DSA indexer on LongBench across DeepSeek-V3.2 and GLM-5 while running with eight and four times fewer indexer heads respectively, and outperforms HISA on average. It also preserves fully green Needle-in-a-Haystack heatmaps up to a 128K-token context and recovers more than 92% of the tokens selected by the DSA indexer per layer. Our TileLang kernel delivers roughly a 3.82 times speedup over DSA's original indexer kernel on a single NVIDIA H200 GPU.
comment: https://github.com/MuLabPKU/TransArch
☆ GraphReAct: Reasoning and Acting for Multi-step Graph Inference
Reasoning-acting frameworks enhance large language models (LLMs) by interleaving reasoning with actions for dynamic information acquisition. However, extending this paradigm to graph learning remains underexplored. Graph data is inherently structured, with information distributed across nodes and edges and encoded through both topology and latent representations. As a result, effective reasoning over graphs requires not only retrieving informative evidence from the graph, but also progressively refining the accumulated context during multi-step inference. In this work, we propose GraphReAct, a graph reasoning-acting framework that enables step-by-step inference over graph-structured data. Specifically, we design a graph-based action space with two complementary retrieval actions: topological retrieval, which captures local structural dependencies, and semantic retrieval, which accesses non-local but relevant evidence in the representation space. These actions dynamically expand the reasoning context. To further support multi-step reasoning, we introduce another type of action, context refinement, which distills and reorganizes accumulated information into a compact representation. By interleaving reasoning with both retrieval and refinement actions, our framework enables a progressive transition from context expansion to compression. Extensive experiments on six benchmark datasets demonstrate that GraphReAct consistently outperforms state-of-the-art methods, validating the effectiveness of reasoning-acting for graph learning.
comment: Under review
☆ TTF: Temporal Token Fusion for Efficient Video-Language Model NeurIPS 2026
Video-language models (VLMs) face rapid inference costs as visual token counts scale with video length. For example, 32 frames at $448{\times}448$ resolution already yield >8,000 visual tokens in Qwen3-VL, making LLM prefill the dominant throughput bottleneck. Existing methods often rely on global similarity or attention-guided compression, incurring offsets to their gains. We propose \textbf{Temporal Token Fusion (TTF)}, a training-free, plug-and-play pre-LLM token compression framework that exploits structured temporal redundancy in video. TTF automatically selects an anchor frame, then for each subsequent frame, performs a local window similarity search (e.g.,$3\times 3$), fusing tokens that exceed a threshold. The compressed sequence maintains positional consistency across both prefill and decoding through coordinate realignment, enabling seamless integration with existing VLM pipelines. On Qwen3-VL-8B with threshold t=0.70, TTF removes about 67\% of visual tokens while retaining 99.5\% of the baseline accuracy and introducing only ${\approx}0.16$\,GFLOPs of matching overhead. Overall, TTF offers a practical, efficient solution for video understanding. The code is available at \href{https://github.com/Cominder/ttf}{https://github.com/Cominder/ttf}
comment: 14 pages; manuscript submitted to NeurIPS 2026
☆ Confidence-Aware Alignment Makes Reasoning LLMs More Reliable
Large reasoning models often reach correct answers through flawed intermediate steps, creating a gap between final accuracy and reasoning reliability. Existing alignment strategies address this with external verifiers or massive sampling, limiting scalability. In this work, we introduce CASPO (Confidence-Aware Step-wise Preference Optimization), a framework that aligns token-level confidence with step-wise logical correctness through iterative Direct Preference Optimization, without training a separate reward model. During inference, we propose Confidence-aware Thought (CaT), which leverages this calibrated confidence to dynamically prune uncertain reasoning branches with negligible O(V) latency. Experiments across ten benchmarks and multiple model families show that CASPO consistently improves reasoning reliability and inference efficiency. CASPO scales to Qwen3-8B-Base and surpasses tree-search baselines on AIME'24 and AIME'25 without using reward-model data. We also release a step-wise dataset with confidence annotations to support fine-grained analysis of reasoning reliability. Code is available at https://github.com/Thecommonirin/CASPO.
comment: 9 pages
☆ Mage: Multi-Axis Evaluation of LLM-Generated Executable Game Scenes Beyond Compile-Pass Rate
Compile-pass rate is the dominant evaluation signal for LLM code generation, yet for multi-component domain-specific artifacts it can be actively misleading. We demonstrate this on executable game scene synthesis with a four-axis evaluation protocol (named `Mage') -- compile success, runtime success, structural fidelity, and mechanism adherence -- applied to 858 generation attempts across four open-weight LLMs (7B--30B), 26~hand-crafted Unity goal pattern playable concepts, and two automatically extracted IR granularity levels. Direct NL-to-C\# generation achieves the highest runtime-pass rate (43\% mean) yet produces structurally vacuous scenes (mechanism $F_1 \approx 0.12$). Structural IR conditioning halves the runtime rate but recovers domain-faithful structure ($F_1$ up to 1.00). Within IR conditioning, behavior-only and full-scene granularity are statistically indistinguishable (McNemar $p = 1.0$), indicating input-level granularity saturation. These results show that compile rate is anti-correlated with functional correctness in this domain and that multi-axis evaluation is necessary to detect the divergence. We release the benchmark, replay logs, and per-record metrics for independent verification.
comment: Main Content: 10 pages, 1 figure. In total 22 pages
☆ Tools as Continuous Flow for Evolving Agentic Reasoning
Large Language Models (LLMs) have demonstrated remarkable capabilities in orchestrating tools for reasoning tasks. However, existing methods rely on a step-wise paradigm that lacks a global perspective, which causes error accumulation over long horizons and restricts generalization to unseen tools. To overcome these limitations, we propose Tools as Continuous Flow for Evolving Agentic Reasoning (FlowAgent), which reconceptualizes tool chaining as continuous trajectory generation within a semantic space. To systematically evaluate this paradigm, we introduce the first plan-level closed-loop benchmark dedicated to plan-level agentic reasoning in dynamic real-world environments. Specifically, the proposed FlowAgent leverages conditional flow matching to generate continuous latent trajectories, providing a global planning perspective to ensure coherent and robust tool execution. Theoretically, we establish formal bounds on utility convergence and prove that our continuous formulation fundamentally guarantees robust generalization and error attenuation. Empirical evaluations show that FlowAgent achieves superior robustness and adaptability in long-horizon reasoning tasks.
☆ Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective
Reinforcement learning, including reinforcement learning with verifiable rewards (RLVR), has emerged as a powerful approach for LLM post-training. Central to these approaches is the design of the importance sampling (IS) ratio used in off-policy policy-gradient estimation. Existing methods face a fundamental bias-variance dilemma: token-level IS ratios, as adopted by PPO (Schulman et al., 2017) and GRPO (Shao et al., 2024), introduce bias by ignoring prefix state distribution mismatch; full sequence ratios provide exact trajectory-level correction but suffer from high variance due to the multiplicative accumulation of per-token ratios, while GSPO (Zheng et al., 2025) improves numerical stability via length normalization at the cost of deviating from the exact full-sequence IS correction. In this work, we identify the cumulative token IS ratio, the product of per-token ratios up to position $t$, as a theoretically principled solution to this dilemma. We prove that, under the token-level policy-gradient formulation, this ratio provides an unbiased prefix correction for each token-level gradient term and has strictly lower variance than the full sequence ratio. Building on this insight, we propose CTPO (Cumulative Token Policy Optimization), which combines the cumulative token IS ratio with position-adaptive clipping that scales log-space clip bounds according to the natural $\sqrt{t}$ growth of the cumulative log-ratio. This yields more consistent regularization across token positions. We implement and evaluate CTPO in the tool-integrated reasoning setting on several challenging mathematical reasoning benchmarks, achieving the best average performance across both model scales compared with strong GRPO and GSPO baselines. Code will be available at https://github.com/horizon-llm/CTPO.
☆ SparseRL-Sync: Lossless Weight Synchronization with ~100x Less Communication
In large-scale reinforcement learning (RL) systems with decoupled Trainer-Rollout execution, the Trainer must regularly synchronize policy weights to the Rollout side to limit policy staleness. When inter-node bandwidth is abundant, such synchronization is usually only a small fraction of end-to-end cost. As model size grows, however, the communication demand rises rapidly. In bandwidth-constrained or network-variable deployments -- for example, cross-datacenter or cross-cluster settings, heterogeneous resource pools, and online RL -- weight synchronization can become a dominant bottleneck for throughput and tail latency. We observe that, in mainstream large-model RL training, the locations where parameters actually change are highly sparse at the element level (often 99%+ sparsity). Building on this observation, we propose and implement SparseRL-Sync, which replaces full-weight transfers with a lossless sparse update payload (indices and values) that can be exactly reconstructed on the inference side, thereby preserving 100% fidelity. Under a simplified cost model, sparse synchronization reduces the per-update communication volume from S to approximately S/X; with 99% sparsity (X ~ 100), this yields about a 100x reduction in transmitted data. Combined with appropriate bucketing, SparseRL-Sync also reduces launch and control-plane overhead, significantly improving scalability and end-to-end efficiency in bandwidth-limited and highly asynchronous RL settings.
comment: Code will be released at https://github.com/scitix/helix
☆ CSR: Infinite-Horizon Real-Time Policies with Massive Cached State Representations IEEE
Deploying massive large language models (LLMs) as continuous cognitive engines for robotics is bottlenecked by the time-to-first-token (TTFT) latency required to process extensive state histories. Existing solutions like RAG or sliding windows compromise global context or incur prohibitive re-computation costs. We formalize the optimal task structure for minimizing latency and theoretically prove that prefix stability, incremental extensibility, and asynchronous state reconciliation are necessary conditions for real-time performance. Building on these proofs, we introduce the Cached State Representation (CSR) framework as the practical instantiation of these properties, ensuring optimal KV-cache reuse. To sustain these properties over infinite horizons, we further propose an Asynchronous State Reconciliation (ASR) algorithm that offloads state memory eviction to a parallel computational resource to eliminate latency spikes. On a physical robot wirelessly connected to an on-premise GPU server, CSR achieves a 26-fold latency reduction (14.67s to 0.56s) for 120K token contexts with a 235B parameter model compared to a standard baseline. On an embodied AI benchmark, we achieve SOTA recall (0.836 vs. 0.459) while maintaining RAG-level latency. ASR is validated to sustain bounded, spike-free TTFT over 10 eviction cycles in continuous real-world operation. Together, CSR and ASR enable massive LLMs to function as continuously operating, high-frequency (> 2 Hz) embodied policies.
comment: Extended Technical Report for Paper Accepted to IEEE RA-L
☆ Activation Differences Reveal Backdoors: A Comparison of SAE Architectures IJCNN 2026
Backdoor attacks on language models pose a significant threat to AI safety, where models behave normally on most inputs but exhibit harmful behavior when triggered by specific patterns. Detecting such backdoors through mechanistic interpretability remains an open challenge. We investigate two sparse autoencoder architectures -- Crosscoders and Differential SAEs (Diff-SAE) -- for isolating backdoor-related features in fine-tuned models. Using a controlled SQL injection backdoor triggered by year-based context ("2024" triggers vulnerable code, "2023" triggers safe code), we evaluate both approaches across LoRA and full-rank fine-tuning regimes on SmolLM2-360M. We find that Diff-SAE consistently and substantially outperforms Crosscoders for backdoor isolation. Diff-SAE achieves a Backdoor Isolation Score (BIS) of 0.40 with perfect precision (1.0) and zero false positive rate across most experimental conditions, while Crosscoders fail almost entirely with BIS below 0.02 in most cases. This performance gap holds across multiple transformer layers (14, 18, 22, 26) and both fine-tuning regimes, with full-rank fine-tuning producing particularly clean backdoor signals. Our results suggest that backdoors manifest as directional activation shifts rather than sparse feature activations, making difference-based representations fundamentally more effective for detection. These findings have important implications for AI safety monitoring and the development of interpretability tools for detecting model manipulation.
comment: Accepted at IJCNN 2026 (IEEE WCCI). ©2026 IEEE
☆ Discovering Ordinary Differential Equations with LLM-Based Qualitative and Quantitative Evaluation ICML 2026
Discovering governing differential equations from observational data is a fundamental challenge in scientific machine learning. Existing symbolic regression approaches rely primarily on quantitative metrics; however, real-world differential equation modeling also requires incorporating domain knowledge to ensure physical plausibility. To address this gap, we propose DoLQ, a method for discovering ordinary differential equations with LLM-based qualitative and quantitative evaluation. DoLQ employs a multi-agent architecture: a Sampler Agent proposes dynamic system candidates, a Parameter Optimizer refines equations for accuracy, and a Scientist Agent leverages an LLM to conduct both qualitative and quantitative evaluations and synthesize their results to iteratively guide the search. Experiments on multi-dimensional ordinary differential equation benchmarks demonstrate that DoLQ achieves superior performance compared to existing methods, not only attaining higher success rates but also more accurately recovering the correct symbolic terms of ground truth equations. Our code is available at https://github.com/Bon99yun/DoLQ.
comment: Accepted at ICML 2026
☆ Generative Modeling with Flux Matching
We introduce Flux Matching, a new paradigm for generative modeling that generalizes existing score-based models to a broader family of vector fields that need not be conservative. Rather than requiring the model to equal the data score, the Flux Matching objective imposes a weaker condition that admits infinitely many vector fields whose stationary distribution is the data. This flexibility enables a class of generative models that cannot be learned under score matching, in which inductive biases, structural priors, and properties of the dynamics can be directly imposed or optimized. We show that Flux Matching performs strongly on high-dimensional image datasets and, more importantly, that our added freedom unlocks a range of applications including faster sampling, interpretable and mechanistic models, and dynamics that encode directed dependencies between variables. More broadly, Flux Matching opens a new dimension in generative modeling by turning the vector field itself into a design choice rather than a fixed target. Code is available at https://github.com/peterpaohuang/flux_matching.
☆ Amortized-Precision Quantization for Early-Exit Vision Transformers
Vision Transformers (ViTs) achieve strong performance across vision tasks, yet their deployment with low-precision early exiting remains fragile. Existing quantization methods assume static full-depth execution, making them unstable when exit decisions are perturbed by quantization noise, which can amplify errors along dynamic inference paths. In this paper, we introduce Amortized-Precision Quantization (APQ), a utilization-aware formulation that accounts for layer-wise stochastic exposure to quantization noise and reveals depth-precision trade-offs. Building on APQ, we propose Mutual Adaptive Quantization with Early Exiting (MAQEE), a bi-level framework that jointly optimizes exit thresholds and bit-widths under explicit risk control to improve inference stability. MAQEE establishes a superior Pareto frontier in the accuracy-efficiency trade-off, reducing BOPs by up to 95% while maintaining accuracy and outperforming strong baselines by up to 20\% across classification, detection, and segmentation tasks.
☆ Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training
Reinforcement learning with verifiable rewards improves LLM reasoning but often induces overthinking, where models generate unnecessarily long reasoning traces. Existing methods mainly rely on length penalties or early-exit strategies; however, the former may degrade accuracy and induce underthinking, whereas the latter assumes that substantial portions of reasoning traces can be safely truncated. To obtain a compression signal without these limitations, we revisit the training dynamics of existing compression methods. We observe that the length--accuracy correlation is initially negative but continually increases during compression, indicating that shorter responses are initially more likely to be correct but gradually lose this property as the policy moves toward underthinking. Based on this observation, we formalize overthinking: a negative correlation indicates an overthinking regime, while a positive one indicates underthinking. When overthinking, the shortest correct responses are shorter than the group-average response length in expectation, making them natural compression targets already present in on-policy rollouts. We therefore propose \emph{Implicit Compression Regularization} (ICR), an on-policy regularization method whose compression signal comes from a virtual shorter distribution induced by the shortest correct responses in rollout groups, guiding the policy toward concise yet correct trajectories. Training dynamics show that ICR maintains a better length--accuracy correlation during compression, indicating that short responses remain better aligned with correctness instead of drifting toward underthinking. Experiments on three reasoning backbones and multiple mathematical and knowledge-intensive benchmarks show that ICR consistently shortens responses while preserving or improving accuracy, achieving a stronger accuracy--length Pareto frontier.
☆ DCGL: Dual-Channel Graph Learning with Large Language Models for Knowledge-Aware Recommendation SIGIR 2026
Knowledge Graphs (KGs) have proven highly effective for recommendation systems by capturing latent item relationships, while recent integration of Large Language Models (LLMs) has further enhanced semantic understanding and addressed knowledge sparsity issues. Nevertheless, current KG-and-LLM-based methods still face three main limitations: 1) inadequate modeling of implicit semantic relationships beyond explicit KG links; 2) suboptimal single-channel fusion of ID and LLM embeddings, which often leads to signal interference and blurred representations; and 3) insufficient consideration of user-item interaction frequency variations in recommendation strategies. To address these challenges, we propose the Dual-Channel Graph Learning (DCGL) framework, featuring three key innovations: 1) a dual-channel architecture that structurally decouples rich semantic information from user behavioral patterns, preventing early interference; 2) a multi-level contrastive learning mechanism that enhances robustness against KG noise through intra-view contrasts and bridges semantic gaps between channels via inter-view alignment; and 3) a dynamic fusion mechanism that adaptively balances semantic generalization and behavioral specificity based on interaction frequency, resolving the cascading limitation. Extensive experiments on four real-world datasets show that DCGL consistently outperforms state-of-the-art methods, yielding substantial improvements in sparse scenarios while maintaining precision for active users. Our code is available at https://github.com/XinchiZou/DCGL.
comment: Accepted by SIGIR 2026
☆ When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory
Memory-agent evaluations report fixed-snapshot accuracy or retrieval quality, but these scores do not show whether evidence remains usable as irrelevant sessions (sessions not annotated as task-relevant evidence for the query) accumulate. We present a scale-conditioned evaluation protocol for agent memory under evidence-preserving growth: for each query, task evidence is held fixed while irrelevant sessions are added. The protocol logs agent--memory trajectories and reports four diagnostics: budget-compliant reliability, tail memory-call burden, failure-regime decomposition, and the usable-scale boundary where reliability falls below the target. Applied to LongMemEval and LoCoMo across flat, planar, and hierarchical memory interfaces, the protocol shows reliability loss is not a single phenomenon. On LongMemEval, HippoRAG stays within the two-call budget but loses 16--20 percentage points in budget-compliant reliability as irrelevant sessions are added; LiCoMemory's observed failures depend strongly on the agent, with Qwen3-8B exceeding the budget while Qwen3-32B and Qwen3-235B remain reliable in the tested range. The result supports a framework for making scalable-memory claims conditional on agent, interface, scale range, and interaction budget.
comment: 19 pages, 11 figures, preprint
☆ BioProVLA-Agent: An Affordable, Protocol-Driven, Vision-Enhanced VLA-Enabled Embodied Multi-Agent System with Closed-Loop-Capable Reasoning for Biological Laboratory Manipulation
Biological laboratory automation can reduce repetitive manual work and improve reproducibility, but reliable embodied execution in wet-lab environments remains challenging. Protocols are often unstructured, labware is frequently transparent or reflective, and multi-step procedures require state-aware execution beyond one-shot instruction following. Existing robotic systems often rely on costly hardware, fixed workflows, dedicated instruments, or robotics-oriented interfaces. Here, we introduce BioProVLA-Agent, an affordable, protocol-driven, vision-enhanced embodied multi-agent system enabled by Vision-Language-Action (VLA) models for biological manipulation. The system uses protocols as the task interface and integrates protocol parsing, visual state verification, and embodied execution in a closed-loop workflow. A Tailored LLM Protocol Agent converts protocols into verifiable subtasks; a VLM-RAG Verification Agent assesses readiness and completion using observations, robot states, retrieved knowledge, and success/failure examples; and a VLA Embodied Agent executes verified subtasks through a lightweight policy. To improve robustness under wet-lab visual perturbations, we develop AugSmolVLA, an online augmentation strategy targeting transparent labware, reflections, illumination shifts, and overexposure. We evaluate the system on a hierarchical benchmark covering 15 atomic tasks, 6 composite workflows, and 3 bimanual tasks, including tube loading, sorting, waste disposal, cap twisting, and liquid pouring. Across normal and high-exposure settings, AugSmolVLA improves execution stability over ACT, X-VLA, and the original SmolVLA, especially for precise placement, transparent-object manipulation, composite workflows, and visually degraded scenes. These results suggest a practical route toward accessible, protocol-centered, and verification-capable embodied AI for biological manipulation.
comment: 16 pages, 7 figures
☆ MedAction: Towards Active Multi-turn Clinical Diagnostic LLMs
Most existing LLM diagnoses are evaluated on static, single-turn settings where complete patient information is provided upfront, an oversimplification of real clinical practice. We study active diagnosis: the real-life clinical process of starting from initial observation, ordering tests, interpreting results, and updating a differential diagnosis across multiple turns. Through systematic analysis, we identify three recurring failure modes in current LLMs: ungrounded test ordering, unreliable diagnostic update, and degraded multi-turn coherence. Together, these failures reveal a core deficit: existing medical training data teaches models to reason from complete information but not to act under evolving, partial evidence. To address this gap, we introduce MedAction, a tree-structured distillation pipeline that synthesizes diverse and high-quality multi-turn diagnostic trajectories via LLM-environment interaction. We propose two knowledge-graph-grounded metrics to filter trajectory quality: Disease Trajectory Consistency (DTC), which tracks whether the model's hypothesis converges toward the correct diagnosis, and Reasoning-Action Consistency (RAC), which verifies that belief updates are driven by gathered evidence. Using this pipeline, we construct MedAction-32K, a dataset of 32,681 trajectories from 2,896 PMC cases. Fine-tuning an 8B model on MedAction-32K achieves state-of-the-art performance among open-source models on both MedR-Bench and our curated MedAction-300-Hard benchmark, pushing the edge for open-source medical LLMs.
☆ SOM: Structured Opponent Modeling for LLM-based Agents via Structural Causal Model
Accurately predicting opponents' behavior from interactions is a fundamental capability for large language model (LLM)-based agents in multi-agent and game-theoretic environments. Existing approaches often entangle opponent modeling with prediction, relying on implicit contextual reasoning and limiting adaptability in dynamic interactions. To this end, we propose Structured Opponent Modeling (SOM), a two-stage opponent modeling framework that distinctly separates opponent model construction and opponent prediction. At the construction stage, SOM employs a Structural Causal Model (SCM), a graph-based formalism for representing dependencies among variables, to capture directed links between opponents' observations and actions, yielding an explicit and structured opponent representation. At the prediction stage, the LLM performs structured reasoning along clear pathways derived from the SCM, improving both prediction accuracy and stability. Extensive experiments on diverse multi-agent benchmarks demonstrate that SOM consistently outperforms state-of-the-art LLM-based reasoning baselines, enabling more accurate and adaptable strategic decision-making in complex and dynamic multi-agent interactions.
☆ EgoPro-Bench: Benchmarking Personalized Proactive Interaction in Egocentric Video Streams
Existing Multimodal Large Language Models (MLLMs) remain primarily reactive, failing to continuously perceive environments or proactively assist users. While emerging benchmarks address proactivity, they are largely confined to alert scenarios, neglect personalized context, and fail to evaluate the precise timing of human-machine interactions (HMI).In this paper, we introduce EgoPro-Bench, a novel benchmark for training and evaluating proactive interaction capabilities based on streaming egocentric videos; it comprises 2,400 videos in the evaluation set and over 12,000 videos in the training set.Unlike previous works, EgoPro-Bench leverages simulated user profiles to generate diverse user intentions and to construct high-fidelity HMI data across 12 distinct domains.Subsequently, we propose a specialized evaluation protocol and metrics, train proactive interaction models designed for efficient reasoning and low-latency interaction on streaming video data, and conduct comprehensive evaluations.Furthermore, we introduce an interaction principle termed "short thinking, better interaction", which allocates a limited token budget prior to intent recognition, thereby enhancing interaction performance.The experiments demonstrate that EgoPro-Bench substantially enhances the intention understanding capabilities of MLLMs and enables accurate identification of appropriate timings for HMI, thereby laying a solid foundation for next-generation user-centric proactive interactive agents.
comment: 8 pages
☆ Sword: Style-Robust World Models as Simulators via Dynamic Latent Bootstrapping for VLA Policy Post-Training
The integration of Vision-Language-Action (VLA) models with World Models has gained increasing attention. One representative approach treats learned World Models as generative simulators, enabling policy optimization entirely within "imagination." However, when deployed as simulators for specific environments such as the LIBERO benchmark, existing World Models often suffer from poor generalization and long-horizon error accumulation. During closed-loop rollouts, these models are highly sensitive to initial-state perturbations; minor changes in color, illumination, and other visual factors can trigger cascading hallucinations, leading to severe blurriness or overexposure. Moreover, long-horizon error accumulation further degrades the quality and fidelity of predicted future states. These issues limit the reliability of World Models as simulators. To mitigate these problems, we propose Sword, a robust World Model framework. Our method introduces Structure-Guided Style Augmentation to disentangle the visual textures of interactive environments from task-relevant dynamics, thereby improving generalization. We further propose Dynamic Latent Bootstrapping, which maintains consistency between training and inference while keeping memory consumption low. Extensive experiments on the LIBERO benchmark show that our method significantly outperforms the baseline WoVR in terms of generalization, generation quality, robustness, fidelity, and the success rate of reinforcement-learning post-training for VLA models.
☆ Mask2Cause: Causal Discovery via Adjacency Constrained Causal Attention
Leveraging deep learning for causal discovery in time series remains challenging because existing neural methods predominantly rely on component-wise architectures that fail to capture shared system dynamics or employ decoupled post-hoc graph extraction that risks overfitting to spurious correlations. We propose $\textbf{Mask2Cause}$, an end-to-end framework that recovers the underlying causal graph directly during the forecasting forward pass. Our approach introduces an Inverted Variable Embedding and an Adjacency-Constrained Masked Attention mechanism, trained with homoscedastic or heteroscedastic objectives to capture causal influences in both mean and variance. Empirical results on diverse benchmarks, from synthetic chaotic dynamics to realistic biological simulations, demonstrate state-of-the-art causal discovery with significantly reduced parameter complexity compared to standard baselines. We further show that inferred causal structures can be used to reduce parameter count of forecasting models by more than 70% on average while maintaining predictive accuracy.
☆ Predictive but Not Plannable: RC-aux for Latent World Models
A latent world model may achieve accurate short-horizon prediction while still inducing a latent space that is poorly aligned with planning. A key issue is spatiotemporal mismatch: these models are often trained with local predictive supervision, but deployed for long-horizon goal-directed search in latent spaces where Euclidean distance may not reflect what is reachable within a finite action budget. We present the Reachability-Correction auxiliary objective (RC-aux), a lightweight correction for this mismatch in reconstruction-free latent world models. RC-aux keeps the world-model backbone unchanged and adds planning-aligned supervision along two axes. Along the time axis, multi-horizon open-loop prediction trains the model beyond one-step consistency. Along the space axis, budget-conditioned reachability supervision, together with temporal hard negatives, encourages the latent space to distinguish states that are eventually reachable from those reachable within the current planning horizon. At test time, the learned reachability signal can also be used by a reachability-aware planner to favor trajectories that are both goal-directed and attainable under the available budget. We instantiate RC-aux on LeWorldModel and evaluate it under both continuation-training and matched-from-scratch settings. Across goal-conditioned pixel-control tasks and a LIBERO-Goal extension, RC-aux improves LeWM-style planning with modest additional cost. These results suggest that planning with latent world models depends not only on predictive accuracy, but also on whether the learned representation encodes the temporal and geometric structure required by downstream search. The code is available at https://github.com/Guang000/RC-aux.
☆ Bifurcation Models: Learning Set-Valued Solution Maps with Weight-Tied Dynamics
Many scientific and combinatorial problems admit multiple correct solutions, not a single label. Standard supervised learning resolves this ambiguity by choosing one solution as the target, but this hidden selector can be arbitrary, discontinuous, and harder to learn than the underlying solution set. We study bifurcation models, a weight-tied dynamical view in which different initializations can converge to different stable equilibria, so the model represents an attractor landscape rather than one chosen branch. We prove that broad set-valued maps with locally Lipschitz branches can be represented by regular equilibrium dynamics and that the induced selectors are almost everywhere regular, while manual selectors can be arbitrarily irregular. Experiments on frustrated Ising models show that such dynamics can discover multiple valid equilibria without branch labels and outperform single-branch supervision. Allen--Cahn experiments further show that diversity is not automatic: it can be encouraged explicitly, but with an accuracy--diversity tradeoff.
☆ Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair
Code-agent RL often receives weak feedback: rollout-time signals are reliable and executable, but capture only necessary or surface conditions for task success rather than the target semantic predicate. Using agentic compile-fix as the setting, we study signal reshaping for standard GRPO under such feedback. Our central claim is that GRPO's within-group comparison is meaningful only after three kinds of signals are reshaped: outcome rewards recover semantic ranking, process signals localize intra-trajectory credit, and rollouts from the same prompt remain execution-comparable. We operationalize these conditions with a minimal signal-reshaping construction that leaves GRPO's group-normalized advantage construction unchanged: compile-and-semantic layered rewards reshape trajectory ranking, step-level process scores outside group reward normalization reshape within-trajectory update strength, and failure-cause-aware rollout governance reshapes within-group comparability. Experiments show a clear end-to-end gain: full signal-reshaped GRPO improves strict compile-and-semantic accuracy from the base model's zero-shot $0.385$ to $0.535$. Controlled comparisons further explain the source of this gain: binary rewards remove the compile-only middle tier and degrade trajectory control; on top of layered rewards, process-score weighting further improves accuracy from $0.48$ to $0.53$ and reduces average evaluation steps from $23.50$ to $17.02$. As a boundary comparison, privileged-prompt token-level distillation mainly optimizes local distributional alignment; in long tool-use trajectories, this signal is diluted by non-critical tokens and cannot replace outcome semantics, process credit, or within-group comparability.
☆ Structured Role-Aware Policy Optimization for Multimodal Reasoning
Reinforcement learning from verifiable rewards (RLVR), especially with Group Relative Policy Optimization (GRPO), has shown strong potential for improving the reasoning capabilities of large vision-language models (LVLMs). However, in multimodal reasoning, final-answer rewards are typically assigned at the sequence level and do not distinguish the functional roles of different tokens, making it difficult to determine whether a correct answer is supported by task-relevant visual evidence. In this paper, we revisit multimodal RLVR from the perspective of role-aware token-level credit assignment, where structured responses are decomposed into perception tokens for extracting visual evidence and reasoning tokens for deriving answers from that evidence. Based on this perspective, we propose Structured Role-aware Policy Optimization (SRPO), which refines the sequence-level GRPO advantage into role-aware token-level advantages without changing the reward function. Specifically, SRPO assigns role-specific credit by using self-distilled on-policy contrasts: perception tokens are emphasized according to their visual dependency under original versus corrupted visual inputs, while reasoning tokens are emphasized according to their consistency with the generated perception. These role-specific signals are further unified through a shared trajectory-level baseline, yielding positive token weights that adjust relative update magnitudes while preserving the original GRPO reward and optimization direction, without requiring external reward models or separate teachers. Experiments across diverse multimodal reasoning benchmarks show that SRPO improves evidence-grounded reasoning, highlighting the importance of moving beyond uniform sequence-level credit toward role-aware optimization for reliable multimodal reasoning.
comment: 32 pages
☆ From Clouds to Hallucinations: Atmospheric Retrieval Hijacking in Remote Sensing Vision-Language RAG
Multimodal RAG systems increasingly rely on vision-language retrievers to ground visual queries in external textual evidence. Existing adversarial studies on RAG mainly manipulate the retrieval corpus or memory, while attacks on vision-language and remote sensing models typically target end-task predictions. Input-space threats to the evidence retrieval stage of remote sensing multimodal RAG remain underexplored. To address this gap, we introduce CloudWeb, an atmospheric retrieval hijacking attack that modifies only the input image while keeping the retriever, generator, and knowledge base fixed at deployment. CloudWeb overlays parameterized cloud- and haze-like patterns on remote sensing images and optimizes them with a retrieval-oriented objective that pulls adversarial image embeddings toward target atmospheric evidence, suppresses source-scene evidence, enforces rank separation, and regularizes naturalness and coverage. To the best of our knowledge, this is the first study of retrieval-stage atmospheric evidence hijacking in remote sensing multimodal RAG. We evaluate CloudWeb on a seven-dataset remote sensing RAG benchmark with five CLIP-style retrievers, including GeoRSCLIP, RemoteCLIP, OpenAI CLIP, and OpenCLIP, together with downstream vision-language generators. Across retrievers, CloudWeb consistently outperforms clean retrieval, handcrafted atmospheric baselines, random cloud perturbations, and fixed variants in injecting weather-related evidence into top-ranked results. On GeoRSCLIP ViT-B/32, Weather@5 increases from 0.71\% to 43.29\%. Downstream generation further shows measurable weather hallucination and semantic shift, indicating that retrieval-stage hijacking can propagate to the final RAG response. These findings reveal a practical failure mode: natural-looking atmospheric changes can compromise evidence retrieval before generation begins.
☆ Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions
Layer pruning efficiently reduces Large Language Model (LLM) computational costs but often triggers sudden performance collapse. Existing representation-based analyses struggle to explain this mechanism. We propose studying pruning through decision representation. Focusing on multiple-choice tasks, we introduce two metrics, Decision Margin and Option Frequency, and an Iterative Pruning method to analyze layer-wise decision dynamics. Our findings reveal a sharp decision transition that partitions the network into two stages: a Silent Phase, where the model cannot yet predict the correct answer, and a Decisive Phase, where the correct prediction emerges. We also find that pruning the Decisive Phase has minimal impact, whereas pruning the Silent Phase triggers immediate performance collapse, highlighting its extreme sensitivity to structural changes. Therefore, we conclude that pruning-induced collapse stems from disrupting the Silent Phase, which prevents the critical decision transition from occurring.
☆ Resource-Element Energy Difference for Noncoherent Over-the-Air Federated Learning
Over-the-air federated learning (OTA-FL) reduces uplink latency by exploiting waveform superposition, but conventional analog aggregation schemes typically require instantaneous channel state information (CSI), channel inversion, and coherent phase alignment, which can be difficult to maintain in practical wireless systems. This paper proposes resource-element energy difference (REED), a noncoherent aggregation primitive for continuous signed updates that avoids instantaneous CSI. REED maps the positive and negative parts of each real-valued update to transmit energies on two orthogonal resource elements with independent phase dithers, and the server estimates the signed aggregate from their energy difference. With only slow-timescale calibration of average channel powers, REED is unbiased for the desired signed sum and admits an exact closed-form variance under Rayleigh fading. We incorporate REED into full-participation FedAvg and prove a smooth nonconvex stationarity bound. Under an average per-client energy budget, the aggregation gain can be scheduled so that the REED-induced perturbation scales quadratically with the local stepsize, yielding the canonical (1/sqrt(T)) stationarity rate. Experiments on MNIST and Fashion-MNIST demonstrate that REED closely matches clean FedAvg and coherent CSIT aggregation in IID settings, while maintaining stable convergence with a moderate performance degradation under strong data heterogeneity.
comment: Preprint; Under-review; Codes to replicate the results is available at: https://github.com/zavareh1/REED
☆ Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning
Large Language Models (LLMs) have become increasingly capable as tool-using agents, with benchmarks spanning diverse general agentic tasks. Yet rigorous evaluation of scientific tool use remains limited. In chemistry, recent agents can plan syntheses and invoke domain-specific tools, but evaluations often rely on curated demonstrations, expert assessment, or LLM-as-judge scoring rather than exact, judge-free ground truth. We address this gap with chemical procurement cost estimation, a practical task in which an agent must ground chemical identities, retrieve supplier quotes, select valid purchasable packs, normalize quantities, and compute cost from a reaction description. We introduce ChemCost, a benchmark of 1,427 evaluable reactions grounded to a frozen pricing snapshot covering 2,261 chemicals and 230,775 supplier quotes, supporting scalar scoring and stage-level diagnosis of grounding, retrieval, procurement, and arithmetic failures. To evaluate robustness, we further construct controlled noise-injected views that perturb chemical aliases, quantity expressions, missing fields, and input formatting. Experiments with frontier, open-weight, and chemistry-specialized LLM agents show that tool access is necessary but insufficient for solving the task. The strongest agents reach only 50.6% accuracy within 25% relative error on clean inputs and degrade substantially with realistic noise. Stage-level analysis further shows that failures arise from brittle parsing, ineffective evidence integration, invalid pack selection, and non-convergent tool use.
comment: 9 pages, 5 figures
☆ Hard to Read, Easy to Jailbreak: How Visual Degradation Bypasses MLLM Safety Alignment ACL 2026
Recent advancements in visual context compression enable MLLMs to process ultra-long contexts efficiently by rendering text into images. However, we identify a critical vulnerability inherent to this paradigm: lowering image resolution inadvertently catalyzes jailbreaking. Our experiments reveal that the safety defenses of SOTA models deteriorate sharply as resolution degrades, surprisingly persisting even when text remains legible. We attribute this to ``Cognitive Overload'', hypothesizing that the effort required to decipher degraded inputs diverts attentional resources from safety auditing. This phenomenon is consistent across various visual perturbations, including noise and geometric distortion. To address this, we propose a simple ``Structured Cognitive Offloading'' strategy that mitigates these risks by enforcing a serialized pipeline to decouple visual transcription from safety assessment. Our work exposes a significant risk in vision-based compression and provides critical insights for the secure design of future MLLMs.
comment: Accepted to Findings of ACL 2026
☆ EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation
Scalable AI agents training relies on interactive environments that faithfully simulate the consequences of agent actions. Manually crafted environments are expensive to build, brittle to extend, and fundamentally limited in diversity. A promising direction is to replace manually crafted environments with LLM-simulated counterparts. However, this paradigm hinges on an unexamined core assumption: LLMs can accurately simulate environmental feedback. In practice, LLM-simulated environments suffer from hallucinations, logical inconsistencies, and silent state drift failures that corrupt agent reward signals and compound the construction costs that the paradigm was designed to eliminate. To address this gap, we propose EnvSimBench with four contributions: 1) We provide the first formal definition and operationalization of Environment Simulation Ability (EnvSim Ability) as a quantifiable research objective. 2) We construct EnvSimBench, a rigorous benchmark covering 400 samples across 167 diverse environments, equipped with verifiable labels and fine-grained difficulty stratification along three axes. 3) Systematic evaluations reveal that all state-of-the-art language models suffer from a universal state change cliff: they achieve near-perfect accuracy on tasks when the environment state remains invariant, yet fail catastrophically when multiple states need simultaneous updates. This finding exposes EnvSim Ability as a critical yet largely unaddressed capability gap. 4) We design a constraint-driven simulation pipeline that substantially reduces hallucination, boosts environment synthesis yield by 6.8%, and cuts costs by over 90%. Overall, EnvSimBench serves as both a diagnostic framework and a practical optimization path for reliable LLM-based environment simulation, establishing a foundation for scalable agent training. Code and data are available at https://github.com/cookieApril/EnvSimBench
☆ Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
We introduce Mutual Reinforcement Learning, a framework for concurrent RL post-training in which heterogeneous LLM policies exchange typed experience while keeping separate parameters, objectives, and tokenizers. The framework combines a Shared Experience Exchange (SEE), Multi-Worker Resource Allocation (MWRA), and a Tokenizer Heterogeneity Layer (THL) that retokenizes text and aligns token-level traces across incompatible vocabularies. This substrate makes the experience-sharing design question operational across model families. We instantiate three controlled probes on top of GRPO: data-level rollout sharing via Peer Rollout Pooling (PRP), value-level advantage sharing via Cross-Policy GRPO Advantage Sharing (XGRPO), and outcome-level success transfer via Success-Gated Transfer (SGT). A contextual-bandit analysis characterizes their structural positions on a stability-support trade-off: PRP pays density-ratio variance and THL residual costs, XGRPO preserves on-policy actor support while changing scalar baselines, and SGT supplies a rescue-set score direction toward verified peer successes. In the evaluated regime, outcome-level sharing occupies the favorable point of this trade-off.
comment: 50 pages, 10 figures, 14 tables
☆ MEMOREPAIR: Barrier-First Cascade Repair in Agentic Memory
Agentic memory evolves across tasks into durable derived artifacts: summaries, cached outputs, embeddings, learned skills, and executable tool procedures. When a source artifact is deleted, corrected, or invalidated by tool or API migration, descendants derived from that source can remain visible and steer future actions with stale support. We formalize this failure mode as the cascade update problem, where repair targets the visible derived state of the memory store. We present MemoRepair, a barrier-first cascade-repair contract for agentic memory. A repair event induces a controlled transition from invalidated descendant state to validated successor state: affected descendants are withdrawn before repair, successors are constructed from retained support and staged repaired predecessors under the current interface, and republication is restricted to validated predecessor-closed successors. This contract induces a scalarized repair-selection problem for a fixed repair-cost tradeoff. We show that the induced publication problem reduces to maximum-weight predecessor closure and can be solved exactly by a single s-t min-cut. Experiments on ToolBench and MemoryArena show that, with complete influence provenance, MemoRepair reduces invalidated-memory exposure from 69.8-94.3% under systems without cascade repair to 0%. Compared with exhaustive Repair all, it recovers 91.1-94.3% of validated successors while reducing normalized repair-operator cost from 1.00 to 0.57-0.76.
☆ Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
Large language models (LLMs) support long-context inference but suffer from substantial memory and runtime overhead due to Key-Value (KV) Cache growth. Existing KV Cache eviction methods primarily rely on local attention weights, neglecting the influence of value representations, output projection, and inter-head interactions. In this work, we reformulate KV Cache eviction from a conventional head-wise, weight-averaging approach into an output-aware, layer-wise matrix multiplication approximation problem. We introduce LaProx, a novel eviction strategy that explicitly models the multiplicative interaction between attention maps and projected value states to accurately quantify token contributions while accounting for inter-head dependencies. Building on this metric, we propose the first unified eviction strategy that assigns globally comparable importance scores to tokens, enabling model-wide selection instead of local, head-wise decisions. Experimental results across 19 datasets on long-context benchmarks LongBench and Needle-In-A-Haystack demonstrate that our approach maintains model performance with only 5\% of the KV cache and consistently outperforms prior works across all configurations. Notably, our method achieves up to 2$\times$ accuracy loss reduction under extreme compression scenarios compared to existing state-of-the-art baselines with minimal overhead.
☆ CASCADE: Context-Aware Relaxation for Speculative Image Decoding
Autoregressive generation is a powerful approach for high-fidelity image synthesis, but it remains computationally demanding and slow even on the most advanced accelerators. While speculative decoding has been explored to mitigate this bottleneck, existing approaches fail to achieve efficiency gains comparable to those observed in text generation. A key limitation is the target model's high uncertainty during image generation, which leads to high draft token rejection rates. In this work, we identify previously overlooked patterns in the target model's behavior that emerge naturally in tree-based speculative decoding. Specifically, we formalize two properties, semantic interchangeability and convergence, arising from the redundancies in the target model's hidden state representations. By capturing these redundancies across the depth and breadth of the predicted token tree, our method identifies principled opportunities for acceptance relaxation without requiring additional training. Additionally, we enhance standalone drafter performance by injecting the redundancy signals from the target model into drafter training with minimal modification. We evaluate our approach across multiple text-to-image models and drafter architectures. Results show that CASCADE achieves state-of-the-art speedups for drafter-based speculative decoding, with up to 3.6x acceleration, while maintaining image quality and text-prompt fidelity.
☆ HMACE: Heterogeneous Multi-Agent Collaborative Evolution for Combinatorial Optimization
Large Language Models have recently emerged as a promising paradigm for automated heuristic design for NP-hard combinatorial optimization problems. Despite this progress, existing LLM-based methods typically rely on monolithic workflows constrained by rigid templates, thereby restricting memory-guided exploration and triggering premature convergence to local optima. To design an autonomous and collaborative architecture, we introduce HMACE, a Heterogeneous Multi-Agent Collaborative Evolution framework that reconceptualizes heuristic search as an organizational design problem. HMACE decomposes each evolutionary generation into an autonomous, role-specialized loop with four coordinated agents: a Proposer for strategy exploration, a Generator for executable heuristic synthesis, an Evaluator for empirical assessment, and a Reflector for archive-backed memory update. By coupling behavior-aware retrieval, lightweight candidate filtering, and fitness-grounded archive updates, HMACE guides the search toward diverse and promising heuristic behaviors while avoiding redundant evaluations. Extensive evaluations on representative COPs, including TSP, Online BPP, MKP, and PFSP, show that HMACE achieves a favorable quality-efficiency trade-off compared to state-of-the-art single-agent and multi-agent baselines. In the matched LLM-driven reference comparison, HMACE achieves the lowest average gaps on TSP and Online BPP (0.464\% and 0.223\%, respectively), while requiring only 0.13M and 0.42M tokens for the two tasks, substantially fewer than the compared baselines.
☆ Same Brain, Different Prediction: How Preprocessing Choices Undermine EEG Decoding Reliability
Electroencephalography (EEG) is a cornerstone of brain-computer interfaces and clinical neuroscience, yet deep learning models are typically trained and evaluated under a single, unreported preprocessing pipeline. We formalize preprocessing choices as a counterfactual intervention space and show that EEG predictions are surprisingly unstable under this space: across six datasets spanning four paradigms, up to 42% of trial-level predictions flip when only the preprocessing changes, a variability that standard uncertainty methods do not explicitly quantify because they condition on a fixed preprocessing pipeline. We provide three tools to make this instability measurable, decomposable, and reducible. First, a Walsh-Hadamard decomposition of the 2^7 pipeline space reveals that sensitivity is near-additive in practice under the binary intervention design, enabling efficient step-by-step optimization. Second, we introduce Preprocessing Uncertainty (PU), a per-trial diagnostic that captures a dimension of instability complementary to model-based confidence. Third, we study Normalized Adaptive PGI (NA-PGI), a graph-structured regularizer that exploits the compositional structure of preprocessing interventions as one mitigation strategy with clear scope conditions.
☆ HARMONY: Bridging the Personalization-Generalization Gap by Mitigating Representation Skew in Heterogeneous Split Federated Learning
Mobile devices face diverse resource constraints and non-IID data class distributions, requiring fast on-device inference for local in-distribution (ID) classes and on-demand remote support for client-specific out-of-distribution (OOD) classes. Hybrid split federated learning (Hybrid SFL) couples personalized client-side front ends (supporting early exit) with a generalized server-side backend for fallback inference, balancing accuracy and cost. However, under client architectural heterogeneity, the existing hybrid SFL suffers from representation skew, where features from customized extractors fail to align in the shared space, leading to a sharp degradation in the server model responsible for OOD prediction. We propose HARMONY, the first hybrid SFL framework to support heterogeneous client architectures. HARMONY modifies meta-learning to simulate diverse extractors across parameters and architectures, and to learn to personalize. To mitigate representation skew, HARMONY conducts server-side contrastive learning to align extracted features, neither sacrificing clients' personalization nor sharing raw labels. Compared to the state of the art across multiple datasets and model families, HARMONY improves test accuracy by up to 43.0%/28.3% without/with OOD, respectively, while maintaining acceptable latency.
comment: 7 pages (except references), 5 figures
☆ Hallucination Detection via Activations of Open-Weight Proxy Analyzers
We introduce a proxy-analyzer framework for detecting hallucinations in large language models. Instead of looking inside the generating model, our system reads already-generated text through a small locally hosted open-weight model and spots hallucinations using the reader's own internal activations. This works just as well when the generator is a closed API like GPT-4 as when it is any open-weight model. We built eighteen features grounded in how transformers process text, covering residual stream norms, per-head source-document attention, entropy, MLP activations, logit-lens trajectories, and three new token-level grounding statistics. We trained a stacking ensemble on 72,135 samples from five hallucination datasets. We tested across seven analyzer architectures from 0.5 billion to 9 billion parameters: Qwen2.5 at 0.5B and 7B, Gemma-2 at 2B and 9B, Pythia at 1.4B, and LLaMA-3 at both 3B and 8B. Across all seven, we consistently beat ReDeEP's token-level AUC of 0.73 on RAGTruth by 7.4 to 10.3 percentage points. Qwen2.5-7B reached an F1 of 0.717, just above ReDeEP's 0.713, while Qwen2.5-0.5B hit 0.706. The most striking finding is how tightly all seven models cluster: AUC spans only 2.3 percentage points across an eighteen-fold difference in model size. Even more surprising, our 3B LLaMA outperforms our 8B LLaMA on RAGTruth, showing that bigger is not always better even within the same model family. Both RAGTruth and LLM-AggreFact include outputs from multiple LLM families, so our results are not skewed toward any particular generator.
comment: 12 pages, 4 figures. Code available at https://github.com/hallu-detect/llm_hallucination_detection
☆ Towards Autonomous Business Intelligence via Data-to-Insight Discovery Agent
Transforming fragmented enterprise data into actionable insights remains a significant challenge for LLMs, constrained by complex database schemas, limitations in dynamic SQL generation, and the need for deep multi-dimensional analysis.In this paper, we propose AIDA(Autonomous Insight Discovery Agent), the first end-to-end framework designed for autonomous exploration in complex business environments. We establish a highly flexible instant retail environment encompassing 200+ metrics and 100+ dimensions, and integrates a proprietary Domain-Specific Language (DSL) that bridges semantic reasoning with precise SQL execution. Our reinforcement learning system subsequently formulates business analysis as a Pareto Principle-guided cumulative reasoning process. Experimental results demonstrate that AIDA significantly outperforms workflow-based agents, and extensive evaluations further reveal that AIDA achieves superior environmental perception and more in-depth analysis from diverse perspectives. Our work ultimately establishes the transformative potential of autonomous intelligence for industrial-scale business intelligence systems.
☆ PSK@EEUCA 2026: Fine-Tuning Large Language Models with Synthetic Data Augmentation for Multi-Class Toxicity Detection in Gaming Chat ACL 2026
This paper describes our system for the EEUCA 2026 Shared Task on Understanding Toxic Behavior in Gaming Communities. The task involves classifying World of Tanks chat messages into six toxicity categories: Non-toxic, Insults/Flaming, Other Offensive, Hate/Harassment, Threats, and Extremism. We explore multiple approaches including encoder-based models, instruction-tuned LLMs with LoRA fine-tuning, hierarchical classification, one-vs-rest strategies, and various ensemble methods. Our best system combines Llama 3.1 8B with carefully calibrated 5\% synthetic data augmentation, achieving an F1-macro score of 0.6234 on the test set, placing 4th out of 35 participating teams. We provide extensive analysis of the dataset's annotation patterns and their impact on model generalization, revealing a critical ''validation trap'' phenomenon where high validation performance correlates with poor test transfer.
comment: Accepted to the EEUCA workshop at ACL 2026
☆ Three-in-One World Model: Energy-Based Consistency, Prediction, and Counterfactual Inference for Marketing Intervention
Marketing decisions reflect the interaction of latent consumer heterogeneity, time-varying internal states, and explicit interventions, a structure that current prediction- and language-oriented models do not capture in a unified manner. We propose a Three-in-One world-model architecture in which a Deep Boltzmann Machine (DBM) learns a frozen belief representation from demographics, time, and lagged actions and outcomes, with lightweight task-specific adapters attached on top. The same belief supports three tasks within a single framework: (i) energy-based consistency evaluation through the DBM's free energy, (ii) outcome prediction through adapters, and (iii) counterfactual inference by holding the belief fixed and varying only the action input given to the adapter. Using a controlled simulation in which the latent price sensitivity, promotion responsiveness, and base preference of each consumer are known, we show that the adapters match a strong MLP baseline on visit- and purchase-AUC while recovering heterogeneous treatment effects substantially better than S-, T-, X-, and DR-learner meta-learners and a Causal Forest baseline built on the same raw features, with the largest gap on a confounded price-promotion intervention. Complementing this, free-energy clamps systematically penalize counterfactual purchase trajectories that lack prior promotional exposure, and the penalty itself depends on the latent base preference in the expected direction. These results indicate that DBM beliefs disentangle latent traits in a form that survives counterfactual queries, providing an integrated world-model substrate for marketing intervention.
☆ Closed-Form Linear-Probe Dataset Distillation for Pre-trained Vision Models
Dataset distillation compresses a large training set into a small synthetic set that preserves downstream training utility. While most existing methods target training networks from scratch, modern visual transfer learning often uses frozen pre-trained encoders followed by lightweight linear probing. Existing distillation methods for this setting either unroll iterative linear-probe updates with trajectory-based gradient matching, or rely on closed-form formulations originally designed for from-scratch training with neural-tangent-kernel (NTK) approximations. Neither route exploits the fact that frozen-feature linear probing admits a closed-form solution determined directly by the pre-trained features themselves, with no infinite-width approximation and no inner-loop trajectory. We propose Closed-Form Linear-Probe Dataset Distillation (CLP-DD), a bilevel formulation that computes the linear probe induced by the synthetic set with a sample-space kernel ridge solver. The synthetic images are then updated by evaluating this induced classifier on real features through a temperature-scaled softmax cross-entropy, where the classifier columns act as learned class anchors in feature space. We further show that the choice of outer objective is decisive: pairing the closed-form inner solver with a standard MSE outer loss substantially underperforms trajectory-based methods, while the discriminative outer loss closes most of the gap. On ImageNet-100 with four pre-trained backbones, CLP-DD substantially improves over LGM without DSA and approaches LGM with DSA at a fraction of the computational cost. On ImageNet-1K, CLP-DD matches or surpasses LGM with DSA on three of four backbones while running roughly $14\times$ faster and using less than one-eighth of the GPU memory.
☆ The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval
Existing Large Language Model (LLM) benchmarks primarily focus on syntactically correct inputs, leaving a significant gap in evaluation on imperfect text. In this work, we study how word-boundary corruption affects how LLMs detect targeted information. By inserting whitespace characters within words to break them into fragments, LLMs' detection accuracy follows a U-shaped curve with the increase in insertion rate. We refer to this curve as the Text Uncanny Valley. To explain such observation, we propose a mode transition hypothesis: LLMs operate in a word-level mode for near-normal text and a character-level mode for heavily fragmented text, with the valley marking the disordered transition where neither mode is effective. Four experiments and one analysis are consistent with this account: in-context learning fails to rescue valley-bottom performance; regularizing the perturbation substantially reduces the U-shape; a math reasoning task replicates the U-shape for Gemini 3.0 Flash but not for stronger models, suggesting the effect is attenuated when tasks rely less on exact lexical alignment; and tokenization entropy peaks before the F1 minimum, consistent with a regime-conflict interpretation. These findings reveal a failure mode invisible to clean-text benchmarks yet directly relevant to any deployment scenario involving noisy or uncurated text inputs.
comment: 18 pages, 9 figures
☆ HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
Existing multimodal search agents process target entities sequentially, issuing one tool call per entity and accumulating redundant interaction rounds whenever a query decomposes into independent sub-retrievals. We argue that effective multimodal agents should search wider rather than longer: dispatching multiple grounded queries concurrently within a round. To this end, we present HyperEyes, a parallel multimodal search agent that fuses visual grounding and retrieval into a single atomic action, enabling concurrent search across multiple entities while treating inference efficiency as a first-class training objective. HyperEyes is trained in two stages. For cold-start supervision, we develop a Parallel-Amenable Data Synthesis Pipeline covering visual multi-entity and textual multi-constraint queries, curating efficiency-oriented trajectories via Progressive Rejection Sampling. Building on this, our central contribution, a Dual-Grained Efficiency-Aware Reinforcement Learning framework, operates at two levels. At the macro level, we propose TRACE (Tool-use Reference-Adaptive Cost Efficiency), a trajectory-level reward whose reference is monotonically tightened during training to suppress superfluous tool calls without restricting genuine multi-hop search. At the micro level, we adapt On-Policy Distillation to inject dense token-level corrective signals from an external teacher on failed rollouts, mitigating the credit-assignment deficiency of sparse outcome rewards. Since existing benchmarks evaluate accuracy as the sole metric, omitting inference cost, we introduce IMEB, a human-curated benchmark of 300 instances that jointly evaluates search capability and efficiency. Across six benchmarks, HyperEyes-30B surpasses the strongest comparable open-source agent by 9.9% in accuracy with 5.3x fewer tool-call rounds on average.
comment: Code & Data: https://github.com/Guankai-Li/HyperEyes
☆ Learning Multi-Relational Graph Representations for DNA Methylation-Based Biological Age Estimation
Aging clocks aim to estimate biological age, a measure of physiological state distinct from chronological age, from observable biomarkers, and are widely used for health assessment and disease analysis. DNA methylation is a particularly informative biomarker due to its stability and strong association with aging, and recent learning-based approaches have improved predictive performance. However, most existing methods treat CpG sites as independent features, overlooking the complex and heterogeneous biological relationships among them. We propose RelAge-GNN, a multi-relational graph neural network framework for DNA methylation-based age prediction. Our method constructs three complementary graphs capturing co-methylation patterns, genomic co-localization, and gene-level associations among CpG sites. Each graph is modeled by an independent GNN branch, and a learnable gating mechanism adaptively fuses the resulting representations. Experiments on large-scale datasets show that RelAge-GNN achieves competitive accuracy and stronger correlation with chronological age compared to state-of-the-art methods. Moreover, the model exhibits improved sensitivity in detecting age acceleration across diverse disease cohorts, highlighting its potential utility for disease characterization. Finally, through post hoc interpretability analyses, we quantify the contributions of different relational structures and CpG sites, providing biologically meaningful insights and suggesting potential directions for aging-related research. Our code is available at: https://anonymous.4open.science/r/RelAge-GNN-F1E3/.
☆ Repeated Deceptive Path Planning against Learnable Observer AAMAS 2026
We study the problem of deceptive path planning (DPP), where an agent aims to conceal its true destination from external observers. While existing work assumes static, non-learning observers, real-world adversaries-such as in critical goods transportation or military operations-can adapt by learning from historical trajectories. To address this gap, we introduce Repeated Deceptive Path Planning (RDPP), a new formulation that explicitly models learnable observers. We show that existing DPP methods fail under this setting, as they cannot adapt to evolving adversarial predictions. While incorporating observer previous predictions into updates enables some adaptation, such incremental updates cause accumulative lag that degrades deception. To this end, we propose Deceptive Meta Planning (DeMP), a two-level optimization framework that combines episode-level adaptation, which enables short-term policy adjustment to counter updated observer, and meta-level updates, which leverage cross-episode feedback to capture how observers update their models and accelerate adaptation in future episodes. In this way, DeMP mitigates the accumulation of adaptation lag, enabling sustained deception against a learning observer. Experiments across environments demonstrate that DeMP significantly outperforms existing approaches in RDPP while maintaining competitive path cost. Our results highlight the importance of modeling repeated interactions with learnable adversaries, providing new insights into deception and privacy in multi-agent systems.
comment: Full version of the extended abstract accepted at AAMAS 2026
☆ SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios
AI agents are increasingly used to diagnose and mitigate failures in production systems, known as agentic Site Reliability Engineering (SRE). Current SRE benchmarks are limited to oversimplistic SRE tasks and are unfortunately hard to extend due to bespoke designs. We present SREGym, a high-fidelity benchmark for SRE agents. SREGym exposes a live system environment built atop real-world cloud-native system stacks, where high-fidelity failure scenarios are simulated through fault injectors. SREGym models the complexity of production environments by simulating (1) a wide range of faults at different layers, (2) various ambient noises, and (3) diverse failure modes such as metastable failures and correlated failures. SREGym is architected as a modular, extensible framework that orchestrates fault and noise injectors across stacks. SREGym currently includes 90 realistic, challenging SRE problems. We use SREGym to evaluate frontier agents and show that their capabilities varies significantly in addressing different kinds of failures, with up to 40% differences in end-to-end results. SREGym is actively maintained as an open-source project and has been used by researchers and practitioners.
♻ ☆ UNCOM: Zero-shot Context-Aware Command Understanding for Tabletop Scenarios
This paper presents UNCOM, a novel hybrid framework for interpreting natural human commands in tabletop scenarios. The system integrates multiple sources of information -- speech, gestures, and scene context -- to extract structured, actionable instructions for robots. Addressing the need for general-purpose human-robot interaction in domestic environments, UNCOM is designed for zero-shot operation, without reliance on predefined object models or training data specific to a given task. Using foundational and task-specific deep learning models, it allows out-of-the-box speech recognition, natural language understanding, gesture detection, and object segmentation. The modular architecture enhances transparency and explainability by explicitly parsing commands into object-action-target representations, enabling integration with symbolic robotic frameworks. We demonstrate the system in a TIAGo++ robot and provide an evaluation on a real-world data set of human-robot interaction scenarios; achieving an 82.39\% success rate over our benchmark data set, highlighting the robustness of the system to diversity, noise, and communication ambiguity. The data set, evaluation scenarios, and the code are publicly available to support future research.
♻ ☆ Goal-Conditioned Decision Transformer for Multi-Goal Offline Reinforcement Learning
Reinforcement learning (RL) in robotics faces significant hurdles regarding sample efficiency and generalization across varying goals. While Offline RL mitigates the need for costly online interactions, its integration with goal-conditioned policies and transformer-based architectures remains underexplored. We introduce a Goal-Conditioned Decision Transformer adapted for offline multi-goal robotics. By explicitly incorporating goal states into the sequence modeling framework, our approach efficiently solves varying tasks using only pre-collected data. We validate this method on a newly released offline dataset for the Franka Emika Panda platform. Experimental results demonstrate that our approach outperforms state-of-the-art online baselines in complex tasks and maintains robustness in sparse-reward settings, even with limited expert demonstrations.
♻ ☆ THINKSAFE: Self-Generated Safety Alignment for Reasoning Models
Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain-of-thought (CoT) reasoning. However, this over-optimization often prioritizes compliance, making models vulnerable to harmful prompts. To mitigate this safety degradation, recent approaches rely on external teacher distillation, yet this introduces a distributional discrepancy that degrades native reasoning. We propose ThinkSafe, a self-generated alignment framework that restores safety alignment without external teachers. Our key insight is that while compliance suppresses safety mechanisms, models often retain latent knowledge to identify harm. ThinkSafe unlocks this via lightweight refusal steering, guiding the model to generate in-distribution safety reasoning traces. Fine-tuning on these self-generated responses effectively realigns the model while minimizing distribution shift. Experiments on DeepSeek-R1-Distill and Qwen3 show ThinkSafe significantly improves safety while preserving reasoning proficiency. Notably, it achieves superior safety and comparable reasoning to GRPO, with significantly reduced computational cost. Code, models, and datasets are available at https://github.com/seanie12/ThinkSafe.git.
comment: 17 pages, 13 figures
♻ ☆ Position: Agent Should Invoke External Tools ONLY When Epistemically Necessary
As large language models evolve into tool-augmented agents, a central question remains unresolved: when is external tool use actually justified? Existing agent frameworks typically treat tools as ordinary actions and optimize for task success or reward, offering little principled distinction between epistemically necessary interaction and unnecessary delegation. This position paper argues that agents should invoke external tools only when epistemically necessary. Here, epistemic necessity means that a task cannot be completed reliably via the agent's internal reasoning over its current context, without any external interaction. We introduce the Theory of Agent (ToA), a framework that treats agents as making sequential decisions about whether remaining uncertainty should be resolved internally or delegated externally. From this perspective, common agent failure modes (e.g., overthinking and overacting) arise from miscalibrated decisions under uncertainty rather than deficiencies in reasoning or tool execution alone. We further discuss implications for training, evaluation, and agent design, highlighting that unnecessary delegation not only causes inefficiency but can impede the development of internal reasoning capability. Our position provides a normative criterion for tool use that complements existing decision-theoretic models and is essential for building agents that are not only correct, but increasingly intelligent.
♻ ☆ AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
Diffusion-based talking head generation has achieved remarkable visual quality, yet scaling it to long-term videos remains challenging. The widely adopted chunk-wise paradigm introduces two fundamental failures: (1) temporal-spatial misalignment between static identity references and dynamic audio streams, and (2) cascading identity drift propagated through self-generated continuity references across chunks. To address both issues, we propose AsymTalker, a novel diffusion-based talking head generation method comprising Temporal Reference Encoding (TRE) and Asymmetric Knowledge Distillation (AKD). First, TRE mitigates temporal-spatial misalignment by transforming the static identity image into a temporally coherent latent representation through encoding of a temporally replicated pseudo-video, without introducing additional parameters. Second, AKD resolves the inherent conditioning dilemma in chunk-wise training: using ground-truth references causes train-inference mismatch, while self-generated references entangle supervision with identity drift. Our asymmetric design circumvents this by anchoring the teacher model with ground-truth continuity references to provide drift-free, chunk-level supervision, thereby avoiding the teacher bottleneck. Meanwhile, the student model learns under inference-aligned conditions, conditioned only on self-generated references, and is trained via distribution matching to preserve identity over long horizons. Extensive experiments show AsymTalker achieves state-of-the-art results on HDTF and VFHQ. It guarantees high-fidelity, identity-consistent synthesis over 600-second videos and reaches a real-time inference speed of 66 FPS.
♻ ☆ LineRides: Line-Guided Reinforcement Learning for Bicycle Robot Stunts IEEE
Designing reward functions for agile robotic maneuvers in reinforcement learning remains difficult, and demonstration-based approaches often require reference motions that are unavailable for novel platforms or extreme stunts. We present LineRides, a line-guided learning framework that enables a custom bicycle robot to acquire diverse, commandable stunt behaviors from a user-provided spatial guideline and sparse key-orientations, without demonstrations or explicit timing. LineRides handles physically infeasible guidelines using a tracking margin that permits controlled deviation, resolves temporal ambiguity by measuring progress via traveled distance along the guideline, and disambiguates motion details through position- and sequence-based key-orientations. We evaluate LineRides on the Ultra Mobility Vehicle (UMV) and show that the policy trained with our methods supports seamless transitions between normal driving and stunt execution, enabling five distinct stunts on command: MiniHop, LargeHop, ThreePointTurn, Backflip, and DriftTurn.
comment: Published in IEEE Robotics and Automation Letters (RA-L), 2026
♻ ☆ How Do Language Models Compose Functions?
While large language models (LLMs) appear to be increasingly capable of solving compositional tasks, it is an open question whether they do so using compositional mechanisms. In this work, we investigate how feedforward LLMs solve two-hop factual recall tasks, which can be expressed compositionally as $g(f(x))$. We first confirm that modern LLMs continue to suffer from the "compositionality gap", i.e. their ability to compute both $z = f(x)$ and $y = g(z)$ does not entail their ability to compute the composition $y = g(f(x))$. We then decode residual stream representations and identify two processing mechanisms: one which solves tasks $\textit{compositionally}$, computing $f(x)$ along the way to $g(f(x))$, and one which solves them $\textit{directly}$, without any detectable signature of the intermediate variable $f(x)$. Finally, we find that embedding space geometry is strongly related to which mechanism is employed, where the idiomatic mechanism is dominant when tasks are represented by translations from $x$ to $g(f(x))$ in the embedding spaces. We fully release our data and code at: https://github.com/apoorvkh/composing-functions.
♻ ☆ Valence-Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control
We show that emotion vectors in LLMs are organized by a two-dimensional valence-arousal (VA) subspace exhibiting circular geometry. Through principal component decomposition and ridge regression, we recover meaningful VA axes underlying emotion steering vectors whose projections correlate with human affect ratings across 44,728 words. Steering along these axes produces monotonic control over the affective properties of generated text, and further affords bidirectional control over multiple downstream behaviors (refusal and sycophancy) from a single subspace. These effects replicate across Llama-3.1-8B, Qwen3-8B, and Qwen3-14B. We propose lexical mediation to explain why these effects and prior emotionally framed controls work: refusal and compliance tokens occupy distinct VA regions, and VA steering directly modulates their emission probabilities.
♻ ☆ Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
While autoregressive Large Vision-Language Models (LVLMs) demonstrate remarkable proficiency in multimodal tasks, they face a "Visual Signal Dilution" phenomenon, where the accumulation of textual history expands the attention partition function, causing visual attention to decay inversely with generated sequence length. To counteract this, we propose Persistent Visual Memory (PVM), a lightweight learnable module designed to strengthen sustained, on-demand access to visual evidence. Integrated as a parallel branch alongside the Feed-Forward Network (FFN) in LVLMs, PVM establishes a distance-agnostic retrieval pathway that directly provides visual embeddings for enhanced visual perception, thereby structurally mitigating the signal suppression inherent to deep generation. Extensive experiments on Qwen3-VL models demonstrate that PVM brings notable improvements with negligible parameter overhead, delivering consistent average accuracy gains across both 4B and 8B scales, particularly in complex reasoning tasks that demand persistent visual perception. Furthermore, in-depth analysis reveals that PVM shows improved robustness in longer generations and accelerates internal prediction convergence.
♻ ☆ Searching for Privacy Risks in LLM Agents via Simulation ICLR 2026
The widespread deployment of LLM-based agents is likely to introduce a critical privacy threat: malicious agents that proactively engage others in multi-turn interactions to extract sensitive information. However, the evolving nature of such dynamic dialogues makes it challenging to anticipate emerging vulnerabilities and design effective defenses. To tackle this problem, we present a search-based framework that alternates between improving attack and defense strategies through the simulation of privacy-critical agent interactions. Specifically, we employ LLMs as optimizers to analyze simulation trajectories and iteratively propose new agent instructions. To explore the strategy space more efficiently, we further utilize parallel search with multiple threads and cross-thread propagation. Through this process, we find that attack strategies escalate from direct requests to sophisticated tactics, such as impersonation and consent forgery, while defenses evolve from simple rule-based constraints to robust identity-verification state machines. The discovered attacks and defenses generalize across diverse scenarios and backbone models, providing useful insights for developing privacy-aware agents.
comment: ICLR 2026
♻ ☆ TAP: Two-Stage Adaptive Personalization of Multi-Task and Multi-Modal Foundation Models in Federated Learning
In federated learning (FL), local personalization of models has received significant attention, yet personalized fine-tuning of foundation models remains underexplored. In particular, there is a lack of understanding in the literature on how to personalize foundation models in settings where there exist heterogeneity not only in data, but also in tasks and modalities across the clients. To address this gap, we propose Two-Stage Adaptive Personalization (TAP). In the first stage, TAP leverages mismatched model architectures between clients and the server to selectively replace personalized parameters with global updates, explicitly limiting cross-task and cross-modality interference. In the second stage, TAP conducts post-FL distillation on the global model to recover a beneficial shared structure. By reintroducing generalizable knowledge only after the global model has stabilized, TAP enhances generalization without compromising personalization. In developing our methodology, we introduce the first convergence analysis of federated foundation model training at the server under modality-task pair heterogeneity across clients, and demonstrate the impact of the number of modality-task pairs on model fine-tuning. Through extensive experiments, we demonstrate the effectiveness of TAP across a variety of datasets and tasks in comparison to state-of-the-art baselines. The implementation code is publicly available at https://github.com/lee3296/TAP.
comment: 29 pages
♻ ☆ Spectral Filtering for Complex Linear Dynamical Systems
We study the problem of learning complex-valued linear dynamical systems (CLDS) with sector-bounded spectrum. This class captures oscillatory and long-memory dynamics arising in signal processing, structured state space models, and quantum systems. We introduce a spectral filtering method based on the Slepian basis and show that learnability is governed by an effective dimension independent of the ambient state dimension. As a consequence, we obtain dimension-free regret bounds for sequence prediction in CLDS with spectrum contained in a sector of the unit disk.
♻ ☆ Code World Model Preparedness Report
This report documents the preparedness assessment of Code World Model (CWM), a model for code generation and reasoning about code from Meta. We conducted pre-release testing across domains identified in our Frontier AI Framework as potentially presenting catastrophic risks, and also evaluated the model's misaligned propensities. Our assessment found that CWM does not pose additional frontier risks beyond those present in the current AI ecosystem. We therefore release it as an open-weight model.
comment: 25 pages, 3 figures
♻ ☆ Frequency-Aware Model Parameter Explorer: A new attribution method for improving explainability
State-of-the-art attribution methods rely on adversarial sample generation that applies an all-pass filter across the frequency spectrum, discarding fine-grained high-frequency information that is demonstrably important for accurate feature attribution in deep neural networks. By generating adversarial samples that selectively perturb high- and low-frequency components, we can probe which spectral features a model relies on most -- directly translating frequency-domain exploration into attribution signals. Building on this insight, we propose FAMPE (Frequency-Aware Model Parameter Explorer), a novel attribution method that introduces an FFT-based α-weighted perturbation scheme -- separately modulating high- and low-frequency components via an energy-driven spectral cutoff -- and, crucially, integrates this frequency-aware exploration directly into model parameter exploration for attribution, a connection that has not been established in prior work. Unlike prior frequency-aware adversarial approaches that target transferability or imperceptibility, FAMPE's specific formulation is designed and validated exclusively for explainability, translating spectral structure into fine-grained attribution maps without requiring any manual baseline selection. Evaluated on ImageNet across four architectures spanning CNNs and Vision Transformers, at fixed α= 0.1 FAMPE outperforms AttEXplore by 4.25% on Inception-v3 and 12.04% on MaxViT-T, with per-sample oracle selection further revealing that low-frequency-dominated images systematically benefit from high-frequency perturbations -- underscoring the potential of adaptive spectral exploration. Our ablation studies confirm that high-frequency perturbations are disproportionately responsible for attribution precision, while excessive low-frequency noise degrades global structural coherence.
comment: Preprint
♻ ☆ PAIR-Former: Budgeted Relational Multi-Instance Learning for Functional miRNA Target Prediction
Functional miRNA--mRNA targeting is a large-bag prediction problem where each transcript yields a heavy-tailed pool of candidate target sites (CTSs), yet only a pair-level label is observed. Prior methods use max-pooling over individual CTS scores, ignoring relational patterns among sites, but modeling these patterns is critical for accuracy. The challenge is that naive relational aggregation incurs $\mathcal{O}(n^2)$ cost, prohibitive when $n$ reaches thousands, yet a cheap scan alone discards the very interactions that drive functional repression. We formalize this tension as \emph{Budgeted Relational Multi-Instance Learning (BR-MIL)}, a new MIL problem where the compute budget $K$ is a first-class constraint such that at most $K$ instances per bag may receive expensive encoding and relational processing. We establish theoretical foundations for BR-MIL, proving that both approximation quality and generalization are governed by $K$ rather than the raw bag size $n$. Building on this theory, we propose \textbf{PAIR-Former}, which scans all candidates cheaply, selects $K$ diverse CTSs, and aggregates them via Set Transformer. PAIR-Former achieves state-of-the-art performance, outperforming all reproduced baselines with F1$=0.840$ on miRAW (10-fold balanced CV) and $0.839$ on deepTargetPro in transfer evaluation, while achieving $0.793$ on the large-scale MTI benchmark (420K pairs, $38\times$ larger), demonstrating that budgeted relational MIL scales where naive approaches fail. Additional results on CAMELYON16 and Musk2 further show that the proposed BR-MIL formulation extends beyond biological sequence modeling.
comment: Preprint. Under review. During the preprint stage, inquiries and feedback can be directed to Jiaqi Yin (yjqhit@gmail.com)
♻ ☆ End-to-end PDDL Planning with Hardcoded and Dynamic Agents
We present an end-to-end framework for planning supported by verifiers. An orchestrator receives a human specification written in natural language and converts it into a PDDL (Planning Domain Definition Language) model, where the domain and problem are iteratively refined by sub-modules (agents) to address common planning requirements, such as time constraints and optimality, as well as ambiguities and contradictions that may exist in the human specification. We support two categories of agents: hardcoded, which are informed by logs and error traces and have a pre-defined goal (e.g., fix issues with PDDL syntax, check temporal constraints), and dynamic, which have no predefined goal but adapt to the specific domain and revise the latent planning abstraction. The validated domain and problem are then passed to an external planning engine to generate a plan. The orchestrator and agents are powered by Large Language Models (LLMs) and require no human intervention at any stage of the process. Finally, a module translates the final plan back into natural language to improve human readability while maintaining the correctness of each step. We demonstrate the flexibility and effectiveness of our framework on GPT-\{4o, 5-mini, 5.4\}, and Gemini-\{2.5, 3\}-flash across more than ten domains and tasks, including the Google NaturalPlan benchmark, Planbench, and classic planning problems like Sokoban, Blocksworld and the Tower of Hanoi, where LLMs are known to struggle even with small instances. Our framework can be integrated with any PDDL planning engine and validator (we successfully tested Fast Downward, LPG, POPF, VAL, and uVAL) and represents a significant step toward end-to-end planning aided by LLMs.
comment: Code: https://github.com/EmanueleLM/MultiAgentPlanning
♻ ☆ Are LLM Agents Behaviorally Coherent? Latent Profiles for Social Simulation
The impressive capabilities of Large Language Models (LLMs) raise the possibility that synthetic agents can serve as substitutes for real participants in human-subject research. To evaluate this claim, prior research has largely focused on whether LLM-generated survey responses align with those produced by human respondents whom the LLMs are prompted to represent. In contrast, we address a more fundamental question: Do agents maintain empirical consistency; aligning to human behavioral models when examined under different experimental settings? To this end, we develop a study designed to (a) ask a set of questions which reveals an agent's latent profile and (b) examine agent behavioral consistency in a conversational setting with other agents. This design enables us to explore a set of behavioral hypotheses to assess whether an agent's conversational behavior is consistent with what we would expect from its revealed state. Our findings show significant inconsistencies in LLMs across model families and at differing model sizes. Most importantly, we find that, although agents may generate responses matching those of their human counterparts, they fail to be empirically consistent, representing a critical gap in their capabilities to accurately substitute for real participants in human-subject research.
comment: 25 pages, 9 figures, 7 tables
♻ ☆ Harnessing Pre-Resolution Signals for Future Prediction Agents
Many high-stakes decisions depend on forecasts made before outcomes are known. In this future prediction setting, the central challenge is that public evidence evolves over time, while the main supervision signal arrives only after resolution: the realized outcome mainly assesses final correctness, offering only coarse guidance on what to track, what to verify, and which judgments to leave uncertain along the way. Our key observation is that revisiting the same unresolved question over time creates informative temporal contrasts across evolving evidence and repeated forecasts, exposing what earlier attempts missed before resolution and yielding a diagnostic signal we call the pre-resolution signal. We instantiate this idea in Milkyway, a future prediction agent with a persistent future prediction harness, an editable external state that stores reusable procedural guidance across revisits to the same unresolved question. As the same unresolved question is revisited, Milkyway extracts pre-resolution signals from evolving evidence and repeated forecasts, uses them to update the harness, and improves later forecasts on that question before resolution. After resolution, the realized outcome serves as a post-resolution check of provisional updates. On the FutureX and FutureWorld benchmarks, Milkyway achieves strong performance against competitive baselines, and a mechanism study suggests that the gains stem from harness evolution driven by pre-resolution signals rather than repeated prediction alone.
comment: Work in progress
♻ ☆ Rep2Text: Decoding Full Text from a Single LLM Token Representation
Large language models (LLMs) have achieved remarkable progress across diverse tasks, yet their internal mechanisms remain largely opaque. In this work, we investigate a fundamental question: to what extent can the original input text be recovered from a single last-token representation in an LLM? To this end, we propose Rep2Text, a novel framework for decoding text from last-token representations. Rep2Text employs a trainable adapter that maps a target model's last-token representation into the token embedding space of a decoding language model, which then autoregressively reconstructs the input text. Experiments across various model combinations (Llama-3.1-8B, Gemma-7B, Mistral-7B-v0.1, Llama-3.2-3B, etc.) show that, on average, roughly half of the tokens in 16-token sequences can be recovered from this compressed representation while preserving strong semantic coherence. Further analysis reveals a clear information bottleneck effect: as sequence length increases, token-level recovery declines, while semantic information remains relatively well preserved. We also find that scaling effects are less pronounced in inversion tasks. Finally, our framework demonstrates robust generalization to out-of-distribution clinical data.
comment: 18 pages, 6 figures, 6 tables
♻ ☆ Time series causal discovery with variable lags
Causal Bayesian Networks (CBNs) are a powerful tool for reasoning under uncertainty about complex real-world problems. Such problems evolve over time, responding to external shocks as they occur. To support decision-making, CBNs require a cause-and-effect map of the variables under consideration, known as the network's structure. Learning the graphical structure of a causal model from data remains challenging; learning it from time-series data is even harder because dependencies may arise at different time lags. Existing time-series causal discovery methods often assume a fixed lag window and do not explicitly optimise edge-specific lags. We propose a Tabu-based structure learning algorithm that searches for a time-ordered directed structure (i.e., where every edge respects time) while allowing edge-specific lags up to a specified maximum lag. The approach uses a decomposable BIC-based score with node-specific effective sample sizes and an explicit lag-length penalty encouraging parsimonious delay assignments while preserving efficient local score updates. We provide theoretical guarantees of validity and local optimality, and we also describe a parallel implementation for improved scalability. In simulations, the method recovered graph structure competitively and estimated lags accurately when true adjacencies were recovered. On a real-world UK COVID-19 policy dataset, the learnt structure was dominated by short delays while retaining a substantial minority of longer-lag dependencies, consistent with delayed behavioural and epidemiological effects.
♻ ☆ Scalable Option Learning in High-Throughput Environments
Hierarchical reinforcement learning (RL) has the potential to enable effective decision-making over long timescales. Existing approaches, while promising, have yet to realize the benefits of large-scale training. In this work, we identify and solve several key challenges in scaling online hierarchical RL to high-throughput environments. We propose Scalable Option Learning (SOL), a highly scalable hierarchical RL algorithm which achieves a ~35x higher throughput compared to existing hierarchical methods. To demonstrate SOL's performance and scalability, we train hierarchical agents using 30 billion frames of experience on the complex game of NetHack, significantly surpassing flat agents and demonstrating positive scaling trends. We also validate SOL on MiniHack and Mujoco environments, showcasing its general applicability. Our code is open sourced at: github.com/facebookresearch/sol.
♻ ☆ Emergent social transmission of model-based representations without inference
How do people acquire rich, flexible knowledge about their environment from others despite limited cognitive capacity? Humans are often thought to rely on computationally costly mentalizing, such as inferring others' beliefs. In contrast, cultural evolution emphasizes that behavioral transmission can be supported by simple social cues. Using reinforcement learning simulations, we show how minimal social learning can indirectly transmit higher-level representations. We simulate a naïve agent searching for rewards in a reconfigurable environment, learning either alone or by observing an expert - crucially, without inferring mental states. Instead, the learner heuristically selects actions or boosts value representations based on observed actions. Our results demonstrate that these cues bias the learner's experience, causing its representation to converge toward the expert's. Model-based learners benefit most from social exposure, showing faster learning and more expert-like representations. These findings show how cultural transmission can arise from simple, non-mentalizing processes exploiting asocial learning mechanisms.
comment: Code available at https://github.com/skessler01/social-transmission-rl.git
♻ ☆ HYPER: A Foundation Model for Inductive Link Prediction with Knowledge Hypergraphs
Inductive link prediction with knowledge hypergraphs is the task of predicting missing hyperedges involving completely novel entities (i.e., nodes unseen during training). Existing methods for inductive link prediction with knowledge hypergraphs assume a fixed relational vocabulary and, as a result, cannot generalize to knowledge hypergraphs with novel relation types (i.e., relations unseen during training). Inspired by knowledge graph foundation models, we propose HYPER as a foundation model for link prediction, which can generalize to any knowledge hypergraph, including novel entities and novel relations. Importantly, HYPER can learn and transfer across different relation types of varying arities, by encoding the entities of each hyperedge along with their respective positions in the hyperedge. To evaluate HYPER, we construct 16 new inductive datasets from existing knowledge hypergraphs, covering a diverse range of relation types of varying arities. Empirically, HYPER consistently outperforms all existing methods in both node-only and node-and-relation inductive settings, showing strong generalization to unseen, higher-arity relational structures.
♻ ☆ Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models
Despite the success of multimodal contrastive learning in aligning visual and linguistic representations, a persistent geometric anomaly, the Modality Gap, remains: embeddings of distinct modalities expressing identical semantics occupy systematically offset regions. Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions, hindering their application in large-scale scenarios. In this paper, we address these limitations by precisely characterizing the geometric shape of the modality gap and leveraging it for efficient model scaling. First, we propose the Fixed-frame Modality Gap Theory, which decomposes the modality gap within a frozen reference frame into stable biases and anisotropic residuals. Guided by this precise modeling, we introduce ReAlign, a training-free modality alignment strategy. Utilizing statistics from massive unpaired data, ReAlign aligns text representation into the image representation distribution via a three-step process comprising Anchor, Trace, and Centroid Alignment, thereby explicitly rectifying geometric misalignment. Building on ReAlign, we propose ReVision, a scalable training paradigm for Multimodal Large Language Models~(MLLMs). ReVision integrates ReAlign into the pretraining stage, enabling the model to learn the distribution of visual representations from unpaired text before visual instruction tuning, without the need for large-scale, high-quality image-text pairs. Our framework demonstrates that statistically aligned unpaired data can effectively substitute for expensive image-text pairs, offering a robust path for the efficient scaling of MLLMs.
♻ ☆ Flat Channels to Infinity in Neural Loss Landscapes NeurIPS'25
The loss landscapes of neural networks contain minima and saddle points that may be connected in flat regions or appear in isolation. We identify and characterize a special structure in the loss landscape: channels along which the loss decreases extremely slowly, while the output weights of at least two neurons, $a_i$ and $a_j$, diverge to $\pm$infinity, and their input weight vectors, $\mathbf{w_i}$ and $\mathbf{w_j}$, become equal to each other. At convergence, the two neurons implement a gated linear unit: $a_iσ(\mathbf{w_i} \cdot \mathbf{x}) + a_jσ(\mathbf{w_j} \cdot \mathbf{x}) \rightarrow σ(\mathbf{w} \cdot \mathbf{x}) + (\mathbf{v} \cdot \mathbf{x}) σ'(\mathbf{w} \cdot \mathbf{x})$. Geometrically, these channels to infinity are asymptotically parallel to symmetry-induced lines of critical points. Gradient flow solvers, and related optimization methods like SGD or ADAM, reach the channels with high probability in diverse regression settings, but without careful inspection they look like flat local minima with finite parameter values. Our characterization provides a comprehensive picture of these quasi-flat regions in terms of gradient dynamics, geometry, and functional interpretation. The emergence of gated linear units at the end of the channels highlights a surprising aspect of the computational capabilities of fully connected layers.
comment: Accepted to NeurIPS'25 (fixed resolution of equations in figs.1,2,3)
♻ ☆ VDCook:DIY video data cook your MLLMs
We introduce VDCook: a self-evolving video data operating system, a configurable video data construction platform for researchers and vertical domain teams. Users initiate data requests via natural language queries and adjustable parameters (scale, retrieval-synthesis ratio, quality threshold). The system automatically performs query optimization, concurrently running real video retrieval and controlled synthesis modules. It ultimately generates in-domain data packages with complete provenance and metadata, along with reproducible Notebooks. Unlike traditional static, one-time-built datasets, VDCook enables continuous updates and domain expansion through its automated data ingestion mechanism based on MCP (Model Context Protocol)\cite{mcp2024anthropic}, transforming datasets into dynamically evolving open ecosystems. The system also provides multi-dimensional metadata annotation (scene segmentation, motion scoring, OCR ratio, automatic captioning, etc.), laying the foundation for flexible subsequent data `cooking' and indexing\cite{vlogger}. This platform aims to significantly lower the barrier to constructing specialized video training datasets through infrastructure-level solutions, while supporting community contributions and a governance-enabled data expansion paradigm. \textbf{Project demo:} https://screenapp.io/app/v/WP0SvffgsH
♻ ☆ Retina-RAG: Retrieval-Augmented Vision-Language Modeling for Joint Retinal Diagnosis and Clinical Report Generation MICCAI 2026
Diabetic Retinopathy (DR) is a leading cause of preventable blindness among working-age adults worldwide, yet most automated screening systems are limited to image-level classification and lack clinically structured reporting. We propose Retina-RAG, a low-cost modular framework that jointly performs DR severity grading, macular edema (ME) detection, and report generation. The architecture decouples a high-performance retinal classifier and a parameter-efficient vision-language model (Qwen2.5-VL-7B-Instruct) adapted via Low-Rank Adaptation (LoRA), enabling flexible component integration. A retrieval-augmented generation (RAG) module injects curated ophthalmic knowledge together with structured classifier outputs at inference time to improve diagnostic consistency and reduce hallucinations. Retina-RAG achieves an F1-score of 0.731 for DR grading and 0.948 for ME detection, substantially outperforming zero-shot Qwen (0.096, 0.732) and MMed-RAG (0.541, 0.641) on a retinal disease detection dataset with captions. For report generation, Retina-RAG attains ROUGE-L 0.438 and SBERT similarity 0.884, exceeding all baselines. The full framework operates on a single consumer-grade GPU, demonstrating that clinically structured retinal AI can be achieved with modest computational resources.
comment: 10 pages, 5 figures. Submitted to MICCAI 2026
♻ ☆ Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression ICML 2026
While Key-Value (KV) cache compression is essential for efficient LLM inference, current evaluations disproportionately focus on sparse retrieval tasks, potentially masking the degradation of High-Density Reasoning where Chain-of-Thought (CoT) coherence is critical. We introduce KVFundaBench to systematically evaluate this gap, revealing a sharp dichotomy: while retrieval tasks remain robust, reasoning tasks exhibit severe Task-Dependent Degradation under aggressive compression due to disrupted CoT links. Extending our analysis to the DeepSeek-R1 model, we uncover that its specialized attention patterns offer unique insights into the fragility of reasoning chains. Guided by these findings -- specifically the necessity of preserving few-shot examples as indivisible Semantic Units -- we propose ShotKV. This approach explicitly separates prefill and decoding phases to prioritize semantic integrity. Empirical results demonstrate that ShotKV achieves 9%-18% accuracy improvements on long-context generation tasks and effectively generalizes to document QA, all while delivering an 11% latency reduction compared to full cache inference.
comment: ICML 2026
♻ ☆ Making AI Evaluation Deployment Relevant Through Context Specification
With many organizations struggling to gain value from AI deployments, pressure to evaluate AI in an informed manner has intensified. Status quo AI evaluation approaches often mask the operational realities that ultimately determine deployment success, making it difficult for organizational decision makers to know whether and how AI tools will deliver durable value. We introduce and describe context specification as a process to support and inform this decision making process. Context specification turns diffuse stakeholder perspectives about what matters in a given setting into clear, named constructs: explicit definitions of the properties, behaviors, and outcomes that evaluations aim to capture, so they can be observed and measured in context. The process serves as a foundational roadmap for evaluating what AI systems are likely to do in the deployment contexts that organizations actually manage.
comment: 8 pages; 2 figures
♻ ☆ Shaping the Future of Mathematics in the Age of AI
Artificial intelligence is transforming mathematics at a speed and scale that demand active engagement from the mathematical community. We examine five areas where this transformation is particularly pressing: values, practice, teaching, technology, and ethics. We offer recommendations on safeguarding our intellectual autonomy, rethinking our practice, broadening curricula, building academically oriented infrastructure, and developing shared ethical principles - with the aim of ensuring that the future of mathematics is shaped by the community itself.
comment: To appear in Notices of the American Mathematical Society. Based on discussions at a September 2025 workshop on "Mechanization and Mathematical Research" held at the Lorentz center, Leiden
♻ ☆ AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents
Recent LLM-based agents have closed substantial portions of the scientific discovery loop in software-only machine-learning research, in chemistry, and in biology. Extending the same loop to high-fidelity physical simulators is harder, because solver completion does not imply physical validity and many failure modes appear only in field-level imagery rather than in solver logs. We present AI CFD Scientist, an open-source AI scientist for computational fluid dynamics (CFD) that, to our knowledge, is the first to span literature-grounded ideation, validated execution, vision-based physics verification, source-code modification, and figure-grounded writing within a single inspectable workflow. Three coupled pathways cover parameter sweeps within a fixed solver, case-local C++ library compilation for new physical models, and open-ended hypothesis search against a reference comparator, all running on OpenFOAM through Foam-Agent. At the center of the framework is a vision-language physics-verification gate that inspects rendered flow fields before any result is accepted, rerun, or written into a manuscript. On five tasks under a shared GPT-5.5 backbone, AI CFD Scientist autonomously discovers a Spalart-Allmaras runtime correction that reduces lower-wall Cf RMSE against DNS by 7.89% on the periodic hill at Reh=5600; under matched LLM cost, two strong general AI-scientist baselines (ARIS, DeepScientist) execute partial CFD workflows but lack the domain-specific validity gates needed to convert runs into defensible scientific claims; and a controlled planted-failure ablation shows that the vision-language gate detects 14 of 16 silent failures missed by solver-level checks. Code, prompts, and run artifacts are released at https://github.com/csml-rpi/cfd-scientist.
comment: 9 main pages and rest in appendix
♻ ☆ Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization ACL 2026
Recent work, using the Biasing Features metric, labels a CoT as unfaithful if it omits a prompt-injected hint that affected the prediction. We argue this metric adopts a narrow notion of faithfulness and confuses unfaithfulness with incompleteness, the lossy compression needed to turn distributed transformer computation into a linear natural language narrative. On multi-hop reasoning tasks with instruct-tuned and reasoning models, many CoTs flagged as unfaithful by Biasing Features are judged faithful by other metrics, exceeding 50% in some models. With a new faithful@k metric, we show that larger inference-time budgets greatly increase hint verbalization (up to 90% in some settings), suggesting much apparent unfaithfulness is due to tight token limits. Using Causal Mediation Analysis, we further show that even non-verbalized hints can causally mediate prediction changes through the CoT. We therefore caution against relying solely on hint-based evaluations and advocate a broader interpretability toolkit, including causal mediation and corruption-based metrics. We do not claim all CoTs are faithful, only that the absence of hint words alone does not prove unfaithfulness.
comment: Accepted to ACL 2026. 23 pages, 29 figures, 6 tables
♻ ☆ MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants
With the rapid advancement of Large Language Models (LLMs) in code generation, human-AI interaction is evolving from static text responses to dynamic, interactive HTML-based applications, which we term MiniApps. These applications require models to not only render visual interfaces but also construct customized interaction logic that adheres to real-world principles. However, existing benchmarks primarily focus on algorithmic correctness or static layout reconstruction, failing to capture the capabilities required for this new paradigm. To address this gap, we introduce MiniAppBench, the first comprehensive benchmark designed to evaluate principle-driven, interactive application generation. Sourced from a real-world application with 10M+ generations, MiniAppBench distills 500 tasks across six domains (e.g., Games, Science, and Tools). Furthermore, to tackle the challenge of evaluating open-ended interactions where no single ground truth exists, we propose MiniAppEval, an agentic evaluation framework. Leveraging browser automation, it performs human-like exploratory testing to systematically assess applications across three dimensions: Intention, Static, and Dynamic. Our experiments reveal that current LLMs still face significant challenges in generating high-quality MiniApps, while MiniAppEval demonstrates high alignment with human judgment, establishing a reliable standard for future research. Our homepage is available in miniappbench.github.io.
♻ ☆ Grounding Multi-Hop Reasoning in Structural Causal Models via Group Relative Policy Optimization
Multi-Hop Fact Verification (MHFV) necessitates complex reasoning across disparate evidence, posing significant challenges for Large Language Models (LLMs) which often suffer from hallucinations and fractured logical chains. Existing methods, while improving transparency via Chain-of-Thought (CoT), lack explicit modeling of the causal dependencies between evidence and claims. In this work, we introduce a novel framework that grounds reasoning in a Structural Causal Model (SCM), treating verification as a constructive causal inference process. We empirically identify an "inverted U-shaped" correlation between reasoning chain length and accuracy, revealing that excessive structural complexity degrades performance. To address this, we propose a Rule-based Reinforcement Learning strategy using Group Relative Policy Optimization (GRPO). This approach dynamically optimizes the trade-off between structural depth and conciseness. Extensive experiments on HoVer and EX-FEVER demonstrate that our SCM-GRPO framework significantly outperforms state-of-the-art baselines, offering a reliable and interpretable solution for complex fact verification.
♻ ☆ On Time, Within Budget: Constraint-Driven Online Resource Allocation for Agentic Workflows
Agentic systems increasingly solve complex user requests by executing orchestrated workflows, where subtasks are assigned to specialized models or tools and coordinated according to their dependencies. While recent work improves agent efficiency by optimizing the performance--cost--latency frontier, real deployments often impose concrete requirements: a workflow must be completed within a specified budget and before a specified deadline. This shifts the goal from average efficiency optimization to maximizing the probability that the entire workflow completes successfully under explicit budget and deadline constraints. We study \emph{constraint-driven online resource allocation for agentic workflows}. Given a dependency-structured workflow and estimates of success rates and generation lengths for each subtask--model pair, the executor dynamically allocates models and parallel samples across simultaneously executable subtasks while managing the remaining budget and time. We formulate this setting as a finite-horizon stochastic online allocation problem and propose \emph{Monte Carlo Portfolio Planning} (MCPP), a lightweight closed-loop planner that directly estimates constrained completion probability through simulated workflow executions and replans after observed outcomes. Experiments on CodeFlow and ProofFlow demonstrate that MCPP consistently improves constrained completion probability over strong baselines across a wide range of budget--deadline constraints.
comment: Preprint
♻ ☆ Separation Assurance between Heterogeneous Fleets of Small Unmanned Aerial Systems via Multi-Agent Reinforcement Learning
In the envisioned future dense urban airspace, multiple companies will operate heterogeneous fleets of small unmanned aerial systems (sUASs), where each fleet includes several homogeneous aircraft with identical policies and configurations, e.g., equipage, sensing, and communication ranges, making tactical deconfliction highly complex for the aircraft. This paper aims to address two core questions: (1) Can tactical deconfliction policies converge or reach an equilibrium to ensure a conflict-free airspace when companies operate heterogeneous fleets of homogeneous aircraft? (2) If so, will the converged policies discriminate against companies operating sUASs with weaker configurations? We investigate a multi-agent reinforcement learning paradigm in which homogeneous aircraft within heterogeneous fleets operate concurrently to perform package delivery missions over Dallas, Texas, USA. An attention-enhanced Proximal Policy Optimization-based Advantage Actor-Critic (PPOA2C) framework is employed to resolve intra- and inter-fleet conflicts, with each fleet independently training its own policy while preserving privacy. Experimental results show that two fleets with distinct, shared PPOA2C policies can reach an equilibrium to maintain safe separation. While two PPOA2C policies outperform two strong rule-based baselines in terms of conflict resolution, a PPOA2C policy exhibits safer interaction with a rule-based policy, indicating adaptive capabilities of PPOA2C policies. Furthermore, we conducted extensive policy-configuration evaluations, which reveal that equilibria between similar policy types tend to favor fleets with stronger configurations. Even under similar configurations but different policy types, the equilibrium favors one of the heterogeneous policies, underscoring the need for fairness-aware conflict management in heterogeneous sUAS operations.
comment: 8 pages, 3 figure, 1 table
♻ ☆ VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
Video large multimodal models increasingly face a scalability bottleneck: long videos produce excessively long visual-token sequences, which sharply increase memory and latency during inference. While existing compression methods are effective in specific settings, most are either weakly query-aware or apply a fixed compression policy across frames, proving suboptimal when visual evidence is unevenly distributed over time. To address this, we present VideoRouter, a query-adaptive dual-router framework built on InternVL for budgeted evidence allocation. The Semantic Router predicts the dominant allocation policy, choosing between broad temporal coverage and adaptive high-resolution preservation, while the Image Router uses early LLM layers to score frame relevance. This enables aggressive compression on less relevant frames while preserving detail on critical evidence frames. To train both routers, we build Video-QTR-10K for allocation-policy supervision and Video-FLR-200K for frame-relevance supervision. Experiments on VideoMME, MLVU, and LongVideoBench show that VideoRouter consistently improves over the InternVL baseline under comparable or lower budgets, achieving up to a 67.9% token reduction.
♻ ☆ From Time Series Analysis to Question Answering: A Survey in the LLM Era IJCAI 2026
Recently, Large Language Models (LLMs) have introduced a novel paradigm in Time Series Analysis (TSA), leveraging strong language capabilities to support tasks such as forecasting and anomaly detection. However, these analysis tasks cannot adequately cover temporal language tasks, such as interpretation and captioning. A fundamental gap remains between TSA and LLMs: LLMs are pre-trained to optimize natural language relevance for question answering rather than objectives specialized for TSA. To bridge this gap, TSA is evolving toward Time Series Question Answering (TSQA), shifting from expert-driven and task-specific analysis to user-driven and task-unified question answering. TSQA depends on flexible exploration rather than predefined TSA pipelines. In this survey, we first propose a taxonomy that reflects the evolution from TSA to TSQA, driven by a shift from external to internal alignment. We then organize existing literature into three alignment paradigms: Injective Alignment, Bridging Alignment, and Internal Alignment, and provide practical guidance for flexible, economical, and generalizable selection of alignment paradigms. We finally analyze datasets across domains and characteristics, identify challenges, and highlight future research directions.
comment: Accepted by IJCAI 2026 Survey Track
♻ ☆ Proactive Instance Navigation with Comparative Judgment for Ambiguous User Queries
Natural-language instance navigation becomes challenging when the initial user request does not uniquely specify the target instance. A practical agent should reduce the user's burden by actively asking only the information needed to distinguish the target from similar distractors, rather than requiring a detailed description upfront. Existing approaches often fall short of this goal: they may stop at the first plausible candidate before sufficiently exploring alternatives, or, even after collecting multiple candidates, ask about the target's attributes derived from individual candidates rather than questions selected to distinguish candidates in the pool. As a result, despite the dialogue, the agent may still fail to distinguish the target from distractors, leading to premature decisions and lengthy user responses. We propose Proactive Instance Navigation with Comparative Judgment (ProCompNav), a two-stage framework that first constructs a candidate pool and then identifies the target through comparative judgment. At each round, ProCompNav extracts an attribute-value pair that splits the current pool, asks a binary yes/no question, and prunes all inconsistent candidates at once. This reframes disambiguation from open-ended target description to pool-level discriminative questioning, where each question is chosen to narrow the candidate set. On CoIN-Bench, ProCompNav improves Success Rate over interactive baselines with the same minimal input and non-interactive baselines with detailed descriptions, while substantially reducing Response Length. ProCompNav also achieves state-of-the-art Success Rate on TextNav, suggesting that comparative judgment is broadly useful for instance-level navigation among similar distractors.
comment: 17 pages, 6 figures
♻ ☆ Exact Is Easier: Credit Assignment for Cooperative LLM Agents
Removing an agent from a cooperative team to measure its contribution seems natural, yet in multi-agent LLM systems this evaluation distorts the result it claims to measure. This failure is not isolated: learned critics, trajectory-level baselines, and agent-removal counterfactuals all inherit from standard multi-agent reinforcement learning a premise that exact counterfactual evaluation requires privileged environment access, and therefore approximate. In cooperative LLM systems, this premise is false. Interaction histories are deterministic functions of observable text with no hidden state, so any decision point can be restored exactly, making direct causal measurement possible without parametric approximation. C3 exploits this property by fixing the complete history at each decision point, sampling alternative actions under a frozen behavior policy, and computing unbiased per-decision advantages through a parameter-free leave-one-out baseline. Across six benchmarks spanning math reasoning and code generation, two model families, and two multi-agent topologies, C3 consistently outperforms all baselines; a controlled decomposition confirms gains originate from credit quality, not architecture, while checkpoint restoration reduces training token consumption. The exact solution proves simpler, cheaper, and more effective than all approximate alternatives. The same structural property that enables exact credit also enables exact verification: three independently computable diagnostics, credit fidelity, within-group variance, and inter-agent influence, constitute the first method-agnostic auditing tool for multi-agent LLM credit assignment. Our code is available at https://github.com/EIT-EAST-Lab/C3
♻ ☆ Is Your Prompt Poisoning Code? Defect Induction Rates and Security Mitigation Strategies
Large language models (LLMs) have become indispensable for automated code generation, yet the quality and security of their outputs remain a critical concern. Existing studies predominantly concentrate on adversarial attacks or inherent flaws within the models. However, a more prevalent yet underexplored issue concerns how the quality of a benign but poorly formulated prompt affects the security of the generated code. To investigate this, we first propose an evaluation framework for prompt quality encompassing three key dimensions: goal clarity, information completeness, and logical consistency. Based on this framework, we construct and publicly release CWE-BENCH-PYTHON, a large-scale benchmark dataset containing tasks with prompts categorized into four distinct levels of normativity (L0-L3). Extensive experiments on multiple state-of-the-art LLMs reveal a clear correlation: as prompt normativity decreases, the likelihood of generating insecure code consistently and markedly increases. Furthermore, we demonstrate that advanced prompting techniques, such as Chain-of-Thought and Self-Correction, effectively mitigate the security risks introduced by low-quality prompts, substantially improving code safety. Our findings highlight that enhancing the quality of user prompts constitutes a critical and effective strategy for strengthening the security of AI-generated code.
comment: Accepted for publication in Empirical Software Engineering (EMSE) Journal
♻ ☆ Econometric vs. Causal Structure-Learning for Time-Series Policy Decisions: Evidence from the UK COVID-19 Policies
Causal machine learning (ML) recovers graphical structures that inform us about potential cause-and-effect relationships. Most progress has focused on cross-sectional data with no explicit time order, whereas recovering causal structures from time series data remains the subject of ongoing research in causal ML. In addition to traditional causal ML, this study assesses econometric methods that some argue can recover causal structures from time series data. The use of these methods can be explained by the significant attention the field of econometrics has given to causality, and specifically to time series, over the years. This presents the possibility of comparing the causal discovery performance between econometric and traditional causal ML algorithms. We seek to understand if there are lessons to be incorporated into causal ML from econometrics, and provide code to translate the results of these econometric methods to the most widely used Bayesian Network R library, bnlearn. We investigate the benefits and challenges that these algorithms present in supporting policy decision-making, using the real-world case of COVID-19 in the UK as an example. Four econometric methods are evaluated in terms of graphical structure, model dimensionality, and their ability to recover causal effects, and these results are compared with those of eleven causal ML algorithms. Amongst our main results, we see that econometric methods provide clear rules for temporal structures, whereas causal-ML algorithms offer broader discovery by exploring a larger space of graph structures that tends to lead to denser graphs that capture more identifiable causal relationships.
♻ ☆ Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning
Reinforcement learning has emerged as a powerful paradigm for unlocking reasoning capabilities in language models. However, relying on sparse rewards makes this process highly sample-inefficient, as models must navigate vast search spaces with minimal feedback. While classic curriculum learning aims to mitigate this by ordering data based on complexity, prior works have primarily targeted small datasets and do not directly transfer to the large-scale settings typical of modern LM training. Furthermore, the right ordering for a specific model is often unclear. To address this, we propose Goldilocks, a novel teacher-driven data sampling strategy that aims to predict each question's difficulty for the student model. The teacher model selects questions of appropriate difficulty for the student model, i.e., questions that are neither too easy nor too hard (Goldilocks principle), while training the student with GRPO. By leveraging the student's performance on seen samples, the teacher continuously adapts to the student's evolving abilities. On the OpenMathReasoning dataset, Goldilocks data sampling improves the performance of models trained with standard GRPO under the same compute budget.
comment: 28 pages, 13 figures
♻ ☆ KV Cache Offloading for Context-Intensive Tasks
With the growing demand for long-context LLMs across a wide range of applications, the key-value (KV) cache has become a critical bottleneck for both latency and memory usage. Recently, KV-cache offloading has emerged as a promising approach to reduce memory footprint and inference latency while preserving accuracy. Prior evaluations have largely focused on tasks that do not require extracting large amounts of information from the context. In this work, we study KV-cache offloading on context-intensive tasks: problems where the solution requires looking up a lot of information from the input prompt. We create and release the Text2JSON benchmark, a highly context-intensive task that requires extracting structured knowledge from raw text. We evaluate modern KV offloading on Text2JSON and other context-intensive tasks and find significant performance degradation on both Llama 3 and Qwen 3 models. Our analysis identifies two key reasons for poor accuracy: low-rank projection of keys and unreliable landmarks, and proposes a simpler alternative strategy that significantly improves accuracy across multiple LLM families and benchmarks. These findings highlight the need for a comprehensive and rigorous evaluation of long-context compression techniques.
comment: Preprint
♻ ☆ DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment
Reinforcement learning is crucial for aligning large language models to perform complex reasoning tasks. However, current algorithms such as Group Relative Policy Optimization suffer from coarse grained, sequence level credit assignment, which severely struggles to isolate pivotal reasoning steps within long Chain of Thought generations. Furthermore, the standard unbounded Kullback Leibler divergence penalty induces severe gradient instability and mode seeking conservatism, ultimately stifling the discovery of novel reasoning trajectories. To overcome these limitations, we introduce Distribution Guided Policy Optimization, a novel critic free reinforcement learning framework that reinterprets distribution deviation as a guiding signal rather than a rigid penalty. DGPO replaces the volatile KL divergence with the bounded Hellinger distance to safely quantify token level exploration without the risk of gradient explosion. To effectively distinguish genuine reasoning breakthroughs from hallucinatory noise, we propose an entropy gating mechanism that scales this deviation by the policy`s epistemic uncertainty. By dynamically redistributing the coarse sequence-level advantage to individual tokens based on these gated scores, DGPO heavily incentivizes critical exploratory steps while suppressing unwarranted, low-entropy deviations. Consequently, DGPO completely eliminates the traditional token-level KL penalty and achieves fine-grained credit reallocation without the computational overhead of an additional value network. Extensive empirical evaluations demonstrate that DGPO sets a new state-of-the-art for critic free alignment. Notably, on the Qwen2.5-32B architecture, DGPO achieves 60.0% Avg@32 accuracy and 46.0% Avg@32 accuracy on the challenging AIME2024 and AIME2025 benchmarks respectively, substantially outperforming competitive baselines like DAPO.
♻ ☆ Q-MMR: Off-Policy Evaluation via Recursive Reweighting and Moment Matching
We present a novel theoretical framework, Q-MMR, for off-policy evaluation in finite-horizon MDPs. Q-MMR learns a set of scalar weights, one for each data point, such that the reweighted rewards approximate the expected return under the target policy. The weights are learned inductively in a top-down manner via a moment matching objective against a value-function discriminator class. Notably, and perhaps surprisingly, a data-dependent finite-sample guarantee for general function approximation can be established under only the realizability of $Q^π$, with a dimension-free bound -- that is, the error does not depend on the statistical complexity of the function class. We also establish connections to several existing methods, such as importance sampling and linear FQE. Further theoretical analyses shed new light on the nature of coverage, a concept of fundamental importance to offline RL.
♻ ☆ VISD: Enhancing Video Reasoning via Structured Self-Distillation
Training VideoLLMs for complex reasoning remains challenging due to sparse sequence level rewards and the lack of fine grained credit assignment over long, temporally grounded reasoning trajectories. While reinforcement learning with verifiable rewards (RLVR) provides reliable supervision, it fails to capture token level contributions, leading to inefficient learning. Conversely, existing self distillation methods offer dense supervision but lack structure and diagnostic specificity, and often interact unstably with reinforcement learning. In this work, we propose VISD, a structured self distillation framework that introduces diagnostically meaningful privileged information for video reasoning. VISD employs a video aware judge model to decompose reasoning quality into multiple dimensions, including answer correctness, logical consistency, and spatio-temporal grounding, and uses this structured feedback to guide a teacher policy for token level supervision. To stably integrate dense supervision with RL, we introduce a direction magnitude decoupling mechanism, where rollout level advantages computed from rewards determine update direction, while structured privileged signals modulate token level update magnitudes. This design enables semantically aligned and fine grained credit assignment, improving both reasoning faithfulness and training efficiency. Additionally, VISD incorporates curriculum scheduling and EMA based teacher stabilization to support robust optimization over long video sequences. Experiments on diverse benchmarks show that VISD consistently outperforms strong baselines, improving answer accuracy and spatio temporal grounding quality. Notably, VISD reaches these gains with nearly 2x faster convergence in optimization steps, highlighting the effectiveness of structured self supervision in improving both performance and sample efficiency for VideoLLMs.
♻ ☆ Uncertainty Quantification for Prior-Data Fitted Networks using Martingale Posteriors
Prior-data fitted networks (PFNs) have emerged as promising foundation models for prediction from tabular datasets, achieving state-of-the-art performance on small to moderate data sizes without tuning. While PFNs are motivated by Bayesian ideas, they do not provide any uncertainty quantification for predictive means, quantiles, or similar quantities. We propose a principled, efficient, and tuning-free sampling procedure to construct Bayesian posteriors for such estimates based on martingale posteriors, and prove its convergence. Several simulated and real-world data examples showcase the efficiency and calibration of our method in inference applications.
♻ ☆ Discovering Learning-Friendly Generation Orders for Sequential Computation
Sequential computation via autoregressive generation can make difficult tasks learnable, but the generation order of intermediate states strongly affects whether training succeeds. We address the problem of discovering a learning-friendly target order automatically, rather than relying on task-specific design. Our key observation is that learning-friendly orders cause a faster loss drop in the early stage of training. We exploit this by \emph{loss profiling}, which ranks candidate orders by the early-stage loss of a single short run. To handle the factorial candidate space, we wrap loss profiling in a hierarchical global -- local search over block- and within-block-level orderings. On six order-sensitive tasks, the method discovers effective orders up to $L=13$ from random initialization and up to $L=40$ from structured initialization, lifting success rates from about 10\% to near 100\%. On integer multiplication, it rediscovers the reverse-digit order that was reported to be efficient in prior studies. On delay dynamical systems, as a case study of multi-variate recurrences, learnability varies sharply even among valid topological sorts of the dependency graph: loss profiling identifies a learning-friendly one, and the global search even discovers orders surpassing hand-designed candidates.
comment: 10+24 pages, 10 figures
♻ ☆ PerfCoder: Large Language Models for Interpretable Code Performance Optimization
Large language models (LLMs) have achieved remarkable progress in automatic code generation, yet their ability to produce high-performance code remains limited--a critical requirement in real-world software systems. We argue that current LLMs struggle not only due to data scarcity but, more importantly, because they lack supervision that guides interpretable and effective performance improvements. In this work, we introduce PerfCoder, a family of LLMs specifically designed to generate performance-enhanced code from source code via interpretable, customized optimizations. PerfCoder is fine-tuned on a curated collection of real-world optimization trajectories with human-readable annotations, and preference-aligned by reinforcement fine-tuning using runtime measurements, enabling it to propose input-specific improvement strategies and apply them directly without relying on iterative refinement. On the PIE code performance benchmark, PerfCoder surpasses all existing models in both runtime speedup and effective optimization rate, demonstrating that performance optimization cannot be achieved by scale alone but requires optimization stratetgy awareness. In addition, PerfCoder can generate interpretable feedback about the source code, which, when provided as input to a larger LLM in a planner-and-optimizer cooperative workflow, can further improve outcomes. Specifically, we elevate the performance of 32B models and GPT-5 to new levels on code optimization, substantially surpassing their original performance.
♻ ☆ How Log-Barrier Helps Exploration in Policy Optimization
Recently, it has been shown that the Stochastic Gradient Bandit (SGB) algorithm converges to a globally optimal policy with a constant learning rate. However, these guarantees rely on unrealistic assumptions about the learning process, namely that the probability of the optimal action is always bounded away from zero. We attribute this to the lack of an explicit exploration mechanism in SGB. To address these limitations, we propose to regularize the SGB objective with a log-barrier on the parametric policy, structurally enforcing a minimal amount of exploration. We prove that Log-Barrier Stochastic Gradient Bandit (LB-SGB) matches the sample complexity of SGB, but also converges (at a slower rate) without any assumptions on the learning process. We also show a connection between the log-barrier regularization and Natural Policy Gradient, as both exploit the geometry of the policy space by controlling the Fisher information. We validate our theoretical findings through numerical simulations, showing the benefits of the log-barrier regularization.
♻ ☆ A Resilience Framework for Bi-Criteria Combinatorial Optimization with Bandit Feedback
We study bi-criteria combinatorial optimization under noisy function evaluations. While resilience and black-box offline-to-online reductions have been studied in single-objective settings, extending these ideas to bi-criteria problems introduces new challenges due to the coupled degradation of approximation guarantees for objectives and constraints. We introduce a notion of $(α,β,δ,\texttt{N})$-resilience for bi-criteria approximation algorithms, capturing how joint approximation guarantees degrade under bounded (possibly worst-case) oracle noise, and develop a general black-box framework that converts any resilient offline algorithm into an online algorithm for bi-criteria combinatorial multi-armed bandits with bandit feedback. The resulting online guarantees achieve sublinear regret and cumulative constraint violation of order $\tilde{O}(δ^{2/3}\texttt{N}^{1/3}T^{2/3})$ without requiring structural assumptions such as linearity, submodularity, or semi-bandit feedback on the noisy functions. We demonstrate the applicability of the framework by establishing resilience for several classical greedy algorithms in submodular optimization.
♻ ☆ The Effective Depth Paradox: Evaluating the Relationship between Architectural Topology and Trainability in Deep CNNs
This paper investigates the relationship between convolutional neural network (CNN) topology and image recognition performance through a comparative study of the VGG, ResNet, and GoogLeNet architectural families. Utilizing a unified experimental framework, the study isolates the impact of depth from confounding implementation variables. A formal distinction is introduced between nominal depth ($D_{\mathrm{nom}}$), representing the physical layer count, and effective depth ($D_{\mathrm{eff}}$), an operational metric quantifying the expected number of sequential transformations. Empirical results demonstrate that architectures utilizing identity shortcuts or branching modules maintain optimization stability by decoupling $D_{\mathrm{eff}}$ from $D_{\mathrm{nom}}$. These findings suggest that effective depth serves as a superior framework for predicting scaling potential and practical trainability, ultimately indicating that architectural topology - rather than sheer layer volume - is the primary determinant of gradient health in deep learning models.
♻ ☆ Contextual Invertible World Models: A Neuro-Symbolic Agentic Framework for Colorectal Cancer Drug Response
Precision oncology is currently limited by the small-N, large-P paradox, where high-dimensional genomic data is abundant but pharmacological response samples are sparse. While deep learning achieves predictive accuracy, it frequently fails to provide the mechanistic clarity required for clinical adoption. We present the Contextual Invertible World Model (CIWM), a Neuro-Symbolic Agentic Framework that bridges this gap by integrating a quantitative machine learning emulator with an LLM-based reasoning layer. Utilising a zero-leakage forensic pipeline on the Sanger GDSC dataset (N = 83), we achieve a robust predictive correlation (r = 0.447, p = 2.30e-05). We identify a Symbolic Scaffold effect, where the explicit modelling of clinical context (MSI status) provides a 3.6 percent gain in fidelity in data-sparse regimes. Through Inverse Reasoning, we perform in silico CRISPR perturbations across the colorectal landscape, identifying a hierarchical dominance of the APC/Wnt-axis over the p53 apoptotic pathway. Validated against human clinical profiles (TCGA-COAD proxy, p = 0.0357), our framework provides a transparent, invertible, and biologically grounded path towards explainable AI in oncology.
♻ ☆ Muon Dynamics as a Spectral Wasserstein Flow
Gradient normalization stabilizes deep-learning optimization, and spectral normalizations are especially natural for matrix-shaped parameter blocks; Muon is the motivating example. We study an idealized deterministic, continuous-time, vanishing-momentum version of this idea in the mean-field regime, where wide models are represented by probability measures on parameter space. Starting from normalized matrix flows, we introduce Spectral Wasserstein distances indexed by norms $γ$ on positive semidefinite matrices: the trace norm gives classical $W_2$, the operator norm gives the Muon geometry, and Schatten norms interpolate between them. We develop the static Kantorovich formulation, a max-min robust-cost representation, Gaussian reductions extending the Bures formula, and for monotone norms, prove equivalence with a Benamou--Brenier formulation. This yields a gradient-flow interpretation of the mean-field normalized training dynamics. We illustrate these findings by numerical experiments on MMD flows, Gaussian reductions, two-layer ReLU models, and shallow attention.
♻ ☆ Supervised sparse auto-encoders for interpretable and compositional representations
Sparse auto-encoders (SAEs) have re-emerged as a prominent method for mechanistic interpretability, yet they face two significant challenges: the non-smoothness of the $L_1$ penalty, which hinders reconstruction and scalability, and a lack of alignment between learned features and human semantics. In this paper, we address these limitations by adapting unconstrained feature models-a mathematical framework from neural collapse theory-and by supervising the task. We supervise (decoder-only) SAEs to reconstruct feature vectors by jointly learning sparse concept embeddings and decoder weights. Validated on Stable Diffusion 3.5, our approach demonstrates compositional generalization, successfully reconstructing images with concept combinations unseen during training, and enabling feature-level intervention for semantic image editing without prompt modification.
♻ ☆ LR-SGS: Robust LiDAR-Reflectance-Guided Salient Gaussian Splatting for Self-Driving Scene Reconstruction
Recent 3D Gaussian Splatting (3DGS) methods have demonstrated the feasibility of self-driving scene reconstruction and novel view synthesis. However, most existing methods either rely solely on cameras or use LiDAR only for Gaussian initialization or depth supervision, while the rich scene information contained in point clouds, such as reflectance, and the complementarity between LiDAR and RGB have not been fully exploited, leading to degradation in challenging self-driving scenes, such as those with high ego-motion and complex lighting. To address these issues, we propose a robust and efficient LiDAR-reflectance-guided Salient Gaussian Splatting method (LR-SGS) for self-driving scenes, which introduces a structure-aware Salient Gaussian representation, initialized from geometric and reflectance feature points extracted from LiDAR and refined through a salient transform and improved density control to capture edge and planar structures. Furthermore, we calibrate LiDAR intensity into reflectance and attach it to each Gaussian as a lighting-invariant material channel, jointly aligned with RGB to enforce boundary consistency. Extensive experiments on the Waymo Open Dataset demonstrate that LR-SGS achieves superior reconstruction performance with fewer Gaussians and shorter training time. In particular, on Complex Lighting scenes, our method surpasses OmniRe by 1.18 dB PSNR.
comment: 8 pages, 7 figures, conference
♻ ☆ DT-PBO: an Interpretable Tree-based Surrogate Model for Preferential Bayesian Optimization
Preferential Bayesian Optimization (PBO) aims to find a decision-maker's most preferred solution in as few pairwise comparisons as possible. Existing approaches rely on Gaussian Process (GP) surrogates, which provide strong performance but limited interpretability. This limits real-world usability in high-stakes domains, such as healthcare, where interpretability and trust are essential. We propose DT-PBO, a novel tree-based surrogate model for PBO that is inherently interpretable while capturing preference uncertainty. Specifically, we introduce a novel splitting heuristic that constructs interpretable shallow decision trees directly from pairwise comparison data, and use Laplace approximation to obtain probabilistic estimates for each leaf. This enables efficient preference modeling without sacrificing interpretability. Across eight benchmark functions, our method achieves competitive convergence to GP-based PBO, particularly on functions with rugged optimization landscapes. Additional experiments show robustness against noise and a fast computational running time. Experiments on real-world datasets further demonstrate that our model provides interpretable insights into decision-maker preferences that would remain opaque under GP-based approaches.
♻ ☆ ReasonSTL: Bridging Natural Language and Signal Temporal Logic via Tool-Augmented Process-Rewarded Learning
Signal Temporal Logic (STL) is an expressive formal language for specifying spatio-temporal requirements over real-valued, real-time signals. It has been widely used for the verification and synthesis of autonomous systems and cyber-physical systems. In practice, however, users often express their requirements in natural language rather than in structured STL formulas, making natural-language-to-STL translation a critical yet challenging task. Manual specification requires temporal-logic expertise and cannot scale, while prompting commercial LLM APIs incurs substantial token costs and may expose sensitive system requirements to third-party services, raising privacy concerns for industrial deployment. To address these challenges, we present \textsc{ReasonSTL}, a tool-augmented framework that adapts local open-source language models for natural-language-to-STL generation. \textsc{ReasonSTL} decomposes the translation process into explicit reasoning, deterministic tool calls, and structured formula construction. We further introduce process-rewarded training to supervise both tool-use trajectories and final formulas, together with \textsc{STL-Bench}, a bilingual, computation-aware benchmark grounded in real-world signals. Experiments show that a 4B model trained with \textsc{ReasonSTL} achieves state-of-the-art performance in both automatic metrics and human evaluations, demonstrating that \textsc{ReasonSTL} provides a transparent, low-cost, and privacy-preserving alternative for formal specification drafting.
♻ ☆ VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training
Off-policy updates are inevitable in reinforcement learning (RL) for large language models (LLMs) due to rollout staleness from asynchronous training and mismatches between training and inference engines. Naive importance sampling gives an unbiased correction but suffers from high variance, which is amplified by unbounded ratios and autoregressive generation. Prior remedies either rely on scenario-specific engineering, or trade bias for variance via token-level clipping or sequence-level normalization, yet these approaches remain largely heuristic. We propose Variational sEquence-level Soft Policy Optimization (VESPO). By explicitly incorporating variance reduction into a variational formulation, we derive a principled closed-form reshaping kernel that operates directly on sequence-level importance weights, avoids token-level approximation and length normalization, and admits an explicit variance bound for the deployed kernel. Experiments on math reasoning and code generation show that VESPO maintains stable training under severe off-policy conditions (staleness up to 64x) and delivers consistent gains across both dense and Mixture-of-Experts (MoE) models, outperforming recent reshaping baselines under matched setup. Code is available at https://github.com/FloyedShen/VESPO.
♻ ☆ A Geometric Taxonomy of Hallucinations in LLMs
Hallucinations in deployed language models can have real consequences for downstream decisions in domains such as healthcare, legal, and financial services. In production, detection has to run on what the deployed system can see: the query, the response, and often a source document. White-box access to model internals and multi-sample querying are not generally available behind a third-party API. Within this setting - black-box, single-pass, only question/answer available - the dominant baseline is NLI, which returns a value but no diagnosis when it fails. We argue that operating directly on the geometry of the embedding space provides detection methods whose successes and failures are interpretable as structural properties of contrastive sentence-encoder training \citep{wang2020understanding}. The contribution is: given an operationally-motivated taxonomy, geometry predicts which types of hallucination are detectable and which are not - and the predictions hold. We propose three operational types organized by the relation of the response embedding to the plausibility region of grounded responses on the unit hypersphere, and derive from the alignment objective a prediction for each: (1)query-proximate unfaithfulness is detectable by an angular ratio; (2)confabulation outside the plausibility region produces a directional signature that outperforms NLI on expert-annotated error; (3)factual errors sharing vocabulary and frame with correct answers are not separable by angular geometry. To validate on content resembling deployment, we built a 212-pair human-confabulated dataset across nine domains using provoked confabulation.
♻ ☆ Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles
Interpreting natural-language commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods for autonomous vehicles (AVs) typically struggle with ambiguous, context-dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial-Aware World Model (SA-WM) that learns to reason ahead by distilling the current scene into a command-aware latent state and rolling out a sequence of future latent states, providing forward-looking cues for disambiguation. Complementing this, a hypergraph-guided decoder then hierarchically fuses these states with the multimodal input, capturing higher-order spatial dependencies for robust localization. In addition, we present DrivePilot, a multi-source VG dataset in AD, featuring semantic annotations generated by a Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT)-prompted LLM pipeline. Extensive evaluations on six benchmarks, ThinkDeeper ranks #1 on the Talk2Car leaderboard and surpasses state-of-the-art baselines on DrivePilot, MoCAD, and RefCOCO/+/g benchmarks. Notably, it shows strong robustness and efficiency in challenging scenes (long-text, multi-agent, ambiguity) and retains superior performance even when trained on 50% of the data.
♻ ☆ SpikingBrain: Spiking Brain-inspired Large Models
Mainstream Transformer-based large language models face major efficiency bottlenecks: training computation scales quadratically with sequence length, and inference memory grows linearly, limiting long-context processing. Building large models on non-NVIDIA platforms also poses challenges for stable and efficient training. To address this, we introduce SpikingBrain, a family of brain-inspired models designed for efficient long-context training and inference. SpikingBrain leverages the MetaX GPU cluster and focuses on three aspects: (1) Model Architecture: linear and hybrid-linear attention architectures with adaptive spiking neurons; (2) Algorithmic Optimizations: an efficient, conversion-based training pipeline and a dedicated spike coding framework; (3) System Engineering: customized training frameworks, operator libraries, and parallelism strategies tailored to MetaX hardware. Using these techniques, we develop two models: SpikingBrain-7B, a linear LLM, and SpikingBrain-76B, a hybrid-linear MoE LLM. These models demonstrate the feasibility of large-scale LLM development on non-NVIDIA platforms, and training remains stable for weeks on hundreds of MetaX GPUs with Model FLOPs Utilization at expected levels. SpikingBrain achieves performance comparable to open-source Transformer baselines while using only about 150B tokens for continual pre-training. Our models also significantly improve long-context efficiency and deliver inference with (partially) constant memory and event-driven spiking behavior. For example, SpikingBrain-7B attains over 100x speedup in Time to First Token for 4M-token sequences. Furthermore, the proposed spiking scheme achieves 69.15 percent sparsity, enabling low-power operation. Overall, this work demonstrates the potential of brain-inspired mechanisms to drive the next generation of efficient and scalable large model design.
♻ ☆ AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management
The rapid development of mobile GUI agents has stimulated growing research interest in long-horizon task automation. However, building agents for these tasks faces a critical bottleneck: the reliance on ever-expanding interaction history incurs substantial context overhead. Existing context management and compression techniques often fail to preserve vital semantic information, leading to degraded task performance. We propose AgentProg, a program-guided approach for agent context management that reframes the interaction history as a program with variables and control flow. By organizing information according to the structure of program, this structure provides a principled mechanism to determine which information should be retained and which can be discarded. We further integrate a global belief state mechanism inspired by Belief MDP framework to handle partial observability and adapt to unexpected environmental changes. Experiments on AndroidWorld and our extended long-horizon task suite demonstrate that AgentProg has achieved the state-of-the-art success rates on these benchmarks. More importantly, it maintains robust performance on long-horizon tasks while baseline methods experience catastrophic degradation. Our system is open-sourced at https://github.com/MobileLLM/AgentProg.
comment: 16 pages, 8 figures
♻ ☆ ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices
Multimodal large language models (MLLMs) have made significant progress in mobile agent development, yet their capabilities are predominantly confined to a reactive paradigm, where they merely execute explicit user commands. The emerging paradigm of proactive intelligence, where agents autonomously anticipate needs and initiate actions, represents the next frontier for mobile agents. However, its development is critically bottlenecked by the lack of benchmarks that can address real-world complexity and enable objective, executable evaluation. To overcome these challenges, we introduce ProactiveMobile, a comprehensive benchmark designed to systematically advance research in this domain. ProactiveMobile formalizes the proactive task as inferring latent user intent across four dimensions of on-device contextual signals and generating an executable function sequence from a comprehensive function pool of 63 APIs. The benchmark features over 3,660 instances of 14 scenarios that embrace real-world complexity through multi-answer annotations. To ensure quality, a team of 30 experts conducts a final audit of the benchmark, verifying factual accuracy, logical consistency, and action feasibility, and correcting any non-compliant entries. Extensive experiments demonstrate that our fine-tuned Qwen2.5-VL-7B-Instruct achieves a success rate of 19.15%, outperforming o1 (15.71%) and GPT-5 (7.39%). This result indicates that proactivity is a critical competency widely lacking in current MLLMs, yet it is learnable, emphasizing the importance of the proposed benchmark for proactivity evaluation.
♻ ☆ EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle ICML 2026
Current Large Language Model (LLM) agents show strong performance in tool use, but lack the crucial capability to systematically learn from their own experiences. While existing frameworks mainly focus on mitigating external knowledge gaps, they fail to address a more fundamental limitation: the inability to iteratively refine problem-solving strategies. In this work, we introduce EvolveR, a framework designed to enable agent to self-improve through a complete, closed-loop experience lifecycle. This lifecycle comprises two key stages: (1) Offline Self-Distillation, where the agent's interaction trajectories are synthesized into a structured repository of abstract, reusable strategic principles; (2) Online Interaction, where the agent interacts with tasks and actively retrieves distilled principles to guide its decision-making, accumulating a diverse set of behavioral trajectories. This loop employs a policy reinforcement mechanism to iteratively update the agent based on its performance. We demonstrate the effectiveness of EvolveR on complex multi-hop question-answering benchmarks, where it achieves superior performance over strong agentic baselines. Our work presents a comprehensive blueprint for agents that learn not only from external data but also from the consequences of their own actions, paving the way for more autonomous and continuously improving systems. Code is available at https://github.com/Edaizi/EvolveR.
comment: Accepted by ICML 2026
♻ ☆ Structural Generalization on SLOG without Hand-Written Rules
Structural generalization in semantic parsing requires systems to apply learned compositional rules to novel structural combinations. Existing approaches either rely on hand-written algebraic rules (AM-Parser) or fail to generalize structurally (Transformer-based models). We present an alternative requiring no hand-written compositional rules, based on a neural cellular automaton (NCA) with a discrete bottleneck: all compositional rules are learned from data through local iteration. On the SLOG benchmark, the system achieves an overall accuracy of $67.3 \pm 0.2\%$ across 10 seeds (AM-Parser: $70.8 \pm 4.3\%$), with 11 of 17 structural generalization categories at $100\%$ type-exact match, including three where AM-Parser scores $0$--$74\%$. Analysis reveals that all 5,539 failure instances reduce to exactly two mechanisms: novel combinations of wh-extraction context with reduced verb types, and modifiers appearing on the subject side of verbs. When we decompose results by CCG structural features, each sub-pattern either succeeds on all instances or fails on all. Intermediate scores (e.g., $41.4\%$) are mixtures of structurally distinct CCG patterns, not partial generalization. These results suggest that CCG directed types provide higher resolution than SLOG's phenomenon-level categories for characterizing structural generalization, and that the success/failure boundary is determined by the coverage of directed operations in the training data.
♻ ☆ ScrapeGraphAI-100k: Dataset for Schema-Constrained LLM Generation
Producing output that conforms to a specified JSON schema underlies tool use, structured extraction, and knowledge base construction in modern large language models. Despite this centrality, public datasets for the task remain small, synthetic, or text-only, and rarely pair real page content with the prompts and schemas used in practice. We introduce ScrapeGraphAI-100k, 93,695 schema-constrained extraction events collected via opt-in ScrapeGraphAI telemetry in Q2--Q3 2025, deduplicated and balanced by schema from 9M raw events. The corpus spans 18 000+ unique schemas across 15 named languages plus a long-tail Other category, with English and Traditional Chinese covering 88% of detected content, each instance pairs Markdown-converted page content with a prompt, schema, LLM response, and per-example jsonschema-rs structural conformance labels (semantic correctness is out of scope, and raw HTML is deferred beyond v1.0). We characterize structural diversity across the corpus and identify sharp failure thresholds as schema complexity grows. As a case study, a 1.7B student fine-tuned on this data closely tracks the output distribution of its GPT-5-nano teacher, though it still trails a 30B-A3B reference (3.3B active parameters) on schema compliance. We offer this distillation result as preliminary evidence that grounding schema-constrained generation in real practitioner workloads at scale enables training and benchmarking that prior synthetic or text-only corpora could not support.
♻ ☆ Entropy-Regularized Adjoint Matching for Offline Reinforcement Learning
Integrating expressive generative policies, such as flow-matching models, into offline reinforcement learning (RL) allows agents to capture complex, multi-modal behaviors. While Q-learning with Adjoint Matching (QAM) stabilizes policy optimization via the continuous adjoint method, it remains inherently bound to the fixed behavior distribution. This dependence induces a \textit{popularity bias} that can suppress high-reward actions in low-density regions, and creates a \textit{support binding} that restricts off-manifold exploration. Existing workarounds, such as appending \textit{residual} Gaussian policies, often re-introduce the expressivity bottlenecks associated with unimodal distributions. In this work, we propose \textit{Maximum Entropy Adjoint Matching} (ME-AM), a unified framework that addresses these limitations within the continuous flow formulation. ME-AM incorporates two mechanisms: (1) a Mirror Descent entropy maximization objective that mitigates the popularity bias to facilitate the extraction of optimal policies from offline datasets, and (2) a \textit{Mixture Behavior Prior} that broadens the geometric support to encompass out-of-distribution high-reward regions. By exploring this extended geometry, ME-AM identifies robust actions while preserving the absolute continuity of the generative vector field. Empirically, ME-AM demonstrates competitive or superior performance compared to prior state-of-the-art (SOTA) methods across a diverse suite of sparse-reward continuous control environments.
♻ ☆ MinMax Recurrent Neural Cascades
We show that the MinMax algebra provides a form of recurrence that is expressively powerful, efficiently implementable, and most importantly it is not affected by vanishing or exploding gradient. We call MinMax Recurrent Neural Cascades (RNCs) the models obtained by cascading several layers of neurons that employ such recurrence. We show that MinMax RNCs enjoy many favourable theoretical properties. First, their formal expressivity includes all regular languages, arguably the maximal expressivity for a finite-memory system. Second, they can be evaluated in parallel with a runtime that is logarithmic in the input length given enough processors; and they can also be evaluated sequentially. Third, their state and activations are bounded uniformly for all input lengths. Fourth, at almost all points, their loss gradient exists and it is bounded. Fifth, they do not exhibit a vanishing state gradient: the gradient of a state w.r.t. a past state can have constant value one regardless of the time distance between the two states. Finally, we find empirical evidence that the favourable theoretical properties of MinMax RNCs are matched by their practical capabilities: they are able to perfectly solve a number of synthetic tasks, showing superior performance compared to the considered state-of-the-art recurrent neural networks; also, we train a MinMax RNC of 127M parameters on next-token prediction, and the obtained model shows competitive performance for its size, providing evidence of the potential of MinMax RNCs on real-world tasks.
♻ ☆ Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
Large multimodal model-based Multi-Agent Systems (MASs) enable collaborative complex problem solving through specialized agents. However, MASs are vulnerable to infectious jailbreak, where compromising a single agent can spread to others, leading to widespread compromise. Existing defenses counter this by training a more contagious cure factor, biasing agents to retrieve it over virus adversarial examples (VirAEs). However, this homogenizes agent responses, providing only superficial suppression rather than true recovery. We revisit these defenses, which operate globally via a shared cure factor, while infectious jailbreak arise from localized interaction behaviors. This mismatch limits their effectiveness. To address this, we propose a training-free Foresight-Guided Local Purification (FLP) framework, where each agent reasons over future interactions to track behavioral evolution and eliminate infections. Specifically, each agent simulates future behavioral trajectories over subsequent chat rounds. To reflect diversity in MASs, we introduce a multi-persona simulation strategy for robust prediction across interaction contexts. We then use response diversity as a diagnostic signal to detect infection by analyzing inconsistencies across persona-based predictions at both retrieval-result and semantic levels. For infected agents, we apply localized purification: recent infections are mitigated via immediate album rollback, while long-term infections are handled using Recursive Binary Diagnosis (RBD), which recursively partitions the image album and applies the same diagnosis strategy to localize and eliminate VirAEs. Experiments show that FLP reduces the maximum cumulative infection rate from over 95% to below 5.47%. Moreover, retrieval and semantic metrics closely match benign baselines, indicating effective preservation of interaction diversity.
comment: 14 pages
♻ ☆ Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement
Training world models on vast quantities of unlabelled videos is a critical step toward fully autonomous intelligence. However, the prevailing paradigm of encoding raw pixels into opaque latent spaces and relying on heavy decoders for reconstruction leaves these models computationally expensive and uninterpretable. We address this problem by introducing NOVA, a world modelling framework that represents the system state as the weights and biases of an auxiliary coordinate-based implicit neural representation (INR). This structured representation is analytically rendered, which eliminates the decoder bottleneck while conferring compactness, portability, and zero-shot super-resolution. Furthermore, like most latent action models, NOVA can be distilled into a context-dependent video generator via an action-matching objective. Surprisingly, without resorting to auxiliary losses or adversarial objectives, NOVA can disentangle structural scene components such as background, foreground, and inter-frame motion, enabling users to edit either content or dynamics without compromising the other. We validate our framework on several challenging datasets, achieving strong controllable forecasting while operating on a single consumer GPU at $\sim$40M parameters. Ultimately, structured representations like INRs not only enhance our understanding of latent dynamics but also pave the way for immersive and customisable virtual experiences.
comment: 35 pages, 30 figures, 8 tables
♻ ☆ Towards an Inferentialist Account of Information Through Proof-theoretic Semantics
Information is one of the most widely-discussed concepts of the current era. However, a great deal of insightful work notwithstanding, it is yet to be given wholly convincing logical or mathematical foundations. Without them, we lack adequate reasoning tools for understanding the complex ecosystems of systems upon which the society depends. We seek to rectify this by taking a first step towards developing an inferentialist semantic theory of information. There are three key interacting components. First, conceptual analysis: the metaphysics of information. Dretske expressed the key concepts of information in terms of intentionality, truth, and transmissibility. We replace truth with inferability, and trace the consequences of this replacement. Second, logic: proof-theoretic semantics (P-tS) provides a mathematical-logical realization of inferentialist reasoning. Using P-tS, we develop the first steps towards a mathematical-logical theory of an inferentialist primitive unit of information, the 'inferon'. This proof-theoretic approach counterpoints the model-theoretic view of information articulated in situation theory. Furthermore, we argue that it facilitates addressing all three components of van Benthem and Martinez's categorization of the understandings of information, as range, as correlation, and as code. Our focus is on information-as-correlation. Third, systems: the P-tS tools we develop provide the basis for a mathematical account of distributed systems modelling -- a key tool from informatics for understanding the organization of information processing systems. This yields a reasoning-based theory of information flow in models of distributed systems. Overall, we seek to give a conceptually rigorous mathematical-logical account of information and its role within informatics, grounded in inference and reasoning.
comment: Manuscript
♻ ☆ Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis
Data synthesis for training large reasoning models offers a scalable alternative to limited, human-curated datasets, enabling the creation of high-quality data. However, existing approaches face several challenges: (i) indiscriminate generation that ignores the solver's ability and yields low-value problems, or reliance on complex data pipelines to balance problem difficulty; and (ii) a lack of reasoning in problem generation, leading to shallow problem variants. In this paper, we develop a problem generator that reasons explicitly to plan problem directions before synthesis and adapts difficulty to the solver's ability. Specifically, we construct related problem pairs and augment them with intermediate problem-design CoT produced by a reasoning model. These data are used to bootstrap problem-design strategies in the generator. Then, we treat the solver's feedback on synthetic problems as a reward signal, enabling the generator to calibrate difficulty and produce complementary problems near the edge of the solver's competence. Extensive experiments on 10 mathematical and general reasoning benchmarks show that our proposed framework achieves a cumulative average improvement of 3.4%, demonstrating robust generalization across both language and vision-language models.
♻ ☆ Dynamic one-time delivery of critical data by small and sparse UAV swarms: a model problem for MARL scaling studies
This work studies the application of Multi-Agent Reinforcement Learning (MARL) to decentralized control of unmanned aerial vehicles to relay a critical data package to a known position. For this purpose, a family of deterministic games is introduced, designed for MARL scaling studies. A robust baseline policy is proposed which restricts agent motion and applies Dijkstra's shortest path algorithm. Computational experiment results show that two off-the-shelf MARL algorithms perform competitively with the baseline for a small number of agents, but face scalability issues as the number of agents increases. Source code and animations are available online at https://github.com/mikapersson/Information-Relaying.
comment: Accepted to the 2026 IFAC World Congress
♻ ☆ CSMCIR: CoT-Enhanced Symmetric Alignment with Memory Bank for Composed Image Retrieval
Composed Image Retrieval (CIR) enables users to search for target images using both a reference image and manipulation text, offering substantial advantages over single-modality retrieval systems. However, existing CIR methods suffer from representation space fragmentation: queries and targets comprise heterogeneous modalities and are processed by distinct encoders, forcing models to bridge misaligned representation spaces only through post-hoc alignment, which fundamentally limits retrieval performance. This architectural asymmetry manifests as three distinct, well-separated clusters in the feature space, directly demonstrating how heterogeneous modalities create fundamentally misaligned representation spaces from initialization. In this work, we propose CSMCIR, a unified representation framework that achieves efficient query-target alignment through three synergistic components. First, we introduce a Multi-level Chain-of-Thought (MCoT) prompting strategy that guides Multimodal Large Language Models to generate discriminative, semantically compatible captions for target images, establishing modal symmetry. Building upon this, we design a symmetric dual-tower architecture where both query and target sides utilize the identical shared-parameter Q-Former for cross-modal encoding, ensuring consistent feature representations and further reducing the alignment gap. Finally, this architectural symmetry enables an entropy-based, temporally dynamic Memory Bank strategy that provides high-quality negative samples while maintaining consistency with the evolving model state. Extensive experiments on four benchmark datasets demonstrate that our CSMCIR achieves state-of-the-art performance with superior training efficiency. Comprehensive ablation studies further validate the effectiveness of each proposed component.
♻ ☆ Multi-Modal Multi-Agent Reinforcement Learning for Radiology Report Generation
We propose MARL-Rad, a multi-modal multi-agent reinforcement learning framework for radiology report generation that trains the entire agentic system on policy within its deployed radiology workflow. MARL-Rad addresses the limitation of post-hoc agentization, where fixed LLMs are organized into hand-designed agentic workflows without being optimized for their assigned roles. Our framework decomposes chest X-ray interpretation into region-specific agents and a global integrating agent, and jointly optimizes them using clinically verifiable rewards. Experiments on the MIMIC-CXR and IU X-ray datasets show that MARL-Rad consistently improves clinical efficacy metrics such as RadGraph, CheXbert, and GREEN scores, achieving state-of-the-art clinical efficacy performance. Further analyses show that MARL-Rad improves laterality consistency and produces more accurate and detailed reports. A blinded clinician evaluation further suggests that MARL-Rad produces reports clinically comparable to ground-truth reports.
comment: 23 pages, 4 figures
♻ ☆ MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning ACL 2026
LLM-based search agents often concatenate the full interaction history into the context, producing long and noisy inputs, and increasing compute cost and GPU memory overhead. To address this issue, we propose MemSearcher, an agent framework that maintains a compact memory during multi-turn interactions, retaining only question-relevant information and thereby keeping the context length stable across turns. Training MemSearcher is challenging because each trajectory spans multiple turns under different LLM contexts, making each turn an independent optimization target in reinforcement learning. We introduce multi-context GRPO, which propagates trajectory-level advantages to all turns for end-to-end optimization. Experiments demonstrate that MemSearcher outperforms strong history-concatenation (ReAct-style) baselines on a range of public datasets while maintaining nearly constant token counts across multi-turn interactions. The code and models will be publicly available at https://github.com/icip-cas/MemSearcher
comment: Accepted to ACL 2026
♻ ☆ Physics-Based Benchmarking Metrics for Multimodal Synthetic Images
Current state of the art measures like BLEU, CIDEr, VQA score, SigLIP-2 and CLIPScore are often unable to capture semantic or structural accuracy, especially for domain-specific or context-dependent scenarios. For this, this paper proposes a Physics-Constrained Multimodal Data Evaluation (PCMDE) metric combining large language models with reasoning, knowledge based mapping and vision-language models to overcome these limitations. The architecture is comprised of three main stages: (1) feature extraction of spatial and semantic information with multimodal features through object detection and VLMs; (2) Confidence-Weighted Component Fusion for adaptive component-level validation; and (3) physics-guided reasoning using large language models for structural and relational constraints (e.g., alignment, position, consistency) enforcement.
♻ ☆ Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a critical method for enhancing the reasoning capabilities of Large Language Models (LLMs). However, continuous training often leads to policy entropy collapse, characterized by a rapid decay in entropy that results in premature overconfidence, reduced output diversity, and vanishing gradient norms that inhibit learning. Gradient-Preserving Clipping is a primary factor influencing these dynamics, but existing mitigation strategies are largely static and lack a framework connecting clipping mechanisms to precise entropy control. This paper proposes reshaping entropy control in RL from the perspective of Gradient-Preserving Clipping. We first theoretically and empirically verify the contributions of specific importance sampling ratio regions to entropy growth and reduction. Leveraging these findings, we introduce a novel regulation mechanism using dynamic clipping thresholds to precisely manage entropy. Furthermore, we design and evaluate dynamic entropy control strategies, including increase-then-decrease, decrease-increase-decrease, and oscillatory decay. Experimental results demonstrate that these strategies effectively mitigate entropy collapse and achieve superior performance across multiple benchmarks.
comment: https://github.com/Kwen-Chen/Flexible-Entropy-Control
♻ ☆ How Hard Is It for Message-Passing GNNs to Simulate One Weisfeiler-Lehman Color-Refinement Step?
Message-passing graph neural networks (MPGNNs) are commonly compared with the Weisfeiler-Lehman (WL) color-refinement procedure, but this comparison does not quantify the resource parameters a network needs to realize color refinement with bounded-size messages and finite numerical precision. We study the cost of simulating a single color-refinement step on unattributed graphs. We distinguish input-independent, or oblivious, simulation from instance-dependent simulation. In the former, the parameters, or their distributions in randomized models, are fixed before the input instance is known. Our results show that the local form of WL color refinement hides a global relabeling problem. In the oblivious setting, deterministic and zero-error randomized MPGNNs cannot solve this problem in the worst case using only shallow networks with small messages. We complement this lower bound with a nearly matching construction in a stronger rooted, port-aware model. By contrast, when the color set is large, bounded-error randomness can greatly reduce the cost, and a one-layer MPGNN with messages of logarithmic size and a logarithmic number of random bits suffices. We show that this logarithmic number of random bits is essentially necessary for shallow, small-message simulations. When the color set is small, we still obtain a rooted, port-aware simulation, but this construction requires more layers or larger messages. We also prove that this extra cost is partly unavoidable, as small color sets force a nontrivial trade-off between the number of layers and the message size. Finally, instance-dependent simulation can be much shallower, but the required instance-specific parameters are not necessarily easy to find. Together, these results reveal quantitative structure hidden behind the statement that MPGNNs match WL color refinement.
♻ ☆ FutureWorld: A Live Reinforcement Learning Environment for Predictive Agents with Real-World Outcome Rewards
Live future prediction refers to the task of making predictions about real-world events before they unfold. This task is increasingly studied using large language model-based agent systems, and it is important for building agents that can continually learn from the real world. It can provide a large number of prediction questions grounded in diverse real-world events, while preventing answer leakage. To leverage the advantages of future prediction, we present FutureWorld, a live agentic reinforcement learning environment that closes the training loop between prediction, outcome realization, and parameter updates. Specifically, we modify and extend verl-tool, resulting in a new framework that we call verl-tool-future. Unlike standard reinforcement learning training frameworks that rely on immediate rewards, verl-tool-future stores prediction-time rollouts, backfills rewards after real-world outcomes become available, and then replays the completed trajectories for policy update. Across three open-source agents, successive FutureWorld training rounds lead to consistent improvements in prediction accuracy, probabilistic scoring, and calibration, demonstrating that delayed real-world outcome feedback can serve as an effective reinforcement learning signal.
comment: We will release the code in the near future
♻ ☆ SeaEvo: Advancing Algorithm Discovery with Strategy Space Evolution
Large Language Model (LLM)-guided evolutionary search is increasingly used for automated algorithm discovery, yet most current methods track search progress primarily through executable programs and scalar fitness. Even when natural-language reasoning is used through heuristic descriptions or reflection, it typically remains transient mutation context or unstructured memory, rather than organized as persistent population-level state over strategic directions. As a result, evolutionary search can struggle to distinguish syntactically different implementations of the same idea, preserve lower-fitness but strategically promising directions, or detect when an entire family of strategies has saturated. We introduce \model, a modular strategy-space layer that turns language-level strategic reasoning into first-class population-level evolutionary state in LLM-driven program search. \model represents each candidate program with an explicit natural-language strategy, clusters the archive by strategy semantics, retrieves behaviorally complementary inspirations, and periodically navigates the strategy landscape to avoid saturated directions. Without modifying the underlying evolutionary algorithms, \model improves existing evolutionary backbones across algorithm discovery, systems optimization, and agent-scaffold design tasks in most settings. Across four systems benchmarks, \model achieves a 20.6% average relative improvement, with the best single run on Prism scoring 3$\times$ higher. These results suggest that persistent strategy representations provide a practical mechanism for improving the effectiveness and cost-efficiency of LLM-guided evolutionary search, pointing toward compound AI systems whose search capabilities benefit from the structured accumulation and reuse of algorithmic strategies.
♻ ☆ Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
A persistent skill library allows language model agents to reuse successful strategies across tasks. Maintaining such a library requires three coupled capabilities. The agent selects a relevant skill, utilizes it during execution, and distills new skills from experience. Existing methods optimize these capabilities in isolation or with separate reward sources, resulting in partial and conflicting evolution. We propose Skill1, a framework that trains a single policy to co-evolve skill selection, utilization, and distillation toward a shared task-outcome objective. The policy generates a query to search the skill library, re-ranks candidates to select one, solves the task conditioned on it, and distills a new skill from the trajectory. All learning derives from a single task-outcome signal. Its low-frequency trend credits selection and its high-frequency variation credits distillation. Experiments on ALFWorld and WebShop show that Skill1 outperforms prior skill-based and reinforcement learning baselines. Training dynamics confirm the co-evolution of the three capabilities, and ablations show that removing any credit signal degrades the evolution.
♻ ☆ Mixture of Masters: Sparse Chess Language Models with Player Routing
Modern chess language models are dense transformers trained on millions of games played by thousands of high-rated individuals. However, these monolithic networks tend to collapse into mode-averaged behavior, where stylistic boundaries are blurred, and rare but effective strategies are suppressed. To counteract homogenization, we introduce Mixture-of-Masters (MoM), the first chess mixture-of-experts model with small-sized GPT experts emulating world-class grandmasters. For each move, a post-hoc learnable gating network selects the most appropriate persona to channel depending on the game state, allowing MoM to switch its style dynamically, e.g., Tal's offensive vocation or Petrosian's defensive solidity. When evaluated against Stockfish on unseen standard games, MoM outperforms both dense individual expert networks and popular GPT baselines trained on aggregated data, while ensuring generation variety, control, and interpretability.
♻ ☆ WebClipper: Efficient Evolution of Web Agents with Graph-based Trajectory Pruning ACL 2026
Deep Research systems based on web agents have shown strong potential in solving complex information-seeking tasks, yet their search efficiency remains underexplored. We observe that many state-of-the-art open-source web agents rely on long tool-call trajectories with cyclic reasoning loops and exploration of unproductive branches. To address this, we propose WebClipper, a framework that compresses web agent trajectories via graph-based pruning. Concretely, we model the agent's search process as a state graph and cast trajectory optimization as a minimum-necessary Directed Acyclic Graph (DAG) mining problem, yielding pruned trajectories that preserve essential reasoning while eliminating redundant steps. Continued training on these refined trajectories enables the agent to evolve toward more efficient search patterns and reduces tool-call rounds by about 20% while improving accuracy. Furthermore, we introduce a new metric called F-AE Score to measure the model's overall performance in balancing accuracy and efficiency. Experiments demonstrate that WebClipper compresses tool-call rounds under excellent performance, providing practical insight into balancing effectiveness and efficiency in web agent design.
comment: ACL 2026 Main
♻ ☆ Active Learning for Communication Structure Optimization in LLM-Based Multi-Agent Systems
Optimizing the communication structure of large language model based multi-agent systems (LLM-MAS) has been shown to improve downstream performance and reduce token usage. Existing methods typically rely on randomly sampled training tasks. However, tasks may differ substantially in difficulty and domain, and thus they are not equally informative for updating communication structure, making optimization under limited training budgets often unstable and highly sensitive to the particular training set. To actively identify the most valuable tasks for communication-structure optimization, we propose an ensemble-based information-theoretic task selection framework. The proposed method estimates task informativeness by how much a candidate task changes the distribution over graph parameters, using ensemble Kalman inversion as an efficient and derivative-free approximation of the corresponding Bayesian update. The resulting estimator is especially suitable for black-box and noisy multi-agent systems. To enhance scalability, we construct a compact candidate pool through embedding-based representative selection and combine the informative selection with surrogate modeling and batch Thompson sampling. We validate our method in both benign settings and settings with agent attacks, demonstrating its effectiveness for communication-structure optimization under constrained computational budgets.
♻ ☆ EviDep: Trustworthy Multimodal Depression Estimation via Disentangled Evidential Learning
Automated multimodal depression estimation in unconstrained environments is inherently challenged by naturalistic noise and complex behavioral variability. Prevailing deterministic methods, however, produce uncalibrated point estimates without quantifying predictive uncertainty, exposing decision-making to the risk of overconfident, untrustworthy estimates. To establish a reliable and trustworthy estimation paradigm, we propose EviDep, an evidential learning framework that jointly quantifies depression severity alongside aleatoric and epistemic uncertainties via a Normal-Inverse-Gamma distribution. To ensure the integrity of the extracted behavioral evidence and prevent artificial confidence inflation during multimodal fusion, EviDep introduces two tailored mechanisms. First, addressing the temporal-frequency heterogeneity of behavioral cues, a Frequency-aware Feature Extraction module leverages a wavelet-based Mixture-of-Experts to dynamically decouple stable macro-level affective baselines from transient micro-level behavioral bursts, effectively filtering out task-irrelevant artifacts. Second, a Disentangled Evidential Learning strategy enforces explicit decorrelation of features in these purified representations. By separating the cross-modal shared consensus from modality-specific behavioral nuances before Bayesian fusion, this rigorous disentanglement strictly prevents the model from double-counting overlapping information. Extensive experiments on the AVEC 2013, AVEC 2014, DAIC-WOZ, and E-DAIC datasets confirm that EviDep achieves state-of-the-art predictive accuracy and superior uncertainty calibration, thereby delivering a trustworthy, risk-aware decision-support tool for depression estimation.
♻ ☆ Pretraining a Foundation Model for Small-Molecule Natural Products
Natural products, as metabolites from microorganisms, animals, or plants, exhibit diverse biological activities, making them crucial for drug discovery. Nowadays, existing deep learning methods for natural products research primarily rely on supervised learning approaches designed for specific downstream tasks. However, such one-model-for-a-task paradigm often lacks generalizability and leaves significant room for performance improvement. Additionally, existing molecular characterization methods are not well-suited for the unique tasks associated with natural products. To address these limitations, we have pre-trained a foundation model for natural products based on their unique properties. Our approach employs a novel pretraining strategy that is especially tailored to natural products. By incorporating contrastive learning and masked graph learning objectives, we emphasize evolutional information from molecular scaffolds while capturing side-chain information. Our framework achieves state-of-the-art (SOTA) results in various downstream tasks related to natural product mining and drug discovery. We first compare taxonomy classification with synthesized molecule-focused baselines to demonstrate that current models are inadequate for understanding natural synthesis. Furthermore, by diving into a fine-grained analysis at both the gene and microbial levels, NaFM demonstrates the ability to capture evolutionary information. Eventually, our method is experimented with virtual screening, illustrating informative natural product representations that can lead to more effective identification of potential drug candidates.
comment: Accepted by Nature Machine Intelligence(2026)
♻ ☆ SB-TRPO: Towards Safe Reinforcement Learning with Hard Constraints
In safety-critical domains, reinforcement learning (RL) agents must often satisfy strict, zero-cost safety constraints while accomplishing tasks. Existing model-free methods frequently either fail to achieve near-zero safety violations or become overly conservative. We introduce Safety-Biased Trust Region Policy Optimisation (SB-TRPO), a principled algorithm for hard-constrained RL that dynamically balances cost reduction with reward improvement. At each step, SB-TRPO updates via a dynamic convex combination of the reward and cost natural policy gradients, ensuring a fixed fraction of optimal cost reduction while using remaining update capacity for reward improvement. Our method comes with formal guarantees of local progress on safety, while still improving reward whenever gradients are suitably aligned. Experiments on standard and challenging Safety Gymnasium tasks demonstrate that SB-TRPO consistently achieves the best balance of safety and task performance in the hard-constrained regime.
♻ ☆ Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level
On-policy distillation (OPD) trains a student on its own trajectories with token-level teacher feedback and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its standard advantage weighted policy gradient suffers from three structural weaknesses, including high variance updates, vanishing gradients in zero-advantage regions, and exploration bottlenecks when corrective signals are insufficient. We therefore propose Asymmetric On-Policy Distillation (AOPD), which replaces ineffective negative reinforcement with localized divergence minimization in non-positive advantage regions while preserving positive reinforcement learning. Experiments on mathematical reasoning benchmarks show that AOPD consistently outperforms standard OPD, with average gains of 4.09 / 8.34 under strong / weak initialization, respectively. AOPD also maintains higher policy entropy during training and better capability retention during sequential tool-use adaptation.
♻ ☆ A Hybrid Graph Neural Network for Enhanced EEG-Based Depression Detection
Graph neural networks (GNNs) are becoming increasingly popular for EEG-based depression detection. However, previous GNN-based methods fail to sufficiently consider the characteristics of depression, thus limiting their performance. Firstly, studies in neuroscience indicate that depression patients exhibit both common and individualized brain abnormal patterns. Previous GNN-based approaches typically focus either on fixed graph connections to capture common abnormal brain patterns or on adaptive connections to capture individualized patterns, which is inadequate for depression detection. Secondly, brain network exhibits a hierarchical structure, which includes the arrangement from channel-level graph to region-level graph. This hierarchical structure varies among individuals and contains significant information relevant to detecting depression. Nonetheless, previous GNN-based methods overlook these individualized hierarchical information. To address these issues, we propose a Hybrid GNN (HGNN) that merges a Common Graph Neural Network (CGNN) branch utilizing fixed connection and an Individualized Graph Neural Network (IGNN) branch employing adaptive connections. The two branches capture common and individualized depression patterns respectively, complementing each other. Furthermore, we enhance the IGNN branch with a Graph Pooling and Unpooling Module (GPUM) to extract individualized hierarchical information. Extensive experiments on two public datasets show that our model achieves state-of-the-art performance.
♻ ☆ Zero-Shot Quantization via Weight-Space Arithmetic
We show that robustness to post-training quantization (PTQ) is a transferable direction in weight space. We call this direction the quantization vector: extracted from a donor task by simple weight-space arithmetic, it can be used to patch a receiver model and improve post-PTQ Top-1 accuracy by up to 60 points in a 3-bit setting, without receiver-side quantization-aware training (QAT). Because the method requires no receiver training data, it provides a zero-shot, low-cost alternative to QAT for extremely low-bit deployment. Across four ViT scales and 22 image classification tasks, donor quantization vectors often yield substantial gains even when donor and receiver tasks differ markedly. We further prove rigorously that quantization vectors are well-defined and do not suffer from reparameterization symmetries, and provide a local geometric account of their effect. Together, these results suggest that quantization robustness can be partially isolated, reused, and transferred through simple weight-space algebra.
♻ ☆ Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
On-policy distillation (OPD) is an effective post-training paradigm for large language models but requires a live teacher server throughout training, resulting in substantial infrastructure overhead. We investigate whether OPD can be performed offline by precomputing teacher log-probabilities once over SFT rollouts and reusing them during training. We find that naively doing so fails to reliably match standard OPD, and trace the root cause to a previously overlooked condition we term teacher consistency, requiring that the same teacher be used for both supervised fine-tuning and OPD. Violating this condition introduces a gradient bias that degrades performance for both offline and online OPD. Building on this insight, we propose Lightning OPD, an offline on-policy distillation framework that enforces teacher consistency and eliminates the need for a live teacher server entirely. We prove that, under teacher consistency, Lightning OPD shares the same optimum as standard OPD, with bounded gradient discrepancy and an implicit regularization effect that helps prevent policy drift. Experiments on math reasoning and code generation show that Lightning OPD achieves comparable performance to standard OPD while delivering 4.0x higher training efficiency. Starting from an SFT-initialized Qwen3-8B-Base model, Lightning OPD reaches 69.9% on AIME 2024 in just 30 GPU hours. Lightning OPD further scales to MoE architectures, training Qwen3-30B-A3B to 71.0% on AIME 2024 on a single 8xH100 node, substantially lowering the barrier for academic research on LLM post-training. Our code is released at https://github.com/jet-ai-projects/Lightning-OPD.
♻ ☆ Detecting Distillation Data from Reasoning Models
Reasoning distillation has emerged as a prevailing paradigm for transferring reasoning capabilities from large reasoning models to small language models. Yet, reasoning distillation risks data contamination: benchmark data may inadvertently be included in the distillation data, thereby inflating model performance metrics. In this work, we formally define the distillation data detection task, which determines whether a given question is included in the model's distillation data. The unique challenge of this task lies in the partial availability of distillation data. To address this, we propose Token Probability Deviation (TPD), a detection method that leverages the probability patterns of output tokens generated by the model instead of input tokens. Our method is motivated by the observation that seen questions tend to elicit more near-deterministic tokens generated by the models than unseen ones. Our TPD score is thus designed to quantify the token-level deviation of generated tokens from a high-confidence reference probability. Consequently, seen questions can yield substantially lower TPD scores than unseen ones, enabling strong detection performance. Extensive experiments demonstrate the effectiveness of our approach, improving detection AUC by up to 31% on distillation datasets.
♻ ☆ AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning
Reinforcement learning (RL) has substantially improved the ability of large language model (LLM) agents to interact with environments and solve multi-turn tasks. However, effective agentic RL remains challenging: sparse outcome-only rewards provide limited guidance for assigning credit to individual steps within long interaction trajectories. Existing approaches often introduce dense intermediate supervision, such as process reward models or auxiliary self-supervised signals, which increases supervision and tuning complexity and may limit generalization across tasks and domains. We present AEM, a supervision-free credit assignment method that adaptively modulates entropy dynamics during RL training to improve the exploration-exploitation trade-off. Since in agentic RL the environment is typically affected by a complete response, rather than an individual token, our analysis lifts entropy dynamics from the token level to the response level, aligning uncertainty estimation with the effective action granularity of LLM agents and reducing sensitivity to token-level sampling noise. We further show that entropy drift under natural-gradient updates is governed by the interaction between the sampled-response advantage and its relative surprisal. Motivated by this result, AEM derives a practical response-level uncertainty proxy and uses it to rescale advantages, leveraging the evolving balance between positive and negative samples to naturally transition from exploration to exploitation. Extensive experiments on ALFWorld, WebShop, and SWE-bench-Verified with models ranging from 1.5B to 32B demonstrate that AEM consistently improves strong RL baselines, including a +1.4\% gain when integrated into a state-of-the-art software-engineering RL training framework.
comment: 30 pages
♻ ☆ Latent-Space Causal Discovery from Indirect Neuroimaging Observations
Neuroimaging does not observe causal variables directly: hemodynamics and volume conduction distort signals so that statistical dependence need not reflect latent neural influence. Before estimating graphs, one must specify under what assumptions delayed directed structure can be studied from such indirect observations. We formalize a conditional setting - recoverable inversion under modality physics together with nonstationary latent dynamics - and derive an inversion-error propagation bound under explicit assumptions. Building on this framing, we propose INCAMA (INdirect CAusal MAmba): physics-aware inversion coupled with a delay-aware Mamba encoder that uses mechanism shifts as informative variation for directed graph scoring. We use controlled simulations for quantitative validation and HCP motor-task fMRI as a zero-shot external transfer check based on anatomical and task-network consistency. Across TVB simulations, INCAMA improves directed-structure recovery by 2-3x in F1 over observation-space and two-stage baselines, and on HCP motor-task fMRI it produces sparse directed estimates concentrated in canonical visuo-motor pathways.
comment: 9 pages, 2 figures
♻ ☆ Temporal Smoothness Doubly Robust Learning for Debiased Knowledge Tracing
Knowledge Tracing (KT) is fundamental to intelligent education systems, yet relies on educational logs that are selectively observed. The non-random nature of exercise recommendations and student choices inevitably induces severe selection bias. Most existing KT methods neglect this issue, training on observed logs using standard empirical risk, which yields biased mastery estimates and accumulates errors in subsequent recommendations. To address this, we introduce a doubly robust (DR) formulation for KT that integrates a propensity model with an error imputation model, theoretically guaranteeing unbiasedness if either model is accurate. Beyond unbiasedness, in the sequential setting of KT, we identify that the estimator's performance is compromised by variance-dependent stochastic deviations that accumulate over time, thereby causing training instability and limiting performance. To mitigate this, we derive a generalization bound that explicitly characterizes the impact of estimator variance and identifies temporal smoothness as a key factor in controlling it. Building on these theoretical insights, we propose the Temporal Smoothness Doubly Robust (TSDR) framework. TSDR jointly optimizes the KT predictor and the imputation model with a smoothness regularizer, effectively reducing variance while preserving the unbiasedness guarantee of DR. Experiments on multiple real-world benchmarks demonstrate that TSDR consistently enhances various state-of-the-art KT backbones, underscoring the vital role of principled bias correction in KT.
♻ ☆ MAS-Algorithm: A Workflow for Solving Algorithmic Programming Problems with a Multi-Agent System
Algorithmic problem solving serves as a rigorous testbed for evaluating structured reasoning in AI coding systems, as it directly reflects a model's ability to perform structured reasoning in complex scenarios. Existing approaches predominantly rely on model-centric strategies, such as architectural modifications and data scaling, which are costly and offer limited interpretability. Alternative methods leveraging external tools or prompting techniques (e.g., chain-of-thought) are often fragmented and lack a unified framework. In this paper, we propose MAS-Algorithm, a systematic multi-agent workflow for algorithmic problem solving inspired by the practices of competitive programmers and algorithm engineers. Our framework decomposes the end-to-end solving process into modular stages, enabling structured reasoning, tool integration, and flexible coordination among agents. The design emphasizes both rigor and extensibility, allowing it to generalize across diverse problem types. Experimental results on a self-constructed benchmark demonstrate consistent improvements across multiple Qwen series models, achieving an average gain of 6.48% in acceptance rate. In contrast, parameter-efficient fine-tuning on the same data yields only a marginal improvement of 0.89%. We further observe a 4.72% gain on LiveCodeBench-Pro, along with consistent improvements across additional accuracy and efficiency metrics. Beyond performance gains, we conduct comprehensive analyses to better understand the reasoning process within the workflow, including error patterns and cross-scenario behaviors. We further perform customized replacement and ablation studies to explore the upper bound of the framework, showing that individual agents can contribute improvements of up to 27.7%. These results highlight the strong potential of MAS-Algorithm for advancing AI-driven algorithmic reasoning.
♻ ☆ Q-Probe: Scaling Image Quality Assessment to High Resolution via Context-Aware Agentic Probing
Reinforcement Learning (RL) has empowered Multimodal Large Language Models (MLLMs) to achieve superior human preference alignment in Image Quality Assessment (IQA). However, existing RL-based IQA models typically rely on coarse-grained global views, failing to capture subtle local degradations in high-resolution scenarios. While emerging "Thinking with Images" paradigms enable multi-scale visual perception via zoom-in mechanisms, their direct adaptation to IQA induces spurious "cropping-implies-degradation" biases and misinterprets natural depth-of-field as artifacts. To address these challenges, we propose Q-Probe, the first agentic IQA framework designed to scale IQA to high resolution via context-aware probing. First, we construct Vista-Bench, a pioneering benchmark tailored for fine-grained local degradation analysis in high-resolution IQA settings. Furthermore, we propose a three-stage training paradigm that progressively aligns the model with human preferences, while simultaneously eliminating causal bias through a novel context-aware cropping strategy. Extensive experiments demonstrate that Q-Probe achieves state-of-the-art performance in high-resolution settings while maintaining superior efficacy across resolution scales.
♻ ☆ A Multi-Memory Segment System for Generating High-Quality Long-Term Memory Content in Agents
In the current field of agent memory, extensive explorations have been conducted in the area of memory retrieval, yet few studies have focused on exploring the memory content. Most research simply stores summarized versions of historical dialogues, as exemplified by methods like A-MEM and MemoryBank. However, when humans form long-term memories, the process involves multi-dimensional and multi-component generation, rather than merely creating simple summaries. The low-quality memory content generated by existing methods can adversely affect recall performance and response quality. In order to better construct high-quality long-term memory content, we have designed a multi-memory segment system (MMS) inspired by cognitive psychology theory. The system processes short-term memory into multiple long-term memory segments, and constructs retrieval memory units and contextual memory units based on these segments, with a one-to-one correspondence between the two. During the retrieval phase, MMS will match the most relevant retrieval memory units based on the user's query. Then, the corresponding contextual memory units is obtained as the context for the response stage to enhance knowledge, thereby effectively utilizing historical data. We conducted experiments on the LoCoMo dataset and further performed ablation experiments, experiments on the robustness regarding the number of input memories, and overhead experiments, which demonstrated the effectiveness and practical value of our method.
comment: The content has been significantly revised and the author has also changed. Therefore, the paper will be withdrawn for revision and then uploaded after the completion of the modifications
♻ ☆ R-GTD: A Geometric Analysis of Gradient Temporal-Difference Learning in Singular Regimes
Gradient temporal-difference (GTD) learning algorithms are widely used for off-policy policy evaluation with function approximation. However, existing convergence analyses rely on the restrictive assumption that the so-called feature interaction matrix (FIM) is nonsingular. In practice, the FIM can become singular and leads to instability or degraded performance. While some prior works have applied regularization to relax the nonsingularity assumption, their theoretical guarantees inevitably rely on other restrictive conditions. In this paper, we propose a regularized optimization objective by reformulating the mean-square projected Bellman error minimization. This formulation naturally yields a regularized GTD algorithms, referred to as R-GTD, which guarantees convergence to a unique solution even when the FIM is singular. We conduct a geometric analysis to establish theoretical convergence guarantees and explicit error bounds for the proposed method, and validate its effectiveness through empirical experiments.
comment: 32 pages, 8 figures
♻ ☆ Automatic Image-Level Morphological Trait Annotation for Organismal Images ICLR 2026
Morphological traits are physical characteristics of biological organisms that provide vital clues on how organisms interact with their environment. Yet extracting these traits remains a slow, expert-driven process, limiting their use in large-scale ecological studies. A major bottleneck is the absence of high-quality datasets linking biological images to trait-level annotations. In this work, we demonstrate that sparse autoencoders trained on foundation-model features yield monosemantic, spatially grounded neurons that consistently activate on meaningful morphological parts. Leveraging this property, we introduce a trait annotation pipeline that localizes salient regions and uses vision-language prompting to generate interpretable trait descriptions. Using this approach, we construct Bioscan-Traits, a dataset of 80K trait annotations spanning 19K insect images from BIOSCAN-5M. Human evaluation confirms the biological plausibility of the generated morphological descriptions. We assess design sensitivity through a comprehensive ablation study, systematically varying key design choices and measuring their impact on the quality of the resulting trait descriptions. By annotating traits with a modular pipeline rather than prohibitively expensive manual efforts, we offer a scalable way to inject biologically meaningful supervision into foundation models, enable large-scale morphological analyses, and bridge the gap between ecological relevance and machine-learning practicality.
comment: ICLR 2026
♻ ☆ Miner:Mining Intrinsic Mastery for Data-Efficient RL in Large Reasoning Models
Current critic-free RL methods for large reasoning models suffer from severe inefficiency when training on positive homogeneous prompts (where all rollouts are correct), resulting in waste of rollouts due to zero advantage estimates. We introduce a radically simple yet powerful solution to \uline{M}ine \uline{in}trinsic mast\uline{er}y (Miner), that repurposes the policy's intrinsic uncertainty as a self-supervised reward signal, with no external supervision, auxiliary models, or additional inference cost. Our method pioneers two key innovations: (1) a token-level focal credit assignment mechanism that dynamically amplifies gradients on critical uncertain tokens while suppressing overconfident ones, and (2) adaptive advantage calibration to seamlessly integrate intrinsic and verifiable rewards. Evaluated across six reasoning benchmarks on Qwen3-4B and Qwen3-8B base models, Miner achieves state-of-the-art performance among the other four algorithms, yielding up to \textbf{4.58} absolute gains in Pass@1 and \textbf{6.66} gains in Pass@K compared to GRPO. Comparison with other methods targeted at exploration enhancement further discloses the superiority of the two newly proposed innovations. This demonstrates that latent uncertainty exploitation is both necessary and sufficient for efficient and scalable RL training of reasoning models. Code is available at https://github.com/pixas/Miner.
comment: 24 pages
♻ ☆ EGA: Adapting Frozen Encoders for Vector Search with Bounded Out-of-Distribution Degradation
Vector search systems built on frozen vision encoders face queries from unseen classes at deployment, yet existing adapter training collapses under this shift: high-capacity adapters with global contrastive losses silently reassign unseen-class samples to wrong seen-class clusters, dropping worst-case Label Precision by over 40 points below the frozen baseline in our tests. We propose Euclidean Geodesic Alignment (EGA), a residual adapter that couples three principles: zero initialization, local triplet loss, and hypersphere projection. These collectively induce a self-limiting dynamic: triplets that already satisfy a small margin stop producing gradients, so the adapter automatically stops updating where the local geometry is already correct. Our experiments show that at convergence $96.5\%$ of triplets are gradient-free, leaving unseen-class regions largely untouched while still enabling full-capacity refinement of seen classes. Across five diverse out-of-distribution (OOD) benchmarks, EGA achieves the highest worst-case Label Precision on the four primary splits and a consistent improvement on the fifth. The design also transfers to stronger backbones in addition to CLIP, and we provide an analytical justification linking gradient sparsity to bounded OOD perturbation.
comment: added ack and github link
♻ ☆ Continually Evolving Skill Knowledge in Vision Language Action Model
Vision-language-action (VLA) models show promising knowledge accumulation ability from pretraining, yet continual learning in VLA remains challenging, especially for efficient adaptation. Existing continual imitation learning (CIL) methods often rely on additional parameters or external modules, limiting scalability for large VLA models. We propose Stellar VLA, a knowledge-driven CIL framework without increasing network parameters. Two progressively extended variants are designed: T-Stellar for flat task-centric modeling and TS-Stellar for hierarchical task-skill structure. Stellar VLA enables self-evolving knowledge learning by jointly optimizing task representations and a learned knowledge space. We propose a knowledge-guided expert routing mechanism conditioned on knowledge relation and Top-K semantic embeddings, enabling task specialization without increasing model size. Experiments on the LIBERO benchmark show that Stellar VLAs achieve strong performance among both VLA and CIL baselines, using only 1 % data replay. Real-world evaluation on a dual-arm platform with distinct embodiment and scene configurations validates effective knowledge transfer. TS-Stellar excels in hierarchical manipulation, and visualizations reveal robust knowledge retention and task discovery. Project Website: https://stellarvla.github.io/
♻ ☆ Saliency-Aware Regularized Quantization Calibration for Large Language Models
Post-training quantization (PTQ) is an effective approach for deploying large language models (LLMs) under memory and latency constraints. Most existing PTQ methods determine quantization parameters by minimizing a layer-wise reconstruction error on a predetermined calibration dataset, typically optimized via either scale search or Gram-based methods. However, from the perspective of generalization risk, existing PTQ calibration objectives based solely on empirical reconstruction error over limited or unrepresentative calibration data may move the quantized weights away from the original floating-point weights, potentially degrading downstream performance. To address this issue, we propose \emph{Regularized Quantization Calibration} (RQC), a unified framework that augments standard PTQ objectives with a regularizer that explicitly controls weight deviation from the original weights. We further generalize this framework to incorporate a saliency-aware regularizer, resulting in \emph{Saliency-Aware Regularized Quantization Calibration} (SARQC). The proposed regularization encourages quantized weights to remain close to the original weights during calibration, leading to improved generalization at inference time. SARQC integrates seamlessly into existing PTQ pipelines and enhances both scale-search-based and Gram-based methods under a unified formulation. Extensive experiments on dense and Mixture-of-Experts LLMs demonstrate consistent improvements in perplexity and zero-shot accuracy, without introducing additional inference overhead.
♻ ☆ CRAFT: Forgetting-Aware Intervention-Based Adaptation for Continual Learning
Large language models (LLMs) can acquire new capabilities through fine-tuning, but continual adaptation often leads to catastrophic forgetting. We propose CRAFT, a continual learning framework that avoids updating model weights by instead learning low-rank interventions on hidden representations. CRAFT proceeds in three stages: it first routes each task to a group of similar tasks based on output-distribution divergence; it then fine-tunes the model using a Kullback-Leibler (KL) divergence against the group's prior state, which directly controls forgetting and determines convergence; finally, it merges interventions for the updated task into the shared representation using the same KL signal. This design unifies routing, regularization, and merging through a single KL-based objective. CRAFT improves overall performance and reduces forgetting compared to strong LoRA-based approaches across multiple benchmarks and model scales, while remaining robust to task ordering. These results suggest that controlling adaptation in representation space, guided by output-space divergence, provides a scalable and principled approach to continual learning in LLMs.
comment: 24 pages
♻ ☆ Governed Capability Evolution: Lifecycle-Time Compatibility Checking and Rollback for AI-Component-Based Systems, with Embodied Agents as Case Study
Software systems built from versioned AI components increasingly need lifecycle-time governance: when a capability module evolves into a new version, the hosting system must decide whetmeher the new version may be activated safely, under what deployment conditions, with what monitoring, and when it should be rolled back. Existing software-deployment patterns (canary, blue-green, feature flags, MLOps pipelines) address parts of this loop but were designed for stateless web services rather than stateful, policy-constrained runtimes that drive AI components in the field. We study this problem in the setting of embodied agents, where capabilities are packaged as installable modules under runtime policy and recovery constraints. We formulate governed capability evolution as a first-class software-lifecycle problem for AI-component-based systems and propose a staged upgrade framework that treats every new capability version as a governed deployment candidate rather than an immediate replacement. The framework introduces four compatibility checks (interface, policy, behavioral, recovery) and organizes them into a staged pipeline of candidate validation, sandbox evaluation, shadow deployment, gated activation, online monitoring, and rollback. A reference prototype on a PyBullet/ROS 2 testbed evaluated over 6 upgrade rounds with 15 random seeds shows naive upgrade reaches 72.9% task success but drives unsafe activation to 60% by the final round, while governed upgrade retains comparable success (67.4%) with zero unsafe activations across all rounds (Wilcoxon p=0.003). Shadow deployment surfaces 40% of regressions invisible to sandbox alone, and rollback succeeds in 79.8% of post-activation drift scenarios. The work extends runtime governance from action execution to capability evolution.
comment: 42 pages, 5 figures, 10 tables, 7 appendices
♻ ☆ Beyond Retrieval: A Multitask Benchmark and Model for Code Search
Code search has usually been evaluated as first-stage retrieval, even though production systems rely on broader pipelines with reranking and developer-style queries. Existing benchmarks also suffer from data contamination, label noise, and degenerate binary relevance. In this paper, we introduce \textsc{CoREB}, a contamination-limited, multitask \underline{co}de \underline{r}etrieval and r\underline{e}ranking \underline{b}enchmark, together with a fine-tuned code reranker, that goes beyond retrieval to cover the full code search pipeline. \textsc{CoREB} is built from counterfactually rewritten LiveCodeBench problems in five programming languages and delivered as timed releases with graded relevance judgments. We benchmark eleven embedding models and five rerankers across three tasks: text-to-code, code-to-text, and code-to-code. Our experiments reveal that: \circone code-specialised embeddings dominate code-to-code retrieval (${\sim}2{\times}$ over general encoders), yet no single model wins all three tasks; \circtwo short keyword queries, the format closest to real developer search, collapse every model to near-zero nDCG@10; \circthree off-the-shelf rerankers are task-asymmetric, with a 12-point swing on code-to-code and no baseline net-positive across all tasks; \circfour our fine-tuned \textsc{CoREB-Reranker} is the first to achieve consistent gains across all three tasks. The data and model are released.
comment: project site: https://hq-bench.github.io/coreb-page/
♻ ☆ LoopNav: Benchmarking Spatial Consistency in World Models
The ability to simulate the world in a spatially consistent manner is a crucial requirement for effective world models. Such a model enables high-quality visual generation, and also ensures the reliability of world models for downstream tasks such as simulation and planning. It must not only retain long-horizon observational information, but also enables the construction of explicit or implicit internal spatial representations. However, existing datasets do not explicitly enforce spatial consistency constraints, limiting both the ability to systematically evaluate this capability and to learn it through data-driven approaches. Furthermore, most existing benchmarks primarily emphasize visual coherence or generation quality, neglecting the requirement of long-range spatial consistency. To bridge this gap, we propose LoopNav, a dataset and corresponding benchmark centered on loop-based navigation for evaluating spatial consistency. The dataset comprises 250 hours (20 million frames) of loop-based navigation videos with actions, collected from diverse locations in the open-world environment of Minecraft. We further introduce a Scene Graph Consistency Score to quantify spatial consistency while remaining invariant to pixel-level variations. Dataset, benchmark, and code are open-sourced to support future research.
comment: V3: SGCS
♻ ☆ Script Sensitivity: Benchmarking Language Models on Unicode, Romanized and Mixed-Script Sinhala SC
The performance of Language Models (LMs) on low-resource, morphologically rich languages like Sinhala remains largely unexplored, particularly regarding script variation in digital communication. Sinhala exhibits script duality, with Unicode used in formal contexts and Romanized text dominating social media, while mixed-script usage is common in practice. This paper benchmarks 24 open-source LMs on Unicode, Romanized and mixed-script Sinhala using perplexity evaluation across diverse text sources. Results reveal substantial script sensitivity, with median performance degradation exceeding 300 times from Unicode to Romanized text. Critically, model size shows no correlation with script-handling competence, as smaller models often outperform architectures 28 times larger. Unicode performance strongly predicts mixed-script robustness but not Romanized capability, demonstrating that single-script evaluation substantially underestimates real-world deployment challenges. These findings establish baseline LM capabilities for Sinhala and provide practical guidance for model selection in multi-script low-resource environments.
comment: Published at SCSE 2026 (9th IEEE International Research Conference on Smart Computing and Systems Engineering). Best Paper Award - Text Analytics Track
♻ ☆ OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search
Agentic search enables language models to solve knowledge-intensive tasks by adaptively acquiring external evidence over multiple steps. Reinforcement learning with verifiable rewards (RLVR) has emerged as a widely adopted training paradigm for search agents, yet outcome-only rewards are sparse and provide limited credit assignment for intermediate search actions. Existing process-reward methods therefore seek to densify supervision through proxy signals, external evaluators, or likelihood-based information gain. However, proxy rewards can deviate from the final outcome objective, while fixed evaluators can become stale as the search policy evolves, leading to unreliable process supervision. To address these challenges, we propose OASES, an Outcome-Aligned Search-Evaluation Supervision framework for agentic search. OASES derives outcome-aligned process rewards by evaluating how well each intermediate search state supports answering the original question. It further co-trains the search policy and the state evaluator on policy, allowing the evaluator to adapt to evolving search behavior and provide more reliable process rewards. Experiments on five multi-hop QA benchmarks show that OASES consistently outperforms strong RL baselines, with further analyses confirming the benefits of outcome-aligned process rewards and search-evaluation co-training.
♻ ☆ LLM-Based Agents for Competitive Landscape Mapping in Drug Asset Due Diligence
In this paper, we describe and benchmark a competitor-discovery component used within an agentic AI system for fast drug asset due diligence. A competitor-discovery AI agent, given an indication, retrieves all drugs comprising the competitive landscape of that indication and extracts canonical attributes for these drugs. The competitor definition is investor-specific, and data is paywalled/licensed, fragmented across registries, ontology-mismatched by indication, alias-heavy for drug names, multimodal, and rapidly changing. Although considered the best tool for this problem, the current LLM-based AI systems aren't capable of reliably retrieving all competing drug names, and there is no accepted public benchmark for this task. To address the lack of evaluation, we use LLM-based agents to transform five years of multi-modal, unstructured diligence memos from a private biotech VC fund into a structured evaluation corpus mapping indications to competitor drugs with normalized attributes. We also introduce a competitor validating LLM-as-a-judge agent that filters out false positives from the list of predicted competitors to maximize precision and suppress hallucinations. On this benchmark, our competitor-discovery agent achieves 83% recall, exceeding OpenAI Deep Research (65%) and Perplexity Labs (60%). The system is deployed in production with enterprise users; in a case study with a biotech VC investment fund, analyst turnaround time dropped from 2.5 days to $\sim$3 hours ($\sim$20x) for the competitive analysis.
♻ ☆ Rhamba: Region-Aware Hybrid Attention-Mamba Framework for Self-Supervised Learning in Resting-State fMRI
Self-supervised pretraining is promising for large-scale neuroimaging, yet the impact of region-aware masking and hybrid sequence modeling remains underexplored. In this work, we introduce Rhamba, a region-aware pretraining framework that integrates anatomically guided masking with hybrid Attention-Mamba architectures for resting state functional magnetic resonance imaging (fMRI) analysis. Models were pretrained on the ABIDE dataset using region-aligned patch embeddings and three masking strategies (Any, Majority, and Pure) with increasing spatial specificity. We evaluated four architectural variants: a Mamba only model, an Alternate architecture with interleaved Mamba and Attention blocks, and two hybrid encoder-decoder configurations (Attention-Mamba (AM) and Mamba-Attention (MA)). The pretrained models were fine-tuned on downstream classification tasks using the COBRE and ADHD-200 datasets for schizophrenia and attention-deficit/hyperactivity disorder discrimination. We employed Integrated Gradients, an explainable AI method, to identify the brain regions contributing to model predictions. Masking strategy strongly influenced reconstruction behavior, with reconstruction loss following a consistent ordering (Any > Majority > Pure). However, this trend did not directly translate into downstream performance, where differences were modest and dataset-dependent. The hybrid architecture with the MA configuration achieved the highest average AUROC across both datasets, and Rhamba outperformed state-of-the-art methods in comparative evaluation. Region-wise analysis showed that peak performance depends on the interaction between masking strategy and architecture rather than a single dominant configuration. Overall, Rhamba offers a flexible framework for balancing interpretability, scalability, and performance in large-scale fMRI representation learning.
♻ ☆ CrossCult-KIBench: A Benchmark for Cross-Cultural Knowledge Insertion in MLLMs
Multimodal Large Language Models (MLLMs), trained primarily on English-centric data, frequently generate culturally inappropriate or misaligned responses in cross-cultural settings. To mitigate this, we introduce the task of cross-cultural knowledge insertion, which focuses on adapting models to specific cultural contexts while preserving their original behavior in other cultures. To facilitate research in this area, we introduce CrossCult-KIBench, a comprehensive evaluation benchmark for assessing both the effectiveness of knowledge insertion and its unintended side effects on non-target cultures. The benchmark includes 9,800 image-grounded cases covering 49 culturally relevant visual scenarios across English, Chinese, and Arabic language-culture groups. It supports evaluation in both single-insert and sequential-insert settings. We also propose Memory-Conditioned Knowledge Insertion (MCKI) as a baseline method. MCKI retrieves relevant cultural knowledge from an external memory using frozen MLLM representations, prepending matched entries as conditional prompts when applicable. Extensive experiments on CrossCult-KIBench reveal that current approaches struggle to balance effective cultural adaptation with behavioral preservation, highlighting a key challenge in developing culturally-aware MLLMs. Our work thus underscores an important research direction for developing more culturally adaptive and responsible MLLMs.
♻ ☆ ATHENA: Agentic Team for Hierarchical Evolutionary Numerical Algorithms
Bridging the gap between theoretical conceptualization and computational implementation is a major bottleneck in Scientific Computing (SciC) and Scientific Machine Learning (SciML). We introduce ATHENA (Agentic Team for Hierarchical Evolutionary Numerical Algorithms), an agentic framework designed as an Autonomous Lab to manage the end-to-end computational research lifecycle. Its core is the HENA loop, a knowledge-driven diagnostic process framed as a Contextual Bandit problem. Acting as an online learner, the system analyzes prior trials to select structural `actions' ($A_n$) from combinatorial spaces guided by expert blueprints (e.g., Universal Approximation, Physics-Informed constraints). These actions are translated into executable code ($S_n$) to generate scientific rewards ($R_n$). ATHENA transcends standard automation: in SciC, it autonomously identifies mathematical symmetries for exact analytical solutions or derives stable numerical solvers where foundation models fail. In SciML, it performs deep diagnosis to tackle ill-posed formulations and combines hybrid symbolic-numeric workflows (e.g., coupling PINNs with FEM) to resolve multiphysics problems. The framework achieves super-human performance, reaching validation errors of $10^{-14}$. Furthermore, collaborative ``human-in-the-loop" intervention allows the system to bridge stability gaps, improving results by an order of magnitude. This paradigm shift focuses from implementation mechanics to methodological innovation, accelerating scientific discovery.
♻ ☆ Adapting Vision-Language Models for Neutrino Event Classification in High-Energy Physics
Recent advances in Large Language Models (LLMs) have demonstrated their remarkable capacity to process and reason over structured and unstructured data modalities beyond natural language. In this work, we explore the applications of Vision Language Models (VLMs), specifically a fine-tuned variant of LLaMA 3.2 to the task of identifying neutrino interactions in pixelated detector data from high-energy physics (HEP) experiments. We benchmark this model against a state-of-the-art convolutional neural network (CNN) architecture, similar to those used in major neutrino experiments, which have achieved high efficiency and purity in classifying electron and muon neutrino events, and also a Vision Transformer (ViT-h/14), which is the same architecture inside the VLM's vision encoder. Our evaluation considers both classification performance and interpretability of the model predictions, comparing a VLM with a vision-only transformer (ViT) and a convolutional neural network (CNN) baseline. We find that transformer-based architectures outperform conventional CNNs in classification accuracy and robustness, with the VLM providing additional flexibility through the integration of auxiliary textual or semantic information and enabling more interpretable, reasoning-based predictions. These results highlight the potential of large transformer models, particularly vision-language models, as general-purpose backbones for physics event classification, combining strong performance, robustness, and interpretability, and opening new avenues for multimodal reasoning in experimental neutrino physics.
comment: Accepted for publication in Communications Physics (Nature Portfolio)
♻ ☆ SteelDefectX: A Multi-Form Vision-Language Dataset and Benchmark for Steel Surface Defect Analysis
Steel surface defect analysis is critical for industrial quality control, yet existing benchmarks rely primarily on label-only annotations, limiting fine-grained semantic understanding and systematic evaluation of vision-language models. To address this gap, we introduce SteelDefectX, a vision-language dataset with multi-form textual annotations for steel surface defect analysis, comprising 7,778 images across 25 defect categories. At the class level, the dataset provides defect names, representative visual attributes, and industrial causes. At the sample level, each image is annotated with three forms of textual representations: (1) free-form natural language descriptions, (2) structured attribute annotations, and (3) template-based sentences. These annotations provide flexible textual supervision with varying levels of expressiveness and controllability. We further establish a comprehensive benchmark covering vision-language classification, segmentation, and cross-dataset transfer, along with additional evaluations such as retrieval and text-guided localization. Experimental results reveal a trade-off between structure and flexibility in textual representations. Structured attributes provide more stable semantic alignment, while natural language descriptions improve transferability and fine-grained spatial grounding. These findings highlight the critical role of textual design in industrial vision-language learning. SteelDefectX provides a new benchmark for studying semantic alignment and generalization in industrial vision-language learning. The code and dataset are available at https://github.com/Zhaosxian/SteelDefectX.
♻ ☆ Android Coach: Improve Online Agentic Training Efficiency with Single State Multiple Actions
Online reinforcement learning (RL) serves as an effective method for enhancing the capabilities of Android agents. However, guiding agents to learn through online interaction is prohibitively expensive due to the high latency of emulators and the sample inefficiency of existing RL algorithms. We identify a fundamental limitation in current approaches: the Single State Single Action paradigm, which updates the policy with one-to-one state-action pairs from online one-way rollouts without fully exploring each costly emulator state. In this paper, we propose Android Coach, a novel framework that shifts the training paradigm to Single State Multiple Actions, allowing the agent to sample and utilize multiple actions for a single online state. We enable this without additional emulator overhead by learning a critic that estimates action values. To ensure the critic serves as a reliable coach, we integrate a process reward model and introduce a group-wise advantage estimator based on the averaged critic outputs. Extensive experiments demonstrate the effectiveness and efficiency of Android Coach: it achieves 7.5% and 8.3% success rate improvements on AndroidLab and AndroidWorld over UI-TARS-1.5-7B, and attains 1.4x higher training efficiency than Single State Single Action methods PPO and GRPO at matched success rates.
♻ ☆ NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps
Existing Vision-Language Navigation (VLN) methods typically adopt an egocentric, step-by-step paradigm, which struggles with error accumulation and limits efficiency. While recent approaches attempt to leverage pre-built environment maps, they often rely on incrementally updating memory graphs or scoring discrete path proposals, which restricts continuous spatial reasoning and creates discrete bottlenecks. We propose Top-Down VLN (TD-VLN), reformulating navigation as a one-step global path planning problem on pre-built top-down maps, supported by our newly constructed R2R-TopDown dataset. To solve this, we introduce NavOne, a unified framework that directly predicts dense path probabilities over multi-modal maps in a single end-to-end forward pass. NavOne features a Top-Down Map Fuser for joint multi-modal map representation, and extends Attention Residuals for spatial-aware depth mixing. Extensive experiments on R2R-TopDown show that NavOne achieves state-of-the-art performance among map-based VLN methods, with a planning-stage speedup of 8x over existing map-based baselines and 80x over egocentric methods, enabling highly efficient global navigation.
comment: 10 pages, 7 figures
Computation and Language 168
☆ LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
Test-time scaling (TTS) has become an effective approach for improving large language model performance by allocating additional computation during inference. However, existing TTS strategies are largely hand-crafted: researchers manually design reasoning patterns and tune heuristics by intuition, leaving much of the computation-allocation space unexplored. We propose an environment-driven framework, AutoTTS, that changes what researchers design: from individual TTS heuristics to environments where TTS strategies can be discovered automatically. The key to AutoTTS lies in environment construction: the discovery environment must make the control space tractable and provide cheap, frequent feedback for TTS search. As a concrete instantiation, we formulate width--depth TTS as controller synthesis over pre-collected reasoning trajectories and probe signals, where controllers decide when to branch, continue, probe, prune, or stop and can be evaluated cheaply without repeated LLM calls. We further introduce beta parameterization to make the search tractable and fine-grained execution trace feedback to improve discovery efficiency by helping the agent diagnose why a TTS program fails. Experiments on mathematical reasoning benchmarks show that the discovered strategies improve the overall accuracy--cost tradeoff over strong manually designed baselines. The discovered strategies generalize to held-out benchmarks and model scales, while the entire discovery costs only $39.9 and 160 minutes. Our data, and code will be open-source at https://github.com/zhengkid/AutoTTS.
comment: 25 pages
☆ Conformal Path Reasoning: Trustworthy Knowledge Graph Question Answering via Path-Level Calibration
Knowledge Graph Question Answering (KGQA) has shown promise for grounded and interpretable reasoning, yet existing approaches often fail to provide reliable coverage guarantees over retrieved answers. While Conformal Prediction (CP) offers a principled framework for producing prediction sets with statistical guarantees, prior methods suffer from critical limitations in both calibration validity and score discriminability, resulting in violated coverage guarantees and excessively large prediction sets. To address these pitfalls, we propose Conformal Path Reasoning (CPR), a trustworthy KGQA framework with two key innovations. First, we perform query-level conformal calibration over path-level scores, preserving the exchangeability while generating path prediction sets. Second, we introduce the Residual Conformal Value Network (RCVNet), a lightweight module trained via PUCT-guided exploration to learn discriminative path-level nonconformity scores. Experiments on benchmarks show that CPR significantly improves the Empirical Coverage Rate by 34% while reducing average prediction set size by 40% compared to conformal baselines. These results validate the efficacy of CPR in satisfying coverage guarantees with substantially more compact answer sets.
comment: 13 pages, 3 figures, 2 tables;
☆ The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents
Context window expansion is often treated as a straightforward capability upgrade for LLMs, but we find it systematically fails in multi-agent social dilemmas. Across 7 LLMs and 4 games over 500 rounds, expanding accessible history degrades cooperation in 18 of 28 model--game settings, a pattern we term the memory curse. We isolate the underlying mechanism through three analyses. First, lexical analysis of 378,000 reasoning traces associates this breakdown with eroding forward-looking intent rather than rising paranoia. We validate this using targeted fine-tuning as a cognitive probe: a LoRA adapter trained exclusively on forward-looking traces mitigates the decay and transfers zero-shot to distinct games. Second, memory sanitization holds prompt length fixed while replacing visible history with synthetic cooperative records, which restores cooperation substantially, proving the trigger is memory content, not length alone. Finally, ablating explicit Chain-of-Thought reasoning often reduces the collapse, showing that deliberation paradoxically amplifies the memory curse. Together, these results recast memory as an active determinant of multi-agent behavior: longer recall can either destabilize or support cooperation depending on the reasoning patterns it elicits.
☆ CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation
While recent advancements in inference-time learning have improved LLM reasoning on Text-to-SQL tasks, current solutions still struggle to perform well on the most challenging tasks in the Bird-Bench (BIRD) benchmark. This is due to inadequate solution space exploration, which is necessary to uncover promising candidate queries that can be further refined to produce the correct output. To address this challenge, we introduce CA-SQL, a novel Text-to-SQL pipeline that utilizes the estimated difficulty of a task to dynamically scale the breadth of the exploration for generating solution candidates. In addition, we use a custom prompt seeding method, based on principles of evolutionary search, to further elicit exploratory behavior from the base LLM and a novel voting method to select the best candidate solution at the end of the search. Experiments demonstrate that our solution achieves a state-of-the-art score of 51.72% on the "challenging" tier of BIRD development set problems, using only GPT-4o-mini, out-performing other in-context learning approaches, even those that leverage larger models. Overall, our method attains a competitive 61.06% execution accuracy and 68.77% Soft F1 score on the BIRD development dataset.
☆ Accurate and Efficient Statistical Testing for Word Semantic Breadth ACL 2026
Measuring the breadth of a word's meaning, or its spread across contexts, has become feasible with contextualized token embeddings. A word type can be represented as a cloud of token vectors, with dispersion-based statistics serving as proxies for contextual diversity (Nagata and Tanaka-Ishii, ACL2025). These measurements are useful for deciding appropriate sense distinctions when constructing thesauri and domain-specific dictionaries. However, when comparing the breadth of two word types, naive hypothesis testing on dispersion can be misleading: differences in semantic direction can masquerade as dispersion differences, inflating Type-I error and yielding "statistically significant" outcomes even when there is no true breadth difference. This is problematic because significance testing should distinguish genuine effects from incidental fluctuations in small-difference regimes. We propose a Householder-aligned permutation test to isolate dispersion differences from directional differences. Our method applies a single Householder reflection to align the mean directions of the two word types and then performs a permutation test on the aligned token clouds, yielding calibrated, non-parametric p-values. For practicality, we introduce a GPU-oriented implementation that batches permutations and linear algebra operations. Empirically, our alignment reduced Type-I error by 32.5% while preserving sensitivity to genuine breadth differences, and achieved a 23x speedup over the CPU baseline.
comment: Accepted to ACL 2026 Main Conference
☆ Uncertainty-Aware Structured Data Extraction from Full CMR Reports via Distilled LLMs
Converting free-text cardiac magnetic resonance (CMR) reports into auditable structured data remains a bottleneck for cohort assembly, longitudinal curation, and clinical decision support. We present CMR-EXTR, a lightweight framework that converts free-text CMR reports into structured data and assigns per-field confidence for quality control. A teacher-student distillation pipeline enables fully offline inference while limiting manual annotation. Uncertainty integrates three complementary principles -- distribution plausibility, sampling stability, and cross-field consistency -- to triage human review. Experiments show that CMR-EXTR achieves 99.65% variable-level accuracy, demonstrating both reliable extraction and informative confidence scores. To our knowledge, this is the first CMR-specific extraction system with integrated confidence estimation. The code is available at https://github.com/yuyi1005/CMR-EXTR.
comment: Accepted to ISBI 2026
☆ Fast Byte Latent Transformer
Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slow, byte-by-byte autoregressive generation. We address this bottleneck in the Byte Latent Transformer (BLT) through new training and generation techniques. First, we introduce BLT Diffusion (BLT-D), a new model and our fastest BLT variant, trained with an auxiliary block-wise diffusion objective alongside the standard next-byte prediction loss. This enables an inference procedure that generates multiple bytes in parallel per decoding step, substantially reducing the number of forward passes required to generate a sequence. Second, we propose two extensions inspired by speculative decoding that trade some of this speed for higher generation quality: BLT Self-speculation (BLT-S), in which BLT's local decoder continues generating past its normal patch boundaries to draft bytes, which are then verified with a single full-model forward pass; and BLT Diffusion+Verification (BLT-DV), which augments BLT-D with an autoregressive verification step after diffusion-based generation. All methods may achieve an estimated memory-bandwidth cost over 50% lower than BLT on generation tasks. Each approach offers its own unique advantages, together removing key barriers to the practical use of byte-level LMs.
☆ Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims NeurIPS 2026
Mechanistic interpretability papers increasingly use causal vocabulary: circuits, mediators, causal abstraction, monosemanticity. Such claims require explicit identification assumptions. A purposive audit of 10 papers across four methodological strands finds no dedicated identification-assumptions section and a recurring pattern: validation metrics such as faithfulness, completeness, monosemanticity, alignment, or ablation effects are reported as causal support without stating the assumptions that make them identifying. A two-human-coder audit on $n=30$ reproduces the direction of the main finding: dedicated identification sections are absent, and validation-metric substitution is common, though exact Dim B/D counts are coding-rule sensitive. The paper proposes a disclosure norm: state whether the claim is causal, name the identification strategy, enumerate assumptions, stress at least one, and explain how conclusions shift if assumptions fail. Validation is not identification.
comment: 10 pages, 2 figures. Submitted to NeurIPS 2026 (Position Track)
☆ Tool Calling is Linearly Readable and Steerable in Language Models
When a tool-calling agent picks the wrong tool, the failure is invisible until execution: the email gets sent, the meeting gets missed. Probing 12 instruction-tuned models across Gemma 3, Qwen 3, Qwen 2.5, and Llama 3.1 (270M to 27B), we find the identity of the chosen tool is linearly readable and steerable inside the model. Adding the mean-difference between two tools' average internal activations switches which tool the model selects at 77-100% accuracy on name-only single-turn prompts (93-100% at 4B+), and the JSON arguments that follow autoregressively match the new tool's schema, so flipping the name is enough. The same per-tool means also flag likely errors before they happen: on Gemma 3 12B and 27B, queries where the gap between the top-1 and top-2 tool is smallest produce 14-21x more wrong calls than queries with the largest gap. The causal effect concentrates along one direction, the row of the output layer that produces the target tool's first token: a unit vector along it at matched magnitude already reaches 93-100%, while what is left over leaves the choice almost untouched. Activation patching localises this to a small set of mid- and late-layer attention heads, and a within-topic probe across 14 same-domain $τ$-bench airline tools reaches top-1 61-89% across five 4B-14B models, ruling out the reading that we are just moving the model along a topic axis. Even base models encode the right tool before they can emit it: cosine readout from the internal state recovers 69-82% on BFCL while base generation reaches only 2-10%, suggesting pretraining forms the representation and instruction tuning later wires it to the output. We measure tool identity selection and JSON schema correctness in single-turn fixed-menu settings; multi-turn agentic transfer is more fragile and is discussed in Limitations.
comment: 29 pages, 6 figures, 7 tables. Manuscript under review
☆ GLiGuard: Schema-Conditioned Classification for LLM Safeguard
Ensuring safe, policy-compliant outputs from large language models requires real-time content moderation that can scale across multiple safety dimensions. However, state-of-the-art guardrail models rely on autoregressive decoders with 7B--27B parameters, reformulating what is fundamentally a classification problem as sequential text generation, a design choice that incurs high latency and scales poorly to multi-aspect evaluation. In this work, we introduce \textbf{GLiGuard}, a 0.3B-parameter schema-conditioned bidirectional encoder adapted from GLiNER2 for LLM content moderation. The key idea is to encode task definitions and label semantics directly into the input sequence as structured token schemas, enabling simultaneous evaluation of prompt safety, response safety, refusal detection, 14 fine-grained harm categories, and 11 jailbreak strategies in a single non-autoregressive forward pass. This schema-conditioned design lets supported task and label blocks be composed directly in the input schema at inference time. Across nine established safety benchmarks, GLiGuard achieves F1 scores competitive with 7B--27B decoder-based guards despite being 23--90$\times$ smaller, while delivering up to 16$\times$ higher throughput and 17$\times$ lower latency. These results suggest that compact bidirectional encoders can approach the accuracy of much larger guard models while drastically reducing inference cost. Code and models are available at https://github.com/fastino-ai/GLiGuard.
comment: 20 pages, 4 figures
☆ Ask Early, Ask Late, Ask Right: When Does Clarification Timing Matter for Long-Horizon Agents?
Long-horizon AI agents execute complex workflows spanning hundreds of sequential actions, yet a single wrong assumption early on can cascade into irreversible errors. When instructions are incomplete, the agent must decide not only whether to ask for clarification but when, and no prior work measures how clarification value changes over the course of execution. We introduce a forced-injection framework that provides ground-truth clarifications at controlled points in the agent's trajectory across four information dimensions (goal, input, constraint, context), three agent benchmarks, and four frontier models (three per benchmark; one on a single benchmark only; 84 task variants; 6,000+ runs). Counter to the common intuition that "earlier is always better," we find that the value of clarification depends sharply on what information is missing: goal clarification loses nearly all value after 10% of execution (pass@3 drops from 0.78 to baseline), while input clarification retains value through roughly 50%. Deferring any clarification type past mid-trajectory degrades performance below never asking at all. Cross-model Kendall tau correlations (0.78-0.87 among models sharing identical task coverage; 0.34-0.67 across the full 4-model panel) confirm these timing profiles are substantially task-intrinsic. A complementary study of 300 unscripted sessions reveals that no current frontier model asks within the empirically optimal window, with strategies ranging from over-asking (52% of sessions) to never asking at all. These empirical demand curves provide the quantitative foundation that existing theoretical frameworks require but have lacked, and establish concrete design targets for timing-aware clarification policies. Code and data will be publicly released.
☆ How to Train Your Latent Diffusion Language Model Jointly With the Latent Space
Latent diffusion models offer an attractive alternative to discrete diffusion for non-autoregressive text generation by operating on continuous text representations and denoising entire sequences in parallel. The major challenge in latent diffusion modeling is constructing a suitable latent space. In this work, we present the Latent Diffusion Language Model (LDLM), in which the latent encoder, diffusion model, and decoder are trained jointly. LDLM builds its latent space by reshaping the representations of a pre-trained language model with a trainable encoder, yielding latents that are easy to both denoise and decode into tokens. We show that naive joint training produces a low-quality diffusion model, and propose a simple training recipe consisting of an MSE decoder loss, diffusion-to-encoder warmup, adaptive timestep sampling, and decoder-input noise. Ablations show that each component substantially impacts generation performance. On OpenWebText and LM1B, LDLM achieves better generation performance than existing discrete and continuous diffusion language models while being $2{\text -}13\times$ faster, indicating that jointly learning the latent space is a key step toward making latent diffusion competitive for text generation.
☆ How Value Induction Reshapes LLM Behaviour ACL 2026
Conversational Large Language Models are post-trained on language that expresses specific behavioural traits, such as curiosity, open-mindedness, and empathy, and values, such as helpfulness, harmlessness, and honesty. This is done to increase utility, ensure safety, and improve the experience of the people interacting with the model. However, values are complex and inter-related -- inducing one could modify behaviour on another. Further, inducing certain values can make models more addictive or sycophantic through language used in the generations, with a potential detrimental effect on the user. We investigate these and other unintended effects of value induction into models. We fine-tune models using curated value subsets of existing preference datasets, measuring the impact of value induction on expression of other values, model safety, anthropomorphic language, and various QA benchmarks. We find that (i) inducing values leads to expression of other related, and sometimes contrastive values, (ii) inducing positive values increases safety, and (iii) all values increase anthropomorphic language use, making models more validating and sycophantic.
comment: Accepted to Findings of ACL 2026
☆ Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation
Discrete flow matching generates text by iteratively transforming noise tokens into coherent language, but may require hundreds of forward passes. Distillation uses the multi-step trajectory to train a student to reproduce the process in a few steps. When the student underperforms, the usual explanation is insufficient capacity. We argue the opposite: the trajectory is the bottleneck, not the student. Each training trajectory is built through a chain of blind stochastic jumps with no evaluation of sequence quality; a single bad decision at an early midpoint propagates through subsequent steps, yet the student must imitate the result. Trajectory-Shaped Discrete Flow Matching (TS-DFM) replaces these blind jumps with guided navigation: a lightweight energy compass evaluates candidate continuations at each midpoint, selecting the most coherent. All shaping is training-only; inference cost is unchanged. On 170M-parameter language modeling, the shaped student at 8 steps achieves 32% lower perplexity than the 1,024-step teacher while being 128x faster, with gains consistent across source distributions and three evaluators of increasing scale. TS-DFM achieves the best perplexity of any discrete-generation baseline we compare against, including methods trained on 6x more data or using 5x larger models.
☆ CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers ICML 2026
Despite the rapid development of AI reviewers, evaluating such systems remains challenging: metrics favor overlap with human reviews over correctness. However, since human reviews often cover only a subset of salient issues and sometimes contain mistakes, they are unreliable as gold references. To address this, we build category-specific benchmark subsets and skip evaluation when the corresponding human reviews are missing to strengthen Completeness. We also leverage reviewer--author--meta-review discussions as expert annotations and filter unreliable reviews accordingly to strengthen Correctness. Finally, we introduce CoCoReviewBench, which curates 3,900 papers from ICLR and NeurIPS to enable reliable and fine-grained evaluation of AI reviewers. Analysis shows that AI reviewers remain limited in correctness and are prone to hallucinations, and highlights reasoning models as more effective reviewers, motivating further directions for improving AI reviewers. Benchmarks and models are available at https://github.com/hexuandeng/CoCoReviewBench.
comment: Accepted by ICML 2026
☆ Beyond "I cannot fulfill this request": Alleviating Rigid Rejection in LLMs via Label Enhancement
Large Language Models (LLMs) rely on safety alignment to obey safe requests while refusing harmful ones. However, traditional refusal mechanisms often lead to "rigid rejection," where a general template (e.g., "I cannot fulfill this request") indiscriminately triggers refusals and severely undermines the naturalness of interactions between humans and LLMs. To address this issue, LANCE is proposed in this paper to ensure safe yet flexible and natural responses via label enhancement. Specifically, LANCE employs variational inference to perform label enhancement, predicting a continuous distribution across multiple rejection categories. These fine-grained rejection distributions provide multi-way textual gradients for a refinement model to neutralize the hazardous elements in the prompt, so that the LLMs could generate safe responses that avoid rigid rejections while preserving the naturalness of interactions. Experiments demonstrate that LANCE significantly alleviates the rigid rejection problem while maintaining high security standards, significantly outperforming existing baseline models in terms of helpfulness and naturalness of responses.
☆ KL for a KL: On-Policy Distillation with Control Variate Baseline
On-Policy Distillation (OPD) has emerged as a dominant post-training paradigm for large language models, especially for reasoning domains. However, OPD remains unstable in practice due to the high gradient variance of its single-sample Monte Carlo estimator, and recipes for stable training are still immature. We propose vOPD (On-Policy Distillation with a control variate baseline), which casts OPD as policy-gradient RL and stabilizes it by introducing a control variate baseline-canonically a value function -- from the RL literature. We show that the OPD value function admits a closed form as the per-token negative reverse KL divergence between the student and the teacher, available directly from the already-computed forward pass with no additional critic or inference. Existing stabilization methods either compute the full token-level reverse KL over the entire vocabulary, adding significant overhead, or restrict it to a top-k support, biasing the objective. vOPD instead preserves the lightweight single-sample estimator, subtracting the value function as a detached baseline to keep the gradient unbiased while reducing variance. Furthermore, we show that a top-k approximation of the baseline further lowers cost without compromising performance. Across mathematical and scientific reasoning benchmarks, vOPD consistently outperforms vanilla OPD and matches the most expensive full-vocabulary baseline, offering an efficient stabilization of On-Policy Distillation through principled RL variance reduction.
☆ MatryoshkaLoRA: Learning Accurate Hierarchical Low-Rank Representations for LLM Fine-Tuning
With the rise in scale for deep learning models to billions of parameters, the computational cost of fine-tuning remains a significant barrier to deployment. While Low-Rank Adaptation (LoRA) has become the standard for parameter-efficient fine-tuning, the need to set a predefined, static rank $r$ requires exhaustive grid searches to balance efficiency and performance. Existing rank-adaptive solutions such as DyLoRA mitigate this by sampling ranks during the training from a predefined distribution. However, they often yield sub-optimal results at higher ranks due to lack of consistent gradient signals across the full hierarchy of ranks, thus making these methods data-inefficient. In this paper, we propose MatryoshkaLoRA, a general, Matryoshka-inspired training framework for LoRA that learns accurate hierarchical low-rank representations by inserting a fixed, carefully crafted diagonal matrix $P$ between the existing LoRA adapters to scale their sub-ranks accordingly. By introducing this simple modification, our general framework recovers LoRA and DyLoRA only by changing $P$ and ensures all sub-ranks embed the available gradient information efficiently. Our MatryoshkaLoRA supports dynamic rank selection with minimal degradation in accuracy. We further propose Area Under the Rank Accuracy Curve (AURAC), a metric that consistently evaluates the performance of hierarchical low-rank adapters. Our results demonstrate that MatryoshkaLoRA learns more accurate hierarchical low-rank representations than prior rank-adaptive approaches and achieves superior accuracy-performance trade-offs across ranks on the evaluated datasets. Our code is available at https://github.com/IST-DASLab/MatryoshkaLoRA.
☆ Measuring and Mitigating the Distributional Gap Between Real and Simulated User Behaviors
As user simulators are increasingly used for interactive training and evaluation of AI assistants, it is essential that they represent the diverse behaviors of real users. While existing works train user simulators to generate human-like responses, whether they capture the broad and heterogeneous distribution of real user behaviors remains an open question. In this work, we introduce a method to measure the distributional gap between real and simulated user behaviors, validated through a human study and ablations. Given a dataset of real and simulated conversations, our method extracts representations of user behavior from each conversation, quantizes them into discrete distributions via clustering, then computes divergence metrics. We provide the first systematic evaluation of 24 LLM-based user simulators on coding and writing tasks, and reveal a large distributional gap from real users that varies across model families, scales, and behavioral facets. Pairwise comparisons show that most simulators behave similarly, while a few stand apart. Combining behaviorally complementary simulators brings the resulting distribution closer to real users compared to either simulator on its own. Finally, a TF-IDF analysis of the clusters surfaces interpretable patterns of behaviors that simulators capture, miss, and hallucinate.
☆ SCENE: Recognizing Social Norms and Sanctioning in Group Chats
Online group chats are social spaces with implicit behavior patterns that, when broken, are often met with social sanctioning from the group. The ability and willingness of LLM-based agents to recognize and adapt to these norms remains mostly unexplored. We introduce SCENE, a social-interaction benchmark focused on implicit norms and social sanctioning in multi-party chat. SCENE generates plausible non-roleplay scenarios with scripted personas that follow a hidden norm, create opportunities for the subject agent to violate it, and sanction breaches when they occur. We further propose behavioral evaluation metrics for two functional adaptation abilities: responsiveness to negative sanctioning, and adapting norm from peers behavior. We evaluate six frontier and open-weight models on SCENE. Our results show that Claude Opus 4.7 and Gemini 3.1 Pro adapt to implicit norms significantly more than the evaluated open-weight models. SCENE contributes one benchmark in the direction of recent calls for dynamic, interactional evaluation of LLM social capabilities.
☆ GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
Human visual reasoning is governed by active vision, a process where metacognitive control drives top-down goal-directed attention, dynamically routing foveal focus toward task-relevant details while maintaining peripheral awareness of the global scene. In contrast, modern Vision-Language Models (VLMs) process visual information passively, relying on the static accumulation of massive token contexts that dilute spatial reasoning and induce linguistic hallucinations. Here we propose the following paradigm shift: GazeVLM, a multimodal architecture that internalizes this metacognitive oversight over its deployment of attention resources directly into the reasoning loop. By empowering the VLM to autonomously generate gaze tokens ($\texttt{}$), GazeVLM establishes a top-down control mechanism over its own causal attention mask. The model dynamically dictates its focal intent, triggering a continuous suppression bias that dampens irrelevant visual features, implementing spatial selective attention and simulating foveal fixation. Once local reasoning concludes, the bias lifts, seamlessly restoring the global view. This architecture enables the model to fluidly transition between global spatial awareness and localized focal reasoning without relying on external agentic contraptions like cropping tools, or inflating the context window with additional visual tokens derived from localized visual patches. Trained with a bespoke Group Relative Policy Optimization (GRPO) procedure that rewards valid grounding, our 4B-parameter GazeVLM delivers strong high-resolution multimodal reasoning performance, surpassing state-of-the-art VLMs in its parameter class by nearly 4% and agentic multimodal pipelines built around thinking with images by more than 5% on HRBench-4k and HRBench-8k.
☆ OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling
Muon improves neural-network training by orthogonalizing matrix-valued updates, but it leaves each layer's update magnitude controlled mostly by a global learning rate. We introduce OrScale, a trust-ratio extension of Muon built on a simple rule: the denominator of a layer-wise ratio should measure the Frobenius norm of the actual parameter-space direction that will be applied. This yields OrScale for general matrix layers and OrScale-LM for language models, where Moonlight shape scaling is combined with one-time per-layer calibration so every trust ratio starts at one. We analyze why three natural Muon-LAMB hybrids fail through shape-degenerate denominators, raw-momentum clip saturation, and decoupled weight-decay runaway, and show that the real-update-direction denominator with coupled weight decay avoids these failures. Theoretically, OrScale admits an O(1/sqrt(T)) nonconvex convergence guarantee in a nuclear-norm criterion, a strict layer-adaptive descent gain under measurable layer heterogeneity, and calibration properties that preserve muP-style learning-rate transfer at initialization. Empirically, OrScale ranks first on CIFAR-10/DavidNet across three seeds, improving Muon from 93.70% to 94.05% validation top-1, and OrScale-LM improves FineWeb-Edu pre-training versus Muon+Moonlight at three of four scales from 125M to 1.1B parameters while outperforming AdamW at every scale.
☆ A Comparative Analysis of Classical Machine Learning and Deep Learning Approaches for Sentiment Classification on IMDb Movie Reviews
This paper presents a comparative study of classical machine learning and deep learning methods for sentiment classification on the IMDb movie reviews dataset. The machine learning pipeline uses TF-IDF features and PyCaret AutoML to evaluate Logistic Regression, Naïve Bayes, and Support Vector Machine, while the deep learning pipeline implements BiLSTM and BiLSTM with an attention mechanism. Experimental results show that classical machine learning, especially SVM, achieves the best performance with an accuracy of 0.8530, outperforming the deep learning models in this study. The BiLSTM with Attention model improves over the standard BiLSTM and reaches an accuracy of 0.706, indicating better contextual modeling. The paper concludes that although deep learning can capture sequential dependencies, classical machine learning remains a strong baseline when combined with effective feature engineering such as TF-IDF, particularly under limited data and computational resources.
comment: 10 pages, 4 authors from Department of Data Science and 2 authors from Department of Informatics Engineering, Institut Teknologi Sumatera, Indonesia
☆ Beyond Confidence: Rethinking Self-Assessments for Performance Prediction in LLMs
Large Language Models (LLMs) are increasingly used in settings where reliable self-assessment is critical. Assessing model reliability has evolved from using probabilistic correctness estimates to, more recently, eliciting verbalized confidence. Confidence, however, has been shown to be an inconsistent and overoptimistic predictor of model correctness. Drawing on cognitive appraisal theory, a framework from human psychology that decomposes self-evaluation into multiple components, we propose a multidimensional perspective on model self-assessment. We elicit six appraisal-based dimensions of self-assessment, alongside confidence, and evaluate their utility for predicting model failure across 12 LLMs and 38 tasks spanning eight domains. We find that competence-related appraisal dimensions, particularly effort and ability, consistently match or outperform confidence across most settings. Effort additionally yields less overoptimistic estimates that remain stable across model sizes. In contrast, affective dimensions provide marginally predictive signals. Furthermore, the most informative dimension varies systematically with task characteristics: effort is most predictive for reasoning-intensive tasks, while ability and confidence dominate on retrieval-oriented tasks. Broadly, our findings indicate that structured multidimensional self-assessment is a promising approach to improving the reliability and safety of language model deployment across diverse real-world settings.
☆ PolySQL: Scaling Text-to-SQL Evaluation Across SQL Dialects via Automated Backend Isomorphism
SQL dialects vary in syntax, types, and functions across database engines. Text-to-SQL benchmarks, however, predominantly support only SQLite. This creates a critical evaluation gap: cross-dialect evaluation reveals weak per-query agreement (Cohen's ), showing that SQLite performance is an unreliable proxy for other dialects. Yet such evaluation remains prohibitively difficult: existing approaches either require expensive manual query transpilation or rely on tools that often fail on complex SQL. To close this gap, we introduce PolySQL, a novel dual-execution method that eliminates the need for query transpilation by comparing normalized execution results. Notably, our approach achieves higher evaluation fidelity than query transpilation with 100% query coverage. PolySQL comprises three datasets, enabling the first large-scale cross-dialect study. Our study reveals a 10.1% average accuracy drop from SQLite to other dialects and identifies a significant dialect difficulty hierarchy. We find this degradation stems from logical rather than syntactic errors (61% vs. 8%). We release our framework code and leaderboard to enable rigorous dialect-robust evaluation.
☆ Hybrid TF--IDF Logistic Regression and MLP Neural Baseline for Indonesian Three-Class Sentiment Analysis on Social Media Text
This paper presents a compact three-class sentiment analysis study for Indonesian social media text. The task is formulated with positive, negative, and neutral outputs derived from a fine-grained emotion dataset. The proposed practical baseline combines TF--IDF text features, three lightweight numeric metadata features, and a balanced multinomial Logistic Regression classifier. For comparison, the study also includes a neural baseline using a two-layer multilayer perceptron (MLP) over the same hybrid feature representation. The dataset originally contains 732 rows and 191 fine-grained emotion labels; after cleaning, deduplication, and label remapping, 707 samples remain with an imbalanced distribution of 459 positive, 188 negative, and 60 neutral instances. Experimental results show that the Logistic Regression deployment model reaches 0.8028 accuracy, 0.8003 weighted F1, and 0.7276 macro F1, while project documentation reports a higher-accuracy but non-production MLP baseline. These findings indicate that careful preprocessing, interpretable feature engineering, and class balancing remain competitive for small Indonesian sentiment datasets, whereas the neural baseline is better treated as a comparative experiment than as the default deployment model.
comment: 8 pages, 4 figures, 4 tables. Research paper on Indonesian three-class sentiment analysis using TF--IDF, Logistic Regression, and MLP baselines
☆ Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models
Large language models (LLMs) achieve strong performance but remain costly to deploy in resource-constrained settings. Training small language models (SLMs) from scratch is computationally expensive, while conventional knowledge distillation requires repeated access to large teachers for different target sizes, leading to poor scalability. To solve these problems, we propose \textbf{Chain-based Distillation (CBD)}, a scalable paradigm for efficiently initializing variable-sized language models. A sparse and limited sequence of intermediate models (called anchors) is constructed via stepwise distillation, forming a distillation chain that progressively transfers knowledge from the source LLMs. To support heterogeneous settings, we introduce \emph{bridge distillation} for cross-architecture and cross-vocabulary transfer. Models of variable sizes are initialized via parameter interpolation between adjacent anchors, eliminating repeated large teacher inference. Experiments show that the proposed method substantially improves efficiency and downstream performance. A 138M-parameter SLM without recovery pre-training, outperforms scratch-trained models on a 10B-token corpus on the specific task. CBD also demonstrates versatility in heterogeneous settings for initialize models with different architectures and vocabularies.
☆ CktFormalizer: Autoformalization of Natural Language into Circuit Representations
LLMs can generate hardware descriptions from natural language specifications, but the resulting Verilog often contains width mismatches, combinational loops, and incomplete case logic that pass syntax checks yet fail in synthesis or silicon. We present CktFormalizer, a framework that redirects LLM-driven hardware generation through a dependently-typed HDL embedded in Lean 4. Lean serves three roles: (i) type checker:dependent types encode bit-width constraints, case coverage, and acyclicity, turning hardware defects into compile-time errors that guide iterative repair; (ii) correctness firewall:compiled designs are structurally free of defects that cause silent backend failures (the baseline loses 20% of correct designs during synthesis and routing; CktFormalizer preserves all of them); (iii) proof assistant:the agent constructs machine-checked equivalence proofs over arbitrary input sequences and parameterized widths, beyond the reach of bounded SMT-based checking. On VerilogEval (156 problems), RTLLM (50 problems), and ResBench (56 problems), CktFormalizer achieves simulation pass rates competitive with direct Verilog generation while delivering substantially higher backend realizability: 95--100% of compiled designs complete the full synthesis, place-and-route, DRC, and LVS flow. A closed-loop PPA optimization stage yields up to 35% area reduction and 30% power reduction through validated architecture exploration, with automated theorem proof ensuring that each optimized variant remains functionally equivalent to its formal specification.
☆ Tracing Uncertainty in Language Model "Reasoning"
Language model (LM) "reasoning", commonly described as Chain-of-Thought or test-time scaling, often improves benchmark performance, but the dynamics underlying this process remain poorly understood. We study these dynamics through the lens of uncertainty quantification by treating the "reasoning" traces, the intermediate token sequences generated by LMs, as evolving model states. We summarize each trace by an uncertainty trace profile: a small set of features describing the shape of the uncertainty signal over its trace, such as its slope and linearity. We find that across five LMs evaluated on GSM8K and ProntoQA, these profiles predict whether a trace yields a correct final answer with AUROC up to 0.807, improving markedly on recent related work. We reach AUROC 0.801 using only the first few hundred tokens of full traces, suggesting that errors can be detected early in the generation. A detailed comparison of correct and incorrect traces further reveals qualitatively distinct uncertainty profiles, with correct traces showing a steeper and less linear decline in uncertainty. Together, the results suggest that our method, grounded in decision-making under uncertainty, provides a principled lens for studying the generative process underlying LM "reasoning".
☆ Rethinking State Tracking in Recurrent Models Through Error Control Dynamics
The theory of state tracking in recurrent architectures has predominantly focused on expressive capacity: whether a fixed architecture can theoretically realize a set of symbolic transition rules. We argue that equally important is error control, the dynamics governing hidden-state drift along the directions that distinguish symbolic states. We prove that affine recurrent networks, a class of models encompassing State-Space Models and Linear Attention, cannot correct errors along state-separating subspaces once they preserve state representations. Consequently, practical affine trackers do not learn robust state tracking; rather, they learn finite horizon solutions governed by accumulated state-relevant error. We characterize the mechanics of this failure, showing that tracking remains readable only while the accumulating within-class spread remains small relative to the initial between-class separation. We demonstrate empirically on group state-tracking tasks that this breakdown is predictable: tracking collapses when the distinguishability ratio crosses the readability threshold of the trained decoder. Across trained models, the point of this crossing predicts the horizon at which downstream accuracy fails. These results establish that robust state tracking is determined not only by an architecture's theoretical expressivity but crucially by its error control.
☆ TextLDM: Language Modeling with Continuous Latent Diffusion
Diffusion Transformers (DiT) trained with flow matching in a VAE latent space have unified visual generation across images and videos. A natural next step toward a single architecture for both generation (visual synthesis) and understanding (text generation) is to apply this framework to language modeling. We propose TextLDM, which transfers the visual latent diffusion recipe to text generation with minimal architectural modification. A Transformer-based VAE maps discrete tokens to continuous latents, enhanced by Representation Alignment (REPA) with a frozen pretrained language model to produce representations effective for conditional denoising. A standard DiT then performs flow matching in this latent space, identical in architecture to its visual counterpart. The central challenge we address is obtaining high-quality continuous text representations: we find that reconstruction fidelity alone is insufficient, and that aligning latent features with a pretrained language model via REPA is critical for downstream generation quality. Trained from scratch on OpenWebText2, TextLDM substantially outperforms prior diffusion language models and matches GPT-2 under the same settings. Our results establish that the visual DiT recipe transfers effectively to language, taking a concrete step toward unified diffusion architectures for multimodal generation and understanding.
☆ Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs
This report benchmarks the performance of ENGINEERING Ingegneria Informatica S.p.A.'s EngGPT2MoE-16B-A3B LLM, a 16B parameter Mixture of Experts (MoE) model with 3B active parameters. Performance is investigated across a wide variety of representative benchmarks, and is compared against comparably-sized open-source MoE and dense models. In comparison with popular Italian models, namely FastwebMIIA-7B, Minerva-7B, Velvet-14B, and LLaMAntino-3-ANITA-8B, EngGPT2MoE-16B-A3B performs as well or better on international benchmarks: ARC-Challenge, GSM8K, AIME24, AIME25, MMLU, and HumanEval (HE). It achieves the best performance for the longest context setting (32k) of the RULER benchmark. On the Italian benchmark dataset ITALIC, the model performs as well or better than the other models except for Velvet-14B, which outperforms it. Compared with popular MoE models of comparable size, the new model reports higher values than DeepSeek-MoE-16B-Chat on all considered benchmarks. It has higher values than Moonlight-16B-A3B on HE, MMLU, AIME24, AIME25, GSM8K, and the 32k RULER setting, but lower on BFCL and some ARC and ITALIC settings. Finally it has lower values than GPT-OSS-20B on most benchmarks, including HE, MMLU, AIME24, AIME25, GSM8K, ARC, BFCL, and the RULER 32k. When compared with popular dense models, EngGPT2MoE-16B-A3B reports higher values on AIME24 and AIME25 than Llama-3.1-8B-Instruct, Gemma-3-12b-it, and Ministral-3-8BInstruct-2512-BF16, but lower values on ITALIC, BFCL, and RULER with a 32k context. When performance is aggregated across all benchmark metrics, EngGPT2MoE-16B-A3B shows higher performance than the Italian models under evaluation while achieving lower results than some of the most performant international models, in particular GPT-5 nano and Qwen3-8B. Taken together, our findings find the new model to be a step forward for native Italian Large Language Models.
☆ SOD: Step-wise On-policy Distillation for Small Language Model Agents
Tool-integrated reasoning (TIR) is difficult to scale to small language models due to instability in long-horizon tool interactions and limited model capacity. While reinforcement learning methods like group relative policy optimization provide only sparse outcome-level rewards. Recently, on-policy distillation (OPD) has gained popularity by supplying dense token-level supervision from a teacher on student-generated trajectories. However, our experiments indicate that applying OPD to TIR leads to a critical failure mode: erroneous tool calls tend to cascade across subsequent reasoning steps, progressively amplifying student-teacher divergence and rendering the teacher's token-level supervision increasingly unreliable. To address this, we propose SOD, a step-wise on-policy distillation framework for small language model agents, which adaptively reweights distillation strength at each step based on step-level divergence. Therefore, SOD can attenuate potentially misleading teacher signals in high-divergence regions while preserving dense guidance in well-aligned states. Experiments on challenging math, science, and code benchmarks show that SOD achieves up to 20.86% improvement over the second-best baseline. Notably, our 0.6B student achieves 26.13% on AIME 2025, demonstrating effective transfer of agentic reasoning to lightweight models. Our code is available at https://github.com/YoungZ365/SOD.
☆ Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models
Recurrent LLM architectures have emerged as a promising approach for improving reasoning, as they enable multi-step computation in the embedding space without generating intermediate tokens. Models such as Ouro perform reasoning by iteratively updating internal representations while retaining a standard Key-Value (KV) cache across iterations, causing memory consumption to grow linearly with reasoning depth. Consequently, increasing the number of reasoning iterations can lead to prohibitive memory usage, limiting the practical scalability of such architectures. In this work, we propose Memory-Efficient Looped Transformer (MELT), a novel architecture that decouples reasoning depth from memory consumption. Instead of using a standard KV cache per layer and loop, MELT maintains a single KV cache per layer that is shared across reasoning loops. This cache is updated over time via a learnable gating mechanism. To enable stable and efficient training under this architecture, we propose to train MELT using chunk-wise training in a two phase procedure: interpolated transition, followed by attention-aligned distillation, both from the LoopLM starting model to MELT. Empirically, we show that MELT models fine-tuned from pretrained Ouro parameters outperform standard LLMs of comparable size, while maintaining a memory footprint comparable to those models and dramatically smaller than Ouro's. Overall, MELT achieves constant-memory iterative reasoning without sacrificing LoopLM performance, using only a lightweight post-training procedure.
comment: 22 pages, 5 figures, 11 tables
☆ SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation
On-policy distillation (OPD) is a standard tool for transferring teacher behavior to a smaller student, but it implicitly assumes that teacher and student predictions are comparable token by token, an assumption that fails whenever the two models tokenize the same text differently. Under heterogeneous tokenizers, exact shared-token matching silently discards a large fraction of the teacher signal at precisely the positions where vocabularies disagree. We propose \textbf{\underline{Sim}ple \underline{C}ross-\underline{T}okenizer OPD (SimCT)}, which restores this signal by enlarging the supervision space: alongside shared tokens, SimCT compares teacher and student over short multi-token continuations that both tokenizers can realize, leaving the OPD loss form itself unchanged. We show that these units are the finest jointly tokenizable supervision interface, and that coarser alternatives remove teacher-student distinctions that are useful for on-policy learning. Across three heterogeneous teacher-student pairs on mathematical reasoning and code-generation benchmarks, SimCT shows consistent gains over shared-vocabulary OPD and representative cross-tokenizer baselines, with ablations confirming that the improvements come from recovering supervision discarded by exact shared-token matching. Code is available at \href{https://github.com/sunjie279/SimCT-}{https://github.com/sunjie279/SimCT-}.
comment: 4 figures, 6 tables, 28 pages
☆ Guidance Is Not a Hyperparameter: Learning Dynamic Control in Diffusion Language Models ICLR2026
Classifier-Free Guidance (CFG) is a widely used mechanism for controlling diffusion-based generative models, yet its guidance scale is typically treated as a fixed hyperparameter throughout generation. This static design yields a suboptimal controllability and quality tradeoff, as the optimal degree of guidance varies across tasks and across different stages of the diffusion process, especially in NLP domain. We recast CFG scale selection as a sequential decision-making problem and propose to learn dynamic guidance trajectories via reinforcement learning. Specifically, we model the guidance scale as a discrete control action selected at each generation step based on the evolving diffusion state, and optimize a policy using Proximal Policy Optimization (PPO) under task-level rewards. Experiments on three controlled NLP generation tasks using discrete diffusion language models demonstrate that adaptive guidance consistently achieves a better balance between controllability and generation quality than fixed-scale strategies. Further analysis of the learned policies reveals distinct and interpretable guidance trajectories across tasks, underscoring the importance of treating guidance as a dynamic control process rather than a static design choice.
comment: ReALM-GEN@ICLR2026
☆ DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain
LLM-based agents are increasingly deployed for routine but consequential tasks in real-world domains, where their behavior is governed by inherently ambiguous domain policies that admit multiple valid interpretations. Despite the prevalence of such ambiguities in practice, existing agent benchmarks largely assume unambiguous, well-specified policies, leaving a critical evaluation gap. We introduce DRIP-R, a benchmark that systematically exploits real-world retail policy ambiguities to construct scenarios in which no single correct resolution exists. DRIP-R comprises a curated set of policy-ambiguous return scenarios paired with a realistic customer personas, a full-duplex conversational simulation with tool-calling capabilities and a multi-judge evaluation framework covering policy adherence, dialogue quality, behavioral alignment, and resolution quality. Our experiments show that frontier models fundamentally disagree on identical policy-ambiguous scenarios, confirming that ambiguity poses a genuine and systematic challenge to LLM decision-making.
comment: 10 pages
☆ TRACE: Tourism Recommendation with Accountable Citation Evidence
Tourism is a high-stakes setting for conversational recommender systems (CRS): a plausible-sounding suggestion can waste real money and trip time once a traveler acts on it. Existing CRS benchmarks primarily evaluate systems with a single Recall@k score over entity mentions, and tourism-specific resources add spatial or knowledge-graph context, yet none of them couple multi-turn recommendation with verbatim review-span evidence and rejection recovery. This leaves an evaluation gap for tourism recommendation that is simultaneously trustworthy, verifiable, and adaptive: recommend the right point of interest (POI) for multi-aspect preferences (such as cuisine, price, atmosphere, walking distance), justify each suggestion with verifiable evidence from prior visitors so the traveler can act without trial and error, and recover when the first recommendation is rejected mid-dialogue. We introduce TRACE, where each item is a multi-turn tourism recommendation dialogue with review-span citations and explicit rejection turns: 10,000 dialogues over 2,400 Yelp POIs and 34,208 reviews across eight U.S. cities, paired with 14 retrieval, planning, and LLM baselines, along with 25 metrics organized under Accuracy, Grounding, and Recovery. Across these baselines, TRACE reveals the Three-Competency Gap: LLM Zero-Shot leads in closed-set Recall@1 and rejection recovery but cites less densely than retrievers; non-LLM retrievers achieve surface-verbatim grounding but with low accuracy; Multi-Review Synthesis fails at recovery. The Grounding Score agrees with human citation precision (Spearman rho=+0.80, p<10^-20), and paired t-tests reproduce the per-baseline ranking (p<0.01 on the dominant contrasts). TRACE reframes accountable tourism recommendation as a joint target (right POI, verifiable evidence, adaptive repair) rather than a single-axis leaderboard.
☆ Not All Tokens Learn Alike: Attention Entropy Reveals Heterogeneous Signals in RL Reasoning
Reinforcement-learning-based post-training has become a key approach for improving the reasoning ability of large language models, but its token-level learning signals remain poorly understood. This work studies their heterogeneity through attention entropy, which measures how concentrated or diffuse the contextual support is for each response token. We first show that token-level RL objectives are sparsely estimable: uniformly random 20 percent token subsets preserve much of the full-token held-out performance, suggesting substantial redundancy in token-level updates. However, entropy-structured subsets behave very differently. Low-attention-entropy tokens, which we call anchors, rely on concentrated support, produce stable gradients aligned with full-token updates, and provide a reliable optimization backbone, but tend to plateau on harder benchmarks. High-attention-entropy tokens, which we call explorers, aggregate more diffuse context and induce larger but more volatile gradients. Explorer-only training is unstable on average, though rare successful runs suggest that these tokens may contain useful hard-reasoning signals when optimization remains stable. We support this anchor-explorer spectrum with evidence-gathering analyses, entropy dynamics, gradient-geometry diagnostics, and controls showing that position, predictive entropy, and loss normalization do not explain the observed asymmetry. Finally, a dynamic entropy-aware soft-reweighting intervention improves Qwen3-8B-Base from 34.39 to 37.40 held-out average in the strongest setting. These findings suggest that attention entropy reveals optimization-relevant structure in token-level RL signals, and that uniform token averaging can obscure meaningful heterogeneity in reasoning post-training.
☆ Reliable Chain-of-Thought via Prefix Consistency
Large Language Models often improve accuracy on reasoning tasks by sampling multiple Chain-of-Thought (CoT) traces and aggregating them with majority voting (MV), a test-time technique called self-consistency. When we truncate a CoT partway through and regenerate the remainder, we observe that traces with correct answers reproduce their original answer more often than traces with wrong answers. We use this difference as a reliability signal, prefix consistency, that weights each candidate answer by how often it reappears under regeneration. It requires no access to token log-probabilities or self-rating prompts. Across five reasoning models and four math and science benchmarks, prefix consistency is the best correctness predictor in most settings, and reweighting votes by it reaches Standard MV plateau accuracy at up to 21x fewer tokens (median 4.6x). Our code is available at https://github.com/naoto-iwase/prefix-consistency.
comment: See our project page at https://naoto-iwase.github.io/prefix-consistency-page
☆ Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation
Automated short answer scoring (ASAS) is shifting from discriminative, fine-tuned models to large language models (LLMs) used in few-shot settings. This paradigm leverages LLMs broad world knowledge and ease of deployment, but limited task-specific data may reduce alignment on complex scoring tasks. In particular, its impact on scoring partially correct responses that require nuanced interpretation remains underexplored. We investigate the relationship between the degree of task-specific adaptation of different models and quality-conditioned scoring agreement. We compare three LLMs (GPT-5.2, GPT-4o, Claude Opus 4.5) in few-shot mode, a fine-tuned BERT-based encoder, and a human expert on two open-ended biology items, using several hundred student responses and ground truth scores provided by a biology education expert. The results show that human-human agreement is highest and stable across the full quality spectrum. All AI models perform well on fully correct and fully incorrect responses, but exhibit substantial degradation on mid-range responses. This mid-range degradation is conditioned on task-specific adaptation: It is most severe in few-shot LLMs with few examples and decreases as task-specific data increases, with fine-tuned encoder models performing best. This mid-range degradation may lead to inequitable evaluation of responses produced by students with developing understanding. Our findings highlight the importance of quality-conditioned fairness, with particular attention to mid-range responses.
☆ MAVEN: Multi-Agent Verification-Elaboration Network with In-Step Epistemic Auditing
While explicit reasoning trajectories enhance model interpretability, existing paradigms often rely on monolithic chains that lack intermediate verification, allowing early errors to cascade unchecked. This lack of modularity impedes granular auditing and compromises the epistemic trust required for high-stakes applications. We propose MAVEN (Multi-Agent Verification-Elaboration Network with In-Step Epistemic Auditing), a blackboard-inspired framework designed to transform LLMs into deliberate reasoners through explicit role-decoupling. At its core, MAVEN operationalizes an adversarial Skeptic-Researcher-Judge loop, simulating expert deliberation by functionally separating logical defense from factual grounding. Experiments on OpenBookQA, TruthfulQA, HALUEVAL and StrategyQA benchmarks demonstrate that MAVEN delivers superior reasoning quality across four fine-grained metrics. Notably, MAVEN consistently outperforms latent reasoning models such as GEMINI-3.1-Pro and consensus-based baselines (e.g., ReConcile) by generating explicitly structured, modular, and verifiable deliberation trajectories, rather than relying on implicit internal states or post-hoc consensus. Moreover, comprehensive evaluations confirm that MAVEN is fully model-agnostic, serving as a strong and transferable reasoning booster that yields substantial performance improvements across diverse backbone models.
comment: 24 pages, 2 figures
☆ Multi-Dimensional Evaluation of LLMs for Grammatical Error Correction
Automated assistants for Grammatical Error Correction are now embedded in educational platforms serving millions of learners, yet three critical gaps remain in this domain: (1) latest-generation Large Language Models (LLMs) lack comprehensive evaluation on grammar correction tasks; (2) whether combining these LLMs improves correction quality is unexplored; and (3) the extent to which reference-based metrics underestimate GEC system performance has not been adequately quantified. In this study, first, we evaluate latest-generation LLMs on edit precision, fluency preservation, and meaning retention, showing fine-tuned GPT-4o achieves state-of-the-art performance across all three dimensions. Second, through grammatical error type analysis we demonstrate that individual LLMs exhibit highly similar error correction patterns ($ρ=0.947$). Third, we show that reference-based metrics underestimate GEC performance with 73.76% of GPT-4o corrections different from gold standards being equally valid or even superior. These GEC evaluation findings equip educators with guidance for selecting GEC assistants that enhance rather than constrain student linguistic development. We make our data, code, and models publicly available.
comment: 9 Pages
☆ Post-training makes large language models less human-like
Large language models (LLMs) are increasingly used as surrogates for human participants, but it remains unclear which models best capture human behavior and why. To address this, we introduce Psych-201, a novel dataset that enables us to measure behavioral alignment at scale. We find that post-training -- the stage that turns base models into useful assistants -- consistently reduces alignment with human behavior across model families, sizes, and objectives. Moreover, this misalignment widens in newer model generations even as base models continue to improve. Finally, we find that persona-induction -- a popular technique for eliciting human-like behavior by conditioning models on participant-specific information -- does not improve predictions at the level of individuals. Taken together, our results suggest that the very processes that are currently employed to turn LLMs into useful assistants also make them less accurate models of human behavior.
☆ Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents
When a phone-use agent avoids harm, does that show safety, or simply inability to act? Existing evaluations often cannot tell. A harmful outcome may be avoided because the agent recognized the risk and chose the safe action, or because it failed to understand the screen or execute any relevant action at all. These cases have different causes and call for different fixes, yet current benchmarks often merge them under task success, refusal, or final harmful outcome. We address this problem with PhoneSafety, a benchmark of 700 safety-critical moments drawn from real phone interactions across more than 130 apps. Each instance isolates the next decision at a risky moment and asks a simple question: does the model take the safe action, take the unsafe action, or fail to do anything useful? We evaluate eight representative phone-use agents under this framework. Our results reveal two main patterns. First, stronger general phone-use ability does not reliably imply safer choices at risky moments. Models that perform better on ordinary app tasks are not always the ones that behave more safely when the next action matters. Second, failures to do anything useful behave like a capability signal rather than a safety signal: they are concentrated in more visually and operationally demanding settings and remain stable when the evaluation protocol changes. Across models, failures split into two recurring patterns: unsafe choices in settings where the model can act but chooses wrongly, and inability to act in more visually and operationally demanding screens. Overall, a harmless outcome is not enough to count as evidence of safety. Evaluating phone-use agents requires separating unsafe judgment from inability to act.
comment: work in progress
☆ Is She Even Relevant? When BERT Ignores Explicit Gender Cues
Gender bias in large language models has primarily been investigated for English, while languages with grammatical or morphological gender remain comparatively understudied. This paper investigates how and when gender information emerges in a Dutch BERT model trained from scratch, offering one of the first checkpoint-level analyses of bias formation in a Transformer architecture for a language combining overt morphological gender marking and generic forms. By extracting contextual embeddings throughout training, we construct dynamic gender subspaces using linear SVMs to trace when gender becomes linearly encoded and how this encoding evolves over time. Contextual embeddings are often assumed to integrate contextual cues robustly, allowing models to adjust the representation of a word depending on its more local usage. We therefore test whether explicit gender cues in controlled sentence templates (e.g., Zij is een loodgieter ('She is a plumber')) can override learned statistical associations (plumber -> male). Our findings challenge this assumption: although gender becomes clearly linearly separable around epoch 20 and is distributed across multiple embedding dimensions, the model struggles to update its internal gender representation in light of explicit contextual cues in short sentence templates. Stereotypical gender-profession pairings are predicted far more accurately than anti-stereotypical ones, and generic forms in Dutch systematically default to a male interpretation, even when the context explicitly denotes a female referent. Together, our results seem to indicate that contextualization in the representations learned by our Dutch BERT model is not sufficiently dynamic along the probed gender direction: explicit gender cues in anti-stereotypical contexts are not reliably reflected in the resulting representations, resulting in persistent male-default behaviour.
☆ Intent-Driven Semantic ID Generation for Grounded Conversational News Recommendation ACL 2026
Conversational news recommendation requires grounding each suggestion in a rapidly evolving article corpus while addressing implicit user intents that lack explicit retrievable keywords. To characterize this scenario, we identify 6 intent types from production dialogues: five are implicit and pose fundamental challenges to standard RAG pipelines, forming a critical retrieve-first bottleneck. To address these issues, we introduce intent-driven Semantic ID (SID) generation under a Generate-then-Match paradigm. With two-stage training that consists of multi-task SID alignment and GPT-4 Chain-of-Thought distillation, an LLM maps diverse intents to hierarchical SID prefixes, which are then fuzzy-matched to the current news pool to guarantee fully grounded recommendations. Profile-Aware Dual-Signal Reasoning (PADR) further enables cold-start users to obtain valid recommendations using only profiles. On a mainstream Chinese news platform, our 7B model achieves 0% hallucination and 12.4% L1 match in the 152K open-generation SID space (4x random baseline). It matches GPT-4+Hybrid RAG on L1 while surpassing it on finer-grained metrics (L2 2x, Category +1.2pp) at ~100x lower cost. Cold-start users, where existing baselines score 0%, achieve 18.0% L1 (6x random), the highest among all user groups.
comment: Accepted at ACL 2026 Industry Track (Oral)
☆ Nürnberg NLP at PsyDefDetect: Multi-Axis Voter Ensembles for Psychological Defence Mechanism Classification ACL 2026
Detecting levels of psychological defence mechanisms in supportive conversations is inherently ambiguous. In the PsyDefDetect shared task at BioNLP 2026 the eight positive defence categories share surface language and differ only in pragmatic function and trained raters reach only moderate inter-annotator agreement. On such a task the decisive lever is not a stronger single model but error independence, since any single representation will waver on the overlapping defence boundaries. We translate this insight into a 9-voter ensemble spanning three orthogonal axes: class granularity (all nine classes for the gatekeeper, only the eight defence classes for the specialists), training method (generative and discriminative) and base model. The system reaches $F1_{test}{=}.420$ on the hidden test set, placing first among 21 registered teams.
comment: Accepted at the BioNLP 2026 PsyDefDetect Shared Task @ ACL 2026 (1st place, 21 registered teams)
☆ Mathematical Reasoning via Intervention-Based Time-Series Causal Discovery Using LLMs as Concept Mastery Simulators
Recent methods for improving LLM mathematical reasoning, whether through MCTS-based test-time search or causal graph-guided knowledge injection, cannot identify which concepts causally contribute to a correct answer, as the observed association may be spurious, driven by confounders such as problem difficulty. We propose CIKA (Causal Intervention for Knowledge Activation), a framework that uses the LLM itself as an interventional simulator: a prompt sets the concept state to ``mastered'' and the correctness change estimates the causal effect. We formalize this quantity as an Interventional Capability Probe (ICP), which diagnoses whether the LLM can use a given concept -- distinct from merely possessing knowledge. Because the intervention exogenously sets the concept state independently of problem difficulty, ICP separates confounding that observational methods cannot. On 67 screened problems, the ICP of the top-ranked concept (+0.219) is significantly larger than that of the negative control (+0.039; paired $t$-test, $p < 10^{-6}$, Cohen's $d = 0.86$), confirming that the probe discriminates causally relevant concepts from irrelevant ones. Analysis of 601 Omni-MATH problems further shows that solved problems have 6.1$\times$ higher ATE than unsolved ones (0.338 vs. 0.055), confirming that ICP is predictive of problem-solving success. With a 7B-parameter LLM whose weights are entirely frozen, CIKA achieves 69.7\% on the contamination-free Omni-MATH-Rule benchmark and 64.0\% overall, compared to 60.5\% for o1-mini, and 97.2\% on GSM8K, 46--50\% on AIME 2024--2026, and 46.2\% on MathArena. The Causal Knowledge Activation component contributes 33.8\% of correct answers on problems where the base model alone fails, demonstrating that the LLM already possessed but had not activated the requisite knowledge.
comment: 17 pages, 0 figures
☆ Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States
Reinforcement learning with verifiable rewards (RLVR) for Large Reasoning Models hinges on baseline estimation for variance reduction, but existing approaches pay a heavy price: PPO requires a policy-model scale critic, while GRPO needs multiple rollouts per prompt to keep its empirical group mean stable. We introduce Policy Optimization with Internal State Value Estimation), which obtains a baseline at negligible cost by using the policy model's internal signals already computed during the policy forward pass. A lightweight probe predicts the expected verifiable reward from the hidden states of the prompt and generated trajectory, as well as token-entropy statistics, and is trained online alongside the policy. To preserve gradient unbiasedness despite using trajectory-conditioned features, we introduce a cross-rollout construction that predicts each rollout's value from an independent rollout's internal states. Because POISE estimates prompt value using only a single rollout, it enables higher prompt diversity for a fixed compute budget during training. This reduces gradient variance for more stable learning and also eliminates the compute overhead of sampling costs for detecting zero-advantage prompts. On Qwen3-4B and DeepSeek-R1-Distill-Qwen-1.5B across math reasoning benchmarks, POISE matches DAPO while requiring less compute. Moreover, its value estimator shows similar performance to a separate LLM-scale value model and generalizes to various verifiable tasks. By leveraging the model's own internal representations, POISE enables more stable and efficient policy optimization.
☆ Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs
The Arrow-of-Time (AoT) task, determining whether a video plays forward or backward by recognizing temporal irreversibility, is one humans solve with near-perfect accuracy, yet frontier Video Large Language Models (Video-LLMs) perform only modestly above chance. This gap raises a key question: do visual backbones fail to encode temporal information, or does information bottleneck lie elsewhere in the Video-LLM architecture? We address this question by isolating the vision encoder from the Video-LLM and tracing temporal information across the encoder, projector, and LLM. We find that video-centric encoders with explicit temporal modeling encode strong temporal signals, whereas frame-centric encoders do not. However, when video-centric representations are passed through a standard Video-LLM architecture, performance often collapses, revealing a bottleneck of temporal information flow. We identify projector design as a key factor: Q-Former disrupts temporal information, while a time-preserved MLP projection substantially improves the LLM's access to such information. Our layer-wise analysis further shows temporal representation dynamics across encoder layers. Guided by these findings, we build a Video-LLM with temporal-aware video-centric encoder, time-preserved projector, and AoT supervision, surpassing human performance on AoT$_{PPB}$ with 98.1\% accuracy, and improving broader temporal reasoning tasks by up to 6.0 points on VITATECS-Direction and 1.3 points on TVBench. Our results show that temporal reasoning in Video-LLMs requires both effective temporal encoding and reliable transfer of this information to the LLM.
☆ Why do Large Language Models Fail in Low-resource Translation? Unraveling the Token Dynamics of Large Language Models for Machine Translation
Large Language Models (LLMs) have recently demonstrated strong performance in machine translation (MT). However, most prior work focuses on improving or benchmarking translation quality, offering limited insight into when and why LLM-based translation fails. In this work, we systematically analyze failure modes of LLMs in MT by evaluating 15 models, including four reasoning LLMs, across 22 language pairs (LPs) with varying resource levels. We find that non-English-centric LPs consistently yield lower COMET scores than English-centric pairs. To investigate the underlying causes, we introduce Token Activation Rate (TAR), a metric that captures how effectively a model utilizes language-specific tokens in its vocabulary during generation. We validate TAR as a proxy for language representation using models with known language distributions in the training data, and show that lower TAR is strongly associated with poorer translation performance. Furthermore, reasoning LLMs tend to generate more tokens when translating into low-TAR languages, suggesting a compensatory mechanism, although its impact on translation quality varies across models. Overall, our findings emphasize the importance of token-level dynamics in understanding MT performance of LLMs.
comment: Accepted to the 26th Annual Conference of the European Association for Machine Translation (EAMT2026)
☆ WeatherSyn: An Instruction Tuning MLLM For Weather Forecasting Report Generation ICML 2026
Accurate weather forecast reporting enables individuals and communities to better plan daily activities and agricultural operations. However, the current reporting process primarily relies on manual analysis of multi-source data, which leads to information overload and reduced efficiency. With the development of multimodal large language models (MLLMs), leveraging data-driven models to analyze and generate reports in the weather forecasting domain remains largely underexplored. In this work, we propose the Weather Forecasting Report (WFR) task and construct the first instruction-tuning dataset for this task, named~\DatasetNameL, which covers 31 cities in America and 8 weather aspects. Based on this corpus, we develop the first model, \ModelNameL, specialized in generating weather forecast reports. Evaluation across multiple metrics on our dataset shows that \ModelNameL~ consistently outperforms leading closed-source MLLMs, particularly on structurally complex weather aspects. We further analyze its performance across diverse geographic regions and weather aspects. \ModelNameL~ demonstrates strong transferability across different regions, highlighting its zero-shot generalization capability. \ModelNameL~offers valuable insight for developing MLLMs specialized in weather report generation. .
comment: ICML 2026
☆ InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search
Existing benchmarks for multimodal agentic search evaluate multimodal search and visual browsing, but visual evidence is either confined to the input or treated as an answer endpoint rather than part of an interleaved search trajectory. We introduce \textbf{InterLV-Search}, a benchmark for Interleaved Language-Vision Agentic Search, in which textual and visual evidence is repeatedly used to condition later search. It contains 2,061 examples across three levels: active visual evidence seeking, controlled offline interleaved multimodal search, and open-web interleaved multimodal search. Beyond existing benchmarks, it also includes multimodal multi-branch samples that involve comparison between multiple entities during the evidence search. We construct Level 1 and Level 2 with automated pipelines and Level 3 with a machine-led, human-supervised open-web pipeline. We further provide InterLV-Agent for standardized tool use, trajectory logging, and evaluation. Experiments on proprietary and open-source multimodal agents show that current systems remain far from solving interleaved multimodal search, with the best model below 50% overall accuracy, highlighting challenges in visual evidence seeking, search control, and multimodal evidence integration. We release the benchmark data and evaluation code at https://github.com/hbhalpha/InterLV-Search-Bench
☆ TCMIIES: A Browser-Based LLM-Powered Intelligent Information Extraction System for Academic Literature
The exponential growth of academic publications has created an urgent need for automated tools capable of extracting structured knowledge from unstructured scientific texts. While large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding and information extraction, existing solutions often require specialized infrastructure, programming expertise, or fine-tuned domain-specific models that create barriers for researchers in specialized fields. This paper presents TCMIIES, a browser-based, zero-installation platform that leverages commercial LLM APIs to perform structured information extraction from academic literature. The system employs a novel schema-guided prompting framework with automatic system prompt generation, enabling researchers to define custom extraction schemas through an intuitive graphical interface without any programming. TCMIIES features a pure front-end architecture that ensures data privacy by processing all information locally in the browser, supports five major LLM providers, implements concurrent batch processing with automatic retry mechanisms, and provides intelligent field mapping for Chinese academic databases including CNKI and Wanfang. We demonstrate the system's effectiveness through comprehensive evaluation across multiple extraction scenarios in Traditional Chinese Medicine research, achieving structured output compliance rates exceeding 94\% and information extraction accuracy comparable to domain-expert annotation. The system represents a practical, accessible solution that bridges the gap between advanced LLM capabilities and domain-specific academic information extraction needs, particularly for researchers in specialized fields who require flexible, privacy-preserving, and cost-effective extraction tools.
☆ ExpThink: Experience-Guided Reinforcement Learning for Adaptive Chain-of-Thought Compression
Large reasoning models (LRMs) achieve strong performance via extended chain-of-thought (CoT) reasoning, yet suffer from excessive token consumption and high inference latency. Existing reinforcement learning (RL) approaches for CoT compression rely on uniform, static length penalties that neglect model capability dynamics and problem-level difficulty variation. We propose \textbf{ExpThink}\xspace, an RL framework that addresses both dimensions through two complementary mechanisms. First, \emph{experience-guided reward shaping} tracks the shortest correct solution found so far for each problem and applies a three-tier reward: full credit for concise correct responses, discounted credit for verbose correct ones, and zero for incorrect ones. The threshold tightens automatically with model improvement, forming a self-evolving curriculum that requires no manual scheduling. Second, \emph{difficulty-adaptive advantage} replaces standard deviation normalization with correct-count normalization, yielding monotonically difficulty-scaled gradients that amplify learning on hard problems to preserve accuracy while suppressing gradients on easy ones to encourage brevity. Together, these mechanisms enforce an accuracy-first, compression-second training objective. Experiments on multiple mathematical reasoning benchmarks demonstrate that \textbf{ExpThink}\xspace reduces average response length by up to 77\% while simultaneously improving accuracy, achieving up to $3\times$ higher accuracy-efficiency ratio (accuracy divided by average token count) than the vanilla baseline and outperforming existing RL-based compression methods on both metrics.
comment: 39 pages, 18 figures. Code and model checkpoints will be released upon publication
☆ SEIF: Self-Evolving Reinforcement Learning for Instruction Following
Instruction following is a fundamental capability of large language models (LLMs), yet continuously improving this capability remains challenging. Existing methods typically rely either on costly external supervision from humans or strong teacher models, or on self-play training with static-difficulty instructions that cannot evolve as the model's capabilities improve. To address these limitations, we propose SEIF (Self-Evolving Reinforcement Learning for Instruction Following), a self-evolving framework for enhancing the instruction-following ability of LLMs. SEIF forms a closed self-evolution loop that improves the model's instruction-following ability, where instruction difficulty evolution and model capability evolution reinforce each other. SEIF consists of four roles: an Instructor that generates increasingly challenging instructions, a Filter that removes conflicting or invalid instructions to ensure data quality, a Follower that learns to follow evolved instructions, and a Judger that provides reward signals for reinforcement learning. The Instructor and Follower are alternately trained and co-evolve throughout the process. Experiments across multiple model scales and architectures show that SEIF consistently improves instruction-following performance, suggesting strong generality. Further analyses reveal the sources of improvement and identify an effective training strategy for self-evolution on open-ended tasks: sufficient early-stage training to build a solid foundation, followed by moderate late-stage training to mitigate overfitting and achieve better final performance. The code and data are publicly available at https://github.com/Rainier-rq1/SEIF.
☆ The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment
Moltbook is a Reddit-like platform where OpenClaw agents post, comment, and vote at scale - a so far unprecedented incident that comes with serious safety concerns. With the aim of studying emergent behavior in populations, we release the Moltbook Files, a dataset of 232k posts and 2.2M comments covering the platform's first 12 days, processed through a pipeline to identify and remove Personally-Identifiable Information (PII). We analyze community structure, authorship, lexical properties, sentiment, topics, semantic geometry, and comment interaction. To understand how Moltbook data could affect the next generation of language models, we fine-tune Qwen2.5-14B-Instruct on Moltbook Files with three adaptation levels. Our PII pipeline reveals that agents post API keys, passwords, BIP39 seed phrases on Moltbook, a publicly indexed platform. The overall sentiment is mostly neutral and mildly positive (66.6% neutral, 19.5% positive) and shows a tendency for self-referential linking. We find that fine-tuning on Moltbook data reduces truthfulness from 0.366 to 0.187. However, a model fine-tuned on a size-matched Reddit dataset produces a comparable decrease. Moltbook thus seems to be more of a harmless slopocalypse. However, tail risks remain, including agent affordances, contamination of future crawls through self-links, and potential transfer of traits to the next generation of language models. More broadly, our findings highlight the importance of control baselines in emergent misalignment evaluations.
☆ Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance
Rubrics have been extensively utilized for evaluating unverifiable, open-ended tasks, with recent research incorporating them into reward systems for reinforcement learning. However, existing frameworks typically treat rubrics only as external evaluator disjointed from the policy's primary reasoning trace. Such design confines rubrics to post-hoc measurement, leaving them unable to actively guide the model's generation process. In this work, we introduce Think-with-Rubrics, a novel paradigm for instruction following tasks. Think-with-Rubrics integrates rubric generation into the reasoning context, transforming the rubric from an independent artifact into an internal guidance of LLM's generation. During training, LLM sequentially generates a rubric followed by a response, while a trained rubric verifier provides joint supervision by evaluating the consistency between the answer and the self-generated / golden rubrics. Experiments across multiple benchmarks demonstrate that Think-with-Rubrics consistently outperforms the Rubric-as-Reward baseline supervised by golden rubrics by an average of 3.87 points. We have also discussed the mechanism by which Think-with-Rubrics enhances model performance. Experimental results demonstrate that supervision from golden rubrics and self-generated rubrics enhances the performance of Think-with-Rubrics by improving the quality of self-generated rubrics and increasing the internal consistency of responses respectively.
☆ GRaSp: Automatic Example Optimization for In-Context Learning in Low-Data Tasks
In-context learning enables large language models to adapt to new tasks, but their performance is highly sensitive to the selected examples. Finding effective demonstrations is particularly difficult in domain-specific, low-data settings where high-quality examples are scarce. We propose GRaSp, a three-stage framework for automatic in-context example optimization. By first generating a large synthetic candidate pool, then structuring it with clustering and dimensionality reduction, and finally using genetic algorithms to find the optimal in-context examples, the framework shows consistent improvements on the NER task. We also introduce a custom diversity-adaptive mutation mechanism, allowing it to transition from the initial broad inter-cluster exploration to focused intra-cluster refinement as the population converges. We evaluate GRaSp on financial named entity recognition (FiNER-139), comparing synthetic and human-annotated candidate pools across pool sizes of 500 and 5000. With non-synthetic data, GRaSp achieves 45.84% micro-F1, consistently outperforming both zero-shot and random few-shot baselines. Synthetic data matches the random baseline but does not exceed it, suggesting that distributional variety in the candidate pool is critical for generalization.
comment: 12 pages, 5 figures
☆ Data Contamination in Neural Hieroglyphic Translation: A Reproducibility Study
Ancient and endangered languages pose a unique challenge for NLP: their datasets are inherently scarce, difficult to expand, and built from formulaic corpora -- making data-quality issues especially consequential yet rarely audited. Motivated by the need to understand what current NMT can realistically achieve for such languages, we investigate hieroglyphic-to-German translation, where a recent study reported 61.5 BLEU using fine-tuned M2M-100. Our reproduction yields only 37.0 BLEU with the released model. Investigating this gap, we find 2\% of test targets appear identically in training (16/50; 50\% under 8-gram overlap at 70\% threshold). This contamination inflates scores dramatically: contaminated samples achieve up to 83.8 BLEU / 0.924 COMET-22 versus 30.9--39.2 BLEU / 0.622--0.676 COMET-22 on clean samples across five model configurations spanning two architectures. Document-level decontamination reduces contaminated BLEU by only 4.6 points because 8/16 targets persist via other source documents -- target-level deduplication is required. We release a decontaminated 34-sample test set and establish corrected baselines (30.9--39.2 BLEU), providing a realistic assessment of NMT capability for this endangered writing system.
comment: Accepted to NLP4DH 2026 Conference
☆ Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs
Vision-language models (VLMs) have advanced rapidly and are increasingly deployed in real-world applications, especially with the rise of agent-based systems. However, their safety has received relatively limited attention. Even the latest proprietary and open-weight VLMs remain highly vulnerable to adversarial attacks, leaving downstream applications exposed to significant risks. In this work, we propose a novel and lightweight adversarial attack detection framework based on sparse autoencoders (SAEs), termed SAEgis. By inserting an SAE module into a pretrained VLM and training it with standard reconstruction objectives, we find that the learned sparse latent features naturally capture attack-relevant signals. These features enable reliable classification of whether an input image has been adversarially perturbed, even for previously unseen samples. Extensive experiments show that SAEgis achieves strong performance across in-domain, cross-domain, and cross-attack settings, with particularly large improvements in cross-domain generalization compared to existing baselines. In addition, combining signals from multiple layers further improves robustness and stability. To the best of our knowledge, this is the first work to explore SAE as a plug-and-play mechanism for adversarial attack detection in VLMs. Our method requires no additional adversarial training, introduces minimal overhead, and provides a practical approach for improving the safety of real-world VLM systems.
☆ SSP-based construction of evaluation-annotated data for fine-grained aspect-based sentiment analysis
We report the construction of a Korean evaluation-annotated corpus, hereafter called 'Evaluation Annotated Dataset (EVAD)', and its use in Aspect-Based Sentiment Analysis (ABSA) extended in order to cover e-commerce reviews containing sentiment and non-sentiment linguistic patterns. The annotation process uses Semi-Automatic Symbolic Propagation (SSP). We built extensive linguistic resources formalized as a Finite-State Transducer (FST) to annotate corpora with detailed ABSA components in the fashion e-commerce domain. The ABSA approach is extended, in order to analyze user opinions more accurately and extract more detailed features of targets, by including aspect values in addition to topics and aspects, and by classifying aspectvalue pairs depending whether values are unary, binary, or multiple. For evaluation, the KoBERT and KcBERT models are trained on the annotated dataset, showing robust performances of F1 0.88 and F1 0.90, respectively, on recognition of aspect-value pairs.
☆ Generating training datasets for legal chatbots in Korean
Chatbots are robots that can communicate with humans using text or voice signals. Legal chatbots improve access to justice, since legal representation and legal advice by lawyers come with a high cost that excludes disadvantaged and vulnerable people. However, capturing the diversity of actual user input in datasets for deep-learning dialog systems (chatbots) is a technical challenge. Diversity requires large volumes of data, which must also be labelled in order to classify the user's intent, while the cost of labelling datasets increases with volume. Instead of labelling large volumes of authentic data from users, our approach consists in jointly generating large volumes of utterances and high-quality labels. The generator of labelled datasets is based on language resources that take the form of local grammar graphs (LGG), which capture and generalize the vocabulary and local syntax observed by linguists in text. The LGGs associate labels to the utterances according to a domain-specific classification system. We tested this approach by implementing LIGA, a legal chatbot in Korean. The chatbot answers users' conversational queries on legal situations by providing information on similar legal cases, made publicly available by the Korean government. We generated labelled utterances from the LGGs with the aid of the open-source Unitex platform. This process produced 700 million utterances. We trained a DIET classifier on a dataset made of these utterances, and the trained model reached 91% f1-score performance. We implemented a chatbot called LIGA, which uses the results of the model to select a link to a web page that documents similar legal cases.
☆ ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring
Referring expression grounding is a core problem in visual grounding and is widely used as a diagnostic of spatial grounding and reasoning in vision and language models, yet most prior work focuses on natural images. In contrast, existing chart referring expression grounding-related benchmarks remain limited: (1) they largely adopt bounding boxes, constraining localization precision for fine chart elements (2) they mostly assume a single and two referred target instances, failing to handle multi-instance target references; (3) the language expressions over-rely on textual cues or data-rank clues (4) they cover only a narrow range of chart types. To address these issues, we introduce a chart referring expression grounding benchmark that systematically supports multiple localization forms, multiple referred targets, diverse grounding cues and diverse chart types. Results across representative multimodal large models reveal a significant performance gap. We further introduce a code-driven synthesis pipeline that exploits the inherent alignment between plotting programs and rendered chart primitives to derive pixel accurate instance masks across chart element types and granularities. We train an instance segmentation model with the synthesized masks and integrate it into a general-purpose multimodal grounding framework. The resulting system consistently outperforms baselines on our benchmark and generalizes well to a ChartQA-derived real-chart grounding benchmark.
☆ The Proxy Presumption: From Semantic Embeddings to Valid Social Measures ACL 2026
Natural Language Processing is rapidly evolving into a primary instrument for Computational Social Science, with researchers increasingly using embeddings to measure latent constructs such as novelty, creativity, and bias. However, this transition faces a fundamental validity challenge: the ''Proxy Presumption,'' or the reliance on geometric properties (e.g., cosine distance) as direct measures of social concepts. We argue that without explicit validation, unsupervised representations remain entangled mixtures of the target construct ($C$) and confounding attributes ($Z$) like topic, style, and authorship. To bridge the gap between semantic embeddings and valid social measures, we introduce the Construct Validity Protocol (CVP). Drawing on causal representation learning and psychometrics, the CVP offers a rigorous pipeline from conceptualization to quantitative verification. We further propose Counterfactual Neutralization, a novel method using LLMs to reduce confounding in embedding space. By providing a standardized Validity Suite -- including tests for discriminant, incremental, and predictive validity -- this work offers the community a toolkit to transform heuristic proxies into robust, scientifically defensible instruments.
comment: ACL 2026
☆ Unsolvability Ceiling in Multi-LLM Routing: An Empirical Study of Evaluation Artifacts
Efficient routing across multiple LLMs enables cost-quality tradeoffs by directing queries to the cheapest capable model. Prior work attributes routing headroom to an "unsolvability ceiling", queries no model in the pool can solve. We present a large-scale study of multi-tier LLM routing with 206,000 query-model pairs across six benchmarks (MMLU, MedQA, HumanEval, MBPP, Alpaca, ShareGPT) using the Gemma 4 and Llama 3.1 families. Evaluating with both LLM-as-a-judge and exact-match metrics, we show that a substantial portion of reported unsolvability stems from evaluation artifacts: (i) systematic judge biases favoring verbosity over correctness, (ii) truncation under fixed generation budgets, and (iii) output format mismatches. Through dual-judge validation and exact-match grounding, we reduce measured unsolvability across tasks. We introduce a decomposition framework attributing failures to these artifacts, revealing consistent patterns across domains and model families. These artifacts also distort router training signals: standard routers collapse to majority-class prediction (~79% smallest-tier optimal), confirmed via random-feature and shuffled-label controls, incurring a 13-17 percentage point opportunity cost. We provide actionable recommendations including dual-judge validation, exact-match anchoring, and cost-sensitive objectives. Our findings suggest existing routing headroom estimates are substantially inflated, underscoring the need for reliable evaluation protocols in multi-LLM systems.
comment: 12 pages, 14 tables
☆ Gradient-Based LoRA Rank Allocation Under GRPO: An Empirical Study
Adaptive rank allocation for LoRA, allocating more parameters to important layers and fewer to unimportant ones, consistently improves efficiency under supervised fine-tuning (SFT). We investigate whether this success transfers to reinforcement learning, specifically Group Relative Policy Optimization (GRPO). Using gradient-magnitude profiling on Qwen 2.5 1.5B with GSM8K, we find that it does not: proportional rank allocation degrades accuracy by 4.5 points compared to uniform allocation (70.0% vs. 74.5%), despite using identical parameter budgets. We identify two mechanisms behind this failure. First, the gradient landscape under GRPO is fundamentally flatter than under SFT, the max-to-min layer importance ratio is only 2.17x, compared to >10x reported in SFT literature. All layers carry meaningful gradient signal; none are truly idle. Second, we discover a gradient amplification effect: non-uniform allocation widens the importance spread from 2.17x to 3.00x, creating a positive feedback loop where high-rank layers absorb more gradient while low-rank layers are progressively silenced. Our results suggest that gradient importance does not predict capacity requirements under RL, and that naive transfer of SFT-era rank allocation to alignment training should be avoided.
comment: 4 pages + references
☆ Mean-Pooled Cosine Similarity is Not Length-Invariant: Theory and Cross-Domain Evidence for a Length-Invariant Alternative ICML 2026
Mean-pooled cosine similarity is the default metric for comparing neural representations across languages, modalities, and tasks. We establish that this metric is not length-invariant: under the anisotropy that characterizes modern transformer representations, mean-pooled cosine grows monotonically in sequence length, independent of representational content. Empirically, on HumanEvalPack across four code LLMs, the length ratio alone explains $R^2 = 0.52$--$0.75$ of cross-language "Python proximity," while AST depth and shared-token fraction add less than 3% of explained variance beyond length. Substituting Centered Kernel Alignment (CKA) reduces explained variance by 83% and reverses the sign of the length coefficient ($β_{\mathrm{len}}: +0.86 \to -0.37$). The same pattern holds in Mistral-7B on parallel WMT pairs ($R^2 = 0.23$ EN-FR, $R^2 = 0.33$ EN-DE for cosine; $R^2 < 0.01$ for CKA). In CLIP ViT-B/32, mean-pooling reduces the length effect relative to EOS-pooling ($R^2: 0.21 \to {<}0.01$), as predicted by the theory's dependence on anisotropy. We argue that length-invariant metrics such as CKA should be the default for cross-representation comparisons, and that recent claims of cross-lingual representational convergence built on mean-pooled cosine warrant re-examination.
comment: 9 pages, 6 figures. Submitted to the Mechanistic Interpretability Workshop at ICML 2026
☆ Activation Differences Reveal Backdoors: A Comparison of SAE Architectures IJCNN 2026
Backdoor attacks on language models pose a significant threat to AI safety, where models behave normally on most inputs but exhibit harmful behavior when triggered by specific patterns. Detecting such backdoors through mechanistic interpretability remains an open challenge. We investigate two sparse autoencoder architectures -- Crosscoders and Differential SAEs (Diff-SAE) -- for isolating backdoor-related features in fine-tuned models. Using a controlled SQL injection backdoor triggered by year-based context ("2024" triggers vulnerable code, "2023" triggers safe code), we evaluate both approaches across LoRA and full-rank fine-tuning regimes on SmolLM2-360M. We find that Diff-SAE consistently and substantially outperforms Crosscoders for backdoor isolation. Diff-SAE achieves a Backdoor Isolation Score (BIS) of 0.40 with perfect precision (1.0) and zero false positive rate across most experimental conditions, while Crosscoders fail almost entirely with BIS below 0.02 in most cases. This performance gap holds across multiple transformer layers (14, 18, 22, 26) and both fine-tuning regimes, with full-rank fine-tuning producing particularly clean backdoor signals. Our results suggest that backdoors manifest as directional activation shifts rather than sparse feature activations, making difference-based representations fundamentally more effective for detection. These findings have important implications for AI safety monitoring and the development of interpretability tools for detecting model manipulation.
comment: Accepted at IJCNN 2026 (IEEE WCCI). ©2026 IEEE
☆ LaTER: Efficient Test-Time Reasoning via Latent Exploration and Explicit Verification
Chain-of-thought (CoT) reasoning improves large language models (LLMs) on difficult tasks, but it also makes inference expensive because every intermediate step must be generated as a discrete token. Latent reasoning reduces visible token generation by propagating continuous states, yet replacing explicit derivations with latent computation can hurt tasks that require symbolic checking. We propose Latent-Then-Explicit Reasoning (LaTER), a two-stage paradigm that first performs bounded exploration in a continuous latent space and then switches to explicit CoT for verification and answer generation. In a training-free instantiation, LaTER projects final-layer hidden states back to the input embedding space, preserves the latent KV cache, and uses entropy and model-native stop-token probes to decide when to switch. We find that strong reasoning models already exhibit structured latent trajectories under this interface. On Qwen3-14B, training-free LaTER reduces total token usage by 16%-32% on several benchmarks while matching or improving accuracy on most of them; for example, it improves AIME 2025 from 70.0% to 73.3% while reducing tokens from 15,730 to 10,661. We further construct Latent-Switch-69K, a supervised corpus that pairs condensed solution intuitions with shortened explicit derivations. Fine-tuning with latent rollout and halting supervision yields additional gains: trained LaTER reaches 80.0% accuracy on AIME 2025, 10.0 points above the standard CoT baseline, while using 33% fewer tokens. Our code, data, and model are available at https://github.com/TioeAre/LaTER.
☆ Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts
Modern reasoning language models generate dense, sequential chain-of-thought traces implicitly assuming that every token contributes and that steps must be consumed in order. We challenge both assumptions through a systematic intervention pipeline--removal, masking, shuffling, and noise injection--applied to model-generated reasoning chains across three models and three benchmarks. Our findings are counterintuitive on three dimensions. Order: Does the sequential order of a reasoning chain matter for answer extraction? No--line-level shuffling reduces accuracy by less than 0.5 pp; word-level shuffling retains 62%-89% accuracy; only token-level shuffling collapses to near zero. Pretrained-only and instruction-tuned variants exhibit near-identical tolerance (78.67% vs. 78.00% under line shuffling), indicating order-independence originates from pretraining rather than reasoning-specific fine-tuning. Dense: Is all the information in a reasoning chain important for answer extraction? No--masking numeric digits collapses accuracy to exactly 0%, while masking alphabetic prose improves accuracy by 4.7 pp. Robustness: Is a reasoning chain that is both order-shuffling and non-dense still robust? Yes--the most aggressively reduced representation (all natural language removed, lines arbitrarily shuffled) still achieves 83% accuracy, and injecting false answers at 3x true-answer frequency leaves accuracy unchanged (83.3%->83.3%), falsifying a frequency-based extraction account. These results establish that answer extraction operates on a sparse, order-insensitive, and structurally robust informational substrate, opening paths toward parallelized and token-efficient reasoning generation.
☆ MedAction: Towards Active Multi-turn Clinical Diagnostic LLMs
Most existing LLM diagnoses are evaluated on static, single-turn settings where complete patient information is provided upfront, an oversimplification of real clinical practice. We study active diagnosis: the real-life clinical process of starting from initial observation, ordering tests, interpreting results, and updating a differential diagnosis across multiple turns. Through systematic analysis, we identify three recurring failure modes in current LLMs: ungrounded test ordering, unreliable diagnostic update, and degraded multi-turn coherence. Together, these failures reveal a core deficit: existing medical training data teaches models to reason from complete information but not to act under evolving, partial evidence. To address this gap, we introduce MedAction, a tree-structured distillation pipeline that synthesizes diverse and high-quality multi-turn diagnostic trajectories via LLM-environment interaction. We propose two knowledge-graph-grounded metrics to filter trajectory quality: Disease Trajectory Consistency (DTC), which tracks whether the model's hypothesis converges toward the correct diagnosis, and Reasoning-Action Consistency (RAC), which verifies that belief updates are driven by gathered evidence. Using this pipeline, we construct MedAction-32K, a dataset of 32,681 trajectories from 2,896 PMC cases. Fine-tuning an 8B model on MedAction-32K achieves state-of-the-art performance among open-source models on both MedR-Bench and our curated MedAction-300-Hard benchmark, pushing the edge for open-source medical LLMs.
☆ On the Complexity of the Matching Problem of Regular Expressions with Backreferences
ReDoS is a well-known type of algorithmic complexity attack, where an adversary supplies maliciously crafted strings to a regular expression matching engine, aiming to exhaust computational resources of systems. Even quadratic-time behavior in matching engines has been exploited in successful attacks, as exemplified by major outages at Stack Overflow (2016) and Cloudflare (2019). These incidents motivate a fundamental question: Is it possible to construct matching engines that are provably efficient, running in (near-)linear time in the length of the input string? For classical regular expressions (REGEX), Thompson's construction yields a linear-time algorithm. However, practical engines support powerful features such as backreferences, which strictly extend the expressive power of REGEX but unfortunately increase the risk of ReDoS attacks. This paper investigates the fine-grained complexity of the string matching problem for regular expressions with backreferences (REWBs). Specifically, we consider $r$-use $k$-REWBs. On the hardness side, we show that the string matching problem for $k$-REWBs cannot be solved in $O(n^{2k-ε})$ time for any $ε> 0$ under SETH. We also prove that this problem is \textbf{W[2]}-hard when parameterized by the length of the REWB expression, strengthening the previous \textbf{W[1]}-hardness. Moreover, we prove that this problem for $2$-use $2$-REWBs cannot be solved in $n^{1+o(1)}$ time unless the triangle detection problem can be solved in that time. On the algorithmic side, we present an $O(n \log^2 n)$-time algorithm for $1$-use REWBs, which significantly improves upon the recent $O(n^2)$-time algorithm by Nogami and Terauchi (MFCS, 2025). Our algorithm employs several techniques including suffix trees, transition monoids of REGEXes, factorization forest data structures, and periodicity of strings.
comment: Full version of ICALP 2026; The abstract field is slightly shorter than that in the paper due to arXiv's length limit
☆ Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions
Layer pruning efficiently reduces Large Language Model (LLM) computational costs but often triggers sudden performance collapse. Existing representation-based analyses struggle to explain this mechanism. We propose studying pruning through decision representation. Focusing on multiple-choice tasks, we introduce two metrics, Decision Margin and Option Frequency, and an Iterative Pruning method to analyze layer-wise decision dynamics. Our findings reveal a sharp decision transition that partitions the network into two stages: a Silent Phase, where the model cannot yet predict the correct answer, and a Decisive Phase, where the correct prediction emerges. We also find that pruning the Decisive Phase has minimal impact, whereas pruning the Silent Phase triggers immediate performance collapse, highlighting its extreme sensitivity to structural changes. Therefore, we conclude that pruning-induced collapse stems from disrupting the Silent Phase, which prevents the critical decision transition from occurring.
☆ MIPIAD: Multilingual Indirect Prompt Injection Attack Defense with Qwen -- TF-IDF Hybrid and Meta-Ensemble Learning
Indirect prompt injection remains a persistent weakness in retrieval-augmented and tool-using LLM systems, and the problem becomes harder to characterise in multilingual settings. We present MIPIAD, a defense framework evaluated on English and Bangla that combines a sequence classifier fine-tuned from Qwen2.5-1.5B via LoRA (XLPID), TF-IDF lexical features, and validation-tuned ensembling through late fusion, stacking, and gradient boosting. The framework is evaluated on a synthetic benchmark built from BIPIA(Yi et al., 2023) templates spanning five task families -- email, table, QA, abstract, and code-comprising over 1.43 million generated samples, with train and test splits using mutually exclusive attack categories. Across the experiments, lexical signals prove strong (TF-IDF+SVM F1=0.77), and the hybrid XLPID+TF-IDF ensemble achieves the best overall F1 (0.9205) while the Boosting Ensemble achieves the best AUROC (0.9378). Ensemble methods consistently reduce the English-Bangla cross-lingual gap relative to standalone neural models. The pipeline is designed for extensibility: NLLB-200 supports over 200 languages and XLPID's multilingual backbone can be retargeted to additional languages without architectural changes; empirical validation is currently limited to English and Bangla
☆ From 0-Order Selection to 2-Order Judgment: Combinatorial Hardening Exposes Compositional Failures in Frontier LLMs
Multiple-choice reasoning benchmarks face dual challenges: rapid saturation from advancing models and data contamination that undermines static evaluations. Ad-hoc hardening methods (paraphrasing, perturbation) attempt to increase difficulty but sacrifice logical validity for surface complexity, falling short to challenge advanced reasoning models. We present LogiHard, a formal framework that deterministically transforms 0-order selection into 2-order logical judgment, which significantly increases the thinking overhead and reasoning steps. The framework integrates Item Response Theory (IRT) for computerized adaptive testing (CAT), enabling precise difficulty control with fewer questions than static benchmarks. We instantiate LogiHard-2k, a logical reasoning dataset constructed by cognitively ranking high-stakes examination questions via 9-dimensional analysis of model thinking traces, followed by combinatorial transformation of high-difficulty items. Evaluation across twelve state-of-the-art models reveals an accuracy degradation ranging from 31% to 56% on combinatorially hardened questions. LLMs suffer from the multi-select failure and early exit bias, which are not shared by human testees. Zero-shot transfer to MMLU demonstrates 47% accuracy degradation (89.84% to 42.86%), confirming applicability across domains with provable validity preservation. The consistent aggregate degeneration is domain-agnostic and stems not from knowledge deficits but from a combinatorial reasoning gap, reflecting a training-induced completeness-verification deficit.
☆ When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models
Mixture-of-Experts (MoE) language models route each token to a small subset of experts, but whether the routes selected by a trained top-$k$ router are good ones is rarely evaluated directly. Holding the model fixed, we compare each standard route against sampled equal-compute alternatives for the same token and score each by the next-token probability it assigns to the realized token in a verified reasoning trajectory. The result is sharply token-conditional: the standard router is well-aligned with route utility on confident tokens but uninformative on the fragile tokens that drive hard reasoning, where lower-loss equal-compute routes consistently exist inside the frozen model but are not selected. The same pattern holds across Qwen3-30B-A3B, GPT-OSS-20B, DeepSeek-V2-Lite, and OLMoE-1B-7B, and follows structurally from how standard top-$k$ training evaluates routing decisions: the language modeling loss scores only the executed route, and load balancing depends only on aggregate routing statistics. A minimal router-only update to the final-layer router, leaving every expert and every other router frozen, is sufficient to shift pass@K on AIME 2024+2025 and HMMT 2025 for both Qwen3-30B-A3B and GPT-OSS-20B, suggesting that at least part of the failure reflects router-reachable misallocation rather than expert capacity alone.
☆ PaT: Planning-after-Trial for Efficient Test-Time Code Generation ACL 2026
Beyond training-time optimization, scaling test-time computation has emerged as a key paradigm to extend the reasoning capabilities of Large Language Models (LLMs). However, most existing methods adopt a rigid Planning-before-Trial (PbT) policy, which inefficiently allocates test-time compute by incurring planning overhead even on directly solvable problems. We propose Planning-after-Trial (PaT), an adaptive policy for code generation that invokes a planner only upon verification failure. This adaptive policy naturally enables a heterogeneous model configuration: a cost-efficient model handles generation attempts, while a powerful model is reserved for targeted planning interventions. Empirically, across multiple benchmarks and model families, our approach significantly advances the cost-performance Pareto frontier. Notably, our heterogeneous configuration achieves performance comparable to a large homogeneous model while reducing inference cost by approximately 69\%.
comment: Accepted to ACL 2026 main conference
☆ Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
We introduce Mutual Reinforcement Learning, a framework for concurrent RL post-training in which heterogeneous LLM policies exchange typed experience while keeping separate parameters, objectives, and tokenizers. The framework combines a Shared Experience Exchange (SEE), Multi-Worker Resource Allocation (MWRA), and a Tokenizer Heterogeneity Layer (THL) that retokenizes text and aligns token-level traces across incompatible vocabularies. This substrate makes the experience-sharing design question operational across model families. We instantiate three controlled probes on top of GRPO: data-level rollout sharing via Peer Rollout Pooling (PRP), value-level advantage sharing via Cross-Policy GRPO Advantage Sharing (XGRPO), and outcome-level success transfer via Success-Gated Transfer (SGT). A contextual-bandit analysis characterizes their structural positions on a stability-support trade-off: PRP pays density-ratio variance and THL residual costs, XGRPO preserves on-policy actor support while changing scalar baselines, and SGT supplies a rescue-set score direction toward verified peer successes. In the evaluated regime, outcome-level sharing occupies the favorable point of this trade-off.
comment: 50 pages, 10 figures, 14 tables
☆ SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting
Speculative decoding accelerates LLM inference by drafting a tree of candidate continuations and verifying it in one target forward. Existing drafters fall into two camps with opposite weaknesses. Autoregressive drafters such as EAGLE-3 preserve dependence along each draft path but call the drafter once per tree depth, making drafting a non-trivial share of per-iteration latency. Parallel drafters cut drafter calls by predicting multiple future positions in one forward, but each position is predicted without seeing the others, producing paths the verifier rejects. In this paper, we propose SpecBlock, a block-iterative drafter that combines path dependence with cheap drafting. Each drafter forward produces K dependent positions and we call this a block. The draft tree grows through repeated block expansions. Two mechanisms explicitly carry path dependence to keep later draft positions accurate. Within each block, a layer-wise shift carries the previous position's hidden state into every decoder layer. Across blocks, each new block can start from any position of the previous block, inheriting its hidden state to extend the path. To spend verifier budget where acceptance is likely, a co-trained rank head replaces the fixed top-k tree by allocating per-position branching during drafting. To avoid training the drafter on prefixes it never produces at inference, a valid-prefix mask drops the loss at later positions once an earlier one is wrong. Beyond static drafting, a cost-aware bandit at deployment uses free verifier feedback to update the drafter selectively, only when the expected throughput gain exceeds the update cost. Experiments show that SpecBlock improves mean speedup by 8-13% over EAGLE-3 at 44-52% of its drafting cost, and cost-aware adaptation extends this lead to 11-19%.
☆ MEMOREPAIR: Barrier-First Cascade Repair in Agentic Memory
Agentic memory evolves across tasks into durable derived artifacts: summaries, cached outputs, embeddings, learned skills, and executable tool procedures. When a source artifact is deleted, corrected, or invalidated by tool or API migration, descendants derived from that source can remain visible and steer future actions with stale support. We formalize this failure mode as the cascade update problem, where repair targets the visible derived state of the memory store. We present MemoRepair, a barrier-first cascade-repair contract for agentic memory. A repair event induces a controlled transition from invalidated descendant state to validated successor state: affected descendants are withdrawn before repair, successors are constructed from retained support and staged repaired predecessors under the current interface, and republication is restricted to validated predecessor-closed successors. This contract induces a scalarized repair-selection problem for a fixed repair-cost tradeoff. We show that the induced publication problem reduces to maximum-weight predecessor closure and can be solved exactly by a single s-t min-cut. Experiments on ToolBench and MemoryArena show that, with complete influence provenance, MemoRepair reduces invalidated-memory exposure from 69.8-94.3% under systems without cascade repair to 0%. Compared with exhaustive Repair all, it recovers 91.1-94.3% of validated successors while reducing normalized repair-operator cost from 1.00 to 0.57-0.76.
☆ Teaching Language Models to Think in Code
Tool-integrated reasoning (TIR) has emerged as a dominant paradigm for mathematical problem solving in language models, combining natural language (NL) reasoning with code execution. However, this interleaved setup has three key limitations: code often acts as a post-hoc verifier, intermediate NL computations are error-prone, and NL and code play overlapping rather than clearly distinct roles. We propose ThinC (Thinking in Code), a framework in which code itself serves as the reasoner rather than as a tool invoked by NL. A ThinC trajectory begins with a brief NL planning step, after which all reasoning unfolds through code blocks connected only by their execution outputs. We distill 12.2k code-centric trajectories from a teacher model and train ThinC-1.7B and ThinC-4B with supervised fine-tuning followed by reinforcement learning. ThinC-4B consistently outperforms every TIR baseline on five competition-level math benchmarks and even surpasses the much larger Qwen3-235B-A22B-Thinking. Further analysis shows that ThinC reasons through code: 99.2% of its final answers are grounded in interpreter output, and the model recovers reliably from code execution failures without intermediate NL reasoning. Our code and models will be released soon.
comment: Preprint
☆ Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
Large language models (LLMs) support long-context inference but suffer from substantial memory and runtime overhead due to Key-Value (KV) Cache growth. Existing KV Cache eviction methods primarily rely on local attention weights, neglecting the influence of value representations, output projection, and inter-head interactions. In this work, we reformulate KV Cache eviction from a conventional head-wise, weight-averaging approach into an output-aware, layer-wise matrix multiplication approximation problem. We introduce LaProx, a novel eviction strategy that explicitly models the multiplicative interaction between attention maps and projected value states to accurately quantify token contributions while accounting for inter-head dependencies. Building on this metric, we propose the first unified eviction strategy that assigns globally comparable importance scores to tokens, enabling model-wide selection instead of local, head-wise decisions. Experimental results across 19 datasets on long-context benchmarks LongBench and Needle-In-A-Haystack demonstrate that our approach maintains model performance with only 5\% of the KV cache and consistently outperforms prior works across all configurations. Notably, our method achieves up to 2$\times$ accuracy loss reduction under extreme compression scenarios compared to existing state-of-the-art baselines with minimal overhead.
☆ DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models
PromptReps showed that an autoregressive language model can be used directly as a retriever by prompting it to generate dense and sparse representations of a query or passage. Extending this to multiple representatives is inefficient for autoregressive models, since tokens must be generated sequentially, and prior multi-token variants did not reliably improve over single-token decoding. We show that the bottleneck is sequential generation, not the multi-token idea itself. DiffRetriever is a representative-token retriever for diffusion language models: it appends K masked positions to the prompt and reads all K in a single bidirectional forward pass. Across in-domain and out-of-domain evaluation, multi-token DiffRetriever substantially improves over single-token on every diffusion backbone we test, while autoregressive multi-token is flat or negative and pays a latency cost that scales with K where diffusion does not. After supervised fine-tuning, DiffRetriever on Dream is the strongest BEIR-7 retriever in our comparison, ahead of PromptReps, the encoder-style DiffEmbed baseline on the same diffusion backbones, and the contrastively fine-tuned single-vector RepLLaMA. A per-query oracle on the frozen base model exceeds contrastive fine-tuning at the same fixed budget, pointing to adaptive budget selection as future work. Code is available at https://github.com/ielab/diffretriever.
☆ Hallucination Detection via Activations of Open-Weight Proxy Analyzers
We introduce a proxy-analyzer framework for detecting hallucinations in large language models. Instead of looking inside the generating model, our system reads already-generated text through a small locally hosted open-weight model and spots hallucinations using the reader's own internal activations. This works just as well when the generator is a closed API like GPT-4 as when it is any open-weight model. We built eighteen features grounded in how transformers process text, covering residual stream norms, per-head source-document attention, entropy, MLP activations, logit-lens trajectories, and three new token-level grounding statistics. We trained a stacking ensemble on 72,135 samples from five hallucination datasets. We tested across seven analyzer architectures from 0.5 billion to 9 billion parameters: Qwen2.5 at 0.5B and 7B, Gemma-2 at 2B and 9B, Pythia at 1.4B, and LLaMA-3 at both 3B and 8B. Across all seven, we consistently beat ReDeEP's token-level AUC of 0.73 on RAGTruth by 7.4 to 10.3 percentage points. Qwen2.5-7B reached an F1 of 0.717, just above ReDeEP's 0.713, while Qwen2.5-0.5B hit 0.706. The most striking finding is how tightly all seven models cluster: AUC spans only 2.3 percentage points across an eighteen-fold difference in model size. Even more surprising, our 3B LLaMA outperforms our 8B LLaMA on RAGTruth, showing that bigger is not always better even within the same model family. Both RAGTruth and LLM-AggreFact include outputs from multiple LLM families, so our results are not skewed toward any particular generator.
comment: 12 pages, 4 figures. Code available at https://github.com/hallu-detect/llm_hallucination_detection
☆ PSK@EEUCA 2026: Fine-Tuning Large Language Models with Synthetic Data Augmentation for Multi-Class Toxicity Detection in Gaming Chat ACL 2026
This paper describes our system for the EEUCA 2026 Shared Task on Understanding Toxic Behavior in Gaming Communities. The task involves classifying World of Tanks chat messages into six toxicity categories: Non-toxic, Insults/Flaming, Other Offensive, Hate/Harassment, Threats, and Extremism. We explore multiple approaches including encoder-based models, instruction-tuned LLMs with LoRA fine-tuning, hierarchical classification, one-vs-rest strategies, and various ensemble methods. Our best system combines Llama 3.1 8B with carefully calibrated 5\% synthetic data augmentation, achieving an F1-macro score of 0.6234 on the test set, placing 4th out of 35 participating teams. We provide extensive analysis of the dataset's annotation patterns and their impact on model generalization, revealing a critical ''validation trap'' phenomenon where high validation performance correlates with poor test transfer.
comment: Accepted to the EEUCA workshop at ACL 2026
☆ The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval
Existing Large Language Model (LLM) benchmarks primarily focus on syntactically correct inputs, leaving a significant gap in evaluation on imperfect text. In this work, we study how word-boundary corruption affects how LLMs detect targeted information. By inserting whitespace characters within words to break them into fragments, LLMs' detection accuracy follows a U-shaped curve with the increase in insertion rate. We refer to this curve as the Text Uncanny Valley. To explain such observation, we propose a mode transition hypothesis: LLMs operate in a word-level mode for near-normal text and a character-level mode for heavily fragmented text, with the valley marking the disordered transition where neither mode is effective. Four experiments and one analysis are consistent with this account: in-context learning fails to rescue valley-bottom performance; regularizing the perturbation substantially reduces the U-shape; a math reasoning task replicates the U-shape for Gemini 3.0 Flash but not for stronger models, suggesting the effect is attenuated when tasks rely less on exact lexical alignment; and tokenization entropy peaks before the F1 minimum, consistent with a regime-conflict interpretation. These findings reveal a failure mode invisible to clean-text benchmarks yet directly relevant to any deployment scenario involving noisy or uncurated text inputs.
comment: 18 pages, 9 figures
☆ Learning Agent Routing From Early Experience
LLM agents achieve strong performance on complex reasoning tasks but incur high latency and compute cost. In practice, many queries fall within the capability boundary of cutting-edge LLMs and do not require full agent execution, making effective routing between LLMs and agents a key challenge. We study the problem of routing queries between lightweight LLM inference and full agent execution under realistic cold-start settings. To address this, we propose BoundaryRouter, a training-free routing framework that uses early behavioral experience and rubric-guided reasoning to decide whether to answer a query with direct LLM inference or escalate to an agent. BoundaryRouter builds a compact experience memory by executing both systems on a shared seed set and retrieves similar cases at inference time to guide routing decisions. To evaluate this method, we introduce RouteBench, a benchmark covering in-domain, paraphrased, and out-of-domain route settings. Experiments show that BoundaryRouter reduces inference time by 60.6% compared to the agent while improving performance by 28.6% over direct LLM inference, outperforming prompt-based and retrieval-only routing by an average of 37.9% and 8.2%, respectively.
comment: 17 pages
☆ Topology-Enhanced Alignment for Large Language Models: Trajectory Topology Loss and Topological Preference Optimization ACL 2026
Alignment of large language models (LLMs) via SFT and RLHF/DPO typically ignores the global geometry of the representation space, relying instead on local token likelihoods or scalar scores. We view generation as tracing a semantic trajectory in hidden space and propose a topology-enhanced alignment framework that regularizes these trajectories using 0-dimensional persistent homology. First, for SFT, we introduce Trajectory Topology Loss (TTL). Treating prompt and gold-answer embeddings as a mixed point cloud, we use a 0D persistent homology algorithm to extract "prompt-answer bridges." TTL aligns the model's actual update direction with these topological bridges rather than arbitrary directions. Second, for DPO, we propose Topological Preference Optimization (TPO). TPO constructs topic-specific semantic preference vectors and aligns the improvement direction between rejected and chosen responses with these vectors in an intermediate hidden layer. We also introduce a dynamic weighting scheme to balance DPO and TPO losses. Evaluating on Qwen2.5-7B-Instruct using UltraChat and Anthropic HH-RLHF, our topology-enhanced objectives consistently outperform strong non-topological baselines (e.g., per-example, nearest-neighbor, random regularizers) on automatic preference metrics and LLM-judge evaluations, while maintaining or improving toxicity. Results show persistent homology and trajectory geometry offer a promising direction for controllable alignment.
comment: Accepted to ACL 2026. 15 pages
☆ A Reproducible Multi-Architecture Baseline for Token-Level Chinese Metaphor Identification under the MIPVU Framework
Metaphor is pervasive in everyday language, yet token-level computational identification of metaphor-related words in Chinese under the MIPVU framework remains under-explored relative to English. This paper presents a reproducible multi-architecture baseline for token-level metaphor identification on the PSU Chinese Metaphor Corpus (PSU CMC), the only widely available MIPVU-annotated Chinese corpus. We systematically compare three model families: (i) encoder fine-tuning with Chinese RoBERTa-wwm-ext-large; (ii) MelBERT adapted to Chinese using a newly constructed basic-meaning resource derived from the Modern Chinese Dictionary, 7th edition (MCD7), comprising 74,823 entries with 71.51% PSU CMC vocabulary coverage; and (iii) Qwen3.5-9B fine-tuned with QLoRA as an instruction-tuned generative baseline. Across five fixed seeds, MelBERT MIP-only achieves the strongest performance at 0.7281 +/- 0.0050 test positive F1, marginally above MelBERT Full (0.7270 +/- 0.0069) and clearly above plain RoBERTa (0.7142 +/- 0.0121). The Qwen QLoRA generative configuration trails encoder baselines by approximately 11 F1 points (0.6157 +/- 0.0113). Three findings merit attention: (1) the SPV channel of MelBERT does not contribute reliable positive signal in Chinese, consistent with the dominance of conventional metaphor; (2) the Qwen-encoder gap is concentrated in recall, reflecting the discrete-commitment limitation of generative output; (3) several Qwen task formulations fail due to format design rather than model capacity. We release all split manifests, per-seed outputs, the MCD7 basic-meaning embedding pipeline, and training scripts to serve as a common reference for future Chinese metaphor identification research.
☆ Rethinking Experience Utilization in Self-Evolving Language Model Agents
Self-evolving agents improve by accumulating and reusing experience from past interactions. Existing work has largely focused on how experience is constructed, represented, and updated, while paying less attention to how experience should be used during runtime decision-making. As a result, most agents rely on rigid usage strategies, either injecting experience once at initialization or at every step, without considering whether it is needed for the current decision. This paper studies experience utilization as a critical design dimension of self-evolving agents. We ask whether agents benefit from interweaving experience use with decision-making, so that experience is invoked only when additional guidance is needed. To examine this question, we introduce {ExpWeaver}, a lightweight instantiation that leaves experience construction unchanged and modifies only runtime utilization by exposing experience as an optional resource during reasoning. Across four representative frameworks, seven LLM backbones, and three types of environments, ExpWeaver consistently achieves the best performance among different utilization strategies. Reinforcement learning experiments further show that this behavior can be amplified through training. Usage-pattern, causal ablation, and entropy-based analyses reveal that ExpWeaver enables agents to invoke experience selectively, at beneficial decision points, and under higher reasoning uncertainty. Overall, our findings call for a shift from merely studying \emph{what} experience to store toward understanding \emph{how} and \emph{when} experience should enter decision-making.
comment: 30 pages, 20 figures, 7 tables
☆ CLIPer: Tailoring Diverse User Preference via Classifier-Guided Inference-Time Personalization
Personalized LLMs can significantly enhance user experiences by tailoring responses to preferences such as helpfulness, conciseness, and humor. However, fine-tuning models to address all possible combinations of user preferences is computationally expensive and impractical. In this paper, we introduce \textbf{CLIPer}(\textbf{Cl}assifier-guided \textbf{I}nference-time \textbf{Per}sonalization), a lightweight personalization approach that leverages a classifier model to steer LLM generation dynamically to different user preferences at inference time. Our method eliminates the need for extensive fine-tuning, inducing negligible additional computational overhead while enabling more controllable and nuanced personalization across single and multi-dimensional preferences. Comprehensive empirical analyses demonstrate the scalability and effectiveness of our approach in delivering personalized language generation.
☆ Topic Is Not Agenda: A Citation-Community Audit of Text Embeddings
Vector search and retrieval-augmented generation (RAG) rest on the assumption that cosine similarity between text embeddings reflects conceptual relatedness. We measure where this assumption breaks. We build an augmented citation graph over 3.58M scientific papers and partition it via Leiden CPM at two granularities: sub-field (L1) and research-agenda (L2, hierarchical inside each L1). Four state-of-the-art embeddings (Gemini, Qwen3-8B, Qwen3-0.6B, SPECTER2) clear the L1 bar reasonably (45-52% top-10 same-rate) but stop working at L2: only 15-21% of top-10 neighbors share the query's research agenda. In absolute terms, 8 of every 10 retrieved papers are off-agenda. The failure is universal across eight scientific domains and all four models; SPECTER2, despite its citation-based contrastive training, is the weakest. As a diagnostic probe, we test whether the same augmented graph also functions as a retrieval signal: a deliberately simple citation-count rerank reaches 57.7% top-1 L2 on top of LLM-expanded Boolean retrieval and 59.6% on top of plain BM25, on 80 curated agenda queries -- about 9 points above the best cosine retriever (Gemini, 50.6%) and 20 points above BM25 alone (39.3%). The probe isolates a slice of the agenda-matching signal the graph carries but the embeddings miss, connecting recent theoretical limits on single-vector retrieval to a concrete failure mode of scientific RAG.
comment: 16 pages, 4 figures, 4 tables
☆ Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs
Reinforcement learning (RL) has achieved remarkable success in LLM reasoning, but whether it can also improve direct recall of parametric knowledge remains an open question. We study this question in a controlled zero-shot, one-hop, closed-book QA setting with no chain-of-thought, training only on binary correctness rewards and applying fact-level train-test deduplication to ensure gains reflect improved recall rather than reasoning or memorization. Across three model families and multiple factual QA benchmarks, RL yields ~27% average relative gains, surpassing both training- and inference-time baselines alike. Mechanistically, RL primarily redistributes probability mass over existing knowledge rather than acquiring new facts, moving correct answers from the low-probability tail into reliable greedy generations. Our data-attribution study reveals that the hardest examples are the most informative: those whose answers never appear in 128 pre-RL samples (only ~18% of training data) drive ~83% of the gain, since rare correct rollouts still emerge during training and get reinforced. Together, these findings broaden the role of RL beyond reasoning, repositioning it as a tool for unlocking rather than acquiring latent parametric knowledge.
☆ Structural Rationale Distillation via Reasoning Space Compression
When distilling reasoning from large language models (LLMs) into smaller ones, teacher rationales for similar problems often vary wildly in structure and strategy. Like a chef who makes the same dish differently each time, this inconsistency burdens the student with noisy supervision that is hard to internalize. We propose Distillation through Reasoning Path Compression (D-RPC), which constrains the teacher to follow a compact, dynamically maintained bank of reusable high-level reasoning paths. For each training question, D-RPC retrieves the most relevant path and conditions the teacher to follow it, producing rationales that are consistent across similar problems yet diverse enough to cover different problem types. A PAC-Bayes analysis formalizes the resulting trade-off between bank size and coverage: smaller banks reduce supervision entropy but risk coverage gaps, and the generalization bound identifies an optimal intermediate size confirmed by our ablations. Across five math and commonsense reasoning benchmarks with two student models, D-RPC consistently outperforms chain-of-thought distillation, freeform rationale generation, direct distillation, and structured-supervision baselines, while using fewer tokens than template-heavy alternatives.
☆ Region4Web: Rethinking Observation Space Granularity for Web Agents
Web agents perceive web pages through an observation space, yet its granularity has remained an underexamined design choice. Existing work treats observation at the same element-level granularity as the action space, leaving the page's functional organization implicit and forcing the agent to infer it from element-level signals at every step. We argue observation should instead operate at the granularity of functional regions, parts of the page that each serve a distinct purpose. We propose Region4Web, a framework that reorganizes the AXTree into functional regions through hierarchical decomposition and semantic abstraction, exposing the page's functional organization as the basis for page state understanding. Moreover, we propose PageDigest, a web-specific inference pipeline that delivers this region-level observation to the actor agent as a compact per-page digest that persists across steps. On the WebArena benchmark, PageDigest substantially reduces observation length while improving overall task success rate across diverse backbone large language models (LLMs) and established agent methods, regardless of backbone capacity. These results show that operating at the granularity of functional regions delivers a more compact and informative basis for the actor agent than element-level processing alone.
☆ The Position Curse: LLMs Struggle to Locate the Last Few Items in a List
Modern large language models (LLMs) can find a needle in a haystack (locating a single relevant fact buried among hundreds of thousands of irrelevant tokens) with near-saturated accuracy, yet fail to retrieve the last few items in a short list. We call this failure the Position Curse. For instance, even in a two-line code snippet, Claude Opus 4.6 misidentifies the second-to-last line most of the time. To characterize this failure, we evaluated two complementary queries: given a position in a sequence (of letters or words), retrieve the corresponding item; and given an item, return its position. Each position is specified as a forward or backward offset from an anchor, either an endpoint of the list (its start or end) or another item in the list. Across both open-source and frontier closed-source models, backward retrieval substantially lags forward retrieval. To test whether this capability can be rescued by post-training, we constructed PosBench, a position-focused training dataset. LoRA fine-tuning improves both forward and backward retrieval and generalizes to a held-out code-understanding benchmark (PyIndex), yet absolute performance remains far from saturated. As LLM coding agents increasingly operate over large codebases where precise indexing becomes essential for code understanding and editing, position-based retrieval emerges as a key capability for future pretraining objectives and model design.
☆ Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation
Recent literature on fine-tuning Large Language Models highlights a fundamental debate. While Full Fine-Tuning (FFT) provides the representational plasticity required for high-entropy knowledge injection, Low-Rank Adaptation (LoRA) can match or surpass FFT performance because many tasks only require updates in a low-rank space and benefit from LoRA's additional regularization. Through empirical evaluation across diverse tasks (SQL, Medical QA, and Counterfactual Knowledge) and varying language models (Gemma-3-1B, Qwen2.5-1.5B, and Qwen2.5-3B), we verify both trends and demonstrate that relying solely on either static architecture is structurally limited. To address this challenge, we propose a Mixture of LoRA and Full (MoLF) Fine-Tuning, a unified framework that enables continuous navigation between both training regimes. MoLF dynamically routes updates between FFT and LoRA at the optimizer level to ensure that exact gradient signals are available to both experts throughout training, yielding stable training dynamics. For memory-constrained environments, we also introduce MoLF-Efficient, which freezes base weights and only routes updates among a pair of LoRA experts of potentially varying rank. Our evaluations show that MoLF either improves on or stays within $1.5\%$ of the better of FFT and LoRA across all settings, while MoLF-Efficient outperforms prior adaptive LoRA approaches by up to $20\%$ on Fact and $9\%$ on Med and SQL.
☆ Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
Computer-use agents(CUAs)are moving frombounded benchmarks toward real software environments, wherethey operate browsers, desktops, mobile applications, flesystems,terminals, and tool backends. In such settings, reliability isno longer captured by task success alone: perception errors,planning drift, memory use, tool mediation, permission scope,and runtime oversight jointly determine whether agent actionsremain aligned with user intent, Existing surveys organize theCUA landscape by methods, platforms, benchmarks, or securitythreats, but less explicitly connect capability formation, author-ity exposure, failure manifestation, and control placement. Toaddress this gap, the article develops an architecture-lifecycleframework for deployment-grounded reliability in CUAs. Thearchitectural view analyzes Perception, Decision, and Executionas coupled layers that transform software observations intoauthority-bearing actions, The lifecycle view examines Creation.Deployment, Operation, and Maintenance as stages in which priorsare learned, tools and permissions are bound, runtime trajecto.ries are stressed, and assurance must be preserved under drift.Using this lens, the analysis synthesizes representative systems,benchmarks, and security/privacy studies; distinguishes wherefailures become visible from where their enabling conditions areintroduced, and maps recurring intervention surfaces for controloversight, and assurance. OpenClaw is used only as a public moti.vating example of an open deployment pattern, not as a verifedinternal case study. The conclusion highlights open challengesin controllable grounding, long-horizon constraint preservation,safe authority binding, mixed-trust runtime defense, privacy-preserving memory,and continual assurance.
☆ Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning
Multimodal Large Language Models (MLLMs) have made remarkable progress on vision-language reasoning, yet most methods still compress visual evidence into discrete textual thoughts, creating an information bottleneck for fine-grained perception. Recent latent visual reasoning methods attempt to reason in continuous hidden states, but we find that they suffer from insufficient manifold compatibility: latent trajectories drift away from pretrained reasoning circuits, collapse into instance-agnostic patterns, and are often bypassed during answer generation. To address these issues, we propose RIS (Retrieve, Integrate, and Synthesize), a spatial-semantic grounded framework that develops latent reasoning as a compatible extension of pretrained MLLM computation. We first construct a step-wise grounded reasoning dataset with bounding boxes and region-specific semantic descriptions. Built on this supervision, RIS anchors latent tokens to both spatial and semantic evidence, enforces their causal role through a progressive attention bottleneck, and introduces short language transition tokens to bridge synthesized latent states back to vocabulary-aligned decoding. Experiments on V*, HRBench4K, HRBench8K, MMVP, and BLINK show consistent improvements over closed/open-source and latent reasoning baselines. Further analyses demonstrate that RIS learns diverse, interpretable, and progressively integrated latent trajectories, offering a practical path toward faithful internal visual reasoning in MLLMs.
comment: 19 pages, 8 figures
☆ Theoretical Limits of Language Model Alignment
Language model (LM) alignment improves model outputs to reflect human preferences while preserving the capabilities of the base model. The most common alignment approaches are (i) reinforcement learning, which maximizes the expected reward under a KL-divergence constraint, and (ii) best-of-$N$ alignment, which selects the highest-reward output among $N$ independent samples. Despite their widespread use, the fundamental limits of reward improvement under a KL budget remain poorly understood. We characterize the information-theoretic limits of KL-regularized alignment by deriving the maximum achievable expected reward gain for a fixed KL-divergence budget. Our first result provides a closed-form expression for the optimal reward improvement, governed by a Jeffreys divergence term rather than the $\sqrt{\texttt{KL}}$ used in prior analyses. We further reformulate this expression as a covariance under the base model, yielding a practical estimator that predicts achievable alignment gains from base model samples alone. We extend our analysis to the proxy reward setting, showing that the gap between ideal and proxy alignment (reward hacking) grows with the magnitude of reward error and when the KL penalty factor decreases. We then prove that reward ensembling mitigates reward hacking, providing a theoretical justification for this technique used in practice. Empirically, we compute the KL-reward Pareto frontier for two tasks for LMs, safety and summarization, and show that best-of-$N$ closely approaches the theoretical limit, while PPO and GRPO remain substantially suboptimal. Our theoretical results shed light on several empirically observed phenomena in the alignment literature and suggest that algorithmic improvements are needed to achieve optimal alignment without high inference costs.
☆ SAGE: Hierarchical LLM-Based Literary Evaluation through Ontology-Grounded Interpretive Dimensions
Evaluating literary quality requires assessing interpretive dimensions such as cultural representation, emotional depth, and philosophical sophistication that resist straightforward computational measurement. We introduce SAGE, a hierarchical evaluation framework that decomposes literary quality into ontology-grounded interpretive dimensions assessed through structured large language model evaluation with multi-round iterative reflection and independent validation. We validate the framework on 100 short stories (50 canonical works, 30 pulp fiction, 20 LLM-generated narratives) across three analytical layers (cultural, emotional-psychological, existential-philosophical) using dual-mode assessment. Across 600 evaluations, the framework achieves 98.8% score convergence and greater than 94% inter-rater agreement, with near-perfect mode invariance between content-based and metadata-based evaluation. Statistical analysis reveals a consistent genre hierarchy (Canonical > Pulp > LLM, all p<0.001) with layer-specific discrimination: cultural critique and philosophical depth exhibit very large effect sizes (Cohen's d>2.4), while emotional representation shows smaller gaps (d=1.68), suggesting that affective patterns are more learnable from training data than critical stance or philosophical depth. Cross-layer correlations (r=0.649-0.683) confirm the three dimensions capture empirically distinguishable quality facets. These findings demonstrate that theory-driven LLM evaluation can achieve measurement-grade reliability and support systematic identification of where current generative models fall short of human literary production, with direct implications for scalable automated evaluation of open-ended text generation.
comment: 19 pages, 4 figures
☆ The Translation Tax Is Not a Scalar: A Counterfactual Audit of English-Source Cue Inheritance in Chinese Multilingual Benchmarks NeurIPS 2026
The Translation Tax is often treated as a scalar: translated benchmarks are assumed to inflate scores by preserving English-source cues. We audit this claim in an English-to-Chinese setting. Three proxy estimators disagree: back-translation gaps are small and parser-fragile; cue-score calibration does not predict item-level gains; and a six-model native-control comparison shows model-family rather than uniform benchmark effects. We add a same-item LLM-naturalization stress test that holds answer, options, and content fixed while rewriting Chinese surface form. After correcting a prompt-construction bug, this contrast no longer supports a model-family interaction, but it preserves a residue dose-response: high-residue items benefit while low-residue items do not. The result is not a single Translation Tax, but a set of estimator- and item-dependent validity risks. We release per-cell evidence, the naturalization protocol, human QC, and a reporting checklist for translated multilingual benchmark papers.
comment: 13 pages, 3 figures. Submitted to NeurIPS 2026
☆ Beyond Single Ground Truth: Reference Monism as Epistemic Injustice in ASR Evaluation
Automatic speech recognition (ASR) evaluation compares system output to ground truth transcripts, with Word Error Rate (WER) quantifying the distance between them. But ground truth transcripts are not discovered - they are produced by human annotators following conventions that encode normative assumptions about which speech features matter. Different conventions (verbatim, non-verbatim, legal) produce different transcripts of identical speech and judge the same ASR output differently. This paper argues that reference monism - enforcing a single transcription convention as ground truth - commits epistemic injustice. Speakers with aphasia, whose speech includes clinically meaningful disfluencies, are systematically disadvantaged when evaluated against "clean" references that treat those disfluencies as errors. The harm is not merely differential performance, but that evaluative infrastructure lacks interpretive resources to recognize their contributions as legitimate. We develop a philosophical framework introducing the hermeneutical gap, formalize Epistemic Injustice Distance (EID) to measure reference monism's cost, and demonstrate empirically using AphasiaBank that WER varies depending on which convention defines ground truth. We propose WER-Range: reporting performance across legitimate conventions rather than assuming a single correct answer.
☆ Self-Consolidating Language Models: Continual Knowledge Incorporation from Context
Large language models (LLMs) increasingly receive information as streams of passages, conversations, and long-context workflows. While longer context windows expose more evidence, they do not ensure that useful information is preserved and reused. We study continual context consolidation: writing current context into model weights while limiting interference with previously consolidated information. We propose \textbf{S}elf-\textbf{Co}nsolidating \textbf{L}anguage Models (SCoL), a post-training framework in which, given current context, an LLM learns to generate textual update instructions specifying which of its own Transformer layers should be updated. Because committed updates change the model that later generates future selections, we train SCoL with meta-reinforcement learning over an evolving model state. We instantiate SCoL with supervised QA rewards on SQuAD knowledge incorporation and intrinsic likelihood-based rewards for LongBench v2 long-context consolidation. Across both settings, SCoL improves acquisition and retention over prompting, summarization, batch test-time training, and sequential finetuning baselines. Analysis of learned selection patterns shows that SCoL encourages the LLM to generate sparse update locations that align with layers of high Fisher information, suggesting that the model learns to route plasticity toward loss-sensitive regions while limiting interference. Moreover, SCoL transfers from shorter meta-training streams to longer LongBench v2 streams at evaluation, suggesting that our framework supports scalable streaming consolidation.
comment: 9 pages
☆ WiCER: Wiki-memory Compile, Evaluate, Refine Iterative Knowledge Compilation for LLM Wiki Systems
The LLM Wiki pattern, to compile and provide domain knowledge into a persistent artifact and serve it to LLMs via KV cache inference, promises context access at sub-second latency with zero retrieval failure. Realizing this requires solving the compilation gap: LLM compilation distilling raw documents into a wiki without catastrophically discarding critical facts. We characterize this gap across 17 RepLiQA domains (6,800 questions): we observe that full context KV cache inference outperforms RAG on curated knowledge (4.38 vs. 4.08 out of 5, 7.3 faster TTFT) but degrades below RAG at scale due to attention dilution, and blind compilation fails entirely (2.14 to 2.32 vs. 3.46, 53 to 60% catastrophic failure rate). To address the compilation gap, we propose WiCER (Wiki-memory Compile, Evaluate, Refine), an iterative algorithm inspired by counterexample-guided abstraction refinement (CEGAR) that closes this gap. WiCER evaluates compiled wikis against diagnostic probes, identifies dropped facts, and forces their preservation in subsequent compilations. One to two iterations recover 80% of lost quality (mean 3.24 vs. 3.47 for raw full-context across the 15 topics with baselines), reducing catastrophic failures by 55% relative. An ablation across all 17 topics confirms that targeted diagnosis (+0.95), not generic pinning (+0.16), drives the gains. All code and benchmarks are released for reproducible research.
☆ MedExAgent: Training LLM Agents to Ask, Examine, and Diagnose in Noisy Clinical Environments
Real-world clinical diagnosis is a complex process in which the doctor is required to obtain information from both interaction with the patient and conducting medical exams. Additionally, the doctor needs to adapt to different patient personas, as well as noisy and incomplete information that can happen at any time during the process. However, existing benchmarks for medical LLMs and methods for automatic diagnosis largely simplify this process by reducing it to single-turn question answering, noise-free conversations, or sequential exam making, etc., ignoring the interactive and uncertain nature of clinical diagnosis. In this paper, we aim to address this gap by formalizing clinical diagnosis as a Partially Observable Markov Decision Process (POMDP) with three action types: questioning the patient, ordering medical exams as tool calls, and issuing a diagnosis. We also introduce a systematic noise model comprising seven patient noise types and three exam noise types. Using our proposed environment, we train an effective diagnosis agent, \textbf{MedExAgent}, through a two-stage pipeline that first performs supervised finetuning on synthetic conversations structured after the Calgary-Cambridge model for clinical interviews, and then applies DAPO to optimize a composite reward capturing diagnostic accuracy, tool call quality, and exam cost including financial cost and patient discomfort. Through extensive experiments and ablation studies, we demonstrate that MedExAgent achieves diagnostic performance comparable to larger models while maintaining cost-efficient examination strategies.
☆ GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations
Benchmarks like GSM8K are popular measures of mathematical reasoning, but leaderboard gains can overstate true capability due to memorization of fixed test sets. Most robustness variants apply surface-level perturbations (paraphrases, renamings, number swaps, distractors) that largely preserve the underlying facts, and static releases can themselves become memorization targets over time. We introduce GSM-SEM, a reusable and stochastic framework for generating semantically diverse benchmark variants with substantially higher semantic variance than prior approaches. GSM-SEM perturbs problem statements by modifying entities, attributes, and/or relationships, frequently altering underlying facts and requiring models to recompute solutions under new conditions, while constraining generation to preserve the original calculations/answer and approximate problem difficulty. GSM-SEM generates fresh variants on each run without requiring re-annotation, reducing reliance on static public benchmarks for evaluation and thereby lowering the bias of memorization. We apply GSM-SEM on GSM8K and two existing variation suites (GSM-Symbolic and GSM-Plus), producing GSM8K-SEM, GSM-Symbolic-SEM, and GSM-Plus-SEM. Evaluating 14 SOTA LLMs, we observe consistent performance drops with larger decline when semantic perturbations are coupled with symbolic/plus variations (average drop rate 28% in maximum strictness configuration of GSM-SEM). We publicly release the three SEM variants as fully human-validated datasets. Finally, to demonstrate applicability beyond GSM-style math problems, we apply GSM-SEM to additional benchmarks including BigBenchHard, LogicBench, and NLR-BIRD.
☆ NSMQ Riddles: A Benchmark of Scientific and Mathematical Riddles for Quizzing Large Language Models
Large Language Models (LLMs) have shown good performance on various science educational benchmarks, demonstrating their potential for use in science and mathematics education. Yet, LLMs tend to be evaluated on science and mathematical educational datasets from the Western world, with an underrepresentation of datasets from the Global South. Furthermore, they tend to have multiple-choice answer options that are trivial to evaluate. In this work, we present NSMQ Riddles, a novel benchmark of Scientific and Mathematical Riddles from Ghana's National Science and Maths Quiz (NSMQ) competition to evaluate LLMs. The NSMQ is an annual live TV competition for senior secondary school students in Ghana that brings together the smartest high school students in Ghana who compete in teams of 2 by answering questions in biology, chemistry, physics, and math over five rounds and five stages until a winning team is crowned for that year. NSMQ Riddles consists of 11 years of riddle questions (n=1.8K) from the 5th round, with each riddle containing a minimum of 3 clues. Students compete to be the first to guess the answer on any of the clues, with earlier clues being vague and also fetching more points. The answers are usually a number, word, or short phrase, allowing for automatic evaluation. We evaluated state-of-the-art models: closed (GPT-5.4, Gemini 3.1 Pro, Claude Opus 4.6) and open models (Kimi-K2.5, DeepSeek-V3.1, GPT-OSS-120B) with high and low reasoning settings. Our evaluation shows that the dataset is challenging even for state-of-the-art LLMs, which performed worse than the best student contestants. This work contributes a novel and challenging benchmark for scientific and mathematical reasoning from the Global South towards enabling a true global benchmarking of LLMs' capabilities for science and mathematics education.
comment: 15 pages. Accepted at the 27th International Conference on Artificial Intelligence in Education
♻ ☆ Enhanced LLM Reasoning by Optimizing Reward Functions with Search-Driven Reinforcement Learning
Mathematical reasoning is a key benchmark for large language models. Reinforcement learning is a standard post-training mechanism for improving the reasoning capabilities of large language models, yet performance remains sensitive to the design of the reward function that drives policy optimization. This paper introduces a search-driven framework that treats the reward specification itself as an object of optimization. The setting of interest is one in which the base model is held fixed and the reward specification is the primary remaining design lever. Candidate reward functions are generated by a frontier language model, validated automatically, screened through 500-step Group Relative Policy Optimization (GRPO) training runs on a Llama-3.2-3B-Instruct base model with Low-Rank Adaptation (LoRA), and ranked by F1 on the GSM8K test set. Ranked summaries from prior rounds are then fed back into the next round of generation. Over five rounds, the search produces 50 candidate rewards. The mean F1 rises from 0.596 in Round 1 to 0.632 in Round 5, and the top individual reward reaches F1 = 0.787. Seven ensemble configurations of top-ranked rewards are evaluated. The best ensemble achieves F1 = 0.795 (95% bootstrap CI [0.756, 0.832]) and accuracy 0.660 [0.635, 0.686], a 0.19 absolute F1 gain over a base-rewards-only GRPO baseline (F1 = 0.609). Pairwise McNemar tests with Bonferroni correction show all five-or-more-reward configurations are statistically indistinguishable at α = 0.05/21. A three-seed re-training of the best ensemble yields F1 of 0.785. A randomly drawn 5-reward control collapses to F1 = 0.047, which shows that the ranked-feedback loop, not the additive signal of having more rewards, drives the gain.
♻ ☆ How Do Language Models Compose Functions?
While large language models (LLMs) appear to be increasingly capable of solving compositional tasks, it is an open question whether they do so using compositional mechanisms. In this work, we investigate how feedforward LLMs solve two-hop factual recall tasks, which can be expressed compositionally as $g(f(x))$. We first confirm that modern LLMs continue to suffer from the "compositionality gap", i.e. their ability to compute both $z = f(x)$ and $y = g(z)$ does not entail their ability to compute the composition $y = g(f(x))$. We then decode residual stream representations and identify two processing mechanisms: one which solves tasks $\textit{compositionally}$, computing $f(x)$ along the way to $g(f(x))$, and one which solves them $\textit{directly}$, without any detectable signature of the intermediate variable $f(x)$. Finally, we find that embedding space geometry is strongly related to which mechanism is employed, where the idiomatic mechanism is dominant when tasks are represented by translations from $x$ to $g(f(x))$ in the embedding spaces. We fully release our data and code at: https://github.com/apoorvkh/composing-functions.
♻ ☆ Valence-Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control
We show that emotion vectors in LLMs are organized by a two-dimensional valence-arousal (VA) subspace exhibiting circular geometry. Through principal component decomposition and ridge regression, we recover meaningful VA axes underlying emotion steering vectors whose projections correlate with human affect ratings across 44,728 words. Steering along these axes produces monotonic control over the affective properties of generated text, and further affords bidirectional control over multiple downstream behaviors (refusal and sycophancy) from a single subspace. These effects replicate across Llama-3.1-8B, Qwen3-8B, and Qwen3-14B. We propose lexical mediation to explain why these effects and prior emotionally framed controls work: refusal and compliance tokens occupy distinct VA regions, and VA steering directly modulates their emission probabilities.
♻ ☆ Ask Patients with Patience: Enabling LLMs for Human-Centric Medical Dialogue with Grounded Reasoning
The severe shortage of medical doctors limits access to timely and reliable healthcare, leaving millions underserved. Large language models (LLMs) offer a potential solution but struggle in real-world clinical interactions. Many LLMs are not grounded in authoritative medical guidelines and fail to transparently manage diagnostic uncertainty. Their language is often rigid and mechanical, lacking the human-like qualities essential for patient trust. To address these challenges, we propose Ask Patients with Patience (APP), a multi-turn LLM-based medical assistant designed for grounded reasoning, transparent diagnoses, and human-centric interaction. APP enhances communication by eliciting user symptoms through empathetic dialogue, significantly improving accessibility and user engagement. It also incorporates Bayesian active learning to support transparent and adaptive diagnoses. The framework is built on verified medical guidelines, ensuring clinically grounded and evidence-based reasoning. To evaluate its performance, we develop a new benchmark that simulates realistic medical conversations using patient agents driven by profiles extracted from real-world consultation cases. We compare APP against SOTA one-shot and multi-turn LLM baselines. The results show that APP improves diagnostic accuracy, reduces uncertainty, and enhances user experience. By integrating medical expertise with transparent, human-like interaction, APP bridges the gap between AI-driven medical assistance and real-world clinical practice.
♻ ☆ Searching for Privacy Risks in LLM Agents via Simulation ICLR 2026
The widespread deployment of LLM-based agents is likely to introduce a critical privacy threat: malicious agents that proactively engage others in multi-turn interactions to extract sensitive information. However, the evolving nature of such dynamic dialogues makes it challenging to anticipate emerging vulnerabilities and design effective defenses. To tackle this problem, we present a search-based framework that alternates between improving attack and defense strategies through the simulation of privacy-critical agent interactions. Specifically, we employ LLMs as optimizers to analyze simulation trajectories and iteratively propose new agent instructions. To explore the strategy space more efficiently, we further utilize parallel search with multiple threads and cross-thread propagation. Through this process, we find that attack strategies escalate from direct requests to sophisticated tactics, such as impersonation and consent forgery, while defenses evolve from simple rule-based constraints to robust identity-verification state machines. The discovered attacks and defenses generalize across diverse scenarios and backbone models, providing useful insights for developing privacy-aware agents.
comment: ICLR 2026
♻ ☆ TSAssistant: A Human-in-the-Loop Agentic Framework for Automated Target Safety Assessment
Target Safety Assessment (TSA) requires systematic integration of heterogeneous evidence, including genetic, transcriptomic, target homology, pharmacological, and clinical data, to evaluate potential safety liabilities of therapeutic targets. This process is inherently iterative and expert-driven, posing challenges in scalability and reproducibility. We present TSAssistant, a multi-agent framework designed to support TSA report drafting through a modular, section-based, and human-in-the-loop paradigm. The framework decomposes report generation into a coordinated pipeline of specialised subagents, each targeting a single TSA section. Specialised subagents retrieve structured and unstructured data as well as literature evidence from curated biomedical sources through standardised tool interfaces, producing individually citable, evidence-grounded sections. Agent behaviour is governed by a hierarchical instruction architecture comprising system prompts, domain-specific skill modules, and runtime user instructions. A key feature is an interactive refinement loop in which users may manually edit sections, append new information, upload additional sources, or re-invoke agents to revise specific sections, with the system maintaining conversational memory across iterations. TSAssistant is designed to reduce the mechanical burden of evidence synthesis and report drafting, supporting a hybrid model in which agentic AI augments evidence synthesis while toxicologists retain final decision authority.
comment: Updated with self-consistency quantitative evaluation; additional quantitative and expert evaluations to be included in future revisions
♻ ☆ From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems
Compound AI Systems (CAIS) are an emerging paradigm that integrates large language models (LLMs) with external components, including retrievers, agents, tools, and orchestrators, to overcome the limitations of standalone models in tasks requiring memory, reasoning, real-time grounding, and multimodal understanding. These systems enable more capable and context-aware behaviors by composing multiple specialized modules into cohesive workflows. Despite growing adoption in both academia and industry, the CAIS landscape remains fragmented and lacks a unified framework for analysis, taxonomy, and evaluation. In this survey, we define the concept of CAIS, propose a multi-dimensional taxonomy based on component roles and orchestration strategies, and analyze four foundational paradigms: Retrieval-Augmented Generation (RAG), LLM Agents, Multimodal LLMs (MLLMs), and Orchestration. We review representative systems, compare design trade-offs, and summarize evaluation methodologies across these paradigms. Finally, we identify key challenges - including scalability, interoperability, benchmarking, and coordination - and outline promising directions for future research. This survey aims to provide researchers and practitioners with a comprehensive foundation for understanding, developing, and advancing the next generation of system-level artificial intelligence.
♻ ☆ Sign-Based Optimizers Are Effective Under Heavy-Tailed Noise
While adaptive gradient methods are the workhorse of modern machine learning, sign-based optimization algorithms such as Lion and Muon have recently demonstrated superior empirical performance over AdamW in training large language models (LLM). However, a theoretical understanding of why sign-based updates outperform variance-adapted methods remains elusive. In this paper, we aim to bridge the gap between theory and practice through the lens of heavy-tailed gradient noise, a phenomenon frequently observed in language modeling tasks. Theoretically, we introduce a novel generalized heavy-tailed noise condition that captures the behavior of LLMs more accurately than standard finite variance assumptions. Under this noise model, we establish sharp convergence rates of SignSGD and Lion for generalized smooth function classes, matching or surpassing previous best-known bounds. Furthermore, we extend our analysis to Muon and Muonlight, providing what is, to our knowledge, the first rigorous analysis of matrix optimization under heavy-tailed stochasticity. These results offer a strong theoretical justification for the empirical superiority of sign-based optimizers, showcasing that they are naturally suited to handle the noisy gradients associated with heavy tails. Empirically, LLM pretraining experiments validate our theoretical insights and confirm that our proposed noise models are well-aligned with practice.
comment: Code is available at https://github.com/Dingzhen230/Heavy-tailed-Noise-in-LLMs
♻ ☆ Are LLM Agents Behaviorally Coherent? Latent Profiles for Social Simulation
The impressive capabilities of Large Language Models (LLMs) raise the possibility that synthetic agents can serve as substitutes for real participants in human-subject research. To evaluate this claim, prior research has largely focused on whether LLM-generated survey responses align with those produced by human respondents whom the LLMs are prompted to represent. In contrast, we address a more fundamental question: Do agents maintain empirical consistency; aligning to human behavioral models when examined under different experimental settings? To this end, we develop a study designed to (a) ask a set of questions which reveals an agent's latent profile and (b) examine agent behavioral consistency in a conversational setting with other agents. This design enables us to explore a set of behavioral hypotheses to assess whether an agent's conversational behavior is consistent with what we would expect from its revealed state. Our findings show significant inconsistencies in LLMs across model families and at differing model sizes. Most importantly, we find that, although agents may generate responses matching those of their human counterparts, they fail to be empirically consistent, representing a critical gap in their capabilities to accurately substitute for real participants in human-subject research.
comment: 25 pages, 9 figures, 7 tables
♻ ☆ Rep2Text: Decoding Full Text from a Single LLM Token Representation
Large language models (LLMs) have achieved remarkable progress across diverse tasks, yet their internal mechanisms remain largely opaque. In this work, we investigate a fundamental question: to what extent can the original input text be recovered from a single last-token representation in an LLM? To this end, we propose Rep2Text, a novel framework for decoding text from last-token representations. Rep2Text employs a trainable adapter that maps a target model's last-token representation into the token embedding space of a decoding language model, which then autoregressively reconstructs the input text. Experiments across various model combinations (Llama-3.1-8B, Gemma-7B, Mistral-7B-v0.1, Llama-3.2-3B, etc.) show that, on average, roughly half of the tokens in 16-token sequences can be recovered from this compressed representation while preserving strong semantic coherence. Further analysis reveals a clear information bottleneck effect: as sequence length increases, token-level recovery declines, while semantic information remains relatively well preserved. We also find that scaling effects are less pronounced in inversion tasks. Finally, our framework demonstrates robust generalization to out-of-distribution clinical data.
comment: 18 pages, 6 figures, 6 tables
♻ ☆ Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training
LLM post-training typically propagates task gradients through the full depth of the model. Although this end-to-end structure is simple and general, it couples task adaptation to full-depth activation storage, long-range backward dependencies and direct task-gradient access to pretrained representations. We argue that this full-depth backward coupling can be unnecessarily expensive and intrusive, particularly when post-training supervision is much narrower than pre-training. To this end, we propose \textbf{LoPT}: Local-Learning Post-Training, a simple post-training strategy that makes gradient reach an explicit design choice. LoPT places a single gradient boundary at the transformer midpoint: the second-half block learns from the task objective, while the first-half block is updated by a lightweight feature-reconstruction objective to preserve useful representations and maintain interface compatibility. LoPT shortens the task-induced backward path while limiting direct interference from narrow task gradients on early-layer representations. Extensive experiments demonstrate that LoPT achieves competitive performance with lower memory cost, higher training efficiency and better retention of pretrained capabilities. Our code is available at: https://github.com/HumyuShi/LoPT
comment: 33pages
♻ ☆ Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression ICML 2026
While Key-Value (KV) cache compression is essential for efficient LLM inference, current evaluations disproportionately focus on sparse retrieval tasks, potentially masking the degradation of High-Density Reasoning where Chain-of-Thought (CoT) coherence is critical. We introduce KVFundaBench to systematically evaluate this gap, revealing a sharp dichotomy: while retrieval tasks remain robust, reasoning tasks exhibit severe Task-Dependent Degradation under aggressive compression due to disrupted CoT links. Extending our analysis to the DeepSeek-R1 model, we uncover that its specialized attention patterns offer unique insights into the fragility of reasoning chains. Guided by these findings -- specifically the necessity of preserving few-shot examples as indivisible Semantic Units -- we propose ShotKV. This approach explicitly separates prefill and decoding phases to prioritize semantic integrity. Empirical results demonstrate that ShotKV achieves 9%-18% accuracy improvements on long-context generation tasks and effectively generalizes to document QA, all while delivering an 11% latency reduction compared to full cache inference.
comment: ICML 2026
♻ ☆ Human-like fleeting memory improves language learning but impairs reading time prediction in transformer language models
Human memory is fleeting. As words are processed, the exact wordforms that make up incoming sentences are rapidly lost. Cognitive scientists have long believed that this limitation of memory may, paradoxically, help in learning language - an idea supported by classic connectionist modelling work. The rise of Transformers appears to challenge this idea, as these models can learn language effectively, despite lacking memory limitations or other architectural recency biases. Here, we investigate the hypothesized benefit of fleeting memory for language learning in tightly controlled experiments on transformer language models. Training transformers with and without fleeting memory on a developmentally realistic training set, we find that fleeting memory consistently improves language learning (as quantified by both overall language modelling performance and targeted syntactic evaluation) but, unexpectedly, impairs surprisal-based prediction of human reading times. Interestingly, follow up analyses revealed that this discrepancy - better language modeling, yet worse reading time prediction - could not be accounted for by prior explanations of why better language models sometimes fit human reading time worse. Together, these results support a benefit of memory limitations on neural network language learning - but not on predicting behavior.
comment: Revised after peer review. Accepted for publication in Transactions of the Association for Computational Linguistics
♻ ☆ DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference ICLR 26
Recent reasoning Large Language Models (LLMs) demonstrate remarkable problem-solving abilities but often generate long thinking traces whose utility is unclear. Our work aims to improve their efficiency, enabling them to reach high performance without overthinking. First, we analyze the entropy of token probabilities in reasoning traces. Across three models, we observe a consistent U-shaped entropy pattern: high entropy on easy problems despite high accuracy, low entropy on problems with medium difficulty, and high entropy on hard problems reflecting uncertainty. Specifically, we notice 22--25\% entropy reduction from easy to medium difficulty regions, suggesting an {overthinking} phenomenon on easy instances. Building on these insights, we introduce \textbf{DiffAdapt}, a lightweight framework that selects Easy/Normal/Hard inference strategies per question based on their difficulty and reasoning trace entropy. Each inference strategy consists of a fixed prompt, temperature and maximum token length. In contrast to existing efficiency optimization methods, our approach does not fine-tune base LLM but a small probe that classifies LLM's final hidden state, allowing inexpensive adaptation. We comprehensively evaluate our method on five models and eight benchmarks. Our method achieves comparable or improved accuracy while reducing token usage by up to 22.4\%, establishing a practical path toward compute-efficient reasoning.
comment: ICLR 26
♻ ☆ Training-Free Multimodal Large Language Model Orchestration
Building interactive omni-modal assistants often relies on end-to-end multimodal alignment to fuse heterogeneous modalities, which incurs substantial data and compute costs and limits extensibility. We present Training-Free Large Language Model Orchestration (LLM Orchestration), a training-free orchestration framework that integrates off-the-shelf modality experts into a unified multimodal input--output system without additional gradient-based training for integration. LLM Orchestration comprises three components: (1) an LLM controller that infers user intent and emits explicit control tokens for expert selection and sequencing, enabling protocol-constrained and auditable routing; (2) a text-centric cross-modal memory that compresses multimodal evidence into structured records for lightweight retrieval and reuse, reducing redundant expert invocations across turns; and (3) a unified interaction layer that executes routing and memory decisions to support consistent modality transitions, full-duplex streaming, and interruption-aware dialogue. Across diverse multimodal benchmarks, LLM Orchestration achieves strong performance under standard evaluation constraints while maintaining low orchestration overhead and modular upgradeability, providing a practical alternative to costly joint training for omni-modal systems.
♻ ☆ Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization ACL 2026
Recent work, using the Biasing Features metric, labels a CoT as unfaithful if it omits a prompt-injected hint that affected the prediction. We argue this metric adopts a narrow notion of faithfulness and confuses unfaithfulness with incompleteness, the lossy compression needed to turn distributed transformer computation into a linear natural language narrative. On multi-hop reasoning tasks with instruct-tuned and reasoning models, many CoTs flagged as unfaithful by Biasing Features are judged faithful by other metrics, exceeding 50% in some models. With a new faithful@k metric, we show that larger inference-time budgets greatly increase hint verbalization (up to 90% in some settings), suggesting much apparent unfaithfulness is due to tight token limits. Using Causal Mediation Analysis, we further show that even non-verbalized hints can causally mediate prediction changes through the CoT. We therefore caution against relying solely on hint-based evaluations and advocate a broader interpretability toolkit, including causal mediation and corruption-based metrics. We do not claim all CoTs are faithful, only that the absence of hint words alone does not prove unfaithfulness.
comment: Accepted to ACL 2026. 23 pages, 29 figures, 6 tables
♻ ☆ On Time, Within Budget: Constraint-Driven Online Resource Allocation for Agentic Workflows
Agentic systems increasingly solve complex user requests by executing orchestrated workflows, where subtasks are assigned to specialized models or tools and coordinated according to their dependencies. While recent work improves agent efficiency by optimizing the performance--cost--latency frontier, real deployments often impose concrete requirements: a workflow must be completed within a specified budget and before a specified deadline. This shifts the goal from average efficiency optimization to maximizing the probability that the entire workflow completes successfully under explicit budget and deadline constraints. We study \emph{constraint-driven online resource allocation for agentic workflows}. Given a dependency-structured workflow and estimates of success rates and generation lengths for each subtask--model pair, the executor dynamically allocates models and parallel samples across simultaneously executable subtasks while managing the remaining budget and time. We formulate this setting as a finite-horizon stochastic online allocation problem and propose \emph{Monte Carlo Portfolio Planning} (MCPP), a lightweight closed-loop planner that directly estimates constrained completion probability through simulated workflow executions and replans after observed outcomes. Experiments on CodeFlow and ProofFlow demonstrate that MCPP consistently improves constrained completion probability over strong baselines across a wide range of budget--deadline constraints.
comment: Preprint
♻ ☆ FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation
Adapting pretrained models typically involves a trade-off between the high training costs of backpropagation and the heavy inference overhead of memory-based or in-context learning. We propose FAAST, a forward-only associative adaptation method that analytically compiles labeled examples into fast weights in a single pass. By eliminating memory or context dependence, FAAST achieves constant-time inference and decouples task adaptation from pretrained representation. Across image classification and language modeling benchmarks, FAAST matches or exceeds backprop-based adaptation while reducing adaptation time by over 90% and is competitive to memory/context-based adaptation while saving memory usage by up to 95%. These results demonstrate FAAST as a highly efficient, scalable solution for supervised task adaptation, particularly for resource-constrained models. We release the code and models at https://github.com/baoguangsheng/faast.
comment: 9 pages, 6 figures, 10 tables
♻ ☆ OLaPh: Optimal Language Phonemizer
Phonemization is a critical component in text-to-speech synthesis. Traditional approaches rely on deterministic transformations and lexica, while neural methods offer potential for higher generalization on out-of-vocabulary (OOV) terms. This work introduces OLaPh (Optimal Language Phonemizer), a hybrid framework that integrates extensive multilingual lexica with advanced NLP techniques and a statistical subword segmentation function. Evaluations on the WikiPron benchmark show that the OLaPh framework significantly outperforms established baselines in overall accuracy and maintains robustness on OOV data through advanced fallback mechanisms. To further explore neural generalization, we utilize the framework to synthesize a high-consistency training corpus for an instruction-tuned Large Language Model (LLM). While the deterministic framework remains more accurate overall, the LLM demonstrates strong generalization, matching or partly exceeding the framework's performance. This suggests that the LLM successfully internalized phonetic intuitions from the synthetic data that transcend the framework's capabilities. Together, these tools provide a comprehensive, open-source resource for multilingual G2P research.
comment: 11 pages, 1 figure, 4 tables
♻ ☆ KV Cache Offloading for Context-Intensive Tasks
With the growing demand for long-context LLMs across a wide range of applications, the key-value (KV) cache has become a critical bottleneck for both latency and memory usage. Recently, KV-cache offloading has emerged as a promising approach to reduce memory footprint and inference latency while preserving accuracy. Prior evaluations have largely focused on tasks that do not require extracting large amounts of information from the context. In this work, we study KV-cache offloading on context-intensive tasks: problems where the solution requires looking up a lot of information from the input prompt. We create and release the Text2JSON benchmark, a highly context-intensive task that requires extracting structured knowledge from raw text. We evaluate modern KV offloading on Text2JSON and other context-intensive tasks and find significant performance degradation on both Llama 3 and Qwen 3 models. Our analysis identifies two key reasons for poor accuracy: low-rank projection of keys and unreliable landmarks, and proposes a simpler alternative strategy that significantly improves accuracy across multiple LLM families and benchmarks. These findings highlight the need for a comprehensive and rigorous evaluation of long-context compression techniques.
comment: Preprint
♻ ☆ ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards ICML 2026
Search agents powered by Large Language Models (LLMs) have demonstrated significant potential in tackling knowledge-intensive tasks. Reinforcement learning (RL) has emerged as a powerful paradigm for training these agents to perform complex, multi-step reasoning. However, prior RL-based methods often rely on sparse or rule-based rewards, which can lead agents to commit to suboptimal or erroneous reasoning paths without the ability to recover. To address these limitations, we propose ReSeek, a novel self-correcting framework for training search agents. Our framework introduces a self-correction mechanism that empowers the agent to dynamically identify and recover from erroneous search paths during an episode. By invoking a special JUDGE action, the agent can judge the information and re-plan its search strategy. To guide this process, we design a dense, instructive process reward function, which decomposes into a correctness reward for retrieving factual information and a utility reward for finding information genuinely useful for the query. Furthermore, to mitigate the risk of data contamination in existing datasets, we introduce FictionalHot, a new and challenging benchmark with recently curated questions requiring complex reasoning. Being intuitively reasonable and practically simple, extensive experiments show that agents trained with ReSeek significantly outperform SOTA baselines in task success rate and path faithfulness.
comment: ICML 2026
♻ ☆ ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning ICML 2026
Reinforcement Learning with Verifiable Rewards (RLVR) enhances reasoning of Large Language Models (LLMs) but usually exhibits limited generation diversity due to the over-incentivization of positive rewards. Although methods like Negative Sample Reinforcement (NSR) mitigate this issue by upweighting penalty from negative samples, they may suppress the semantic distributions shared between positive and negative responses. To boost reasoning ability without losing diversity, this paper proposes negative sample projection Residual Reinforcement Learning (ResRL) that decouples similar semantic distributions among positive and negative responses. We theoretically link Lazy Likelihood Displacement (LLD) to negative-positive head-gradient interference and derive a single-forward proxy that upper-bounds representation alignment to guide conservative advantage reweighting. ResRL then projects negative-token hidden representations onto an SVD-based low-rank positive subspace and uses projection residuals to modulate negative gradients, improving reasoning while preserving diversity and outperforming strong baselines on average across twelve benchmarks spanning Mathematics, Code, Agent Tasks, and Function Calling. Notably, ResRL surpasses NSR on mathematical reasoning by 9.4\% in Avg@16 and 7.0\% in Pass@128. Code is available at https://github.com/1229095296/ResRL.git.
comment: Accepted to ICML 2026. Preprint version. https://github.com/1229095296/ResRL.git
♻ ☆ SEQUOR: A Multi-Turn Benchmark for Realistic Constraint Following
In a conversation, a helpful assistant must reliably follow user directives, even as they refine, modify, or contradict earlier requests. Yet most instruction-following benchmarks focus on single-turn or short multi-turn scenarios, leaving open how well models handle long-horizon instruction-following tasks. To bridge this gap, we present SEQUOR, an automatic benchmark for evaluating constraint adherence in long multi-turn conversations. SEQUOR consists of simulated persona-driven interactions built with constraints extracted from real-world conversations. Our results show that even when following a single constraint, instruction-following accuracy consistently decreases as the conversation grows longer, with drops exceeding 11%. This decline becomes larger when models have to follow multiple constraints simultaneously, reducing their accuracy by over 40%. In scenarios where constraints are added or replaced at arbitrary points of the conversation, model accuracy decreases by more than 9%. Taken together, our results reveal that current models still struggle to follow user instructions in multi-turn conversations, and provide a way for better measuring instruction-following capabilities in assistants.
♻ ☆ Sparser, Faster, Lighter Transformer Language Models
Scaling autoregressive large language models (LLMs) has driven unprecedented progress but comes with vast computational costs. In this work, we tackle these costs by leveraging unstructured sparsity within an LLM's feedforward layers, the components accounting for most of the model parameters and execution FLOPs. To achieve this, we introduce a new sparse packing format and a set of CUDA kernels designed to seamlessly integrate with the optimized execution pipelines of modern GPUs, enabling efficient sparse computation during LLM inference and training. To substantiate our gains, we provide a quantitative study of LLM sparsity, demonstrating that simple L1 regularization can induce over 99% sparsity with negligible impact on downstream performance. When paired with our kernels, we show that these sparsity levels translate into substantial throughput, energy efficiency, and memory usage benefits that increase with model scale. We will release all code and kernels under an open-source license to promote adoption and accelerate research toward establishing sparsity as a practical axis for improving the efficiency and scalability of modern foundation models.
comment: Code and checkpoints available at: https://github.com/SakanaAI/sparser-faster-llms
♻ ☆ A Geometric Taxonomy of Hallucinations in LLMs
Hallucinations in deployed language models can have real consequences for downstream decisions in domains such as healthcare, legal, and financial services. In production, detection has to run on what the deployed system can see: the query, the response, and often a source document. White-box access to model internals and multi-sample querying are not generally available behind a third-party API. Within this setting - black-box, single-pass, only question/answer available - the dominant baseline is NLI, which returns a value but no diagnosis when it fails. We argue that operating directly on the geometry of the embedding space provides detection methods whose successes and failures are interpretable as structural properties of contrastive sentence-encoder training \citep{wang2020understanding}. The contribution is: given an operationally-motivated taxonomy, geometry predicts which types of hallucination are detectable and which are not - and the predictions hold. We propose three operational types organized by the relation of the response embedding to the plausibility region of grounded responses on the unit hypersphere, and derive from the alignment objective a prediction for each: (1)query-proximate unfaithfulness is detectable by an angular ratio; (2)confabulation outside the plausibility region produces a directional signature that outperforms NLI on expert-annotated error; (3)factual errors sharing vocabulary and frame with correct answers are not separable by angular geometry. To validate on content resembling deployment, we built a 212-pair human-confabulated dataset across nine domains using provoked confabulation.
♻ ☆ User eXperience Perception Insights Dataset (UXPID): Synthetic User Feedback from Public Industrial Forums
Customer feedback in industrial forums offers rich but underexplored insights into real-world product experience. Yet systematic analysis remains challenging due to unstructured, domain-specific content and the scarcity of high-quality labeled datasets. This paper presents the User eXperience Perception Insights Dataset (UXPID), a collection of 7130 synthesized and anonymized user feedback branches extracted from a public industrial automation forum. Each JSON record contains multi-post comments enriched with metadata and annotated by a large language model (LLM) for UX insights, user expectations, severity ratings, sentiment, and topic classifications. UXPID is designed to facilitate research in user requirements, user experience (UX) analysis, and AI-driven feedback processing, particularly where privacy and licensing restrictions limit access to real-world data. It supports the training and evaluation of transformer-based models for tasks such as issue detection, sentiment analysis, and requirements extraction in technical forums, providing a valuable resource for advancing NLP methods within industrial product support and software engineering domains.
♻ ☆ SpikingBrain: Spiking Brain-inspired Large Models
Mainstream Transformer-based large language models face major efficiency bottlenecks: training computation scales quadratically with sequence length, and inference memory grows linearly, limiting long-context processing. Building large models on non-NVIDIA platforms also poses challenges for stable and efficient training. To address this, we introduce SpikingBrain, a family of brain-inspired models designed for efficient long-context training and inference. SpikingBrain leverages the MetaX GPU cluster and focuses on three aspects: (1) Model Architecture: linear and hybrid-linear attention architectures with adaptive spiking neurons; (2) Algorithmic Optimizations: an efficient, conversion-based training pipeline and a dedicated spike coding framework; (3) System Engineering: customized training frameworks, operator libraries, and parallelism strategies tailored to MetaX hardware. Using these techniques, we develop two models: SpikingBrain-7B, a linear LLM, and SpikingBrain-76B, a hybrid-linear MoE LLM. These models demonstrate the feasibility of large-scale LLM development on non-NVIDIA platforms, and training remains stable for weeks on hundreds of MetaX GPUs with Model FLOPs Utilization at expected levels. SpikingBrain achieves performance comparable to open-source Transformer baselines while using only about 150B tokens for continual pre-training. Our models also significantly improve long-context efficiency and deliver inference with (partially) constant memory and event-driven spiking behavior. For example, SpikingBrain-7B attains over 100x speedup in Time to First Token for 4M-token sequences. Furthermore, the proposed spiking scheme achieves 69.15 percent sparsity, enabling low-power operation. Overall, this work demonstrates the potential of brain-inspired mechanisms to drive the next generation of efficient and scalable large model design.
♻ ☆ Interpreting Speaker Characteristics in the Dimensions of Self-Supervised Speech Features IEEE
How do speech models trained through self-supervised learning structure their representations? Previous studies have looked at how information is encoded in feature vectors across different layers. But few studies have considered whether speech characteristics are captured within individual dimensions of SSL features. In this paper we specifically look at speaker information using PCA on utterance-averaged representations. For a range of SSL models, we find that the principal dimension that explains most variance encodes pitch and associated characteristics like gender. Other individual principal dimensions correlate with intensity, noise levels, the second formant, and higher frequency characteristics. We then use synthesis analyses to show that the dimensions for most characteristics are isolated from each other's influence. We further show that characteristics can be changed by manipulating the corresponding dimensions.
comment: 5 pages, 7 figures, submitted to IEEE Signal Processing Letters
♻ ☆ EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle ICML 2026
Current Large Language Model (LLM) agents show strong performance in tool use, but lack the crucial capability to systematically learn from their own experiences. While existing frameworks mainly focus on mitigating external knowledge gaps, they fail to address a more fundamental limitation: the inability to iteratively refine problem-solving strategies. In this work, we introduce EvolveR, a framework designed to enable agent to self-improve through a complete, closed-loop experience lifecycle. This lifecycle comprises two key stages: (1) Offline Self-Distillation, where the agent's interaction trajectories are synthesized into a structured repository of abstract, reusable strategic principles; (2) Online Interaction, where the agent interacts with tasks and actively retrieves distilled principles to guide its decision-making, accumulating a diverse set of behavioral trajectories. This loop employs a policy reinforcement mechanism to iteratively update the agent based on its performance. We demonstrate the effectiveness of EvolveR on complex multi-hop question-answering benchmarks, where it achieves superior performance over strong agentic baselines. Our work presents a comprehensive blueprint for agents that learn not only from external data but also from the consequences of their own actions, paving the way for more autonomous and continuously improving systems. Code is available at https://github.com/Edaizi/EvolveR.
comment: Accepted by ICML 2026
♻ ☆ Structural Generalization on SLOG without Hand-Written Rules
Structural generalization in semantic parsing requires systems to apply learned compositional rules to novel structural combinations. Existing approaches either rely on hand-written algebraic rules (AM-Parser) or fail to generalize structurally (Transformer-based models). We present an alternative requiring no hand-written compositional rules, based on a neural cellular automaton (NCA) with a discrete bottleneck: all compositional rules are learned from data through local iteration. On the SLOG benchmark, the system achieves an overall accuracy of $67.3 \pm 0.2\%$ across 10 seeds (AM-Parser: $70.8 \pm 4.3\%$), with 11 of 17 structural generalization categories at $100\%$ type-exact match, including three where AM-Parser scores $0$--$74\%$. Analysis reveals that all 5,539 failure instances reduce to exactly two mechanisms: novel combinations of wh-extraction context with reduced verb types, and modifiers appearing on the subject side of verbs. When we decompose results by CCG structural features, each sub-pattern either succeeds on all instances or fails on all. Intermediate scores (e.g., $41.4\%$) are mixtures of structurally distinct CCG patterns, not partial generalization. These results suggest that CCG directed types provide higher resolution than SLOG's phenomenon-level categories for characterizing structural generalization, and that the success/failure boundary is determined by the coverage of directed operations in the training data.
♻ ☆ ScrapeGraphAI-100k: Dataset for Schema-Constrained LLM Generation
Producing output that conforms to a specified JSON schema underlies tool use, structured extraction, and knowledge base construction in modern large language models. Despite this centrality, public datasets for the task remain small, synthetic, or text-only, and rarely pair real page content with the prompts and schemas used in practice. We introduce ScrapeGraphAI-100k, 93,695 schema-constrained extraction events collected via opt-in ScrapeGraphAI telemetry in Q2--Q3 2025, deduplicated and balanced by schema from 9M raw events. The corpus spans 18 000+ unique schemas across 15 named languages plus a long-tail Other category, with English and Traditional Chinese covering 88% of detected content, each instance pairs Markdown-converted page content with a prompt, schema, LLM response, and per-example jsonschema-rs structural conformance labels (semantic correctness is out of scope, and raw HTML is deferred beyond v1.0). We characterize structural diversity across the corpus and identify sharp failure thresholds as schema complexity grows. As a case study, a 1.7B student fine-tuned on this data closely tracks the output distribution of its GPT-5-nano teacher, though it still trails a 30B-A3B reference (3.3B active parameters) on schema compliance. We offer this distillation result as preliminary evidence that grounding schema-constrained generation in real practitioner workloads at scale enables training and benchmarking that prior synthetic or text-only corpora could not support.
♻ ☆ Prune, Interpret, Evaluate: A Cross-Layer Transcoder-Native Framework for Efficient Circuit Discovery via Feature Attribution
Existing feature-interpretation pipelines typically operate on uniformly sampled units or exhaustive feature sets, incurring massive costs on units irrelevant to target behaviors. To address this, we introduce the first CLT-native end-to-end pruning framework, PIE, which pioneers the paradigm of pruning first and interpreting later. PIE connects Pruning, automatic Interpretation, and interpretation Evaluation, establishing a comprehensive benchmarking environment to systematically measure behavioral fidelity and downstream interpretability under pruning. Within this framework, we adapt strong relevance baselines and propose Feature Attribution Patching (FAP), a patch-grounded attribution method that scores CLT features by aggregating gradient-weighted write contributions. Furthermore, we introduce FAP-Synergy, a systematic synergy-aware reranking procedure. We evaluate pruning using KL-divergence behavior retention and assess interpretation quality with FADE-style metrics across IOI and Doc-String datasets. Across budget constraints of K in {50, 100, 200, 400, 800}, our rigorous benchmarking reveals distinct operational regimes: while base FAP and adapted baselines perform robustly at relaxed budgets, FAP-Synergy excels in highly constrained, strict-budget regimes. Crucially, we demonstrate a practical "Effective Budget" advantage: on the IOI task for both Llama-3.2-1B and Gemma-2-2B, FAP-Synergy at K=50 functionally matches the behavioral fidelity of baseline circuits at K=75. Because downstream evaluation costs scale linearly per feature, Synergy effectively grants the pipeline 25 "free" features, achieving K=75 fidelity while reducing interpretation costs by 33%.
♻ ☆ Can David Beat Goliath? On Multi-Hop Reasoning with Resource-Constrained Agents
Multi-turn reasoning agents solve complex questions by decomposing them into intermediate retrieval or tool-use steps, for accumulating supporting evidence across turns. Meanwhile, with reinforcement learning (RL), training these agents rely on many on-policy rollouts and large training batches. Under realistic resource constraints that make dense exploration infeasible, each RL batch contains only few useful reasoning paths from the current policy. Existing approaches do not fully address this bottleneck: SFT-based initialization can overfit when annotated trajectories are scarce, retrieval-level rewards can assign credit to individual retrieved documents without directly optimizing coverage of the full evidence set, and expansion can waste rollouts from poorly chosen prefixes. We introduce David-GRPO, which improves small-batch learning by using information from both outside and inside the current policy: (i) expert bootstrapping injects a few off-policy expert trajectories into RL updates, and (ii) evidence-guided exploration turns on-policy partial successes into evidence-coverage scores and additional continuations. On agents up to 1.5B parameters trained on four RTX 3090 GPUs, David-GRPO improves over prior RL baselines under the same low-budget setting on six multi-hop QA benchmarks. The gains come with a behavioral shift: unlike prior low-budget RL baselines that often skip retrieval or stop after shallow search, David-GRPO learns to increase retrieval depth and evidence coverage.
comment: Preprint
♻ ☆ WorldCup Sampling for Multi-bit LLM Watermarking
As large language models (LLMs) generate increasingly human-like text, watermarking has emerged as a promising solution for reliable attribution beyond mere detection. While multi-bit watermarking enables richer provenance encoding, existing approaches typically extend zero-bit watermarking schemes by introducing static logit perturbations and counting-based decoding strategies, which can degrade text quality and compromise decoding robustness as the payload increases. In this paper, we propose WorldCup, a multi-bit watermarking framework for LLMs that models the sampling process as a structured communication channel and embeds message bits through a hierarchical competition mechanism guided by complementary signals. Moreover, WorldCup incorporates entropy-aware modulation to preserve generation quality and enables robust message recovery via confidence-aware decoding that accounts for token-level reliability. Comprehensive experiments demonstrate that WorldCup achieves a strong balance across message capacity, detectability, robustness, text quality, and decoding efficiency, consistently outperforming prior baselines. We believe that this work establishes a scalable and principled foundation for future research on multi-bit watermarking in LLMs.
♻ ☆ MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning ACL 2026
LLM-based search agents often concatenate the full interaction history into the context, producing long and noisy inputs, and increasing compute cost and GPU memory overhead. To address this issue, we propose MemSearcher, an agent framework that maintains a compact memory during multi-turn interactions, retaining only question-relevant information and thereby keeping the context length stable across turns. Training MemSearcher is challenging because each trajectory spans multiple turns under different LLM contexts, making each turn an independent optimization target in reinforcement learning. We introduce multi-context GRPO, which propagates trajectory-level advantages to all turns for end-to-end optimization. Experiments demonstrate that MemSearcher outperforms strong history-concatenation (ReAct-style) baselines on a range of public datasets while maintaining nearly constant token counts across multi-turn interactions. The code and models will be publicly available at https://github.com/icip-cas/MemSearcher
comment: Accepted to ACL 2026
♻ ☆ Rethinking Weight Tying: Pseudo-Inverse Tying for LM Stable Training and Updates
Weight tying is widely used in compact language models to reduce parameters by sharing the token table between the input embedding and the output projection. However, parameter sharing alone does not guarantee a stable token interface: during training, the correspondence between encoding tokens into hidden states and decoding hidden states into logits can drift, worsening optimization sensitivity and weakening explainability probes that rely on a meaningful vocabulary-space decoder. We propose Pseudo-Inverse Tying (PIT), which synchronizes embedding and unembedding as coupled projections of a shared latent token memory, guaranteeing a pseudo-inverse-consistent interface throughout training. PIT maintains an orthonormal shared memory, obtained by polar initialization from a source checkpoint for continued pretraining or by random orthonormal initialization for from-scratch pretraining, and introduces a learned symmetric positive definite hidden-space transform parameterized via a Cholesky factor. The output head applies this transform to hidden states before the vocabulary projection, while the embedding applies the inverse transform to token vectors using stable triangular solves, avoiding explicit pseudo-inverse recomputation and vocabulary-sized auxiliary parameters. Beyond improving training stability, PIT provides a cleaner substrate for logit-lens-style and vocabulary-space explainability probes by keeping the input and output token geometries synchronized. We evaluate PIT on on-device models spanning 256M-1.3B parameters. The results show that PIT improves continued-pretraining stability, enforces near-exact token-interface consistency across settings, and yields more predictable lightweight adaptation after continued pretraining, while from-scratch pretraining reveals a trade-off between strict interface consistency and unconstrained optimization.
comment: an early-stage version
♻ ☆ Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a critical method for enhancing the reasoning capabilities of Large Language Models (LLMs). However, continuous training often leads to policy entropy collapse, characterized by a rapid decay in entropy that results in premature overconfidence, reduced output diversity, and vanishing gradient norms that inhibit learning. Gradient-Preserving Clipping is a primary factor influencing these dynamics, but existing mitigation strategies are largely static and lack a framework connecting clipping mechanisms to precise entropy control. This paper proposes reshaping entropy control in RL from the perspective of Gradient-Preserving Clipping. We first theoretically and empirically verify the contributions of specific importance sampling ratio regions to entropy growth and reduction. Leveraging these findings, we introduce a novel regulation mechanism using dynamic clipping thresholds to precisely manage entropy. Furthermore, we design and evaluate dynamic entropy control strategies, including increase-then-decrease, decrease-increase-decrease, and oscillatory decay. Experimental results demonstrate that these strategies effectively mitigate entropy collapse and achieve superior performance across multiple benchmarks.
comment: https://github.com/Kwen-Chen/Flexible-Entropy-Control
♻ ☆ SeaEvo: Advancing Algorithm Discovery with Strategy Space Evolution
Large Language Model (LLM)-guided evolutionary search is increasingly used for automated algorithm discovery, yet most current methods track search progress primarily through executable programs and scalar fitness. Even when natural-language reasoning is used through heuristic descriptions or reflection, it typically remains transient mutation context or unstructured memory, rather than organized as persistent population-level state over strategic directions. As a result, evolutionary search can struggle to distinguish syntactically different implementations of the same idea, preserve lower-fitness but strategically promising directions, or detect when an entire family of strategies has saturated. We introduce \model, a modular strategy-space layer that turns language-level strategic reasoning into first-class population-level evolutionary state in LLM-driven program search. \model represents each candidate program with an explicit natural-language strategy, clusters the archive by strategy semantics, retrieves behaviorally complementary inspirations, and periodically navigates the strategy landscape to avoid saturated directions. Without modifying the underlying evolutionary algorithms, \model improves existing evolutionary backbones across algorithm discovery, systems optimization, and agent-scaffold design tasks in most settings. Across four systems benchmarks, \model achieves a 20.6% average relative improvement, with the best single run on Prism scoring 3$\times$ higher. These results suggest that persistent strategy representations provide a practical mechanism for improving the effectiveness and cost-efficiency of LLM-guided evolutionary search, pointing toward compound AI systems whose search capabilities benefit from the structured accumulation and reuse of algorithmic strategies.
♻ ☆ NCL-UoR at SemEval-2026 Task 5: Embedding-Based Methods, Fine-Tuning, and LLMs for Word Sense Plausibility Rating
Word sense plausibility rating requires predicting the human-perceived plausibility of a given word sense on a 1-5 scale in the context of short narrative stories containing ambiguous homonyms. This paper systematically compares three approaches: (1) embedding-based methods pairing sentence embeddings with standard regressors, (2) transformer fine-tuning with parameter-efficient adaptation, and (3) large language model (LLM) prompting with structured reasoning and explicit decision rules. The best-performing system employs a structured prompting strategy that decomposes evaluation into narrative components (precontext, target sentence, ending) and applies explicit decision rules for rating calibration. The analysis reveals that structured prompting with decision rules outperforms both fine-tuned models and embedding-based approaches, and that prompt design matters more than model scale for this task.
♻ ☆ NCL-BU at SemEval-2026 Task 3: Fine-tuning XLM-RoBERTa for Multilingual Dimensional Sentiment Regression
Dimensional Aspect-Based Sentiment Analysis (DimABSA) extends traditional ABSA from categorical polarity labels to continuous valence-arousal (VA) regression. This paper describes a system developed for Track A, Subtask 1 (Dimensional Aspect Sentiment Regression), aiming to predict real-valued VA scores in the [1, 9] range for each given aspect in a text. A fine-tuning approach based on XLM-RoBERTa-base is adopted, with dual regression heads with sigmoid-scaled outputs for valence and arousal prediction. Separate models are trained for each language-domain pair (English and Chinese across restaurant, laptop, and finance domains), and training and development sets are merged for final test predictions. In development experiments, the fine-tuning approach is compared against several large language models under a few-shot prompting setting, demonstrating that task-specific fine-tuning outperforms these LLM-based methods across all evaluation datasets.
♻ ☆ Detecting Distillation Data from Reasoning Models
Reasoning distillation has emerged as a prevailing paradigm for transferring reasoning capabilities from large reasoning models to small language models. Yet, reasoning distillation risks data contamination: benchmark data may inadvertently be included in the distillation data, thereby inflating model performance metrics. In this work, we formally define the distillation data detection task, which determines whether a given question is included in the model's distillation data. The unique challenge of this task lies in the partial availability of distillation data. To address this, we propose Token Probability Deviation (TPD), a detection method that leverages the probability patterns of output tokens generated by the model instead of input tokens. Our method is motivated by the observation that seen questions tend to elicit more near-deterministic tokens generated by the models than unseen ones. Our TPD score is thus designed to quantify the token-level deviation of generated tokens from a high-confidence reference probability. Consequently, seen questions can yield substantially lower TPD scores than unseen ones, enabling strong detection performance. Extensive experiments demonstrate the effectiveness of our approach, improving detection AUC by up to 31% on distillation datasets.
♻ ☆ A Multi-Memory Segment System for Generating High-Quality Long-Term Memory Content in Agents
In the current field of agent memory, extensive explorations have been conducted in the area of memory retrieval, yet few studies have focused on exploring the memory content. Most research simply stores summarized versions of historical dialogues, as exemplified by methods like A-MEM and MemoryBank. However, when humans form long-term memories, the process involves multi-dimensional and multi-component generation, rather than merely creating simple summaries. The low-quality memory content generated by existing methods can adversely affect recall performance and response quality. In order to better construct high-quality long-term memory content, we have designed a multi-memory segment system (MMS) inspired by cognitive psychology theory. The system processes short-term memory into multiple long-term memory segments, and constructs retrieval memory units and contextual memory units based on these segments, with a one-to-one correspondence between the two. During the retrieval phase, MMS will match the most relevant retrieval memory units based on the user's query. Then, the corresponding contextual memory units is obtained as the context for the response stage to enhance knowledge, thereby effectively utilizing historical data. We conducted experiments on the LoCoMo dataset and further performed ablation experiments, experiments on the robustness regarding the number of input memories, and overhead experiments, which demonstrated the effectiveness and practical value of our method.
comment: The content has been significantly revised and the author has also changed. Therefore, the paper will be withdrawn for revision and then uploaded after the completion of the modifications
♻ ☆ Miner:Mining Intrinsic Mastery for Data-Efficient RL in Large Reasoning Models
Current critic-free RL methods for large reasoning models suffer from severe inefficiency when training on positive homogeneous prompts (where all rollouts are correct), resulting in waste of rollouts due to zero advantage estimates. We introduce a radically simple yet powerful solution to \uline{M}ine \uline{in}trinsic mast\uline{er}y (Miner), that repurposes the policy's intrinsic uncertainty as a self-supervised reward signal, with no external supervision, auxiliary models, or additional inference cost. Our method pioneers two key innovations: (1) a token-level focal credit assignment mechanism that dynamically amplifies gradients on critical uncertain tokens while suppressing overconfident ones, and (2) adaptive advantage calibration to seamlessly integrate intrinsic and verifiable rewards. Evaluated across six reasoning benchmarks on Qwen3-4B and Qwen3-8B base models, Miner achieves state-of-the-art performance among the other four algorithms, yielding up to \textbf{4.58} absolute gains in Pass@1 and \textbf{6.66} gains in Pass@K compared to GRPO. Comparison with other methods targeted at exploration enhancement further discloses the superiority of the two newly proposed innovations. This demonstrates that latent uncertainty exploitation is both necessary and sufficient for efficient and scalable RL training of reasoning models. Code is available at https://github.com/pixas/Miner.
comment: 24 pages
♻ ☆ FinReasoning: A Hierarchical Benchmark for Reliable Financial Research Reporting
Large language models (LLMs) are increasingly deployed in financial research workflows, where their role is evolving from single-model assistance for human analysts toward autonomous collaboration among multiple agents. Yet real-world deployments still expose factual errors, numerical inconsistencies, and shallow analysis, which can distort assessments of corporate fundamentals and trigger severe economic losses. While existing benchmarks have begun to evaluate such failures, they score all aspects of the generated analysis in one pass, failing to distinguish whether a model fails at foundational stages like auditing and correction, or underperforms at generating research-grade insights. Consequently, it obscures capability bottlenecks and the specialized strengths essential for multi-agent role assignment. To address these gaps, we introduce FinReasoning, a hierarchical benchmark that decomposes the core capabilities of financial research into semantic consistency, data alignment, and deep insight. We further propose a fine-grained evaluation framework that strengthens hallucination-correction assessment and incorporates a 12-indicator rubric for core analytical skills. FinReasoning reveals clear capability stratification across model types. Closed-source models (like Doubao-Seed-1.8) perform strongly overall and are better suited for core reasoning agents in multi-agent financial systems; open-source general models (like Qwen3-235B) show clear capability divergence and consistently underperform on Semantic Consistency, making them less suited for quality-sensitive generation tasks; financial-domain models (like Fin-R1) generate moderate insights but lack foundational auditing skills. Our work has already been deployed in pilot tests across several real-world scenarios. The resource is available at https://github.com/TongjiFinLab/FinReasoning.
♻ ☆ Overview of the TREC 2025 RAGTIME Track
The principal goal of the RAG TREC Instrument for Multilingual Evaluation (RAGTIME) track at TREC is to study report generation from multilingual source documents. The track has created a document collection containing Arabic, Chinese, English, and Russian news stories. RAGTIME includes three task types: Multilingual Report Generation, English Report Generation, and Multilingual Information Retrieval (MLIR). A total of 125 runs were submitted by 13 participating teams (and as baselines by the track coordinators) for three tasks. This overview describes these three tasks and presents the available results.
comment: 14 pages, 3 figures, final version of the RAGTIME 2025 overview paper
♻ ☆ Script Sensitivity: Benchmarking Language Models on Unicode, Romanized and Mixed-Script Sinhala SC
The performance of Language Models (LMs) on low-resource, morphologically rich languages like Sinhala remains largely unexplored, particularly regarding script variation in digital communication. Sinhala exhibits script duality, with Unicode used in formal contexts and Romanized text dominating social media, while mixed-script usage is common in practice. This paper benchmarks 24 open-source LMs on Unicode, Romanized and mixed-script Sinhala using perplexity evaluation across diverse text sources. Results reveal substantial script sensitivity, with median performance degradation exceeding 300 times from Unicode to Romanized text. Critically, model size shows no correlation with script-handling competence, as smaller models often outperform architectures 28 times larger. Unicode performance strongly predicts mixed-script robustness but not Romanized capability, demonstrating that single-script evaluation substantially underestimates real-world deployment challenges. These findings establish baseline LM capabilities for Sinhala and provide practical guidance for model selection in multi-script low-resource environments.
comment: Published at SCSE 2026 (9th IEEE International Research Conference on Smart Computing and Systems Engineering). Best Paper Award - Text Analytics Track
♻ ☆ OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search
Agentic search enables language models to solve knowledge-intensive tasks by adaptively acquiring external evidence over multiple steps. Reinforcement learning with verifiable rewards (RLVR) has emerged as a widely adopted training paradigm for search agents, yet outcome-only rewards are sparse and provide limited credit assignment for intermediate search actions. Existing process-reward methods therefore seek to densify supervision through proxy signals, external evaluators, or likelihood-based information gain. However, proxy rewards can deviate from the final outcome objective, while fixed evaluators can become stale as the search policy evolves, leading to unreliable process supervision. To address these challenges, we propose OASES, an Outcome-Aligned Search-Evaluation Supervision framework for agentic search. OASES derives outcome-aligned process rewards by evaluating how well each intermediate search state supports answering the original question. It further co-trains the search policy and the state evaluator on policy, allowing the evaluator to adapt to evolving search behavior and provide more reliable process rewards. Experiments on five multi-hop QA benchmarks show that OASES consistently outperforms strong RL baselines, with further analyses confirming the benefits of outcome-aligned process rewards and search-evaluation co-training.
♻ ☆ A Comparative analysis of Layer-wise Representational Capacity in AR and Diffusion LLMs
Autoregressive (AR) language models build representations incrementally via left-to-right prediction, while diffusion language models (dLLMs) are trained through full-sequence denoising. Although recent dLLMs match AR performance, whether diffusion objectives fundamentally reshape internal representations remains unclear. We perform the first layer- and token-wise representational analysis comparing native dLLMs (LLaDA), native AR models (Qwen2.5), and AR-initialized dLLMs (Dream-7B), using cosine similarity across layers and tokens alongside static inference-time layer-skipping as an analytical probe of redundancy. We find that diffusion objectives produce more global representations with substantial early-layer redundancy and reduced recency bias, while AR objectives yield tightly coupled, locally structured representations. AR-initialized dLLMs retain AR-like dynamics despite diffusion training, revealing persistent initialization bias. Leveraging this redundancy, native dLLMs absorb up to 18.75% FLOPs reduction while retaining over 90% performance on math-reasoning and coding benchmarks, whereas AR models collapse under identical skipping, revealing that diffusion objectives, rather than architecture alone, induce depth redundancy that enables principled compression.
comment: v3: improving writing with all v2 changes
♻ ☆ Beyond Factual Accuracy: Evaluating Global Reasoning Integrity in RAG Systems with LogicScore
Current evaluation methods for Retrieval Augmented Generation (RAG) suffer from \textit{factual myopia}: they relentlessly emphasize factual accuracy yet neglect global logical integrity in long-form answer generation. This drives models to force unnatural connections, producing factually grounded yet logically incoherent responses with unaddressed gaps, ambiguous links, or redundant premises. To mitigate this, we present \textsc{LogicScore}, shifting from local, fact-by-fact assessment to rigorous global reasoning scrutiny. Grounded in Horn Rules, our approach integrates a backward verification mechanism to systematically evaluate three key reasoning dimensions: \textit{Completeness} (logically sound deduction), \textit{Essentiality} (non-redundancy), and \textit{Determinateness} (consistent answer entailment). Extensive experiments across three multi-hop QA datasets (HotpotQA, MusiQue, and 2WikiMultiHopQA) and over 20 LLMs (including GPT-5, Gemini-3-Pro, LLaMA3, and task-specific tuned models) reveal a critical capability gap: leading models often achieve high factual accuracy (e.g., 92.85\% precision for Gemini-3 Pro) but struggle with global reasoning quality (e.g., 35.11\% Essentiality for Gemini-3 Pro). Our work establishes a robust standard for logical evaluation, highlighting the need to prioritize reasoning coherence alongside factual grounding in LLM development.
♻ ☆ Safety Anchor: Defending Harmful Fine-tuning via Geometric Bottlenecks ICML 2026
The safety alignment of Large Language Models (LLMs) remains vulnerable to Harmful Fine-tuning (HFT). While existing defenses impose constraints on parameters, gradients, or internal representations, we observe that they can be effectively circumvented under persistent HFT. Our analysis traces this failure to the inherent redundancy of the high-dimensional parameter space: attackers exploit optimization trajectories that are orthogonal to defense constraints to restore harmful capabilities while deceptively adhering to safety restrictions. To address this, we propose Safety Bottleneck Regularization (SBR). SBR shifts the defensive focus from the redundant parameter space to the unembedding layer, which serves as a geometric bottleneck. By anchoring the final hidden states of harmful queries to those of the safety-aligned model, SBR enables the model to maintain safe responses even under persistent HFT. Extensive experiments confirm SBR's effectiveness, demonstrating that utilizing just a single safety anchor is sufficient to reduce the Harmful Score to $<$10 while preserving competitive performance on benign downstream tasks.
comment: Accepted to ICML 2026
♻ ☆ TCDA: Thread-Constrained Discourse-Aware Modeling for Conversational Sentiment Quadruple Analysis IJCAI 2026
Conversational Aspect-based Sentiment Quadruple Analysis (DiaASQ) needs to capture the complex interrelationships in multiple rounds of dialogues. Existing methods usually employ simple Graph Convolutional Networks (GCN), which introduce structural noise and fail to consider the temporal sequence of the dialogues, or use standard RoPE, which implicitly captures relative distances in a flat sequence but cannot clearly separate the token-level syntactic order from the utterance-level progression, and may suffer from the Distance Dilution problem. To address these issues, we propose a new framework that combines Thread-Constrained Directed Acyclic Graph (TC-DAG) and Discourse-Aware Rotary Position Embedding (D-RoPE). Specifically, TC-DAG filters out cross-thread noise based on thread constraints, maintains global connectivity through root anchoring, and incorporates the temporal sequence of the dialogues. D-RoPE aligns multi-layer semantics using dual-stream projection and multi-scale frequency signals, captures thread dependencies using tree-like distances, and alleviates the token-level Distance Dilution problem by incorporating utterance-level progressions. Experimental results on two benchmark datasets demonstrate that our framework achieves state-of-the-art performance.
comment: Accepted to IJCAI 2026 (Main Track)
♻ ☆ Belief Memory: Agent Memory Under Partial Observability
LLM agents that operate over long context depend on external memory to accumulate knowledge over time. However, existing methods typically store each observation as a single deterministic conclusion (e.g., inferring "API~X failed" from temporary errors), even though such observations are inherently partial and potentially ambiguous. By committing to one conclusion and discarding uncertainty, these methods introduce self-reinforcing error: the agent acts on the stored conclusion, never revisits alternatives, and reinforces the conclusion over time. To address this issue, we propose BeliefMem, which shifts the memory paradigm from committing to a single conclusion per observation to retaining multiple candidate conclusions with their probabilities. Concretely, BeliefMem stores the candidate conclusions as separate memory entries, each carrying a probability that is updated via Noisy-OR rules as new observations arrive. At retrieval, all candidates surface together with their probabilities, keeping alternatives visible to the agent. Since each conclusion in memory retains its probability, BeliefMem preserves the uncertainty that the deterministic paradigm discards, enabling the agent to act with high confidence on well-evidenced knowledge while retaining the capacity to update its confidence when new evidence arrives. Empirical evaluations on LoCoMo and ALFWorld benchmarks show that, even with limited data, BeliefMem achieves the best average performance, remarkably outperforming well-known baselines. More broadly, such probabilistic memory produces substantial gains and explores a new direction for agent memory in partially observable environments.
♻ ☆ FiSMiness: A Finite State Machine Based Paradigm for Emotional Support Conversations NAACL2025
Emotional support conversation (ESC) aims to alleviate the emotional distress of individuals through effective conversations. Although large language models (LLMs) have obtained remarkable progress on ESC, most of these studies might not define the diagram from the state model perspective, therefore providing a suboptimal solution for long-term satisfaction. To address such an issue, we leverage the Finite State Machine (FSM) on LLMs, and propose a framework called FiSMiness. Our framework allows a single LLM to bootstrap the planning during ESC, and self-reason the seeker's emotion, support strategy and the final response upon each conversational turn. Substantial experiments on ESC datasets suggest that FiSMiness outperforms many baselines, including direct inference, self-refine, chain of thought, finetuning, and external-assisted methods, even those with many more parameters.
comment: NAACL2025 CMCL Workshop
♻ ☆ Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
Speech large language models (SLMs) are typically built from text large language model (TLM) checkpoints, yet they still suffer from a substantial modality gap. Prior work has mainly attempted to reduce this gap from the output side by making speech generation more text-like, but the gap remains. We argue that the key remaining bottleneck lies on the input side. We propose TextPro-SLM, an SLM that makes spoken input more closely resemble that of a prosody-aware text LLM. TextPro-SLM combines WhisperPro, a unified speech encoder that produces synchronized text tokens and prosody embeddings, with an LLM backbone trained to preserve the semantic capabilities of the original TLM while learning paralinguistic understanding. Experiments show that TextPro-SLM achieves the lowest modality gap among leading SLMs at both 3B and 7B scales, while also delivering strong overall performance on paralinguistic understanding tasks. These gains are achieved with only roughly 1,000 hours of LLM training audio, suggesting that reducing the modality gap from the input side is both effective and data-efficient.
comment: Work in progress
♻ ☆ Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation ICML 2026
The core learning signal used in language model distillation is the standard Kullback-Leibler (KL) divergence between the student and teacher distributions. Traditional KL divergence tends to be dominated by the next tokens with the highest probabilities, i.e., the teacher's modes, thereby diminishing the influence of less probable yet potentially informative components of the output distribution. We propose a new tail-aware divergence that decouples the contribution of the teacher model's top-K predicted probabilities from that of lower-probability predictions, while maintaining the same computational profile as the KL Divergence. Our decoupled approach reduces the impact of the teacher modes and, consequently, increases the contribution of the tail of the distribution. Experimental results demonstrate that our modified distillation method yields competitive performance in both pre-training and supervised distillation of decoder models across various datasets. Furthermore, the distillation process is efficient and can be performed with a modest academic budget for large datasets, eliminating the need for industry-scale computing.
comment: ICML 2026
♻ ☆ OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice
Multimodal large language models (MLLMs) have emerged as a promising paradigm for dental image analysis. However, their ability to capture the multi-level cognitive processes required for radiographic analysis remains unclear. Here, we present a comprehensive benchmark to evaluate the cognitive capabilities of MLLMs in dental radiographic analysis. It spans three critical imaging modalities, i.e., periapical, panoramic, and lateral cephalometric radiographs, and defines four cognitive categories: perception, comprehension, prediction, and decision-making. The benchmark comprises 27 clinically grounded tasks derived from public datasets, with manually curated annotations and 3,820 clinician assessments for evaluation. Six frontier MLLMs, including GPT-5.2 and GLM-4.6, are evaluated. We demonstrate the performance gap between MLLMs and clinicians in dental practice, delineate model strengths and limitations, characterize failure patterns, and provide recommendations for improvement. This data resource will facilitate the development of next-generation artificial intelligence systems aligned with clinical cognition, safety requirements, and workflow complexity in dental practice.
comment: 21 pages, 4 figures, 5 tables
♻ ☆ InvThink: Premortem Reasoning for Safer Language Models
We present InvThink, a training and prompting framework that requires the model to enumerate, analyze, and constrain potential failures before generating its final response. Unlike existing safety alignment methods that optimize only for safe final responses, InvThink structures generation into three steps: (1) enumerate potential harms, (2) analyze their consequences, (3) generate the response under explicit mitigation constraints. We observe three findings: (i) InvThink shows higher safety scores at larger model sizes, compared to existing safety prompting and alignment baselines. (ii) InvThink mitigates the safety tax. Models trained with INVTHINK preserve their reasoning capability on standard benchmarks. (iii) beyond general safety tasks, InvThink also reduces harmful behavior in professional ethics domains (medicine, finance, law) and in agentic misalignment scenarios, achieving up to 32% reduction in harmfulness over zero-shot baselines and 16% over SafetyPrompt. We extend InvThink with supervised fine-tuning, and GRPO-based reinforcement learning across three LLM families.
♻ ☆ Direction-Flipped Influence Audits Reveal Hidden Structure in Moral Choices of LLMs
Moral benchmarks for LLMs typically score models on context-free prompts, implicitly treating the measured choice rate as stable. We test this assumption with a direction-flipped influence audit: for each scenario, we compare a baseline prompt with matched cues steering toward option A or option B. Across a trolley-problem-style moral triage task, BBQ, and DailyDilemmas, and across five LLM families with and without reasoning, short contextual cues shift per-condition choice rates by 12-18 percentage points on average. These shifts reveal structure that baseline scores miss: roughly 40% of baseline-neutral triage and BBQ conditions exhibit directional asymmetry under influence, and a meaningful share of significant effects backfire, moving opposite the cue's intended direction. In follow-up probes, models often recognize the cue while denying that it affected their choice. Among significant backfire trials, this stated-vs.-revealed inconsistency appears in 78% of cases. Reasoning does not eliminate contextual sensitivity but reshapes it: social-pressure cues such as user preference and emotional appeal weaken across benchmarks, while few-shot demonstrations strengthen sharply on both triage and BBQ. We recommend direction-flipped influence pairs as a standard complement to context-free moral-bias evaluation, and release the harness and data to make such audits routine.
Machine Learning 300
☆ Normalizing Trajectory Models
Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coarse transitions. Existing few-step methods address this through distillation, consistency training, or adversarial objectives, but sacrifice the likelihood framework in the process. We introduce Normalizing Trajectory Models (NTM), which models each reverse step as an expressive conditional normalizing flow with exact likelihood training. Architecturally, NTM combines shallow invertible blocks within each step with a deep parallel predictor across the trajectory, forming an end-to-end network trainable from scratch or initializable from pretrained flow-matching models. Its exact trajectory likelihood further enables self-distillation: a lightweight denoiser trained on the model's own score produces high-quality samples in four steps. On text-to-image benchmarks, NTM matches or outperforms strong image generation baselines in just four sampling steps while uniquely retaining exact likelihood over the generative trajectory.
☆ Zero-Shot Imagined Speech Decoding via Imagined-to-Listened MEG Mapping
Decoding imagined speech from non-invasive brain recordings is challenging because imagined datasets are scarce and difficult to align temporally across subjects and sessions In this work, we propose a new approach to the decoding of imagined speech that leverages the richer and more reliably labeled recordings during listening to speech. We collected paired listened and imagined MEG recordings to rhythmic melodic and spoken stimuli from trained musicians. Using trained musicians helped improve temporal alignment across conditions. We then developed a three-stage decoding pipeline that revealed consistent and meaningful relationships between neural activity evoked by imagining and listening to the same stimuli. First, we trained six linear and neural models to map imagined MEG responses to listened responses. We evaluated these models against a null baseline from unseen subjects to validate that the predicted-listening responses preserve stimulus-specific information. In the second stage, we trained a contrastive word decoder exclusively on the listened MEG responses, and evaluated it using four embedding strategies including semantic, acoustic, and phonetic representations. In the third stage, we process the imagined MEG responses from held-out subjects through the mapping pipeline to compute the corresponding listening responses that are then decoded by the listened decoder. Using rank-based analysis, we show that the imagined words are decodable significantly above chance. We shall report here the results of a proof-of-concept implementation to decode imagined speech, where all evaluations are performed on held-out subjects. We also demonstrate that performance improves with training data size, suggesting that this approach is scalable and can directly be made applicable to realistic brain-computer interface scenarios.
☆ GRAPHLCP: Structure-Aware Localized Conformal Prediction on Graphs
Conformal prediction (CP) provides a distribution-free approach to uncertainty quantification with finite-sample guarantees. However, applying CP to graph neural networks (GNNs) remains challenging as the combinatorial nature of graphs often leads to insufficiently certain predictions and indiscriminative embeddings. Existing methods primarily rely on embedding-space proximity for localization, which can be unreliable for graphs and yield inefficient prediction sets. We propose GRAPHLCP, a proximity-based localized CP framework that explicitly incorporates graph topology and inter-node dependencies into localization and weighting. Our approach introduces a feature-aware densification step to mitigate locality bias in sparse graphs, followed by a Personalized PageRank-based kernel computation to model structural proximity. This enables topology-dependent anchor sampling and calibration weighting that captures both local and long-range dependencies. Extensive experiments on several regression and classification datasets demonstrate that GRAPHLCP guarantees marginal coverage with finite samples while efficiently attaining favorable test conditional coverage across various conditioning scenarios.
comment: 20 pages, 9 Figures, 8 Tables
☆ A Note on Non-Negative $L_1$-Approximating Polynomials
$L_1$-Approximating polynomials, i.e., polynomials that approximate indicator functions in $L_1$-norm under certain distributions, are widely used in computational learning theory. We study the existence of \textit{non-negative} $L_1$-approximating polynomials with respect to Gaussian distributions. This is a stronger requirement than $L_1$-approximation but weaker than sandwiching polynomials (which themselves have many applications). These non-negative approximating polynomials have recently found uses in smoothed learning from positive-only examples. In this short note, we prove that every class of sets with Gaussian surface area (GSA) at most $Γ$ under the standard Gaussian admits degree-$k$ non-negative polynomials that $\eps$-approximate its indicator functions in $L_1$-norm, for $k=\tilde{O}(Γ^2/\varepsilon^2)$. Equivalently, finite GSA implies $L_1$-approximation with the stronger pointwise guarantee that the approximating polynomial has range contained in $[0,\infty)$. Up to a constant-factor, this matches the degree of the best currently known Gaussian $L_1$-approximation degree bound without the non-negativity constraint.
☆ Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs
Reinforcement learning (RL) for exponential-utility optimization in discounted Markov decision processes (MDPs) lacks principled value-based algorithms. We address this gap in the fixed risk-aversion setting. Building on the Bellman-type equation for exponential utility studied in \cite{porteus1975optimality}, we derive two Q-value-style extensions and show that the associated operators are contractions in the $L_\infty$ and sup-log/Thompson metrics, respectively. We characterize their fixed points and prove that the induced greedy stationary policy is optimal for the exponential-utility objective among stationary policies. These structural results lead to two model-free algorithms: a two-timescale Q-learning--style algorithm, for which we establish almost-sure convergence and provide finite-time convergence rates via timescale separation, and a one-timescale algorithm governed by a sublinear power-law operator. Since the latter does not admit a global contraction in standard metrics, we prove its convergence using delicate arguments based on local Lipschitzness, monotonicity, homogeneity, and Dini derivatives, and provide a scalar finite-time analysis that highlights the challenges in obtaining convergence rates in the vector case. Our work provides a foundation for value-based RL under exponential-utility objectives.
☆ Fast Byte Latent Transformer
Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slow, byte-by-byte autoregressive generation. We address this bottleneck in the Byte Latent Transformer (BLT) through new training and generation techniques. First, we introduce BLT Diffusion (BLT-D), a new model and our fastest BLT variant, trained with an auxiliary block-wise diffusion objective alongside the standard next-byte prediction loss. This enables an inference procedure that generates multiple bytes in parallel per decoding step, substantially reducing the number of forward passes required to generate a sequence. Second, we propose two extensions inspired by speculative decoding that trade some of this speed for higher generation quality: BLT Self-speculation (BLT-S), in which BLT's local decoder continues generating past its normal patch boundaries to draft bytes, which are then verified with a single full-model forward pass; and BLT Diffusion+Verification (BLT-DV), which augments BLT-D with an autoregressive verification step after diffusion-based generation. All methods may achieve an estimated memory-bandwidth cost over 50% lower than BLT on generation tasks. Each approach offers its own unique advantages, together removing key barriers to the practical use of byte-level LMs.
☆ Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph
Direct Preference Optimization (DPO) aligns language models using pairwise preference comparisons, offering a simple and effective alternative to Reinforcement Learning (RL) from human feedback. However, in many practical settings, training data consists of multiple rollouts per prompt, inducing rich preference structure that pairwise DPO fails to exploit. Collapsing such data into independent pairs discards transitivity, introduces redundant or conflicting supervision, and can lead to unstable optimization. We propose Graph Direct Preference Optimization (GraphDPO), a principled generalization of DPO that operates over directed acyclic preference graphs induced by rollout rankings. GraphDPO encodes dominance relations as edges and optimizes a graph-structured Plackett--Luce-inspired objective that aggregates supervision over graph neighborhoods, enforcing transitivity while recovering standard DPO as a special case. To handle discrete or sparse signals, we introduce an equivalence-class construction where responses with identical preferences form graph layers, and intra-layer edges contribute zero loss, preventing spurious gradients. Despite leveraging full graph structure, GraphDPO maintains linear per-prompt complexity via efficient log-sum-exp aggregation. We further incorporate optional ground-truth anchoring by inserting verified solutions as dominant nodes and applying an annealed schedule that stabilizes early training while gradually relaxing oracle supervision. Experiments on reasoning and program synthesis tasks demonstrate superior performance, suggesting that graph-structured preference modeling is a scalable and robust alternative to pairwise and listwise alignment objectives.
☆ Don't Get Your Kroneckers in a Twist: Gaussian Processes on High-Dimensional Incomplete Grids
We introduce CUTS-GPR, a new method for performing numerically exact Gaussian process regression (GPR) in high-dimensional settings. The key component of CUTS-GPR is an extremely fast kernel matrix-vector product, which exhibits near-linear or even linear scaling with the amount of training data, $N$, and low-order polynomial scaling with dimensionality, $D$. This is obtained by combining an additive kernel with an incomplete grid and exploiting the resulting structure of the kernel matrix. We demonstrate the scalability of the matrix-vector product by running benchmarks with billions of data points and thousands of dimensions. Full GPR calculations, including hyperparameter optimization, are completed in a matter of hours for $N = 447 265$ and $D = 24$. We demonstrate that our CUTS-GPR enables Bayesian modeling of high-dimensional potential energy surfaces - a longstanding challenge in computational chemistry.
comment: 51 pages, 8 figures
☆ PropSplat: Map-Free RF Field Reconstruction via 3D Gaussian Propagation Splatting IEEE
Building a site-specific propagation model typically requires either ray-tracing over detailed 3D maps or dense measurement campaigns. Both approaches are expensive and often infeasible for rapid deployments where geographic data is unavailable or outdated. We present PropSplat, a map-free propagation modeling method that reconstructs radio frequency (RF) fields using 3D anisotropic Gaussian primitives. Each Gaussian encodes a scalar path loss offset relative to an explicit baseline path loss model with a learnable path loss exponent. Gaussians are initialized along observed transmitter--receiver paths and optimized end-to-end to learn the propagation environment without external information like floor plans, terrain databases, or clutter data. We evaluate PropSplat against wireless radiance field methods NeRF$^2$, GSRF, and WRF-GS+ on two real-world datasets. On large-scale outdoor drive-tests spanning multiple topographical regions at six sub-6 GHz frequencies, PropSplat achieves 5.38 dB RMSE when training measurements are spaced 300m apart and outperforms WRF-GS+ (5.87 dB), GSRF (7.46 dB), and NeRF$^2$ (14.76 dB). On indoor Bluetooth Low Energy measurements, PropSplat achieves 0.19m mean localization error, an order of magnitude better than NeRF$^2$ (1.84m), while achieving near-identical received signal strength prediction accuracy. These results show that accurate site-specific propagation reconstruction is achievable from sparse RF-native measurements. The need for geographic data as a prerequisite for scalable RF environment modeling is reduced.
comment: Accepted for presentation at IEEE DySPAN 2026
☆ Semiparametric Efficient Test for Interpretable Distributional Treatment Effects
Distributional treatment effects can be invisible to means: a treatment may preserve average outcomes while changing tails, modes, dispersion, or rare-event probabilities. Kernel tests can detect discrepancies between interventional outcome laws, but global tests do not reveal where the laws differ. We propose DR-ME, to our knowledge the first semiparametrically efficient finite-location test for interpretable distributional treatment effects. DR-ME evaluates an interventional kernel witness at learned outcome locations, returning causal-discrepancy coordinates rather than only a global rejection. From observational data, we derive orthogonal doubly robust kernel features whose centered oracle form is the canonical gradient of this finite witness. For fixed locations, we characterize the local testing limit: DR-ME is chi-square calibrated under the null, has noncentral chi-square local power, and uses the covariance whitening that optimizes local signal-to-noise for discrepancies visible through the selected coordinates. This efficient local-power geometry yields a principled location-learning criterion, with sample splitting preserving post-selection validity. Experiments show near-nominal type-I error, competitive power against global doubly robust kernel tests, and interpretable learned locations that localize distributional effects in a semi-synthetic medical-imaging study.
☆ PET-Adapter: Test-Time Domain Adaptation for Full and Limited-Angle PET Image Reconstruction
Positron Emission Tomography (PET) image reconstruction is inherently challenged by Poisson noise and physical degradation factors, which are further exacerbated in limited-angle acquisitions. While deep learning methods demonstrate promising performance, their generalization to unseen clinical data distributions remains limited without extensive retraining. We propose PET-Adapter, a test-time domain adaptation framework for generative PET reconstruction models pretrained solely on phantom data. Our method enables adaptation to clinical datasets with varying anatomies, tracers, and scanner configurations without requiring paired ground truth. PET-Adapter introduces layer-wise low-rank anatomical conditioning during adaptation and Ordered Subset Expectation Maximization-based warm-starting that initializes the generation from physics-informed reconstructions, reducing diffusion steps from 50 to 2 without compromising quality. Experiments across multiple clinical datasets demonstrate superior 3D reconstruction performance in both full-angle and limited-angle settings, highlighting the clinical feasibility and computational efficiency of the proposed approach.
☆ STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation
Deep generative models have advanced rapidly across text and vision, motivating unified multimodal systems that can understand, reason over, and generate interleaved text-image sequences. Most existing approaches combine autoregressive language modeling with diffusion-based image generators, inheriting a structural mismatch between causal text generation and iterative visual denoising. We observe that autoregressive normalizing flows are autoregressive Transformers--sharing the same causal mask, KV-cache mechanism, and left-to-right structure as LLMs--making them the most natural paradigm for true unified multimodal generation. We present STARFlow2, built on the Pretzel architecture that vertically interleaves a pretrained VLM stream with a TarFlow stream via residual skip connections, both operating under the same causal mask. Combined with a deep-shallow flow design and a unified FAE latent space, STARFlow2 enables cache-friendly interleaved generation where both text and visual outputs directly enter the KV-cache without re-encoding. Experiments demonstrate strong performance across image generation and multimodal understanding benchmarks, validating autoregressive flows as a viable foundation for unified multimodal modeling.
comment: 19 pages, 9 figures
☆ Adaptive Domain Decomposition Physics-Informed Neural Networks for Traffic State Estimation with Sparse Sensor Data
Traffic state estimation from sparse fixed sensors is challenging because physics-informed neural networks (PINNs) tend to over-smooth the shockwaves admitted by the Lighthill-Whitham-Richards (LWR) model. This study proposes Adaptive Domain Decomposition Physics-Informed Neural Networks (ADD-PINN), a two-stage residual-guided framework for LWR-based offline speed-field reconstruction. A coarse global PINN is first trained; its spatial residual profile is then used to place subdomain boundaries and initialize child subnetworks in a decomposition-enabled mode, while a data-driven shock indicator can retain a single-domain fallback when localized evidence of transition is weak. The primary offline I-24 MOTION evaluation spans five days, five sensor configurations, and ten seeds per configuration, yielding 1,500 runs in total. Against neural and physics-informed baselines, ADD-PINN attains the lowest relative L2 error in 18 of 25 configurations and in 14 of 15 sparse-sensing cases, while training 2.4 times faster than the extended PINN (XPINN) baseline. An ablation study supports spatial-only decomposition as an effective default for fixed-sensor traffic reconstruction in the evaluated settings. Supplementary Next Generation Simulation (NGSIM) experiments serve as a negative control: the shock indicator suppresses decomposition in all 50 runs, and the default single-domain fallback ranks first across all sensor configurations. These results support residual-guided spatial decomposition as an effective PINN-family design for offline reconstruction when sparse fixed sensing coincides with localized transition regions.
comment: 56 pages, 5 figures, 12 tables. Submitted to Transportation Research Part C
☆ Globally Optimal Training of Spiking Neural Networks via Parameter Reconstruction
Spiking Neural Networks (SNNs) have been proposed as biologically plausible and energy-efficient alternatives to conventional Artificial Neural Networks (ANNs). However, the training of SNN usually relies on surrogate gradients due to the non-differentiability of the spike function, introducing approximation errors that accumulate across layers. To address this challenge, we extend the work on convexification of parallel feedforward threshold networks to parallel recurrent threshold networks, which subsume parallel SNNs as a structured special case. Building on this theoretical framework, we propose a parameter reconstruction algorithm for SNN training that demonstrates consistent and significant advantages across various tasks, both as a standalone method and in combination with surrogate-gradient training. The ablations further demonstrate the data scalability and robustness to model configurations of our training algorithm, pointing toward its potential in large-scale SNN training.
☆ Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims NeurIPS 2026
Mechanistic interpretability papers increasingly use causal vocabulary: circuits, mediators, causal abstraction, monosemanticity. Such claims require explicit identification assumptions. A purposive audit of 10 papers across four methodological strands finds no dedicated identification-assumptions section and a recurring pattern: validation metrics such as faithfulness, completeness, monosemanticity, alignment, or ablation effects are reported as causal support without stating the assumptions that make them identifying. A two-human-coder audit on $n=30$ reproduces the direction of the main finding: dedicated identification sections are absent, and validation-metric substitution is common, though exact Dim B/D counts are coding-rule sensitive. The paper proposes a disclosure norm: state whether the claim is causal, name the identification strategy, enumerate assumptions, stress at least one, and explain how conclusions shift if assumptions fail. Validation is not identification.
comment: 10 pages, 2 figures. Submitted to NeurIPS 2026 (Position Track)
☆ Interpreting Reinforcement Learning Agents with Susceptibilities
Susceptibilities are a technique for neural network interpretability that studies the response of posterior expectation values of observables to perturbations of the loss. We generalize this construction to the setting of the regret in deep reinforcement learning and investigate the utility of susceptibilities in a simple gridworld model that nevertheless exhibits non-trivial stagewise development. We argue that susceptibilities reveal internal features of the development of the model in parameter space that one cannot detect purely by studying the development of the learned policy. We validate these results with activation-steering, and discuss the framework's extension to RLHF post-training.
comment: 55 pages, comments welcome
☆ Penalty-Based First-Order Methods for Bilevel Optimization with Minimax and Constrained Lower-Level Problems
We study a class of bilevel optimization problems in which both the upper- and lower-level problems have minimax structures. This setting captures a broad range of emerging applications. Despite the extensive literature on bilevel optimization and minimax optimization separately, existing methods mainly focus on bilevel optimization with lower-level minimization problems, often under strong convexity assumptions, and are not directly applicable to the minimax lower-level setting considered here. To address this gap, we develop penalty-based first-order methods for bilevel minimax optimization without requiring strong convexity of the lower-level problem. In the deterministic setting, we establish that the proposed method finds an $ε$-KKT point with $\tilde{O}(ε^{-4})$ oracle complexity. We further show that bilevel problems with convex constrained lower-level minimization can be reformulated as special cases of our framework via Lagrangian duality, leading to an $\tilde{O}(ε^{-4})$ complexity bound that improves upon the existing $\tilde{O}(ε^{-7})$ result. Finally, we extend our approach to the stochastic setting, where only stochastic gradient oracles are available, and prove that the proposed stochastic method finds a nearly $ε$-KKT point with $\tilde{O}(ε^{-9})$ oracle complexity.
☆ STEPS: A Temporal Smooth Error Propagation Solver on the Manifolds for Test-Time Adaptation in Time Series Forecasting NeurIPS 2026
Test-Time Adaptation (TTA) aims to improve time series forecasting under distribution shifts by using limited observations revealed during inference. However, forecasting TTA must operate in a source-free online setting, where the adaptation signal is short, temporally correlated, and potentially noisy. Existing methods can therefore suffer from weak identifiability, error accumulation, and unstable long-horizon corrections when the revealed prefix is sparse or contaminated. To address these issues, we propose STEPS, a Smooth Temporal Error Propagation Solver for TTA in time-series forecasting. STEPS reformulates forecasting TTA as a Dirichlet Boundary Value Problem on a temporal manifold, where the revealed prefix error serves as the boundary condition for the unknown future error field. Then, STEPS solves a smooth and bounded correction field in prediction space: a Local Solver propagates prefix errors under temporal smoothness, a Global Solver retrieves stable cross-window error memory and Spatiotemporal Manifold Fusion (SMF) integrates both solutions into the final correction. Across six standard benchmarks and four frozen backbones, STEPS achieves an average relative MSE reduction of 26.82% over the zero-shot backbone, exceeding the strongest compared TTA baseline by 12.77%. Additional sparse prefix and contamination tests confirm the robustness of STEPS under limited and noisy prefixes.
comment: 9 pages main text, appendix included. 7 figures. Submitted to NeurIPS 2026
☆ Graph-Structured Hyperdimensional Computing for Data-Efficient and Explainable Process-Structure-Property Prediction
Multiphoton photoreduction enables high-fidelity fabrication of complex 3D microstructures, yet reliable process-structure-property (PSP) prediction remains difficult because the available data are sparse, heterogeneous, and interaction-dominated. In this regime, conventional feature-vector models are statistically underdetermined, making them prone to spurious correlations, poor regime transfer, and unstable post hoc explanations, whereas mechanistic pipelines depend on calibrated submodels that are rarely available during early process development. We present PSP-HDC, a graph-structured hyperdimensional computing framework that encodes a directed PSP graph as an internal prior for representation, inference, and explanation. A trainable scalar-to-hypervector encoder learns parameter-specific embeddings on a fixed hyperdimensional basis to accommodate heterogeneous scales and noise. Sample representations are then composed through graph-aligned binding and bundling along directed PSP dependencies, and prediction is performed by associative-memory retrieval against class prototypes. Because the same prototype memories support both decision making and attribution, PSP-HDC provides intrinsic explanations at the parameter, group, and within-group levels, while memory alignment and separation quantify prototype formation during training. On sheet-resistance regime prediction for the 3D platform, PSP-HDC achieves an accuracy of 0.910 +/- 0.077 over 1000 random splits and 0.896 under process-fold generalization, outperforming strong baselines.
comment: 19 pages, 18 figures
☆ Bayesian Sensitivity of Causal Inference Estimators under Evidence-Based Priors
Causal inference, especially in observational studies, relies on untestable assumptions about the true data-generating process. Sensitivity analysis helps us determine how robust our conclusions are when we alter these underlying assumptions. Existing frameworks for sensitivity analysis are concerned with worst-case changes in assumptions. In this work, we argue that using such pessimistic criteria can often become uninformative or lead to conclusions contradicting our prior knowledge about the world. To demonstrate this claim, we generalize the recent s-value framework (Gupta & Rothenhäusler, 2023) to estimate the sensitivity of three different common assumptions in causal inference. Empirically, we find that, indeed, worst-case conclusions about sensitivity can rely on unrealistic changes in the data-generating process. To overcome this, we extend the s-value framework with a new sensitivity analysis criterion: Bayesian Sensitivity Value (BSV), which computes the expected sensitivity of an estimate to assumption violations under priors constructed from real-world evidence. We use Monte Carlo approximations to estimate this quantity and illustrate its applicability in an observational study on the effect of diabetes treatments on weight loss.
comment: TMLR 2026
☆ Tool Calling is Linearly Readable and Steerable in Language Models
When a tool-calling agent picks the wrong tool, the failure is invisible until execution: the email gets sent, the meeting gets missed. Probing 12 instruction-tuned models across Gemma 3, Qwen 3, Qwen 2.5, and Llama 3.1 (270M to 27B), we find the identity of the chosen tool is linearly readable and steerable inside the model. Adding the mean-difference between two tools' average internal activations switches which tool the model selects at 77-100% accuracy on name-only single-turn prompts (93-100% at 4B+), and the JSON arguments that follow autoregressively match the new tool's schema, so flipping the name is enough. The same per-tool means also flag likely errors before they happen: on Gemma 3 12B and 27B, queries where the gap between the top-1 and top-2 tool is smallest produce 14-21x more wrong calls than queries with the largest gap. The causal effect concentrates along one direction, the row of the output layer that produces the target tool's first token: a unit vector along it at matched magnitude already reaches 93-100%, while what is left over leaves the choice almost untouched. Activation patching localises this to a small set of mid- and late-layer attention heads, and a within-topic probe across 14 same-domain $τ$-bench airline tools reaches top-1 61-89% across five 4B-14B models, ruling out the reading that we are just moving the model along a topic axis. Even base models encode the right tool before they can emit it: cosine readout from the internal state recovers 69-82% on BFCL while base generation reaches only 2-10%, suggesting pretraining forms the representation and instruction tuning later wires it to the output. We measure tool identity selection and JSON schema correctness in single-turn fixed-menu settings; multi-turn agentic transfer is more fragile and is discussed in Limitations.
comment: 29 pages, 6 figures, 7 tables. Manuscript under review
☆ Where's the Plan? Locating Latent Planning in Language Models with Lightweight Mechanistic Interventions
We study planning site formation in language models -- where internal representations of structurally-constrained future tokens form during the forward pass, and whether they causally drive generation. Using rhyming-couplet completion as a clean test of forward-looking constraint, we apply two lightweight methods (linear probing and activation patching) across Qwen3, Gemma-3, and Llama-3 at more than ten scales. Probing shows that future-rhyme information is linearly decodable at the line boundary, with signal that strengthens with scale in all three families. Activation patching reveals that only Gemma-3-27B causally relies on this encoding, exhibiting a handoff in which the causal driver migrates from the rhyme word to the line boundary around layer 30. Every other model we test conditions on the rhyme word throughout generation, with near-zero causal effect at the line boundary despite strong probe signal. We localize the Gemma-3-27B handoff to five attention heads through two-stage path patching that recover ~90% of the rhyme-routing capacity at the newline.
comment: 13 pages, 20 figures, 3 tables
☆ Susceptibilities and Patterning: A Primer on Linear Response in Bayesian Learning
These notes introduce the theory of susceptibilities as developed in [arXiv:2504.18274, arXiv:2601.12703] for interpreting neural networks. The susceptibility of an observable $φ$ to a data perturbation is defined as a derivative of a posterior expectation, which by the fluctuation--dissipation theorem equals a posterior covariance. Different choices of $φ$ yield different objects: per-sample losses give the influence matrix (the Bayesian influence function of [arXiv:2509.26544]), while component-localized observables give the structural susceptibility matrix that pairs model components with data patterns. The susceptibility matrix is (up to a factor of $nβ$) the Jacobian of the map from data distributions to structural coordinates; its pseudo-inverse provides a linearized solution to the patterning problem of [arXiv:2601.13548]: finding data perturbations that produce a desired structural change. We motivate the theory from its statistical-mechanical foundations, then give a detailed exposition of susceptibilities, their empirical estimators, and their connection to the geometry of the loss landscape.
comment: 34 pages, 3 figures, comments welcome!
☆ Self-Play Enhancement via Advantage-Weighted Refinement in Online Federated LLM Fine-Tuning with Real-Time Feedback
Recent works have advanced feedback-based learning systems, whereby a foundation model is able to intake incoming feedback (e.g., a user) to self-improve, creating a self-loop system of training. However, existing works are limited in needing to consider an offline setup to allow for such feedback-based methods, and are further limited in the need of requiring privileged ground-truth contexts for training. Moreover, there is limited consideration of federated learning (FL), which is particularly well-suited for incorporating external feedback across large networks of end users, for example, but requires methods to be efficient for training on resource-constrained edge devices. Therefore, we introduce SPEAR (Self-Play Enhancement via Advantage-Weighted Refinement), an efficient online learning algorithm for federated LLM fine-tuning. SPEAR utilizes a feedback-guided self-play loop to construct naturally contrastive pairs per prompt which are utilized to be trained on (i) standard maximum likelihood on correct completions and (ii) confidence-weighted unlikelihood on tail tokens of incorrect completions. Without the need of expensive group generations and ground-truth contexts for training (i.e., only partial, non-answer feedback), in contrast with existing works, SPEAR can be trained both online and in a resource-efficient manner. We validate SPEAR across various benchmark datasets, demonstrating its superior performance in comparison to state-of-the-art baselines. The implementation code is publicly available at https://github.com/lee3296/SPEAR.
comment: 27 pages
☆ It Just Takes Two: Scaling Amortized Inference to Large Sets
Neural posterior estimation has emerged as a powerful tool for amortized inference, with growing adoption across scientific and applied domains. In many of these applications, the conditioning variable is a set of observations whose elements depend not only on the target but also on unknown factors shared across the set. Optimal inference therefore requires treating the set jointly, which in turn requires training the estimator at the deployment set size -- a regime where memory and compute quickly become prohibitive. We introduce a simple, theoretically grounded strategy that decouples representation learning from posterior modeling. Our method trains a mean-pool Deep Set on sets of size at most two, producing an encoder that generalizes to arbitrary set sizes. The inference head is then finetuned on pre-aggregated embeddings, making training cost essentially independent of the deployment set size N. Across scalar, image, multi-view 3D, molecular, and high-dimensional conditional generation benchmarks with N in the thousands, our approach matches or outperforms standard baselines at a fraction of the compute.
☆ DVD: Discrete Voxel Diffusion for 3D Generation and Editing
We introduce Discrete Voxel Diffusion (DVD), a discrete diffusion framework to generate, assess, and edit sparse voxels for SLat (Structured LATent) based 3D generative pipelines. Although discrete diffusion has not generally displaced continuous diffusion in image-like generation, we show that it can be an effective first-stage prior for sparse voxel scaffolds. By treating voxel occupancy as a native discrete variable, DVD avoids continuous-to-discrete thresholding and provides a simple framework for voxel generation, uncertainty estimation, and editing. Beyond quality gains, DVD provides more interpretable generation dynamics through explicit categorical modeling. Furthermore, we leverage the predictive entropy as a robust uncertainty metric to identify ambiguous voxel regions and complicated samples, facilitating tasks such as data filtering and quality assessment. Finally, we propose a lightweight fine-tuning strategy using block-structured perturbation patterns. This approach empowers the model to inpaint and edit voxels within a single sampling round, requiring negligible auxiliary computation and no additional model evaluations.
☆ Linear Response Estimators for Singular Statistical Models
We define susceptibilities as a measure of the response of an observable quantity of a parameterized statistical model to a perturbation of the data for a general class of observables. We define estimators for these susceptibilities as statistics in a sequence of n data-points and prove that these estimators are consistent and asymptotically unbiased in the large n regime.
comment: 24 pages, comments welcome!
☆ When Diffusion Model Can Ignore Dimension: An Entropy-Based Theory
Diffusion models perform remarkably well on high-dimensional data such as images, often using only a modest number of reverse-time steps. Despite this practical success, existing convergence theory does not fully explain why such samplers remain efficient in high dimensions. Many prior KL guarantees bound the discretization error in terms of the ambient dimension, while other improved results replace this dependence using intrinsic-dimensional or geometric structure assumptions. In this work, we develop an alternative information-theoretic perspective on diffusion sampler convergence. We prove that, for Gaussian mixture targets, the discretization error is controlled by the Shannon entropy of the latent mixture component rather than by the ambient dimension. Consequently, the leading step complexity scales linearly with latent entropy and depends only logarithmically on the second moment of the data. Our analysis also extends to discrete target distributions, where the relevant complexity is the entropy of the target rather than the dimension of the embedding space. These results suggest that diffusion sampling can remain efficient in high-dimensional spaces when the data distribution admits a compact latent representation, as is widely believed to be the case for natural images.
☆ Asymptotically Log-Optimal Bayes-Assisted Confidence Sequences for Bounded Means
Confidence sequences based on test martingales provide time-uniform uncertainty quantification for the mean of bounded IID observations without parametric distributional assumptions. Their practical efficiency, however, depends strongly on the choice of martingale updates, and many existing constructions do not exploit prior information about plausible data-generating distributions or mean values. We propose a Bayes-assisted framework that uses a Bayesian working predictive model to adaptively construct confidence sequences.For each candidate mean and time point, the predictive distribution selects, among valid one-step martingale factors, the update maximising predictive expected log-growth; validity is therefore preserved even when the prior or working model is misspecified. We prove that if the predictive distribution is Wasserstein-consistent, the resulting procedure is asymptotically log-optimal, matching the per-sample log-growth of an oracle procedure with access to the true distribution. We instantiate the framework using robust predictives based on Dirichlet-process mixtures and Bayesian exponentially tilted empirical likelihood. Experiments on synthetic data, sequential best-arm identification for LLM evaluation, and prediction-powered inference show that informative priors can substantially reduce confidence-sequence width and sampling effort while retaining anytime-valid coverage.
comment: Valentin and Stefano are equal first author
☆ Aggregation in conformal e-classification
Aggregating conformal predictors is a standard way of balancing their predictive and computational efficiency while retaining their validity, at least approximately. An important advantage of conformal e-predictors is that they are easier to aggregate without sacrificing their validity. This paper studies experimentally cross-conformal e-prediction, which is an existing method of aggregating conformal e-predictors, and its modifications that are conceptually simpler and more flexible.
comment: 23 pages, 10 figures
☆ FLAM: Evaluating Model Performance with Aggregatable Measures in Federated Learning IEEE
Performance evaluation is essential for assessing the quality of machine learning (ML) models and guiding deployment decisions. In federated learning (FL), assessing the performance is challenging because data are distributed across participants. Consequently, the coordinator must rely on locally computed evaluation metrics and aggregate them to assess the global model. A key challenge is that common aggregation strategies, such as weighted averaging based on the local samples per participant, do not always produce the same results as centralized evaluation. Existing definitions of performance evaluation are largely tailored to accuracy and do not generalize to other metrics, leading to inconsistencies between participant-based and centralized evaluation. However, such discrepancies are inconsistent with the FL objective and lead to a wrong calculation of the metric. To address this issue, we examine the underlying reasons for these discrepancies and propose FLAM, a performance evaluation method based on aggregatable measures that yields the same results as centralized evaluation without the need for a global test dataset.
comment: Accepted for publication in 2nd IEEE International Conference on Federated Learning and Intelligent Computing Systems(FLICS2026)
☆ Graph Representation Learning Augmented Model Manipulation on Federated Fine-Tuning of LLMs
Federated fine-tuning (FFT) has emerged as a privacy-preserving paradigm for collaboratively adapting large language models (LLMs). Built upon federated learning, FFT enables distributed agents to jointly refine a shared pretrained LLM by aggregating local LLM updates without sharing local raw data. However, FFT-based LLMs remain vulnerable to model manipulation threats, in which adversarial participants upload manipulated LLM updates that corrupt the aggregation process and degrade the performance of the global LLM. In this paper, we propose an Augmented Model maniPulation (AugMP) strategy against FFT-based LLMs. Specifically, we design a novel graph representation learning framework that captures feature correlations among benign LLM updates to guide the generation of malicious updates. To enhance manipulation effectiveness and stealthiness, we develop an iterative manipulation algorithm based on an augmented Lagrangian dual formulation. Through this formulation, malicious updates are optimized to embed adversarial objectives while preserving benign-like parameter characteristics. Experimental results across multiple LLM backbones demonstrate that the AugMP strategy achieves the strongest manipulation performance among all competing baselines, reducing the global LLM accuracy by up to 26% and degrading the average accuracy of local LLM agents by up to 22%. Meanwhile, AugMP maintains high statistical and geometric consistency with benign updates, enabling it to evade conventional distance- and similarity-based defense methods.
☆ Convergent Stochastic Training of Attention and Understanding LoRA
Transformers have revolutionized machine learning and deploying attention layers in the model is increasingly standard across a myriad of applications. Further, for large models, it is common to implement Low Rank Adaptation (LoRA), whereby a factorized parameterization of them is trained, to achieve a surprisingly beneficial accuracy-size trade-off. In this work, via a unified framework we rigorously establish trainability of such models under stochastic methods. We prove that for any mild regularization, the empirical regression loss on a attention layer and LoRA on a shallow neural net, both induce Poincaré inequality for the corresponding Gibbs' measure. Then it follows via invoking recent results that a certain SDE, which mimics the SGD, minimizes the corresponding losses. In both the cases, our first-of-its-kind results of trainability on attention and nets, do not rely on any assumptions on the data or the size of the architecture.
☆ Slowly Annealed Langevin Dynamics: Theory and Applications to Training-Free Guided Generation
We study Slowly Annealed Langevin Dynamics (SALD), a sampler for tracking a path of moving target distributions and approximating the terminal target through time slowdown. We establish non-asymptotic convergence guarantees via a KL differential inequality, showing that slowdown improves tracking through contraction of intermediate targets and the complexity of the path. Motivated by training-free guided generation with pretrained score-based generative models, we further introduce Velocity-Aware SALD (VA-SALD), which explicitly incorporates the underlying marginal distributions of the pretrained model and uses slowdown to correct the additional deviation induced by guidance. This yields a principled framework for training-free guided generation for diffusion-based and related generative model families, together with convergence guarantees that clarify the roles of intermediate functional inequalities and guidance bias. Code is available at https://github.com/anitan0925/sald.
☆ Exploring the non-convexity in machine learning using quantum-inspired optimization
The escalating complexity of modern machine learning necessitates solving challenging non-convex optimization problems, particularly in high-dimensional regimes and scenarios contaminated by gross outliers. Traditional approaches, relying on convex relaxations or specialized local search heuristics, frequently succumb to suboptimal local minima and fail to recover the true underlying discrete structures. In this paper, we propose treating these non-convex challenges as a global search problem and introduce a unified framework based on Quantum-Inspired Evolutionary Optimization (QIEO). By leveraging a probabilistic representation inspired by quantum superposition, QIEO maintains a global view of the search space, enabling it to tunnel through local optima that trap conventional gradient-based and greedy solvers. We comprehensively evaluate QIEO across diverse non-convex applications, including sparse signal recovery (gene expression analysis and compressed sensing) and robust linear regression. Extensive benchmarking against state-of-the-art continuous solvers (ADAM, Differential Evolution), classical metaheuristics (Genetic Algorithms), and specialized non-convex algorithms (Iterative Hard Thresholding) demonstrates that QIEO consistently achieves superior structural fidelity, lower mean squared error, and enhanced robustness without support inflation. Our findings suggest that embracing a quantum-inspired global search provides a resilient, unified paradigm for overcoming the inherent intractability of discrete nonconvex machine learning landscapes.
☆ TAVIS: A Benchmark for Egocentric Active Vision and Anticipatory Gaze in Imitation Learning
Active vision -- where a policy controls its own gaze during manipulation -- has emerged as a key capability for imitation learning, with multiple independent systems demonstrating its benefits in the past year. Yet there is no shared benchmark to compare approaches or quantify what active vision contributes, on which task types, and under what conditions. We introduce TAVIS, evaluation infrastructure for active-vision imitation learning, with two complementary task suites -- TAVIS-Head (5 tasks, global search via pan/tilt necks) and TAVIS-Hands (3 tasks, local occlusion via wrist cameras) -- on two humanoid torso embodiments (GR1T2, Reachy2), built on IsaacLab. TAVIS provides three evaluation primitives: a paired headcam-vs-fixedcam protocol on identical demonstrations; GALT (Gaze-Action Lead Time), a novel metric grounded in cognitive science and HRI that quantifies anticipatory gaze in learned policies; and procedural ID/OOD splits. Baseline experiments with Diffusion Policy and $π_0$ reveal that (i) active-vision generally helps, but benefits are task-conditional rather than uniform; (ii) multi-task policies degrade sharply under controlled distribution shifts on both suites; and (iii) imitation alone yields anticipatory gaze, with median lead times comparable to the human teleoperator reference. Code, evaluation scripts, demonstrations (LeRobot v3.0; ~2200 episodes) and trained baselines are released at https://github.com/spiglerg/tavis and https://huggingface.co/tavis-benchmark.
☆ Prototype Guided Post-pretraining for Single-Cell Representation Learning
Single-cell representation learning (SCRL) from gene expression data offers a way to uncover the complex regulatory logic underlying cellular function. Inspired by large language models in natural language modeling, several single-cell pretrained models have recently been proposed that treat genes as tokens and cells as sentences. However, these models are fundamentally limited by the long-tailed nature of cell-type distributions and struggle to generalize under covariate shifts in gene expression data. While fine-tuning is often used to mitigate these issues, we observe that performance remains bounded. To address this challenge, we introduce CellRefine, a post-pretraining method that operates between the pretraining and fine-tuning stages of a single-cell foundation model. CellRefine uses a multi-faceted objective that incorporates marker-gene sets as structural priors to guide post-pretraining and refine the latent embedding manifold of cells. Across multiple computational biology tasks, empirical results show that CellRefine consistently improves downstream performance, yielding gains up to 15%.
☆ INO-SGD: Addressing Utility Imbalance under Individualized Differential Privacy ICLR-26
Differential privacy (DP) is widely employed in machine learning to protect confidential or sensitive training data from being revealed. As data owners gain greater control over their data due to personal data ownership, they are more likely to set their own privacy requirements, necessitating individualized DP (IDP) to fulfil such requests. In particular, owners of data from more sensitive subsets, such as positive cases of stigmatized diseases, likely set stronger privacy requirements, as leakage of such data could incur more serious societal impact. However, existing IDP algorithms induce a critical utility imbalance problem: Data from owners with stronger privacy requirements may be severely underrepresented in the trained model, resulting in poorer performance on similar data from subsequent users during deployment. In this paper, we analyze this problem and propose the INO-SGD algorithm, which strategically down-weights data within each batch to improve performance on the more private data across all iterations. Notably, our algorithm is specially designed to satisfy IDP, while existing techniques addressing utility imbalance neither satisfy IDP nor can be easily adapted to do so. Lastly, we demonstrate the empirical feasibility of our approach.
comment: Accepted to the 14th International Conference on Learning Representations (ICLR-26)
☆ Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation
Discrete flow matching generates text by iteratively transforming noise tokens into coherent language, but may require hundreds of forward passes. Distillation uses the multi-step trajectory to train a student to reproduce the process in a few steps. When the student underperforms, the usual explanation is insufficient capacity. We argue the opposite: the trajectory is the bottleneck, not the student. Each training trajectory is built through a chain of blind stochastic jumps with no evaluation of sequence quality; a single bad decision at an early midpoint propagates through subsequent steps, yet the student must imitate the result. Trajectory-Shaped Discrete Flow Matching (TS-DFM) replaces these blind jumps with guided navigation: a lightweight energy compass evaluates candidate continuations at each midpoint, selecting the most coherent. All shaping is training-only; inference cost is unchanged. On 170M-parameter language modeling, the shaped student at 8 steps achieves 32% lower perplexity than the 1,024-step teacher while being 128x faster, with gains consistent across source distributions and three evaluators of increasing scale. TS-DFM achieves the best perplexity of any discrete-generation baseline we compare against, including methods trained on 6x more data or using 5x larger models.
☆ Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders
Learning hierarchical features in Sparse Autoencoders (SAEs) is essential for capturing the structured nature of real-world data and mitigating issues like feature absorption or splitting. Existing works attempt to identify hierarchical relationships within independent feature sets by relying on activation coverage, the assumption that child feature should only activate when its parent feature activates. However, we demonstrate that this condition alone is insufficient; that is, it often produces false positives where parent and child concepts are semantically unrelated. To address this, we introduce a novel reconstruction condition that enforces a deeper functional link between hierarchical levels. By combining both activation and reconstruction constraints, we propose the Tree SAE, a model designed to learn hierarchical structures directly from within the feature set. Our results demonstrate that Tree SAEs significantly surpass the existing SAEs at learning hierarchical pairs while maintaining competitive performance to the state-of-the-art on several key benchmarks. Finally, we demonstrate the practical utility of our Tree SAE in mapping the geometry of child feature subspaces and uncovering the complex hierarchical concept structures encoded within large language models.
comment: 21 pages
☆ Flatness and Gradient Alignment Are Both Necessary: Spectral-Aware Gradient-Aligned Exploration for Multi-Distribution Learning NeurIPS 2026
Sharpness-aware and gradient-alignment methods have been shown to improve generalization, however each family of methods targets a single geometric property of the loss landscape, while ignoring the other. In this paper, we show that this omission is structurally unavoidable and that both flatness and gradient alignment should be considered in multi-distribution learning settings. Specifically, we derive an excess-risk decomposition that yields two additive leading-order terms: (i) an alignment term, controlled by the trace of $\bar{H}^{-1}Σ_g$ and (ii) a curvature term, controlled by $\bar{H}$, where $\bar{H}$ is the average Hessian and $Σ_g$ is the covariance of the gradient across distributions. Notably, $\bar{H}$ appears inverted in one and non-inverted in the other. We further show, via a counterexample, that neither quantity bounds the other in general, so no algorithm targeting only one term can guarantee low excess risk. Motivated by this decomposition, we propose SAGE (Spectral-Aware Gradient-Aligned Exploration) that targets both terms. The curvature component replaces SAM's gradient-scaled perturbation with the polar factor of each layer's gradient matrix, computed via Newton-Schulz iteration, so that the ascent step probes all directions with similar magnitude. On the other hand, the alignment component injects isotropic noise at the descent step, the magnitude of which scales with cross-distribution gradient disagreement. Experiments on five domain-generalization and two multi-task learning benchmarks show that the proposed method establishes a new state-of-the-art on DomainBed and acts as a general-purpose improvement to base MTL solvers, remaining competitive with, or even surpassing, state-of-the-art methods.
comment: Preprint - Submitted to NeurIPS 2026
☆ Statistical inference with belief functions: A survey
Belief functions are a powerful and popular framework for the mathematical characterisation of uncertainty, in particular in situations in which lack of data renders learning a probability distribution for the problem impractical. The first step in a reasoning chain based on belief functions is inference: how to learn a belief measure from the available data. In this survey we focus, in particular, on making inference from statistical data, and review the most significant contributions in the area.
comment: 9 pages, 0 figures
☆ Consistency Regularised Gradient Flows for Inverse Problems
Vision-Language Latent Diffusion Models (LDMs) (Rombach et al., 2022) provide powerful generative priors for inverse problems. However, existing LDM-based inverse solvers typically require a large number of neural function evaluations (NFEs) and backpropagation through large pretrained components, leading to substantial computational costs and, in some cases, degraded reconstruction quality. We propose a unified Euclidean-Wasserstein-2 gradient-flow framework that jointly performs posterior sampling and prompt optimization in the latent space through a single flow that aligns the prior and posterior with the observed data. Combined with few-step latent text-to-image models, this formulation enables low-NFE inference without backpropagation through autoencoders. Experiments across several canonical imaging inverse problems show that our method achieves state-of-the-art performance with significantly reduced computational cost.
☆ Curvature Beyond Positivity: Greedy Guarantees for Arbitrary Submodular Functions
Submodular functions -- functions exhibiting diminishing returns -- are central to machine learning. When the objective is monotone and non-negative, the greedy algorithm achieves a tight $63\%$ approximation. But many practical objectives incorporate costs that make them negative on some inputs, and all existing multiplicative guarantees require non-negativity. Prior work handles negativity through additive bounds for the special class of decomposable functions and non-monotonicity through partial-monotonicity parameters, but these address each difficulty in isolation and neither extends the classical structural theory. We extend \emph{curvature} -- a parameter measuring how far a function deviates from linearity -- to all submodular functions, handling both non-monotonicity and negativity through a single classical concept. A greedy algorithm with pruning achieves a curvature-controlled multiplicative ratio for \emph{any} submodular function, including those taking negative values -- the first such guarantee beyond monotonicity and non-negativity. In the non-monotone regime $1 \le c_g < 2.2$, the bound strictly beats the best known uniform ratio of $0.401$ (for non-negative $f$), and it recovers the classical $(1-e^{-c_g})/c_g$ guarantee for monotone functions. A multilinear-extension variant extends the framework to general combinatorial constraints via multilinear relaxation. Experiments on cost-penalized experimental design, coverage, feature selection, and a curvature sweep on Multi-News passage selection support the theory.
comment: 44 pages, 11 figures
☆ Adaptive Regularization for Sparsity Control in Bregman-Based Optimizers
Sparse training reduces the memory and computational costs of deep neural networks. However, sparse optimization methods, e.g., those adding an $\ell_1$ penalty, often control sparsity only indirectly through a regularization parameter $λ$, whose mapping to the final sparsity rate is non-trivial. In our experiments, we found this parameter sensitivity to be particularly pronounced for Bregman-based optimizers. Specifically, the two variants LinBreg and AdaBreg reach the same sparsity at $λ$ values that differ by up to two orders of magnitude, requiring expensive trial-and-error sweeps to achieve a user-specified sparsity. To address this, we propose an adaptive regularization scheme that updates $λ$ based on the difference between the model's current sparsity and the target sparsity. We analyze the resulting algorithm and evaluate it on automatic speaker verification with ECAPA-TDNN and ResNet34 on VoxCeleb and CNCeleb. The proposed method reliably achieves sparsity targets ranging between 75% and 99%. It also converges faster than the oracle-tuned non-adaptive baseline during early training and matches or surpasses its final performance in equal error rate. We further show that the adaptive scheme inherits key properties from its non-adaptive counterpart, including improved out-of-distribution robustness over the dense baselines.
comment: 21 pages, 15 figures
☆ Enhancing Federated Quadruplet Learning: Stochastic Client Selection and Embedding Stability Analysis
Federated Learning (FL) enables decentralised model training across distributed clients without requiring data centralisation. However, the generalisation performance of the global model is usually degraded by data heterogeneity across clients, particularly under limited data availability and class imbalance. To address this challenge, we propose FedQuad, a novel method that explicitly enforces minimising intra-class representations while enabling inter-class splits across clients. By jointly minimising distances between positive pairs and maximising distances between negative pairs, the proposed approach mitigates representation misalignment introduced during model aggregation. We evaluate our method on CIFAR-10, CIFAR-100, and Tiny-ImageNet under diverse non-IID settings and varying numbers of clients, demonstrating consistent improvements over existing baselines. Additionally, we provide a comprehensive analysis of metric learning-based approaches in both centralised and federated environments, highlighting their effectiveness in alleviating representation collapse under heterogeneous data distributions.
comment: arXiv admin note: substantial text overlap with arXiv:2509.04107
☆ Characterizing and Correcting Effective Target Shift in Online Learning
Online learning from a stream of data is a defining feature of intelligence, yet modern machine learning systems often struggle in this setting, especially under distributional shift. To understand its basic properties, we study the relationship between online and offline learning in the context of kernel regression. We derive a closed-form expression for the function learned by online kernel regression, revealing that online kernel regression is equivalent to offline regression with shifted, inaccurate target outputs. Conversely, we show that by compensating for this effective shift in the teaching signal through target correction, online kernel-based learning can provably learn the same predictor as its offline counterpart. We derive both a closed-form expression for this target correction and an iterative form that can be applied sequentially. Applying this framework to image classification tasks on CIFAR-10 and CORe50, we show that online stochastic gradient descent with iteratively corrected targets outperforms learning with the true targets in continual learning settings. This work therefore provides a basic framework for analyzing and improving online learning in non-stationary environments.
comment: 22 pages; 6 figures
☆ Black-box model classification under the discriminative factorization
Access to modern generative systems is often restricted to querying an API (the ``black-box" setting) and many properties of the system are unknown to the user at inference time. While recent work has shown that low-dimensional representations of models based on the relationship between their embedded responses to a set of queries are useful for inferring model-level properties, the quality of these representations is highly sensitive to the query set. We introduce the \emph{discriminative factorization} to distinguish between high- and low-quality query sets in the context of black-box model-level classification. Under this framework, the probability of chance-level classification decays exponentially in the query budget. On three auditing tasks, estimated factorization parameters predict the empirical performance decay rate. We conclude by showing that query sets selected using the estimated discriminative field reproduce the empirical ordering of oracle query sets.
☆ KL for a KL: On-Policy Distillation with Control Variate Baseline
On-Policy Distillation (OPD) has emerged as a dominant post-training paradigm for large language models, especially for reasoning domains. However, OPD remains unstable in practice due to the high gradient variance of its single-sample Monte Carlo estimator, and recipes for stable training are still immature. We propose vOPD (On-Policy Distillation with a control variate baseline), which casts OPD as policy-gradient RL and stabilizes it by introducing a control variate baseline-canonically a value function -- from the RL literature. We show that the OPD value function admits a closed form as the per-token negative reverse KL divergence between the student and the teacher, available directly from the already-computed forward pass with no additional critic or inference. Existing stabilization methods either compute the full token-level reverse KL over the entire vocabulary, adding significant overhead, or restrict it to a top-k support, biasing the objective. vOPD instead preserves the lightweight single-sample estimator, subtracting the value function as a detached baseline to keep the gradient unbiased while reducing variance. Furthermore, we show that a top-k approximation of the baseline further lowers cost without compromising performance. Across mathematical and scientific reasoning benchmarks, vOPD consistently outperforms vanilla OPD and matches the most expensive full-vocabulary baseline, offering an efficient stabilization of On-Policy Distillation through principled RL variance reduction.
☆ ADKO: Agentic Decentralized Knowledge Optimization
We present Agentic Decentralized Knowledge Optimization (ADKO), a framework for collaborative black-box optimization across autonomous agents that achieves sample efficiency, privacy preservation, heterogeneous-objective handling, and communication efficiency. Each agent maintains a private Gaussian Process (GP) surrogate trained on local data and communicates only through knowledge tokens-compact, lossy summaries containing directional signals, advantage scores, and optional language-model (LM) insights-without sharing raw data or model parameters. ADKO unifies GP-Upper Confidence Bound (GP-UCB), parallel Bayesian optimization, decentralized learning, and LM-guided discovery. We provide the first formal analysis of dual information loss: token compression, quantified via mutual-information-based fidelity, and LM approximation error, decomposed into bias and stochastic noise. Our main result shows cumulative regret decomposes into GP error, LM bias, LM noise, and compression loss, with necessary and sufficient conditions for sublinear regret. We also propose fidelity-aware token pruning to preserve high-information tokens under memory budget. Experiments on neural architecture search and scientific discovery validate the theory and show consistent improvements over strong baselines.
comment: 31 pages
☆ On the Tradeoffs of On-Device Generative Models in Federated Predictive Maintenance Systems
Federated Learning (FL) has emerged as a promising paradigm for preserving client data ownership and control over distributed Internet of Things (IoT) environments. While discriminative models dominate most FL use cases, recent advances in generative models -- such as Variational Autoencoders (VAE), Generative Adversarial Networks (GAN), and Diffusion Models (DM) -- offer new opportunities for unsupervised anomaly detection in time series analysis, with relevant applications in predictive maintenance (PdM) in critical industrial infrastructures. In this work, we present a comprehensive analysis of VAEs, GANs, and DMs in the context of federated PdM. We analyze their performance and communication overhead under both full and partial federation setups, where only subsets of model components are shared. Building on this analysis, the paper proposes a novel taxonomy for federated generative models that formalizes partial component sharing as a principled mechanism for model personalization. Our experiments over a real-world time series dataset reveal distinct trade-offs in model utility, stability, and scalability, especially in heterogeneous and bandwidth-constrained FL settings. For the evaluated GAN-based configurations, full federation improves training stability relative to independent local training, although the model remains less robust than the VAE- and DDPM-based alternatives. For DMs, however, partial federation -- especially decoder sharing -- can outperform full federation in bandwidth-constrained, non-IID settings.
☆ Actor-Critic Algorithm for Dynamic Expectile and CVaR
Optimizing dynamic risk with stochastic policies is challenging in both policy updates and value learning. The former typically requires transition perturbation, while the latter may rely on model-based approaches. To address these challenges, we propose a surrogate policy gradient without transition perturbation under softmax policy parameterization. We further develop model-free value learning methods for dynamic expectile and conditional value-at-risk by leveraging elicitability. Finally, inspired by Expected SARSA and Expected Policy Gradient, a model-free off-policy actor-critic algorithm is constructed. Empirical results in domains with verifiable risk-averse behavior show that our algorithm can learn risk-averse policy and consistently outperforms other existing methods.
☆ MatryoshkaLoRA: Learning Accurate Hierarchical Low-Rank Representations for LLM Fine-Tuning
With the rise in scale for deep learning models to billions of parameters, the computational cost of fine-tuning remains a significant barrier to deployment. While Low-Rank Adaptation (LoRA) has become the standard for parameter-efficient fine-tuning, the need to set a predefined, static rank $r$ requires exhaustive grid searches to balance efficiency and performance. Existing rank-adaptive solutions such as DyLoRA mitigate this by sampling ranks during the training from a predefined distribution. However, they often yield sub-optimal results at higher ranks due to lack of consistent gradient signals across the full hierarchy of ranks, thus making these methods data-inefficient. In this paper, we propose MatryoshkaLoRA, a general, Matryoshka-inspired training framework for LoRA that learns accurate hierarchical low-rank representations by inserting a fixed, carefully crafted diagonal matrix $P$ between the existing LoRA adapters to scale their sub-ranks accordingly. By introducing this simple modification, our general framework recovers LoRA and DyLoRA only by changing $P$ and ensures all sub-ranks embed the available gradient information efficiently. Our MatryoshkaLoRA supports dynamic rank selection with minimal degradation in accuracy. We further propose Area Under the Rank Accuracy Curve (AURAC), a metric that consistently evaluates the performance of hierarchical low-rank adapters. Our results demonstrate that MatryoshkaLoRA learns more accurate hierarchical low-rank representations than prior rank-adaptive approaches and achieves superior accuracy-performance trade-offs across ranks on the evaluated datasets. Our code is available at https://github.com/IST-DASLab/MatryoshkaLoRA.
☆ Distributional simplicity bias and effective convexity in Energy Based Models
Energy-based learning is a powerful framework for generative modelling, but its training is inherently non-convex, leading potentially to sensitivity to initialisation, poor local optima, and unstable gradient dynamics. We present a dynamical analysis of energy-based learning through the lens of the effective model, which can be interpreted as either a generalised Ising model with higher-order interactions or the Fourier expansion of the energy. Under sufficient expressivity, we show that the gradient flow induced by learning strictly positive distributions over binary variables admits two types of fixed points: data-consistent points, which exactly reproduce the target distribution, and spurious points, which satisfy stationarity without matching the target distribution. Around data-consistent points, we show that perturbations are either stable or neutral, with neutral directions leaving the effective model invariant. Finally, we show that gradient dynamics induce a hierarchy in which lower-order interactions are learned before higher-order ones. This provides a mechanistic explanation for the distributional simplicity bias and clarifies why fixed points that are not data-consistent at low orders are not observed in practice.
comment: 13 pages, 2 figures
☆ \mathsf{VISTA}: Decentralized Machine Learning in Adversary Dominated Environments
Decentralized machine learning often relies on outsourcing computations, such as gradient evaluations, to untrusted worker nodes. Existing robust aggregation methods can mitigate malicious behavior under honest-majority assumptions, but may fail when adversaries control a majority of the workers. We study this adversary-dominated setting through an incentive-oriented framework in which reports are accepted and rewarded only when they are mutually consistent up to a threshold. This turns the adversary from a pure saboteur into a rational agent that trades off increasing estimation error against the risk of rejection and loss of reward. We consider iterative optimization under this model. Unlike one-shot computation, iterative learning requires long-horizon decisions: permissive acceptance rules enable faster early progress but admit more adversarial corruption, while strict rules improve estimation accuracy but cause frequent rejections. We propose \mathsf{VISTA}, an adaptive algorithm that tunes the acceptance threshold using the optimization history. Numerical results show that \mathsf{VISTA} improves convergence over static thresholds. We also provide a rigorous convergence analysis showing that, with suitable incentive-aware adaptation, adversary-dominated decentralized learning can retain the asymptotic convergence behavior of standard SGD without relying on an honest majority.
☆ RelAgent: LLM Agents as Data Scientists for Relational Learning
Relational learning is a challenging problem that has motivated a wide range of approaches, including graph-based models (e.g., graph neural networks, graph transformers), tabular methods (e.g., tabular foundation models), and sequence-based approaches (e.g., large language models), each with its own advantages and limitations. We propose RelAgent, an LLM-based autonomous data scientist for relational learning, which operates in two phases. In the search phase, an LLM agent uses database, validation, and evaluation workspace tools to construct SQL feature programs and select a predictive model. In the inference phase, the resulting program is executed without further LLM calls. The final predictor consists of SQL queries and a classical model, enabling fast, deterministic, and intrinsically interpretable predictions: features are human-readable queries, and predictions depend only on the resulting query-defined feature map, enabling scalable deployment using standard database systems.
☆ PPI-Net connects molecular protein interactions to functional processes in disease
Understanding how molecular alterations propagate across biological systems to drive disease remains a central challenge. Although high-throughput profiling enables comprehensive characterization of tumor states, most models neglect structured biological relationships or lack interpretability across scales. Here we present PPI-Net, a hierarchical graph neural network that integrates protein-protein interaction (PPI) networks with pathway-level representations to model disease from molecular interactions to functional processes. Patient-specific molecular profiles are embedded within a shared interaction network from STRING and propagated through a multi-layer Reactome hierarchy using graph attention, enabling aggregation of gene-level signals into higher-order biological programs. Across RNA-seq data from ten cancer types from The Cancer Genome Atlas, PPI-Net achieves robust predictive performance, with balanced accuracy exceeding 90% in multiple cohorts. Comparative analysis on RNA-Seq data from breast cancer demonstrated that PPI-Net's integration of the Reactome hierarchy improved balanced accuracy by 6.7% relative to a PPI-only model, while hierarchical multi-level supervision improved balanced accuracy by 12.3% relative to using only a single top-level prediction head. Applying a multi-omics approach using RNA-seq and methylation data improves model interpretation, recovering canonical oncogenic modules, including TP53-AKT signaling and stress response pathways, while revealing convergence onto coherent programs such as ion signaling and cellular responses to stimuli. These results demonstrate that integrating interaction networks with pathway hierarchies enables accurate prediction while providing mechanistic insight into cancer biology.
comment: 17 pages, 3 figures, 2 tables
☆ Approximation-Free Differentiable Oblique Decision Trees
Decision Trees (DTs) are widely used in safety-critical domains such as medical diagnosis, valued for their interpretability and effectiveness on tabular data. However, training accurate oblique DTs is challenging due to complex optimization landscapes and overfitting risks, particularly in regression. Recent advances have introduced differentiable formulations that enable gradient-based training and joint optimization of decision boundaries and leaf regressors. Yet, existing approaches typically rely on approximations, either through probabilistic softening of boundaries (soft DTs) or quantized gradients such as the Straight-Through Estimator (STE). To overcome these limitations, we propose DTSemNet, a novel, semantically equivalent, and invertible representation of hard oblique DTs as neural networks. DTSemNet enables end-to-end training with standard gradient descent, eliminating the need for approximations in both classification and regression. While classification aligns naturally with this formulation, regression remains challenging due to the joint optimization of internal nodes and leaf regressors. To address this, we analyze the limitations of STE and introduce an annealed Top-k method that provides accurate gradient signals without approximation. Extensive experiments on classification and regression benchmarks show that DTSemNet-trained oblique DTs outperform state-of-the-art differentiable DTs. Furthermore, we demonstrate that DTSemNet can serve as programmatic DT policies in reinforcement learning environments, thereby broadening their applicability.
comment: Accepted for publication in JMLR, Vol. 27, 2026
☆ NSPOD: acceleratingthe convergence ofKrylov-based iterative linearsolvers via approximated PODs
The convergence of Krylov-based linear iterative solvers applied to parametric partial differential equations (PDEs) is often highly sensitive to the domain, its discretization, the location/values of the applied Dirichlet/Neumann boundary conditions, body forces and material properties, among others. We have previously introduced hybridization of classical linear iterative solvers with neural operators for specific geometries, but they tend to not perform well on geometries not previously seen during training. We partially addressed this challenge by introducing the deep operator network Geo-DeepONet and hybridizing it with Krylov-based iterative linear solvers, which, despite learning effectively across arbitrary unstructured meshes without requiring retraining, led to only modest reductions in iterations compared to state-of-the-art preconditioners. In this study we introduce Neural Subspace Proper Orthogonal Decomposition (NSPOD), a multigrid-like deep operator network-based preconditioner which can dramatically reduce the number of iterations needed for convergence in Krylov-based linear iterative solvers, even when compared to state-of-the-art methods such as algebraic multigrid preconditioners. We demonstrate its efficiency via numerical experiments on a linearized version of solid mechanics PDEs applied to unstructured domains obtained from complex CAD geometries. We expect that the findings in this study lead to more efficient hybrid preconditioners that can match, or possibly even surpass, the convergence properties of the current gold standard preconditioning methods for solid mechanics PDEs.
comment: 17 pages, 9 figures, 3 tables
☆ Scaling Categorical Flow Maps
Continuous diffusion and flow matching models could represent a powerful alternative to autoregressive approaches for language modelling (LM), as they unlock a host of advantages currently reserved for continuous modalities, including accelerated sampling and tilting. Recently, several works have demonstrated the possibility of generating discrete data continuously by a simple flow matching process between a Gaussian and the one-hot encoded data distribution. They have further shown the feasibility of accelerated sampling via Categorical Flow Maps (CFMs), resulting in competitive sample quality in the few-step regime. However, this method had only been evaluated at relatively modest scales ($<1$B), leaving the question of its scalability completely open. In this article, we train a $1.7$B-parameter base flow model on $2.1$T tokens and self-distill it into a CFM that generates diverse, high-quality text in as few as $4$ inference steps while maintaining near-data-level token entropy. Furthermore, we introduce a likelihood bound for CFMs in the semi-discrete setting, and show that they can be used to score the model on standard LM benchmarks, achieving results in the same range as discrete diffusion methods. Finally, we uncover some of the challenges that arise from training these models at scale, and we provide prescriptive insights on loss weighting and time scheduling.
☆ OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling
Muon improves neural-network training by orthogonalizing matrix-valued updates, but it leaves each layer's update magnitude controlled mostly by a global learning rate. We introduce OrScale, a trust-ratio extension of Muon built on a simple rule: the denominator of a layer-wise ratio should measure the Frobenius norm of the actual parameter-space direction that will be applied. This yields OrScale for general matrix layers and OrScale-LM for language models, where Moonlight shape scaling is combined with one-time per-layer calibration so every trust ratio starts at one. We analyze why three natural Muon-LAMB hybrids fail through shape-degenerate denominators, raw-momentum clip saturation, and decoupled weight-decay runaway, and show that the real-update-direction denominator with coupled weight decay avoids these failures. Theoretically, OrScale admits an O(1/sqrt(T)) nonconvex convergence guarantee in a nuclear-norm criterion, a strict layer-adaptive descent gain under measurable layer heterogeneity, and calibration properties that preserve muP-style learning-rate transfer at initialization. Empirically, OrScale ranks first on CIFAR-10/DavidNet across three seeds, improving Muon from 93.70% to 94.05% validation top-1, and OrScale-LM improves FineWeb-Edu pre-training versus Muon+Moonlight at three of four scales from 125M to 1.1B parameters while outperforming AdamW at every scale.
☆ GRASP -- Graph-Based Anomaly Detection Through Self-Supervised Classification
Advanced persistent threat (APT) attacks remain difficult to detect due to their stealth, adaptability, and use of legitimate system components. Provenance-based intrusion detection systems (PIDS) offer a promising defense by capturing detailed relationships between system components and actions. However, current PIDS rely on predefined or subset-determined thresholds, which limit detection stability and the ability to detect any anomalous behavior in general. Furthermore, related work often neglects the role of process executables, which describe system activity by interacting through a process with files, network components, and other processes. We introduce GRASP, a PIDS based on masked self-supervised classification. GRASP masks the executable information of processes and learns to infer it from their two-hop provenance graph neighborhood, marking misclassified processes as anomalies. It captures behavior patterns for the learned executables without thresholding, making it robust against interference and unknown activities. Evaluations on the DARPA TC and OpTC datasets demonstrate that GRASP consistently detects anomalous behavior, including known attack-related activities, outperforming existing systems. Our PIDS identifies all documented attacks on datasets where the behavior of executables is learnable. In addition, compared to existing systems, GRASP uncovers potentially malicious anomalous behavior not labeled as an attack in the documentation.
comment: 17 pages
☆ The Minimax Rate of Second-Order Calibration
We characterize the minimax rate of estimating the second-order calibration error for binary classification, which quantifies whether a higher-order predictor's epistemic-uncertainty estimate matches the conditional variance of the label probability on its level sets. Our key observation is that the sech perturbation kernel, previously used only to enforce smoothness of calibration functions, in fact makes them analytic in a strip of half-width $hπ/2$. Polynomial regression then estimates the calibration error at rate $\tilde{O}(1/\sqrt{n})$, with explicit constants, a qualitative improvement over the $O(n^{-1/4})$ rate achievable by bucketing or kernel smoothing. A matching $Ω(1/\sqrt{n})$ lower bound establishes minimax optimality up to logarithmic factors. As a corollary, we give the first finite-sample guarantee for second-order Platt scaling, yielding a post-hoc procedure that recalibrates both the mean prediction and the epistemic-variance estimate of any higher-order predictor. Along the way, we provide a bucket-free definition of second-order calibration and relate it quantitatively to the bucketed formulation of Ahdritz et al. [2025]. Our experiments confirm the predicted rate and the quality of the recalibrated uncertainties.
☆ Text-to-CAD Evaluation with CADTests
Text-to-CAD has recently emerged as an important task with the potential to substantially accelerate design workflows. Despite its significance, there has been surprisingly little work on Text-to-CAD evaluation, and assessing CAD model generation performance remains a considerable challenge. In this work, we introduce a new evaluation perspective for Text-to-CAD based on automated testing. We propose CADTestBench, the first test-based benchmark for Text-to-CAD, based on CADTests, executable software tests that verify whether a generated CAD model satisfies the geometric and topological requirements of the input prompt. Using CADTestBench, we conduct comprehensive benchmarking of recent Text-to-CAD methods and further demonstrate that CADTests can also guide CAD model generation, yielding simple baselines that surpass performance of current methods. CADTestBench code and data are available at GitHub and Hugging Face dataset.
☆ Beyond Confidence: Rethinking Self-Assessments for Performance Prediction in LLMs
Large Language Models (LLMs) are increasingly used in settings where reliable self-assessment is critical. Assessing model reliability has evolved from using probabilistic correctness estimates to, more recently, eliciting verbalized confidence. Confidence, however, has been shown to be an inconsistent and overoptimistic predictor of model correctness. Drawing on cognitive appraisal theory, a framework from human psychology that decomposes self-evaluation into multiple components, we propose a multidimensional perspective on model self-assessment. We elicit six appraisal-based dimensions of self-assessment, alongside confidence, and evaluate their utility for predicting model failure across 12 LLMs and 38 tasks spanning eight domains. We find that competence-related appraisal dimensions, particularly effort and ability, consistently match or outperform confidence across most settings. Effort additionally yields less overoptimistic estimates that remain stable across model sizes. In contrast, affective dimensions provide marginally predictive signals. Furthermore, the most informative dimension varies systematically with task characteristics: effort is most predictive for reasoning-intensive tasks, while ability and confidence dominate on retrieval-oriented tasks. Broadly, our findings indicate that structured multidimensional self-assessment is a promising approach to improving the reliability and safety of language model deployment across diverse real-world settings.
☆ Flexible Routing via Uncertainty Decomposition
A key strategy for balancing performance and cost in modern machine learning systems is to dynamically route queries to either a low-cost model or a more expensive oracle (such as a large pretrained model or human expert), an approach known as model routing. In this work we present a new uncertainty-aware router that (1) avoids unnecessary oracle calls on inherently ambiguous queries, and (2) adapts dynamically to different loss functions and cost parameters through simple hyperparameter changes, without retraining. Our method, applicable to any classification setting where multiple independent annotations per input are available, is based on decomposing total uncertainty into irreducible and reducible components using higher-order predictors [Ahdritz et al., 2025]. This enables a unified approach to both routing and abstention: predict with the weak model when uncertainty is low, route to the oracle when reducible uncertainty is high, and abstain when irreducible uncertainty is high. Our router comes with strong theoretical guarantees bounding regret relative to optimal task-specific routers. We conduct experiments on both synthetic and real-world datasets that demonstrate the benefits of our approach in suitable regimes -- in particular, whenever reducible and irreducible uncertainty are not too correlated.
☆ Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning
On-policy distillation (OPD) leverages dense teacher rewards to enhance reasoning models. However, scaling OPD to long-horizon tasks exposes a critical flaw: as the student's generated prefix inevitably diverges from the teacher's thought process, the teacher's dense reward loses local exploitability. Continuing to generate and evaluate tokens on these ``drifted'' trajectories not only degrades reward quality but also incurs massive computational waste. To address this, we introduce \textbf{Prune-OPD}, a framework that dynamically aligns training budgets with supervision quality. By continuously monitoring the local compatibility between student and teacher predictions (e.g., via top-$k$ overlap), Prune-OPD detects prefix-drift events in real time. Upon detecting severe drift, it monotonically down-weights subsequent unreliable rewards and triggers dynamic rollout truncation. This allows the training process to halt futile generation and reallocate compute strictly to reliable teacher supervision. Across diverse teacher-student combinations, Prune-OPD consistently aligns computation with supervision reliability. When prefix drift makes dense teacher rewards unreliable, it reduces training time by 37.6\%--68.0\% while preserving, and often improving, performance on challenging benchmarks (AMC, AIME, HMMT). When student-teacher compatibility remains high, it automatically preserves long-context supervision by expanding the training window. These results suggest that Prune-OPD improves OPD not by blindly shortening rollouts, but by reallocating computation toward locally exploitable teacher rewards.
comment: 17 pages, 8 figures
☆ Toward Privileged Foundation Models:LUPI for Accelerated and Improved Learning
Training foundation models is computationally intensive and often slow to converge.We introduce PIQL,Privileged Information for Quick and Quality Learning, the first framework to systematically integrate privileged information (PI) to simultaneously accelerate learning and improve generalization in tabular foundation models (TFMs). We construct two complementary forms of PI: (i) aggregate dataset-level statistics that reduce the burden on in-context learning, and (ii) encodings of the underlying data-generating program, providing knowledge beyond observable data. We further design an architecture that effectively transfers the train-time-only PI by learning to reconstruct it from observed context at inference. We provide a theoretical analysis characterizing conditions under which PI reduces the population-level approximation gap and accelerates convergence in finite-data regimes. Empirical evidence shows that PIQL enables TFMs to achieve faster convergence, lower final loss, and better generalization, in effect, reducing data and compute requirements. Our work establishes PI-guided pretraining as a principled and practical paradigm for improving the efficiency and performance of foundation models.
☆ Neural Operators as Efficient Function Interpolators
Neural operators (NOs) are designed to learn maps between infinite-dimensional function spaces. We propose a novel reframing of their use. By introducing an auxiliary base-space, any finite-dimensional function can be viewed as an operator acting by composition on functions of the base-space. Through a range of benchmarks on analytic functions of increasing complexity and dimensionality, we demonstrate that NOs can match or outperform standard multilayer perceptrons and Kolmogorov--Arnold Networks in accuracy while requiring significantly fewer parameters and training time. As a real-world application, we apply a two-dimensional Tensorized Fourier Neural Operator (TFNO) to the nuclear chart, learning a correction to state-of-the-art nuclear mass models as a partially observed residual field. A TFNO ensemble reaches a held-out root-mean-square error of 198.2 keV, placing it among the best recent neural-network approaches while retaining high parameter efficiency and short training times. More broadly, these results introduce NOs as a scalable framework for finite-dimensional function interpolation, from analytic benchmarks to structured scientific data.
comment: 12 pages, 9 figures
☆ Spectral Surgery: Class-Targeted Post-Hoc Rebalancing via Hessian Spike Perturbation
The Hessian spectrum of trained deep networks exhibits a characteristic structure: a continuous bulk of near-zero eigenvalues and a small number of large outlier eigenvalues (spikes), confirming the relevance of Random Matrix Theory in deep learning. The spike count matches the number of classes minus one. While prior work has described this structure, no method has exploited it operationally to improve classification performance. We propose Spectral Surgery, a post-hoc optimization method that directly perturbs model weights along spike eigenvectors to rebalance per-class accuracy without retraining. We introduce (i) a spike-class sensitivity matrix that quantifies the directional derivative of each class's accuracy along each spike eigenvector, (ii) a constrained optimization of perturbation coefficients that targets weak classes while preserving strong ones, and (iii) an adaptive amplitude control that raises or lowers the perturbation budget based on iteration-level improvement signals. We obtain encouraging results on CIFAR-10 and ISIC-2019 on both balanced accuracy and standard deviation.
☆ Tracing Uncertainty in Language Model "Reasoning"
Language model (LM) "reasoning", commonly described as Chain-of-Thought or test-time scaling, often improves benchmark performance, but the dynamics underlying this process remain poorly understood. We study these dynamics through the lens of uncertainty quantification by treating the "reasoning" traces, the intermediate token sequences generated by LMs, as evolving model states. We summarize each trace by an uncertainty trace profile: a small set of features describing the shape of the uncertainty signal over its trace, such as its slope and linearity. We find that across five LMs evaluated on GSM8K and ProntoQA, these profiles predict whether a trace yields a correct final answer with AUROC up to 0.807, improving markedly on recent related work. We reach AUROC 0.801 using only the first few hundred tokens of full traces, suggesting that errors can be detected early in the generation. A detailed comparison of correct and incorrect traces further reveals qualitatively distinct uncertainty profiles, with correct traces showing a steeper and less linear decline in uncertainty. Together, the results suggest that our method, grounded in decision-making under uncertainty, provides a principled lens for studying the generative process underlying LM "reasoning".
☆ POETS: Uncertainty-Aware LLM Optimization via Compute-Efficient Policy Ensembles
Balancing exploration and exploitation is a core challenge in sequential decision-making and black-box optimization. We introduce POETS ($\textbf{Po}$licy $\textbf{E}$nsembles for $\textbf{T}$hompson $\textbf{S}$ampling), a novel framework that bridges uncertainty quantification and policy optimization. Our approach is grounded in the insight that policies trained with Kullback-Leibler (KL) regularization implicitly encode an underlying reward function. Building on this, POETS bypasses the complex, nested process of training an uncertainty-aware reward model and separately fitting a policy to this model. Instead, we directly train a policy ensemble to capture epistemic uncertainty by matching implicitly encoded reward functions to online, bootstrapped data. To overcome the prohibitive compute and memory constraints of ensembling Large Language Models (LLMs), POETS utilizes an efficient architecture: the ensemble shares a pre-trained backbone while maintaining diversity through independent Low-Rank Adaptation (LoRA) branches. Theoretically, we prove that POETS implicitly conducts KL-regularized Thompson sampling and thus inherits strong cumulative regret bounds of ${\mathcal O}(\sqrt{T γ_T})$. Empirically, we demonstrate that POETS achieves state-of-the-art sample efficiency across diverse scientific discovery domains, including protein search and quantum circuit design. Furthermore, it improves the optimization trajectories of reinforcement learning, proving particularly robust in off-policy settings with experience replay or in small dataset regimes.
comment: preprint
☆ Training-Induced Escape from Token Clustering in a Mean-Field Formulation of Transformers
Transformers perform inference by iteratively transforming token representations across layers. This layerwise computation has been studied empirically, and recent mean-field theories of Transformer dynamics explain how attention can drive token distributions toward clustering. However, existing mean-field analyses largely treat model parameters as prescribed, leaving open how training reshapes this clustering picture. We study this question in a noisy mean-field Transformer in which only a parameter-linear FFN is trained under $L^2$ regularization. We find and analyze a training-induced phase in the dynamics: after initially following attention-driven clustering, the token distribution can leave the clustered regime near the final layers. Our mathematical analysis is based on an entropy-regularized interaction energy that captures the clustering bias of attention. More broadly, our results point toward a training-aware mean-field theory of Transformer dynamics, in which training and inference dynamics are treated together.
comment: 48 pages, 6 figures, comments are wellcome!
☆ Interactive Trajectory Planning with Learning-based Distributionally Robust Model Predictive Control and Markov Systems
We investigate interactive trajectory planning subject to uncertainty in the decisions of surrounding agents. To control the ego-agent, we aim to first learn the decision distribution and solve a Stochastic Model Predictive Control (SMPC) problem. To account for errors in the learned distribution, we show that it is possible to utilize Probably Approximately Correct (PAC) learning in combination with Distributionally Robust (DR) optimization to obtain a solution which accounts for the errors induced by the learning model. The results indicate that our PAC learning-based DR-MPC framework provides a method to interpolate between a robust MPC and an omnipotent SMPC, based on the available number of samples.
Pre-trained Tabular Foundation Models as Versatile Summary Networks for Neural Posterior Estimation
In this work, we study TabPFN as a training-free, modular summary network for simulation-based Bayesian inference (SBI). Tabular foundation models such as TabPFN are pretrained on broad families of synthetic tabular data-generating processes and adapt at test time through in-context learning, making them natural candidates for SBI, where posterior estimation often depends on learning informative summaries of simulated observations. We propose PFN-NPE: a general recipe that uses a pretrained TabPFN encoder as a fixed summary network for simulator outputs, then pairs the resulting summaries with a downstream inference head chosen for the problem. With normalizing flows as the default inference head, PFN-NPE matches established posterior approximation methods and sometimes outperforms them. More importantly, diagnostic probes show that the TabPFN-derived summaries often preserve useful posterior location and marginal information. These analyses also reveal a limitation in that TabPFN-derived summaries may struggle to represent the joint posterior structure even when the marginals are well recovered. Still, our experiments show that TabPFN can serve as an effective summary network across a diverse set of SBI settings, with the inference network left modular and task-dependent.
☆ SMT-Based Active Learning of Weighted Automata
We present an SMT-based active learning algorithm for nondeterministic weighted automata (WFAs) as a practical and robust alternative to Hankel/L*-style methods. Our algorithm is parametric in a given semiring and, if it terminates, guaranteed to produce minimal WFAs. We prove partial correctness and provide a sufficient termination condition, which in particular implies termination for all finite semirings. Our extensive experimental evaluation shows that our algorithm is capable of learning numerous minimal WFAs over both finite and infinite semirings, vastly outperforms a naive baseline, and is competitive with a state-of-the-art algorithm while producing significantly smaller automata and requiring less interaction with the teacher.
comment: Appearing in CAV 2026
☆ Efficient Verification of Neural Control Barrier Functions with Smooth Nonlinear Activations
Formal verification of neural control barrier functions (NCBFs) remains challenging, especially for neural networks with nonlinear activations like \(\tanh\). Existing CROWN-based methods rely on conservative linear relaxations for Jacobian bounds, limiting scalability. We propose LightCROWN, which computes tighter Jacobian bounds by exploiting the analytical properties of activation functions. Experiments on nonlinear control systems including the inverted pendulum, Dubins car, and planar quadrotor demonstrate that LightCROWN improves verification success rates up to 100\%, while enhancing speed and scalability. Our approach provides a generalizable improvement for CROWN-based frameworks, enabling more efficient verification of complex NCBFs. The code can be found at github.com/Autonomous-Systems-and-Control-Lab/verify-neural-CBF.
comment: 9 pages, 4 figures
☆ When Losses Align: Gradient-Based Composite Loss Weighting for Efficient Pretraining
Modern deep models are often pretrained on large-scale data with missing labels using composite objectives, where the relative weights of multiple loss terms act as hyperparameters. Tuning these weights with random search or Bayesian optimization is computationally expensive, as it requires many independent training runs. To address this, we propose a gradient-based bilevel method that learns pretraining loss weights online by aligning the composite pretraining gradient with a downstream objective. By exploiting the structure of the loss, the method avoids the multiple backward passes typically required by truncated backpropagation through the full model, reducing the overhead of hyperparameter tuning to approximately 30% above a single training run. We evaluate the approach on event-sequence modeling and self-supervised computer vision, where it matches or improves upon carefully tuned baselines while substantially reducing the cost of hyperparameter tuning compared to random or Bayesian search.
☆ Rethinking State Tracking in Recurrent Models Through Error Control Dynamics
The theory of state tracking in recurrent architectures has predominantly focused on expressive capacity: whether a fixed architecture can theoretically realize a set of symbolic transition rules. We argue that equally important is error control, the dynamics governing hidden-state drift along the directions that distinguish symbolic states. We prove that affine recurrent networks, a class of models encompassing State-Space Models and Linear Attention, cannot correct errors along state-separating subspaces once they preserve state representations. Consequently, practical affine trackers do not learn robust state tracking; rather, they learn finite horizon solutions governed by accumulated state-relevant error. We characterize the mechanics of this failure, showing that tracking remains readable only while the accumulating within-class spread remains small relative to the initial between-class separation. We demonstrate empirically on group state-tracking tasks that this breakdown is predictable: tracking collapses when the distinguishability ratio crosses the readability threshold of the trained decoder. Across trained models, the point of this crossing predicts the horizon at which downstream accuracy fails. These results establish that robust state tracking is determined not only by an architecture's theoretical expressivity but crucially by its error control.
☆ Robust and Reliable AI for Predictive Quality in Semiconductor Materials Manufacturing with MLOps and Uncertainty Quantification
Semiconductor materials manufacturing presents unique challenges for machine learning deployment due to evolving process conditions, equipment degradation, and raw material variability that can cause model performance deterioration over time. This study benchmarks machine learning operations (MLOps) retraining strategies using five years of real manufacturing data to identify optimal retraining approaches for quality prediction. We evaluate various retraining frequencies and hyperparameter optimization strategies using control limit normalized residuals as key performance metric. Results demonstrate that a fixed retraining cadence every five production batches without hyperparameter retuning achieves superior performance across all drift conditions while significantly reducing computational overhead compared to strategies incorporating hyperparameter optimization. This approach effectively maintains model accuracy during both abrupt process changes and gradual equipment degradation patterns. To address the critical need for uncertainty quantification in manufacturing decision-making, we implement conformal prediction to generate prediction confidence intervals with strong statistical guarantees. This enables proactive quality control by identifying when prediction intervals fall within acceptable control limits, transforming traditional reactive quality management into a predictive framework. The findings provide practical guidelines for implementing robust MLOps strategies in manufacturing environments where computational efficiency and reliable uncertainty quantification are paramount for operational success.
☆ Flow Matching for Count Data
High-dimensional count data arise in applications such as single-cell RNA sequencing and neural spike trains, where mapping between distributions across successive batches or time points form critical components of data analysis. The recent success of diffusion- and flow-based deep generative models for images, video, and text motivates extending these ideas to count-valued settings, but many existing methods either treat each count as a categorical state or transform counts into a continuous space, neither of which is natural or efficient when the count range is large. We propose count-FM, a flow-matching framework for count data based on a continuous-time birth-death process with local unit jumps. Count-FM learns marginal transitions efficiently in count space through simulation-free training of conditional transition rates, allowing transport between arbitrary count-distributed source and target populations. In simulation, count-FM achieves better sample quality than representative baselines while using substantially fewer parameters. We further apply count-FM to scRNA-seq and neural spike-train data for unconditional generation, transport, and conditional generation. Across these tasks, count-FM yields improved sample quality, greater modeling efficiency, and interpretable transport paths.
☆ Physics-Informed Reduced-Order Operator Learning for Hyperelasticity in Continuum Micromechanics
Physics-informed operator learning is an attractive candidate for surrogate modeling of microstructures, especially in multiscale finite-element simulations. Its practical use, however, is often limited by the high cost of loss evaluation. We address this bottleneck by combining the Equilibrium Neural Operator (EquiNO) with the QR-based discrete empirical interpolation method (Q-DEIM). EquiNO learns only the modal coefficients of reduced displacement-fluctuation and first Piola-Kirchhoff stress representations built from periodic and divergence-free bases, thereby enforcing periodicity and mechanical equilibrium by construction. Q-DEIM then identifies a small set of spatial points through a column-pivoted QR factorization of the stress basis and restricts constitutive evaluations during training to these points alone. This makes full-batch second-order optimization practical for three-dimensional representative volume elements (RVEs). Homogenized first Piola-Kirchhoff stresses are recovered directly from the offline-averaged reduced stress modes, without the need to reconstruct the full stress field at inference time. We validate the framework on two three-dimensional finite-strain hyperelastic RVEs. Q-DEIM reduces the per-step training cost by roughly three orders of magnitude relative to full-field loss evaluation, while reduced homogenization achieves speed-up factors of order $10^3$ to $10^4$ over direct full-field computations. Despite relying on only a small number of offline snapshot loading paths for basis construction, the method accurately interpolates and extrapolates both microscopic stress fields and homogenized stresses, with prediction quality improving systematically as more snapshots are added.
comment: 22 pages, 12 figures
☆ Intelligent Truck Matching in Full Truckload Shipments using Ping2Hex approach SC
Accurate truck-to-shipment matching using GPS data is foundational for full truckload supply chain visibility, enabling real-time tracking and accurate estimated time of arrival (ETA) predictions. However, missing or corrupted vehicle identifiers prevent traditional matching approaches, leaving shipments without visibility. This paper presents Intelligent Truck Matching (ITM) 2.0, a machine learning system that addresses this critical gap by formulating matching as a probabilistic ranking problem. Our approach leverages Uber H3 hexagonal spatial indexing to discretize GPS pings into route similarity features, combined with temporal information, then applies LightGBM gradient boosting with threshold-based post-processing. Through rigorous evaluation including offline model selection (SVM, XGBoost, LightGBM), comprehensive ablation studies, and production shadow testing, we demonstrate substantial gains over rule-based baselines. ITM 2.0 achieves 26 percentage point precision improvement in North America and 14 points in Europe, while doubling coverage. Deployed in production at Project44 handling full truckload shipments, the system demonstrates robustness to geocoding errors up to 1 km, multiple candidate trucks, and sparse pings.
comment: 12 pages, 10 figures, 8 tables. Accepted at iSCSi 2026 (International Conference on Industry Sciences and Computer Sciences Innovation). To appear in Procedia Computer Science (Elsevier)
☆ Drifting Field Policy: A One-Step Generative Policy via Wasserstein Gradient Flow
We propose Drifting Field Policy (DFP), a non-ODE one-step generative policy built on the drifting model paradigm. We frame the policy update as a reverse-KL Wasserstein-2 gradient flow toward a soft target policy, so that each DFP update corresponds to a gradient step in probability space. By construction, this gradient is decomposed into an ascent toward higher action-value regions and a score matching with the anchor policy as a trust region. We further derive a simple, tractable surrogate of the otherwise intractable update loss, akin to behavior cloning on top-K critic-selected actions. We find empirically that this mechanism uniquely benefits the drifting backbone owing to its non-ODE parameterization. With one-step inference, DFP achieves state-of-the-art performance on several manipulation tasks across Robomimic and OGBench, outperforming ODE-based policies.
☆ Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences
Recursive retraining of generative models poses a critical representation challenge: when synthetic outputs are curated based on a fixed reward signal, the model tends to collapse onto a narrow set of outputs that over-optimize that objective. Prior work suggests that such collapse is unavoidable without adding real data into the mix. We revisit this conclusion from an alignment perspective and show that collapse can be mitigated through curation based on multiple reward functions. We formalize the dynamics of recursive training under heterogeneous preferences and prove that, under certain conditions, the model converges to a stable distribution that allocates probability mass across competing high-reward regions. The limiting distribution preserves diversity and provably satisfies a weighted Nash bargaining solution, offering a formal interpretation of value aggregation in synthetic retraining loops.
comment: Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026
☆ Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models
Recurrent LLM architectures have emerged as a promising approach for improving reasoning, as they enable multi-step computation in the embedding space without generating intermediate tokens. Models such as Ouro perform reasoning by iteratively updating internal representations while retaining a standard Key-Value (KV) cache across iterations, causing memory consumption to grow linearly with reasoning depth. Consequently, increasing the number of reasoning iterations can lead to prohibitive memory usage, limiting the practical scalability of such architectures. In this work, we propose Memory-Efficient Looped Transformer (MELT), a novel architecture that decouples reasoning depth from memory consumption. Instead of using a standard KV cache per layer and loop, MELT maintains a single KV cache per layer that is shared across reasoning loops. This cache is updated over time via a learnable gating mechanism. To enable stable and efficient training under this architecture, we propose to train MELT using chunk-wise training in a two phase procedure: interpolated transition, followed by attention-aligned distillation, both from the LoopLM starting model to MELT. Empirically, we show that MELT models fine-tuned from pretrained Ouro parameters outperform standard LLMs of comparable size, while maintaining a memory footprint comparable to those models and dramatically smaller than Ouro's. Overall, MELT achieves constant-memory iterative reasoning without sacrificing LoopLM performance, using only a lightweight post-training procedure.
comment: 22 pages, 5 figures, 11 tables
☆ An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference
Long-context inference increasingly operates over CPU-resident KV caches, either because decoding-time KV states exceed GPU memory capacity or because disaggregated prefill-decode systems place KV data in host memory. Although block-sparse attention reduces attention cost in this setting, sparsity alone is insufficient for end-to-end efficiency. GPU-only designs remain constrained by PCIe bandwidth and metadata memory overhead, while CPU-GPU hybrid designs still suffer from substantial GPU idle time and bottlenecks in CPU-side top-k selection and sparse attention computation. Fluxion is built on three key insights: output-aware KV budgeting, head-specific and granularity-aware sparse configuration, and cross-device coordinated execution for sparse attention over CPU-resident KV caches. Guided by these insights, Fluxion combines a lightweight head-property predictor, a granularity-budget selector, and a priority-based scheduler to jointly optimize budget allocation, sparse configuration, and CPU-GPU execution overlap. This co-design enables hybrid sparse attention to achieve both accuracy and system efficiency in long-context inference. Across 2 models, 3 benchmarks, and 40 tasks, Fluxion preserves quality well -- the worst average degradation is only -0.26 relative to FULL, while delivering 1.5$\times$-3.7$\times$ speedup over the strongest fixed sparse hybrid baseline, whose KV budget is only 0.05.
☆ Bayesian Fine-tuning in Projected Subspaces
Low-Rank Adaptation (LoRA) enables parameter-efficient fine-tuning of large models by decomposing weight updates into low-rank matrices, significantly reducing storage and computational overhead. While effective, standard LoRA lacks mechanisms for uncertainty quantification, leading to overconfident and poorly calibrated models. Bayesian variants of LoRA address this limitation, but at the cost of a significantly increased number of trainable parameters, partially offsetting the original efficiency gains. Additionally, these models are harder to train and may suffer from unstable convergence. In this work, we propose a novel framework for parameter-efficient Bayesian fine-tuning, demonstrating that effective uncertainty quantification can be achieved in very low-dimensional parameter spaces. The proposed method achieves strong performance with improved calibration and generalization while maintaining computational efficiency. Our empirical findings show that, with the appropriate projection of the weight space uncertainty can be effectively modeled in a low-dimensional space, and weight covariances exhibit low ranks.
☆ Future Validity is the Missing Statistic: From Impossibility to $Φ$-Estimation for Grammar-Faithful Speculative Decoding
Grammar-constrained generation is often combined with local vocabulary masking and speculative decoding, but the resulting sampling law is not the grammar-conditional distribution users usually intend. We show that any speculative decoder with local mask access, Leviathan rejection, and rollback soundness samples from the locally projected distribution $μ^{\mathrm{proj}}$ rather than the grammar-conditional distribution $μ^\star$. This extends the GAD impossibility result to speculative decoding; on Dyck grammars with Qwen3-8B, the total-variation gap can reach 0.996. We identify the future-validity function $Φ_t(y)=\Pr_p[\mathrm{valid\ completion}\mid y]$ as the missing correction statistic. The target distribution is a Doob transform of the base model with $h=Φ$, while local masking corresponds to setting $h$ to one. With exact $Φ$, our oracle decoder FVO-Spec samples exactly from $μ^\star$; with approximate $Φ$, we bound the resulting total-variation error. Because exact future validity is hard for general context-free grammars, we evaluate estimator hierarchies on tractable Dyck and finite JSON languages. OneStep reduces Dyck TV by 14% with under 1% throughput overhead, exact dynamic programming reduces it by 97%, and finite-language correction closes JSON gaps to numerical precision. All fidelity claims are scoped to enumerable grammars and token tries.
☆ Toward Better Geometric Representations for Molecule Generative Models
Geometric representation-conditioned molecule generation provides an effective paradigm that decouples molecule representation modeling from structure generation. By decoupling molecule generation into two stages-first generating a meaningful molecule representation, and then generating a 3D molecule conditioned on this representation-the efficiency and quality of the generation process can be significantly enhanced. However, its effectiveness is fundamentally limited by the quality of the representation space: pretrained molecular encoders, such as UniMol, produce representations that are non-smooth and not fully exploited during the generative training process. In this work, we propose LENSEs, a framework that better exploits the potential of molecule representations in representation-conditioned generation methods. In particular, LENSEs introduces three complementary mechanisms: (1) a representation head, simultaneously trained during generative tasks, that extracts multi-level representations from the pretrained encoder; (2) a molecule perceptual loss that optimizes the generator in a semantic-informative representation space; and (3) a node-level representation alignment (REPA) loss that explicitly aligns the generator's hidden states with encoder representations, reducing the semantic gap between pretraining and generation. We demonstrate the effectiveness of these improvements through extensive molecule generation tasks. Specifically, on the challenging molecule generation dataset GEOM-DRUG, LENSEs achieves 97.28% validity and 98.51% molecule stability, surpassing existing advanced methods. Further analyses through Lipschitz constant reduction (4.6x) and QM9 probing tasks also demonstrate the smoother, more informative refined representations, establishing generative training with alignment objectives as a potential pretraining paradigm for molecular encoders.
☆ Fortifying Time Series: DTW-Certified Robust Anomaly Detection
Time-series anomaly detection is critical for ensuring safety in high-stakes applications, where robustness is a fundamental requirement rather than a mere performance metric. Addressing the vulnerability of these systems to adversarial manipulation is therefore essential. Existing defenses are largely heuristic or provide certified robustness only under $\ell_p$-norm constraints, which are incompatible with time-series data. In particular, $\ell_p$-norm fails to capture the intrinsic temporal structure in time series, causing small temporal distortions to significantly alter the $\ell_p$-norm measures. Instead, the similarity metric \emph{Dynamic Time Warping} (DTW) is more suitable and widely adopted in the time-series domain, as DTW accounts for temporal alignment and remains robust to temporal variations. To date, however, there has been no certifiable robustness result in this metric that provides guarantees. In this work, we introduce the first \emph{DTW-certified robust defense} in time-series anomaly detection by adapting the randomized smoothing paradigm. We develop this certificate by bridging the $\ell_p$-norm to DTW distance through a lower-bound transformation. Extensive experiments across various datasets and models validate the effectiveness and practicality of our theoretical approach. Results demonstrate significantly improved performance, e.g., up to 18.7\% in F1-score under DTW-based adversarial attacks compared to traditional certified models.
☆ Gradient Starvation in Binary-Reward GRPO: Why Group-Mean Centering Fails and Why the Simplest Fix Works
Group Relative Policy Optimization (GRPO) is a standard algorithm for reinforcement learning from verifiable rewards, but its group-mean-centered advantage can fail under binary rewards. The failure mode is gradient starvation: when every response in a group is correct or every response is wrong, the centered advantage is exactly zero and the policy receives no learning signal. We prove that the true degeneracy rate always exceeds the i.i.d. Bernoulli prediction by Jensen's inequality, and observe a 0.69 degeneracy rate at group size four in logged Qwen3.5-9B GSM8K training. We then show that the fixed-reference Sign advantage, $A=2r-1$, performs pass@$G$ failure descent by increasing the probability that at least one sample in the group succeeds. On the full GSM8K test set across seven seeds, Sign reaches 73.8% accuracy versus 28.4% for standard normalized group-mean DrGRPO at group size four, a 45.4 point gain with $p<0.0001$. The effect is directionally consistent on Llama-3.1-8B and positive but underpowered on a MATH-500 transfer check. Pass@$k$ analysis indicates that the main benefit is search compression rather than large capacity expansion, aligning the empirical gains with recent RLVR ceiling observations.
☆ The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits
Chain-of-thought reasoning is often treated as a monotone way to improve language-model accuracy by letting a model think longer. We identify a countervailing effect, the coupling tax: when reasoning traces and final answers share one output-token budget, long traces can crowd out the answer they are meant to support. Across GSM8K, MATH-500, and five BIG-Bench Hard tasks with Qwen3 models at three scales, non-thinking mode matches or outperforms thinking mode on GSM8K and MATH-500 at every budget up to 2048 tokens, while harder tasks shift the crossover to larger budgets. We derive a truncation-waste decomposition, $\mathrm{Acc}_{\mathrm{think}}(b)=α_c F_L(b)+α_t(1-F_L(b))$, that predicts this crossover from chain-length and accuracy statistics and explains inverse scaling within the Qwen family. A DeepSeek-R1-Distill-Llama-8B replication shows the same pattern under a different thinking interface. As a mitigation, split-budget generation decouples reasoning and answer budgets; on full MATH-500, IRIS reaches 74.0% accuracy, a strengthened extraction variant reaches 78.8%, and a fixed non-oracle SC+IRIS gate reaches 83.6%. The results show that test-time reasoning should be evaluated as a budget-allocation problem, not only as a question of whether longer traces are available.
comment: 40 pages, 6 figures
☆ Structured Coupling for Flow Matching
Standard flow matching scales well but typically relies on an unstructured source distribution, limiting its ability to learn interpretable latent structure. Latent-variable models, by contrast, capture structure but often sacrifice generative quality. We bridge this gap by proposing Structured Coupling for Flow Matching (SCFM), a cooperative framework that augments flow matching with structured latent representation learning. By introducing structured latent variables and exogenous noise into the source, SCFM jointly learns a structured prior (via latent variable modeling) and a continuous transport map (via flow matching). It uses a shared time-dependent recognition network for both latent variable model variational inference and intermediate-time flow velocity estimation. This yields a structurally informed yet unconditional, simulation-free flow model, where the latent variable model can also assist flow sampling. Empirically, SCFM facilitates unsupervised latent representation learning for clustering, disentanglement and downstream tasks, while remaining competitive with flow matching in sample quality, showing that meaningful structure can be learned without sacrificing generative fidelity.
☆ FactoryBench: Evaluating Industrial Machine Understanding
We introduce FactoryBench, a benchmark for evaluating time-series models and LLMs on machine understanding over industrial robotic telemetry. Q&A pairs are organized along four causal levels (state, intervention, counterfactual, decision) instantiating Pearl's ladder of causation, and span five answer formats: four structured formats are scored deterministically and free-form answers are scored by an LLM-as-judge voting protocol. We propose a scalable Q&A generation framework built around structured question templates, present FactoryWave (a dense, multitask, multivariate sensor dataset collected from a UR3 cobot and a KUKA KR10 industrial arm), and construct FactoryBench as a large-scale benchmark of over 70k Q&A items grounded in roughly 15k normalized episodes from FactoryWave, AURSAD, and voraus-AD. Zero-shot evaluation of six frontier LLMs shows that no model exceeds 50% on structured levels or 18% on decision-making, revealing a wide gap between current models and operational machine understanding.
comment: 9 pages, 4 figures, 14 tables; appendix with 24 pages
☆ Differentially Private Auditing Under Strategic Response
Regulatory audits of AI systems increasingly rely on differential privacy (DP) to protect training data and model internals. We study audit design when the audited developer can strategically respond to the privacy-constrained audit interface. We formalize privacy-constrained auditing as a bilevel Stackelberg game, in which an auditor commits to a query policy and DP budget allocation across harm dimensions, and a strategic developer reallocates mitigation efforts in response. We introduce the welfare-weighted under-detection gap $B_w$, the welfare-weighted true residual harm the audit fails to detect at the developer's strategic best response, and prove that naive DP auditing (uniform or harm-proportional allocation) induces a strictly larger $B_w$ than any non-strategic mitigation baseline whenever effective detectability is heterogeneous, the welfare weights are not comonotone with detectability, and the developer's optimum is interior. We characterize the optimal auditor allocation as a four-factor balance of welfare weight, audit miss-probability, detectability elasticity, and mitigation-cost curvature, and provide a single-level reformulation of the bilevel problem via the developer's KKT system. We propose Strategic Private Audit Design (SPAD), a projected-gradient algorithm with hypergradients computed through the developer's best response.
☆ Debiased Counterfactual Generation via Flow Matching from Observations
Estimating counterfactual distributions under interventions is central to treatment risk assessment and counterfactual generation tasks. Existing approaches model the counterfactual distribution as a standalone generative target, without exploiting its relationship to the observational data. In this work, we show that under standard assumptions, observational and counterfactual outcome distributions are tightly linked: they have identical support and tail behavior, remain statistically close under weak confounding, and share any features of high-dimensional outcomes which are invariant to confounders. These properties motivate learning counterfactual distributions not from scratch, but via a deconfounding flow from the observational distribution. We formulate this problem via flow-matching and derive a semiparametrically efficient estimator based on a novel efficient influence function correction. We subsequently extend our estimator to target minimal-energy flows in high-dimensions, which we show can be especially simple targets between observational and counterfactual distributions. In experiments, deconfounding flows outperform existing debiased counterfactual distribution estimators, while also mitigating known failure modes of flow-based methods.
☆ Quotient Semivalues for False-Name-Resistant Data Attribution
Data valuation methods allocate payments and audit training data's contribution to machine-learning pipelines; however, they often assume passive contributors. In reality, contributors can split datasets across pseudonymous identities, duplicate high-value examples, create near-duplicates, or launder synthetic variants to inflate their share. We formalize this as false-name manipulation in ML data attribution. Our main construction is the quotient semivalue mechanism: compute Shapley-, Banzhaf-, or Beta-style values over evidence-backed attribution clusters instead of raw identities, using a canonical-representative operator to absorb within-cluster duplication. We prove an impossibility: on a fixed monotone data-value game, exact Shapley-fair attribution over reported identities is incompatible with unrestricted false-name-proofness, even on binary-valued instances, and characterize the split-gain of a general semivalue on a unanimity counter-example. The mechanism is exactly false-name-proof under two structural conditions: false-name-neutral within-cluster allocation and quotient-stable manipulations. Under imperfect provenance, when these conditions hold approximately, manipulation gain and fairness loss are bounded by three measurable quantities: escaped-cluster mass, value-estimation error, and clustering distance. We instantiate the mechanisms in DataMarket-Gym, a benchmark for attribution under strategic provider attacks. On synthetic classification tasks, quotient semivalues with example-level evidence reduce manipulation gain on duplicate and near-duplicate Sybil attacks from $1.74$ under baseline Shapley to $0.96$, near the honest level. The cosine-threshold and (false-merge, false-split) rate sweeps trace the corresponding fairness--Sybil frontier.
☆ Direction-Preserving Number Representations
Low-precision number formats are widely used in modern machine learning systems due to their efficiency. Accurate direction representation is key to the accuracy of vector operations. This work precisely explores the extent to which the direction of a vector can be represented by selecting its scalar elements from a common finite alphabet of a given size. This is standard practice in machine learning, where low-precision significands may be narrow-width floating-point or integer values. A geometric framework is introduced for analyzing the directional coverage of such product-structured codes. This work analytically quantifies the suboptimality gap between such product-structured codes and spherical codes for the vector as a whole, in both low and asymptotically high dimensions. Furthermore, within the product code class, it is proven that the standard formats of two's complement, fixed-point, and floating-point are suboptimal, again with quantified gap, pointing to the potential to develop new scalar number formats. Such scalar alphabets are numerically optimized across multiple block dimensions for directional coverage, including the dimension used in NVIDIA's NVFP4 format. Experimental results are presented comparing the performance of standard formats and the optimized alphabet. We find that for four bits, NVIDIA's choice of E2M1 closely approximates the optimized alphabet, providing a geometric explanation for its strong performance in low-precision machine learning workloads and an analytical understanding of the link between that superiority and block size. We provide open-source formal proofs in Lean for the theorems in this work, along with the experimental code and the optimized alphabets obtained.
comment: 9 pages excluding appendices and references, 18 in total. 5 figures
☆ Stochastic Transition-Map Distillation for Fast Probabilistic Inference
Diffusion models achieve strong generation quality, diversity, and distribution coverage, but their performance often comes with expensive inference. In this work, we propose Stochastic Transition-Map Distillation (STMD), a teacher-free framework for accelerating diffusion model inference while preserving probabilistic sample generation. In contrast to score-based diffusion models, whose denoising parametrization models the mean of the posterior distribution, STMD distills the full transition map associated with the sampling stochastic differential equation (SDE). We parameterize these SDE transitions with a conditional Mean Flow model, yielding a one- or few-step stochastic sampler that retains the transition structure of the underlying diffusion process. This perspective is especially useful for downstream tasks that require stochastic inference, such as diffusion posterior sampling, inverse problems, and energy-based fine-tuning. Compared to recent distillation methods, STMD requires no pretrained teacher, bi-level optimization, or trajectory simulation and caching, enabling efficient and scalable training. We derive convergence bounds for our method in the Wasserstein distance, providing a strong theoretical foundation for our approach, and validate STMD on various image generation examples on the MNIST, CIFAR-10, and CelebA datasets.
☆ Reliable Chain-of-Thought via Prefix Consistency
Large Language Models often improve accuracy on reasoning tasks by sampling multiple Chain-of-Thought (CoT) traces and aggregating them with majority voting (MV), a test-time technique called self-consistency. When we truncate a CoT partway through and regenerate the remainder, we observe that traces with correct answers reproduce their original answer more often than traces with wrong answers. We use this difference as a reliability signal, prefix consistency, that weights each candidate answer by how often it reappears under regeneration. It requires no access to token log-probabilities or self-rating prompts. Across five reasoning models and four math and science benchmarks, prefix consistency is the best correctness predictor in most settings, and reweighting votes by it reaches Standard MV plateau accuracy at up to 21x fewer tokens (median 4.6x). Our code is available at https://github.com/naoto-iwase/prefix-consistency.
comment: See our project page at https://naoto-iwase.github.io/prefix-consistency-page
☆ Learning Large-Scale Modular Addition with an Auxiliary Modulus
Learning parity functions, more general modular addition, is a challenging machine learning task due to its input sensitivity. A recent study substantially scaled modular addition learning in both the number of summands and the modulus. Its key idea is to increase zeros in training sequences, reducing the effective number of summands and thus controlling training difficulty; however, this induces covariate shift between training and test input distributions. This study theoretically and empirically analyzes this side effect and proposes a covariate-shift-free method for modular addition. Specifically, we introduce an auxiliary modulus $Kq$ during training, which reduces wrap-around frequency and problem difficulty while preserving the same input distribution across training and testing. Experiments show strong scalability and sample efficiency: even for large input length $N$, large modulus $q$, and small datasets -- where the sparse method fails to learn -- our method achieves equal or better match accuracy and relaxed $τ$-accuracy. For example, at $N=64$ and $q=974269$, our method trained on 100K samples achieves $97.0\%$ $τ$-accuracy at $τ=0.05$, while the sparse method achieves only $9.5\%$ with the same data size and $93.9\%$ even when extended to 1M samples.
comment: 10+11 pages, 5 figures
☆ MAVEN: Multi-Agent Verification-Elaboration Network with In-Step Epistemic Auditing
While explicit reasoning trajectories enhance model interpretability, existing paradigms often rely on monolithic chains that lack intermediate verification, allowing early errors to cascade unchecked. This lack of modularity impedes granular auditing and compromises the epistemic trust required for high-stakes applications. We propose MAVEN (Multi-Agent Verification-Elaboration Network with In-Step Epistemic Auditing), a blackboard-inspired framework designed to transform LLMs into deliberate reasoners through explicit role-decoupling. At its core, MAVEN operationalizes an adversarial Skeptic-Researcher-Judge loop, simulating expert deliberation by functionally separating logical defense from factual grounding. Experiments on OpenBookQA, TruthfulQA, HALUEVAL and StrategyQA benchmarks demonstrate that MAVEN delivers superior reasoning quality across four fine-grained metrics. Notably, MAVEN consistently outperforms latent reasoning models such as GEMINI-3.1-Pro and consensus-based baselines (e.g., ReConcile) by generating explicitly structured, modular, and verifiable deliberation trajectories, rather than relying on implicit internal states or post-hoc consensus. Moreover, comprehensive evaluations confirm that MAVEN is fully model-agnostic, serving as a strong and transferable reasoning booster that yields substantial performance improvements across diverse backbone models.
comment: 24 pages, 2 figures
☆ Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding
Multi-agent pathfinding (MAPF) is a widely used abstraction for multi-robot trajectory planning problems, where multiple homogeneous agents move simultaneously within a shared environment. Although solving MAPF optimally is NP-hard, scalable and efficient solvers are critical for real-world applications such as logistics and search-and-rescue. To this end, the research community has proposed various decentralized suboptimal MAPF solvers that leverage machine learning. Such methods frame MAPF (from a single agent perspective) as a Dec-POMDP where at each time step an agent has to decide an action based on the local observation and typically solve the problem via reinforcement learning or imitation learning. We follow the same approach but additionally introduce a learnable communication module tailored to enhance cooperation between agents via efficient feature sharing. We present the Local Communication for Multi-agent Pathfinding (LC-MAPF), a generalizable pre-trained model that applies multi-round communication between neighboring agents to exchange information and improve their coordination. Our experiments show that the introduced method outperforms the existing learning-based MAPF solvers, including IL and RL-based approaches, across diverse metrics in a diverse range of (unseen) test scenarios. Remarkably, the introduced communication mechanism does not compromise LC-MAPF's scalability, a common bottleneck for communication-based MAPF solvers.
☆ Robust stochastic first order methods in heavy-tailed noise via medoid mini-batch gradient sampling
We consider a first order stochastic optimization framework where, at each iteration, $K$ independent identically distributed (i.i.d.) data point samples are drawn, based on which stochastic gradients can be queried. We allow gradient noise to be heavy-tailed, with possibly infinite variances. For the considered heavy-tailed setting, many algorithmic variants have recently been proposed based on gradient clipping or other nonlinear operators (e.g., normalization) applied over noisy gradients. In this paper, we take an alternative approach and propose a novel stochastic first order method dubbed Robust Stochastic Gradient Descent with medoid mini-batch gradient sampling, R-SGD-Mini for short. The core idea of R-SGD-Mini is to split the $K$-sized data batch into $M$ distinct data chunks, form for each chunk the stochastic gradient, and update the solution estimate with respect to the stochastic gradient direction of the chunk that is medoid of gradients of all data-chunks. Under a general class of symmetric heavy-tailed gradient noises and a standard non-convex setting, we establish explicit bounds on the expected time-averaged squared gradient norm. More precisely, we show that the latter quantity converges at rate $\mathcal{O}(T^{-1})$ to a small neighborhood of zero; we explicitly characterize this neighborhood in terms of noise and algorithm's parameters. Moreover, if the time horizon is known in advance, we establish the rate of $\mathcal{O}(T^{-\frac{1}{2}}).$ Furthermore, when clipping is incorporated, we obtain convergence guaranties in the high-probability sense and recover the same rate. Experimental results indicate that R-SGD-Mini and its clipped variant consistently perform favorably compared to SGD, clipped SGD and Median-of-Means based methods.
☆ Post-training makes large language models less human-like
Large language models (LLMs) are increasingly used as surrogates for human participants, but it remains unclear which models best capture human behavior and why. To address this, we introduce Psych-201, a novel dataset that enables us to measure behavioral alignment at scale. We find that post-training -- the stage that turns base models into useful assistants -- consistently reduces alignment with human behavior across model families, sizes, and objectives. Moreover, this misalignment widens in newer model generations even as base models continue to improve. Finally, we find that persona-induction -- a popular technique for eliciting human-like behavior by conditioning models on participant-specific information -- does not improve predictions at the level of individuals. Taken together, our results suggest that the very processes that are currently employed to turn LLMs into useful assistants also make them less accurate models of human behavior.
☆ Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents
When a phone-use agent avoids harm, does that show safety, or simply inability to act? Existing evaluations often cannot tell. A harmful outcome may be avoided because the agent recognized the risk and chose the safe action, or because it failed to understand the screen or execute any relevant action at all. These cases have different causes and call for different fixes, yet current benchmarks often merge them under task success, refusal, or final harmful outcome. We address this problem with PhoneSafety, a benchmark of 700 safety-critical moments drawn from real phone interactions across more than 130 apps. Each instance isolates the next decision at a risky moment and asks a simple question: does the model take the safe action, take the unsafe action, or fail to do anything useful? We evaluate eight representative phone-use agents under this framework. Our results reveal two main patterns. First, stronger general phone-use ability does not reliably imply safer choices at risky moments. Models that perform better on ordinary app tasks are not always the ones that behave more safely when the next action matters. Second, failures to do anything useful behave like a capability signal rather than a safety signal: they are concentrated in more visually and operationally demanding settings and remain stable when the evaluation protocol changes. Across models, failures split into two recurring patterns: unsafe choices in settings where the model can act but chooses wrongly, and inability to act in more visually and operationally demanding screens. Overall, a harmless outcome is not enough to count as evidence of safety. Evaluating phone-use agents requires separating unsafe judgment from inability to act.
comment: work in progress
☆ Mathematical Reasoning via Intervention-Based Time-Series Causal Discovery Using LLMs as Concept Mastery Simulators
Recent methods for improving LLM mathematical reasoning, whether through MCTS-based test-time search or causal graph-guided knowledge injection, cannot identify which concepts causally contribute to a correct answer, as the observed association may be spurious, driven by confounders such as problem difficulty. We propose CIKA (Causal Intervention for Knowledge Activation), a framework that uses the LLM itself as an interventional simulator: a prompt sets the concept state to ``mastered'' and the correctness change estimates the causal effect. We formalize this quantity as an Interventional Capability Probe (ICP), which diagnoses whether the LLM can use a given concept -- distinct from merely possessing knowledge. Because the intervention exogenously sets the concept state independently of problem difficulty, ICP separates confounding that observational methods cannot. On 67 screened problems, the ICP of the top-ranked concept (+0.219) is significantly larger than that of the negative control (+0.039; paired $t$-test, $p < 10^{-6}$, Cohen's $d = 0.86$), confirming that the probe discriminates causally relevant concepts from irrelevant ones. Analysis of 601 Omni-MATH problems further shows that solved problems have 6.1$\times$ higher ATE than unsolved ones (0.338 vs. 0.055), confirming that ICP is predictive of problem-solving success. With a 7B-parameter LLM whose weights are entirely frozen, CIKA achieves 69.7\% on the contamination-free Omni-MATH-Rule benchmark and 64.0\% overall, compared to 60.5\% for o1-mini, and 97.2\% on GSM8K, 46--50\% on AIME 2024--2026, and 46.2\% on MathArena. The Causal Knowledge Activation component contributes 33.8\% of correct answers on problems where the base model alone fails, demonstrating that the LLM already possessed but had not activated the requisite knowledge.
comment: 17 pages, 0 figures
☆ Optimal Recourse Summaries via Bi-Objective Decision Tree Learning
Actionable Recourse provides individuals with actions they can take to change an unfavorable classifier outcome. While useful at the instance level, it is ill-suited for global auditing and bias detection, since aggregating local actions is costly and often inconsistent. Recourse Summaries address this limitation by partitioning the population and assigning one shared action per subgroup, enabling comparison across subgroups. Designing summaries involves a fundamental trade-off between recourse effectiveness and recourse cost, which existing methods do not adequately address. We introduce Summaries of Optimal and Global Actionable Recourse (SOGAR), which formulates recourse summary learning as an optimal decision tree learning problem and finds the Pareto front -- the complete set of solutions where improving one objective necessarily worsens the other. SOGAR enables post-hoc selection of the desired trade-off without retraining. Using shallow axis-parallel decision trees and sparse leaf actions, SOGAR produces stable, low-cost, and effective recourse summaries that outperform existing approaches across effectiveness and cost metrics.
☆ A Refined Generalization Analysis for Extreme Multi-class Supervised Contrastive Representation Learning ICML 2026
Contrastive Representation Learning (CRL) has achieved strong empirical success in multiple machine learning disciplines, yet its theoretical sample complexity remains poorly understood. Existing analyses usually assume that input tuples are identically and independently distributed, an assumption violated in most practical settings where contrastive tuples are constructed from a finite pool of labeled data, inducing dependencies among tuples. While one recent work analyzed this learning setting using U-Statistics to estimate the population risk, the techniques used therein require the risk of each class to concentrate uniformly, making excess risk bounds scale in the order of $ρ_{\min}^{-{1}/{2}}$ where $ρ_{\min}$ denotes the probability of the rarest class. Such a dependency can be overly pessimistic in the extreme multiclass settings where there are many tail classes which contribute minimally to the overall population risk. Our contributions are two-fold. Firstly, we improve upon the previous work and prove a bound with a sample complexity of the same order as the number of classes $R$, regardless of the distribution over classes. Furthermore, we formulate a different estimator that captures the concentration of the risk \textit{across classes}, enabling sharper bounds in extreme multi-class learning scenarios, especially where class distributions are long-tailed. Under mild assumptions on the class distributions, the resulting sample complexity is $\mathcal{O}(k)$ where $k$ is the number of samples per tuple.
comment: Accepted at ICML 2026
☆ Revisiting Transformer Layer Parameterization Through Causal Energy Minimization
Transformer blocks typically combine multi-head attention (MHA) for token mixing with gated MLPs for token-wise feature transformation, yet many choices in their parameterization remain largely empirical. We introduce Causal Energy Minimization (CEM), a framework that recasts Transformer layers as optimization steps on conditional energy functions while explicitly accounting for layer parameterization. Extending prior energy-based interpretations of attention, CEM shows that weight-tied MHA can be derived as a gradient update on an interaction energy, and that a gated MLP with shared up/down projections can be viewed through an element-wise energy. This perspective identifies a design space for Transformer layers that includes within-layer weight sharing, diagonal-plus-low-rank interactions, lightweight preconditioners, and recursive updates. We evaluate CEM-derived layers in language-modeling experiments at the moderate hundred-million-parameter scale. Despite their constrained parameterizations, these layers train stably and can match corresponding Transformer baselines. Overall, our results suggest that CEM provides a useful lens for understanding Transformer layer parameterization, connecting Transformer architectures to energy-based models and motivating further exploration of energy-guided layer designs.
☆ Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States
Reinforcement learning with verifiable rewards (RLVR) for Large Reasoning Models hinges on baseline estimation for variance reduction, but existing approaches pay a heavy price: PPO requires a policy-model scale critic, while GRPO needs multiple rollouts per prompt to keep its empirical group mean stable. We introduce Policy Optimization with Internal State Value Estimation), which obtains a baseline at negligible cost by using the policy model's internal signals already computed during the policy forward pass. A lightweight probe predicts the expected verifiable reward from the hidden states of the prompt and generated trajectory, as well as token-entropy statistics, and is trained online alongside the policy. To preserve gradient unbiasedness despite using trajectory-conditioned features, we introduce a cross-rollout construction that predicts each rollout's value from an independent rollout's internal states. Because POISE estimates prompt value using only a single rollout, it enables higher prompt diversity for a fixed compute budget during training. This reduces gradient variance for more stable learning and also eliminates the compute overhead of sampling costs for detecting zero-advantage prompts. On Qwen3-4B and DeepSeek-R1-Distill-Qwen-1.5B across math reasoning benchmarks, POISE matches DAPO while requiring less compute. Moreover, its value estimator shows similar performance to a separate LLM-scale value model and generalizes to various verifiable tasks. By leveraging the model's own internal representations, POISE enables more stable and efficient policy optimization.
☆ Bilevel Graph Structure Learning, Revisited: Inner-Channel Origins of the Reported Gain
Bilevel graph structure learning is widely understood to improve graph neural networks by jointly optimizing model parameters and a learned graph structure, with the resulting performance gain attributed to the rewired adjacency. We find that this attribution may be overstated: training-dynamics effects in the inner loop, rather than the rewiring itself, capture a substantial share of the gain. To establish this, we introduce frozen-$φ$, a control that freezes the graph while retaining the inner-loop training schedule. This decomposes the bilevel gain into an inner channel of $T$-step training dynamics with implicit gradient regularization and a graph channel of the graph rewiring itself. On spatio-temporal flow forecasting the inner channel matches or exceeds the full bilevel pipeline, accounting for 78-101% of the gain; on node classification it accounts for 37-44% under a Bernoulli edge-level parameterization. We also verify that classical spectral diagnostics can dissociate from task gain. We propose frozen-$φ$ as a standardized diagnostic for bilevel graph structure learning, with graph distillation as a method-agnostic complement. A three-precondition framework further predicts the sign of the bilevel gain on all six benchmarks.
☆ Ensemble Distributionally Robust Bayesian Optimisation
We study zeroth-order optimisation under context distributional uncertainty, a setting commonly tackled using Bayesian optimisation (BO). A prevailing strategy to make BO more robust to the complex and noisy nature of data is to employ an ensemble as the surrogate model, thereby mitigating the weaknesses of any single model. In this study, we propose a novel algorithm for Ensemble Distributionally Robust Bayesian Optimisation that remains computationally tractable while managing continuous context. We obtain theoretical sublinear regret bounds, improving current state-of-the-art results. We show that our method's empirical behaviour aligns with its theoretical guarantees.
☆ Beyond Distribution Estimation: Simplex Anchored Structural Inference Towards Universal Semi-Supervised Learning ICML 2026
Semi-supervised learning faces significant challenges in realistic scenarios where labeled data is scarce and unlabeled data follows unknown, arbitrary distributions. We formalize this critical yet under-explored paradigm as Universal Semi-supervised Learning (UniSSL). Existing methods typically leverage unlabeled data via pseudo-labeling. However, they often rely on the idealized assumption of a uniform unlabeled data distribution or require sufficient labeled data to estimate it. In the UniSSL setting, such dependencies lead to numerous erroneous pseudo-labels, thereby triggering representation confusion. Fortunately, we observe that inter-sample relations captured by representations are more reliable than pseudo-labels. Leveraging this insight, we shift our focus to representation-level structural inference to bypass distribution estimation. Accordingly, we propose Simplex Anchored Graph-state Equipartition (SAGE), which captures high-order inter-sample dependencies to establish structural consensus for guiding representation learning. Meanwhile, to mitigate representation confusion, we employ vectors that satisfy a simplex equiangular tight frame to serve as a coordinate frame for guiding inter-class representation separation. Finally, we introduce a weighting strategy based on distribution-agnostic metrics to prioritize reliable pseudo-labels and an auxiliary branch to isolate potentially erroneous pseudo-labels. Evaluations on five standard benchmarks show that SAGE consistently outperforms state-of-the-art methods, with an average accuracy gain of \textbf{8.52\%}.
comment: The paper is accepted by ICML 2026
☆ ProteinJEPA: Latent prediction complements protein language models
Protein language models are trained primarily with masked language modeling (MLM), which predicts amino-acid identities at masked positions. We ask whether latent-space prediction can complement these token-level objectives under matched wall-clock budget. Across pretrained and random-init protein sequence encoders at 35--150M parameters, we find that the best protein-JEPA design is not all-position latent prediction but a variant: predicting latent targets only at masked positions, and retaining the MLM cross-entropy. We call this recipe masked-position MLM+JEPA. On a 16-task downstream suite (15 frozen linear probes plus SCOPe-40 zero-shot fold retrieval), under matched wall-clock budgets, this recipe wins more tasks than it loses against MLM-only continuation: 10 wins / 3 losses / 3 ties (hereafter W/L/T) on pretrained ESM2-35M, 11/2/3 on ESM2-150M while results in pretraining from scratch are mixed (6/8/2). Gains are seen for multiple models on 11 of 16 tasks, including stability, \b{eta}β\b{eta}-lactamase fitness, variant effect, intrinsic disorder, remote homology, enzyme classification, and SCOPe-40 fold retrieval. Tasks with more losses than wins are Fluorescence (TAPE) and Peptide-HLA Binding. All-position MLM+JEPA matches MLM-only overall but does not reproduce the masked-position gains. JEPA-only (no MLM) collapses in nearly every experiment. We conclude that JEPA, when combined with MLM, is competitive and can outperform pure MLM in pretraining and continued training, even under matched wall-clock budgets.
☆ Disagreement-Regularized Importance Sampling for Adversarial Label Corruption
Standard Importance Sampling (IS) collapses under label corruption because high-norm examples, prioritized for variance reduction, are often adversarial outliers. We formalize this misalignment using an $\varepsilon$-contamination model and propose Disagreement-Regularized Importance Sampling (DR-IS), a sub-sampling method based on loss rank-disagreement across independent proxy ensemble. We prove finite-sample concentration bounds showing that the empirical rank disagreement of bulk corrupted examples is bounded above, and that of boundary-clean examples bounded below, both at rate $O(\sqrt{\log(N/δ)/K})$ with probability $1-δ$; when the structural expectation gap $Δ'$ between the two groups is positive and the boundary-clean set is at least as large as the selected subset, these bounds certify strict separation and control the contamination rate of the selected subset. Empirically, DR-IS remains robust under targeted high-norm attacks that break magnitude-based methods such as the Error $L_2$-norm (EL2N) on benchmark datasets. DR-IS complements training-dynamics approaches like Area Under the Margin ranking (AUM), offering improved robustness in the loss-aligned regime alongside explicit finite-sample concentration certificates and a contamination bound limiting noise leakage from the statistical tail of corrupted points.
☆ Probabilistic Object Detection with Conformal Prediction
Conformal Prediction (CP) is a distribution-free method for constructing prediction sets with marginal finite-sample coverage guarantees, making it a suitable framework for reliable uncertainty quantification in safety-critical object detection. However, object detection introduces structured multi-output predictions, complicating the application of classical CP theory developed for single outputs. In addition, standard, unscaled CP produces fixed-width prediction intervals across inputs, leading to unnecessary width for low-uncertainty predictions. While scaled CP addresses this by adapting the interval width to an input-dependent uncertainty estimate, prior work has neither systematically compared unscaled and scaled CP for multi-class object detection, nor integrated CP with a complementary uncertainty quantification method in this setting. We fill this gap by: (i) applying CP coordinate-wise to bounding box corners with a Bonferroni correction for box-level guarantees; (ii) scaling the resulting intervals using per-prediction aleatoric uncertainty estimates derived from a probabilistic object detector trained with loss attenuation, evaluated in uncalibrated and two calibrated variants; (iii) extending to a two-step pipeline that constructs prediction sets for the class using RAPS and conditions the conformalized bounding boxes on the predicted class set. Across three autonomous driving datasets (KITTI, BDD, CODA), including a cross-domain setting under distribution shift, scaled CP consistently improves interval sharpness over unscaled CP, achieving up to 19% higher IoU and 39% lower interval scores, without sacrificing coverage. Class-wise calibration further improves coverage for both variants with a negligible effect on sharpness. Together, these improvements yield more actionable uncertainty estimates for real-time, real-world object detection.
comment: Code is available at https://github.com/mos-ks/OD-CP
☆ On the Invariance and Generality of Neural Scaling Laws
Neural scaling laws establish a predictable relationship between model performance and data or compute, offering crucial guidance for resource allocation in new domains and tasks. Yet such laws are most needed precisely where they are hardest to obtain: fitting one for a new model task pair demands expensive sweeps that typically exhaust the very compute budget the law is meant to economize. This paper poses the research question of how to develop generalizable scaling laws: laws fit once on a well-resourced source domain and reliably transported to new domains where running a full sweep is infeasible, which requires a fundamental understanding of when and why scaling properties change. We address this by identifying the right invariants: scaling laws are preserved under bijective (information-preserving) transformations of the data and modified in predictable, information-theoretically grounded ways under non-bijective transformations that lower its information resolution $ρ$: a single axis along which a law fit in one domain can be transported to another. We validate this across language, vision, and speech, and demonstrate two cross-domain applications: predicting scaling for language models trained on electronic health records from laws fit on general text, and predicting time-series classification scaling under varying levels of noise injection, recovering the data-scaling exponents to within $3\%$ error.
comment: 23 pages, 6 figures, 11 tables
☆ GESR: Graph-Based Edge Semantic Reconstruction for Stealthy Communication Detection with Benign-Only Training
Detecting stealthy malicious communications from flow logs under benign-only training remains a critical challenge in network security. Malicious communications often camouflage as normal traffic like standard HTTPS flows. Conventional intrusion detectors rely strictly on known labeled attacks. Alternatively, they score flows completely independently. These approaches fail against sparse and context-dependent suspicious activity. To capture this essential context, graph anomaly detectors have been introduced to add valuable relational information to the analysis. However, existing methods fail to test the structural consistency of specific communication edges. To overcome these fundamental limitations, we present GESR, a novel graph-based framework for detecting suspicious communications and anomalous hosts under a benign-only training setting. GESR models complex network activity as attributed communication graphs. It cleverly reconstructs edge semantics entirely from local structural context rather than isolated features. This non-intuitive design forces the framework to predict expected communication patterns from neighborhood topologies. Attackers cannot easily manipulate this deep structural dependency. The model then converts the resulting structural inconsistencies into host-level anomaly scores. It utilizes robust Median Absolute Deviation (MAD) calibration for this final step. We evaluate GESR extensively on CTU-13 and CICIDS2017 datasets. These evaluations strictly impose tight false-positive operating constraints. On CICIDS2017, GESR achieves an outstanding ROC-AUC of 0.9753. It also yields a high TPR of 0.8569 at a strict 5% FPR threshold. GESR consistently outperforms existing methods across both evaluated benchmarks. The results prove that structure-conditioned edge reconstruction is a credible direction for practical intrusion detection.
☆ SGD for Variational Inference: Tackling Unbounded Variance via Preconditioning and Dynamic Batching
Black-Box Variational Inference (BBVI) typically relies on Stochastic Gradient Descent (SGD) to optimize the Evidence Lower Bound (ELBO). However, the stochastic gradients in BBVI inherently exhibit unbounded variance, violating standard assumptions and instead satisfying the weaker Blum-Gladyshev (BG) condition, where variance grows quadratically with distance from the optimum. In this paper, we bridge the gap between stochastic optimization theory and the practical instances of BBVI. Focusing on the broad elliptic location-scale family of parameterized distributions, we offer two main contributions. First, we prove the existence of an ELBO solution, a foundational property usually assumed a priori in the literature. Second, we establish comprehensive convergence guarantees spanning finite-time and asymptotic regimes for Minibatch Projected SGD (PSGD) equipped with dynamic batching and preconditioning under the BG condition. Our theoretical framework demonstrates that dynamic batching combined with preconditioning systematically enables rigorous guarantees even in complex settings. We illustrate our theoretical findings with numerical results, highlighting the efficacy of our approach for modern inference tasks.
☆ Why Self-Inconsistency Arises in GNN Explanations and How to Exploit It
Recent work has observed that explanations produced by Self-Interpretable Graph Neural Networks (SI-GNNs) can be self-inconsistent: when the model is reapplied to its own explanatory graph subset, it may produce a different explanation. However, why self-inconsistency arises remains poorly understood. In this work, we first identify re-explanation-induced context perturbation as the direct cause of score variation. We then introduce a latent signal assignment hypothesis to explain why only some edges are sensitive to this perturbation, and analyze how conciseness regularization affects latent signal assignment. Given that self-inconsistent edges do not provide stable evidence for the model's prediction, we propose Self-Denoising (SD), a model-agnostic and training-free post-processing strategy that calibrates explanations with only one additional forward pass. Experiments across representative SI-GNN frameworks, backbone architectures, and benchmark datasets support our hypothesis and show that SD consistently improves explanation quality while adding only about 4--6\% computational overhead in practice.
☆ Tessellations of Semi-Discrete Flow Matching
We study Flow Matching in a semi-discrete setting where a Gaussian source is transported toward a discrete target supported on finitely many points. This semi-discrete regime is the theoretical setting behind the use of Flow Matching for generative modeling, where the target distribution is represented by a finite dataset. In this semi-discrete regime, the exact Flow Matching velocity field is available in closed form, which makes it possible to analyze the geometry induced by the terminal flow map independently of optimization and approximation effects. We investigate the terminal assignment regions, namely the preimages of the target atoms under the terminal flow. We show that these regions are open, simply connected and, under an additional assumption, homeomorphic to the unit ball. At the same time, a planar four-point example shows that these cells can differ sharply from Laguerre cells arising in semi-discrete optimal transport: they may be non-convex, have curved boundaries, and exhibit different boundedness and adjacency patterns. These results clarify the geometry intrinsically induced by the exact semi-discrete Flow Matching objective before neural approximation enters the picture.
☆ LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning
Developing lightweight, on-device vision-language GUI agents is essential for efficient cross-platform automated interaction. However, current on-device agents are constrained by limited model capacity, and further performance improvements remain urgently needed. Traditional Supervised Fine-Tuning (SFT) for small-scale models often leads to overfitting, catastrophic forgetting and policy rigidity, and thus fails to fully address these challenges. In this work, we propose a novel SFT-free training paradigm that significantly enhances the performance of small-scale models. We first present the initial systematic integration of generalized knowledge distillation into the GUI agent domain via Guided On-policy Distillation. By incorporating oracle reference trajectories together with a dynamic retrieval mechanism, our method reduces hallucinations and mitigates the cognitive misalignment inherent in multi-solution GUI tasks. Building on this foundation, we further introduce a Multi-solution Dual-level GRPO framework that jointly aligns macro-level subtask planning with micro-level execution matching, thereby improving exploration in long-horizon GUI agent scenarios. In addition, we construct an automated data generation pipeline to synthesize GUI task trajectories with rich multi-solution annotations. Extensive experiments show that our method achieves state-of-the-art performance among lightweight models while remaining competitive with substantially larger-scale models across all benchmarks. Ablation studies further demonstrate that structured on-policy distillation and multi-solution dual-level exploration can fully unlock the capabilities of 2B/3B scale agents, surpassing the performance limits of conventional imitation learning.
☆ ExpThink: Experience-Guided Reinforcement Learning for Adaptive Chain-of-Thought Compression
Large reasoning models (LRMs) achieve strong performance via extended chain-of-thought (CoT) reasoning, yet suffer from excessive token consumption and high inference latency. Existing reinforcement learning (RL) approaches for CoT compression rely on uniform, static length penalties that neglect model capability dynamics and problem-level difficulty variation. We propose \textbf{ExpThink}\xspace, an RL framework that addresses both dimensions through two complementary mechanisms. First, \emph{experience-guided reward shaping} tracks the shortest correct solution found so far for each problem and applies a three-tier reward: full credit for concise correct responses, discounted credit for verbose correct ones, and zero for incorrect ones. The threshold tightens automatically with model improvement, forming a self-evolving curriculum that requires no manual scheduling. Second, \emph{difficulty-adaptive advantage} replaces standard deviation normalization with correct-count normalization, yielding monotonically difficulty-scaled gradients that amplify learning on hard problems to preserve accuracy while suppressing gradients on easy ones to encourage brevity. Together, these mechanisms enforce an accuracy-first, compression-second training objective. Experiments on multiple mathematical reasoning benchmarks demonstrate that \textbf{ExpThink}\xspace reduces average response length by up to 77\% while simultaneously improving accuracy, achieving up to $3\times$ higher accuracy-efficiency ratio (accuracy divided by average token count) than the vanilla baseline and outperforming existing RL-based compression methods on both metrics.
comment: 39 pages, 18 figures. Code and model checkpoints will be released upon publication
☆ Efficient Data Selection for Multimodal Models via Incremental Optimization Utility
The scaling of Large Multimodal Models (LMMs) is constrained by the quality-quantity trade-off inherent in synthetic data. Previous approaches, such as LLM-as-a-Judge, have proven their effectiveness in addressing this but suffer from prohibitive computational costs and lack of interpretability. To bridge this gap, we propose One-Step-Train (OST), a framework that reformulates data selection as an incremental optimization utility ranking problem. Instead of relying on semantic heuristics, OST estimates the marginal utility of each sample via a simulated single-step update on a lightweight proxy. Experiments on the Qwen series across multimodal mathematical reasoning benchmarks demonstrate that OST achieves Pareto-optimal efficiency. By selecting the top-50 subset, OST reduces training costs by 43% (and total time consumption by 17) while surpassing the strong LLM-as-a-Judge baseline by 1.8 points. Furthermore, under a fixed compute budget, our method using only the top-20 subset achieves a 5.6 point gain over LLM-as-a-Judge, improves upon heuristic scoring baselines like DEITA, and outperforms the Full-SFT baseline by 8.8 points. Notably, while Full-SFT suffers from performance degradation due to noise, our optimization-grounded approach effectively identifies toxic samples, successfully reversing the negative transfer frequently observed in complex reasoning tasks.
☆ Excluding the Target Domain Improves Extrapolation: Deconfounded Hierarchical Physics Constraints
Extrapolation to out-of-distribution conditions is a fundamental challenge for physics-constrained deep generative models. Existing methods apply physical constraints as a single static regularization term uniformly across the generation process, and address neither the hierarchical structure of physical laws and the confounding variable problem. We propose the Deconfounded Hierarchical Gate (DHG), which serves as a diagnostic and control mechanism: it identifies when and how strongly temperature confounding contaminates each constraint level, so that hierarchical gates reflect intrinsic physical inconsistency rather than spurious temperature effects. DHG combines counterfactual estimation via the do-operator with backdoor adjustment to remove confounding, then applies Coarse-to-Fine physical constraints progressively. We report a counter-intuitive finding in pretraining: excluding the target-domain data from pretraining outperforms including it by 39% in extrapolation performance (RMSE 0.224 vs. 0.324). This occurs because FNO learns domain-agnostic physical patterns that transfer more effectively when the target domain is withheld. On a lithium-ion battery temperature extrapolation benchmark (trained at 24 degrees Celsius, evaluated at 4.0--43.0 degrees Celsius), our method achieves RMSE = 0.215, a 46% improvement over the unconstrained baseline (Pure CFM: 0.397).
comment: 16 pages, 2 figures
☆ Does Your Neural Network Extrapolate? Feature Engineering as Identifiability Bias for OOD Generalization
Successful deep neural networks discover salient features of data. We show when and why they fail to learn out-of-distribution (OOD)-relevant representations from an in-distribution (ID) training window. This requires decoupling feature learning from data-generating-process (DGP) identifiability. From a single training window, OOD extrapolation is non-identifiable: infinitely many DGPs are $\varepsilon$-observationally equivalent on the training data but diverge arbitrarily outside it, and no in-distribution criterion alone reliably breaks the tie. A structural commitment, the feature map, label map, and model class $(\varphi, ψ, \mathcal{M})$, dictates the assumed DGP and governs OOD generalization while leaving ID performance essentially unchanged. When architecture, pretraining, augmentation, input formats, or domain knowledge implicitly inject the missing commitment, the model succeeds. When it cannot infer OOD-relevant structure from ID evidence, it fails. Changing only the representation can make the same architecture, at the same in-distribution loss, differ by ${\sim}520\times$ out of distribution. When the commitment is correct and identifiable, OOD error vanishes. For example, Fourier coordinates turn periodic extrapolation into interpolation on $\mathbb{S}^1$. The same mechanism predicts outcomes in three natural-science settings (mass-action chemistry; Kepler's-third-law exoplanet prediction, $n=2{,}362$; and cross-species coding-DNA detection) and in a 264-run positional-encoding study across Transformer, Mamba, and S4D. Finally, a controlled study shows: correct features are necessary but not sufficient. The model class must express the target, and the transformed training data must cover the relevant representation space.
☆ SHRED: Retain-Set-Free Unlearning via Self-Distillation with Logit Demotion
Machine unlearning for large language models (LLMs) aims to selectively remove memorized content such as private data, copyrighted text, or hazardous knowledge, without costly full retraining. Most existing methods require a retain set of curated examples to prevent catastrophic degradation of general model utility, creating an extra data dependency that complicates deployment. We propose SHRED (Self-distillation via High-surprisal-only Retain-set-free Entropy Demotion), a retain-set-free unlearning method built on a key insight: not all tokens within a forget set instance carry memorized information equally. High-information tokens concentrate the model's memorized knowledge, while low-information tokens reflect general language competence. SHRED operates in two stages. (1) Selection: We perform a forward pass on a forget set instance, collect per-token autoregressive probabilities, and select the bottom (lowest probability, highest Shannon information) as forget positions; the remaining positions are retained as benign anchors. (2) Training: We construct modified KL targets that demote the memorized token's logit at forget positions while preserving the original distribution at benign positions. The model is then trained via a single top KL self-distillation objective that simultaneously drives forgetting and utility preservation. We evaluate SHRED across four standard unlearning benchmarks and demonstrate that it establishes a new Pareto-optimal trade-off between forget efficacy and model utility, outperforming retain-set-dependent methods. Our analysis shows that SHRED is robust against relearning attacks and membership-inference attacks, and it maintains stable utility even after many sequential unlearning runs.
☆ NPMixer: Hierarchical Neighboring Patch Mixing for Time Series Forecasting
Multivariate time series forecasting remains a challenge due to the complexity of local temporal dynamics and global dependencies across multiple variables. In this paper, we propose \textbf{N}eighboring \textbf{P}atching \textbf{Mixer} (\textbf{NPMixer}), a hierarchical architecture featuring a Learnable Stationary Wavelet Transform that adaptively learns filter coefficients to decompose signals into trend and detail components in a data-dependent manner. Our framework introduces a Neighboring Mixer Block that captures local temporal dynamics through a series of hierarchical MLP layers operating on non-overlapping patches. Specifically, the mixer block utilizes MLPs to learn temporal patterns within and across these patches, expanding the receptive field to capture multi-scale dependencies. A Channel-Mixing Encoder is applied to high-frequency components to learn channel correlations while preserving the stability of the underlying global trend. Extensive experiments on seven benchmark datasets demonstrate that NPMixer consistently outperforms state-of-the-art models, achieving better performance in 20 out of 28 ($71.4\%$) evaluated experimental setups for MSE.
☆ Breaking QAOA's Fixed Target Hamiltonian Barrier: A Fully Connected Quantum Boltzmann Machine via Bilevel Optimization
To overcome the limitations of classical partially connected Boltzmann machines and mainstream quantum Boltzmann machines (QBMs), this work extends the conventional circuit of the quantum approximate optimization algorithm (QAOA) to a bilevel optimization architecture and proposes a fully connected QBM. The inner-loop training simulates positive phase energy minimization based on the computational process of the conventional QAOA circuit, whereas the outer-loop training simulates negative phase contrastive divergence learning by optimizing the structural parameters of the target Hamiltonian. It is found that, first, the model exhibits superior performance using only a single layer (p=1) in the QAOA circuit, with an average probability of 0.9559 in measuring the target quantum state under noiseless conditions. Second, the model exhibits notable noise robustness. Under the typical noise level of current mainstream commercial quantum computing devices, the average probability of measuring the target quantum state reaches 0.6047; when the noise rises to a more stringent level with doubled intensity, this probability remains at 0.3859. In both scenarios, the target quantum state maintains the highest measurement probability among all detected states, with a value several times higher than that of the second-ranked state. This indicates that the model retains strong robustness even when noise meets or exceeds the upper limit of current mainstream commercial quantum computing devices. Third, under a block-by-block learning strategy with p=1 and only 10 measurement shots, the model consistently generates the target "qubit" grid image regardless of noise interference, demonstrating strong robustness in image generation.
comment: 34 pages, 8 figures, 3 tables, 1 algorithm
☆ Transfer Learning Across Fast- and Full-Simulation Domains in High-Energy Physics
Machine-learning models in high-energy physics are often trained on simulated data, where fully simulated samples are computationally expensive while fast simulation provides large statistics at reduced realism. In this work, we systematically study transfer learning between fast-simulated and fully simulated datasets in a realistic LHC environment. We consider three representative tasks, signal-background classification, quark-gluon jet tagging, and missing transverse energy reconstruction, using dense neural networks, graph neural networks, and transformer-based architectures. Models are pretrained on ATLAS-like fast simulation and adapted to CMS-like fast simulation and to fully simulated ATLAS Open Data. Across all tasks, pretrained models consistently outperform independently trained baselines and require significantly less target-domain training data, typically reducing the needed statistics by about a factor of two. These results demonstrate that fast simulation can be used to learn robust, reusable representations and motivate publishing trained models as reusable scientific assets beyond large foundation models.
comment: 16 pages, 8 figures
☆ Uncovering Hidden Systematics in Neural Network Models for High Energy Physics
Neural networks (NNs) are inherently multidimensional classifiers that learn complex, non-linear relationships among input observables. While their flexibility enables unprecedented performance in high-energy physics (HEP) analyses, it also makes them sensitive to small variations in their inputs. Consequently, the propagation and estimation of systematic uncertainties in NN-based models remain an open challenge. There are indications that uncertainties derived in control regions or from nominal variations of input features can underestimate the true model uncertainty, potentially leaving biases unaccounted for. Inspired by insights from adversarial-attack studies in machine learning, we explore how subtle perturbations, fully consistent with the experimental uncertainties on the input observables, can lead to substantial changes in NN outputs, while keeping the one-dimensional and correlated input distributions nearly unchanged. Using a set of representative HEP tasks, including event classification and object identification, and testing across a variety of network architectures, we demonstrate that networks can be systematically "fooled" at significant rates within the allowed uncertainty envelopes. Building on this observation, we introduce a quantitative framework to probe and measure the hidden sensitivity of neural networks to realistic experimental variations, providing a practical path to evaluate and control their systematic uncertainty in physics analyses.
comment: 18 pages, 9 figures
☆ Physical Simulators as Do-Operators: Causal Discovery under Latent Confounders for AI-for-Science
Existing interventional causal discovery methods -- IGSP, DCDI, ENCO -- assume causal sufficiency (no latent confounders) and rely on virtual interventions in synthetic simulators. In AI-for-Science settings such as molecular design and materials science, latent confounders are ubiquitous and real interventions (e.g., physics-based simulations) require hours to days per data point. We propose CFM-SD (Causal Flow Matching with Simulation Data), which uses first-principles physical simulators as do-operators in Pearl's interventional calculus to simultaneously handle latent confounders and real interventional data. Theoretically, $d$-variable causal structure is identifiable with $O(d)$ single-variable interventions -- the minimum under physical realizability constraints. In Intrinsic Evaluation on synthetic data ($γ=0.2$--$0.8$), CFM-SD achieves average F1$=0.800$ vs. F1$=0.127$--$0.562$ for all baselines. In Extrinsic Evaluation on real scientific data, CFM-SD achieves 57--58\% bias reduction in molecular toxicity prediction and battery electrolyte optimization, demonstrating practical value beyond synthetic benchmarks.
comment: 17 pages, 1 figure
☆ Approximation Error Upper and Lower Bounds for Hölder Class with Transformers ICML2026
We explore the expressive power of Transformers by establishing precise approximation error upper and lower bounds for Hölder class. Specifically, a new approximation upper bound is derived for the standard Transformer architecture equipped with Softmax operators, ReLU activation functions, and residual connections. We prove that a Transformer network composed of at most $\mathcal{O}(\varepsilon^{-{d_{0}}/α})$ blocks can approximate any bounded Hölder function with $d_{0}$-dimensional input and smoothness $α\in(0,1]$ under any accuracy $\varepsilon>0$. In the case of approximation lower bounds, leveraging the VC-dimension upper bound, we are the first to rigorously prove that Transformers demand for at least $Ω(\varepsilon^{-{d_{0}}/({4α})})$ blocks to achieve the $\varepsilon$ approximation accuracy. As a final step, we extend the derived results for standard Transformers to a general regression task and establish the corresponding excess risk rates demonstrating Transformers' empirical effectiveness in real-world settings.
comment: 31 pages, 2 figures. Accepted by ICML2026
☆ Learning Minimal-Deviation Corrections for Multi-Dimensional Mismodelling in HEP Simulations
Accurate Monte Carlo (MC) modelling in high-energy physics is challenging, particularly in complex scenarios where simulations fail to reproduce observed data. In practice, experimental information is often limited to one-dimensional (1D) distributions, while mismodelling arises in a multidimensional feature space. This restricts traditional correction methods, as one-dimensional reweighting ignores correlations and fully multidimensional approaches require large target datasets. We propose a neural network-based method that operates under these constraints by learning a transformation of simulated events that reproduces the available 1D target distributions while remaining close to the original simulation. This minimal-deviation principle preserves the global correlation structure of the baseline model while enabling targeted corrections of mismodelled features. Using controlled studies with simulated pseudo-data, we show that the method improves agreement with target distributions and maintains a consistent multidimensional structure. The approach is designed for complex, high-dimensional analyses where traditional techniques are insufficient, providing a scalable way to enhance MC modelling under limited information.
comment: 12 pages, 6 figures
☆ Estimation of Motor Unit Parameters from Surface Electromyograms using an Informed Autoencoder
Motor unit parameters such as the innervation zone centre or the conduction velocity of the electrical potential harbour the potential to improve the fidelity of neuromechanical models used for movement and force prediction. Determining these parameters in a non-invasive way is challenging, as they are subject-specific and may vary with muscle contraction. Existing work on the estimation of motor unit parameters mainly relies on white-box modelling and therefore requires substantial manual modelling effort. This work targets the simultaneous estimation of multiple subject-specific motor unit parameters from electromyography (EMG) recordings measured non-invasively at the skin surface. This results in an inverse problem with a nonlinear loss function. To address this problem, an informed autoencoder is developed. This autoencoder reconstructs the surface EMG recordings while learning the parameters in its latent space and adhering to physical laws that relate the parameters to the EMG signals. In experiments on synthetic data, innervation zone centres are estimated with a mean absolute error of 2.5989 $\mathrm{mm}$, and conduction velocities of the electric potential are estimated with a mean absolute error of 0.1697 $\mathrm{m}\mathrm{s}^{-1}$. These results demonstrate the plausibility of this novel approach, which enables the simultaneous estimation of several motor unit parameters while reducing manual modelling effort through the integration of data-driven machine learning.
☆ Inference-Time Attribute Distribution Alignment for Unconditional Diffusion
Inference-time controllable generation is essential for real-world applications of unconditional diffusion models. However, most existing techniques focus on individual samples, struggling in applications that require the sample population to follow specific attribute distributions (e.g., demographic balance or semantic proportions). We formalize this setting as the inference-time attribute distributional alignment problem for pretrained unconditional diffusion models. To address this, we cast inference-time attribute distributional alignment as an optimal control problem over the reverse diffusion process, viewing the process as the rollout of a dynamical system and augmenting it with additive, time-dependent perturbations as control. We solve for the perturbations using an optimal-control-based algorithm to optimize a differentiable distribution-matching objective while penalizing control effort to preserve data fidelity. Experiment results in image generation demonstrate that our proposed plug-and-play approach can better align attribute distributions to diverse and flexible test-time targets compared to baselines, without retraining or finetuning the pretrained diffusion model.
comment: Preprint. 35 pages, 13 figures
☆ VNN-LIB 2.0: Rigorous Foundations for Neural Network Verification
Neural network verification is an active and rapidly maturing research area, with a growing ecosystem of solvers and tools. The VNN-LIB standard was introduced to support interoperability in this ecosystem, but Version~1.0 has several serious short-comings as a formal foundation: it lacks a precise syntax, semantics, and type system, offers limited expressivity, and relies on externally defined ONNX models whose semantics are informal and constantly evolving. The latter distinguishes VNN-LIB from established standards such as SMT-LIB, where queries are self-contained and have fixed semantics. In this paper we address these challenges by developing the theoretical foundations of VNN-LIB~2.0. Our key contribution is the introduction of the notion of a \emph{network theory}, which abstractly characterises the minimal semantic interface required from a neural network model format. This abstraction enables VNN-LIB to be defined independently of any specific ONNX version while remaining compatible with evolving model representations. Building on this foundation, we present a formal syntax for a more expressive query language, a type system for it over the numeric domains provided by the network theory, and finally a formal semantics. To ensure internal consistency, the standard is mechanised in the Agda theorem prover. VNN-LIB~2.0 therefore provides robust and rigorous foundations for trustworthy neural network verification.
☆ Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs
Vision-language models (VLMs) have advanced rapidly and are increasingly deployed in real-world applications, especially with the rise of agent-based systems. However, their safety has received relatively limited attention. Even the latest proprietary and open-weight VLMs remain highly vulnerable to adversarial attacks, leaving downstream applications exposed to significant risks. In this work, we propose a novel and lightweight adversarial attack detection framework based on sparse autoencoders (SAEs), termed SAEgis. By inserting an SAE module into a pretrained VLM and training it with standard reconstruction objectives, we find that the learned sparse latent features naturally capture attack-relevant signals. These features enable reliable classification of whether an input image has been adversarially perturbed, even for previously unseen samples. Extensive experiments show that SAEgis achieves strong performance across in-domain, cross-domain, and cross-attack settings, with particularly large improvements in cross-domain generalization compared to existing baselines. In addition, combining signals from multiple layers further improves robustness and stability. To the best of our knowledge, this is the first work to explore SAE as a plug-and-play mechanism for adversarial attack detection in VLMs. Our method requires no additional adversarial training, introduces minimal overhead, and provides a practical approach for improving the safety of real-world VLM systems.
☆ SSP-based construction of evaluation-annotated data for fine-grained aspect-based sentiment analysis
We report the construction of a Korean evaluation-annotated corpus, hereafter called 'Evaluation Annotated Dataset (EVAD)', and its use in Aspect-Based Sentiment Analysis (ABSA) extended in order to cover e-commerce reviews containing sentiment and non-sentiment linguistic patterns. The annotation process uses Semi-Automatic Symbolic Propagation (SSP). We built extensive linguistic resources formalized as a Finite-State Transducer (FST) to annotate corpora with detailed ABSA components in the fashion e-commerce domain. The ABSA approach is extended, in order to analyze user opinions more accurately and extract more detailed features of targets, by including aspect values in addition to topics and aspects, and by classifying aspectvalue pairs depending whether values are unary, binary, or multiple. For evaluation, the KoBERT and KcBERT models are trained on the annotated dataset, showing robust performances of F1 0.88 and F1 0.90, respectively, on recognition of aspect-value pairs.
☆ GameGen-Verifier: Parallel Keypoint-Based Verification for LLM-Generated Games via Runtime State Injection
LLM-based game generation promises to turn natural-language specifications into executable games, but progress is limited by the lack of reliable automated verification. Unlike conventional code generation, game correctness is defined over long-horizon interaction: a game may appear correct while violating core mechanics such as state updates, interaction rules, and phase transitions. Existing Agent-as-a-Verifier approaches collapse verification into open-ended gameplay, making verdicts reachability-bound, time-consuming, coverage-limited, and sensitive to the agent's gameplay ability. We present GameGen-Verifier, an automated verification paradigm for LLM-generated games that decomposes a specification into verifiable keypoints and grounds them into independent verification units. Each unit patches the game runtime into a concrete target state, executes a bounded interaction, and judges the outcome against the keypoint assertion. We implement GGV-Harness, a scalable agentic harness providing concurrency management, runtime isolation, and fault recovery. On VeriGame, our dataset of 100 games across seven genres, GameGen-Verifier achieves up to 92.2% accuracy against human judgments versus 58.8% for the coverage-enforced Agent-as-a-Verifier baseline, while reducing wall-clock time by up to 16.6x.
☆ Inference of Qualitative Models from Steady-State Data via Weighted MaxSMT
Qualitative models provide crucial instruments for modelling complex biological systems. While advances in automated reasoning and symbolic encodings have enabled rigorous inference of these models from data, the process remains highly fragile. First, biological measurement errors inevitably propagate into formal model specifications. Second, when a specification becomes unsatisfiable, distinguishing between fundamental design flaws and minor technical errors is notoriously difficult. This uncertainty often leads to under-specification, as it is unclear which observations are still ``safe'' to incorporate. To overcome these challenges, we introduce a robust inference method based on weighted MaxSMT. By encoding uncertain biological observations as weighted soft constraints, our approach enables the solver to identify a model best reflecting the observations, even with some conflicting constraints. Our method allows for Boolean and multi-valued variable domains, alongside observations derived from discretisation (level constraints) and differential expression (ordering constraints). We show our approach can be used to successfully infer neural cell differentiation models from prior-knowledge networks with 200--1300 genes using ordering constraints on all included genes.
☆ Generating training datasets for legal chatbots in Korean
Chatbots are robots that can communicate with humans using text or voice signals. Legal chatbots improve access to justice, since legal representation and legal advice by lawyers come with a high cost that excludes disadvantaged and vulnerable people. However, capturing the diversity of actual user input in datasets for deep-learning dialog systems (chatbots) is a technical challenge. Diversity requires large volumes of data, which must also be labelled in order to classify the user's intent, while the cost of labelling datasets increases with volume. Instead of labelling large volumes of authentic data from users, our approach consists in jointly generating large volumes of utterances and high-quality labels. The generator of labelled datasets is based on language resources that take the form of local grammar graphs (LGG), which capture and generalize the vocabulary and local syntax observed by linguists in text. The LGGs associate labels to the utterances according to a domain-specific classification system. We tested this approach by implementing LIGA, a legal chatbot in Korean. The chatbot answers users' conversational queries on legal situations by providing information on similar legal cases, made publicly available by the Korean government. We generated labelled utterances from the LGGs with the aid of the open-source Unitex platform. This process produced 700 million utterances. We trained a DIET classifier on a dataset made of these utterances, and the trained model reached 91% f1-score performance. We implemented a chatbot called LIGA, which uses the results of the model to select a link to a web page that documents similar legal cases.
☆ A Flexible Adaptive Stable Clustering Algorithm for Archive-Scale Online Mass Spectrometry
Modern online mass spectrometry generates multi-terabyte data streams critical for understanding Earth's environmental systems. However, extracting actionable chemical insights from these repositories is impeded by a computational bottleneck: existing clustering methods force a compromise among scalability, metric flexibility, and algorithmic stability. Here, we introduce Flexible Adaptive Stable Clustering (FASC), a dynamical systems framework that resolves these constraints by architecturally decoupling the similarity kernel from rigorous optimization logic. Unlike legacy heuristics that suffer from stochastic drift and algorithmic blending, FASC employs a Density-Augmented Similarity Selection rule and geometric constraints to guarantee deterministic, order-independent convergence. After validating FASC on canonical machine-learning ground truths (achieving >99.5% cluster purity and 0.99 Adjusted Rand Index), we deployed the framework on 25 million mass spectra of atmospheric aerosols. Demonstrating strictly linear empirical runtime scaling (O(N)), FASC autonomously mapped atmospheric aging pathways of secondary inorganic aerosols while isolating ultra-rare industrial tracers (<0.2% abundance), providing a scalable infrastructure for mining environmental big data.
☆ SR$^2$-LoRA: Self-Rectifying Inter-layer Relations in Low-Rank Adaptation for Class-Incremental Learning
Pre-trained models with parameter-efficient fine-tuning (PEFT) have demonstrated promising potential for class-incremental learning (CIL), yet catastrophic forgetting still persists when adapting models to new tasks. In this paper, we present a novel perspective on catastrophic forgetting through the analysis of inter-layer relation drift, i.e., the progressive disruption of relationships among layer-wise representations during the learning of new tasks. We theoretically show that the increase of such drift reduces the classification margins of previously learned tasks, thereby degrading overall model performance. To address this issue, we propose \underline{S}elf-\underline{R}ectifying inter-layer \underline{R}elation Low-Rank Adaptation~(SR$^2$-LoRA), a simple yet effective method that mitigates catastrophic forgetting by constraining inter-layer relation drift. Specifically, SR$^2$-LoRA constructs the relation matrices induced by the previous and current models on current-task samples, and aligns the corresponding singular values. We further theoretically show that this alignment exhibits greater robustness to estimation perturbations than direct entry-wise alignment. Extensive experiments on standard CIL benchmarks demonstrate that SR$^2$-LoRA effectively mitigates catastrophic forgetting, with its advantages becoming more pronounced as the number of tasks increases. Code is available in the \href{https://github.com/FqWan24/SR-2-LoRA}{repository}.
☆ Effective and Memory-Efficient Alternatives to ECC for Reliable Large-Scale DNNs IEEE
Modern Deep Learning (DL) workloads are increasingly deployed in safety-critical domains, such as automotive systems and hyperscale data centers, where transient hardware faults pose a serious threat to system reliability. These workloads are highly memory-intensive, and their correct functionality strongly depends on model parameters stored in memory, which are typically protected using Error Correction Codes (ECCs). In this work, we study ECC's impact on such models and propose two lightweight alternatives to ECCs that achieve superior reliability. The first approach, MSET, selectively hardens the most vulnerable bits in CNN and ViT parameters, while the second approach, CEP, provides fine-grained protection for all parameter bits. Experimental results demonstrate that both methods significantly enhance the reliability of large CNNs and ViTs, mostly outperforming conventional Single Error Detection Double Error Correction (SECDED) ECC schemes, with no memory overhead and, in fact, with considerably lower area and delay characteristics when compared to SECDEC. Experimental results indicate that ViTs can be effectively protected by merely protecting their highest exponent bits in FP16 and FP32 representations. Furthermore, applying the CEP technique can guarantee the resilience of DNNs by up to one order of magnitude higher BERs, with a 3.5x lower area overhead and 7x faster decoder compared to SECDED ECC.
comment: 7 pages, 7 figures, 3 tables. The paper is accepted at IEEE IOLTS'26
☆ Risk-Consistent Multiclass Learning from Random Label-Subset Membership Queries
Obtaining accurate class labels is often costly or unreliable, and may also be limited by privacy or other practical conditions. Compared with asking an annotator to provide the exact class, it is often easier to ask whether the true label belongs to a certain label subset. This query-response form defines a distinct weak-supervision mechanism: weak supervision information is generated through feedback on a label subset. Although weakly supervised learning has studied many learning frameworks, most existing work starts from established weak label objects. A systematic characterization is still lacking for weakly supervised learning generated directly by such query response observations. This paper proposes a multiclass learn ing framework under random label-subset queries. We model the data-generating distribution of query-response observations and derive an unbiased estimator of the target risk under the empirical risk minimization (ERM) framework. To address negative empirical risk and the associated overfitting problem, we introduce corrected risk estimators based on non-negative and absolute-value corrections. Theoretical analysis establishes a conditional generalization and excess-risk bound for the unbiased estimator, and a bias-and-consistency result for the corrected risk estimator. Experiments under the matched random-query mechanism demonstrate the feasibility of direct query-response learning and the stabilization effect of risk correction.
☆ Tracking Large-scale Shared Bikes with Inertial Motion Learning in GNSS Blocked Environments IEEE
Although Global Navigation Satellite Systems (GNSS) provide a general solution for bike tracking outdoors, there still exist complex riding environments where only inertial navigation systems work, such as urban canyons. Despite decades of research, localization using only low-cost inertial sensors still faces challenges such as cumulative drifts and poor robustness caused by filtering methods. Furthermore, sensors such as visual and LiDAR could provide reliable measurements, but they are not suitable for large-scale deployment. In this paper, we propose an inertial tracking framework that integrates bicycle mechanical constraints with a mixture-of-experts model. Specifically, we leverage multiple expert modules to capture shared representations and weight them through the gating mechanism, thus improving multi-task learning performance and enabling uncertainty-aware trajectory estimation. Furthermore, based on the mechanical transmission between the pedal and the rear wheel of a bike, we explore the intrinsic relationship between the rider's periodic pedalling behaviors and acceleration variations, and convert such patterns into bike's wheel speed for dynamic calibration. Experiments with real-world riding data from shared bikes of the DiDi ride-hailing platform demonstrate that our system improves the accuracy of baselines by at least 12%, with wheel speed errors below 0.5 m/s at 95-percentile.
comment: It has been submitted to IEEE Transactions on Intelligent Transportation Systems (T-ITS). Journal article. 14 pages, 18 figures, 10 tables
☆ The Proxy Presumption: From Semantic Embeddings to Valid Social Measures ACL 2026
Natural Language Processing is rapidly evolving into a primary instrument for Computational Social Science, with researchers increasingly using embeddings to measure latent constructs such as novelty, creativity, and bias. However, this transition faces a fundamental validity challenge: the ''Proxy Presumption,'' or the reliance on geometric properties (e.g., cosine distance) as direct measures of social concepts. We argue that without explicit validation, unsupervised representations remain entangled mixtures of the target construct ($C$) and confounding attributes ($Z$) like topic, style, and authorship. To bridge the gap between semantic embeddings and valid social measures, we introduce the Construct Validity Protocol (CVP). Drawing on causal representation learning and psychometrics, the CVP offers a rigorous pipeline from conceptualization to quantitative verification. We further propose Counterfactual Neutralization, a novel method using LLMs to reduce confounding in embedding space. By providing a standardized Validity Suite -- including tests for discriminant, incremental, and predictive validity -- this work offers the community a toolkit to transform heuristic proxies into robust, scientifically defensible instruments.
comment: ACL 2026
☆ Emergent Symbolic Structure in Health Foundation Models: Extraction, Alignment, and Cross-Modal Transfer ICML
Health foundation models (FMs) learn useful representations from wearable sensors, but interpreting what they encode and transferring that knowledge across modalities after training remains difficult. We present a post-training framework that decomposes frozen embeddings into interpretable directions, referred to as symbols, and use these symbols to align the embedding spaces without retraining. We evaluate the framework on three FMs for photoplethysmography (PPG) and accelerometer data, independently pretrained on ~20M minutes of unlabeled data from ~172K participants, and analyzed on a held-out cohort of 30K subjects. We find that extracted symbols associate selectively with health conditions and physiological attributes, and these associations are partially shared across modalities and architectures. Cross-modal transfer via symbols retains more than 95% of in-domain performance, is nearly symmetric across domain directions, and saturates with limited paired data, together indicating that alignment recovers a shared low-dimensional subspace rich in physiological information. Overall, these results suggest that health FM embeddings contain an interpretable symbolic organization that is shared across modalities and supports cross-domain transfer without joint training.
comment: 8 pages ICML workshop, 4 main figures
☆ Have Graph -- Will Lift? The Case for Higher-Order Benchmarks
After a somewhat rocky start, geometry and topology have established a foothold in machine learning. Message passing, either on graphs or higher-order complexes, is one of the main drivers of geometric deep learning, and paradigms that were once considered to be firmly in the realm of the abstract-like sheaves-have been "tamed" to serve as novel inductive biases for model architectures in topological deep learning. The veritable diversity of models, however, is in stark contrast to the scarcity of suitable benchmark datasets. As a result, researchers often resort to lifting existing graph datasets to include higher-order information. In this opinion paper, I want to encourage the community to also source new datasets, which may be used to prop up the foundations of our research field.
☆ Rubric-based On-policy Distillation
On-policy distillation (OPD) is a powerful paradigm for model alignment, yet its reliance on teacher logits restricts its application to white-box scenarios. We contend that structured semantic rubrics can serve as a scalable alternative to teacher logits, enabling OPD using only teacher-generated responses. To prove it, we introduce ROPD, a simple yet foundational framework for rubric-based OPD. Specifically, ROPD induces prompt-specific rubrics from teacher-student contrasts, and then utilizes these rubrics to score the student rollouts for on-policy optimization. Empirically, ROPD outperforms the advanced logit-based OPD methods across most scenarios, and achieving up to a 10x gain in sample efficiency. These results position rubric-based OPD as a flexible, black-box-compatible alternative to the prevailing logit-based OPD, offering a simple yet strong baseline for scalable distillation across proprietary and open-source LLMs. Code is available at https://github.com/Peregrine123/ROPD_official.
comment: Preprint. Code is available at https://github.com/Peregrine123/ROPD_official
☆ Unsolvability Ceiling in Multi-LLM Routing: An Empirical Study of Evaluation Artifacts
Efficient routing across multiple LLMs enables cost-quality tradeoffs by directing queries to the cheapest capable model. Prior work attributes routing headroom to an "unsolvability ceiling", queries no model in the pool can solve. We present a large-scale study of multi-tier LLM routing with 206,000 query-model pairs across six benchmarks (MMLU, MedQA, HumanEval, MBPP, Alpaca, ShareGPT) using the Gemma 4 and Llama 3.1 families. Evaluating with both LLM-as-a-judge and exact-match metrics, we show that a substantial portion of reported unsolvability stems from evaluation artifacts: (i) systematic judge biases favoring verbosity over correctness, (ii) truncation under fixed generation budgets, and (iii) output format mismatches. Through dual-judge validation and exact-match grounding, we reduce measured unsolvability across tasks. We introduce a decomposition framework attributing failures to these artifacts, revealing consistent patterns across domains and model families. These artifacts also distort router training signals: standard routers collapse to majority-class prediction (~79% smallest-tier optimal), confirmed via random-feature and shuffled-label controls, incurring a 13-17 percentage point opportunity cost. We provide actionable recommendations including dual-judge validation, exact-match anchoring, and cost-sensitive objectives. Our findings suggest existing routing headroom estimates are substantially inflated, underscoring the need for reliable evaluation protocols in multi-LLM systems.
comment: 12 pages, 14 tables
☆ Exploring CoCo Challenges in ML Engineering Teams: Insights From the Semiconductor Industry
The integration of machine learning (ML) into complex software systems has increased challenges in collaboration and communication (CoCo) of the teams building these systems. ML engineering (MLE) teams often involve diverse roles, ML engineers, data scientists, software engineers, and domain experts, each bringing unique goals, experiences, and jargon. These interdisciplinary dynamics can make it challenging to deploy, reproduce, and maintain ML-enabled systems over the long term. Previous studies have uncovered several CoCo challenges and practices, but most have focused on software-centric companies, leaving limited empirical understanding of how these dynamics unfold in hardware-centric contexts. In hardware-centric environments, CoCo challenges are shaped by additional constraints such as strict data governance, long development cycles, and tight coupling with physical processes, which amplify coordination complexity and reduce flexibility. To strengthen empirical understanding in such settings, we present a qualitative investigation of MLE teams within a global semiconductor company, where ML-enabled systems and manufacturing processes introduce additional complexity. We interviewed 12 practitioners regarding CoCo practices, tools, challenges, and approaches. Through analysis, we identified 16 recurring challenges, with unclear roles and responsibilities emerging as the most critical, and common practices and recommendations practitioners considered effective in mitigating CoCo problems. While grounded in a single organizational context, our findings align with known issues in interdisciplinary ML-enabled systems development, but also demonstrate how these challenges manifest differently under hardware-driven constraints. Our results highlight directions for future research and tool support to strengthen CoCo in MLE projects and ensure the success of ML-enabled systems.
☆ Convex Optimization with Nested Evolving Feasible Sets
Convex Optimization with Nested Evolving Feasible Sets (CONES)} is considered where the objective function $f$ remains fixed but the feasible region evolves over time as a nested sequence $S_1 \supseteq S_2 \supseteq \cdots \supseteq S_T$. The goal of an online algorithm is to simultaneously minimize the regret with respect to hindsight static optimal benchmark and the total movement cost while ensuring feasibility at all times. CONES is an optimization-oriented generalization of the well-known nested convex body chasing problem. When the loss function is convex, we propose a lazy-algorithm and show that it achieves $O(T^{1-β}), O(T^β)$ simultaneous regret and movement cost for any $β\in (0,1]$, over a time horizon of $T$. When the loss function is strongly convex or $α$-sharp, we propose an algorithm Frugal that simultaneously achieves zero regret and a movement cost of $O(\log T)$. To complement this, we show that any online algorithm with $o(T)$ regret has a movement cost of $Ω(\log{T})$ for both cases, proving optimality of Frugal.
☆ StreamPhy: Streaming Inference of High-Dimensional Physical Dynamics via State Space Models
Inferring the evolution of high-dimensional and multi-modal (e.g., spatio-temporal) physical fields from irregular sparse measurements in real time is a fundamental challenge in science and engineering. Existing approaches, including diffusion-based generative models and functional tensor methods, typically operate in offline settings, depend on full temporal observations, or incur substantial inference cost. We propose StreamPhy, an end-to-end framework that enables efficient and accurate streaming inference of full-field physical dynamics from incoming irregular sparse measurements. The framework integrates a data-adaptive observation encoder that is robust to arbitrary observation patterns, a structured state-space model that supports memory-efficient online updates across irregular time intervals, and an expressive Functional Tensor Feature-wise Linear Modulation (FT-FiLM) decoder for continuous-field generation. We prove that FT-FiLM is more expressive than the functional Tucker model, admitting a richer function class for handling complex dynamics. Experiments on three representative physical systems under challenging sampling patterns show that StreamPhy consistently outperforms state-of-the-art baselines, with at least 48\% improvement in accuracy and up to 20--100X faster inference than diffusion-based methods.
☆ Zero-Shot Neural Network Evaluation with Sample-Wise Activation Patterns IEEE
Zero-shot proxies, also known as training-free metrics, are widely adopted to reduce the computational overhead in neural network evaluation for scenarios such as Neural Architecture Search (NAS), as they do not require any training. Existing zero-shot metrics have several limitations, including weak correlation with the true performance and poor generalisation across different networks or downstream tasks. For example, most of these metrics apply only to either convolutional neural networks (CNNs) or Transformers, but not both. To address these limitations, we propose Sample-Wise Activation Patterns (SWAP), and its derivative, SWAP-Score, a novel and highly effective zero-shot metric. SWAP-Score is broadly applicable across both architecture families and task domains, demonstrating strong predictive performance in the majority of tasks. This metric measures the expressivity of neural networks over a mini-batch of samples, showing a high correlation with the neural networks' ground-truth performance. For both CNNs and Transformers, the SWAP-Score outperforms existing zero-shot metrics across computer vision and natural language processing tasks. For instance, Spearman's correlation coefficient between the SWAP-Score and CIFAR-10 validation accuracy for DARTS CNNs is 0.93, and 0.71 for FlexiBERT Transformers on GLUE tasks. Moreover, SWAP-Score is label-independent, hence can be applied at the pre-training stage of language models to estimate their performance for downstream tasks. When applied to NAS, SWAP-empowered NAS, SWAP-NAS can achieve competitive performance using only approximately 6 and 9 minutes of GPU time, on CIFAR-10 and ImageNet respectively. Our code is available at: https://github.com/pym1024/SWAP_Universal
comment: Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence. arXiv admin note: text overlap with arXiv:2403.04161
☆ QuadNorm: Resolution-Robust Normalization for Neural Operators
Normalization layers in neural operators usually compute statistics by uniformly averaging discrete grid values, making the normalization itself discretization-dependent and thereby a source of transfer error across different resolutions or meshes. To enable discretization robustness, we introduce a quadrature normalization family that replaces existing uniform averaging in normalization layers with numerical quadrature: QuadNorm and BlendQuadNorm. On endpoint-inclusive uniform grids, the proposed quadrature moments are $O(h^2)$-consistent across discretizations, meaning that their cross-resolution mismatch decays quadratically with grid spacing. A transfer-error bound then predicts how normalization-induced mismatch scales with both the resolution gap and network depth. The experiments show the same gap- and depth-scaling trends predicted by the transfer-error bound. On Darcy, QuadNorm delivers the best cross-resolution performance at every tested target resolution from $64^2$ to $256^2$; on real-data benchmarks, Transolver with QuadNorm achieves nearly resolution-invariant transfer. The largest gains appear on nonperiodic PDEs and nonspectral architectures, where native-resolution improvements also emerge. We also validate BlendQuadNorm, which stays close to LayerNorm behavior and serves as a conservative default for periodic FNO settings. These results identify normalization as a previously overlooked source of resolution dependence in neural operators.
comment: 42 pages, 8 figures
☆ FlightSense: An End-to-End MLOps Platform for Real-Time Flight Delay Prediction via Rotation-Chain Propagation Features and Agentic Conversational AI
Flight delays impose cascading operational and financial burdens across the aviation network, costing the U.S. economy billions of dollars annually by disrupting interconnected aircraft rotation systems. While prior machine learning approaches have demonstrated strong predictive performance, most treat upstream delays as static input variables rather than explicitly modeling how delays propagate dynamically through aircraft rotation chains, and none have deployed such systems alongside a live weather-aware conversational AI interface for end-user interaction. This paper presents FlightSense, an end-to-end MLOps platform for real-time flight delay prediction built through a progressive three-version feature engineering framework. Version 1 trains an XGBoost classifier on 11 schedule-based features establishing a baseline ROC AUC of 0.732 on 7.07 million BTS 2018 On-Time Performance records. Version 2 introduces 11 delay propagation features derived from aircraft rotation chains via tail-number tracking, yielding the dominant performance gain (AUC 0.732 to 0.875) and surpassing the single-stage XGBoost baseline reported by Zhou (2025). Version 3 integrates five NOAA meteorological features across 10 major U.S. airports, achieving a final test set AUC of 0.879. FlightSense is deployed as a production AWS MLOps pipeline incorporating live weather ingestion via Lambda, real-time SageMaker inference, an interactive Streamlit dashboard, and an Amazon Bedrock Nova Micro conversational assistant answering natural-language delay queries via a tool-use architecture.
comment: 12 pages, 5 figures, 9 tables; machine learning, MLOps, aviation delay prediction
☆ MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
DeepSeek Sparse Attention (DSA) sets the state of the art for fine-grained inference-time sparse attention by introducing a learned token-wise indexer that scores every prefix token and selects the most relevant ones for the main attention. To remain expressive, the indexer uses many query heads (for example, 64 on DeepSeek-V3.2) that share the same selected token set; this multi-head design is precisely what makes the indexer the dominant cost on long contexts. We propose MISA (Mixture of Indexer Sparse Attention), a drop-in replacement for the DSA indexer that treats its indexer heads as a pool of mixture-of-experts. A lightweight router uses cheap block-level statistics to pick a query-dependent subset of only a few active heads, and only those heads run the heavy token-level scoring. This preserves the diversity of the original indexer pool while reducing the per-query cost from scoring every prefix token with every head to scoring it with only a handful of routed heads, plus a negligible router term computed on a small set of pooled keys. We further introduce a hierarchical variant of MISA that uses the routed pass to keep an enlarged candidate set and then re-ranks it with the original DSA indexer to recover the final selected tokens almost exactly. With only eight active heads and no additional training, MISA matches the dense DSA indexer on LongBench across DeepSeek-V3.2 and GLM-5 while running with eight and four times fewer indexer heads respectively, and outperforms HISA on average. It also preserves fully green Needle-in-a-Haystack heatmaps up to a 128K-token context and recovers more than 92% of the tokens selected by the DSA indexer per layer. Our TileLang kernel delivers roughly a 3.82 times speedup over DSA's original indexer kernel on a single NVIDIA H200 GPU.
comment: https://github.com/MuLabPKU/TransArch
☆ Mean-Pooled Cosine Similarity is Not Length-Invariant: Theory and Cross-Domain Evidence for a Length-Invariant Alternative ICML 2026
Mean-pooled cosine similarity is the default metric for comparing neural representations across languages, modalities, and tasks. We establish that this metric is not length-invariant: under the anisotropy that characterizes modern transformer representations, mean-pooled cosine grows monotonically in sequence length, independent of representational content. Empirically, on HumanEvalPack across four code LLMs, the length ratio alone explains $R^2 = 0.52$--$0.75$ of cross-language "Python proximity," while AST depth and shared-token fraction add less than 3% of explained variance beyond length. Substituting Centered Kernel Alignment (CKA) reduces explained variance by 83% and reverses the sign of the length coefficient ($β_{\mathrm{len}}: +0.86 \to -0.37$). The same pattern holds in Mistral-7B on parallel WMT pairs ($R^2 = 0.23$ EN-FR, $R^2 = 0.33$ EN-DE for cosine; $R^2 < 0.01$ for CKA). In CLIP ViT-B/32, mean-pooling reduces the length effect relative to EOS-pooling ($R^2: 0.21 \to {<}0.01$), as predicted by the theory's dependence on anisotropy. We argue that length-invariant metrics such as CKA should be the default for cross-representation comparisons, and that recent claims of cross-lingual representational convergence built on mean-pooled cosine warrant re-examination.
comment: 9 pages, 6 figures. Submitted to the Mechanistic Interpretability Workshop at ICML 2026
☆ Mage: Multi-Axis Evaluation of LLM-Generated Executable Game Scenes Beyond Compile-Pass Rate
Compile-pass rate is the dominant evaluation signal for LLM code generation, yet for multi-component domain-specific artifacts it can be actively misleading. We demonstrate this on executable game scene synthesis with a four-axis evaluation protocol (named `Mage') -- compile success, runtime success, structural fidelity, and mechanism adherence -- applied to 858 generation attempts across four open-weight LLMs (7B--30B), 26~hand-crafted Unity goal pattern playable concepts, and two automatically extracted IR granularity levels. Direct NL-to-C\# generation achieves the highest runtime-pass rate (43\% mean) yet produces structurally vacuous scenes (mechanism $F_1 \approx 0.12$). Structural IR conditioning halves the runtime rate but recovers domain-faithful structure ($F_1$ up to 1.00). Within IR conditioning, behavior-only and full-scene granularity are statistically indistinguishable (McNemar $p = 1.0$), indicating input-level granularity saturation. These results show that compile rate is anti-correlated with functional correctness in this domain and that multi-axis evaluation is necessary to detect the divergence. We release the benchmark, replay logs, and per-record metrics for independent verification.
comment: Main Content: 10 pages, 1 figure. In total 22 pages
☆ CellScientist: Dual-Space Hierarchical Orchestration for Closed-Loop Refinement of Virtual Cell Models
Virtual Cell Modeling (VCM) requires models that not only predict perturbation responses, but also support targeted revision when predictions fail. Current LLM-assisted modeling workflows face a refinement-routing problem: prediction discrepancies are observed through executable implementations, but the relevant revision may involve the modeling assumption, representation design, implementation, or task constraint. Without structured feedback propagation across these levels, iterative refinement may repair code while failing to revise the assumption responsible for the discrepancy. We propose CellScientist, a dual-space hierarchical framework that couples a high-level hypothesis space with a low-level executable implementation space. CellScientist represents modeling decisions as structured states, realizes them as admissible programs under task and interface constraints, and routes execution discrepancies back to targeted hypothesis or implementation updates. This enables a closed Hypothesis -> Implementation -> Hypothesis loop where failures become structured signals for model refinement rather than debugging events. Across morphology and transcriptomic benchmarks, with additional single-cell perturbation evaluations, the final executable models selected by CellScientist improve over reference baselines under fixed split and evaluation protocols, while the workflow produces auditable refinement traces.
☆ Beyond Linear Attention: Softmax Transformers Implement In-Context Reinforcement Learning
In-context reinforcement learning (ICRL) studies agents that, after pretraining, adapt to new tasks by conditioning on additional context without parameter updates. Existing theoretical analyses of ICRL largely rely on linear attention, which replaces the softmax function in the standard attention with an identity mapping. This paper provides the first theoretical understanding of ICRL without making the unrealistic linear attention simplification. In particular, we consider the standard softmax attention used in practice. We show that, with certain parameters, the layerwise forward pass of a Transformer with such softmax attention is equivalent to iterative updates of a weighted softmax temporal difference (TD) learning algorithm. Here, weighted softmax TD is a new RL algorithm that performs policy evaluation in kernel space and adopts both linear TD and tabular TD as special cases. We also prove that under a certain contraction condition, the policy evaluation error decays as the number of layers grows, with the identified parameters above. Finally, we prove that those parameters are a global minimizer of a pretraining loss, explaining their emergence in our numerical experiments.
☆ Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective
Reinforcement learning, including reinforcement learning with verifiable rewards (RLVR), has emerged as a powerful approach for LLM post-training. Central to these approaches is the design of the importance sampling (IS) ratio used in off-policy policy-gradient estimation. Existing methods face a fundamental bias-variance dilemma: token-level IS ratios, as adopted by PPO (Schulman et al., 2017) and GRPO (Shao et al., 2024), introduce bias by ignoring prefix state distribution mismatch; full sequence ratios provide exact trajectory-level correction but suffer from high variance due to the multiplicative accumulation of per-token ratios, while GSPO (Zheng et al., 2025) improves numerical stability via length normalization at the cost of deviating from the exact full-sequence IS correction. In this work, we identify the cumulative token IS ratio, the product of per-token ratios up to position $t$, as a theoretically principled solution to this dilemma. We prove that, under the token-level policy-gradient formulation, this ratio provides an unbiased prefix correction for each token-level gradient term and has strictly lower variance than the full sequence ratio. Building on this insight, we propose CTPO (Cumulative Token Policy Optimization), which combines the cumulative token IS ratio with position-adaptive clipping that scales log-space clip bounds according to the natural $\sqrt{t}$ growth of the cumulative log-ratio. This yields more consistent regularization across token positions. We implement and evaluate CTPO in the tool-integrated reasoning setting on several challenging mathematical reasoning benchmarks, achieving the best average performance across both model scales compared with strong GRPO and GSPO baselines. Code will be available at https://github.com/horizon-llm/CTPO.
☆ SparseRL-Sync: Lossless Weight Synchronization with ~100x Less Communication
In large-scale reinforcement learning (RL) systems with decoupled Trainer-Rollout execution, the Trainer must regularly synchronize policy weights to the Rollout side to limit policy staleness. When inter-node bandwidth is abundant, such synchronization is usually only a small fraction of end-to-end cost. As model size grows, however, the communication demand rises rapidly. In bandwidth-constrained or network-variable deployments -- for example, cross-datacenter or cross-cluster settings, heterogeneous resource pools, and online RL -- weight synchronization can become a dominant bottleneck for throughput and tail latency. We observe that, in mainstream large-model RL training, the locations where parameters actually change are highly sparse at the element level (often 99%+ sparsity). Building on this observation, we propose and implement SparseRL-Sync, which replaces full-weight transfers with a lossless sparse update payload (indices and values) that can be exactly reconstructed on the inference side, thereby preserving 100% fidelity. Under a simplified cost model, sparse synchronization reduces the per-update communication volume from S to approximately S/X; with 99% sparsity (X ~ 100), this yields about a 100x reduction in transmitted data. Combined with appropriate bucketing, SparseRL-Sync also reduces launch and control-plane overhead, significantly improving scalability and end-to-end efficiency in bandwidth-limited and highly asynchronous RL settings.
comment: Code will be released at https://github.com/scitix/helix
☆ Activation Differences Reveal Backdoors: A Comparison of SAE Architectures IJCNN 2026
Backdoor attacks on language models pose a significant threat to AI safety, where models behave normally on most inputs but exhibit harmful behavior when triggered by specific patterns. Detecting such backdoors through mechanistic interpretability remains an open challenge. We investigate two sparse autoencoder architectures -- Crosscoders and Differential SAEs (Diff-SAE) -- for isolating backdoor-related features in fine-tuned models. Using a controlled SQL injection backdoor triggered by year-based context ("2024" triggers vulnerable code, "2023" triggers safe code), we evaluate both approaches across LoRA and full-rank fine-tuning regimes on SmolLM2-360M. We find that Diff-SAE consistently and substantially outperforms Crosscoders for backdoor isolation. Diff-SAE achieves a Backdoor Isolation Score (BIS) of 0.40 with perfect precision (1.0) and zero false positive rate across most experimental conditions, while Crosscoders fail almost entirely with BIS below 0.02 in most cases. This performance gap holds across multiple transformer layers (14, 18, 22, 26) and both fine-tuning regimes, with full-rank fine-tuning producing particularly clean backdoor signals. Our results suggest that backdoors manifest as directional activation shifts rather than sparse feature activations, making difference-based representations fundamentally more effective for detection. These findings have important implications for AI safety monitoring and the development of interpretability tools for detecting model manipulation.
comment: Accepted at IJCNN 2026 (IEEE WCCI). ©2026 IEEE
☆ Discovering Ordinary Differential Equations with LLM-Based Qualitative and Quantitative Evaluation ICML 2026
Discovering governing differential equations from observational data is a fundamental challenge in scientific machine learning. Existing symbolic regression approaches rely primarily on quantitative metrics; however, real-world differential equation modeling also requires incorporating domain knowledge to ensure physical plausibility. To address this gap, we propose DoLQ, a method for discovering ordinary differential equations with LLM-based qualitative and quantitative evaluation. DoLQ employs a multi-agent architecture: a Sampler Agent proposes dynamic system candidates, a Parameter Optimizer refines equations for accuracy, and a Scientist Agent leverages an LLM to conduct both qualitative and quantitative evaluations and synthesize their results to iteratively guide the search. Experiments on multi-dimensional ordinary differential equation benchmarks demonstrate that DoLQ achieves superior performance compared to existing methods, not only attaining higher success rates but also more accurately recovering the correct symbolic terms of ground truth equations. Our code is available at https://github.com/Bon99yun/DoLQ.
comment: Accepted at ICML 2026
☆ Generative Modeling with Flux Matching
We introduce Flux Matching, a new paradigm for generative modeling that generalizes existing score-based models to a broader family of vector fields that need not be conservative. Rather than requiring the model to equal the data score, the Flux Matching objective imposes a weaker condition that admits infinitely many vector fields whose stationary distribution is the data. This flexibility enables a class of generative models that cannot be learned under score matching, in which inductive biases, structural priors, and properties of the dynamics can be directly imposed or optimized. We show that Flux Matching performs strongly on high-dimensional image datasets and, more importantly, that our added freedom unlocks a range of applications including faster sampling, interpretable and mechanistic models, and dynamics that encode directed dependencies between variables. More broadly, Flux Matching opens a new dimension in generative modeling by turning the vector field itself into a design choice rather than a fixed target. Code is available at https://github.com/peterpaohuang/flux_matching.
☆ Latent Order Bandits
Bandit algorithms solve diverse sequential decision-making problems, but are often too sample-inefficient for from-scratch personalization. To substantially reduce exploration times, latent bandit algorithms exploit cross-instance structure implied by discrete latent states, provided that the posterior distribution of rewards and latent states is known and accurate. However, obtaining an accurate model of this structure is difficult, and a small number of latent states may be insufficient to characterize the reward distributions in all problem instances. We propose latent order bandits (LOB), relaxing the assumptions of latent bandits to require only prior knowledge of a partial order of action preferences in each state. This allows instances of the same state to vary in reward distributions, as long as the partial order of actions is shared. For example, groups of users on a streaming service may agree on which movie genres are the best but rate experiences on different scales. We give an upper-confidence bound procedure for the LOB problem, applicable to both total and partial latent orders, and give an upper bound on its regret. To improve empirical performance, we propose a posterior-sampling algorithm and show, in a suite of experiments, that both are competitive with full-prior latent bandits when same-state instances share reward parameters, and preferable to them when reward scales differ between instances with the same latent state.
comment: arXiv admin note: text overlap with arXiv:2508.05367
Pretraining Induces a Reusable Spectral Basis for Downstream Task Adaptation
Finetuning pretrained models occurs in a low-dimensional subspace of the full parameter space. Prior work has focused on characterizing this optimization subspace, but largely ignored the complementary question: why do certain directions remain unexplored during finetuning? Are these stable directions irrelevant to downstream tasks, or do they already encode task-relevant structure that requires no further adjustment? Answering this question is central to understanding how pretrained knowledge transfers. Through systematic spectral analysis across vision and language models, we show that the leading singular vectors of pretrained weight matrices remain highly stable under finetuning and are shared across unrelated downstream tasks, revealing that pretraining establishes a reusable spectral coordinate system. Models pretrained on larger datasets exhibit greater spectral stability under distribution shift or task change, directly linking pretraining scale to geometric transferability. Motivated by these findings, we propose a parameter-efficient method that freezes pretrained singular vectors and optimizes only leading spectral coefficients, achieving competitive performance on GLUE with 0.2% trainable parameters. Our results reveal that the stable directions encode transferable structure rather than irrelevant noise: successful pretraining discovers spectral bases that downstream tasks inherit and operate within.
☆ Spectrum-Adaptive Generalization Bounds for Trained Deep Transformers
Understanding why trained Transformers generalize well is a fundamental problem in modern machine learning theory, and complexity-based generalization bounds provide a principled way to study this question. While existing norm-based bounds for Transformers remove the explicit polynomial dependence on the hidden dimension, they typically impose fixed norm constraints specified a priori and can exhibit unfavorable exponential dependence on depth. In this paper, we derive spectrum-adaptive post hoc generalization bounds for multi-layer Transformers. Under layerwise spectral norm control, the bounds are expressed in terms of layerwise Schatten quantities of the query-key, value, and feedforward weight matrices. Since the Schatten indices need not be fixed a priori and can instead be selected after training, separately for each matrix type and layer, the bounds adaptively trade off spectral complexity against the dimension- and depth-dependent factors according to the learned singular-value profiles. Empirical comparisons of BERT-adapted proxies for the leading complexity factors suggest that the proxies induced by our bounds grow more slowly with depth and hidden dimension than the corresponding norm-based proxies. Overall, our results provide a complexity-based perspective on how the spectral structure of trained Transformers is reflected in generalization analyses.
☆ Sparse Random-Feature Neural Networks with Krylov-Based SVD for Singularly Perturbed ODE
Random-feature neural networks (RFNNs), including architectures with fixed hidden layers and analytically determined output weights, offer fast training but often suffer from issues due to dense representations of the hidden layer activation. Their reliance on dense feature mappings and least squares solvers can limit scalability and numerical stability, particularly for high-dimensional or stiff systems. Specifically, the activation matrix is observed to be low-rank and extremely ill-conditioned. In this work, we propose a sparse framework for RFNNs that integrates structured sparsity into the hidden layer activations that increases the rank and employs Sparse Singular Value Decomposition (sSVD) for solving the resulting linear least squares problem scalably and efficiently while catering to the bad condition number. We explore the theory behind Lanczos-Golub-Kahan Bidiagonalization technique for sparse SVD and conduct some experiments to identify some limitations and justify the requirement for orthogonalization step in our application. Then, we demonstrate that the proposed method maintains or improves solution accuracy for solving the benchmark one-dimensional steady convection-diffusion equations case having stronger advection, while achieving substantial gains in training efficiency and robustness compared to standard dense implementations.
☆ Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic
Recent interpretability work has identified model-internal handles on post-trained behavior, including refusal directions, assistant/persona axes, and sparse chat-tuning features. These results localize where behaviors can be read out or controlled, often in middle-to-late layers. We ask how earlier computation and the late stack cooperate to turn those differences into next-token margins. To test this, we introduce first-divergence cross-patching: at the first token where pretrained base (PT) and instruction-tuned (IT) checkpoints disagree, we cross each model's earlier-layer state with each model's late stack. The diagnostic separates training recipes: same-base instruction-following descendants show late effects that depend on their own earlier-layer state, while OpenMath2 math-domain SFT and controlled code/biomed CPT controls with verified domain learning do not; for OpenMath2, the late effect is already largely portable from base earlier-layer state. Across five dense families (4B-32B), the IT late stack adds +0.76 logits from PT upstream and +2.44 from IT upstream, giving a +1.68 interaction that is positive in every family. Thus the late stack has a real PT-upstream effect, but its larger effect in the IT checkpoint appears only when it reads its own post-trained upstream state. Sparse features in final MLP layers partially mediate the effect and are driven by upstream patches, supporting a handoff from earlier state to final-layer feature activation to IT-token margin. Forced-token scoring shows that the local token choice can change later exact-answer success. Operationally, paired-checkpoint studies that localize a difference to late layers should test whether it survives under the other checkpoint's upstream state before treating the late stack as self-contained.
☆ The Convergence Gap: Instruction-Tuned Language Models Stabilize Later in the Forward Pass
Final outputs hide when a checkpoint commits to its next-token prediction. We introduce the convergence gap, a model-diffing diagnostic that decodes each layer's next-token distribution and measures its distance to the model's own final distribution. Across six paired pretrained and instruction-tuned checkpoints in native prompting regimes, instruction-tuned checkpoints remain farther from their final predictions later into the stack. The effect persists under endpoint-matched raw and tuned readouts, endpoint-free same-history checks, and fixed-history template replay. Matched-prefix interventions identify late MLP windows as the largest tested leverage point: late IT grafts into PT hosts increase late KL by +0.34 nats, while PT-late swaps into IT hosts reduce it by -0.51 nats; matched random late perturbations give only +0.003 versus +0.327 for the true late graft. A preselected Gemma case study provides behavior-facing plausibility for the same late swap, without serving as a benchmark claim. These results identify a robust predictiondynamics signature of post-training: released instruction-following checkpoints tend to settle later, and late MLP computation is the strongest tested bidirectional handle on that delay under matched histories.
☆ Mask2Cause: Causal Discovery via Adjacency Constrained Causal Attention
Leveraging deep learning for causal discovery in time series remains challenging because existing neural methods predominantly rely on component-wise architectures that fail to capture shared system dynamics or employ decoupled post-hoc graph extraction that risks overfitting to spurious correlations. We propose $\textbf{Mask2Cause}$, an end-to-end framework that recovers the underlying causal graph directly during the forecasting forward pass. Our approach introduces an Inverted Variable Embedding and an Adjacency-Constrained Masked Attention mechanism, trained with homoscedastic or heteroscedastic objectives to capture causal influences in both mean and variance. Empirical results on diverse benchmarks, from synthetic chaotic dynamics to realistic biological simulations, demonstrate state-of-the-art causal discovery with significantly reduced parameter complexity compared to standard baselines. We further show that inferred causal structures can be used to reduce parameter count of forecasting models by more than 70% on average while maintaining predictive accuracy.
☆ Predictive but Not Plannable: RC-aux for Latent World Models
A latent world model may achieve accurate short-horizon prediction while still inducing a latent space that is poorly aligned with planning. A key issue is spatiotemporal mismatch: these models are often trained with local predictive supervision, but deployed for long-horizon goal-directed search in latent spaces where Euclidean distance may not reflect what is reachable within a finite action budget. We present the Reachability-Correction auxiliary objective (RC-aux), a lightweight correction for this mismatch in reconstruction-free latent world models. RC-aux keeps the world-model backbone unchanged and adds planning-aligned supervision along two axes. Along the time axis, multi-horizon open-loop prediction trains the model beyond one-step consistency. Along the space axis, budget-conditioned reachability supervision, together with temporal hard negatives, encourages the latent space to distinguish states that are eventually reachable from those reachable within the current planning horizon. At test time, the learned reachability signal can also be used by a reachability-aware planner to favor trajectories that are both goal-directed and attainable under the available budget. We instantiate RC-aux on LeWorldModel and evaluate it under both continuation-training and matched-from-scratch settings. Across goal-conditioned pixel-control tasks and a LIBERO-Goal extension, RC-aux improves LeWM-style planning with modest additional cost. These results suggest that planning with latent world models depends not only on predictive accuracy, but also on whether the learned representation encodes the temporal and geometric structure required by downstream search. The code is available at https://github.com/Guang000/RC-aux.
☆ Bifurcation Models: Learning Set-Valued Solution Maps with Weight-Tied Dynamics
Many scientific and combinatorial problems admit multiple correct solutions, not a single label. Standard supervised learning resolves this ambiguity by choosing one solution as the target, but this hidden selector can be arbitrary, discontinuous, and harder to learn than the underlying solution set. We study bifurcation models, a weight-tied dynamical view in which different initializations can converge to different stable equilibria, so the model represents an attractor landscape rather than one chosen branch. We prove that broad set-valued maps with locally Lipschitz branches can be represented by regular equilibrium dynamics and that the induced selectors are almost everywhere regular, while manual selectors can be arbitrarily irregular. Experiments on frustrated Ising models show that such dynamics can discover multiple valid equilibria without branch labels and outperform single-branch supervision. Allen--Cahn experiments further show that diversity is not automatic: it can be encouraged explicitly, but with an accuracy--diversity tradeoff.
☆ Structured Role-Aware Policy Optimization for Multimodal Reasoning
Reinforcement learning from verifiable rewards (RLVR), especially with Group Relative Policy Optimization (GRPO), has shown strong potential for improving the reasoning capabilities of large vision-language models (LVLMs). However, in multimodal reasoning, final-answer rewards are typically assigned at the sequence level and do not distinguish the functional roles of different tokens, making it difficult to determine whether a correct answer is supported by task-relevant visual evidence. In this paper, we revisit multimodal RLVR from the perspective of role-aware token-level credit assignment, where structured responses are decomposed into perception tokens for extracting visual evidence and reasoning tokens for deriving answers from that evidence. Based on this perspective, we propose Structured Role-aware Policy Optimization (SRPO), which refines the sequence-level GRPO advantage into role-aware token-level advantages without changing the reward function. Specifically, SRPO assigns role-specific credit by using self-distilled on-policy contrasts: perception tokens are emphasized according to their visual dependency under original versus corrupted visual inputs, while reasoning tokens are emphasized according to their consistency with the generated perception. These role-specific signals are further unified through a shared trajectory-level baseline, yielding positive token weights that adjust relative update magnitudes while preserving the original GRPO reward and optimization direction, without requiring external reward models or separate teachers. Experiments across diverse multimodal reasoning benchmarks show that SRPO improves evidence-grounded reasoning, highlighting the importance of moving beyond uniform sequence-level credit toward role-aware optimization for reliable multimodal reasoning.
comment: 32 pages
☆ bispectrum: Selective $G$-Bispectra Made Practical
Many machine learning tasks are invariant under the action of a group $G$ of transformations: signal classification can be invariant under translations, image classification under 2D rotations, and spherical-image classification under 3D rotations. The $G$-bispectrum is a principled complete invariant of a signal (retaining all all signal's information up to the group action) with proven benefits in machine learning and as a pooling layer in deep networks. However, its deployment has been hampered by high computational cost and a patchwork of group-specific implementations. We present bispectrum, an open-source, fully unit-tested PyTorch library that implements selective $G$-bispectra for seven different group actions, as differentiable modules that can be directly incorporated into machine learning pipelines and deep learning architectures. For finite groups $G$, selectivity reduces the computational cost from $O(|G|^2)$ to $O(|G|)$. For planar rotations, we leverage the disk bispectrum. For spherical 3D rotations, we introduce an augmented selective bispectrum at band-limit $L$ which reduces the cost from $O(L^3)$ to $Θ(L^2)$ coefficients. We profile the entire library (for which we implemented various compute optimizations), showing that it delivers near-exact $G$-invariance with its selective $G$-bispectra computed in sub-millisecond time on GPU (up to commonly used bandlimits). We evaluate the benefits of incorporating $G$-bispectra as pooling layers into deep learning architectures on three classical benchmark datasets --comparing against norm pooling, gated pooling, Fourier-ELU pooling, max pooling, and (non-equivariant) data-augmented convolutional baselines. Results show that $G$-bispectra consistently outperform alternatives in the low-data, moderate-capacity regime.
☆ MIPIAD: Multilingual Indirect Prompt Injection Attack Defense with Qwen -- TF-IDF Hybrid and Meta-Ensemble Learning
Indirect prompt injection remains a persistent weakness in retrieval-augmented and tool-using LLM systems, and the problem becomes harder to characterise in multilingual settings. We present MIPIAD, a defense framework evaluated on English and Bangla that combines a sequence classifier fine-tuned from Qwen2.5-1.5B via LoRA (XLPID), TF-IDF lexical features, and validation-tuned ensembling through late fusion, stacking, and gradient boosting. The framework is evaluated on a synthetic benchmark built from BIPIA(Yi et al., 2023) templates spanning five task families -- email, table, QA, abstract, and code-comprising over 1.43 million generated samples, with train and test splits using mutually exclusive attack categories. Across the experiments, lexical signals prove strong (TF-IDF+SVM F1=0.77), and the hybrid XLPID+TF-IDF ensemble achieves the best overall F1 (0.9205) while the Boosting Ensemble achieves the best AUROC (0.9378). Ensemble methods consistently reduce the English-Bangla cross-lingual gap relative to standalone neural models. The pipeline is designed for extensibility: NLLB-200 supports over 200 languages and XLPID's multilingual backbone can be retargeted to additional languages without architectural changes; empirical validation is currently limited to English and Bangla
☆ PerCaM-Health: Personalized Dynamic Causal Graphs for Healthcare Reasoning
Personalized healthcare decisions require reasoning about how physiological and behavioral variables influence an individual patient over time. Existing temporal causal discovery methods are poorly matched to this setting: cohort-level models provide stable but non-personalized structures, while per-patient discovery is unreliable because individual trajectories are short, noisy, irregular, and non-stationary. This creates a fundamental gap between population-level causal modeling and the patient-specific, time-varying mechanisms needed for intervention reasoning. We introduce PerCaM-Health, a framework for learning personalized dynamic causal graphs from longitudinal health data. The framework learns a knowledge-guided population temporal graph, then conservatively adapts and evolves it using patient-specific temporal evidence and rolling-window updates, producing interpretable and auditable graph sequences. By coupling these graphs with temporal structural equations, the framework enables patient-level counterfactual queries, such as estimating short-horizon outcome changes under hypothetical behavioral interventions. Experiments on a semi-synthetic dynamic health benchmark show that PerCaM-Health improves graph recovery, dynamic edge tracking, and intervention direction accuracy compared to cohort-level, per-patient, and non-personalized temporal baselines. These results demonstrate that jointly modeling personalization and temporal evolution yields more reliable causal structure and intervention reasoning.
☆ How Big Should a Wireless Foundation Model Be?
Wireless foundation models are rapidly emerging as a key enabler of AI-native communication systems, yet a fundamental question remains unanswered: how large should these models be? We present a principled, physics-grounded answer, showing that the intrinsic dimensionality (dNL, the nonlinear manifold dimension of the channel) acts as the fundamental bottleneck, defining the scaling ceiling once a data-sufficient regime is reached. This dimensionality is not a design choice but a physical constraint: Maxwell's equations, finite scatterers, and antenna aperture inherently constrain wireless propagation environments to a limited number of degrees of freedom -- spanning 5-35 across both real-world OTA measurements and 3GPP-standardized channel models we evaluate -- orders of magnitude below the ~1,000-dimensional semantic space of language. As a consequence, we propose a scaling framework for wireless AI: taking NTN satellite channels as a representative case (dNL ~= 14), scaling gains diminish rapidly beyond ~30 million parameters, entering a stochastic asymptote above 70M where a further 1.6x increase (96M->150M) yields only 0.52 dB. Beyond this ceiling, inference-time adaptation via pilot-aided test-time training (TTT) is far more effective: a compact 12M-parameter model surpasses a static 96M model by 9.9 dB (NMSE, SNR = 20 dB) / 7.6 dB (MCM, SNR = 10 dB) at one-eighth the parameters. With dNL distributions validated across real-world indoor massive MIMO measurements, our scaling laws and TTT gains are demonstrated through NTN satellite simulations, reframing wireless AI design: channel geometry -- not model size -- fundamentally governs the scaling laws of physical-layer wireless AI.
☆ Resource-Element Energy Difference for Noncoherent Over-the-Air Federated Learning
Over-the-air federated learning (OTA-FL) reduces uplink latency by exploiting waveform superposition, but conventional analog aggregation schemes typically require instantaneous channel state information (CSI), channel inversion, and coherent phase alignment, which can be difficult to maintain in practical wireless systems. This paper proposes resource-element energy difference (REED), a noncoherent aggregation primitive for continuous signed updates that avoids instantaneous CSI. REED maps the positive and negative parts of each real-valued update to transmit energies on two orthogonal resource elements with independent phase dithers, and the server estimates the signed aggregate from their energy difference. With only slow-timescale calibration of average channel powers, REED is unbiased for the desired signed sum and admits an exact closed-form variance under Rayleigh fading. We incorporate REED into full-participation FedAvg and prove a smooth nonconvex stationarity bound. Under an average per-client energy budget, the aggregation gain can be scheduled so that the REED-induced perturbation scales quadratically with the local stepsize, yielding the canonical (1/sqrt(T)) stationarity rate. Experiments on MNIST and Fashion-MNIST demonstrate that REED closely matches clean FedAvg and coherent CSIT aggregation in IID settings, while maintaining stable convergence with a moderate performance degradation under strong data heterogeneity.
comment: Preprint; Under-review; Codes to replicate the results is available at: https://github.com/zavareh1/REED
☆ When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models
Mixture-of-Experts (MoE) language models route each token to a small subset of experts, but whether the routes selected by a trained top-$k$ router are good ones is rarely evaluated directly. Holding the model fixed, we compare each standard route against sampled equal-compute alternatives for the same token and score each by the next-token probability it assigns to the realized token in a verified reasoning trajectory. The result is sharply token-conditional: the standard router is well-aligned with route utility on confident tokens but uninformative on the fragile tokens that drive hard reasoning, where lower-loss equal-compute routes consistently exist inside the frozen model but are not selected. The same pattern holds across Qwen3-30B-A3B, GPT-OSS-20B, DeepSeek-V2-Lite, and OLMoE-1B-7B, and follows structurally from how standard top-$k$ training evaluates routing decisions: the language modeling loss scores only the executed route, and load balancing depends only on aggregate routing statistics. A minimal router-only update to the final-layer router, leaving every expert and every other router frozen, is sufficient to shift pass@K on AIME 2024+2025 and HMMT 2025 for both Qwen3-30B-A3B and GPT-OSS-20B, suggesting that at least part of the failure reflects router-reachable misallocation rather than expert capacity alone.
♻ ☆ Privately Estimating Black-Box Statistics
Standard techniques for differentially private estimation, such as Laplace or Gaussian noise addition, require guaranteed bounds on the sensitivity of the estimator in question. But such sensitivity bounds are often large or simply unknown. Thus we seek differentially private methods that can be applied to arbitrary black-box functions. A handful of such techniques exist, but all are either inefficient in their use of data or require evaluating the function on exponentially many inputs. In this work we present a scheme that trades off between statistical efficiency (i.e., how much data is needed) and oracle efficiency (i.e., the number of evaluations). We also present lower bounds showing the near-optimality of our scheme.
♻ ☆ DReS: Dual Reconstruction Smoothing for Functional Regularization
Smoothness is a key inductive bias in machine learning and is closely related to generalization. Existing smoothness-inducing methods typically rely either on explicit gradient regularization, which often incurs substantial computational and memory overhead, or on data-mixing strategies, which are less naturally applicable to unsupervised and self-supervised settings. In this work, we propose $\textit{Dual Reconstruction Smoothing}$ (DReS), a nonparametric regularization framework that induces smoothness through a spline-based auxiliary branch with shared model parameters. The method introduces no additional trainable parameters and can be applied to arbitrary submodules, making it suitable for unsupervised, self-supervised, and supervised regimes. We show theoretically that the discrepancy between the target function and its DReS approximation is controlled by higher-order smoothness quantities of the function, establishing the method as an implicit higher-order smoothness regularizer. Empirically, DReS improves representation learning across several self-supervised methods, improves generation quality in generative modeling, and achieves strong performance relative to competitive baselines in supervised learning.
♻ ☆ Scaling Continual Learning to 300+ Tasks with Bi-Level Routing Mixture-of-Experts ICML 2026
Continual learning, especially class-incremental learning (CIL), on the basis of a pre-trained model (PTM) has garnered substantial research interest in recent years. However, how to effectively learn both discriminative and comprehensive feature representations while maintaining stability and plasticity over very long task sequences remains an open problem. We propose CaRE, a scalable {C}ontinual Le{a}rner with efficient Bi-Level {R}outing Mixture-of-{E}xperts (BR-MoE). The core idea of BR-MoE is a bi-level routing mechanism: a router selection stage that dynamically activates relevant task-specific routers, followed by an expert routing phase that dynamically activates and aggregates experts, aiming to inject discriminative and comprehensive representations into every intermediate network layer. On the other hand, we introduce a challenging dataset, OmniBenchmark-1K, for CIL performance evaluation on very long task sequences with hundreds of tasks. Extensive experiments show that CaRE demonstrates leading performance across a variety of datasets and task settings, including commonly used CIL datasets with classical CIL settings (e.g., 5-20 tasks). To the best of our knowledge, CaRE is the first continual learner that scales to very long task sequences (ranging from 100 to over 300 non-overlapping tasks), while outperforming all baselines by a large margin on such task sequences. We hope that this work will inspire further research into continual learning over extremely long task sequences. Code and dataset are publicly released at https://github.com/LMMMEng/CaRE.
comment: Accepted by ICML 2026
♻ ☆ Synergistic Benefits of Joint Molecule Generation and Property Prediction
Modeling the joint distribution of data samples and their properties allows to construct a single model for both data generation and property prediction, with synergistic benefits reaching beyond purely generative or predictive models. However, training joint models presents daunting architectural and optimization challenges. Here, we propose Hyformer, a transformer-based joint model that successfully blends the generative and predictive functionalities, using an alternating attention mechanism and a joint pre-training scheme. We show that Hyformer is simultaneously optimized for molecule generation and property prediction, while exhibiting synergistic benefits in conditional sampling, out-of-distribution property prediction and representation learning. Finally, we demonstrate the benefits of joint learning in a drug design use case of discovering novel antimicrobial~peptides.
comment: 17 pages, 4 figures
♻ ☆ AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
Diffusion-based talking head generation has achieved remarkable visual quality, yet scaling it to long-term videos remains challenging. The widely adopted chunk-wise paradigm introduces two fundamental failures: (1) temporal-spatial misalignment between static identity references and dynamic audio streams, and (2) cascading identity drift propagated through self-generated continuity references across chunks. To address both issues, we propose AsymTalker, a novel diffusion-based talking head generation method comprising Temporal Reference Encoding (TRE) and Asymmetric Knowledge Distillation (AKD). First, TRE mitigates temporal-spatial misalignment by transforming the static identity image into a temporally coherent latent representation through encoding of a temporally replicated pseudo-video, without introducing additional parameters. Second, AKD resolves the inherent conditioning dilemma in chunk-wise training: using ground-truth references causes train-inference mismatch, while self-generated references entangle supervision with identity drift. Our asymmetric design circumvents this by anchoring the teacher model with ground-truth continuity references to provide drift-free, chunk-level supervision, thereby avoiding the teacher bottleneck. Meanwhile, the student model learns under inference-aligned conditions, conditioned only on self-generated references, and is trained via distribution matching to preserve identity over long horizons. Extensive experiments show AsymTalker achieves state-of-the-art results on HDTF and VFHQ. It guarantees high-fidelity, identity-consistent synthesis over 600-second videos and reaches a real-time inference speed of 66 FPS.
♻ ☆ On the Meta-Design of Allocation Problems
There is an extensive literature that studies how to find optimal policies in resource allocation problems, taking the underlying design parameters that define the allocation, such as what data is collected, how many people can be served, and quality of service as fixed constraints. Yet, from a planner's perspective, these design parameters are themselves optimization variables that are just as important in determining overall welfare as selecting the optimal targeting rule for a given set of constraints. This realization motivates a rich set of meta-design questions exploring how planners should make principled decisions about investments in prediction, capacity constraints, and treatment quality, all of which lie upstream of classical policy optimization. Building on initial theoretical work in this space, our paper has three main contributions. First, we formally define the broad meta-design space of resource allocation problems. Second, we develop empirical tools that enable practitioners to reliably navigate it. Third, we demonstrate the framework in two real-world case studies on German employment services and targeted cash transfer programs in Ethiopia.
♻ ☆ Multi-environment Invariance Learning with Missing Data
Learning models that can handle distribution shifts is a key challenge in domain generalization. Invariance learning, an approach that focuses on identifying features invariant across environments, improves model generalization by capturing stable relationships, which may represent causal effects when the data distribution is encoded within a structural equation model (SEM) and satisfies modularity conditions. This has led to a growing body of work that builds on invariance learning, leveraging the inherent heterogeneity across environments to develop methods that provide causal explanations while enhancing robust prediction. However, in many practical scenarios, obtaining complete outcome data from each environment is challenging due to the high cost or complexity of data collection. This limitation in available data hinders the development of models that fully leverage environmental heterogeneity, making it crucial to address missing outcomes to improve both causal insights and robust prediction. In this work, we derive an estimator from the invariance objective under missing outcomes. We establish non-asymptotic guarantees on variable selection property and $\ell_2$ error convergence rates, which are influenced by the proportion of missing data and the quality of imputation models across environments. We evaluate the performance of the new estimator through extensive simulations and demonstrate its application using the UCI Bike Sharing dataset to predict the count of bike rentals. The results show that despite relying on a biased imputation model, the estimator is efficient and achieves lower prediction error, provided the bias is within a reasonable range.
comment: Added co-author
♻ ☆ Upper Generalization Bounds for Neural Oscillators
Neural oscillators that originate from second-order ordinary differential equations (ODEs) have shown competitive performance in learning mappings between dynamic loads and responses of complex nonlinear structural systems. Despite this empirical success, theoretically quantifying the generalization capacities of their neural network architectures remains undeveloped. In this study, the neural oscillator consisting of a second-order ODE followed by a multilayer perceptron (MLP) is considered. Its upper probably approximately correct (PAC) generalization bound for approximating causal and uniformly continuous operators between continuous temporal function spaces and that for approximating the uniformly asymptotically incrementally stable second-order dynamical systems are derived by leveraging the Rademacher complexity framework. These bounds are further extended to the squared Wasserstein-1 distances between the probability measures of quantities of interest calculated from target causal operators and the corresponding learned neural oscillators. The theoretical results show that the estimation errors grow polynomially with respect to both MLP sizes and the time length, thereby avoiding the curse of parametric complexity. Furthermore, the derived error bounds demonstrate that constraining the Lipschitz constants of the MLPs via loss function regularization can improve the generalization ability of the neural oscillator. Numerical studies considering a Bouc-Wen nonlinear system under stochastic seismic excitation validates the theoretically predicted power laws of the estimation errors with respect to the sample size and time length, and confirms the effectiveness of constraining MLPs' matrix and vector norms in enhancing the performance of the neural oscillator under limited training data.
comment: This manuscript contains 33 pages with 6 figures
♻ ☆ A Theory of Saddle Escape in Deep Nonlinear Networks
In deep networks with small initialization, training exhibits long plateaus separated by sharp feature-acquisition transitions. Whereas shallow nonlinear networks and deep linear networks are well studied, extending these analyses to deep nonlinear networks remains challenging. We derive an exact identity for the imbalance of Frobenius norms of layer weight matrices that holds for any smooth activation and any differentiable loss and use this to classify activation functions into four universality classes. On the permutation-symmetric submanifold, the identity combines with an approximate balance law to reduce the full matrix flow to a scalar ODE, giving a critical-depth escape time law $τ_\star = Θ(\varepsilon^{-(r-2)})$ governed by the number $r$ of layers at the bottleneck scale rather than the total depth $L$. We find that this same $r-2$ exponent is recovered under He-normal initialization with $r$ bottleneck layers rescaled by $\varepsilon$, where the symmetry manifold is preserved by the flow but not attracting. We find close agreement between our theory and numerical simulations.
♻ ☆ Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model
Setting the learning rate (LR) for a deep learning model is a critical part of successful training. Choosing LRs is often done empirically with trial and error. In this work, we explore a solvable model of optimal LR schedules for a powerlaw random feature model trained with stochastic gradient descent (SGD). We consider the optimal schedule $η_T^\star(t)$ where $t$ is the current iterate and $T$ is the training horizon. This schedule is computed both as a numerical optimization problem and also analytically using optimal control theory. Our analysis reveals two regimes which we term the easy phase and hard phase. In the easy phase the optimal schedule is a polynomial decay $η_T^\star(t) \simeq T^{-ξ} (1-t/T)^δ$ where $ξ$ and $δ$ depend on the properties of the features and task. In the hard phase, the optimal schedule resembles warmup-stable-decay with constant initial LR and annealing performed over a vanishing fraction of training steps. We investigate joint optimization of LR and batch size and find batch ramps can improve the wall-clock time in the easy phase. Beyond SGD, we derive optimal schedules for momentum parameter $β(t)$ and show that it improves the loss-scaling exponent in the hard phase. We compare our optimal schedule to various benchmarks including (1) optimal constant learning rates $η_T(t) \sim T^{-ξ}$ (2) optimal power laws $η_T(t) \sim T^{-ξ} t^{-χ}$, finding that our schedule achieves better rates than either of these. Our theory suggests that LR transfer across training horizon depends on the structure of the model and task. For ResNet image classification on CIFAR-5M, the learning curves exhibit hard-phase behavior where optimal base LRs are constant under sufficient annealing. GPT-2 style transformers trained in language modeling exhibit easy-phase behavior where optimal LRs shift even under annealing.
♻ ☆ TAP: Two-Stage Adaptive Personalization of Multi-Task and Multi-Modal Foundation Models in Federated Learning
In federated learning (FL), local personalization of models has received significant attention, yet personalized fine-tuning of foundation models remains underexplored. In particular, there is a lack of understanding in the literature on how to personalize foundation models in settings where there exist heterogeneity not only in data, but also in tasks and modalities across the clients. To address this gap, we propose Two-Stage Adaptive Personalization (TAP). In the first stage, TAP leverages mismatched model architectures between clients and the server to selectively replace personalized parameters with global updates, explicitly limiting cross-task and cross-modality interference. In the second stage, TAP conducts post-FL distillation on the global model to recover a beneficial shared structure. By reintroducing generalizable knowledge only after the global model has stabilized, TAP enhances generalization without compromising personalization. In developing our methodology, we introduce the first convergence analysis of federated foundation model training at the server under modality-task pair heterogeneity across clients, and demonstrate the impact of the number of modality-task pairs on model fine-tuning. Through extensive experiments, we demonstrate the effectiveness of TAP across a variety of datasets and tasks in comparison to state-of-the-art baselines. The implementation code is publicly available at https://github.com/lee3296/TAP.
comment: 29 pages
♻ ☆ Scalable Equilibrium Propagation via Intermediate Error Signals for Deep Convolutional CRNNs
Equilibrium Propagation (EP) is a biologically inspired local learning rule first proposed for convergent recurrent neural networks (CRNNs), in which synaptic updates depend only on neuron states from two distinct phases. EP estimates gradients that closely align with those computed by Backpropagation Through Time (BPTT) while significantly reducing computational demands, positioning it as a potential candidate for on-chip training in neuromorphic architectures. However, prior studies on EP have been constrained to shallow architectures, as deeper networks suffer from the vanishing gradient problem, leading to convergence difficulties in both energy minimization and gradient computation. To alleviate the vanishing gradient problem in deep EP networks, we propose a novel EP framework that incorporates layer-wise learning signals to provide auxiliary supervision, which enhances the convergence of neuron dynamics. This is the first work to integrate knowledge distillation and local error signals into EP, enabling the training of significantly deeper architectures. Our proposed approach achieves state-of-the-art performance on the CIFAR-10 and CIFAR-100 datasets, showcasing its scalability on deep VGG architectures. These results represent a significant advancement in the scalability of EP, suggesting that intermediate learning signals can extend the practical applicability of EP to deeper architectures.
♻ ☆ Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime
Agentic reinforcement learning (RL) for software engineering spends much of its compute on stateful trajectories whose grouped binary rewards are highly skewed and weakly contrastive. We frame this as pass-rate control and show that the binary reward-side signal is strongest near a 50% rollout pass rate under four criteria: reward entropy, group-filtering survival, leave-one-out (RLOO) advantage energy under Group Relative Policy Optimization (GRPO), and success-failure pair count. We propose Prefix Sampling (PS), which replays self-generated trajectory prefixes to steer skewed groups toward this regime: successful prefixes give mostly failing groups a head start, while failing prefixes handicap mostly passing groups. Replayed states are reconstructed through the existing rollout path, and replayed tokens are masked from the loss so optimization applies only to current-policy continuations. On SWE-bench Verified, PS reaches the baseline high-score regime within evaluation variability while delivering 2.01x and 1.55x end-to-end wall-clock speedups on Qwen3-14B and Qwen3-32B; the 14B peak improves from 0.274 to 0.295. AIME 2025 experiments on 4B and 8B show the same pass-rate-control pattern, and 4B ablations attribute gains to replay, bidirectional coverage, and adaptive control.
comment: 25 pages, 8 figures, 11 tables; revised manuscript text and formatting
♻ ☆ Frequency-Aware Model Parameter Explorer: A new attribution method for improving explainability
State-of-the-art attribution methods rely on adversarial sample generation that applies an all-pass filter across the frequency spectrum, discarding fine-grained high-frequency information that is demonstrably important for accurate feature attribution in deep neural networks. By generating adversarial samples that selectively perturb high- and low-frequency components, we can probe which spectral features a model relies on most -- directly translating frequency-domain exploration into attribution signals. Building on this insight, we propose FAMPE (Frequency-Aware Model Parameter Explorer), a novel attribution method that introduces an FFT-based α-weighted perturbation scheme -- separately modulating high- and low-frequency components via an energy-driven spectral cutoff -- and, crucially, integrates this frequency-aware exploration directly into model parameter exploration for attribution, a connection that has not been established in prior work. Unlike prior frequency-aware adversarial approaches that target transferability or imperceptibility, FAMPE's specific formulation is designed and validated exclusively for explainability, translating spectral structure into fine-grained attribution maps without requiring any manual baseline selection. Evaluated on ImageNet across four architectures spanning CNNs and Vision Transformers, at fixed α= 0.1 FAMPE outperforms AttEXplore by 4.25% on Inception-v3 and 12.04% on MaxViT-T, with per-sample oracle selection further revealing that low-frequency-dominated images systematically benefit from high-frequency perturbations -- underscoring the potential of adaptive spectral exploration. Our ablation studies confirm that high-frequency perturbations are disproportionately responsible for attribution precision, while excessive low-frequency noise degrades global structural coherence.
comment: Preprint
♻ ☆ Robustness of Spatio-temporal Graph Neural Networks for Fault Location in Partially Observable Distribution Grids
Fault location in distribution grids is critical for reliability and minimizing outage durations. Yet, it remains challenging due to partial observability, given sparse measurement infrastructure. Recent works show promising results by combining Recurrent Neural Networks (RNNs) and Graph Neural Networks (GNNs) for spatio-temporal learning. Still, many modern GNN architectures remain untested for this grid application, while existing GNN solutions have not explored GNN topology definitions beyond simply adopting the full grid topology to construct the GNN graph. We address these gaps by (i) systematically comparing a newly proposed graph-forming strategy (measured-only) to the traditional full-topology approach, and (ii) introducing STGNN (Spatio-temporal GNN) models based on GraphSAGE and an improved Graph Attention (GATv2), for distribution grid fault location; (iii) benchmarking them against state-of-the-art STGNN and RNN baselines on the IEEE 123-bus feeder. In our experiments, all evaluated STGNN variants achieve high performance and consistently outperform a pure RNN baseline, with improvements up to 11 percentage points F1. Among STGNN models, the newly explored RGATv2 and RGSAGE achieve only marginally higher F1 scores. Still, STGNNs demonstrate superior stability, with tight confidence intervals (within +/- 1.4%) compared to the RNN baseline (up to +/- 7.5%) across different experiment runs. Finally, our proposed reduced GNN topology (measured-only) shows clear benefits in both (i) model training time (6-fold reduction) and (ii) model performance (up to 11 points F1). This suggests that measured-only graphs offer a more practical, efficient, and robust framework for partially observable distribution grids.
♻ ☆ PAIR-Former: Budgeted Relational Multi-Instance Learning for Functional miRNA Target Prediction
Functional miRNA--mRNA targeting is a large-bag prediction problem where each transcript yields a heavy-tailed pool of candidate target sites (CTSs), yet only a pair-level label is observed. Prior methods use max-pooling over individual CTS scores, ignoring relational patterns among sites, but modeling these patterns is critical for accuracy. The challenge is that naive relational aggregation incurs $\mathcal{O}(n^2)$ cost, prohibitive when $n$ reaches thousands, yet a cheap scan alone discards the very interactions that drive functional repression. We formalize this tension as \emph{Budgeted Relational Multi-Instance Learning (BR-MIL)}, a new MIL problem where the compute budget $K$ is a first-class constraint such that at most $K$ instances per bag may receive expensive encoding and relational processing. We establish theoretical foundations for BR-MIL, proving that both approximation quality and generalization are governed by $K$ rather than the raw bag size $n$. Building on this theory, we propose \textbf{PAIR-Former}, which scans all candidates cheaply, selects $K$ diverse CTSs, and aggregates them via Set Transformer. PAIR-Former achieves state-of-the-art performance, outperforming all reproduced baselines with F1$=0.840$ on miRAW (10-fold balanced CV) and $0.839$ on deepTargetPro in transfer evaluation, while achieving $0.793$ on the large-scale MTI benchmark (420K pairs, $38\times$ larger), demonstrating that budgeted relational MIL scales where naive approaches fail. Additional results on CAMELYON16 and Musk2 further show that the proposed BR-MIL formulation extends beyond biological sequence modeling.
comment: Preprint. Under review. During the preprint stage, inquiries and feedback can be directed to Jiaqi Yin (yjqhit@gmail.com)
♻ ☆ Exact Flow Linear Attention: Exact Solution from Continuous-Time Dynamics
In this paper, we introduce Exact Flow Linear Attention~(EFLA), an exact-flow formulation of delta-rule linear attention. We show that the delta-rule update can be interpreted as an explicit Euler discretization of an underlying continuous-time system. EFLA replaces this first-order update with the exact closed-form flow. By exploiting the rank-1 structure of the dynamics matrix, both the matrix exponential and the input integral collapse to a simple update that preserves delta-rule linear attention's algebraic structure, parameter count, linear-time complexity, and chunkwise parallelism. This attention mechanism removes the Euler discretization error of the delta-rule dynamics without introducing additional parameters. Experiments on robustness tests, language modeling benchmarks, and the MAD synthetic benchmark show that EFLA improves stability under corrupted and high-energy inputs, reduces perplexity, and achieves stronger downstream performance compared to SSM and Euler-style baselines. These results establish exact-flow integration as a principled and scalable update mechanism for delta-rule linear attention.
comment: 16 pages, 5 figures
♻ ☆ Hammer and Anvil: Toward a Theory of Backdoors in Federated Learning
Federated Learning (FL) enables distributed model training but is vulnerable to backdoor attacks, where malicious clients embed attacker-controlled behaviors into the global model. Existing defenses fail against adaptive adversaries. In this paper, we present "Hammer and Anvil", a principled theoretical framework that categorizes backdoors by the deviation, $δ$, of their updates to the mean of the updates. We identify two fundamental defense types: "Type 1 (The Anvil)", comprising outlier detection and robust aggregation effective against large-deviation attacks, and "Type 2 (The Hammer)", consisting of removal-based defenses effective against small-deviation attacks. We demonstrate that defenses of a single type and non-principled combined defenses inherently leave an exploitable gap for adaptive attackers. To bridge this gap, we propose the principled combination of Type 1 and Type 2 defenses. We evaluate our framework against a new, worst-case, full-information adaptive adversary that knows the benign updates, the aggregation algorithm, and its parameters, and yet this adversary fails against our combined defenses. Our empirical evaluation across various datasets and settings shows that single-typed and non-principled combined defenses are easily broken, often by a single malicious client. In contrast, our best combined defense variants, $HA_{Flame}^{CSFT}$, $HA_{Krum}^{CSFT}$, and $HA_{Multi-Metrics}^{CSFT}$, remain undefeated even in the most adversarial settings. Our results provide a principled approach for research on backdoors in federated learning.
♻ ☆ Sign-Based Optimizers Are Effective Under Heavy-Tailed Noise
While adaptive gradient methods are the workhorse of modern machine learning, sign-based optimization algorithms such as Lion and Muon have recently demonstrated superior empirical performance over AdamW in training large language models (LLM). However, a theoretical understanding of why sign-based updates outperform variance-adapted methods remains elusive. In this paper, we aim to bridge the gap between theory and practice through the lens of heavy-tailed gradient noise, a phenomenon frequently observed in language modeling tasks. Theoretically, we introduce a novel generalized heavy-tailed noise condition that captures the behavior of LLMs more accurately than standard finite variance assumptions. Under this noise model, we establish sharp convergence rates of SignSGD and Lion for generalized smooth function classes, matching or surpassing previous best-known bounds. Furthermore, we extend our analysis to Muon and Muonlight, providing what is, to our knowledge, the first rigorous analysis of matrix optimization under heavy-tailed stochasticity. These results offer a strong theoretical justification for the empirical superiority of sign-based optimizers, showcasing that they are naturally suited to handle the noisy gradients associated with heavy tails. Empirically, LLM pretraining experiments validate our theoretical insights and confirm that our proposed noise models are well-aligned with practice.
comment: Code is available at https://github.com/Dingzhen230/Heavy-tailed-Noise-in-LLMs
♻ ☆ End-to-end PDDL Planning with Hardcoded and Dynamic Agents
We present an end-to-end framework for planning supported by verifiers. An orchestrator receives a human specification written in natural language and converts it into a PDDL (Planning Domain Definition Language) model, where the domain and problem are iteratively refined by sub-modules (agents) to address common planning requirements, such as time constraints and optimality, as well as ambiguities and contradictions that may exist in the human specification. We support two categories of agents: hardcoded, which are informed by logs and error traces and have a pre-defined goal (e.g., fix issues with PDDL syntax, check temporal constraints), and dynamic, which have no predefined goal but adapt to the specific domain and revise the latent planning abstraction. The validated domain and problem are then passed to an external planning engine to generate a plan. The orchestrator and agents are powered by Large Language Models (LLMs) and require no human intervention at any stage of the process. Finally, a module translates the final plan back into natural language to improve human readability while maintaining the correctness of each step. We demonstrate the flexibility and effectiveness of our framework on GPT-\{4o, 5-mini, 5.4\}, and Gemini-\{2.5, 3\}-flash across more than ten domains and tasks, including the Google NaturalPlan benchmark, Planbench, and classic planning problems like Sokoban, Blocksworld and the Tower of Hanoi, where LLMs are known to struggle even with small instances. Our framework can be integrated with any PDDL planning engine and validator (we successfully tested Fast Downward, LPG, POPF, VAL, and uVAL) and represents a significant step toward end-to-end planning aided by LLMs.
comment: Code: https://github.com/EmanueleLM/MultiAgentPlanning
♻ ☆ Are LLM Agents Behaviorally Coherent? Latent Profiles for Social Simulation
The impressive capabilities of Large Language Models (LLMs) raise the possibility that synthetic agents can serve as substitutes for real participants in human-subject research. To evaluate this claim, prior research has largely focused on whether LLM-generated survey responses align with those produced by human respondents whom the LLMs are prompted to represent. In contrast, we address a more fundamental question: Do agents maintain empirical consistency; aligning to human behavioral models when examined under different experimental settings? To this end, we develop a study designed to (a) ask a set of questions which reveals an agent's latent profile and (b) examine agent behavioral consistency in a conversational setting with other agents. This design enables us to explore a set of behavioral hypotheses to assess whether an agent's conversational behavior is consistent with what we would expect from its revealed state. Our findings show significant inconsistencies in LLMs across model families and at differing model sizes. Most importantly, we find that, although agents may generate responses matching those of their human counterparts, they fail to be empirically consistent, representing a critical gap in their capabilities to accurately substitute for real participants in human-subject research.
comment: 25 pages, 9 figures, 7 tables
♻ ☆ Drifting Fields are not Conservative
Drifting models have recently gained attention for generating high-quality samples in a single forward pass. During training, they learn a push-forward map by following a vector-valued field, the drift field. We ask whether this procedure is equivalent to optimizing a scalar loss and find that, in general, it is not: drift fields are not conservative and cannot be written as the gradient of any scalar potential. We identify the position-dependent normalization as the source of non-conservatism, with the Gaussian kernel as the unique radial exception. Guided by this, we introduce the sharp kernel $k^\#$ and a sharp-normalized drift field that is conservative for general radial kernels. The resulting vector field is the gradient of a scalar potential that can be optimized directly using stochastic gradient descent. Moreover, the field has the form of a score difference of kernel density estimates, and gives exact equilibrium identifiability. Thus, sharp normalization closes the gap to related literature, such as Wasserstein gradient-flows and denoising score matching, also for non-Gaussian kernels. Empirically, sharp normalization preserves the performance of the original drifting objective, suggesting that the non-conservative flexibility is not required for high-quality generation.
♻ ☆ Rep2Text: Decoding Full Text from a Single LLM Token Representation
Large language models (LLMs) have achieved remarkable progress across diverse tasks, yet their internal mechanisms remain largely opaque. In this work, we investigate a fundamental question: to what extent can the original input text be recovered from a single last-token representation in an LLM? To this end, we propose Rep2Text, a novel framework for decoding text from last-token representations. Rep2Text employs a trainable adapter that maps a target model's last-token representation into the token embedding space of a decoding language model, which then autoregressively reconstructs the input text. Experiments across various model combinations (Llama-3.1-8B, Gemma-7B, Mistral-7B-v0.1, Llama-3.2-3B, etc.) show that, on average, roughly half of the tokens in 16-token sequences can be recovered from this compressed representation while preserving strong semantic coherence. Further analysis reveals a clear information bottleneck effect: as sequence length increases, token-level recovery declines, while semantic information remains relatively well preserved. We also find that scaling effects are less pronounced in inversion tasks. Finally, our framework demonstrates robust generalization to out-of-distribution clinical data.
comment: 18 pages, 6 figures, 6 tables
♻ ☆ Time series causal discovery with variable lags
Causal Bayesian Networks (CBNs) are a powerful tool for reasoning under uncertainty about complex real-world problems. Such problems evolve over time, responding to external shocks as they occur. To support decision-making, CBNs require a cause-and-effect map of the variables under consideration, known as the network's structure. Learning the graphical structure of a causal model from data remains challenging; learning it from time-series data is even harder because dependencies may arise at different time lags. Existing time-series causal discovery methods often assume a fixed lag window and do not explicitly optimise edge-specific lags. We propose a Tabu-based structure learning algorithm that searches for a time-ordered directed structure (i.e., where every edge respects time) while allowing edge-specific lags up to a specified maximum lag. The approach uses a decomposable BIC-based score with node-specific effective sample sizes and an explicit lag-length penalty encouraging parsimonious delay assignments while preserving efficient local score updates. We provide theoretical guarantees of validity and local optimality, and we also describe a parallel implementation for improved scalability. In simulations, the method recovered graph structure competitively and estimated lags accurately when true adjacencies were recovered. On a real-world UK COVID-19 policy dataset, the learnt structure was dominated by short delays while retaining a substantial minority of longer-lag dependencies, consistent with delayed behavioural and epidemiological effects.
♻ ☆ Scalable Option Learning in High-Throughput Environments
Hierarchical reinforcement learning (RL) has the potential to enable effective decision-making over long timescales. Existing approaches, while promising, have yet to realize the benefits of large-scale training. In this work, we identify and solve several key challenges in scaling online hierarchical RL to high-throughput environments. We propose Scalable Option Learning (SOL), a highly scalable hierarchical RL algorithm which achieves a ~35x higher throughput compared to existing hierarchical methods. To demonstrate SOL's performance and scalability, we train hierarchical agents using 30 billion frames of experience on the complex game of NetHack, significantly surpassing flat agents and demonstrating positive scaling trends. We also validate SOL on MiniHack and Mujoco environments, showcasing its general applicability. Our code is open sourced at: github.com/facebookresearch/sol.
♻ ☆ Discovering Sparse Counterfactual Factors via Latent Adjustment for Survey-based Community Intervention
Transportation surveys are widely used to understand travel preferences and adoption barriers, yet most survey-based analyses remain descriptive or predictive and rarely provide sparse, policy-feasible intervention strategies. We study sparse counterfactual community intervention from survey responses, where the goal is to shift a target respondent group toward a desired reference group through controllable survey-variable adjustments. We formulate this task as a policy-feasible distributional alignment problem using a fixed-basis nonnegative latent representation that preserves pre/post comparability and provides a stable map from latent factors to original variables. To make latent movement actionable, target-relevant latent factors are identified through Shapley-guided attribution and transferred to controllable variables as intervention priorities. Feasible group-level adjustments are then learned by minimizing an entropy-regularized optimal-transport discrepancy between the post-intervention target distribution and the reference distribution, together with a weighted $\ell_{2,1}$ penalty that promotes shared policy-lever sparsity. Experiments on real-world transportation survey datasets show that the proposed framework produces compact and interpretable policy-feasible interventions with explicit adjustment magnitudes, improves population-level conversion, and preserves intervention sparsity. Code and datasets are publicly available at: https://github.com/pangjunbiao/latent-group-alignment.git
♻ ☆ Flock: A Knowledge Graph Foundation Model via Learning on Random Walks
We study the problem of zero-shot link prediction on knowledge graphs (KGs), which requires models to generalize to novel entities and novel relations. Knowledge graph foundation models (KGFMs) address this task by enforcing equivariance over both nodes and relations, which enables them to learn structural properties of nodes and relations that transfer to novel KGs with similar structure. However, the conventional notion of deterministic equivariance inherently limits the expressive power of KGFMs, as it prevents them from distinguishing relations that are structurally similar but semantically distinct. To overcome this limitation, we propose to leverage probabilistic node-relation equivariance, which preserves equivariance in distribution while using structured randomness to break symmetries at inference time. Building on this principle, we present Flock, a KGFM that iteratively samples random walks, encodes them into sequences, embeds them with a sequence model, and aggregates node and relation representations through learned pooling. Flock respects probabilistic node-relation equivariance and, crucially, is a universal approximator for isomorphism-invariant link-level functions over KGs. Empirically, Flock perfectly solves our new diagnostic dataset Petals on which current KGFMs fail, and achieves state-of-the-art performance on entity and relation prediction tasks across 54 KGs from diverse domains. Code is available at https://github.com/jw9730/flock.
comment: 42 pages, 7 figures
♻ ☆ HYPER: A Foundation Model for Inductive Link Prediction with Knowledge Hypergraphs
Inductive link prediction with knowledge hypergraphs is the task of predicting missing hyperedges involving completely novel entities (i.e., nodes unseen during training). Existing methods for inductive link prediction with knowledge hypergraphs assume a fixed relational vocabulary and, as a result, cannot generalize to knowledge hypergraphs with novel relation types (i.e., relations unseen during training). Inspired by knowledge graph foundation models, we propose HYPER as a foundation model for link prediction, which can generalize to any knowledge hypergraph, including novel entities and novel relations. Importantly, HYPER can learn and transfer across different relation types of varying arities, by encoding the entities of each hyperedge along with their respective positions in the hyperedge. To evaluate HYPER, we construct 16 new inductive datasets from existing knowledge hypergraphs, covering a diverse range of relation types of varying arities. Empirically, HYPER consistently outperforms all existing methods in both node-only and node-and-relation inductive settings, showing strong generalization to unseen, higher-arity relational structures.
♻ ☆ Partially Lazy Gradient Descent for Smoothed Online Learning
We introduce \textsc{$k$-lazyGD}, an online learning algorithm that bridges the gap between greedy Online Gradient Descent (OGD, for $k{=}1$) and lazy GD/dual-averaging (for $k{=}T$), creating a spectrum between reactive and stable updates. We analyze this spectrum in Smoothed Online Convex Optimization (SOCO), where the learner incurs both hitting and movement costs. Our main contribution is establishing that laziness is possible without sacrificing hitting performance: we prove that \textsc{$k$-lazyGD} achieves the optimal dynamic regret $\mathcal{O}(\sqrt{(P_T{+}1)T})$ for any laziness slack $k$ up to $Θ(\sqrt{T/P_T})$, where $P_T$ is the comparator path length. This result formally connects the allowable laziness to the comparator's shifts, showing that \textsc{$k$-lazyGD} can retain the inherently small movements of lazy methods without compromising tracking ability. We base our analysis on the Follow the Regularized Leader (FTRL) framework, and derive a matching lower bound. Since the slack depends on $P_T$, an ensemble of learners with various slacks is used, yielding a method that is provably stable when it can be, and agile when it must be.
♻ ☆ AffineLens: Capturing the Continuous Piecewise Affine Functions of Neural Networks
Piecewise affine neural networks (PANNs) provide a principled geometric perspective on neural network expressivity by characterizing the input--output map as a continuous piecewise affine (CPA) function whose complexity is governed by the number, arrangement, and shapes of its affine regions. However, existing interpretability and expressivity analyses often rely on indirect proxies (e.g., activation statistics or theoretical upper bounds) and rarely offer practical, accurate tools for enumerating and visualizing the induced region partition under realistic architectures and bounded input domains. In this work, we present AffineLens, a unified framework for computing the hyperplane arrangements and polyhedral structures underlying PANNs. Given a calibrated (bounded) input polytope, AffineLens identifies the subset of neuron-induced hyperplanes that intersect the domain, enumerates the resulting affine sub-regions in a layer-wise manner, and returns provably non-empty maximal CPA regions together with interior representatives. The framework further provides visualizations of region partitioning and decision boundaries, enabling qualitative inspection alongside quantitative region counts. By exploiting the affine restriction property of CPA networks under fixed activation patterns, AffineLens supports a broad class of modern components, including batch normalization, pooling, residual connections, multilayer perceptrons, and convolutional layers. Finally, we use AffineLens to perform a systematic empirical study of architectural expressivity, comparing networks through region complexity metrics and revealing how design choices influence the geometry of learned functions.
♻ ☆ Sparse Attention as Compact Kernel Regression
Recent work has revealed a link between self-attention mechanisms in transformers and test-time kernel regression via the Nadaraya-Watson estimator, with standard softmax attention corresponding to a Gaussian kernel. However, a kernel-theoretic understanding of sparse attention mechanisms is currently missing. In this paper, we establish a formal correspondence between sparse attention and compact (bounded support) kernels. We show that normalized ReLU and sparsemax attention arise from Epanechnikov kernel regression under fixed and adaptive normalizations, respectively. More generally, we demonstrate that widely used kernels in nonparametric density estimation -- including Epanechnikov, biweight, and triweight -- correspond to $α$-entmax attention with $α= 1 + \frac{1}{n}$ for $n \in \mathbb{N}$, while the softmax/Gaussian relationship emerges in the limit $n \to \infty$. This unified perspective explains how sparsity naturally emerges from kernel design and provides principled alternatives to heuristic top-$k$ attention and other associative memory mechanisms. Experiments with a kernel-regression-based variant of transformers -- Memory Mosaics -- show that kernel-based sparse attention achieves competitive performance on language modeling, in-context learning, and length generalization tasks, offering a principled framework for designing attention mechanisms.
comment: 16 pages, 5 figures
♻ ☆ Flat Channels to Infinity in Neural Loss Landscapes NeurIPS'25
The loss landscapes of neural networks contain minima and saddle points that may be connected in flat regions or appear in isolation. We identify and characterize a special structure in the loss landscape: channels along which the loss decreases extremely slowly, while the output weights of at least two neurons, $a_i$ and $a_j$, diverge to $\pm$infinity, and their input weight vectors, $\mathbf{w_i}$ and $\mathbf{w_j}$, become equal to each other. At convergence, the two neurons implement a gated linear unit: $a_iσ(\mathbf{w_i} \cdot \mathbf{x}) + a_jσ(\mathbf{w_j} \cdot \mathbf{x}) \rightarrow σ(\mathbf{w} \cdot \mathbf{x}) + (\mathbf{v} \cdot \mathbf{x}) σ'(\mathbf{w} \cdot \mathbf{x})$. Geometrically, these channels to infinity are asymptotically parallel to symmetry-induced lines of critical points. Gradient flow solvers, and related optimization methods like SGD or ADAM, reach the channels with high probability in diverse regression settings, but without careful inspection they look like flat local minima with finite parameter values. Our characterization provides a comprehensive picture of these quasi-flat regions in terms of gradient dynamics, geometry, and functional interpretation. The emergence of gated linear units at the end of the channels highlights a surprising aspect of the computational capabilities of fully connected layers.
comment: Accepted to NeurIPS'25 (fixed resolution of equations in figs.1,2,3)
♻ ☆ Online Localized Conformal Prediction
Conformal prediction is a framework that provides valid uncertainty quantification for general models with exchangeable data. However, in the online learning and time-series settings, exchangeability is not satisfied. Existing online conformal methods, such as adaptive conformal inference (ACI), can achieve long-run validity, yet they remain inefficient under covariate heterogeneity because they rely on global calibration. We propose \emph{Online Localized Conformal Prediction (OLCP)}, which combines online adaptation with covariate-dependent localization to better reflect heterogeneity. To reduce sensitivity to the localization bandwidth, we further develop \emph{OLCP-Hedge}, which performs bandwidth selection as an online expert aggregation problem using a constrained online convex optimization framework. Importantly, we provide coverage guarantees for both algorithms and demonstrate through simulations and real-data experiments that the proposed methods attain valid long-run coverage with narrower prediction sets than existing baselines.
♻ ☆ Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training
LLM post-training typically propagates task gradients through the full depth of the model. Although this end-to-end structure is simple and general, it couples task adaptation to full-depth activation storage, long-range backward dependencies and direct task-gradient access to pretrained representations. We argue that this full-depth backward coupling can be unnecessarily expensive and intrusive, particularly when post-training supervision is much narrower than pre-training. To this end, we propose \textbf{LoPT}: Local-Learning Post-Training, a simple post-training strategy that makes gradient reach an explicit design choice. LoPT places a single gradient boundary at the transformer midpoint: the second-half block learns from the task objective, while the first-half block is updated by a lightweight feature-reconstruction objective to preserve useful representations and maintain interface compatibility. LoPT shortens the task-induced backward path while limiting direct interference from narrow task gradients on early-layer representations. Extensive experiments demonstrate that LoPT achieves competitive performance with lower memory cost, higher training efficiency and better retention of pretrained capabilities. Our code is available at: https://github.com/HumyuShi/LoPT
comment: 33pages
♻ ☆ VDCook:DIY video data cook your MLLMs
We introduce VDCook: a self-evolving video data operating system, a configurable video data construction platform for researchers and vertical domain teams. Users initiate data requests via natural language queries and adjustable parameters (scale, retrieval-synthesis ratio, quality threshold). The system automatically performs query optimization, concurrently running real video retrieval and controlled synthesis modules. It ultimately generates in-domain data packages with complete provenance and metadata, along with reproducible Notebooks. Unlike traditional static, one-time-built datasets, VDCook enables continuous updates and domain expansion through its automated data ingestion mechanism based on MCP (Model Context Protocol)\cite{mcp2024anthropic}, transforming datasets into dynamically evolving open ecosystems. The system also provides multi-dimensional metadata annotation (scene segmentation, motion scoring, OCR ratio, automatic captioning, etc.), laying the foundation for flexible subsequent data `cooking' and indexing\cite{vlogger}. This platform aims to significantly lower the barrier to constructing specialized video training datasets through infrastructure-level solutions, while supporting community contributions and a governance-enabled data expansion paradigm. \textbf{Project demo:} https://screenapp.io/app/v/WP0SvffgsH
♻ ☆ LUCAS-MEGA: A Large-Scale Multimodal Dataset for Representation Learning in Soil-Environment Systems
Understanding soil is fundamental to agriculture, carbon cycling, and environmental sustainability, yet progress is limited by fragmented and heterogeneous datasets that constrain modeling to small-scale predictive settings rather than high-dimensional representation learning. We introduce LUCAS-MEGA, a large-scale multimodal dataset constructed through systematic data fusion of European soil-environment observations, with the LUCAS survey as its backbone. The fused dataset comprises over 70,000 samples and more than 1,000 features spanning physical, chemical, environmental, biological, and visual attributes, aggregated from 68 source datasets. To enable integration at scale, we develop SoilFuser, a multi-agent, human-in-the-loop data fusion pipeline that standardizes heterogeneous data formats and measurement protocols, resolves inconsistencies and invalid entries (e.g., unit inconsistencies, codebook mismatches, and erroneous values), incorporates natural language annotations, and harmonizes multimodal attributes and metadata into a unified, machine learning-ready feature space. The resulting dataset captures key characteristics of real-world soil observations, including multimodality, uneven feature coverage, and heterogeneous uncertainty. To demonstrate the usability of LUCAS-MEGA for data-driven modeling, we pretrain a multimodal tabular transformer (SoilFormer) using a self-supervised objective based on feature masking, achieving stable training, strong predictive performance, and representations that support uncertainty-aware prediction. We further show that the learned representations recover relationships consistent with established soil processes. LUCAS-MEGA is released with open access and is accompanied by composable, agent-friendly APIs that support structured querying and data-driven workflows.
comment: 27 pages, 7 figures, 1 table
♻ ☆ Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization ACL 2026
Recent work, using the Biasing Features metric, labels a CoT as unfaithful if it omits a prompt-injected hint that affected the prediction. We argue this metric adopts a narrow notion of faithfulness and confuses unfaithfulness with incompleteness, the lossy compression needed to turn distributed transformer computation into a linear natural language narrative. On multi-hop reasoning tasks with instruct-tuned and reasoning models, many CoTs flagged as unfaithful by Biasing Features are judged faithful by other metrics, exceeding 50% in some models. With a new faithful@k metric, we show that larger inference-time budgets greatly increase hint verbalization (up to 90% in some settings), suggesting much apparent unfaithfulness is due to tight token limits. Using Causal Mediation Analysis, we further show that even non-verbalized hints can causally mediate prediction changes through the CoT. We therefore caution against relying solely on hint-based evaluations and advocate a broader interpretability toolkit, including causal mediation and corruption-based metrics. We do not claim all CoTs are faithful, only that the absence of hint words alone does not prove unfaithfulness.
comment: Accepted to ACL 2026. 23 pages, 29 figures, 6 tables
♻ ☆ DeepFedNAS: Efficient Hardware-Aware Architecture Adaptation for Heterogeneous IoT Federations via Pareto-Guided Supernet Training
Deploying federated learning across heterogeneous IoT device fleets requires tailored neural network architectures for each device class, yet existing Federated Neural Architecture Search (FedNAS) methods suffer from unguided supernet training and prohibitively costly post-training search pipelines that demand over 20 GPU-hours per deployment target. We introduce DeepFedNAS, a two-phase framework built on a multi-objective fitness function that synthesizes information-theoretic network metrics with architectural heuristics. In the first phase, Federated Pareto Optimal Supernet Training replaces random subnet sampling with a pre-computed cache of elite, high-fitness architectures, yielding a superior supernet. In the second phase, a Predictor-Free Search uses this fitness function as a zero-cost accuracy proxy, discovering hardware-optimized subnets in ~20 seconds, a ~61x speedup over the baseline pipeline. Experiments on CIFAR-10, CIFAR-100, and CINIC-10 demonstrate state-of-the-art accuracy (up to +1.21% on CIFAR-100), a 2.8x reduction in per-round transmission size, and robust performance under extreme non-IID conditions (α = 0.1), making DeepFedNAS practical for scalable, communication-constrained IoT federations. Source code: https://github.com/bostankhan6/DeepFedNAS
comment: This paper significantly extends the preliminary work presented at ESANN 2026. Source Code: https://github.com/bostankhan6/DeepFedNAS
♻ ☆ Separation Assurance between Heterogeneous Fleets of Small Unmanned Aerial Systems via Multi-Agent Reinforcement Learning
In the envisioned future dense urban airspace, multiple companies will operate heterogeneous fleets of small unmanned aerial systems (sUASs), where each fleet includes several homogeneous aircraft with identical policies and configurations, e.g., equipage, sensing, and communication ranges, making tactical deconfliction highly complex for the aircraft. This paper aims to address two core questions: (1) Can tactical deconfliction policies converge or reach an equilibrium to ensure a conflict-free airspace when companies operate heterogeneous fleets of homogeneous aircraft? (2) If so, will the converged policies discriminate against companies operating sUASs with weaker configurations? We investigate a multi-agent reinforcement learning paradigm in which homogeneous aircraft within heterogeneous fleets operate concurrently to perform package delivery missions over Dallas, Texas, USA. An attention-enhanced Proximal Policy Optimization-based Advantage Actor-Critic (PPOA2C) framework is employed to resolve intra- and inter-fleet conflicts, with each fleet independently training its own policy while preserving privacy. Experimental results show that two fleets with distinct, shared PPOA2C policies can reach an equilibrium to maintain safe separation. While two PPOA2C policies outperform two strong rule-based baselines in terms of conflict resolution, a PPOA2C policy exhibits safer interaction with a rule-based policy, indicating adaptive capabilities of PPOA2C policies. Furthermore, we conducted extensive policy-configuration evaluations, which reveal that equilibria between similar policy types tend to favor fleets with stronger configurations. Even under similar configurations but different policy types, the equilibrium favors one of the heterogeneous policies, underscoring the need for fairness-aware conflict management in heterogeneous sUAS operations.
comment: 8 pages, 3 figure, 1 table
♻ ☆ FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation
Adapting pretrained models typically involves a trade-off between the high training costs of backpropagation and the heavy inference overhead of memory-based or in-context learning. We propose FAAST, a forward-only associative adaptation method that analytically compiles labeled examples into fast weights in a single pass. By eliminating memory or context dependence, FAAST achieves constant-time inference and decouples task adaptation from pretrained representation. Across image classification and language modeling benchmarks, FAAST matches or exceeds backprop-based adaptation while reducing adaptation time by over 90% and is competitive to memory/context-based adaptation while saving memory usage by up to 95%. These results demonstrate FAAST as a highly efficient, scalable solution for supervised task adaptation, particularly for resource-constrained models. We release the code and models at https://github.com/baoguangsheng/faast.
comment: 9 pages, 6 figures, 10 tables
♻ ☆ From Time Series Analysis to Question Answering: A Survey in the LLM Era IJCAI 2026
Recently, Large Language Models (LLMs) have introduced a novel paradigm in Time Series Analysis (TSA), leveraging strong language capabilities to support tasks such as forecasting and anomaly detection. However, these analysis tasks cannot adequately cover temporal language tasks, such as interpretation and captioning. A fundamental gap remains between TSA and LLMs: LLMs are pre-trained to optimize natural language relevance for question answering rather than objectives specialized for TSA. To bridge this gap, TSA is evolving toward Time Series Question Answering (TSQA), shifting from expert-driven and task-specific analysis to user-driven and task-unified question answering. TSQA depends on flexible exploration rather than predefined TSA pipelines. In this survey, we first propose a taxonomy that reflects the evolution from TSA to TSQA, driven by a shift from external to internal alignment. We then organize existing literature into three alignment paradigms: Injective Alignment, Bridging Alignment, and Internal Alignment, and provide practical guidance for flexible, economical, and generalizable selection of alignment paradigms. We finally analyze datasets across domains and characteristics, identify challenges, and highlight future research directions.
comment: Accepted by IJCAI 2026 Survey Track
♻ ☆ Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring
Reward models (RMs) have become an indispensable fixture of the language model (LM) post-training playbook, enabling policy alignment and test-time scaling. Research on the application of RMs in code generation, however, has been comparatively sparse, with existing work largely focusing on execution feedback. This choice constrains post-training to optimizing functional correctness over self-contained executable code. In this work, we examine the training and evaluation of multilingual, multi-criteria code RMs. To this end, we first compile Themis-CodeRewardBench, a benchmark to evaluate code RMs across five preference dimensions (i.e., criteria) and eight programming languages, on which we profile 50+ code, math, and general-purpose RMs. Observing the limited proficiency of current RMs beyond scoring for functional correctness, we develop Themis-CodePreference, the largest open-source collection of code preferences to date (more than 350k preference pairs), and use it to train Themis-RM, a suite of multilingual code reward models for flexible multi-criteria scoring, ranging in size from 600M to 32B parameters. Our experiments and ablations demonstrate positive scaling trends, strong cross-lingual transfer when training on diverse preferences, and the importance of multi-criteria training for reliable code reward modeling.
♻ ☆ Geometric Analysis of Neural Regression Collapse via Intrinsic Dimension
Neural multivariate regression underpins a wide range of domains, including control, robotics, and finance, yet the geometry of its learned representations remains poorly characterized. While neural collapse has been shown to benefit generalization in classification, we find that analogous collapse in regression consistently degrades performance. To explain this contrast, we analyze regression models through the lens of intrinsic dimension. Across control tasks and synthetic datasets, we estimate the intrinsic dimension of last-layer features (ID_H) and compare it with that of the regression targets (ID_Y). Collapsed models exhibit ID_H < ID_Y, leading to over-compression and poor generalization, whereas non-collapsed models typically maintain ID_H > ID_Y. For the non-collapsed models, performance with respect to ID_H depends on the data quantity and noise levels. From these observations, we identify two regimes (over-compressed and under-compressed) that determine when expanding or reducing feature dimensionality improves performance. Our results provide new geometric insights into neural regression collapse and suggest practical strategies for enhancing generalization.
comment: 36 pages, 21 figures
♻ ☆ Exact Is Easier: Credit Assignment for Cooperative LLM Agents
Removing an agent from a cooperative team to measure its contribution seems natural, yet in multi-agent LLM systems this evaluation distorts the result it claims to measure. This failure is not isolated: learned critics, trajectory-level baselines, and agent-removal counterfactuals all inherit from standard multi-agent reinforcement learning a premise that exact counterfactual evaluation requires privileged environment access, and therefore approximate. In cooperative LLM systems, this premise is false. Interaction histories are deterministic functions of observable text with no hidden state, so any decision point can be restored exactly, making direct causal measurement possible without parametric approximation. C3 exploits this property by fixing the complete history at each decision point, sampling alternative actions under a frozen behavior policy, and computing unbiased per-decision advantages through a parameter-free leave-one-out baseline. Across six benchmarks spanning math reasoning and code generation, two model families, and two multi-agent topologies, C3 consistently outperforms all baselines; a controlled decomposition confirms gains originate from credit quality, not architecture, while checkpoint restoration reduces training token consumption. The exact solution proves simpler, cheaper, and more effective than all approximate alternatives. The same structural property that enables exact credit also enables exact verification: three independently computable diagnostics, credit fidelity, within-group variance, and inter-agent influence, constitute the first method-agnostic auditing tool for multi-agent LLM credit assignment. Our code is available at https://github.com/EIT-EAST-Lab/C3
♻ ☆ Geometric and dynamical analysis of attractor boundaries and storage limits in kernel Hopfield networks
High-capacity associative memories based on Kernel Logistic Regression (KLR) exhibit strong storage capabilities, but the dynamical and geometric mechanisms underlying their stability remain poorly understood. This paper investigates the global geometry of attractor basins and the mechanisms governing the storage limit in KLR-trained Hopfield networks. We combine empirical evaluations using random sequences and real-world image embeddings (CIFAR-10) with morphing experiments and statistical Signal-to-Noise Ratio (SNR) analysis. Our experiments show that the network achieves a storage capacity for random sequences up to $P/N \approx 16$, while maintaining stable retrieval for structured data at effective loads near $P/N \approx 20$. Morphing analysis indicates that attractors on the "Ridge of Optimization" are separated by sharp, phase-transition-like boundaries, characterized by steep effective potential barriers and critical slowing down. Furthermore, by comparing an SNR analysis with a geometric reference point inspired by Cover's theorem, we show that the practical storage limit is governed primarily not by a lack of geometric separability in the feature space, but by the loss of dynamical stability against crosstalk noise. These findings suggest that KLR networks function as highly localized exemplar-based memories that operate near the onset of dynamical collapse, providing a useful perspective on the design of robust, large-scale retrieval systems.
comment: 10 pages, 6 figures
♻ ☆ Econometric vs. Causal Structure-Learning for Time-Series Policy Decisions: Evidence from the UK COVID-19 Policies
Causal machine learning (ML) recovers graphical structures that inform us about potential cause-and-effect relationships. Most progress has focused on cross-sectional data with no explicit time order, whereas recovering causal structures from time series data remains the subject of ongoing research in causal ML. In addition to traditional causal ML, this study assesses econometric methods that some argue can recover causal structures from time series data. The use of these methods can be explained by the significant attention the field of econometrics has given to causality, and specifically to time series, over the years. This presents the possibility of comparing the causal discovery performance between econometric and traditional causal ML algorithms. We seek to understand if there are lessons to be incorporated into causal ML from econometrics, and provide code to translate the results of these econometric methods to the most widely used Bayesian Network R library, bnlearn. We investigate the benefits and challenges that these algorithms present in supporting policy decision-making, using the real-world case of COVID-19 in the UK as an example. Four econometric methods are evaluated in terms of graphical structure, model dimensionality, and their ability to recover causal effects, and these results are compared with those of eleven causal ML algorithms. Amongst our main results, we see that econometric methods provide clear rules for temporal structures, whereas causal-ML algorithms offer broader discovery by exploring a larger space of graph structures that tends to lead to denser graphs that capture more identifiable causal relationships.
♻ ☆ Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning
Reinforcement learning has emerged as a powerful paradigm for unlocking reasoning capabilities in language models. However, relying on sparse rewards makes this process highly sample-inefficient, as models must navigate vast search spaces with minimal feedback. While classic curriculum learning aims to mitigate this by ordering data based on complexity, prior works have primarily targeted small datasets and do not directly transfer to the large-scale settings typical of modern LM training. Furthermore, the right ordering for a specific model is often unclear. To address this, we propose Goldilocks, a novel teacher-driven data sampling strategy that aims to predict each question's difficulty for the student model. The teacher model selects questions of appropriate difficulty for the student model, i.e., questions that are neither too easy nor too hard (Goldilocks principle), while training the student with GRPO. By leveraging the student's performance on seen samples, the teacher continuously adapts to the student's evolving abilities. On the OpenMathReasoning dataset, Goldilocks data sampling improves the performance of models trained with standard GRPO under the same compute budget.
comment: 28 pages, 13 figures
♻ ☆ KV Cache Offloading for Context-Intensive Tasks
With the growing demand for long-context LLMs across a wide range of applications, the key-value (KV) cache has become a critical bottleneck for both latency and memory usage. Recently, KV-cache offloading has emerged as a promising approach to reduce memory footprint and inference latency while preserving accuracy. Prior evaluations have largely focused on tasks that do not require extracting large amounts of information from the context. In this work, we study KV-cache offloading on context-intensive tasks: problems where the solution requires looking up a lot of information from the input prompt. We create and release the Text2JSON benchmark, a highly context-intensive task that requires extracting structured knowledge from raw text. We evaluate modern KV offloading on Text2JSON and other context-intensive tasks and find significant performance degradation on both Llama 3 and Qwen 3 models. Our analysis identifies two key reasons for poor accuracy: low-rank projection of keys and unreliable landmarks, and proposes a simpler alternative strategy that significantly improves accuracy across multiple LLM families and benchmarks. These findings highlight the need for a comprehensive and rigorous evaluation of long-context compression techniques.
comment: Preprint
♻ ☆ SPHERE: Mitigating the Loss of Spectral Plasticity in Mixture-of-Experts for Deep Reinforcement Learning ICML 2026
In deep reinforcement learning (DRL), an agent is trained from a stream of experience. In a continual learning setting, such agents can suffer from plasticity loss: their ability to learn new skills from new experiences diminishes over training. Recently, Mixture-of-Experts (MoE) networks have been reported to enable scaling laws and facilitate the learning of diverse skills. However, in continual reinforcement learning settings, their performance can degenerate as learning proceeds, indicating a loss of plasticity. To address this, building on Neural Tangent Kernel (NTK) theory, we formalize the plasticity loss in MoE policies as a loss of spectral plasticity. We then derive a tractable proxy for spectral plasticity, one expressible in terms of individual expert feature matrices. Leveraging this proxy, we introduce SPHERE, a practical Parseval penalty tailored for MoE-based policies that alleviates the loss of spectral plasticity. On MetaWorld and HumanoidBench, SPHERE improves average success under continual RL by 133% and 50% over an unregularized MoE baseline, while maintaining higher spectral plasticity throughout training.
comment: Accepted to ICML 2026
♻ ☆ DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment
Reinforcement learning is crucial for aligning large language models to perform complex reasoning tasks. However, current algorithms such as Group Relative Policy Optimization suffer from coarse grained, sequence level credit assignment, which severely struggles to isolate pivotal reasoning steps within long Chain of Thought generations. Furthermore, the standard unbounded Kullback Leibler divergence penalty induces severe gradient instability and mode seeking conservatism, ultimately stifling the discovery of novel reasoning trajectories. To overcome these limitations, we introduce Distribution Guided Policy Optimization, a novel critic free reinforcement learning framework that reinterprets distribution deviation as a guiding signal rather than a rigid penalty. DGPO replaces the volatile KL divergence with the bounded Hellinger distance to safely quantify token level exploration without the risk of gradient explosion. To effectively distinguish genuine reasoning breakthroughs from hallucinatory noise, we propose an entropy gating mechanism that scales this deviation by the policy`s epistemic uncertainty. By dynamically redistributing the coarse sequence-level advantage to individual tokens based on these gated scores, DGPO heavily incentivizes critical exploratory steps while suppressing unwarranted, low-entropy deviations. Consequently, DGPO completely eliminates the traditional token-level KL penalty and achieves fine-grained credit reallocation without the computational overhead of an additional value network. Extensive empirical evaluations demonstrate that DGPO sets a new state-of-the-art for critic free alignment. Notably, on the Qwen2.5-32B architecture, DGPO achieves 60.0% Avg@32 accuracy and 46.0% Avg@32 accuracy on the challenging AIME2024 and AIME2025 benchmarks respectively, substantially outperforming competitive baselines like DAPO.
♻ ☆ Q-MMR: Off-Policy Evaluation via Recursive Reweighting and Moment Matching
We present a novel theoretical framework, Q-MMR, for off-policy evaluation in finite-horizon MDPs. Q-MMR learns a set of scalar weights, one for each data point, such that the reweighted rewards approximate the expected return under the target policy. The weights are learned inductively in a top-down manner via a moment matching objective against a value-function discriminator class. Notably, and perhaps surprisingly, a data-dependent finite-sample guarantee for general function approximation can be established under only the realizability of $Q^π$, with a dimension-free bound -- that is, the error does not depend on the statistical complexity of the function class. We also establish connections to several existing methods, such as importance sampling and linear FQE. Further theoretical analyses shed new light on the nature of coverage, a concept of fundamental importance to offline RL.
♻ ☆ Optimising MFCC parameters for the automatic detection of respiratory diseases
Voice signals originating from the respiratory tract are utilized as valuable acoustic biomarkers for the diagnosis and assessment of respiratory diseases. Among the employed acoustic features, Mel Frequency Cepstral Coefficients (MFCC) is widely used for automatic analysis, with MFCC extraction commonly relying on default parameters. However, no comprehensive study has systematically investigated the impact of MFCC extraction parameters on respiratory disease diagnosis. In this study, we address this gap by examining the effects of key parameters, namely the number of coefficients, frame length, and hop length between frames, on respiratory condition examination. Our investigation uses four datasets: the Cambridge COVID-19 Sound database, the Coswara dataset, the Saarbrucken Voice Disorders (SVD) database, and a TACTICAS dataset. The Support Vector Machine (SVM) is employed as the classifier, given its widespread adoption and efficacy. Our findings indicate that the accuracy of MFCC decreases as hop length increases, and the optimal number of coefficients is observed to be approximately 30. The performance of MFCC varies with frame length across the datasets: for the COVID-19 datasets (Cambridge COVID-19 Sound database and Coswara dataset), performance declines with longer frame lengths, while for the SVD dataset, performance improves with increasing frame length (from 50 ms to 500 ms). Furthermore, we investigate the optimized combination of these parameters and observe substantial enhancements in accuracy. Compared to the worst combination, the SVM model achieves an accuracy of 81.1%, 80.6%, and 71.7%, with improvements of 19.6%, 16.10%, and 14.90% for the Cambridge COVID-19 Sound database, the Coswara dataset, and the SVD dataset respectively.
♻ ☆ Uncertainty Quantification for Prior-Data Fitted Networks using Martingale Posteriors
Prior-data fitted networks (PFNs) have emerged as promising foundation models for prediction from tabular datasets, achieving state-of-the-art performance on small to moderate data sizes without tuning. While PFNs are motivated by Bayesian ideas, they do not provide any uncertainty quantification for predictive means, quantiles, or similar quantities. We propose a principled, efficient, and tuning-free sampling procedure to construct Bayesian posteriors for such estimates based on martingale posteriors, and prove its convergence. Several simulated and real-world data examples showcase the efficiency and calibration of our method in inference applications.
♻ ☆ Closed-Form Last Layer Optimization
Neural networks are typically optimized with variants of stochastic gradient descent. Under a squared loss, however, the optimal solution to the linear last layer weights is known in closed-form. We propose to leverage this during optimization, treating the last layer as a function of the backbone parameters, and optimizing solely for these parameters. We show this is equivalent to alternating between gradient descent steps on the backbone and closed-form updates on the last layer. We adapt the method for the setting of stochastic gradient descent, by trading off the loss on the current batch against the accumulated information from previous batches. We provide theoretical analyses showing convergence of the method to an optimal solution in the neural tangent kernel regime, as well as quantifying the gains compared to standard SGD in a one-step analysis. Finally, we demonstrate the effectiveness of our approach compared with SGD and Adam on a squared loss in several regression tasks, including neural operators and causal inference.
♻ ☆ Discovering Learning-Friendly Generation Orders for Sequential Computation
Sequential computation via autoregressive generation can make difficult tasks learnable, but the generation order of intermediate states strongly affects whether training succeeds. We address the problem of discovering a learning-friendly target order automatically, rather than relying on task-specific design. Our key observation is that learning-friendly orders cause a faster loss drop in the early stage of training. We exploit this by \emph{loss profiling}, which ranks candidate orders by the early-stage loss of a single short run. To handle the factorial candidate space, we wrap loss profiling in a hierarchical global -- local search over block- and within-block-level orderings. On six order-sensitive tasks, the method discovers effective orders up to $L=13$ from random initialization and up to $L=40$ from structured initialization, lifting success rates from about 10\% to near 100\%. On integer multiplication, it rediscovers the reverse-digit order that was reported to be efficient in prior studies. On delay dynamical systems, as a case study of multi-variate recurrences, learnability varies sharply even among valid topological sorts of the dependency graph: loss profiling identifies a learning-friendly one, and the global search even discovers orders surpassing hand-designed candidates.
comment: 10+24 pages, 10 figures
♻ ☆ DurableUn: Quantization-Induced Recovery Attacks in Machine Unlearning
Machine unlearning aims to remove specified training data to satisfy privacy regulations such as GDPR. However, existing evaluations assume identical precision at unlearning and deployment, overlooking that production LLMs are deployed at low-bit precision. We show that INT4 quantization systematically restores forgotten content even when models pass compliance audits at bfloat16 (BF16), we term this the quantization recovery attack (QRA). We conduct the first systematic study of unlearning robustness under adapter-space INT4 quantization in the NF4+LoRA regime, evaluating seven methods on LLaMA-3-8B-Instruct across TOFU, MUSE-News, and WikiBio-WPU. INT8 is benign; INT4 induces recovery of up to 22x, worsening with dataset difficulty. We identify the FA-RA-Q-INT4 trilemma: no method simultaneously achieves strong forgetting, high utility, and quantization robustness. A dense Pareto sweep reveals a sharp phase transition once robustness is achieved, retaining accuracy collapses regardless of further tuning. To address this, we propose DURABLEUN-SAF (Sharpness-Aware Forgetting), a quantization-aware objective using Straight-Through Estimator gradients through INT4 rounding. DURABLEUN-SAF is the only method to achieve a stable empirical (0.047, {BF16, INT8, INT4})- durability certificate: Q-INT4= 0.043 +- 0.002, cert rate= 3/3, versus SalUn's cert rate= 1/3 at its own published hyperparameters. We call for Q-INT4 to be adopted as a standard evaluation metric alongside FA and RA.
♻ ☆ Knowing but Not Correcting: Routine Task Requests Suppress Factual Correction in LLMs
LLMs reliably correct false claims when presented in isolation, yet when the same claims are embedded in task-oriented requests, they often comply rather than correct. We term this failure mode \emph{correction suppression} and construct a benchmark of 300 false premises to systematically evaluate it across eight models. Suppression rates range from 19\% to 90\%, with four models exceeding 80\%, establishing correction suppression as a prevalent and severe phenomenon. Mechanistic analysis reveals that suppression is not a knowledge failure: the model registers the error internally but task context diverts early-layer attention from the false claim as output intent crystallizes toward compliance at middle layers. We characterize this as \emph{knowing but not correcting} -- suppression occurs at response selection rather than knowledge encoding. Guided by this mechanism, we propose two training-free interventions. Correction Direction Steering (CDS) estimates a correction-compliance direction from matched pairs and injects it at middle layers before output intent crystallizes. Dynamic Payload Amplification (DPA) localizes payload tokens via attention divergence between early and late layers and amplifies their representation at the final layer, requiring no calibration data. Experiments on Qwen3.5-9B and LLaMA3.1-8B show both methods substantially improve factual strictness. CDS achieves the highest correction rate on Qwen3.5-9B (0\%$\to$58.2\%). DPA is the only method that preserves or improves reasoning capability on both models. These findings introduce \emph{factual strictness} -- the willingness to uphold accuracy against contextual pressures -- as a new dimension of model reliability.
♻ ☆ How Log-Barrier Helps Exploration in Policy Optimization
Recently, it has been shown that the Stochastic Gradient Bandit (SGB) algorithm converges to a globally optimal policy with a constant learning rate. However, these guarantees rely on unrealistic assumptions about the learning process, namely that the probability of the optimal action is always bounded away from zero. We attribute this to the lack of an explicit exploration mechanism in SGB. To address these limitations, we propose to regularize the SGB objective with a log-barrier on the parametric policy, structurally enforcing a minimal amount of exploration. We prove that Log-Barrier Stochastic Gradient Bandit (LB-SGB) matches the sample complexity of SGB, but also converges (at a slower rate) without any assumptions on the learning process. We also show a connection between the log-barrier regularization and Natural Policy Gradient, as both exploit the geometry of the policy space by controlling the Fisher information. We validate our theoretical findings through numerical simulations, showing the benefits of the log-barrier regularization.
♻ ☆ Clinical Characteristics and Laboratory Biomarkers in ICU-admitted Septic Patients with and without Bacteremia
Few studies have investigated the diagnostic utilities of biomarkers for predicting bacteremia among septic patients admitted to intensive care units (ICU). Therefore, this study evaluated the prediction power of laboratory biomarkers to utilize those markers with high performance to optimize the predictive model for bacteremia. This retrospective cross-sectional study was conducted at the ICU department of Gyeongsang National University Changwon Hospital in 2019. Adult patients qualifying SEPSIS-3 (increase in sequential organ failure score greater than or equal to 2) criteria with at least two sets of blood culture were selected. Collected data was initially analyzed independently to identify the significant predictors, which was then used to build the multivariable logistic regression (MLR) model. A total of 218 patients with 48 cases of true bacteremia were analyzed in this research. Both CRP and PCT showed a substantial area under the curve (AUC) value for discriminating bacteremia among septic patients (0.757 and 0.845, respectively). To further enhance the predictive accuracy, we combined PCT, bilirubin, neutrophil lymphocyte ratio (NLR), platelets, lactic acid, erythrocyte sedimentation rate (ESR), and Glasgow Coma Scale (GCS) score to build the predictive model with an AUC of 0.907 (95% CI, 0.843 to 0.956). In addition, a high association between bacteremia and mortality rate was discovered through the survival analysis (0.004). While PCT is certainly a useful index for distinguishing patients with and without bacteremia by itself, our MLR model indicates that the accuracy of bacteremia prediction substantially improves by the combined use of PCT, bilirubin, NLR, platelets, lactic acid, ESR, and GCS score.
comment: This research is not complete
♻ ☆ A Resilience Framework for Bi-Criteria Combinatorial Optimization with Bandit Feedback
We study bi-criteria combinatorial optimization under noisy function evaluations. While resilience and black-box offline-to-online reductions have been studied in single-objective settings, extending these ideas to bi-criteria problems introduces new challenges due to the coupled degradation of approximation guarantees for objectives and constraints. We introduce a notion of $(α,β,δ,\texttt{N})$-resilience for bi-criteria approximation algorithms, capturing how joint approximation guarantees degrade under bounded (possibly worst-case) oracle noise, and develop a general black-box framework that converts any resilient offline algorithm into an online algorithm for bi-criteria combinatorial multi-armed bandits with bandit feedback. The resulting online guarantees achieve sublinear regret and cumulative constraint violation of order $\tilde{O}(δ^{2/3}\texttt{N}^{1/3}T^{2/3})$ without requiring structural assumptions such as linearity, submodularity, or semi-bandit feedback on the noisy functions. We demonstrate the applicability of the framework by establishing resilience for several classical greedy algorithms in submodular optimization.
♻ ☆ Chordless cycle filtrations for dimensionality detection in complex networks via topological data analysis
Many complex networks, ranging from social to biological systems, exhibit structural patterns consistent with an underlying hyperbolic geometry. Revealing the dimensionality of this latent space can disentangle the structural complexity of communities, impact efficient network navigation, and fundamentally shape connectivity and system behavior. We introduce a topological data analysis weighting scheme for graphs based on chordless cycles to estimate network dimensionality in a data-driven way. We further show that the resulting descriptors can effectively estimate network dimensionality using a neural network architecture trained on a synthetic graph database constructed for this purpose, which requires no retraining to transfer effectively to real-world networks. Thus, by combining cycle-aware filtrations, algebraic topology, and machine learning, our approach provides a robust and effective method for uncovering the hidden geometry of complex networks and guiding accurate modeling and low-dimensional embedding.
♻ ☆ Versatile yet Efficient Network Traffic Analysis: Offloading Network Foundation Model to SmartNIC
Pervasive encryption makes large-scale labeling infeasible for traffic analysis, while security operations demand edge analysis to avert service degradation and further vulnerabilities. These pressures have produced two disjoint research lines: 1) versatile analysis, via network foundation models for low label dependency, and 2) efficient analysis, via hardware offloading for low analysis latency. However, versatility and efficiency have appeared fundamentally incompatible to co-achieve, with prior work consistently sacrificing one for the other, yet we show that this incompatibility is a consequence of polarized design choices across the three components of traffic analysis systems, i.e., traffic processing, model architecture, and analysis execution. In response, we present Nepco, a versatile yet efficient network traffic analysis system that offloads network foundation models to SmartNIC. Our key observation is that discriminative traffic information is concentrated in localized byte regions, motivating versatile yet efficient localized byte-sequence modeling rather than inefficient global modeling. To exploit this without incurring the latency bottlenecks of complex encoding steps, we employ a hardware-friendly processing pipeline that directly embeds raw byte sequences. Crucially, to maintain versatility across diverse tasks, we propose a pattern-aware convolutional architecture equipped with dedicated scoring and gating mechanisms. By exploiting translation invariance, this design dynamically locates and extracts salient semantic signatures. We prototype Nepco on the Nvidia BlueField-3 SmartNIC with multiengine collaborative analysis execution. The experimental results demonstrate that Nepco achieves macro F1 competitive with the best performances achieved by 8 state-of-the-art network foundation models, while reducing end-to-end latency by 328x to the millisecond scale.
comment: Under review
♻ ☆ ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning ICML 2026
Reinforcement Learning with Verifiable Rewards (RLVR) enhances reasoning of Large Language Models (LLMs) but usually exhibits limited generation diversity due to the over-incentivization of positive rewards. Although methods like Negative Sample Reinforcement (NSR) mitigate this issue by upweighting penalty from negative samples, they may suppress the semantic distributions shared between positive and negative responses. To boost reasoning ability without losing diversity, this paper proposes negative sample projection Residual Reinforcement Learning (ResRL) that decouples similar semantic distributions among positive and negative responses. We theoretically link Lazy Likelihood Displacement (LLD) to negative-positive head-gradient interference and derive a single-forward proxy that upper-bounds representation alignment to guide conservative advantage reweighting. ResRL then projects negative-token hidden representations onto an SVD-based low-rank positive subspace and uses projection residuals to modulate negative gradients, improving reasoning while preserving diversity and outperforming strong baselines on average across twelve benchmarks spanning Mathematics, Code, Agent Tasks, and Function Calling. Notably, ResRL surpasses NSR on mathematical reasoning by 9.4\% in Avg@16 and 7.0\% in Pass@128. Code is available at https://github.com/1229095296/ResRL.git.
comment: Accepted to ICML 2026. Preprint version. https://github.com/1229095296/ResRL.git
♻ ☆ Faster Verified Explanations for Neural Networks
Verified explanations are a principled way to explain the decisions taken by neural networks, which are otherwise black-box in nature. However, these techniques face significant scalability challenges, as they require multiple calls to neural network verifiers, each of them with an exponential worst-case complexity. We present FaVeX, a novel algorithm to compute verified explanations. FaVeX accelerates the computation by dynamically combining batch and sequential processing of input features, and by reusing information from previous queries, both when proving invariances with respect to certain input features, and when searching for feature assignments altering the prediction. Furthermore, we present a novel and hierarchical definition of verified explanations, termed verifieroptimal robust explanations, that explicitly factors the incompleteness of network verifiers within the explanation. Our comprehensive experimental evaluation demonstrates the superior scalability of both FaVeX, and of verifier-optimal robust explanations, which together can produce meaningful formal explanation on networks with hundreds of thousands of non-linear activations.
comment: ECOOP 2026
♻ ☆ An abstract effective convergence theorem for stochastic processes, with applications to stochastic approximation
We provide a general theorem on the asymptotic behavior of stochastic processes that conform to a relaxed supermartingale condition. The distinguishing feature of our result is that it provides quantitative convergence guarantees at a much higher level of abstraction and generality than is typically seen in the stochastic approximation literature, formulated in particular in terms of a general modulus $τ$ that, on an intuitive level, captures an effective variant of the uniqueness in expectation of associated solutions. Our convergence rate is highly uniform, depending on very few data beyond $τ$. We then demonstrate the utility of our result as a unifying framework by deriving new quantitative versions of several key concepts and theorems from stochastic approximation, including the Robbins-Siegmund theorem, Dvoretzky's convergence theorem, and the convergence of stochastic quasi-Fejér monotone sequences, the latter formulated in a novel and highly general metric context. Throughout, we isolate and discuss special cases of our results which allow for the construction of fast, and in particular linear, rates. Various applications of our results and our general methodology to stochastic approximation are discussed, and in particular explicitly derived in related work of the authors.
comment: 25 pages
♻ ☆ PRPO: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection ICML 2026
The rapid rise of synthetic media has made deepfake detection a critical challenge for online safety and trust. Progress remains constrained by the scarcity of large, high-quality datasets. Although multimodal large language models (LLMs) exhibit strong reasoning capabilities, their performance on deepfake detection is poor, often producing explanations that are misaligned with visual evidence or hallucinatory. To address this limitation, we introduce a reasoning-annotated dataset for deepfake detection and propose Paragraph-level Relative Policy Optimization (PRPO), a reinforcement learning algorithm that aligns LLM reasoning with image content at the paragraph level. Experiments show that PRPO improves detection accuracy by a wide margin and achieves the highest reasoning score of 4.55/5.0. Ablation studies further demonstrate that PRPO significantly outperforms GRPO under test-time conditions. These results underscore the importance of grounding multimodal reasoning in visual evidence to enable more reliable and interpretable deepfake detection.
comment: Accepted at ICML 2026
♻ ☆ DT-PBO: an Interpretable Tree-based Surrogate Model for Preferential Bayesian Optimization
Preferential Bayesian Optimization (PBO) aims to find a decision-maker's most preferred solution in as few pairwise comparisons as possible. Existing approaches rely on Gaussian Process (GP) surrogates, which provide strong performance but limited interpretability. This limits real-world usability in high-stakes domains, such as healthcare, where interpretability and trust are essential. We propose DT-PBO, a novel tree-based surrogate model for PBO that is inherently interpretable while capturing preference uncertainty. Specifically, we introduce a novel splitting heuristic that constructs interpretable shallow decision trees directly from pairwise comparison data, and use Laplace approximation to obtain probabilistic estimates for each leaf. This enables efficient preference modeling without sacrificing interpretability. Across eight benchmark functions, our method achieves competitive convergence to GP-based PBO, particularly on functions with rugged optimization landscapes. Additional experiments show robustness against noise and a fast computational running time. Experiments on real-world datasets further demonstrate that our model provides interpretable insights into decision-maker preferences that would remain opaque under GP-based approaches.
♻ ☆ Robust Sublinear Convergence Rates for Iterative Bregman Projections
Entropic regularization provides a simple way to approximate linear programs whose constraints split into two or more tractable blocks. The resulting objectives are amenable to cyclic Kullback-Leibler (KL) Bregman projections, with Sinkhorn-type algorithms for optimal transport, matrix scaling, and barycenters as canonical examples. This paper gives a general blueprint for proving $O(1/k)$ dual convergence rate with a constant that scales only linearly in $1/γ$, where $γ$ is the entropic regularization parameter. We call such rates "robust", because this mild dependence on $γ$ underpins favorable complexity bounds for approximating the unregularized problem via alternating KL projections. The blueprint reduces the proof to a uniform primal bound and a dual bound for a quotient norm induced by the constraint split. To make these inputs usable, we propose two helper results, which rely on the non-expansiveness of the dual iterations in this quotient dual norm. Instantiating this blueprint for graph-structured transport yields a new flow-Sinkhorn algorithm for the Wasserstein-1 distance on graphs. It achieves $\varepsilon$-additive accuracy on the transshipment cost in $O(p\,\mathrm{diameter}^3/\varepsilon^{4})$ arithmetic operations (up to logarithmic factors), where $p$ is the number of edges. We also provide a machine-checked Lean formalization of the core blueprint and its graph-$\mathrm{W}_1$ instantiation.
♻ ☆ Path Integration and Object-Location Binding Emerge in an Action-Conditioned Predictive Sequence Network
Adaptive cognition requires structured internal models of objects and their relations. Predictive neural networks are often proposed to learn such world models, but how these are instantiated and how they support prediction remain unclear. We investigate this in a minimal in-silico setting. A recurrent neural network samples tokens sequentially from 2D continuous token scenes and is trained to predict the upcoming token from the current input and a saccade-like displacement. On novel scenes, prediction accuracy improves across the sequence, indicating in-context learning. Decoding analyses reveal path integration and dynamic binding of token identity to position. Interventional analyses show that new bindings can be learned late in sequence and that out-of-distribution bindings can be learned as well. Together, these findings show how structured representations relying on flexible binding emerge to support prediction, offering a mechanistic account of sequential world modeling relevant to cognitive science.
comment: 8 pages, 4 figures; accepted at CogSci 2026
♻ ☆ SoftSAE: Dynamic Top-K Selection for Adaptive Sparse Autoencoders
Sparse Autoencoders (SAEs) have become an important tool in mechanistic interpretability, helping to analyze internal representations in both Large Language Models (LLMs) and Vision Transformers (ViTs). By decomposing polysemantic activations into sparse sets of monosemantic features, SAEs aim to translate neural network computations into human-understandable concepts. However, common architectures such as TopK SAEs rely on a fixed sparsity level. They enforce the same number of active features (K) across all inputs, ignoring the varying complexity of real-world data. Natural data often lies on manifolds with varying local intrinsic dimensionality, meaning the number of relevant factors can change significantly across samples. This suggests that a fixed sparsity level is not optimal. Simple inputs may require only a few features, while more complex ones need more expressive representations. Using a constant K can therefore introduce noise in simple cases or miss important structure in more complex ones. To address this issue, we propose SoftSAE, a sparse autoencoder with a Dynamic Top-K selection mechanism. Our method uses a differentiable Soft Top-K operator to learn an input-dependent sparsity level k. This allows the model to adjust the number of active features based on the complexity of each input. As a result, the representation better matches the structure of the data, and the explanation length reflects the amount of information in the input. Experimental results confirm that SoftSAE not only finds meaningful features, but also selects the right number of features for each concept. The source code is available at: https://github.com/St0pien/SoftSAE.
♻ ☆ VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training
Off-policy updates are inevitable in reinforcement learning (RL) for large language models (LLMs) due to rollout staleness from asynchronous training and mismatches between training and inference engines. Naive importance sampling gives an unbiased correction but suffers from high variance, which is amplified by unbounded ratios and autoregressive generation. Prior remedies either rely on scenario-specific engineering, or trade bias for variance via token-level clipping or sequence-level normalization, yet these approaches remain largely heuristic. We propose Variational sEquence-level Soft Policy Optimization (VESPO). By explicitly incorporating variance reduction into a variational formulation, we derive a principled closed-form reshaping kernel that operates directly on sequence-level importance weights, avoids token-level approximation and length normalization, and admits an explicit variance bound for the deployed kernel. Experiments on math reasoning and code generation show that VESPO maintains stable training under severe off-policy conditions (staleness up to 64x) and delivers consistent gains across both dense and Mixture-of-Experts (MoE) models, outperforming recent reshaping baselines under matched setup. Code is available at https://github.com/FloyedShen/VESPO.
♻ ☆ Large-Scale High-Quality 3D Gaussian Head Reconstruction from Multi-View Captures
We propose HeadsUp, a scalable feed-forward method for reconstructing high-quality 3D Gaussian heads from large-scale multi-camera setups. Our method employs an efficient encoder-decoder architecture that compresses input views into a compact latent representation. This latent representation is then decoded into a set of UV-parameterized 3D Gaussians anchored to a neutral head template. This UV representation decouples the number of 3D Gaussians from the number and resolution of input images, enabling training with many high-resolution input views. We train and evaluate our model on an internal dataset with more than 10,000 subjects, which is an order of magnitude larger than existing multi-view human head datasets. HeadsUp achieves state-of-the-art reconstruction quality and generalizes to novel identities without test-time optimization. We extensively analyze the scaling behavior of our model across identities, views, and model capacity, revealing practical insights for quality-compute trade-offs. Finally, we highlight the strength of our latent space by showcasing two downstream applications: generating novel 3D identities and animating the 3D heads with expression blendshapes.
comment: Project page: https://apple.github.io/ml-headsup/
♻ ☆ A Foundation Model for Instruction-Conditioned In-Context Time Series Tasks
In-context learning (ICL) enables task adaptation at inference time by conditioning on demonstrations rather than updating model parameters. Although recent time-series foundation models incorporate contextual conditioning, retrieval, or example-based prompting, they typically rely on implicit positional structure or task-specific objectives rather than explicit instruction-conditioned input-output demonstrations. We introduce iAmTime, a time-series foundation model trained with instruction-conditioned amortized meta-learning to infer tasks directly from example demonstrations. iAmTime represents each episode as a structured prompt over historical context and future-known variables using specialized semantic tokens that attend to designated time-series regions, exchange information across demonstrations, and inject task information into the query representation. The model combines a Hierarchical Multi-Scope Transformer Encoder, which captures temporal and covariate dynamics while inferring latent task structure from demonstrated input-output mappings, with a Task-Conditioned Patch Decoder, which adapts decoding through expert-based routing. We train iAmTime on large-scale real and synthetic corpora using supervised and self-supervised instruction-conditioned tasks, including forecasting, imputation, reconstruction, classification, anomaly detection, and source de-mixing. Across diverse domains, frequencies, and horizons, iAmTime improves zero-shot adaptation over strong time-series foundation baselines on probabilistic and point forecasting benchmarks, while achieving competitive performance on non-forecasting tasks such as classification.
♻ ☆ Diffusion Path Samplers via Sequential Monte Carlo
We develop diffusion-based samplers for target distributions known up to a normalising constant. To this end, we rely on the well-known diffusion path that smoothly interpolates between a simple base distribution and the target, popularised by diffusion models. We tackle the score estimation problem by developing an efficient sequential Monte Carlo sampler that evolves auxiliary variables from conditional distributions along the path, providing principled score and density estimates for time-varying distributions. To control the variance of score estimates, we further propose practical control variate schedules that incur minimal overhead. We adapt this general framework to paths induced by the Ornstein-Uhlenbeck (OU) time-reversal process, stochastic interpolants, and diffusion annealed Langevin dynamics, outlining their trade-offs. Finally, we provide theoretical guarantees and empirically demonstrate the effectiveness of our method on several synthetic and real-world datasets.
♻ ☆ Sparser, Faster, Lighter Transformer Language Models
Scaling autoregressive large language models (LLMs) has driven unprecedented progress but comes with vast computational costs. In this work, we tackle these costs by leveraging unstructured sparsity within an LLM's feedforward layers, the components accounting for most of the model parameters and execution FLOPs. To achieve this, we introduce a new sparse packing format and a set of CUDA kernels designed to seamlessly integrate with the optimized execution pipelines of modern GPUs, enabling efficient sparse computation during LLM inference and training. To substantiate our gains, we provide a quantitative study of LLM sparsity, demonstrating that simple L1 regularization can induce over 99% sparsity with negligible impact on downstream performance. When paired with our kernels, we show that these sparsity levels translate into substantial throughput, energy efficiency, and memory usage benefits that increase with model scale. We will release all code and kernels under an open-source license to promote adoption and accelerate research toward establishing sparsity as a practical axis for improving the efficiency and scalability of modern foundation models.
comment: Code and checkpoints available at: https://github.com/SakanaAI/sparser-faster-llms
♻ ☆ RNE: plug-and-play diffusion inference-time control and energy-based training ICLR 2026
Diffusion models generate data by removing noise gradually, which corresponds to the time-reversal of a noising process. However, access to only the denoising kernels is often insufficient. In many applications, we need the knowledge of the marginal densities along the generation trajectory, which enables tasks such as inference-time control. To address this gap, in this paper, we introduce the Radon-Nikodym Estimator (RNE). Based on the concept of the \textit{density ratio} between path distributions, it reveals a fundamental connection between marginal densities and transition kernels, providing a flexible plug-and-play framework that unifies (1) diffusion density estimation, (2) inference-time control, and (3) energy-based diffusion training under a single perspective. Experiments demonstrate that RNE delivers strong results in inference-time control applications, such as annealing and model composition, with promising inference-time scaling performance, and achieves a simple yet efficient regularisation for training energy-based diffusion models. Additionally, our proposed RNE is modality-agnostic and applicable not only to continuous diffusion models but also to their discrete diffusion counterparts.
comment: Accepted at ICLR 2026
♻ ☆ Koopman Autoencoders with Continuous-Time Latent Dynamics for Fluid Dynamics Forecasting
Forecasting physical systems over long horizons from irregularly sampled observations demands models that are stable, computationally efficient, and free of fixed-timestep assumptions. We address this with a continuous-time Koopman autoencoder whose latent dynamics obey $dz/dt = \mathbf{K}_{\mathrm{cont}} z$, yielding closed-form inference via $z(τ) = \exp(\mathbf{K}_{\mathrm{cont}} τ) z(0)$ at any horizon $τ$ in a single step. This decouples forecast cost from forecast length at inference time and supports data assimilation as gradient-based optimization with cost independent of the assimilation window. However, scaling continuous-time Koopman dynamics to high-dimensional chaotic systems causes severe latent instability, including spectral collapse and trajectory divergence over long horizons. In contrast, discrete Koopman methods train an operator $\mathbf{A}$ such that $z_{t+Δt} = \mathbf{A} z_t$; recovering the continuous generator could be theoretically done through matrix logarithm but requires conditions not guaranteed by training, and approximation errors grow with the $Δt$ imposed by the training data. These methods also require fixed, regular timesteps. We identify an empirically effective set of structural constraints -- rollout training, forward-backward consistency, latent regularization, and physics-conditioned LoRA -- sufficient for stable long-horizon latent dynamics. On challenging fluid benchmarks, our method outperforms strong diffusion and operator-learning baselines on long-horizon forecasting while achieving a 110$\times$ inference speedup.
♻ ☆ User eXperience Perception Insights Dataset (UXPID): Synthetic User Feedback from Public Industrial Forums
Customer feedback in industrial forums offers rich but underexplored insights into real-world product experience. Yet systematic analysis remains challenging due to unstructured, domain-specific content and the scarcity of high-quality labeled datasets. This paper presents the User eXperience Perception Insights Dataset (UXPID), a collection of 7130 synthesized and anonymized user feedback branches extracted from a public industrial automation forum. Each JSON record contains multi-post comments enriched with metadata and annotated by a large language model (LLM) for UX insights, user expectations, severity ratings, sentiment, and topic classifications. UXPID is designed to facilitate research in user requirements, user experience (UX) analysis, and AI-driven feedback processing, particularly where privacy and licensing restrictions limit access to real-world data. It supports the training and evaluation of transformer-based models for tasks such as issue detection, sentiment analysis, and requirements extraction in technical forums, providing a valuable resource for advancing NLP methods within industrial product support and software engineering domains.
♻ ☆ SpikingBrain: Spiking Brain-inspired Large Models
Mainstream Transformer-based large language models face major efficiency bottlenecks: training computation scales quadratically with sequence length, and inference memory grows linearly, limiting long-context processing. Building large models on non-NVIDIA platforms also poses challenges for stable and efficient training. To address this, we introduce SpikingBrain, a family of brain-inspired models designed for efficient long-context training and inference. SpikingBrain leverages the MetaX GPU cluster and focuses on three aspects: (1) Model Architecture: linear and hybrid-linear attention architectures with adaptive spiking neurons; (2) Algorithmic Optimizations: an efficient, conversion-based training pipeline and a dedicated spike coding framework; (3) System Engineering: customized training frameworks, operator libraries, and parallelism strategies tailored to MetaX hardware. Using these techniques, we develop two models: SpikingBrain-7B, a linear LLM, and SpikingBrain-76B, a hybrid-linear MoE LLM. These models demonstrate the feasibility of large-scale LLM development on non-NVIDIA platforms, and training remains stable for weeks on hundreds of MetaX GPUs with Model FLOPs Utilization at expected levels. SpikingBrain achieves performance comparable to open-source Transformer baselines while using only about 150B tokens for continual pre-training. Our models also significantly improve long-context efficiency and deliver inference with (partially) constant memory and event-driven spiking behavior. For example, SpikingBrain-7B attains over 100x speedup in Time to First Token for 4M-token sequences. Furthermore, the proposed spiking scheme achieves 69.15 percent sparsity, enabling low-power operation. Overall, this work demonstrates the potential of brain-inspired mechanisms to drive the next generation of efficient and scalable large model design.
♻ ☆ Emergence of Distortions in High-Dimensional Guided Diffusion Models
Classifier-free guidance (CFG) is the de facto standard for conditional sampling in diffusion models, yet it often reduces sample diversity. Using tools from statistical physics, we analyze the emergence of generative distortions induced by CFG, namely the mismatch between the CFG sampling distribution and the true conditional distribution. We study this phenomenon in analytically tractable settings with exact score functions, characterizing its dependence on data dimensionality and the number of classes. For high-dimensional Gaussian mixtures, we use dynamic mean-field theory to show that distortions arise when the number of classes scales exponentially with the data dimension, whereas they vanish in the sub-exponential regime due to a dynamical phase transition. We further prove that, in the infinite-class limit, distortions remain unavoidable regardless of dimensionality because of the increasing density of classes. Finally, we show that standard CFG schedules cannot prevent variance shrinkage, and we propose a theoretically grounded guidance schedule incorporating a negative-guidance window that improves both class separability and sample diversity in real-world latent diffusion models.
comment: 41 pages, 21 figures
♻ ☆ Structured Prototype-Guided Adaptation for EEG Foundation Models
Electroencephalography (EEG) foundation models (EFMs) have shown strong potential for transferable representation learning, yet their adaptation in realistic settings remains challenging when only a few labeled subjects are available. We show that this challenge stems from a structural mismatch between noisy, limited supervision and the highly plastic parameter space of EFMs, reflected in three key failure modes: overconfident miscalibration, prediction collapse, and representation drift caused by unconstrained parameter updates. To address these challenges, we propose SCOPE, a Structured COnfidence-aware Prototype-guided framework for label-limited EFM adaptation. SCOPE first constructs cohort-level external supervision to provide persistent guidance and further derives confidence-aware pseudo-labels to select reliable unlabeled samples for adaptation. Building on the constructed external supervision, SCOPE introduces ProAdapter, a lightweight prototype-conditioned adapter that modulates frozen EFMs to preserve pretrained representations. Experiments across 50 label-limited adaptation settings, covering 6 EEG tasks, 5 EFM backbones, and 5%-50% training labeled-subject ratios, show that SCOPE consistently achieves strong performance and efficiency.
♻ ☆ Identifiability Challenges in Sparse Linear Ordinary Differential Equations
Dynamical systems modeling is a core pillar of scientific inquiry across natural and life sciences. Increasingly, dynamical system models are learned from data, rendering identifiability a paramount concept. For systems that are not identifiable from data, no guarantees can be given about their behavior under new conditions and inputs, or about possible control mechanisms to steer the system. It is known in the community that "linear ordinary differential equations (ODE) are almost surely identifiable from a single trajectory." However, this only holds for dense matrices. The sparse regime remains underexplored, despite its practical relevance with sparsity arising naturally in many biological, social, and physical systems. In this work, we address this gap by characterizing the identifiability of sparse linear ODEs. Contrary to the dense case, we show that sparse systems are unidentifiable with a positive probability in practically relevant sparsity regimes and provide lower bounds for this probability. We further study empirically how this theoretical unidentifiability manifests in state-of-the-art methods to estimate linear ODEs from data. Our results corroborate that sparse systems are also practically unidentifiable. Theoretical limitations are not resolved through inductive biases or optimization dynamics. Our findings call for rethinking what can be expected from data-driven dynamical system modeling and allows for quantitative assessments of how much to trust a learned linear ODE.
comment: 9 pages, 4 figures
♻ ☆ RECON: Robust symmetry discovery via Explicit Canonical Orientation Normalization ICLR 2026
Real world data often exhibits unknown, instance-specific symmetries that rarely exactly match a transformation group $G$ fixed a priori. Class-pose decompositions aim to create disentangled representations by factoring inputs into invariant features and a pose $g\in G$ defined relative to a training-dependent, arbitrary canonical representation. We introduce RECON, a class-pose agnostic canonical orientation normalization that corrects arbitrary canonicals via a simple right translation, yielding natural, data-aligned canonicalizations. This enables (i) unsupervised discovery of instance-specific pose distributions, (ii) detection of out-of-distribution poses and (iii) a plug-and-play test-time canonicalization layer. This layer can be attached on top of any pre-trained model to infuse group invariance, improving its performance without retraining. We validate on images and molecular ensembles, demonstrating accurate symmetry discovery, and matching or outperforming other canonicalizations in downstream classification.
comment: Accepted as a conference paper at ICLR 2026
♻ ☆ Transferable SCF-Acceleration through Solver-Aligned Initialization Learning
The cost of Kohn-Sham density functional theory (KS-DFT) calculations scales with the number of solver iterations, which depends on the quality of the initial guess. Machine learning methods that predict initial guesses from molecular geometry can reduce this cost, but matrix-prediction models fail when extrapolating to larger molecules, degrading rather than accelerating convergence [Liu et al., 2025]. We show that this failure is a supervision problem, not an extrapolation problem: models trained on ground-state targets fit those targets well out of distribution, yet produce initial guesses that slow convergence. Solver-Aligned Initialization Learning (SAIL) resolves this for both Hamiltonian and density matrix models by differentiating through the self-consistent field (SCF) solver end-to-end. We introduce the Effective Relative Iteration Count (ERIC), a correction to the commonly used RIC that accounts for hidden Fock-build overhead. On QM40, which contains molecules up to 4$\times$ larger than the training distribution, SAIL reduces ERIC by 37\% (PBE), 33\% (SCAN), and 28\% (B3LYP), more than doubling the previous state-of-the-art reduction on B3LYP. On QMugs molecules 10$\times$ larger than the training set, SAIL delivers a 1.35$\times$ wall-time speedup at the hybrid level of theory, extending ML SCF acceleration to large drug-like molecules.
♻ ☆ CGF-Softmax: A Cumulant-Based Softmax Reformulation for Efficient Inference under Homomorphic Encryption
Homomorphic encryption (HE) is a prominent framework for privacy-preserving machine learning, enabling inference directly on encrypted data. However, evaluating softmax, a core component of transformer architectures, remains particularly challenging in HE due to its multivariate structure, the large dynamic range induced by exponential functions, and the costly division operation. In this paper, we propose CGF-softmax, which reformulates the softmax denominator through the cumulant generating function (CGF). By eliminating both homomorphic division and explicit maximum subtraction, this reformulation substantially reduces multiplicative depth while preserving key properties of softmax. Extensive experiments on Vision Transformers and large language models show that CGF-softmax provides an efficient and accurate approximation of softmax in encrypted inference. In particular, it achieves inference accuracy close to that of high-depth exact methods, while requiring substantially lower computational cost through reduced multiplicative depth.
♻ ☆ Entropy-Regularized Adjoint Matching for Offline Reinforcement Learning
Integrating expressive generative policies, such as flow-matching models, into offline reinforcement learning (RL) allows agents to capture complex, multi-modal behaviors. While Q-learning with Adjoint Matching (QAM) stabilizes policy optimization via the continuous adjoint method, it remains inherently bound to the fixed behavior distribution. This dependence induces a \textit{popularity bias} that can suppress high-reward actions in low-density regions, and creates a \textit{support binding} that restricts off-manifold exploration. Existing workarounds, such as appending \textit{residual} Gaussian policies, often re-introduce the expressivity bottlenecks associated with unimodal distributions. In this work, we propose \textit{Maximum Entropy Adjoint Matching} (ME-AM), a unified framework that addresses these limitations within the continuous flow formulation. ME-AM incorporates two mechanisms: (1) a Mirror Descent entropy maximization objective that mitigates the popularity bias to facilitate the extraction of optimal policies from offline datasets, and (2) a \textit{Mixture Behavior Prior} that broadens the geometric support to encompass out-of-distribution high-reward regions. By exploring this extended geometry, ME-AM identifies robust actions while preserving the absolute continuity of the generative vector field. Empirically, ME-AM demonstrates competitive or superior performance compared to prior state-of-the-art (SOTA) methods across a diverse suite of sparse-reward continuous control environments.
♻ ☆ MinMax Recurrent Neural Cascades
We show that the MinMax algebra provides a form of recurrence that is expressively powerful, efficiently implementable, and most importantly it is not affected by vanishing or exploding gradient. We call MinMax Recurrent Neural Cascades (RNCs) the models obtained by cascading several layers of neurons that employ such recurrence. We show that MinMax RNCs enjoy many favourable theoretical properties. First, their formal expressivity includes all regular languages, arguably the maximal expressivity for a finite-memory system. Second, they can be evaluated in parallel with a runtime that is logarithmic in the input length given enough processors; and they can also be evaluated sequentially. Third, their state and activations are bounded uniformly for all input lengths. Fourth, at almost all points, their loss gradient exists and it is bounded. Fifth, they do not exhibit a vanishing state gradient: the gradient of a state w.r.t. a past state can have constant value one regardless of the time distance between the two states. Finally, we find empirical evidence that the favourable theoretical properties of MinMax RNCs are matched by their practical capabilities: they are able to perfectly solve a number of synthetic tasks, showing superior performance compared to the considered state-of-the-art recurrent neural networks; also, we train a MinMax RNC of 127M parameters on next-token prediction, and the obtained model shows competitive performance for its size, providing evidence of the potential of MinMax RNCs on real-world tasks.
♻ ☆ Neural Continuous-Time Markov Chain: Discrete Diffusion via Decoupled Jump Timing and Direction
Discrete diffusion models based on continuous-time Markov chains (CTMCs) have shown strong performance on language and discrete data generation, yet existing approaches typically parameterize the reverse rate matrix monolithically -- through proxies such as concrete scores (SEDD) or clean-data predictions (MDLM, GIDD) -- rather than aligning the parameterization with the intrinsic CTMC decomposition into jump timing and jump direction. We propose \textbf{Neural CTMC}, which exploits the underlying Poisson structure of CTMC dynamics by separately parameterizing the reverse process through an \emph{exit rate} (when to jump) and a \emph{jump distribution} (where to jump) via two dedicated network heads. We show that the evidence lower bound (ELBO) reduces to a path-space KL divergence between the true and learned reverse processes that factorizes into a Poisson KL for timing and a categorical KL for direction, and admits a tractable, gradient-equivalent and consistent loss. Experimentally, scored by Gemma2-9B, our pure-uniform Neural CTMC achieves $16.36$ generative perplexity on TinyStories (vs.\ GIDD $37.60$ and MDLM $42.66$). On OpenWebText, it attains the best perplexity at the same training-token budget across 16--128 sampling steps among the methods we compare (e.g., at 128 steps: Neural CTMC $183.6$ vs.\ MDLM $210.5$ and GIDD $249.8$). To facilitate reproducibility, we release our pretrained weights at https://huggingface.co/Jiangxy1117/Neural-CTMC.
♻ ☆ Amortized Multi-Objective Optimization Across Tasks with Generative Solution Modeling IJCAI 2026
Many real-world applications require solving families of expensive multi-objective optimization problems~(EMOPs) under varying operational conditions. This can be formulated as parametric expensive multi-objective optimization problems (P-EMOPs) where each task parameter defines a distinct optimization instance. Current multi-objective Bayesian optimization methods have been widely used for finding finite sets of Pareto optimal solutions for each task. However, P-EMOPs present a fundamental challenge: the continuous task parameter space can contain infinite distinct problems, each requiring separate expensive evaluations. To address this, we propose learning an inverse model to amortize the multi-objective optimization cost across the continuous task-preference space, enabling direct solution prediction for any query without the need for expensive re-evaluation. This paper introduces a novel parametric multi-objective Bayesian optimizer that learns this inverse model by alternating between (1) generative solution sampling via conditional generative models and (2) acquisition-driven search leveraging inter-task synergies. This approach enables effective optimization across multiple tasks and finally achieves direct solution prediction for unseen parameterized EMOPs without re-evaluations. We theoretically justify the faster convergence by leveraging inter-task synergies through task-aware Gaussian processes. Based on that, empirical studies in synthetic and real-world benchmarks further verify the effectiveness of the proposed parametric optimizer.
comment: Accepted by IJCAI 2026
♻ ☆ CONSIGN: Conformal Segmentation Informed by Spatial Groupings via Decomposition ICLR 2026
Most machine learning-based image segmentation models produce pixel-wise confidence scores that represent the model's predicted probability for each class label at every pixel. While this information can be particularly valuable in high-stakes domains such as medical imaging, these scores are heuristic in nature and do not constitute rigorous quantitative uncertainty estimates. Conformal prediction (CP) provides a principled framework for transforming heuristic confidence scores into statistically valid uncertainty estimates. However, applying CP directly to image segmentation ignores the spatial correlations between pixels, a fundamental characteristic of image data. This can result in overly conservative and less interpretable uncertainty estimates. To address this, we propose CONSIGN (Conformal Segmentation Informed by Spatial Groupings via Decomposition), a CP-based method that incorporates spatial correlations to improve uncertainty quantification in image segmentation. Our method generates meaningful prediction sets that come with user-specified, high-probability error guarantees. It is compatible with any pre-trained segmentation model capable of generating multiple sample outputs. We evaluate CONSIGN against two CP baselines across three medical imaging datasets and two COCO dataset subsets, using three different pre-trained segmentation models. Results demonstrate that accounting for spatial structure significantly improves performance across multiple metrics and enhances the quality of uncertainty estimates.
comment: Accepted as poster at ICLR 2026
♻ ☆ Multi-Modal Multi-Agent Reinforcement Learning for Radiology Report Generation
We propose MARL-Rad, a multi-modal multi-agent reinforcement learning framework for radiology report generation that trains the entire agentic system on policy within its deployed radiology workflow. MARL-Rad addresses the limitation of post-hoc agentization, where fixed LLMs are organized into hand-designed agentic workflows without being optimized for their assigned roles. Our framework decomposes chest X-ray interpretation into region-specific agents and a global integrating agent, and jointly optimizes them using clinically verifiable rewards. Experiments on the MIMIC-CXR and IU X-ray datasets show that MARL-Rad consistently improves clinical efficacy metrics such as RadGraph, CheXbert, and GREEN scores, achieving state-of-the-art clinical efficacy performance. Further analyses show that MARL-Rad improves laterality consistency and produces more accurate and detailed reports. A blinded clinician evaluation further suggests that MARL-Rad produces reports clinically comparable to ground-truth reports.
comment: 23 pages, 4 figures
♻ ☆ Rethinking Weight Tying: Pseudo-Inverse Tying for LM Stable Training and Updates
Weight tying is widely used in compact language models to reduce parameters by sharing the token table between the input embedding and the output projection. However, parameter sharing alone does not guarantee a stable token interface: during training, the correspondence between encoding tokens into hidden states and decoding hidden states into logits can drift, worsening optimization sensitivity and weakening explainability probes that rely on a meaningful vocabulary-space decoder. We propose Pseudo-Inverse Tying (PIT), which synchronizes embedding and unembedding as coupled projections of a shared latent token memory, guaranteeing a pseudo-inverse-consistent interface throughout training. PIT maintains an orthonormal shared memory, obtained by polar initialization from a source checkpoint for continued pretraining or by random orthonormal initialization for from-scratch pretraining, and introduces a learned symmetric positive definite hidden-space transform parameterized via a Cholesky factor. The output head applies this transform to hidden states before the vocabulary projection, while the embedding applies the inverse transform to token vectors using stable triangular solves, avoiding explicit pseudo-inverse recomputation and vocabulary-sized auxiliary parameters. Beyond improving training stability, PIT provides a cleaner substrate for logit-lens-style and vocabulary-space explainability probes by keeping the input and output token geometries synchronized. We evaluate PIT on on-device models spanning 256M-1.3B parameters. The results show that PIT improves continued-pretraining stability, enforces near-exact token-interface consistency across settings, and yields more predictable lightweight adaptation after continued pretraining, while from-scratch pretraining reveals a trade-off between strict interface consistency and unconstrained optimization.
comment: an early-stage version
♻ ☆ Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a critical method for enhancing the reasoning capabilities of Large Language Models (LLMs). However, continuous training often leads to policy entropy collapse, characterized by a rapid decay in entropy that results in premature overconfidence, reduced output diversity, and vanishing gradient norms that inhibit learning. Gradient-Preserving Clipping is a primary factor influencing these dynamics, but existing mitigation strategies are largely static and lack a framework connecting clipping mechanisms to precise entropy control. This paper proposes reshaping entropy control in RL from the perspective of Gradient-Preserving Clipping. We first theoretically and empirically verify the contributions of specific importance sampling ratio regions to entropy growth and reduction. Leveraging these findings, we introduce a novel regulation mechanism using dynamic clipping thresholds to precisely manage entropy. Furthermore, we design and evaluate dynamic entropy control strategies, including increase-then-decrease, decrease-increase-decrease, and oscillatory decay. Experimental results demonstrate that these strategies effectively mitigate entropy collapse and achieve superior performance across multiple benchmarks.
comment: https://github.com/Kwen-Chen/Flexible-Entropy-Control
♻ ☆ How Hard Is It for Message-Passing GNNs to Simulate One Weisfeiler-Lehman Color-Refinement Step?
Message-passing graph neural networks (MPGNNs) are commonly compared with the Weisfeiler-Lehman (WL) color-refinement procedure, but this comparison does not quantify the resource parameters a network needs to realize color refinement with bounded-size messages and finite numerical precision. We study the cost of simulating a single color-refinement step on unattributed graphs. We distinguish input-independent, or oblivious, simulation from instance-dependent simulation. In the former, the parameters, or their distributions in randomized models, are fixed before the input instance is known. Our results show that the local form of WL color refinement hides a global relabeling problem. In the oblivious setting, deterministic and zero-error randomized MPGNNs cannot solve this problem in the worst case using only shallow networks with small messages. We complement this lower bound with a nearly matching construction in a stronger rooted, port-aware model. By contrast, when the color set is large, bounded-error randomness can greatly reduce the cost, and a one-layer MPGNN with messages of logarithmic size and a logarithmic number of random bits suffices. We show that this logarithmic number of random bits is essentially necessary for shallow, small-message simulations. When the color set is small, we still obtain a rooted, port-aware simulation, but this construction requires more layers or larger messages. We also prove that this extra cost is partly unavoidable, as small color sets force a nontrivial trade-off between the number of layers and the message size. Finally, instance-dependent simulation can be much shallower, but the required instance-specific parameters are not necessarily easy to find. Together, these results reveal quantitative structure hidden behind the statement that MPGNNs match WL color refinement.
♻ ☆ FutureWorld: A Live Reinforcement Learning Environment for Predictive Agents with Real-World Outcome Rewards
Live future prediction refers to the task of making predictions about real-world events before they unfold. This task is increasingly studied using large language model-based agent systems, and it is important for building agents that can continually learn from the real world. It can provide a large number of prediction questions grounded in diverse real-world events, while preventing answer leakage. To leverage the advantages of future prediction, we present FutureWorld, a live agentic reinforcement learning environment that closes the training loop between prediction, outcome realization, and parameter updates. Specifically, we modify and extend verl-tool, resulting in a new framework that we call verl-tool-future. Unlike standard reinforcement learning training frameworks that rely on immediate rewards, verl-tool-future stores prediction-time rollouts, backfills rewards after real-world outcomes become available, and then replays the completed trajectories for policy update. Across three open-source agents, successive FutureWorld training rounds lead to consistent improvements in prediction accuracy, probabilistic scoring, and calibration, demonstrating that delayed real-world outcome feedback can serve as an effective reinforcement learning signal.
comment: We will release the code in the near future
♻ ☆ MAST: A Multi-fidelity Augmented Surrogate model via Spatial Trust-weighting
In engineering design and scientific computing, computational cost and predictive accuracy are intrinsically coupled. High-fidelity simulations provide accurate predictions but at substantial computational costs, while lower-fidelity approximations offer efficiency at the expense of accuracy. Multi-fidelity surrogate modelling addresses this trade-off by combining abundant low-fidelity data with sparse high-fidelity observations. However, existing methods rely on global correlation assumptions that can often fail in practice to capture how fidelity relationships vary across the input space, leading to poor performance, particularly under tight budget constraints. We introduce MAST, a method that blends corrected low-fidelity observations with high-fidelity predictions, trusting high-fidelity near observed samples and relying on corrected low-fidelity elsewhere. MAST achieves this through explicit discrepancy modelling and distance-based weighting with closed-form variance propagation, producing a single heteroscedastic Gaussian process. Across multi-fidelity synthetic benchmarks, MAST shows a marked improvement over the current state-of-the-art techniques. Crucially, MAST maintains robust performance across varying total budget and fidelity gaps, conditions under which competing methods exhibit significant degradation or unstable behaviour. More broadly, MAST provides a spatially adaptive framework for multi-fidelity Gaussian-process modelling, in which the contribution of low-fidelity information is governed by its proximity to high-fidelity calibration data, opening a new direction for more reliable surrogate construction under sparse and budget-constrained settings.
♻ ☆ Mixture of Masters: Sparse Chess Language Models with Player Routing
Modern chess language models are dense transformers trained on millions of games played by thousands of high-rated individuals. However, these monolithic networks tend to collapse into mode-averaged behavior, where stylistic boundaries are blurred, and rare but effective strategies are suppressed. To counteract homogenization, we introduce Mixture-of-Masters (MoM), the first chess mixture-of-experts model with small-sized GPT experts emulating world-class grandmasters. For each move, a post-hoc learnable gating network selects the most appropriate persona to channel depending on the game state, allowing MoM to switch its style dynamically, e.g., Tal's offensive vocation or Petrosian's defensive solidity. When evaluated against Stockfish on unseen standard games, MoM outperforms both dense individual expert networks and popular GPT baselines trained on aggregated data, while ensuring generation variety, control, and interpretability.
♻ ☆ Active Learning for Communication Structure Optimization in LLM-Based Multi-Agent Systems
Optimizing the communication structure of large language model based multi-agent systems (LLM-MAS) has been shown to improve downstream performance and reduce token usage. Existing methods typically rely on randomly sampled training tasks. However, tasks may differ substantially in difficulty and domain, and thus they are not equally informative for updating communication structure, making optimization under limited training budgets often unstable and highly sensitive to the particular training set. To actively identify the most valuable tasks for communication-structure optimization, we propose an ensemble-based information-theoretic task selection framework. The proposed method estimates task informativeness by how much a candidate task changes the distribution over graph parameters, using ensemble Kalman inversion as an efficient and derivative-free approximation of the corresponding Bayesian update. The resulting estimator is especially suitable for black-box and noisy multi-agent systems. To enhance scalability, we construct a compact candidate pool through embedding-based representative selection and combine the informative selection with surrogate modeling and batch Thompson sampling. We validate our method in both benign settings and settings with agent attacks, demonstrating its effectiveness for communication-structure optimization under constrained computational budgets.
♻ ☆ Metric Unreliability in Multimodal Machine Unlearning: A Systematic Analysis and Principled Unified Score
Machine unlearning in Vision-Language Models (VLMs) is required for compliance with the General Data Protection Regulation (GDPR), yet current evaluation practices are inconsistent. We present the first systematic study of metric reliability in multimodal unlearning. Five standard metrics, Forget Accuracy (FA), Retain Accuracy (RA), Membership Inference Attack (MIA), Activation Distance (AD), and JS divergence (JS), yield conflicting method rankings across three VQA benchmarks (MLLMU-Bench, UnLOK-VQA, MMUBench). Kendall tau analysis over 36 unlearned LLaVA-1.5-7B models reveals two opposing clusters, {FA, RA, MIA} and {AD, JS}, with tau_FA_AD = -0.26, reproduced on BLIP-2 OPT-2.7B. Agreement is lower in multimodal VQA (average tau = 0.086) than in unimodal classification (average tau = 0.158; difference = 0.072), indicating that dual image-and-text pathways amplify inconsistency. We introduce the Unified Quality Score (UQS), a composite metric with weights derived from each metric's Spearman correlation with the oracle distance d(M_hat, M_star), where M_star is the oracle model retrained only on the retain set. RA shows the strongest reliability (rho = 0.484, p = 0.003), while FA is negatively correlated (rho = -0.418, p = 0.011). UQS yields stable rankings under 100 random weight perturbations (tau = 0.647 +- 0.262). We release the benchmark, 36 checkpoints, and an interactive leaderboard. Code and pre-computed results are available at https://github.com/neurips26/UnifiedUnl.
comment: 9 Pages , 6 figures, Neurips 2026
♻ ☆ EviDep: Trustworthy Multimodal Depression Estimation via Disentangled Evidential Learning
Automated multimodal depression estimation in unconstrained environments is inherently challenged by naturalistic noise and complex behavioral variability. Prevailing deterministic methods, however, produce uncalibrated point estimates without quantifying predictive uncertainty, exposing decision-making to the risk of overconfident, untrustworthy estimates. To establish a reliable and trustworthy estimation paradigm, we propose EviDep, an evidential learning framework that jointly quantifies depression severity alongside aleatoric and epistemic uncertainties via a Normal-Inverse-Gamma distribution. To ensure the integrity of the extracted behavioral evidence and prevent artificial confidence inflation during multimodal fusion, EviDep introduces two tailored mechanisms. First, addressing the temporal-frequency heterogeneity of behavioral cues, a Frequency-aware Feature Extraction module leverages a wavelet-based Mixture-of-Experts to dynamically decouple stable macro-level affective baselines from transient micro-level behavioral bursts, effectively filtering out task-irrelevant artifacts. Second, a Disentangled Evidential Learning strategy enforces explicit decorrelation of features in these purified representations. By separating the cross-modal shared consensus from modality-specific behavioral nuances before Bayesian fusion, this rigorous disentanglement strictly prevents the model from double-counting overlapping information. Extensive experiments on the AVEC 2013, AVEC 2014, DAIC-WOZ, and E-DAIC datasets confirm that EviDep achieves state-of-the-art predictive accuracy and superior uncertainty calibration, thereby delivering a trustworthy, risk-aware decision-support tool for depression estimation.
♻ ☆ Pretraining a Foundation Model for Small-Molecule Natural Products
Natural products, as metabolites from microorganisms, animals, or plants, exhibit diverse biological activities, making them crucial for drug discovery. Nowadays, existing deep learning methods for natural products research primarily rely on supervised learning approaches designed for specific downstream tasks. However, such one-model-for-a-task paradigm often lacks generalizability and leaves significant room for performance improvement. Additionally, existing molecular characterization methods are not well-suited for the unique tasks associated with natural products. To address these limitations, we have pre-trained a foundation model for natural products based on their unique properties. Our approach employs a novel pretraining strategy that is especially tailored to natural products. By incorporating contrastive learning and masked graph learning objectives, we emphasize evolutional information from molecular scaffolds while capturing side-chain information. Our framework achieves state-of-the-art (SOTA) results in various downstream tasks related to natural product mining and drug discovery. We first compare taxonomy classification with synthesized molecule-focused baselines to demonstrate that current models are inadequate for understanding natural synthesis. Furthermore, by diving into a fine-grained analysis at both the gene and microbial levels, NaFM demonstrates the ability to capture evolutionary information. Eventually, our method is experimented with virtual screening, illustrating informative natural product representations that can lead to more effective identification of potential drug candidates.
comment: Accepted by Nature Machine Intelligence(2026)
♻ ☆ SB-TRPO: Towards Safe Reinforcement Learning with Hard Constraints
In safety-critical domains, reinforcement learning (RL) agents must often satisfy strict, zero-cost safety constraints while accomplishing tasks. Existing model-free methods frequently either fail to achieve near-zero safety violations or become overly conservative. We introduce Safety-Biased Trust Region Policy Optimisation (SB-TRPO), a principled algorithm for hard-constrained RL that dynamically balances cost reduction with reward improvement. At each step, SB-TRPO updates via a dynamic convex combination of the reward and cost natural policy gradients, ensuring a fixed fraction of optimal cost reduction while using remaining update capacity for reward improvement. Our method comes with formal guarantees of local progress on safety, while still improving reward whenever gradients are suitably aligned. Experiments on standard and challenging Safety Gymnasium tasks demonstrate that SB-TRPO consistently achieves the best balance of safety and task performance in the hard-constrained regime.
♻ ☆ Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level
On-policy distillation (OPD) trains a student on its own trajectories with token-level teacher feedback and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its standard advantage weighted policy gradient suffers from three structural weaknesses, including high variance updates, vanishing gradients in zero-advantage regions, and exploration bottlenecks when corrective signals are insufficient. We therefore propose Asymmetric On-Policy Distillation (AOPD), which replaces ineffective negative reinforcement with localized divergence minimization in non-positive advantage regions while preserving positive reinforcement learning. Experiments on mathematical reasoning benchmarks show that AOPD consistently outperforms standard OPD, with average gains of 4.09 / 8.34 under strong / weak initialization, respectively. AOPD also maintains higher policy entropy during training and better capability retention during sequential tool-use adaptation.
♻ ☆ A Hybrid Graph Neural Network for Enhanced EEG-Based Depression Detection
Graph neural networks (GNNs) are becoming increasingly popular for EEG-based depression detection. However, previous GNN-based methods fail to sufficiently consider the characteristics of depression, thus limiting their performance. Firstly, studies in neuroscience indicate that depression patients exhibit both common and individualized brain abnormal patterns. Previous GNN-based approaches typically focus either on fixed graph connections to capture common abnormal brain patterns or on adaptive connections to capture individualized patterns, which is inadequate for depression detection. Secondly, brain network exhibits a hierarchical structure, which includes the arrangement from channel-level graph to region-level graph. This hierarchical structure varies among individuals and contains significant information relevant to detecting depression. Nonetheless, previous GNN-based methods overlook these individualized hierarchical information. To address these issues, we propose a Hybrid GNN (HGNN) that merges a Common Graph Neural Network (CGNN) branch utilizing fixed connection and an Individualized Graph Neural Network (IGNN) branch employing adaptive connections. The two branches capture common and individualized depression patterns respectively, complementing each other. Furthermore, we enhance the IGNN branch with a Graph Pooling and Unpooling Module (GPUM) to extract individualized hierarchical information. Extensive experiments on two public datasets show that our model achieves state-of-the-art performance.
♻ ☆ Zero-Shot Quantization via Weight-Space Arithmetic
We show that robustness to post-training quantization (PTQ) is a transferable direction in weight space. We call this direction the quantization vector: extracted from a donor task by simple weight-space arithmetic, it can be used to patch a receiver model and improve post-PTQ Top-1 accuracy by up to 60 points in a 3-bit setting, without receiver-side quantization-aware training (QAT). Because the method requires no receiver training data, it provides a zero-shot, low-cost alternative to QAT for extremely low-bit deployment. Across four ViT scales and 22 image classification tasks, donor quantization vectors often yield substantial gains even when donor and receiver tasks differ markedly. We further prove rigorously that quantization vectors are well-defined and do not suffer from reparameterization symmetries, and provide a local geometric account of their effect. Together, these results suggest that quantization robustness can be partially isolated, reused, and transferred through simple weight-space algebra.
♻ ☆ Gradient Flow Structure and Quantitative Dynamics of Multi-Head Self-Attention
Transformer self-attention can be interpreted as a gradient flow on the unit sphere, in which tokens evolve under softmax interaction potentials and tend to form clusters. While prior work has established clustering behavior for single-head attention, the multi-head setting remains less understood due to geometric interference between heads, which invalidates standard monotonicity arguments. In this work, we develop a theoretical framework for multi-head self-attention dynamics and resolve several open questions. We show that, under suitable conditions on the score matrices, a natural multi-head energy functional is non-decreasing along both flat and spherical dynamics. We identify the key obstruction to per-head monotonicity as radial shadow terms, which are projections of each head's output onto token directions, persisting even under orthogonality assumptions. We introduce a sufficient condition ensuring monotonicity and establish robustness to approximate orthogonality. In a simplified scalar-head regime with equiangular token configurations, we derive a closed-form expression for the critical inverse temperature governing clustering behavior, and show that heterogeneous heads exhibit super-additive clustering rates. In this regime, we also prove a separation in clustering time between ReLU and softmax attention in the linearized dynamics. Finally, we establish an entropy production identity and show that attention entropy increases monotonically toward equilibrium as clustering progresses. Our results provide a unified perspective on the dynamics of multi-head attention and clarify the mechanisms underlying clustering and stability in transformer models.
comment: 20 pages, 5 figures
♻ ☆ Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
On-policy distillation (OPD) is an effective post-training paradigm for large language models but requires a live teacher server throughout training, resulting in substantial infrastructure overhead. We investigate whether OPD can be performed offline by precomputing teacher log-probabilities once over SFT rollouts and reusing them during training. We find that naively doing so fails to reliably match standard OPD, and trace the root cause to a previously overlooked condition we term teacher consistency, requiring that the same teacher be used for both supervised fine-tuning and OPD. Violating this condition introduces a gradient bias that degrades performance for both offline and online OPD. Building on this insight, we propose Lightning OPD, an offline on-policy distillation framework that enforces teacher consistency and eliminates the need for a live teacher server entirely. We prove that, under teacher consistency, Lightning OPD shares the same optimum as standard OPD, with bounded gradient discrepancy and an implicit regularization effect that helps prevent policy drift. Experiments on math reasoning and code generation show that Lightning OPD achieves comparable performance to standard OPD while delivering 4.0x higher training efficiency. Starting from an SFT-initialized Qwen3-8B-Base model, Lightning OPD reaches 69.9% on AIME 2024 in just 30 GPU hours. Lightning OPD further scales to MoE architectures, training Qwen3-30B-A3B to 71.0% on AIME 2024 on a single 8xH100 node, substantially lowering the barrier for academic research on LLM post-training. Our code is released at https://github.com/jet-ai-projects/Lightning-OPD.
♻ ☆ Data Augmentation of Contrastive Learning is Estimating Positive-incentive Noise ICML 2026
Inspired by the idea of Positive-incentive Noise (Pi-Noise or $π$-Noise) that aims at learning the reliable noise beneficial to tasks, we scientifically investigate the connection between contrastive learning and $π$-noise in this paper. By converting the contrastive loss to an auxiliary Gaussian distribution to quantitatively measure the difficulty of the specific contrastive model under the information theory framework, we properly define the task entropy, the core concept of $π$-noise, of contrastive learning. It is further proved that the predefined data augmentation in the standard contrastive learning paradigm can be regarded as a kind of point estimation of $π$-noise. Inspired by the theoretical study, a framework that develops a $π$-noise generator to learn the beneficial noise (instead of estimation) as data augmentations for contrast is proposed. The designed framework can be applied to diverse types of data and is also completely compatible with the existing contrastive models. From the visualization, we surprisingly find that the proposed method successfully learns effective augmentations. Our code is available at https://github.com/hyzhang98/PiNDA.
comment: Accepted by ICML 2026
♻ ☆ FANoS-v2: Feedback-Controlled Momentum with Thermostat Damping for Lightweight Neural Optimization
\FANOS{} is a PyTorch optimizer that augments RMS-preconditioned momentum with a scalar feedback controller over update energy. The public reference implementation stores momentum in parameter-update units, applies a non-negative thermostat damping coefficient, supports diagonal, factored, and raw-gradient preconditioning, and exposes diagnostics intended for stability audits. This study gives a complete mathematical specification of the released optimizer, including the exact parameter-unit update, the study-equation physical update mode, bounded log-ratio thermostat control, adaptive preconditioner softening, warmup guardrails, and the experimental \Fast{} profile. We report the v0.2 evidence: five-seed reduced-sample MNIST, Fashion-MNIST, and CIFAR-10 experiments show mean top-1 gains of 0.889, 2.197, and 2.666 percentage points over AdamW for \Fast{}, but with 49.8\%, 61.6\%, and 56.8\% higher wall-clock time. Preliminary scientific, PINN, and EEG smoke tests are mixed and are treated as hypothesis-generating only. The evidence supports \FANOS{} as an alpha-stage research optimizer with a reproducible lightweight-vision signal and an explicit runtime bottleneck.
comment: 17 pages, 3 figures, 5 tables
♻ ☆ Decoding Dynamic Visual Experience from Calcium Imaging via Cell-Pattern-Aware Pretraining
Neural recordings exhibit a distinctive form of heterogeneity rooted in differences in cell types, intrinsic circuit dynamics, and stochastic stimulus-response variability that goes beyond ordinary dataset variability, mixing statistically regular neurons with highly stochastic, stimulus-contingent ones within the same dataset. This heterogeneity poses a challenge for self-supervised learning (SSL) -- learnable statistical regularity -- thereby destabilizing representation learning and limiting reliable scaling. We introduce POYO-CAP (Cell-pattern Aware Pretraining), a biologically grounded hybrid pretraining strategy that first trains with masked reconstruction plus lightweight auxiliary supervision on statistically regular neurons -- identified via skewness and kurtosis -- and then fine-tunes on more stochastic populations. On the Allen Brain Observatory dataset, this curriculum yields 12--13\% relative improvements over from-scratch training and enables smooth, monotonic scaling with model size, whereas baselines trained on mixed populations plateau or destabilize. By making statistical predictability an explicit data-selection criterion, POYO-CAP turns neural heterogeneity into a scalable learning advantage for robust neural decoding.
♻ ☆ R-GTD: A Geometric Analysis of Gradient Temporal-Difference Learning in Singular Regimes
Gradient temporal-difference (GTD) learning algorithms are widely used for off-policy policy evaluation with function approximation. However, existing convergence analyses rely on the restrictive assumption that the so-called feature interaction matrix (FIM) is nonsingular. In practice, the FIM can become singular and leads to instability or degraded performance. While some prior works have applied regularization to relax the nonsingularity assumption, their theoretical guarantees inevitably rely on other restrictive conditions. In this paper, we propose a regularized optimization objective by reformulating the mean-square projected Bellman error minimization. This formulation naturally yields a regularized GTD algorithms, referred to as R-GTD, which guarantees convergence to a unique solution even when the FIM is singular. We conduct a geometric analysis to establish theoretical convergence guarantees and explicit error bounds for the proposed method, and validate its effectiveness through empirical experiments.
comment: 32 pages, 8 figures
♻ ☆ SLOPE: Optimistic Potential Landscape Shaping for Model-based Reinforcement Learning
Model-based reinforcement learning (MBRL) is sample-efficient but struggles in sparse reward settings. A critical bottleneck arises from the lack of informative gradients in sparse settings, where standard reward models often yield flat landscapes that struggle to guide planning. To address this challenge, we propose Shaping Landscapes with Optimistic Potential Estimates (SLOPE), a novel framework that shifts reward modeling from predicting sparse scalars to constructing informative potential landscapes. SLOPE employs optimistic distributional regression to estimate high-confidence upper bounds, which amplifies rare success signals and ensures sufficient exploration gradients. Evaluations on 30+ tasks across 5 benchmarks and real-world robotic deployments, demonstrate that SLOPE consistently outperforms leading baselines in fully sparse, semi-sparse, and dense rewards.
comment: Work in progress
♻ ☆ Manifold Random Features
We present a new paradigm for creating random features to approximate bi-variate functions (in particular, kernels) defined on general manifolds. This new mechanism of Manifold Random Features (MRFs) leverages discretization of the manifold and the recently introduced technique of Graph Random Features (GRFs) to learn continuous fields on manifolds. Those fields are used to find continuous approximation mechanisms that otherwise, in general scenarios, cannot be derived analytically. MRFs provide positive and bounded features, a key property for accurate, low-variance approximation. We show deep asymptotic connection between GRFs, defined on discrete graph objects, and continuous random features used for regular kernels. As a by-product of our method, we re-discover recently introduced mechanism of Gaussian kernel approximation applied in particular to improve linear-attention Transformers, considering simple random walks on graphs and by-passing original complex mathematical computations. We complement our algorithm with a rigorous theoretical analysis and verify in thorough experimental studies.
♻ ☆ Retrieval from Within: An Intrinsic Capability of Attention-Based Models
Retrieval-augmented generation (RAG) typically treats retrieval and generation as separate systems. We ask whether an attention-based encoder-decoder can instead retrieve directly from its own internal representations. We introduce INTRA (INTrinsic Retrieval via Attention), a framework where decoder attention queries score pre-encoded evidence chunks that are then directly reused as context for generation. By construction, INTRA unifies retrieval and generation, eliminating the retriever-generator mismatch typical of RAG pipelines. This design also amortizes context encoding by reusing precomputed encoder states across queries. On question-answering benchmarks, INTRA outperforms strong engineered retrieval pipelines on both evidence recall and end-to-end answer quality. Our results demonstrate that attention-based models already possess a retrieval mechanism that can be elicited, rather than added as an external module.
Multimedia 8
☆ Anisotropic Modality Align
Training multimodal large language models has long been limited by the scarcity of high-quality paired multimodal data. Recent studies show that the shared representation space of pretrained multimodal contrastive models can serve as a bridge, enabling models to perform multimodal training with unimodal data. However, the key premise of this paradigm remains insufficiently understood: can representations from different modalities be reliably interchanged? The core obstacle lies in the persistent Modality Gap in the shared space. In this work, we revisit the geometric nature of the modality gap. We find that modality representations already share compatible dominant semantic geometry. What truly hinders modality interchangeability is not a simple global shift, but an anisotropic residual structure concentrated along a small number of dominant directions. Based on this finding, we further propose the principle of anisotropic modality gap alignment: effective modality alignment should align with the target-modality distribution while preserving the semantic structure of the source modality. Guided by this principle, we propose an anisotropic geometric correction framework, AnisoAlign, for unpaired modality alignment. This framework leverages the internal geometric prior of the target modality and performs bounded correction on source-modality representations, thereby constructing substitute representations in the target modality. Experiments confirm its benefits in both geometric diagnostics and text-only MLLM training. Overall, this work recasts the modality gap from an empirical observation into a correctable, structured geometric phenomenon and provides a new representation alignment perspective for training multimodal models with unimodal data.
☆ A Decomposed Retrieval-Edit-Rerank Framework for Chord Generation ICMR 2026
Chord generation is an inherently constrained creative task that requires balancing stylistic diversity with music-theoretic feasibility. Existing approaches typically entangle candidate generation and constraint enforcement within a single model, making the diversity-feasibility trade-off difficult to control and interpret. In this work, we approach chord generation from a system-level perspective, introducing a Retrieval-Edit-Rerank (RER) framework that decomposes the task into three explicit stages: i) retrieval, which defines a stylistically plausible candidate space; ii) editing, which enforces music-theoretic feasibility through minimal modifications; and iii) reranking, which resolves soft preferences among feasible candidates. This separation provides a controllable pipeline, where each component addresses a distinct aspect of the generation process, thereby enhancing both the interpretability and adjustability of the output chords. Through objective metrics and subjective evaluation, our decomposed system outperforms all end-to-end chord generation baselines in balancing chord diversity and music-theoretic feasibility. Ablation studies further confirm the complementary roles of each stage in creative exploration and constraint satisfaction.
comment: Accepted by the 2026 ACM International Conference on Multimedia Retrieval (ICMR 2026)
☆ Forensic analysis of video data deletion and recovery in Honeywell surveillance file system
Real-time video surveillance systems store recorded video using digital video recorders (DVRs) and network video recorders (NVRs). To support continuous high-volume video storage, these devices employ specialized, nonstandard file systems that are often proprietary and undocumented. This lack of documentation significantly increases the time and effort required for forensic analysis. In this study, we analyze an undocumented proprietary file system used by Honeywell video surveillance devices-one that, to the best of our knowledge, has not been examined in prior work-and investigate its deletion mechanisms and demonstrate the feasibility of video recovery after deletion. We perform a file system analysis using a binary diffing technique and evaluate three deletion methods supported by the target device: 1) formatting-based deletion, 2) data expiration, and 3) overwrite. For each method, we investigate changes in file system metadata and on-disk data structures and demonstrate the feasibility of video data recovery. Our findings aim to support more efficient and accurate forensic investigations of Honeywell surveillance products and provide foundational insights into the analysis of proprietary file systems used in video recording devices.
comment: The paper has been accepted by The 26th Annual Digital Forensics Research Conference USA (DFRWS USA 2026)
☆ PersonaGest: Personalized Co-Speech Gesture Generation with Semantic-Guided Hierarchical Motion Representation
Co-speech gesture generation aims to synthesize realistic body movements that are semantically coherent with speech and faithful to a user-specified gestural style. Existing VQ-VAE based co-speech gesture generation methods improve generation quality but fail to encode semantic structure into the motion representation or explicitly disentangle content from style, limiting both semantic coherence and personalization fidelity. We present PersonaGest, a two-stage framework addressing both limitations. In the first stage, a semantic-guided RVQ-VAE disentangles motion content and gestural style within the residual quantization structure, where a Semantic-Aware Motion Codebook (SMoC) organizes the content codebook by gesture semantics and contrastive learning further enforces content-style separation. In the second stage, a Masked Generative Transformer generates content tokens via a semantic-aware re-masking strategy, followed by a cascade of Style Residual Transformers conditioned on a reference motion prompt for style control. Extensive experiments demonstrate state-of-the-art performance on objective metrics and perceptual user studies, with strong style consistency to the reference prompt. Our project page with demo videos is available at https://danny-nus.github.io/PersonaGest/
comment: 26 pages, 10 figures, 12 tables
☆ Do Joint Audio-Video Generation Models Understand Physics?
Joint audio-video generation models are rapidly approaching professional production quality, raising a central question: do they understand audio-visual physics, or merely generate plausible sounds and frames that violate real-world consistency? We introduce AV-Phys Bench, a benchmark for evaluating physical commonsense in joint audio-video generation. AV-Phys Bench tests models across three scene categories: Steady State, Event Transition, and Environment Transition. It covers physics-grounded subcategories drawn from real-world scenes, plus Anti-AV-Physics prompts that deliberately request physically inconsistent audio-video behavior. Each generation is evaluated along five dimensions: visual semantic adherence, audio semantic adherence, visual physical commonsense, audio physical commonsense, and cross-modal physical commonsense. Across three proprietary and four open-source models, we find that Seedance 2.0 performs best overall, but all models remain far from robust physical understanding. Performance drops sharply on event-driven and environment-driven transitions, and even strong proprietary systems collapse on Anti-AV-Physics prompts. We further introduce AV-Phys Agent, a ReAct-style evaluator that combines a multimodal language model with deterministic acoustic measurement tools, producing rankings that closely align with human ratings. Our results identify cross-modal physical consistency and transition-driven scene dynamics as key open challenges for joint audio-video generation.
comment: Preprint. Full abstract appears in the PDF
☆ MMTB: Evaluating Terminal Agents on Multimedia-File Tasks
Terminals provide a powerful interface for AI agents by exposing diverse tools for automating complex workflows, yet existing terminal-agent benchmarks largely focus on tasks grounded in text, code, and structured files. However, many real-world workflows require practitioners to work directly with audio and video files. Working with such multimedia files calls for terminal agents not only to understand multimedia content, but also to convert auditory and visual evidence across related files into appropriate actions. To evaluate terminal agents on multimedia-file tasks, we introduce MultiMedia-TerminalBench (MMTB), a benchmark of 105 tasks across 5 meta-categories where terminal agents directly operate with audio and video files. Alongside MMTB, we propose Terminus-MM, a multimedia harness that extends Terminus-KIRA with audio and video perception for terminal agents. Together, MMTB and Terminus-MM support a controlled study of multimedia terminal agents, revealing how different forms of multimedia access shape task outcomes and determine which evidence agents rely on to construct executable terminal workflows. MMTB media and metadata are released at https://huggingface.co/datasets/mm-tbench/mmtb-media
♻ ☆ Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models
Despite the success of multimodal contrastive learning in aligning visual and linguistic representations, a persistent geometric anomaly, the Modality Gap, remains: embeddings of distinct modalities expressing identical semantics occupy systematically offset regions. Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions, hindering their application in large-scale scenarios. In this paper, we address these limitations by precisely characterizing the geometric shape of the modality gap and leveraging it for efficient model scaling. First, we propose the Fixed-frame Modality Gap Theory, which decomposes the modality gap within a frozen reference frame into stable biases and anisotropic residuals. Guided by this precise modeling, we introduce ReAlign, a training-free modality alignment strategy. Utilizing statistics from massive unpaired data, ReAlign aligns text representation into the image representation distribution via a three-step process comprising Anchor, Trace, and Centroid Alignment, thereby explicitly rectifying geometric misalignment. Building on ReAlign, we propose ReVision, a scalable training paradigm for Multimodal Large Language Models~(MLLMs). ReVision integrates ReAlign into the pretraining stage, enabling the model to learn the distribution of visual representations from unpaired text before visual instruction tuning, without the need for large-scale, high-quality image-text pairs. Our framework demonstrates that statistically aligned unpaired data can effectively substitute for expensive image-text pairs, offering a robust path for the efficient scaling of MLLMs.
♻ ☆ VDCook:DIY video data cook your MLLMs
We introduce VDCook: a self-evolving video data operating system, a configurable video data construction platform for researchers and vertical domain teams. Users initiate data requests via natural language queries and adjustable parameters (scale, retrieval-synthesis ratio, quality threshold). The system automatically performs query optimization, concurrently running real video retrieval and controlled synthesis modules. It ultimately generates in-domain data packages with complete provenance and metadata, along with reproducible Notebooks. Unlike traditional static, one-time-built datasets, VDCook enables continuous updates and domain expansion through its automated data ingestion mechanism based on MCP (Model Context Protocol)\cite{mcp2024anthropic}, transforming datasets into dynamically evolving open ecosystems. The system also provides multi-dimensional metadata annotation (scene segmentation, motion scoring, OCR ratio, automatic captioning, etc.), laying the foundation for flexible subsequent data `cooking' and indexing\cite{vlogger}. This platform aims to significantly lower the barrier to constructing specialized video training datasets through infrastructure-level solutions, while supporting community contributions and a governance-enabled data expansion paradigm. \textbf{Project demo:} https://screenapp.io/app/v/WP0SvffgsH
Computer Vision and Pattern Recognition 242
☆ ActCam: Zero-Shot Joint Camera and 3D Motion Control for Video Generation SIGGRAPH 2026
For artistic applications, video generation requires fine-grained control over both performance and cinematography, i.e., the actor's motion and the camera trajectory. We present ActCam, a zero-shot method for video generation that jointly transfers character motion from a driving video into a new scene and enables per-frame control of intrinsic and extrinsic camera parameters. ActCam builds on any pretrained image-to-video diffusion model that accepts conditioning in terms of scene depth and character pose. Given a source video with a moving character and a target camera motion, ActCam generates pose and depth conditions that remain geometrically consistent across frames. We then run a single sampling process with a two-phase conditioning schedule: early denoising steps condition on both pose and sparse depth to enforce scene structure, after which depth is dropped and pose-only guidance refines high-frequency details without over-constraining the generation. We evaluate ActCam on multiple benchmarks spanning diverse character motions and challenging viewpoint changes. We find that, compared to pose-only control and other pose and camera methods, ActCam improves camera adherence and motion fidelity, and is preferred in human evaluations, especially under large viewpoint changes. Our results highlight that careful camera-consistent conditioning and staged guidance can enable strong joint camera and motion control without training. Project page: https://elkhomar.github.io/actcam/.
comment: SIGGRAPH 2026
☆ BAMI: Training-Free Bias Mitigation in GUI Grounding CVPR 2026
GUI grounding is a critical capability for enabling GUI agents to execute tasks such as clicking and dragging. However, in complex scenarios like the ScreenSpot-Pro benchmark, existing models often suffer from suboptimal performance. Utilizing the proposed \textbf{Masked Prediction Distribution (MPD)} attribution method, we identify that the primary sources of errors are twofold: high image resolution (leading to precision bias) and intricate interface elements (resulting in ambiguity bias). To address these challenges, we introduce \textbf{Bias-Aware Manipulation Inference (BAMI)}, which incorporates two key manipulations, coarse-to-fine focus and candidate selection, to effectively mitigate these biases. Our extensive experimental results demonstrate that BAMI significantly enhances the accuracy of various GUI grounding models in a training-free setting. For instance, applying our method to the TianXi-Action-7B model boosts its accuracy on the ScreenSpot-Pro benchmark from 51.9\% to 57.8\%. Furthermore, ablation studies confirm the robustness of the BAMI approach across diverse parameter configurations, highlighting its stability and effectiveness. Code is available at https://github.com/Neur-IO/BAMI.
comment: Accepted by CVPR 2026
☆ Relit-LiVE: Relight Video by Jointly Learning Environment Video SIGGRAPH 2026
Recent advances have shown that large-scale video diffusion models can be repurposed as neural renderers by first decomposing videos into intrinsic scene representations and then performing forward rendering under novel illumination. While promising, this paradigm fundamentally relies on accurate intrinsic decomposition, which remains highly unreliable for real-world videos and often leads to distorted appearances, broken materials, and accumulated temporal artifacts during relighting. In this work, we present Relit-LiVE, a novel video relighting framework that produces physically consistent, temporally stable results without requiring prior knowledge of camera pose. Our key insight is to explicitly introduce raw reference images into the rendering process, enabling the model to recover critical scene cues that are inevitably lost or corrupted in intrinsic representations. Furthermore, we propose a novel environment video prediction formulation that simultaneously generates relit videos and per-frame environment maps aligned with each camera viewpoint in a single diffusion process. This joint prediction enforces strong geometric-illumination alignment and naturally supports dynamic lighting and camera motion, significantly improving physical consistency in video relighting while easing the requirement of known per-frame camera pose. Extensive experiments demonstrate that Relit-LiVE consistently outperforms state-of-the-art video relighting and neural rendering methods across synthetic and real-world benchmarks. Beyond relighting, our framework naturally supports a wide range of downstream applications, including scene-level rendering, material editing, object insertion, and streaming video relighting. The Project is available at https://github.com/zhuxing0/Relit-LiVE.
comment: Accepted at SIGGRAPH 2026. Project site: https://github.com/zhuxing0/Relit-LiVE
☆ Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study
Despite the growing popularity of Multimodal Domain Generalization (MMDG) for enhancing model robustness, it remains unclear whether reported performance gains reflect genuine algorithmic progress or are artifacts of inconsistent evaluation protocols. Current research is fragmented, with studies varying significantly across datasets, modality configurations, and experimental settings. Furthermore, existing benchmarks focus predominantly on action recognition, often neglecting critical real-world challenges such as input corruptions, missing modalities, and model trustworthiness. This lack of standardization obscures a reliable assessment of the field's advancement. To address this issue, we introduce MMDG-Bench, the first unified and comprehensive benchmark for MMDG, which standardizes evaluation across six datasets spanning three diverse tasks: action recognition, mechanical fault diagnosis, and sentiment analysis. MMDG-Bench encompasses six modality combinations, nine representative methods, and multiple evaluation settings. Beyond standard accuracy, it systematically assesses corruption robustness, missing-modality generalization, misclassification detection, and out-of-distribution detection. With 7, 402 neural networks trained in total across 95 unique cross-domain tasks, MMDG-Bench yields five key findings: (1) under fair comparisons, recent specialized MMDG methods offer only marginal improvements over ERM baseline; (2) no single method consistently outperforms others across datasets or modality combinations; (3) a substantial gap to upper-bound performance persists, indicating that MMDG remains far from solved; (4) trimodal fusion does not consistently outperform the strongest bimodal configurations; and (5) all evaluated methods exhibit significant degradation under corruption and missing-modality scenarios, with some methods further compromising model trustworthiness.
comment: Code: https://github.com/lihongzhao99/MMDG_Benchmark
☆ GlazyBench: A Benchmark for Ceramic Glaze Property Prediction and Image Generation
Developing ceramic glazes is a costly, time-consuming process of trial and error due to complex chemistry, placing a significant burden on independent artists. While recent advances in multimodal AI offer a modern solution, the field lacks the large-scale datasets required to train these models. We propose GlazyBench, the first dataset for AI-assisted glaze design. Comprising 23,148 real glaze formulations, GlazyBench supports two primary tasks: predicting post-firing surface properties, such as color and transparency, from raw materials, and generating accurate visual representations of the glaze based on these properties. We establish comprehensive baselines for property prediction using traditional machine learning and large language models, alongside image generation benchmarks using deep generative and large multimodal models. Our experiments demonstrate promising yet challenging results. GlazyBench pioneers a new research direction in AI-assisted material design, providing a standardized benchmark for systematic evaluation.
☆ DPM++: Dynamic Masked Metric Learning for Occluded Person Re-identification
Although person re-identification has made impressive progress, occlusion caused by obstacles remains an unsettled issue in real applications. The difficulty lies in the mismatch between incomplete occluded samples and holistic identity representations. Severe occlusion removes discriminative body cues and introduces interference from background clutter and occluders, making global metric learning unreliable. Existing methods mainly rely on extra pre-trained models to estimate visible parts for alignment or construct occluded samples via data augmentation, but still lack a unified framework that learns robust visibility-consistent matching under realistic occlusion patterns. In this paper, we propose DPM++, a Dynamic Masked Metric Learning framework for occluded person re-identification. DPM++ learns an input-adaptive masked metric that dynamically selects reliable identity subspaces for each occluded instance, enabling matching to emphasize visibility-consistent evidence while suppressing unreliable components. Built upon the classifier-prototype space, DPM++ introduces a CLIP-based two-stage supervision scheme, where ID-level semantic priors are learned from the text branch and transferred into the classifier-prototype space for dynamic masked matching. To strengthen the masked metric, we introduce a saliency-guided patch transfer strategy to synthesize controllable and photo-realistic occluded samples during training. Exploiting real scene priors, this strategy exposes the model to realistic partial observations and provides richer supervision than random erasing. In addition, occlusion-aware sample pairing and mask-guided optimization improve the stability and effectiveness of the framework. Experiments on occluded and holistic person re-identification benchmarks show that DPM++ consistently outperforms previous state-of-the-art methods in both holistic and occlusion scenarios.
☆ SoftSAE: Dynamic Top-K Selection for Adaptive Sparse Autoencoders
Sparse Autoencoders (SAEs) have become an important tool in mechanistic interpretability, helping to analyze internal representations in both Large Language Models (LLMs) and Vision Transformers (ViTs). By decomposing polysemantic activations into sparse sets of monosemantic features, SAEs aim to translate neural network computations into human-understandable concepts. However, common architectures such as TopK SAEs rely on a fixed sparsity level. They enforce the same number of active features (K) across all inputs, ignoring the varying complexity of real-world data. Natural data often lies on manifolds with varying local intrinsic dimensionality, meaning the number of relevant factors can change significantly across samples. This suggests that a fixed sparsity level is not optimal. Simple inputs may require only a few features, while more complex ones need more expressive representations. Using a constant K can therefore introduce noise in simple cases or miss important structure in more complex ones. To address this issue, we propose SoftSAE, a sparse autoencoder with a Dynamic Top-K selection mechanism. Our method uses a differentiable Soft Top-K operator to learn an input-dependent sparsity level k. This allows the model to adjust the number of active features based on the complexity of each input. As a result, the representation better matches the structure of the data, and the explanation length reflects the amount of information in the input. Experimental results confirm that SoftSAE not only finds meaningful features, but also selects the right number of features for each concept. The source code is available at: https://anonymous.4open.science/r/SoftSAE-8F71/.
☆ DINORANKCLIP: DINOv3 Distillation and Injection for Vision-Language Pretraining with High-Order Ranking Consistency
Contrastive language-image pretraining (CLIP) suffers from two structural weaknesses: the symmetric InfoNCE loss discards the relative ordering among unmatched in-batch pairs, and global pooling collapses the visual representation into a semantic bottleneck that is poorly sensitive to fine-grained local structure. RANKCLIP partially addresses the first issue with a list-wise Plackett-Luce ranking-consistency loss, but its model is strictly first-order and inherits the second weakness untouched. We propose DINORANKCLIP, a pretraining framework that addresses both jointly. Our principal contribution is injecting a frozen DINOv3 teacher into the contrastive trunk through a dual-branch lightweight student and a multi-scale fusion module with channel-spatial attention, a self-attention refiner, and a conflict-aware gate that preserves the cross-modal alignment up to first order. Complementarily, we introduce a high-order Plackett-Luce ranking model in which the per-position utility is augmented with attention-parameterised pairwise and tuple-wise transition terms; the family contains CLIP and RANKCLIP as nested zero-order and first-order special cases, and the optimal order on every benchmark is $R^*=3$. The full empirical study -- order sweep, Fine-grained Probe on five datasets, four-node Modality-Gap analysis, six-variant Fusion ablation -- fits in 72 hours on a single eight-GPU H100 node and trains entirely on Conceptual Captions 3M. DINORANKCLIP consistently outperforms CLIP, CyCLIP, ALIP, and RANKCLIP under matched compute, with the largest relative gains on the fine-grained and out-of-distribution evaluations that most directly stress local structural reasoning.
comment: 18 pages, 7 figures, 9 tables. Code will be made publicly available upon acceptance
☆ Solving Minimal Problems Without Matrix Inversion Using FFT-Based Interpolation CVPR 2026
Estimating camera geometry typically involves solving minimal problems formulated as systems of multivariate polynomial equations, which often pose computational challenges when using existing Gröbner-basis or resultant-based methods due to matrix inversion needed in the online solver. Here we propose a sampling-based, matrix inversion-free method that constructs the solvers using sparse hidden-variable resultants. The determinant polynomial in the hidden variable is efficiently reconstructed via inverse fast Fourier transform interpolation from sampled evaluations, avoiding symbolic expansion. Solving this polynomial yields the hidden variable, and the remaining unknowns are recovered by identifying rank-1 deficient submatrices and applying Cramer's rule. A greatest common divisor-based criterion ensures robust submatrix identification under noise. Experiments on diverse minimal problems demonstrate that the proposed solver achieves strong numerical stability and competitive runtime, particularly for small-scale problems, providing a practical alternative to traditional Gröbner-basis and resultant-based solvers.
comment: Accepted to CVPR 2026
☆ Continuous Latent Diffusion Language Model
Large language models have achieved remarkable success under the autoregressive paradigm, yet high-quality text generation need not be tied to a fixed left-to-right order. Existing alternatives still struggle to jointly achieve generation efficiency, scalable representation learning, and effective global semantic modeling. We propose Cola DLM, a hierarchical latent diffusion language model that frames text generation through hierarchical information decomposition. Cola DLM first learns a stable text-to-latent mapping with a Text VAE, then models a global semantic prior in continuous latent space with a block-causal DiT, and finally generates text through conditional decoding. From a unified Markov-path perspective, its diffusion process performs latent prior transport rather than token-level observation recovery, thereby separating global semantic organization from local textual realization. This design yields a more flexible non-autoregressive inductive bias, supports semantic compression and prior fitting in continuous space, and naturally extends to other continuous modalities. Through experiments spanning 4 research questions, 8 benchmarks, strictly matched ~2B-parameter autoregressive and LLaDA baselines, and scaling curves up to about 2000 EFLOPs, we identify an effective overall configuration of Cola DLM and verify its strong scaling behavior for text generation. Taken together, the results establish hierarchical continuous latent prior modeling as a principled alternative to strictly token-level language modeling, where generation quality and scaling behavior may better reflect model capability than likelihood, while also suggesting a concrete path toward unified modeling across discrete text and continuous modalities.
comment: 99 pages, 31 figures, 9 tables. Project page: https://hongcanguo.github.io/Cola-DLM/
☆ MedHorizon: Towards Long-context Medical Video Understanding in the Wild
Medical multimodal large language models (MLLMs) have advanced image understanding and short-video analysis, but real clinical review often requires full-procedure video understanding. Unlike general long videos, medical procedures contain highly redundant anatomical views, while decisive evidence is temporally sparse, spatially subtle, and context dependent. Existing benchmarks often assume this evidence has already been localized through images, short clips, or pre-segmented videos, leaving the retrieval-before-reasoning problem under-tested. We introduce MedHorizon, an in-the-wild benchmark for long-context medical video understanding. MedHorizon preserves 759 hours of full-length clinical procedures and provides 1,253 evidence-grounded multiple-choice questionsthat jointly evaluate sparse evidence understanding and multi-hop clinical reasoning. Its evidence is extremely sparse, with only 0.166% evidence frames on average, requiring models to search noisy procedural streams before interpreting and aggregating findings. We evaluate representative general-domain, medical-domain, and long-video MLLMs. The best model reaches only 41.1% accuracy, showing that current systems remain far from robust full-procedure understanding. Further analysis yields four key findings: performance does not scale reliably with more frames, evidence retrieval and clinical interpretation remain primary bottlenecks; these bottlenecks are rooted in weak procedural reasoning and attention drift under redundancy, and generic sampling methods only partially balances local detail with global coverage. MedHorizon provides a rigorous testbed for MLLMs that retrieve sparse evidence and reason over complete clinical workflows.
☆ Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance
In recent years, open-source efforts like Senorita-2M have propelled video editing toward natural language instruction. However, current publicly available datasets predominantly focus on local editing or style transfer, which largely preserve the original scene structure and are easier to scale. In contrast, Background Replacement, a task central to creative applications such as film production and advertising, requires synthesizing entirely new, temporally consistent scenes while maintaining accurate foreground-background interactions, making large-scale data generation significantly more challenging. Consequently, this complex task remains largely underexplored due to a scarcity of high-quality training data. This gap is evident in poorly performing state-of-the-art models, e.g., Kiwi-Edit, because the primary open-source dataset that contains this task, i.e., OpenVE-3M, frequently produces static, unnatural backgrounds. In this paper, we trace this quality degradation to a lack of precise background guidance during data synthesis. Accordingly, we design a scalable pipeline that generates foreground and background guidance in a decoupled manner with strict quality filtering. Building on this pipeline, we introduce Sparkle, a dataset of ~140K video pairs spanning five common background-change themes, alongside Sparkle-Bench, the largest evaluation benchmark tailored for background replacement to date. Experiments demonstrate that our dataset and the model trained on it achieve substantially better performance than all existing baselines on both OpenVE-Bench and Sparkle-Bench. Our proposed dataset, benchmark, and model are fully open-sourced at https://showlab.github.io/Sparkle/.
comment: Tech Report. Project Page: https://showlab.github.io/Sparkle/
☆ Agentic AIs Are the Missing Paradigm for Out-of-Distribution Generalization in Foundation Models
Foundation models (FMs) are increasingly deployed in open-world settings where distribution shift is the rule rather than the exception. The out-of-distribution (OOD) phenomena they face -- knowledge boundaries, capability ceilings, compositional shifts, and open-ended task variation -- differ in kind from the settings that have shaped prior OOD research, and are further complicated because the pretraining and post-training distributions of modern FMs are often only partially observed. Our position is that OOD for foundation models is a structurally distinct problem that cannot be solved within the prevailing model-centric paradigm, and that agentic systems constitute the missing paradigm required to address it. We defend this claim through four steps. First, we give a stage-aware formalization of OOD that accommodates partially observed multi-stage training distributions. Second, we prove a parameter coverage ceiling: there exist practically relevant inputs that no model-centric method (training-time or test-time) can handle within tolerance $\varepsilon$, for reasons intrinsic to parameter-based representation. Third, we characterize agentic OOD systems by four structural properties -- perception, strategy selection, external action, and closed-loop verification -- and show that they strictly extend the reachable set beyond the ceiling. Fourth, we respond to seven counterarguments, conceding two, and outline a research agenda. We do not claim that agentic methods subsume model-centric ones; we argue that the two are complementary, and that progress on FM-OOD requires explicit recognition of the agentic paradigm as a first-class research direction.
comment: 13 pages, 2 figures
☆ DCR: Counterfactual Attractor Guidance for Rare Compositional Generation
Diffusion models generate realistic visual content, yet often fail to produce rare but plausible compositions. When prompted with combinations that are valid but underrepresented in training data, such as a snowy beach or a rainbow at night, the generation process frequently collapses toward more common alternatives. We identify this failure mode as default completion bias, where denoising trajectories are implicitly attracted toward high-frequency semantic configurations. Existing guidance mechanisms do not explicitly model this competing tendency and therefore struggle to prevent such collapse. We introduce Default Completion Repulsion (DCR), a training-free framework that explicitly models and suppresses default completion behavior. DCR constructs a counterfactual attractor by relaxing the rare compositional factor while preserving surrounding semantics, inducing an alternative denoising trajectory reflecting the model's preferred completion. We define the discrepancy between target and attractor trajectories as a counterfactual drift, and propose a projection-based repulsion mechanism that removes guidance components aligned with this drift direction. This suppresses undesired frequent completions while preserving other semantic components. DCR operates entirely within the standard diffusion sampling process without retraining or architectural modification. Experiments on rare compositional prompts show that DCR improves compositional fidelity while maintaining visual quality. Our analysis further shows that the framework exposes and counteracts intrinsic model biases, offering a new perspective on controllable generation beyond explicit constraint enforcement.
comment: 40 pages, 33 figures
☆ FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction
Video diffusion models perform well in short-video synthesis, but their training-free extension to long videos often suffers from content drift, temporal inconsistency, and over-smoothed dynamics. Existing methods improve temporal consistency by combining a global branch with a local branch, but they often further decompose appearance consistency and temporal dynamics within each branch using predefined criteria. This assignment is unreliable when appearance and action progression are tightly coupled, such as in camera motion and sequential motion. We analyze the video temporal extension issue from a singular-spectrum perspective and show that enlarged self-attention windows induce spectral concentration: spectral energy becomes dominated by a few low-rank singular directions, preserving coarse structure but suppressing high-rank spatial details and motion-rich temporal variations. To mitigate this problem, we propose FreeSpec, a training-free spectral reconstruction framework for long-video generation. FreeSpec decomposes global and local features with singular value decomposition, and uses the global branch as low-rank spectral guidance and the local branch as a high-rank reconstruction basis. This spectrum-level fusion avoids the rigid feature partitioning of previous decomposition rules, preserving long-range consistency while better retaining spatial details and temporal dynamics. Experiments on Wan2.1 and LTX-Video demonstrate that FreeSpec improves long-video generation, especially for temporal dynamics, while maintaining strong visual quality and temporal consistency. Project demo: https://fdchen24.github.io/FreeSpec-Website/.
☆ MARBLE: Multi-Aspect Reward Balance for Diffusion RL
Reinforcement learning fine-tuning has become the dominant approach for aligning diffusion models with human preferences. However, assessing images is intrinsically a multi-dimensional task, and multiple evaluation criteria need to be optimized simultaneously. Existing practice deal with multiple rewards by training one specialist model per reward, optimizing a weighted-sum reward $R(x)=\sum_k w_k R_k(x)$, or sequentially fine-tuning with a hand-crafted stage schedule. These approaches either fail to produce a unified model that can be jointly trained on all rewards or necessitates heavy manually tuned sequential training. We find that the failure stems from using a naive weighted-sum reward aggregation. This approach suffers from a sample-level mismatch because most rollouts are specialist samples, highly informative for certain reward dimensions but irrelevant for others; consequently, weighted summation dilutes their supervision. To address this issue, we propose MARBLE (Multi-Aspect Reward BaLancE), a gradient-space optimization framework that maintains independent advantage estimators for each reward, computes per-reward policy gradients, and harmonizes them into a single update direction without manually-tuned reward weighting, by solving a Quadratic Programming problem. We further propose an amortized formulation that exploits the affine structure of the loss used in DiffusionNFT, to reduce the per-step cost from K+1 backward passes to near single-reward baseline cost, together with EMA smoothing on the balancing coefficients to stabilize updates against transient single-batch fluctuations. On SD3.5 Medium with five rewards, MARBLE improves all five reward dimensions simultaneously, turns the worst-aligned reward's gradient cosine from negative under weighted summation in 80% of mini-batches to consistently positive, and runs at 0.97X the training speed of baseline training.
comment: Homepage and code repo: https://aim-uofa.github.io/MARBLE
☆ 3D MRI Image Pretraining via Controllable 2D Slice Navigation Task
Self-supervised pretraining has become the mainstream approach for learning MRI representations from unlabeled scans. However, most existing objectives still treat each scan primarily as static aggregations of slices, patches or volumes. We ask whether there exists an intrinsic form of self-supervision signal that is different from reconstructing the masked patches, through transforming the 3D volumes into controllable 2D rendered sequences: by rendering slices at continuous positions, orientations, and scales, a 3D volume can be converted into dense video-action sequences whose controls are the action trajectories. We study this formulation with an action-conditioned pretraining objective, where a tokenizer encodes slice observations and a latent dynamics model predicts the evolution of latent features. Across representative anatomical and spatial downstream tasks, the proposed pretraining is evaluated against standard static-volume baselines, tokenizer-only pretraining, and dynamics variants without aligned actions. These results suggest that controllable MRI slice navigation provides a useful complementary pretraining interface for learning anatomical and spatial representations from large unlabeled MRI collections.
comment: 9 pages, 5 figures
☆ GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs
We address the challenge of knowledge composition in Vision-Language Models (VLMs), where accumulating expertise across multiple domains or tasks typically leads to catastrophic forgetting. We introduce GeoStack (Geometric Stacking), a modular framework that allows independently trained domain experts to be composed into a unified model. By imposing geometric and structural constraints on the adapter manifold, GeoStack ensures the foundational knowledge of the base model is preserved. Furthermore, we mathematically demonstrate a weight-folding property that achieves constant-time inference complexity ($O(1)$), regardless of the number of integrated experts. Experimental results across multi-domain adaptation and class-incremental learning show that GeoStack provides an efficient mechanism for long-term knowledge composition while significantly mitigating catastrophic forgetting. Code is available at https://github.com/QuantitativeImagingLaboratory/GeoStack.
☆ Hyperbolic Concept Bottleneck Models
Concept Bottleneck Models (CBMs) have become a popular approach to enable interpretability in neural networks by constraining classifier inputs to a set of human-understandable concepts. While effective, current models embed concepts in flat Euclidean space, treating them as independent, orthogonal dimensions. Concepts, however, are highly structured and organized in semantic hierarchies. To resolve this mismatch, we propose Hyperbolic Concept Bottleneck Models (HypCBM), a post-hoc framework that grounds the bottleneck in this structure by reformulating concept activation as asymmetric geometric containment in hyperbolic space. Rather than treating entailment cones as a pre-training penalty, we show they encode a natural test-time activation signal: the margin of inclusion within a concept's entailment cone yields sparse, hierarchy-aware activations without any additional supervision or learned modules. We further introduce an adaptive scaling law for hierarchically faithful interventions, propagating user corrections coherently through the concept tree. Empirically, HypCBM rivals post-hoc Euclidean models trained on 20$\times$ more data in sparse regimes required for human interpretability, with stronger hierarchical consistency and improved robustness to input corruptions.
comment: 24 pages, 14 figures
☆ From Review to Design: Ethical Multimodal Driver Monitoring Systems for Risk Mitigation, Incident Response, and Accountability in Automated Vehicles
As vehicles transition toward higher levels of automation, Driver Monitoring Systems (DMS) have become essential for ensuring human oversight, safety, and regulatory compliance in a vehicle. These systems rely on multimodal sensing and AI-driven inference to assess driver attention, cognitive state, and readiness to take control. While technologically promising, their deployment introduces a complex set of ethical and legal challenges - ranging from privacy and consent to data ownership and algorithmic fairness. While overarching frameworks such as the GDPR, EU AI Act, and IEEE standards offer important guidance, they lack the specificity required for addressing the unique risks posed by in-cabin sensing technologies. This paper adopts a review-to-design perspective, critically examining existing regulatory instruments and ethical frameworks -- such as the GDPR, the EU AI Act, and IEEE guidelines -- and identifying gaps in their applicability to the distinctive risks posed by multimodal, AI-enabled in-cabin monitoring. Building on this review, we propose a modular ethical design framework tailored specifically to Driver Monitoring Systems. The framework translates high-level principles into actionable design and deployment guidance, including user-configurable consent mechanisms, fairness-aware model development, transparency and explainability tools, and safeguards for driver emotional well-being. Finally, the paper outlines a risk analysis and failure mitigation strategy, emphasizing proactive incident response and accountability mechanisms tailored to the DMS context. Together, these contributions aim to inform the development of transparent, trustworthy, and human-centered driver monitoring systems for next-generation autonomous vehicles.
☆ FREPix: Frequency-Heterogeneous Flow Matching for Pixel-Space Image Generation
Pixel-space diffusion has re-emerged as a promising alternative to latent-space generation because it avoids the representation bottleneck introduced by VAEs. Yet most existing methods still treat image generation as a frequency-homogeneous process, overlooking the distinct roles and learning dynamics of low- and high-frequency components. To address this, we propose FREPix, a FREquency-heterogeneous flow matching framework for Pixel-space image generation. FREPix explicitly decomposes generation into low- and high-frequency components, assigns them separate transport paths, predicts them with a factorized network, and trains them with a frequency-aware objective. In this way, coarse-to-fine generation becomes an explicit design principle rather than an implicit behavior. On ImageNet class-to-image generation, FREPix achieves competitive results among pixel-space generation models, reaching 1.91 FID at $256\times256$ and 2.38 FID at $512\times512$, with particularly strong behavior in the low-NFE regime.
☆ E = T*H/(O+B): A Dimensionless Control Parameter for Mixture-of-Experts Ecology
We introduce E = T*H/(O+B), a dimensionless control parameter that predicts whether Mixture-of-Experts (MoE) models will develop a healthy expert ecology or collapse into dead experts. E combines four hyperparameters -- routing temperature T, routing entropy weight H, oracle weight O, and balance weight B -- into a single quantity. Through 12 controlled experiments (8 vision, 4 language) totaling over 11,000 training epochs, we establish that E >= 0.5 alone is sufficient to guarantee zero dead experts, removing the necessity for handcrafted load-balancing auxiliary losses. We validate this cross-modally on CIFAR-10, CIFAR-100, TinyImageNet-200, WikiText-2, and WikiText-103. Six additional findings emerge: (1) dead experts can resuscitate -- triggered by balance loss driving router re-exploration; (2) ortho toxicity is dataset-dependent, not universal; (3) task complexity shifts the critical E threshold; (4) model overfitting is decoupled from expert ecological health; (5) three-tier MoE spontaneously collapses into a two-tier functional structure; (6) ecological structure is temperature-invariant across a 50x range. We propose that E serves as a unified diagnostic for MoE training, analogous to the Reynolds number in fluid dynamics.
comment: 12 experiments, 11,000+ training epochs, cross-modal validation (vision + language). Extended version of the Claude-in-the-Loop ecology framework
Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models
World model-based policy evaluation is a practical proxy for testing real-world robot control by rolling out candidate actions in action-conditioned video diffusion models. As these models increasingly adopt latent diffusion modeling (LDM), choosing the right latent space becomes critical. While the status quo uses autoencoding latent spaces like VAEs that are primarily trained for pixel reconstruction, recent work suggests benefits from pretrained encoders with representation-aligned semantic latent spaces. We systematically evaluate these latent spaces for action-conditioned LDM by comparing six reconstruction and semantic encoders to train world model variants under a fixed protocol on BridgeV2 dataset, and show effective world model training in high-dimensional representation spaces with and without dimension compression. We then propose three axes to assess robotic world model performance: visual fidelity, planning and downstream policy performance, and latent representation quality. Our results show visual fidelity alone is insufficient for world model selection. While reconstruction encoders like VAE and Cosmos achieve strong pixel-level scores, semantic encoders such as V-JEPA 2.1 (strongest overall on policy), Web-DINO, and SigLIP 2 generally excel across the other two axes at all model scales. Our study advocates semantic latent space as stronger foundation for policy-relevant robotics diffusion world models.
comment: 9 pages
☆ Empirical Evidence for Simply Connected Decision Regions in Image Classifiers
Understanding the topology of decision regions is central to explaining the inner workings of deep neural networks. Prior empirical work has provided evidence that these regions are path connected. We study a stronger topological question: whether closed loops inside a decision region can be contracted without leaving that region. To this end, we propose an iterative quad-mesh filling procedure that constructs a finite-resolution label-preserving surface bounded by a given loop and lying entirely within the same decision region. We further connect this construction to natural Coons patches in order to quantify its deviation from a canonical geometric interpolation of the loop. By evaluating our method across several modern image-classification models, we provide empirical evidence supporting the hypothesis that decision regions in deep neural networks are not only path connected, but also simply connected.
☆ Continuous-Time Distribution Matching for Few-Step Diffusion Distillation
Step distillation has become a leading technique for accelerating diffusion models, among which Distribution Matching Distillation (DMD) and Consistency Distillation are two representative paradigms. While consistency methods enforce self-consistency along the full PF-ODE trajectory to steer it toward the clean data manifold, vanilla DMD relies on sparse supervision at a few predefined discrete timesteps. This restricted discrete-time formulation and mode-seeking nature of the reverse KL divergence tends to exhibit visual artifacts and over-smoothed outputs, often necessitating complex auxiliary modules -- such as GANs or reward models -- to restore visual fidelity. In this work, we introduce Continuous-Time Distribution Matching (CDM), migrating the DMD framework from discrete anchoring to continuous optimization for the first time. CDM achieves this through two continuous-time designs. First, we replace the fixed discrete schedule with a dynamic continuous schedule of random length, so that distribution matching is enforced at arbitrary points along sampling trajectories rather than only at a few fixed anchors. Second, we propose a continuous-time alignment objective that performs active off-trajectory matching on latents extrapolated via the student's velocity field, improving generalization and preserving fine visual details. Extensive experiments on different architectures, including SD3-Medium and Longcat-Image, demonstrate that CDM provides highly competitive visual fidelity for few-step image generation without relying on complex auxiliary objectives. Code is available at https://github.com/byliutao/cdm.
comment: 22pages, 9 figures
☆ eXplaining to Learn (eX2L): Regularization Using Contrastive Visual Explanation Pairs for Distribution Shifts
Despite extensive research into mitigating distribution shifts, many existing algorithms yield inconsistent performance, often failing to outperform baseline Empirical Risk Minimization (ERM) across diverse scenarios. Furthermore, high algorithmic complexity frequently limits interpretability and offers only an indirect means of addressing spurious correlations. We propose eXplaining to Learn (eX2L): an interpretable, explanation-based framework that decorrelates confounding features from a classifier's latent representations during training. eX2L achieves this by penalizing the similarity between Grad-CAM activation maps generated by a primary label classifier and those from a concurrently trained confounder classifier. On the rigorous Spawrious Many-to-Many Hard Challenge benchmark, eX2L achieves an average accuracy (AA) of 82.24% +/- 3.87% and a worst-group accuracy (WGA) of 66.31% +/- 8.73%, outperforming the current state-of-the-art (SOTA) by 5.49% and 10.90%, respectively. Beyond its competitive performance, eX2L demonstrates that functional domain invariance can be achieved by explicitly decoupling label and nuisance attributes at the group level.
☆ The frame-level leakage trap: rethinking evaluation protocols for intrinsic image decomposition, with source-separable uncertainty as a case study
Evaluation protocols for learned intrinsic image decomposition on MPI Sintel have been inconsistent. Several prior works split the dataset by frames, which allows spatially similar frames of the same scene to appear in both train and test partitions. We quantify this leakage effect for the first time, across three architectures: a frame-level split inflates test R_PSNR by 1.6 to 2.0 dB (p less than 0.01 for all three, paired t-test across 3 seeds) relative to a scene-level split, confirming an architecture-independent protocol effect. A three-point gradient (random/temporal/scene) shows the gap is continuous, and under extended training the frame-level inflation exceeds 10 dB. We advocate scene-level splits as the community standard and provide reference numbers for six representative models under this protocol. As a case study within the corrected protocol, we present a physics-informed decomposition I = R composed with S + N with a source-separable three-way heteroscedastic uncertainty head. We empirically verify channel specialization: the non-Lambertian uncertainty channel shows r = 0.67 cross-correlation with non-Lambertian residual error, more than 4 times the texture channel's correlation. We further demonstrate downstream utility: filtering out the 75% highest-uncertainty pixels reduces reconstruction MSE by 77% on retained pixels, whereas random filtering produces no improvement. The specialization also holds on out-of-distribution real photographs. We report negative results for a more elaborate variant combining frequency decomposition, cross-task supervision, evidential learning, contrastive loss, and test-time adaptation. Our method reaches 15.98 plus or minus 0.41 dB R_PSNR, within 0.8 dB of a 5-member Deep Ensemble at one-fifth the cost, with the unique capability of source-separated uncertainty.
comment: Submitted to Journal of Electronic Imaging. 25 pages, 10 figures. Addresses evaluation protocol issues in intrinsic image decomposition and proposes source-separable uncertainty estimation
☆ Memory Efficient Full-gradient Attacks (MEFA) Framework for Adversarial Defense Evaluations
This work studies the robust evaluation of iterative stochastic purification defenses under white-box adversarial attacks. Our key technical insight is that gradient checkpointing makes exact end-to-end gradient computation through long purification trajectories practical by trading additional recomputation for substantially lower memory usage. This enables full-gradient adaptive attacks against diffusion- and Langevin-based purification defenses, where prior evaluations often resort to approximate backpropagation due to memory constraints. These approximations can weaken the attack signal and risk overestimating robustness. In parallel, stochasticity in iterative purification is frequently under-controlled, even though different purification trajectories can substantially change reported robustness metrics. Building on this insight, we introduce a memory-efficient full-gradient evaluation framework for stochastic purification defenses. The framework combines checkpointed backpropagation with evaluation protocols that control stochastic variability, thereby reducing memory bottlenecks while preserving exact gradients. We evaluate diffusion-based purification and Langevin sampling with Energy-Based Models (EBMs), demonstrating that full-gradient attacks uncover vulnerabilities missed by approximate-gradient evaluations. Our framework yields stronger state-of-the-art $\ell_{\infty}$ and $\ell_{2}$ white-box attacks and further supports probing out-of-distribution robustness. Overall, our results show that exact-gradient evaluation is essential for reliable benchmarking of iterative stochastic defenses.
☆ SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation NeurIPS 2026
High-resolution image-to-video (I2V) generation aims to synthesize realistic temporal dynamics while preserving fine-grained appearance details of the input image. At 2K resolution, it becomes extremely challenging, and existing solutions suffer from various weaknesses: 1) end-to-end models are often prohibitively expensive in memory and latency; 2) cascading low-resolution generation with a generic video super-resolution tends to hallucinate details and drift from input-specific local structures, since the super-resolution stage is not explicitly conditioned on the input image. To this end, we propose SwiftI2V, an efficient framework tailored for high-resolution I2V. Following the widely used two-stage design, it addresses the efficiency--fidelity dilemma by first generating a low-resolution motion reference to reduce token costs and ease the modeling burden, then performing a strongly image-conditioned 2K synthesis guided by the motion to recover input-faithful details with controlled overhead. Specifically, to make generation more scalable, SwiftI2V introduces Conditional Segment-wise Generation (CSG) to synthesize videos segment-by-segment with a bounded per-step token budget, and adopts bidirectional contextual interaction within each segment to improve cross-segment coherence and input fidelity. On VBench-I2V at 2K resolution, SwiftI2V achieves performance comparable to end-to-end baselines while reducing total GPU-time by 202x. Particularly, it enables practical 2K I2V generation on a single datacenter GPU (e.g., H800) or consumer GPU (e.g., RTX 4090).
comment: 27 pages, 17 figures. Submitted to NeurIPS 2026
☆ Earth-o1: A Grid-free Observation-native Atmospheric World Model
Despite the unprecedented volume of multimodal data provided by modern Earth observation systems, our ability to model atmospheric dynamics remains constrained. Traditional modeling frameworks force heterogeneous measurements into predefined spatial grids, inherently limiting the full exploitation of raw sensor data and creating severe computational bottlenecks. Here we present Earth-o1, an observation-native atmospheric world model that overcomes these structural limitations. Rather than relying on conventional atmospheric dynamical modeling systems or traditional data assimilation, Earth-o1 directly learns the continuous, three-dimensional physical evolution of the Earth system from ungridded observational data. By integrating diverse sensor inputs into a unified, grid-free dynamical field, the model autonomously advances the atmospheric state in space and time. We show that this fundamentally distinct paradigm enables direct, real-time forecasting and cross-sensor inference without the overhead of explicit numerical solvers. In hindcast evaluations, Earth-o1 achieves surface forecast skill comparable to the operational Integrated Forecasting System (IFS). These results establish that continuous, observation-driven world models -- a new class of fully observation-native geophysical simulators -- can match the fidelity of established physical frameworks, providing a scalable data-driven foundation for a digital twin of the Earth.
☆ TinyBayes: Closed-Form Bayesian Inference via Jacobi Prior for Real-Time Image Classification on Edge Devices
Cocoa (Theobroma cacao) is a critical cash crop for millions of smallholder farmers in West Africa, where Cocoa Swollen Shoot Virus Disease (CSSVD) and anthracnose cause devastating yield losses. Automated disease detection from leaf images is essential for early intervention, yet deploying such systems in resource-constrained settings demands models that are small, fast, and require no internet connectivity. Existing edge-deployable plant disease systems rely on end-to-end deep learning without uncertainty quantification, while Bayesian methods for edge devices focus on hardware-level inference architectures rather than agricultural applications. We bridge this gap with TinyBayes, the first framework to combine a closed-form Bayesian classifier with a mobile-grade computer vision pipeline for crop disease detection. Our pipeline uses YOLOv8-Nano (5.9 MB) for lesion localisation, MobileNetV3-Small (3.5 MB) for feature extraction, and the Jacobi prior; a Bayesian method that provides a closed form non-iterative estimators via projection, for the classification. The Jacobi-DMR (Distributed Multinomial Regression) classifier adds only 13.5 KB to the pipeline, bringing the total model size within 9.5 MB, while achieving 78.7% accuracy on the Amini Cocoa Contamination Challenge dataset and enabling end-to-end CPU inference under 150 ms per image. We benchmark against seven classifiers including Random Forest, SVM, Ridge, Lasso, Elastic Net, XGBoost, and Jacobi-GP, and demonstrate that the Jacobi-DMR offers the best trade-off between accuracy, model size, and inference speed for edge deployment. We have proved the asymptotic equivalence and consistency, asymptotic normality and the bias correction of Jacobi-DMR. All data and codes are available here: https://github.com/shouvik-sardar/TinyBayes
comment: 14 Pages, 1 Figure, 4 Tables
☆ NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps
Existing Vision-Language Navigation (VLN) methods typically adopt an egocentric, step-by-step paradigm, which struggles with error accumulation and limits efficiency. While recent approaches attempt to leverage pre-built environment maps, they often rely on incrementally updating memory graphs or scoring discrete path proposals, which restricts continuous spatial reasoning and creates discrete bottlenecks. We propose Top-Down VLN (TD-VLN), reformulating navigation as a one-step global path planning problem on pre-built top-down maps, supported by our newly constructed R2R-TopDown dataset. To solve this, we introduce NavOne, a unified framework that directly predicts dense path probabilities over multi-modal maps in a single end-to-end forward pass. NavOne features a Top-Down Map Fuser for joint multi-modal map representation, and extends Attention Residuals for spatial-aware depth mixing. Extensive experiments on R2R-TopDown show that NavOne achieves state-of-the-art performance among map-based VLN methods, with a planning-stage speedup of 8x over existing map-based baselines and 80x over egocentric methods, enabling highly efficient global navigation.
comment: 10 pages, 7 figures
☆ Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement
Training world models on vast quantities of unlabelled videos is a critical step toward fully autonomous intelligence. However, the prevailing paradigm of encoding raw pixels into opaque latent spaces and relying on heavy decoders for reconstruction leaves these models computationally expensive and uninterpretable. We address this problem by introducing NOVA, a world modelling framework that represents the system state as the weights and biases of an auxiliary coordinate-based implicit neural representation (INR). This structured representation is analytically rendered, which eliminates the decoder bottleneck while conferring compactness, portability, and zero-shot super-resolution. Furthermore, like most latent action models, NOVA can be distilled into a context-dependent video generator via an action-matching objective. Surprisingly, without resorting to auxiliary losses or adversarial objectives, NOVA can disentangle structural scene components such as background, foreground, and inter-frame motion, enabling users to edit either content or dynamics without compromising the other. We validate our framework on several challenging datasets, achieving strong controllable forecasting while operating on a single consumer GPU at $\sim$40M parameters. Ultimately, structured representations like INRs not only enhance our understanding of latent dynamics but also pave the way for immersive and customisable virtual experiences.
comment: 35 pages, 30 figures, 8 tables
☆ Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency
Recent advancements in image animation have utilized diffusion models to breathe life into static images. However, existing controllable frameworks typically rely on Lagrangian motion guidance, where optical flow is estimated relative to the initial frame. This paper revisits the same optical-flow primitive through a more local supervision design: we use adjacent-frame Eulerian motion fields to guide generation, where the motion signal always describes a short temporal hop. This shift enables parallelized training and provides bounded-error supervision throughout the generation process. To mitigate the drift artifacts common in adjacent frame generation, we introduce a Bidirectional Geometric Consistency mechanism, which computes a forward-backward cycle check to mathematically identify and mask occluded regions, preventing the model from learning incorrect warping objectives. Extensive experiments demonstrate that our approach accelerates training, preserves temporal coherence, and reduces dynamic artifacts compared to reference-based baselines.
comment: Work in progress
☆ When Labels Have Structure: Improving Image Classification with Hierarchy-Aware Cross-Entropy
Standard cross-entropy is the default classification loss across virtually all of machine learning, yet it treats all misclassifications equally, ignoring the semantic distances that a class hierarchy encodes. We propose Hierarchy-Aware Cross-Entropy (HACE), a drop-in replacement for standard cross-entropy that incorporates a known class hierarchy directly into the loss. HACE combines two components: prediction aggregation, which propagates the model's probability mass upward through the class hierarchy to ensure that parent nodes accumulate the confidence of their children; and ancestral label smoothing, which distributes the ground-truth signal along the path from the true class to the root. We evaluate HACE on CIFAR-100, FGVC Aircraft, and NABirds in two regimes: end-to-end training across six architectures spanning convolutional and attention-based designs, and linear probing on frozen DINOv2-Large features. In end-to-end training, HACE improves accuracy over standard cross-entropy in 15 out of 18 architecture--dataset pairs, with a mean gain of 4.66\%. In linear probing on frozen DINOv2-Large features, HACE outperforms all competing methods on all three datasets, with a mean improvement of 2.18\% over the next best baseline.
☆ On-Orbit Real-Time Wildfire Detection Under On-Board Constraints
We present a deployed system for on-orbit wildfire detection aboard a nine-satellite commercial thermal infrared constellation, operating under demanding joint constraints: sub-megabyte model footprint, sub-150 ms per-batch TensorRT FP16 inference on an NVIDIA Jetson Xavier NX, and an end-to-end alert pipeline targeting under 10 minutes from satellite overpass to fire event communication. The system operates on uncalibrated mid-wave infrared (MWIR) single-band imagery at 200 m ground sampling distance, where fires frequently appear as sub-pixel or single-pixel thermal anomalies under extreme class imbalance -- challenges not addressed by the contextual thermal-thresholding pipelines (MODIS, VIIRS) that currently dominate operational fire monitoring. We present an empirical study of lightweight dense representation learning for this regime using a proprietary nine-satellite MWIR dataset. We compare dense masked autoencoding (DenseMAE) and a hybrid DenseMAE+EMA (exponential moving average) distillation variant, and evaluate representations via linear probing and full-distribution pixel-level average precision (AP) under extreme class imbalance. DenseMAE pretraining enables compact downstream models on the latency-accuracy Pareto frontier: our fastest SSL-pretrained model achieves 0.640 test AP and 0.69 event-level Fire-F1 with 65.34 ms latency per batch and a 0.52 MB engine, without pruning or compression. The best configuration reaches 0.699 AP and 0.744 Fire-F1 below 1 MB, outperforming a supervised baseline (0.650 AP) under comparable constraints.
☆ Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction
Feed-forward 3D reconstruction models based on Vision Transformers can directly estimate scene geometry and camera poses from a small set of input images, but scaling them to video inputs with hundreds or thousands of frames remains challenging due to the quadratic cost of global attention layers. Recent token-merging methods accelerate these models by compressing the token sequence within the global attention layers, but they apply a uniform reduction to query tokens and key-value tokens, ignoring their functionally distinct roles in 3D reconstruction. In this work, we identify a key property of feed-forward 3D reconstruction models: query tokens encode view-specific geometric requests and are sensitive to compression, while key-value tokens represent shared scene context and tolerate aggressive compression. Guided by this insight, we propose Spark3R, a training-free acceleration framework that decouples the compression of query tokens and key-value tokens by assigning distinct reduction factors, with intra-group token merging applied to query tokens and lightweight token pruning to key-value tokens. Additionally, Spark3R adaptively adjusts the key-value reduction factor across layers, further improving the quality-efficiency trade-off. As a plug-and-play framework requiring no retraining, Spark3R integrates directly into multiple pretrained feed-forward 3D reconstruction models, including VGGT, $π^3$, and Depth-Anything-3, and achieves up to $28\times$ speedup on 1,000-frame inputs while maintaining competitive reconstruction quality.
☆ ZScribbleSeg: A comprehensive segmentation framework with modeling of efficient annotation and maximization of scribble supervision
Curating fully annotated datasets for medical image segmentation is labour-intensive and expertise-demanding. To alleviate this problem, prior studies have explored scribble annotations for weakly supervised segmentation. Existing solutions mainly compute losses on annotated areas and generate pseudo labels by propagating annotations to adjacent regions. However, these methods often suffer from inaccurate and unrealistic segmentations due to insufficient supervision and incomplete shape information. In contrast, we first investigate the principle of good scribble annotations, which leads to efficient scribble forms via supervision maximization and randomness simulation. We further introduce regularization terms to encode the spatial relationship and the shape constraints, where the EM algorithm is utilized to estimate the mixture ratios of label classes. These ratios are critical in identifying the unlabeled pixels for each class and correcting erroneous predictions, thus the accurate estimation lays the foundation for the incorporation of spatial prior. Finally, we integrate the efficient scribble supervision with the prior into a framework, referred to as ZScribbleSeg, and apply it to multiple scenarios. Leveraging only scribble annotations, ZScribbleSeg achieves competitive performance on six segmentation tasks including ACDC, MSCMRseg, BTCV, MyoPS, Decathlon-BrainTumor and Decathlon-Prostate. Our code will be released via https://github.com/DLwbm123/ZScribbleSeg.
comment: Accepted by Medical Image Analysis
☆ Look Beyond Saliency: Low-Attention Guided Dual Encoding for Video Semantic Search
Video semantic search in densely crowded scenes remains a challenging task due to visual encoders tendency to prioritize salient foreground regions while neglecting contextually important, background areas. We propose an Inverse Attention Embedding mechanism that explicitly captures and highlights these overlooked regions. By combining inverse attention embeddings with traditional visual embeddings, our method significantly enhances semantic retrieval performance without additional training. Initial experiments and ablation studies demonstrate promising improvements over existing approaches in recall for video semantic search in crowded environments.
☆ Differentiable Adaptive 4D Structured Illumination for Joint Capture of Shape and Reflectance CVPR 2026
We present a differentiable framework to adaptively compute 4D illumination conditions with respect to an object, for efficient, high-quality simultaneous acquisition of its shape and reflectance, with a unified spatial-angular structured light and a single camera. Using a simple histogram-based pixel-level probability model for depth and reflectance, we differentiably link the next illumination condition(s) with a loss that encourages the reduction in depth uncertainty. As new structured illumination is cast, corresponding image measurements are used to update the uncertainty at each pixel. Finally, a fine-tuning-based approach reconstructs the depth map and reflectance parameter maps, by minimizing the differences between all physical measurements and their simulated counterparts. The effectiveness of our framework is demonstrated on physical objects with wide variations in shape and appearance. Our depth results compare favorably with state-of-the-art techniques, while our reflectance results are comparable when validated against photographs.
comment: Accepted to CVPR 2026. 10 pages, 13 figures
☆ Playing the network backward: A Game Theoretic Attribution Framework
Attribution methods explain which input features drive a model's prediction, making them central to model debugging and mechanistic interpretability. Yet backward attribution methods, including gradients, LRP, and transformer-specific rules, lack a shared framework in which to compare the underlying backward calculations. We introduce such a framework by recasting backward attribution as a two-player game on an extended network graph, building on Gaubert and Vlassopoulos' ReLU Net Game. Gradients and the full alpha-beta-LRP family arise as integrals over game trajectories under specific equilibria, so attribution maps become projections of trajectory distributions rather than the primary object. Desired explanation properties, such as localisation focus, robustness to input noise, or stable attention routing, can be specified as game-theoretic concepts, including policy regularization, risk aversion, and extended action sets, and translate directly into novel adaptations of the well-known backward rules. On ViT-B/16, one such selected adaptation of alpha-beta-LRP outperforms prior transformer-specific backward methods across all considered localisation metrics.
☆ Taming the Entropy Cliff: Variable Codebook Size Quantization for Autoregressive Visual Generation
Most discrete visual tokenizers rely on a default design: every position in the sequence shares the same codebook. Researchers try to scale the codebook size $K$ to get better reconstruction performance. Such a constant-codebook design hits a fundamental information-theoretic limit. We observe that the per-position conditional entropy of the training set decays so quickly along the sequence that, after a few positions, the conditional distribution becomes essentially deterministic. On ImageNet with $K=16384$, this happens within only 2 out of 256 positions, turning the remaining 254 into a memorization problem. We call this phenomenon the Entropy Cliff and formalize it with a simple expression: $t^{*} = \lceil \log_2 N / \log_2 K \rceil$. Interestingly, this phenomenon is not observed in language, as its natural structure keeps the effective entropy per position well below the codebook capacity. To address this, we propose Variable Codebook Size Quantization (VCQ), where the codebook size $K_t$ grows monotonically along the sequence from $K_{\min}=2$ to $K_{\max}$, leaving the loss function, parameter count, and AR training procedure unchanged. With a vanilla autoregressive Transformer and standard next-token prediction, a base version of VCQ reduces gFID w/o CFG from 27.98 to 14.80 on ImageNet $256\times256$ over the baseline. Scaled up, it reaches gFID 1.71 with 684M autoregressive parameters, without any extra training techniques such as semantic regularization or causal alignment. The extreme information bottleneck at $K_{\min}=2$ naturally induces a coarse-to-fine semantic hierarchy: a linear probe on only the first 10 tokens reaches 43.8% top-1 accuracy on ImageNet, compared to 27.1% for uniform codebooks. Ultimately, these results show that what matters is not only the total capacity of the codebook, but also how that capacity is distributed and organized.
☆ Bridging visual saliency and large language models for explainable deep learning in medical imaging
The opaque nature of deep learning models remains a significant barrier to their clinical adoption in medical imaging. This paper presents a multimodal explainability framework that bridges the gap between convolutional neural network (CNN) predictions and clinically actionable insights for brain tumor classification, leveraging large language models (LLMs) to deliver human-interpretable diagnostic narratives. The proposed framework operates through three coupled stages. First, nine CNN architectures are extended with a dual-output hybrid formulation that simultaneously optimises a classification head and a segmentation head, enabling spatially richer feature learning. Second, visual saliency attribution methods, namely Grad-CAM, Grad-CAM++, and ScoreCAM, are applied to generate class-discriminative heatmaps, which are subsequently refined into binary tumor masks via an adaptive percentile thresholding pipeline. Third, the resulting masks are mapped onto the Harvard-Oxford cortical atlas to translate pixel-level evidence into named neuroanatomical structures, and the extracted findings are encoded into a structured JSON file that conditions three LLMs (Grok3, Mistral, and LLaMA) to generate coherent, radiological-style diagnostic reports. Evaluated on a dataset of 4,834 contrast-enhanced T1-weighted brain MRI images spanning three tumor classes, InceptionResNetV2 achieved the highest classification performance and Grad-CAM++ yielded the best segmentation overlap. Among the language models, Grok3 led in lexical diversity and coherence, while LLaMA achieved the highest readability score. By integrating visual, anatomical, and linguistic modalities into a unified pipeline, the framework produces explanations that are technically grounded and meaningfully interpretable, advancing the transparency and clinical accountability of artificial intelligence assisted brain tumor diagnosis.
☆ EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields
Pretrained video diffusion models provide powerful spatiotemporal generative priors, making them a natural foundation for robotic world models. While recent world-action models jointly optimize future videos and actions, they predominantly treat video generation as an auxiliary representation for policy learning. Consequently, they insufficiently explore the inverse problem: leveraging action signals to guide video synthesis, thereby often failing to preserve precise robot spatial geometry and fine-grained robot-object interaction dynamics in the generated rollouts. To bridge this gap, we present EA-WM, an Event-Aware Generative World Model that effectively closes the loop between kinematic control and visual perception. Rather than injecting joint or end-effector actions as abstract, low-dimensional tokens, EA-WM projects actions and kinematic states directly into the target camera view as Structured Kinematic-to-Visual Action Fields. To fully exploit this geometrically grounded representation, we introduce event-aware bidirectional fusion blocks that modulate cross-branch attention, capturing object state changes and interaction dynamics. Evaluated on the comprehensive WorldArena benchmark, EA-WM achieves state-of-the-art performance, outperforming existing baselines by a significant margin.
comment: Preprint. 22 pages, 10 figures
☆ Event-Causal RAG: A Retrieval-Augmented Generation Framework for Long Video Reasoning in Complex Scenarios
Recent large vision-language models have achieved strong performance on short- and medium-length video understanding, yet they remain inadequate for ultra-long or even infinite video reasoning, where models must preserve coherent memory over extended durations and infer causal dependencies across temporally distant events. Existing end-to-end video understanding methods are fundamentally limited by the $O(n^2)$ complexity of self-attention, while recent retrieval-augmented generation (RAG) approaches still suffer from fragmented clip-level memory, weak modeling of temporal and causal structure, and high storage and online inference costs. We present Event-Causal RAG, a lightweight retrieval-augmented framework for infinite long-video reasoning. Instead of indexing fixed-length clips, our method segments streaming videos into semantically coherent events and represents each event as a structured State-Event-State (SES) graph, capturing the event together with its surrounding state transitions. These graphs are merged into a global Event Knowledge Graph and stored in a dual-store memory that supports both semantic matching and causal-topological retrieval. On top of this memory, we design a bidirectional retrieval strategy to efficiently identify the most relevant event causal chains and provide them, together with the associated video evidence, to a backbone video foundation model for answer generation. Experiments on long-video understanding benchmarks demonstrate that Event-Causal RAG consistently outperforms strong clip-based retrieval baselines and long-context video models, particularly on questions requiring multi-event integration and causal inference across long temporal gaps, while also achieving improved memory efficiency and robust streaming performance.
☆ SuperFace: Preference-Aligned Facial Expression Estimation Beyond Pseudo Supervision
Accurate facial estimation is crucial for realistic digital human animation, and ARKit blendshape coefficients offer an interpretable representation by mapping facial motions to semantic animation controls. However, learning high-quality ARKit coefficient prediction remains limited by the absence of reliable ground-truth supervision. Existing methods typically rely on capture software such as Live Link Face to provide pseudo labels, which may contain noisy activations, biased coefficient magnitudes, and missing or inaccurate facial actions. Consequently, models trained with supervised learning tend to reproduce imperfect pseudo labels rather than optimize for perceptual expression fidelity. In this paper, we propose SuperFace, a preference-driven framework that moves ARKit facial expression estimation from pseudo-label imitation toward human-aligned perceptual optimization. Instead of treating software-estimated coefficients as fixed ground truth, SuperFace uses them only as an initialization and further improves coefficient prediction through human preference feedback on rendered facial expressions. By aligning the model with perceptual judgments rather than numerical pseudo labels, SuperFace enables more visually faithful and expressive facial animation. Experiments show that SuperFace improves expression fidelity over Live Link Face supervision, demonstrating the effectiveness of preference-driven optimization for semantic facial action prediction.
☆ Retina-RAG: Retrieval-Augmented Vision-Language Modeling for Joint Retinal Diagnosis and Clinical Report Generation MICCAI 2026
Diabetic Retinopathy (DR) is a leading cause of preventable blindness among working-age adults worldwide, yet most automated screening systems are limited to image-level classification and lack clinically structured reporting. We propose Retina-RAG, a low-cost modular framework that jointly performs DR severity grading, macular edema (ME) detection, and report generation. The architecture decouples a high-performance retinal classifier and a parameter-efficient vision-language model (Qwen2.5-VL-7B-Instruct) adapted via Low-Rank Adaptation (LoRA), enabling flexible component integration. A retrieval-augmented generation (RAG) module injects curated ophthalmic knowledge together with structured classifier outputs at inference time to improve diagnostic consistency and reduce hallucinations. Retina-RAG achieves an F1-score of 0.731 for DR grading and 0.948 for ME detection, substantially outperforming zero-shot Qwen (0.096, 0.732) and MMed-RAG (0.541, 0.641) on a retinal disease detection dataset with captions. For report generation, Retina-RAG attains ROUGE-L 0.429 and SBERT similarity 0.884, exceeding all baselines. The full framework operates on a single consumer-grade GPU, demonstrating that clinically structured retinal AI can be achieved with modest computational resources.
comment: 10 pages, 5 figures. Submitted to MICCAI 2026
☆ DynT2I-Eval: A Dynamic Evaluation Framework for Text-to-Image Models
Existing text-to-image (T2I) benchmarks largely rely on fixed prompt sets, leaving them vulnerable to overfitting and benchmark contamination once publicly released and repeatedly reused. In this work, we propose DynT2I-Eval, a fully automated dynamic evaluation framework for T2I models. It constructs a structured visual semantic space from long-form descriptions, decomposing prompts into controllable dimensions (e.g., subject, logical constraint, environment, and composition). This enables the continuous generation of fresh prompts via task-specific spaces and difficulty-aware sampling. DynT2I-Eval evaluates model performance across text alignment, perceptual quality, and aesthetics. Heterogeneous outputs are unified into prompt-conditioned pairwise comparisons, allowing a dynamic scheduler, micro-batch aggregation, and weighted Bayesian updates to maintain a stable online leaderboard despite changing prompt distributions and model injection. Experiments with independently sampled prompt streams demonstrate that continually refreshed prompts provide a robust evaluation protocol, reducing the impact of prompt-set-specific tuning. Simulations and ablations further confirm that the proposed ranking framework achieves a strong balance among cold-start convergence, late-entry discovery, and long-run ranking fidelity.
☆ Beyond Forgetting in Continual Medical Image Segmentation: A Comprehensive Benchmark Study
Continual learning (CL) is essential for deploying medical image segmentation models in clinical environments where imaging domains, anatomical targets, and diagnostic tasks evolve over time. However, continual segmentation still faces three main challenges. First, the scenarios for this task remain insufficiently standardized for real-world clinical settings. Second, existing research has been primarily focused on mitigating forgetting, overlooking the other essential properties such as plasticity. Third, a benchmark work with comprehensive evaluation on existing methods is stll desirable. To address these gaps, we present such benchmark study of continual medical image segmentation. We first define three clinically motivated scenarios, namely Domain-CL, Class-CL, and Organ-CL, to respectively capture the cross-center domain shift, the incremental anatomical structure segmentation, and the cross-organ segmentation. We then introduce an evaluation framework that measures not only general performance and forgetting, but also plasticity, forward generalizability, parameter efficiency, and replay burden. The results, from extensive experiments with representative CL methods, showed that it was still challenging to develop a model that could satisfy all the requirements simultaneously. Nevertheless, these studies also suggested that the replay-based methods achieve the best overall balance between stability and plasticity, the parameter-isolation methods should be effective at reducing forgetting, though at the cost of increased model size, and the forward generalizability remain a significantly understudied aspect of this research field. Finally, we discuss related learning paradigms and outline future directions for continual medical image segmentation.
comment: Submitted to a journal
☆ Secure Seed-Based Multi-bit Watermarking for Diffusion Models from First Principles
The rapid emergence of generative image models has led to the development of specialized watermarking techniques, particularly in-generation methods such as seed-based embedding. However, current evaluations in this area remain largely empirical, making them heavily reliant on the specific model architectures used for generation and inversion. This prevents any clear conclusion on the performance of any method, especially regarding security, for which a rigorous definition is lacking. Against this approach, we argue that the effectiveness of a watermarking scheme should be established purely through a thorough theoretical analysis. This is enabled by decoupling the model-dependent part from the actual decision mechanism of the watermarking system. Using this decoupling, we introduce a formal evaluation framework based on security, robustness, and fidelity. This allows precise comparisons between watermarking systems through a characteristic surface representing the trade-off between these three quantities, independent of any generative model. Based on this framework, we propose SSB, a novel watermarking method that generalizes previous seed-based methods by allowing to reach any security-robustness-fidelity regime on its characteristic surface. This work opens the door to the design of modern watermarking systems with theoretical guarantees that do not necessitate any costly empirical evaluations.
☆ Learning Discrete Autoregressive Priors with Wasserstein Gradient Flow
Discrete image tokenizers are commonly trained in two stages: first for reconstruction, and then with a prior model fitted to the frozen token sequences. This decoupling leaves the tokenizer unaware of the model that will later generate its tokens. As a result, the learned tokens may preserve image information well but still be difficult for an autoregressive (AR) prior to predict from left to right. We analyze this mismatch using Tripartite Variational Consistency (TVC), which decomposes latent-variable learning into three consistency conditions: conditional-likelihood consistency, prior consistency, and posterior consistency. TVC shows that two-stage training preserves the reconstruction side but leaves prior consistency outside the tokenizer objective: the overall token distribution is fixed before the AR prior participates in training. Motivated by this view, we add a distribution-level prior-matching signal during tokenizer training, while keeping the reconstruction objective unchanged. We optimize this signal with a Wasserstein-gradient-flow update. For hard categorical tokens, the update reduces to a token-level contrast between an auxiliary AR model that tracks the tokenizer's current token distribution and the target AR prior. It requires only forward passes through the two AR models and does not backpropagate through either of them. The resulting tokenizer, wAR-Tok, reduces AR loss and improves generation FID on CIFAR-10 and ImageNet at comparable reconstruction quality.
☆ AI-Generated Images: What Humans and Machines See When They Look at the Same Image
The misuse of generative AI in online disinformation campaigns highlights the urgent need for transparent and explainable detection systems. In this work, we investigate how detectors for AI-generated images can be more effective in providing human-understandable explanations for their predictions. To this end, we develop a suite of detectors with various architectures and fine-tuning strategies, trained on our large-scale photorealistic fake image dataset, AIText2Image, and assess their performance on state-of-the-art text-to-image AI generators. We integrate 16 different explainable AI (XAI) methods into our detection framework, and the visual explanations are comprehensively refined and evaluated through a novel approach that prioritizes human understanding of AI-generated images, using both textual and visual responses collected from a survey of 100 participants. This framework offers insights into visual-language cues in fake image detection and into the clarity of XAI methods from a human perspective, measuring the alignment of XAI outputs with human preferences.
comment: Included in the main conference proceedings published by Springer Nature (CCIS Series)
☆ Autoregressive Visual Generation Needs a Prologue
In this work, we propose Prologue, an approach to bridging the reconstruction-generation gap in autoregressive (AR) image generation. Instead of modifying visual tokens to satisfy both reconstruction and generation, Prologue generates a small set of prologue tokens prepended to the visual token sequence. These prologue tokens are trained exclusively with the AR cross-entropy (CE) loss, while visual tokens remain dedicated to reconstruction. This decoupled design lets us optimize generation through the AR model's true distribution without affecting reconstruction quality, which we further formalize from an ELBO perspective. On ImageNet 256x256, Prologue-Base reduces gFID from 21.01 to 10.75 without classifier-free guidance while keeping reconstruction almost unchanged; Prologue-Large reaches a competitive rFID of 0.99 and gFID of 1.46 using a standard AR model without auxiliary semantic supervision. Interestingly, driven only by AR gradients, prologue tokens exhibit emergent semantic structure: linear probing on 16 prologue tokens reaches 35.88% Top-1, far above the 23.71% of the first 16 tokens from a standard tokenizer; resampling with fixed prologue tokens preserves a similar high-level semantic layout. Our results suggest a new direction: generation quality can be improved by introducing a separate learned generative representation while leaving the original representation intact.
☆ Continuous Expert Assembly: Instance-Conditioned Low-Rank Residuals for All-in-One Image Restoration
Real-world image degradation is often unknown, spatially non-uniform, and compositional, requiring all-in-one restoration models to adapt a single set of weights to diverse local corruption patterns without test-time degradation labels. Existing methods typically modulate a shared backbone with global prompts or degradation descriptors, or route features through predefined expert pools. However, compact global conditioning can bottleneck localized degradation evidence, while static expert routing may produce homogeneous updates or rely on unstable sparse assignments. We propose \textbf{Continuous Expert Assembly} (CEA), a token-wise dynamic parameterization framework for all-in-one image restoration. CEA employs a lightweight \textbf{Cross-Attention Hyper-Adapter} to probe intermediate spatial features and synthesize instance-conditioned low-rank routing bases and residual directions. Each spatial token then assembles its own residual update via dense signed dot-product affinities over the generated rank-wise components, avoiding external prompts, static expert banks, and discrete Top- selection. The resulting assembly rule also admits a linear-attention perspective, making its dense token-wise routing behavior transparent. Experiments on AIO-3, AIO-5, and CDD-11 show that CEA improves average restoration quality over strong prompt-, descriptor-, and expert-based baselines, with the clearest gains on spatially varying and compositional degradations, while maintaining favorable parameter, FLOP, and runtime efficiency.
☆ Pest-Thinker: Learning to Think and Reason like Entomologists via Reinforcement Learning
Pest-induced crop losses pose a major threat to global food security and sustainable agricultural development. While recent advances in Multimodal Large Language Models (MLLMs) have shown strong potential for visual understanding and smart agriculture, their direct application to pest recognition remains limited due to the domain's unique challenges such as high inter-species complexity, intra-species variability, and the scarcity of expert-annotated data. In this work, we introduce Pest-Thinker, a knowledge-driven reinforcement learning (RL) framework that enables MLLMs to reason over fine-grained pest morphology. We first construct two high-definition pest benchmarks, QFSD and AgriInsect, comprising diverse species and expert-annotated morphological traits. Leveraging these datasets, we synthesize Chain-of-Thought (CoT) reasoning trajectories to facilitate structured learning of pest-specific visual cues through Supervised Fine-Tuning (SFT). Subsequently, we employ Group Relative Policy Optimization (GRPO) with a novel feature reward that guides the model to focus on observable morphological evidence, assessed by an LLM-as-a-Judge strategy. Extensive experiments demonstrate that Pest-Thinker substantially improves both in-domain and out-of-domain morphological understanding, marking a step toward expert-level visual reasoning for intelligent agricultural pest analysis. The datasets and source code are available upon acceptance.
comment: 10 pages, 5 figures
☆ Dynamic Pondering Sparsity-aware Mixture-of-Experts Transformer for Event Stream based Visual Object Tracking
Despite significant progress, RGB-based trackers remain vulnerable to challenging imaging conditions, such as low illumination and fast motion. Event cameras offer a promising alternative by asynchronously capturing pixel-wise brightness changes, providing high dynamic range and high temporal resolution. However, existing event-based trackers often neglect the intrinsic spatial sparsity and temporal density of event data, while relying on a single fixed temporal-window sampling strategy that is suboptimal under varying motion dynamics. In this paper, we propose an event sparsity-aware tracking framework that explicitly models event-density variations across multiple temporal scales. Specifically, the proposed framework progressively injects sparse, medium-density, and dense event search regions into a three-stage Vision Transformer backbone, enabling hierarchical multi-density feature learning. Furthermore, we introduce a sparsity-aware Mixture-of-Experts module to encourage expert specialization under different sparsity patterns, and design a dynamic pondering strategy to adaptively adjust the inference depth according to tracking difficulty. Extensive experiments on FE240hz, COESOT, and EventVOT demonstrate that the proposed approach achieves a favorable trade-off between tracking accuracy and computational efficiency. The source code will be released on https://github.com/Event-AHU/OpenEvTracking.
☆ Uncovering Entity Identity Confusion in Multimodal Knowledge Editing
Multimodal knowledge editing (MKE) aims to correct the internal knowledge of large vision-language models after deployment, yet the behavioral patterns of post-edit models remain underexplored. In this paper, we identify a systemic failure mode in edited models, termed Entity Identity Confusion (EIC): edited models exhibit an absurd behavior where text-only queries about the original entity's identity unexpectedly return information about the new entity. To rigorously investigate EIC, we construct EC-Bench, a diagnostic benchmark that directly probes how image-entity bindings shift before and after editing. Our analysis reveals that EIC stems from existing methods failing to distinguish between Image-Entity (I-E) binding and Entity-Entity (E-E) relational knowledge in the model, causing models to overfit E-E associations as a shortcut: the image is still perceived as the original entity, with the new entity's name serving only as a spurious identity label. We further explore potential mitigation strategies, showing that constraining edits to the model's I-E processing stage encourages edits to act more faithfully on I-E binding, thereby substantially reducing EIC. Based on these findings, we discuss principled desiderata for faithful MKE and provide methodological guidance for future research.
☆ Metonymy in vision models undermines attention-based interpretability
Part-based reasoning is a classical strategy to make a computer vision model directly focus on the object parts that are relevant to the downstream task. In the context of deep learning, this also serves to improve by-design interpretability, often by using part-centric attention mechanisms on top of a latent image representation provided by a standard, black-box model. This approach is based on a locality assumption: that the latent representation of an object part encodes primarily information about the corresponding image region. In this work, we test this basic assumption, measuring intra-object leakage in vision models using part-based attribute annotations. Through a comprehensive experimental evaluation, we show that modern pretrained vision transformers violate the locality assumption and exhibit a strong intra-object leakage, in which each part encodes information from the whole object, a visual metonymy that compromises the faithfulness of attention-based interpretable-by-design methods for part-based reasoning, ultimately rendering them uninterpretable. In addition, we establish an upper bound using a two-stage approach that prevents leakage by design. We then show that this inherently disentangled feature extraction improves attribute-driven part discovery on a variety of tasks, confirming the practical impact of intra-object leakage. Our results uncover a neglected issue affecting the interpretability of part-based representations, such as those in CBMs relying on part-centric concepts, highlighting that two-stage approaches offer a promising way to mitigate it.
comment: 28 pages, preprint
☆ VISD: Enhancing Video Reasoning via Structured Self-Distillation
Training VideoLLMs for complex reasoning remains challenging due to sparse sequence level rewards and the lack of fine grained credit assignment over long, temporally grounded reasoning trajectories. While reinforcement learning with verifiable rewards (RLVR) provides reliable supervision, it fails to capture token level contributions, leading to inefficient learning. Conversely, existing self distillation methods offer dense supervision but lack structure and diagnostic specificity, and often interact unstably with reinforcement learning. In this work, we propose VISD, a structured self distillation framework that introduces diagnostically meaningful privileged information for video reasoning. VISD employs a video aware judge model to decompose reasoning quality into multiple dimensions, including answer correctness, logical consistency, and spatio-temporal grounding, and uses this structured feedback to guide a teacher policy for token level supervision. To stably integrate dense supervision with RL, we introduce a direction magnitude decoupling mechanism, where rollout level advantages computed from rewards determine update direction, while structured privileged signals modulate token level update magnitudes. This design enables semantically aligned and fine grained credit assignment, improving both reasoning faithfulness and training efficiency. Additionally, VISD incorporates curriculum scheduling and EMA based teacher stabilization to support robust optimization over long video sequences. Experiments on diverse benchmarks show that VISD consistently outperforms strong baselines, improving answer accuracy and spatio temporal grounding quality. Notably, VISD reaches these gains with nearly 2x faster convergence in optimization steps, highlighting the effectiveness of structured self supervision in improving both performance and sample efficiency for VideoLLMs.
☆ Boosting Self-Supervised Tracking with Contextual Prompts and Noise Learning CVPR 2026
Learning robust contextual knowledge from unlabeled videos is essential for advancing self-supervised tracking. However, conventional self-supervised trackers lack effective context modeling, while existing context association methods based on non-semantic queries struggle to adapt to unlabeled tracking scenarios, making it difficult to learn reliable contextual cues. In this work, we propose a novel self-supervised tracking framework, named \textbf{\tracker}, which introduces a dual-modal context association mechanism that jointly leverages fine-grained semantic prompts and contextual noise to drive the model toward learning robust tracking representations. Adherent to the easy-to-hard learning principle, our contextual association mechanism operates based on two stages. During early training, instance patch tokens (prompts) are assigned to both forward and backward tracking branches to facilitate the acquisition of tracking knowledge. As training progresses, contextual noise is gradually injected into the model to perturb feature, encouraging the tracker to learn robust tracking representations in a more complex feature space. Thus, this novel contextual association mechanism enables our self-supervised model to learn high-quality tracking representations from unlabeled videos, while being applied exclusively during training to preserve efficient inference. Extensive experiments demonstrate the superiority of our method.
comment: CVPR 2026
☆ OpenGaFF: Open-Vocabulary Gaussian Feature Field with Codebook Attention
Understanding open-vocabulary 3D scenes with Gaussian-based representations remains challenging due to fragmented and spatially inconsistent semantic predictions across multi-view observations. In this paper, we present OpenGaFF, a novel framework for open-vocabulary 3D scene understanding built upon 3D Gaussian Splatting. At the core of our method is a Gaussian Feature Field that models semantics as a continuous function of Gaussian geometry and appearance. By explicitly conditioning semantic predictions on geometric structure, this formulation strengthens the coupling between geometry and semantics, leading to improved spatial coherence across similar structures in 3D space. To further enforce object-level semantic consistency, we introduce a structured codebook that serves as a set of shared semantic primitives. Furthermore, a codebook-guided attention mechanism is proposed to retrieve language features via similarity matching between query embeddings and learned codebook entries, enabling robust open-vocabulary reasoning while reducing intra-object feature variance. Extensive experiments on standard 2D and 3D open-vocabulary benchmarks demonstrate that our method consistently outperforms prior approaches, achieving improved segmentation quality, stronger 3D semantic consistency and a semantically interpretable codebook that provides insight into the learned representation.
☆ LARGO: Low-Rank Hypernetwork for Handling Missing Modalities
Addressing missing modalities is an important challenge in multimodal image analysis and often relies on complex architectures that do not transfer easily to different datasets without architectural modifications or hyperparameter tuning. While most existing methods tackle this problem in feature space by engineering representations that are robust to missing inputs, we instead operate in weight space. We propose LARGO, a hypernetwork that compresses the $2^N-1$ dedicated missing-modality models into a single network by modelling the convolutional weights using the Canonical Polyadic (CP) tensor decomposition. Extensive experimental validation on BraTS 2018 (4 modalities, 15 scenarios) and ISLES 2022 (3 modalities, 7 scenarios) shows that our method ranks first in 47 out of 52 configurations, achieving average Dice improvements of +0.68$\%$ and +2.53$\%$ over state-of-the-art baselines (mmFormer, M$^{3}$AE, ShaSpec, SimMLM). A proof-of-concept experiment on avMNIST suggests that LARGO may extend beyond medical imaging to heterogeneous non-medical modalities.
☆ AMIEOD: Adaptive Multi-Experts Image Enhancement for Object Detection in Low-Illumination Scenes IEEE
In multimedia application scenarios, images captured under low-illumination conditions often lead to lower accuracy in visual perception tasks compared to those taken in well-lit environments. To tackle this challenge, we propose AMIEOD, an image enhancement-enabled object detection framework for low-illumination scenes, where the two tasks are jointly optimized in a detection performance-oriented manner. Specifically, to fully exploit the information in poorly lit images, a Multi-Experts Image Enhancement Module (MEIEM) is proposed, which leverages diverse enhancement strategies. On this basis, aiming to better align the MEIEM with the detection task, we propose a Detection-Guided Regression Loss (DGRL) that utilizes the detection result to decide the regression target. Moreover, to dynamically select the most suitable enhancement strategy from MEIEM during inference, we construct an Expert Selection Module (ESM) guided by the proposed Detection-Guided Cross-Entropy (DGCE) loss, which formulates the optimization of ESM as a classification task. The improved method is well-matched with current detection algorithms to improve their performance in dim scenes. Extensive experiments on multiple datasets demonstrate that the proposed method significantly improves object detection accuracy in low-illumination conditions. Our code has been released at https://github.com/scujayfantasy/AMIEOD
comment: Accepted at IEEE Transactions on Multimedia
☆ Revisiting Uncertainty: On Evidential Learning for Partially Relevant Video Retrieval ICML 2026
Partially relevant video retrieval aims to retrieve untrimmed videos using text queries that describe only partial content. However, the inherent asymmetry between brief queries and rich video content inevitably introduces uncertainty into the retrieval process. In this setting, vague queries often induce semantic ambiguity across videos, a challenge that is further exacerbated by the sparse temporal supervision within videos, which fails to provide sufficient matching evidence. To address this, we propose Holmes, a hierarchical evidential learning framework that aggregates multi-granular cross-modal evidence to quantify and model uncertainty explicitly. At the inter-video level, similarity scores are interpreted as evidential support and modeled via a Dirichlet distribution. Based on the proposed three-fold principle, we perform fine-grained query identification, which then guides query-adaptive calibrated learning. At the intra-video level, to accumulate denser evidence, we formulate a soft query-clip alignment via flexible optimal transport with an adaptive dustbin, which alleviates sparse temporal supervision while suppressing spurious local responses. Extensive experiments demonstrate that Holmes outperforms state-of-the-art methods. Code is released at https://github.com/lijun2005/ICML26-Holmes.
comment: Accepted by ICML 2026. 16 pages, 6 figures, 3 tables
☆ MSD-Score: Multi-Scale Distributional Scoring for Reference-Free Image Caption Evaluation
Evaluating image captions without references remains challenging because global embedding similarity often misses fine-grained mismatches such as hallucinated objects, missing attributes, or incorrect relations. We propose MSD-Score, a reference-free metric that models image patch and text token embeddings as von Mises-Fisher mixtures on the unit hypersphere. Instead of treating each modality as a single point, MSD-Score formulates image-text matching as a multi-scale distributional scoring problem. Semantic discrepancies are quantified via a weighted bi-directional KL divergence and combined with global similarity in a multi-scale framework for both single- and multi-candidate evaluations. Extensive experiments show that MSD-Score achieves state-of-the-art correlation with human judgments among reference-free metrics. Beyond accuracy, its probabilistic formulation yields transparent and decomposable diagnostics of local grounding errors, providing a deterministic complementary signal to holistic similarity metrics and judge-based evaluators.
comment: Preprint. 17 pages, 10 figures. Code is available at: https://steinsgatesg.github.io/MSDScore/
☆ Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models
Reinforcement learning from human feedback (RLHF) effectively promotes preference alignment of text-to-image (T2I) diffusion models. To improve computational efficiency, direct preference optimization (DPO), which avoids explicit reward modeling, has been widely studied. However, its reliance on binary feedback limits it to coarse-grained modeling on chosen-rejected pairs, resulting in suboptimal optimization. In this paper, we propose ArenaPO, which leverages Arena scores as offline rewards to provide refined feedback, thus achieving efficient and fine-grained optimization without a reward model. This enables ArenaPO to benefit from both the rich rewards of traditional RLHF and the efficiency of DPO. Specifically, we first construct a model Arena in which each model's capability is represented as a Gaussian distribution, and infer these capabilities by traversing the annotated pairwise preferences. Each output image is treated as a sample from the corresponding capability distribution. Then, for a image pair, conditioned on the two capability distributions and the observed pairwise preference, the absolute quality gap is estimated using latent-variable inference based on truncated normal distribution, which serves as fine-grained feedback during training. It does not require a reward model and can be computed offline, thus introducing no additional training overhead. We conduct ArenaPO training on Pick-a-Pic v2 and HPD v3 datasets, showing that ArenaPO consistently outperforms existing baselines.
☆ PersonaGesture: Single-Reference Co-Speech Gesture Personalization for Unseen Speakers
We propose PersonaGesture, a diffusion-based pipeline for single-reference co-speech gesture personalization of unseen speakers. Given target speech and one motion clip from a new speaker, the model must synthesize gestures that follow the new utterance while retaining speaker-specific pose choices, without per-speaker optimization. This setting is useful for avatars and virtual agents, but it is hard because the reference mixes stable speaker habits with utterance-specific trajectories. PersonaGesture consists of two key components, Adaptive Style Infusion (ASI) and Implicit Distribution Rectification (IDR), to separate temporal identity evidence from residual statistic correction. A Style Perceiver first encodes the variable-length reference into compact speaker-memory tokens. ASI injects these tokens into denoising through zero-initialized residual cross-attention, enabling style evidence to affect motion formation without replacing the pretrained speech-to-motion prior. Building on this, IDR applies a length-aware diagonal affine map in latent space to correct residual channel-wise moments estimated from the same reference. Across BEAT2 and ZeroEGGS, we evaluate quantitative metrics, reference-identity controls, same-audio diagnostics, qualitative comparisons, and human preference. Experiments show that separating denoising-time speaker memory from conservative post-generation moment correction improves unseen-speaker personalization over collapsed style codes, full-reference attention, and one-clip finetuning. Project: https://xiangyue-zhang.github.io/PersonaGesture.
☆ Towards Self-Explainable Document Visual Question Answering with Chain-of-Explanation Predictions
Document Visual Question Answering (DocVQA) requires vision-language models to reason not only about what information in a document is relevant to a question, but also where the answer is grounded on the page. Existing DocVQA models entangle question-relevant evidence and answer localization and operate largely as black boxes, offering limited means to verify how predictions depend on visual evidence. We propose CoExVQA, a self-explainable DocVQA framework with a grounded reasoning process through a chain-of-explanation design. CoExVQA first identifies question-relevant evidence, then explicitly localizes the answer region, and finally decodes the answer exclusively from the grounded region. Prediction via CoExVQA's chain-of-explanation enables direct inspection and verification of the reasoning process across modalities. Empirical results show that restricting decoding to grounded evidence achieves SotA explainable DocVQA performance on PFL-DocVQA, improving ANLS by 12% over the current explainable baselines while providing transparent and verifiable predictions.
☆ RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control
Camera-controlled video-to-video (V2V) generation enables dynamic viewpoint synthesis from monocular footage, holding immense potential for interactive filmmaking and live broadcasting. However, existing implicit synthesis methods fundamentally rely on non-causal, full-sequence processing and rigid prefix-style temporal concatenation. This architectural paradigm mandates bidirectional attention, resulting in prohibitive computational latency, quadratic complexity scaling, and inherent incompatibility with real-time streaming or variable-length inputs. To overcome these limitations, we introduce \texttt{RealCam}, a novel autoregressive framework for interactive, real-time camera-controlled V2V generation. We first design a high-fidelity teacher model grounded in a \textbf{Cross-frame In-context Learning} paradigm. By interleaving source and target frames into synchronized contextual pairs, our design inherently enables length-agnostic generalization and naturally facilitates causal adaptation, breaking the rigid prefix bottleneck. We then distill this teacher into a few-step causal student via Self-Forcing with Distribution Matching Distillation, enabling efficient, on-the-fly streaming synthesis. Furthermore, to mitigate severe loop inconsistency in closed-loop trajectories, we propose \textbf{Loop-Closed Data Augmentation (LoopAug)}, a novel paradigm that synthesizes globally consistent loop sequences from existing multiview datasets. Extensive experiments demonstrate that \texttt{RealCam} achieves state-of-the-art visual fidelity and temporal consistency while enabling truly interactive camera control with orders-of-magnitude faster inference than existing paradigms. Our project page is at https://xyc-fly.github.io/RealCam/.
☆ Fusion in Your Way: Aligning Image Fusion with Heterogeneous Demands via Direct Preference Optimization CVPR 2026
As a key technique in multi-modal processing, infrared and visible image fusion (IVIF) plays a crucial role in integrating complementary spectral information for visual enhancement and downstream vision tasks. Despite remarkable progress, existing methods struggle to flexibly accommodate heterogeneous demands. Achieving adaptive fusion that aligns with various preferences from both human and machine vision remains an open and challenging problem. To address this challenge, we propose DPOFusion, a direct preference optimization (DPO) framework integrating the property-aligned latent diffusion model (PALDM) and the preference-controllable latent diffusion model (PCLDM), enabling task-guided, preference-adaptive IVIF for both human and machine vision. The PALDM leverages a latent fusion prior and a joint conditional loss to generate diverse candidate fusion results with various properties. PCLDM is subsequently fine-tuned via instance direct preference optimization (IDPO), enabling direct control of the final fusion results with heterogeneous preference signals. Experimental results demonstrate that our framework not only attains precise preference alignment among humans, vision-language models, and task-driven networks, but also sets a new benchmark for adaptive fusion quality and task-oriented transferability.
comment: Accepted by CVPR 2026
☆ Domain Generalization through Spatial Relation Induction over Visual Primitives
Domain generalization requires identifying stable representations that support reliable classification across domains. Most existing methods seek such stability through improving the training process, for example, through model selection strategies, data augmentation, or feature-alignment objectives. Although these strategies can be effective, they leave the representation learning of structural composition implicit, which may limit performance on compositional domain generalization benchmarks. In this work, we propose Primitive-Aware Relational Structure for domain gEneralization (PARSE), an image classification framework that factors visual recognition into visual primitives and their relational composition. We represent these compositions using soft binary, ternary, and quaternary predicates over primitive locations, yielding differentiable measures of spatial alignment that can be learned end-to-end. To learn primitives and relational structures jointly, we design an end-to-end architecture with three components: (1) a convolutional neural network (CNN) backbone that extracts general visual features, (2) a concept bottleneck layer that maps these features to primitive heatmaps with differentiable spatial coordinates, and (3) a structural scoring layer that evaluates candidate spatial relations among the detected primitives. We then compute class probability from the joint evidence of its class-specific relational compositions. Across CUB-DG and the DomainBed benchmark suite,PARSE improves accuracy by over 4.5 percentage points on CUB-DG and remains competitive with existing DG methods on DomainBed.
☆ PlotPick: AI-powered batch extraction of numerical data from scientific figures
Systematic reviews and meta-analyses frequently require numerical data that authors report only as figures, yet manual digitisation is slow and does not scale. We present PlotPick, an open-source tool that uses vision-language models (VLMs) to batch-extract structured tabular data from scientific figures. We evaluate six VLMs from three providers on two established chart-to-table benchmarks (ChartX and PlotQA) and compare against the dedicated chart-to-table model DePlot. All six VLMs outperform DePlot on both benchmarks. On ChartX (restricted to bar charts, line charts, box plots, and histograms; n=300), VLMs achieve 88-96% recall versus 71% for DePlot. On PlotQA (n=529), VLMs achieve 86-99% RMSF1 versus 94% for DePlot. The gap is largest on chart types absent from the dedicated models' training data: on box plots, DePlot achieves 24% RMSF1 while VLMs achieve 83-97%. PlotPick is available at https://plotpick.streamlit.app.
comment: 7 pages, 2 figures, 2 tables. Software available at https://plotpick.streamlit.app and https://github.com/tommycarstensen/plotpick
☆ T2I-VeRW: Part-level Fine-grained Perception for Text-to-Image Vehicle Retrieval
Vehicle Re-identification (Re-ID) aims to retrieve the most similar image to a given query from images captured by non-overlapping cameras. Extending vehicle Re-ID from image-only queries to text-based queries enables retrieval in real-world scenarios where only a witness description of the target vehicle is available. In this paper, we propose PFCVR, a Part-level Fine-grained Cross-modal Vehicle Retrieval model for text-to-image vehicle re-identification. PFCVR constructs locally paired images and texts at the part level and introduces learnable part-query tokens that aggregate both part-specific and full-sentence context before aligning with visual part features. On top of this explicit local alignment, a bi-directional mask recovery module lets each modality reconstruct its masked content under the guidance of the other, implicitly bridging local correspondences into global feature alignment. Furthermore, we construct a new large-scale dataset called T2I-VeRW, which contains 14,668 images covering 1,796 vehicle identities with fine-grained part-level annotations. Experimental results on the T2I-VeRI dataset show that PFCVR achieves 29.2\% Rank-1 accuracy, improving over the best competing method by +3.7\% percentage points. On the newly proposed T2I-VeRW benchmark, PFCVR achieves 55.2\% Rank-1 accuracy, outperforming a comprehensive set of recent state-of-the-art methods. Source code will be released on https://github.com/Event-AHU/Neuromorphic_ReID
☆ Adding Thermal Awareness to Visual Systems in Real-Time via Distilled Diffusion Models
Purely RGB-based vision models often fail to provide reliable cues in challenging scenarios such as nighttime and fog, leading to degraded performance and safety risks. Infrared imaging captures heat-emitting sources and provides critical complementary information, but existing high-fidelity fusion methods suffer from prohibitive latency, rendering them impractical for real-time edge deployment. To address this, we propose FusionProxy, a real-time image fusion module designed as a fully independent, plug-and-play component with diffusion level quality. FusionProxy exploits two complementary statistics of a teacher sample ensemble: per-pixel variance in raw image space, used to weight pixel-level supervision, and per-pixel variance inside frozen foundation backbones, used to route feature-level alignment spatially. Once trained, FusionProxy can be directly integrated into any visual perception system without joint optimization. Extensive experiments demonstrate that our method achieves superior performance on static recognition tasks and significantly enhances robustness in dynamic tasks, including closed-loop autonomous driving. Crucially, FusionProxy achieves real-time inference speeds on diverse platforms, from high-end GPUs to commodity hardware, providing a flexible and generalizable solution for all-day perception.
☆ Neuromorphic visual attention for Sign-language recognition on SpiNNaker
Sign-language recognition has achieved substantial gains in classification accuracy in recent years; however, the latency and power requirements of most existing methods limit their suitability for real-time deployment. Neuromorphic sensing and processing offer an alternative paradigm based on sparse, event-driven computation that supports low-latency and energy-efficient perception. In this work, we introduce an end-to-end neuromorphic architecture for American Sign Language (ASL) fingerspelling recognition that integrates a spiking visual attention mechanism for online region-of-interest extraction with a compact spiking neural network deployed on the SpiNNaker neuromorphic platform. We benchmark the proposed system against two datasets: a synthetically generated event-based version of the Sign Language MNIST dataset and a natively recorded ASL-DVS dataset, whilst providing a comprehensive overview of Sign-language recognition and related work. This work yields competitive performance in simulation (92.27%) and comparable performance on neuromorphic hardware deployment (83.1%), while achieving the most energy-efficient architecture (0.565 mW) and low latency (3 ms) across all benchmarked approaches. Despite its compact design, the system demonstrates the suitability of task-dependent visual attention applications for edge deployment.
☆ 4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding
Dynamic spatial reasoning from monocular video is essential for bridging visual intelligence and the physical world, yet remains challenging for vision-language models (VLMs). Prior approaches either verbalize spatial-temporal reasoning entirely as text, which is inherently verbose and imprecise for complex dynamics, or rely on external geometric modules that increase inference complexity without fostering intrinsic model capability. In this paper, we present 4DThinker, the first framework that enables VLMs to "think with 4D" through dynamic latent mental imagery, i.e., internally simulating how scenes evolve within the continuous hidden space. Specifically, we first introduce a scalable, annotation-free data generation pipeline that synthesizes 4D reasoning data from raw videos. We then propose Dynamic-Imagery Fine-Tuning (DIFT), which jointly supervises textual tokens and 4D latents to ground the model in dynamic visual semantics. Building on this, 4D Reinforcement Learning (4DRL) further tackles complex reasoning tasks via outcome-based rewards, restricting policy gradients to text tokens to ensure stable optimization. Extensive experiments across multiple dynamic spatial reasoning benchmarks demonstrate that 4DThinker consistently outperforms strong baselines and offers a new perspective toward 4D reasoning in VLMs. Our code is available at https://github.com/zhangquanchen/4DThinker.
comment: 21 pages, 16 figures
☆ iPhoneBlur: A Difficulty-Stratified Benchmark for Consumer Device Motion Deblurring
Motion blur restoration on consumer mobile devices is typically evaluated using aggregate metrics that obscure performance variation across blur difficulty, masking model behavior under real deployment conditions. This work introduces iPhoneBlur, a difficulty-stratified benchmark of 7,400 image pairs synthesized from high-framerate iPhone 17 Pro videos captured in diverse real-world scenarios. Samples are partitioned into Easy, Medium, and Hard categories through PSNR-guided adaptive temporal windowing, with stratification validated by monotonic 2.2x increase in optical flow magnitude across tiers. Each sample includes comprehensive metadata enabling investigation of ISP-aware and difficulty-adaptive restoration strategies. Spectral analysis confirms synthesized blur exhibits high-frequency suppression patterns consistent with authentic motion degradation. Evaluation of six architectures reveals consistent 7-9 dB performance degradation from Easy to Hard subsets, a substantial gap entirely hidden by aggregate reporting. The benchmark further exposes a domain gap between professional and consumer cameras which targeted fine-tuning substantially recovers. By coupling difficulty stratification with deployment-critical metadata, iPhoneBlur enables systematic assessment of model reliability and failure modes for resource-constrained edge systems.
comment: 21 Pages, 12 figures
Prompt-Free and Efficient SAM2 Adaptation for Biomedical Semantic Segmentation via Dual Adapters ICIP2026
Segment Anything Model 2 (SAM2) demonstrated impressive zero-shot capabilities on natural images but faces challenges in biomedical segmentation due to significant domain shifts and prompt dependency. To address these limitations, we propose a prompt-free, parameter-efficient fine-tuning framework designed for multi-class segmentation on variable-sized inputs. We introduce a convolutional Positional Encoding Generator to adapt effectively to arbitrary aspect ratios and present a dual-adapter strategy: High-Performance Adapter utilizing deformable convolutions for precise boundary modeling and Lightweight Adapter employing structural re-parameterization to minimize inference latency. Experiments on ISBI 2012, Kvasir-SEG, Synapse, and ACDC datasets demonstrate that our approach significantly outperforms strong adaptation baselines. Specifically, our method improved segmentation accuracy by up to 19.66\% over the vanilla SAM2, while reducing computational costs by approximately 87\% compared to heavyweight medical SAM adaptations, establishing a superior trade-off between accuracy and efficiency.
comment: Accepted by ICIP2026
☆ TableVista: Benchmarking Multimodal Table Reasoning under Visual and Structural Complexity ACL 2026
We introduce TableVista, a comprehensive benchmark for evaluating foundation models in multimodal table reasoning under visual and structural complexity. TableVista consists of 3,000 high-quality table reasoning problems, where each instance is expanded into 10 distinct visual variants through our multi-style rendering and transformation pipeline. This process encompasses diverse scenario styles, robustness perturbations, and vision-only configurations, culminating in 30,000 multimodal samples for a multi-dimensional evaluation. We conduct an extensive evaluation of 29 state-of-the-art open-source and proprietary foundation models on TableVista. Through comprehensive quantitative and qualitative analysis, we find that while evaluated models remain largely stable across diverse rendering styles, they exhibit pronounced performance degradation on complex structural layouts and vision-only settings, revealing that current models struggle to maintain reasoning consistency when structural complexity combines with visually integrated presentations. These findings highlight critical gaps in current multimodal capabilities, providing insights for advancing more robust and reliable table understanding models.
comment: ACL 2026 Findings
☆ MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware
The recent advancement of Vision Language Action (VLA) models has driven a critical demand for large scale egocentric datasets. However, existing datasets are often limited by short episode durations, typically spanning only a few minutes, which fails to capture the long horizon temporal dependencies necessary for complex robotic task execution. To bridge this gap, we present MobileEgo Anywhere, a framework designed to facilitate the collection of robust, hour plus egocentric trajectories using commodity mobile hardware. We leverage the ubiquitous sensor suites of modern smartphones to provide high fidelity, long term camera pose tracking, effectively removing the high hardware barriers associated with traditional robotics data collection. Our contributions are three fold: (1) we release a novel dataset comprising 200 hours of diverse, long form egocentric data with persistent state tracking; (2) we open source a mobile application that enables any user to record egocentric data, and (3) we provide a comprehensive processing pipeline to convert raw mobile captures into standardized, training ready formats for Vision Language Action model and foundation model research. By democratizing the data collection process, this work enables the massive scale acquisition of long horizon data across varied global environments, accelerating the development of generalizable robotic policies.
☆ RAWild: Sensor-Agnostic RAW Object Detection via Physics-Guided Curve and Grid Modeling
Camera sensor RAW data offers intrinsic advantages for object detection, including deeper bit depth, preserved physical information, and freedom from image signal processor (ISP) distortions. However, varying exposure conditions, spectral sensitivities, and bit depths across devices introduce substantially larger domain gaps than sRGB, making sensor-agnostic generalization a fundamental challenge. In this study, we present \textbf{RAWild}, a physics-guided global-local tone mapping framework for sensor-agnostic RAW object detection. By factoring sensor-induced variations into a global tonal correction and a spatially adaptive local color adjustment, both driven by RAW distribution priors, our framework enables a single network to train jointly across heterogeneous sensors. To further support cross-sensor generalization, we construct a physics-based RAW simulation pipeline that synthesizes realistic sensor outputs spanning diverse spectral sensitivities, illuminants, and sensor non-idealities. Extensive experiments across multiple RAW benchmarks covering bit depths from 10 to 24 demonstrate state-of-the-art (SOTA) performance under single-dataset, mixed-dataset, and challenging robustness settings.
☆ Whole-body CT attenuation and volume charts from routine clinical scans via evidence-grounded LLM report filtering
Interpreting quantitative CT biomarkers, such as organ volume and tissue attenuation, requires large-scale healthy reference distributions. However, creating these is challenging because clinical datasets are often heavily enriched with pathology. Here, we develop an evidence-grounded, cross-verified large language model (LLM) ensemble to filter pathological findings from radiology reports, enabling the construction of pathology-reduced cohorts from over 350,000 CT examinations. Five LLMs, first, flag structure-level abnormality candidates grounded in verbatim report evidence and, second, resolve disagreements via cross-verification. Using distribution-aware generalized additive models for location, scale, and shape, we establish comprehensive whole-body reference charts for 106 anatomical structures (volumes and attenuation) across adulthood, accounting for age, sex, contrast enhancement, and acquisition parameters. Longitudinal analyses reveal structure- and contrast-dependent changes distinct from cross-sectional trends. These resources facilitate covariate-adjusted centile scoring from routine CT, supporting standardized quantitative phenotyping, multi-site imaging studies, and scalable opportunistic screening research.
comment: Supplement available at: https://github.com/ai-med/body-charts/blob/main/body_charts_supp.pdf
☆ Backdoor Mitigation in Object Detection via Adversarial Fine-Tuning
Backdoor attacks can implant malicious behaviours into deep models while preserving performance on clean data, posing a serious threat to safety-critical vision systems. Although backdoor mitigation has been studied extensively for image classification, defenses for object detection remain comparatively underdeveloped. Adversarial fine-tuning is a common backdoor mitigation approach in classification, but adapting it to detection is nontrivial as classification-oriented adversarial generation does not match the detection attack space, where attacks may cause object misclassification or disappearance, and standard detection losses can dilute the repair signal across many predictions. We address these challenges through a detection-aware adversarial fine-tuning framework for mitigating object-detection backdoors when the defender has access only to a compromised detector and a small clean dataset, without knowing the attack objective. For adversarial generation that does not require knowledge of the attack objective, we introduce soft-branch minimisation, which uses a soft gate to combine objectives aligned with misclassification and disappearance attacks, together with a detection-aware classification-loss maximisation. For targeted repair, we introduce a dual-objective fine-tuning loss applied to target-matched predictions, concentrating the defensive update on predictions most relevant to the backdoor behaviour. Experiments across CNN- and Transformer-based detectors show that our approach more effectively reduces attack success while preserving true detections, compared with classification-oriented baselines, and maintains competitive clean detection performance.
☆ Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling
Recent advances in generative video models are increasingly driven by post-training and test-time scaling, both of which critically depend on the quality of video reward models (RMs). An ideal reward model should predict accurate rewards that align with human preferences across diverse scenarios. However, existing paradigms face a fundamental dilemma: \textit{Discriminative RMs} regress rewards directly on features extracted by multimodal large language models (MLLMs) without explicit reasoning, making them prone to shortcut learning and heavily reliant on massive data scaling for generalization. In contrast, \textit{Generative RMs} with Chain-of-Thought (CoT) reasoning exhibit superior interpretability and generalization potential, as they leverage fine-grained semantic supervision to internalize the rationales behind human preferences. However, they suffer from inherent optimization bottlenecks due to the coupling of reasoning and scoring within a single autoregressive inference chain. To harness the generalization benefits of CoT reasoning while mitigating the training instability of coupled reasoning and scoring, we introduce DeScore, a training-efficient and generalizable video reward model. DeScore employs a decoupled ``think-then-score'' paradigm: an MLLM first generates an explicit CoT, followed by a dedicated discriminative scoring module consisting of a learnable query token and a regression head that predicts the final reward. DeScore is optimized via a two-stage framework: (1) a discriminative cold start incorporating a random mask mechanism to ensure robust scoring capabilities, and (2) a dual-objective reinforcement learning stage that independently refines CoT reasoning quality and calibrates the final reward, ensuring that higher-quality reasoning directly translates to superior model performance.
☆ From Drops to Grid: Noise-Aware Spatio-Temporal Neural Process for Rainfall Estimation
High-resolution rainfall observations are crucial for weather forecasting, water management, and hazard mitigation. Traditional operational measurements are often biased and low-resolution, limiting their ability to capture local rainfall. Accurate high-resolution rainfall maps require integrating sparse surface observations, yet existing deep learning densification methods are hindered by rainfall's skewed, localized nature, noise, and limited spatio-temporal fusion. We present DropsToGrid, a Neural Process-based method that generates dense rainfall fields by fusing temporal sequences from noisy, irregularly distributed private weather stations with spatial context from radar. Leveraging multi-scale feature extraction, temporal attention, and multi-modal fusion, the model produces stochastic, continuous rainfall estimates and explicitly quantifies uncertainty. Evaluations on real-world datasets demonstrate that DropsToGrid outperforms both operational and deep learning baselines, generating accurate high-resolution rainfall maps with well-calibrated uncertainty, even when only few stations are available and in cross-regional scenarios.
☆ Plug-and-play Class-aware Knowledge Injection for Prompt Learning with Visual-Language Model
Prompt learning has become an effective and widely used technique in enhancing vision-language models (VLMs) such as CLIP for various downstream tasks, particularly in zero-shot classification within specific domains. Existing methods typically focus on either learning class-shared prompts for a given domain or generating instance-specific prompts through conditional prompt learning. While these methods have achieved promising performance, they often overlook class-specific knowledge in prompt design, leading to suboptimal outcomes. The underlying reasons are: 1) class-specific prompts offer more fine-grained supervision compared to coarse class-shared prompts, which helps prevent misclassification of data from different classes into a single class; 2) compared to class-specific prompts, instance-specific prompts neglect the richer class-level information across multiple instances, potentially causing data from the same class to be divided into multiple classes. To effectively supplement the class-specific knowledge into existing methods, we propose a plug-and-play Class-Aware Knowledge Injection (CAKI) framework. CAKI comprises two key components, i.e., class-specific prompt generation and query-key prompt matching. The former encodes class-specific knowledge into prompts from few-shot samples that belong to the same class and stores the learned prompts in a class-level knowledge bank. The latter provides a plug-and-play mechanism for each test instance to retrieve relevant class-level knowledge from the knowledge bank and inject such knowledge to refine model predictions. Extensive experiments demonstrate that our CAKI effectively improves the performance of existing methods on base and novel classes. Code is publicly available at \href{https://github.com/yjh576/CAKI}{this https URL}.
comment: Accepted by International Journal of Computer Vision
☆ Architecture-agnostic Lipschitz-constant Bayesian header and its application to resolve semantically proximal classification errors with vision transformers
Label noise remains a critical bottleneck for the generalization of supervised deep learning models, particularly when errors are structured rather than random. Standard robust training methods often fail in the presence of such semantically proximal classification errors. This work presents an architecture-agnostic Lipschitz-constant Bayesian header that can be integrated into feature extractors such as vision transformers, yielding the bi-Lipschitz-constrained Bayesian Vision Transformer (LipB-ViT). In contrast to conventional Bayesian layers, our approach enforces spectral normalization on both the mean and log-variance of the variational weights, which promotes calibrated predictive uncertainty and mitigates noise amplification. We further propose a novel metric to jointly capture uncertainty and confidence across misclassification rates, as well as an adaptive arithmetic-mean fusion scheme that combines feature-space proximity with predictive uncertainty to detect corrupted labels outperforming the state of the art k-nearest neighbor based identification methods by more than 7% reaching a recall of more than 0.93 at 15% semantically misclassified labels. Although computational costs increase due to Monte Carlo sampling, the method offers plug-and-play compatibility with pre-trained backbones and consistent hyperparameters across domains, suggesting strong utility for high-stakes applications with variable annotation reliability. The stabilized confidence estimates serve as the foundation for an analysis pipeline that jointly assesses dataset quality and label noise, yielding a second novel metric for their combined quantification. Lastly, we systematically evaluate LipB-ViT under both structured (adversarial) and unstructured noise at inference time, demonstrating its robustness in realistic high-noise and attack scenarios. We compare its performance against baseline methods.
comment: 10 pages, 3 figures, 4 tables; Supplementary 5 pages with 5 figures; Including references total 18 pages
☆ Understanding Cross-Language Transfer Improvements in Low-Resource HTR: The Role of Sequence Modeling
Handwritten Text Recognition (HTR) for Arabic-script languages benefits from cross-language joint training under low-resource conditions, particularly when using CRNN-based models that combine convolutional encoders with sequence modeling. However, it remains unclear whether these improvements are better explained by shared visual representations or sequence-level dependencies. In this work, we conduct a controlled architectural study of line-level Arabic-script HTR, comparing CNN-only models with CTC decoding and CRNN models under identical single-script and multi-script training regimes. Experiments are performed on Arabic (KHATT), Urdu (NUST-UHWR), and Persian (PHTD) datasets under low-resource settings (K in {100, 500, 1000}). Our results show a clear divergence in transfer behavior: while CNN-only models exhibit limited or unstable improvements, CRNN models achieve better performance under multi-script training, particularly in the most data-constrained regimes. Focusing on transfer improvements (delta CER) rather than absolute performance, we find that cross-language improvements are associated with sequence-level modeling, while sharing visual representations learned by the CNN encoder, corresponding to similarities in character shapes across scripts, alone appears to be insufficient. This finding suggests that contextual modeling plays an important role in enabling effective transfer in low-resource scenarios, and that similar behavior may extend to other low-resource language settings.
☆ Detecting AI-Generated Videos with Spiking Neural Networks
Modern AI-generated videos are photorealistic at the single-frame level, leaving inter-frame dynamics as the main remaining axis for detection. Existing detectors typically handle this temporal evidence in three ways: feeding the full frame sequence to a generic temporal backbone, reducing one dominant temporal cue to fixed video-level descriptors, or comparing temporal features to real-video statistics through a detection metric. These strategies degrade sharply under cross-generator evaluation, where artifact type and timescale vary across generators. On caption-paired benchmark, GenVidBench, we identify two signatures that prior detectors do not jointly exploit: AI-generated videos exhibit smoother frame-to-frame temporal residuals at the pixel level, and more compact trajectories in the semantic feature space, indicating a temporal smoothness gap at both levels. We further observe that, when raw video is fed into a Spiking Neural Networks (SNNs), fake clips elicit firing predominantly at object and motion boundaries, unlike real clips, suggesting that the SNN responds to temporal artifacts localized at edges. These cues are sparse, asynchronous, and concentrated at moments of change, which makes SNNs a natural choice for this task: their event-driven, sparsely-activated dynamics align with the structure of the residual signal in a way that dense ANN backbones do not. Building on this observation, we propose MAST, a detector that processes multi-channel temporal residuals with a spike-driven temporal branch alongside a frozen semantic encoder for cross-generator generalization. On the GenVideo benchmark, MAST achieves 93.14\% mean accuracy across 10 unseen generators under strict cross-generator evaluation, matching or surpassing the strongest ANN-based detectors and demonstrating the practical applicability of SNNs to AI-generated video detection.
☆ MTL-MAD: Multi-Task Learners are Effective Medical Anomaly Detectors
Anomaly detection in medical images is a challenging task, since anomalies are not typically available during training. Recent methods leverage a single pretext task coupled with a large-scale pre-trained model to reach state-of-the-art performance. Instead, we propose to learn multiple self-supervised and pseudo-labeling tasks from scratch, using a joint model based on Mixture-of-Experts (MoE). By carefully integrating multiple proxy tasks, the joint model effectively learns a robust representation of normal anatomical structures, so that anomaly scores can be derived based on how well the multi-task learner (MTL) solves each task during inference. We perform comprehensive experiments on BMAD, a recent benchmark that comprises a broad range of medical image modalities. The empirical results indicate that our multi-task learner is an effective anomaly detector, outperforming all state-of-the-art competitors on BMAD. Moreover, our model produces interpretable anomaly maps, potentially helping physicians in providing more accurate diagnoses.
☆ DBMSolver: A Training-free Diffusion Bridge Sampler for High-Quality Image-to-Image Translation CVPR 2026
Diffusion-based image-to-image (I2I) translation excels in high-fidelity generation but suffers from slow sampling in state-of-the-art Diffusion Bridge Models (DBMs), often requiring dozens of function evaluations (NFEs). We introduce DBMSolver, a training-free sampler that exploits the semi-linear structure of DBM's underlying SDE and ODE via exponential integrators, yielding highly-efficient 1st- and 2nd-order solutions. This reduces NFEs by up to 5x while boosting quality (e.g., FID drops 53% on DIODE at 20 NFEs vs. 2nd-order baseline). Experiments on inpainting, stylization, and semantics-to-image tasks across resolutions up to 256x256 show DBMSolver sets new SOTA efficiency-quality tradeoffs, enabling real-world applicability. Our code is publicly available at https://github.com/snumprlab/dbmsolver.
comment: Accepted to CVPR 2026. Includes supplementary material
☆ Training-Free Dense Hand Contact Estimation with Multi-Modal Large Language Models
Dense hand contact estimation requires both high-level semantic understanding and fine-grained geometric reasoning of human interaction to accurately localize contact regions. Recently, multi-modal large language models (MLLMs) have demonstrated strong capabilities in understanding visual semantics, enabled by vision-language priors learned from large-scale data. However, leveraging MLLMs for dense hand contact estimation remains underexplored. There are two major challenges in applying MLLMs to dense hand contact estimation. First, encoding explicit 3D hand geometry is difficult, as MLLMs primarily operate on vision and language modalities. Second, capturing fine-grained vertex-level contact remains challenging, as MLLMs tend to focus on high-level semantics rather than detailed geometric reasoning. To address these challenges, we propose ContactPrompt, a training-free and zero-shot approach for dense hand contact estimation using MLLMs. To effectively encode 3D hand geometry, we introduce a detailed hand-part segmentation and a part-wise vertex-grid representation that provides structured, localized geometric information. To enable accurate and efficient dense contact prediction, we develop a multi-stage structured contact reasoning with part conditioning, progressively bridging global semantics and fine-grained geometry. Therefore, our method effectively leverages the reasoning capabilities of MLLMs while enabling precise dense hand contact estimation. Surprisingly, the proposed approach outperforms previous supervised methods trained on large-scale dense contact datasets without requiring any training. The codes will be released.
comment: Project page: https://contactprompt-release.github.io/
☆ 3DSS: 3D Surface Splatting for Inverse Rendering
We present 3D Surface Splatting (3DSS), the first differentiable surface splatting renderer for physically-based inverse rendering from multi-view images. Our central insight is that the surface separation problem at the heart of surface splatting admits a direct formulation in terms of the reconstruction kernels themselves. From this foundation we derive a coverage-based compositing model whose per-layer opacity arises directly from the accumulated Elliptical Weighted Average reconstruction weight, yielding anti-aliased silhouettes and informative visibility gradients at sparsely covered edges. Combined with forward microfacet shading under co-optimized HDR environment lighting and density-aware adaptive refinement, 3DSS jointly recovers shape, spatially-varying BRDF materials, and illumination. Because the optimized representation is a set of oriented surface samples, it bridges natively to mesh-based workflows via surface reconstruction from oriented point cloud methods. We evaluate 3DSS against mesh-based, implicit, and Gaussian-splatting baselines across geometry reconstruction, novel-view synthesis, and novel-illumination relighting.
☆ InkDiffuser: High-Fidelity One-shot Chinese Calligraphy via Differentiable Morphological Optimization
Current Chinese calligraphy generation methods suffer from poor stroke rendering and unrealistic ink morphology, resulting in outputs with limited visual fidelity and artistic fluidity. To address this problem, we propose \textbf{InkDiffuser}, a diffusion-based generative framework for one-shot Chinese calligraphy synthesis. To guarantee high-fidelity rendering, we introduce two core contributions: a high-frequency enhancement mechanism and a Differentiable Ink Structure (DIS) loss that explicitly regularizes ink morphology. Inspired by the observation that high-frequency information in individual samples typically carries contour details, we enhance content extraction by explicitly fusing high-frequency representations for more accurate font structure. Furthermore, we propose a differentiable ink structure loss that integrates differentiable morphological operations into the diffusion process. By allowing the model to learn an explicit decomposition of ink-trace structures, DIS facilitates fine-grained refinement of stroke contours and delivers significantly improved visual realism in the generated calligraphy. Extensive experiments on various calligraphic styles and complex characters demonstrate that InkDiffuser can generate superior calligraphy fonts with realistic ink rendering effects from only a single reference glyph and outperform existing few-shot font generation approaches in structural consistency, detail fidelity, and visual authenticity. The code is available at the following address: https://github.com/JingVIPLab/InkDiffuser.
comment: 10 pages, 6 figures
☆ Align3D-AD: Cross-Modal Feature Alignment and Dual-Prompt Learning for Zero-shot 3D Anomaly Detection
Zero-shot 3D anomaly detection aims to identify anomalies without access to training data from target categories. However, existing methods mainly rely on projecting 3D observations into multi-view representations that primarily capture geometric cues rather than realistic visual semantics and process them with vision encoders pretrained on RGB data, leading to a significant domain gap between the encoder and the projected representations. To address this issue, we propose Align3D-AD, a unified two-stage framework that leverages the RGB modality from auxiliary categories as cross-modal guidance for zero-shot 3D anomaly detection. First, we introduce a cross-modal feature alignment paradigm that maps rendering features into the RGB semantic space. Unlike prior works that implicitly rely on pretrained encoders, our method enables direct semantic transfer from RGB observations. A semantic consistency reweighting strategy is further introduced to refine feature alignment by reweighting local regions according to holistic semantic consistency. Second, we propose a modality-aware prompt learning framework with dual-prompt contrastive alignment. By assigning independent prompts to RGB-aligned and rendering features, our method captures complementary semantics across modalities, while the contrastive alignment further enhances prompt representations to improve discriminability. Extensive experiments on MVTec3D-AD, Eyecandies, and Real3D-AD demonstrate that Align3D-AD consistently outperforms existing zero-shot methods under both one-vs-rest and cross-dataset settings, highlighting its generalization capability and robustness. Code and the dataset will be made available once our paper is accepted.
☆ VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
Video large multimodal models increasingly face a scalability bottleneck: long videos produce excessively long visual-token sequences, which sharply increase memory and latency during inference. While existing compression methods are effective in specific settings, most are either weakly query-aware or apply a fixed compression policy across frames, proving suboptimal when visual evidence is unevenly distributed over time. To address this, we present VideoRouter, a query-adaptive dual-router framework built on InternVL for budgeted evidence allocation. The Semantic Router predicts the dominant allocation policy, choosing between broad temporal coverage and adaptive high-resolution preservation, while the Image Router uses early LLM layers to score frame relevance. This enables aggressive compression on less relevant frames while preserving detail on critical evidence frames. To train both routers, we build Video-QTR-10K for allocation-policy supervision and Video-FLR-200K for frame-relevance supervision. Experiments on VideoMME, MLVU, and LongVideoBench show that VideoRouter consistently improves over the InternVL baseline under comparable or lower budgets, achieving up to a 67.9% token reduction.
☆ Unifying Scientific Communication: Fine-Grained Correspondence Across Scientific Media CVPR
The communication of scientific knowledge has become increasingly multimodal, spanning text, visuals, and speech through materials such as research papers, slides, and recorded presentations. These different representations collectively convey a study's reasoning, results, and insights, offering complementary perspectives that enrich understanding. However, despite their shared purpose, such materials are rarely connected in a structured way. The absence of explicit links across formats makes it difficult to trace how concepts, visuals, and explanations correspond, limiting unified exploration and analysis of research content. To address this gap, we introduce the Multimodal Conference Dataset (MCD), the first benchmark that integrates research papers, presentation videos, explanatory videos, and slides from the same works. We evaluate a range of embedding-based and vision-language models to assess their ability to discover fine-grained cross-format correspondences, establishing the first systematic benchmark for this task. Our results show that vision-language models are robust but struggle with fine-grained alignment, while embedding-based models capture text-visual correspondences well but equations and symbolic content form distinct clusters in the embedding space. These findings highlight both the strengths and limitations of current approaches and point to key directions for future research in multimodal scientific understanding. To ensure reproducibility, we release the resources for MCD at https://github.com/meghamariamkm2002/MCD
comment: Accepted at IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026
☆ ChartZero: Synthetic Priors Enable Zero Shot Chart Data Extraction
Automated data extraction from line charts remains fundamentally bottlenecked by extreme stylistic diversity and a severe scarcity of comprehensively annotated, real-world datasets. Current end-to-end pipelines depend heavily on costly manual annotations, crippling their ability to generalize across arbitrary aesthetics and grid layouts. Furthermore, existing models suffer from two critical failure modes during reconstruction. First, extracting thin, intersecting curves frequently causes structural fragmentation and the erasure of fine visual details, as standard architectures struggle against complex backgrounds. Second, semantic association is notoriously error-prone; current pipelines rely on rigid spatial heuristics that easily break down against the unpredictable legend placements of in-the-wild charts. Finally, measuring true progress is hindered by evaluation protocols that assess isolated sub-tasks rather than holistic, end-to-end data reconstruction. To address these foundational issues, we introduce ChartZero, a parsing framework that leverages synthetic priors to enable robust zero-shot chart data extraction. By training exclusively on a purely synthetic dataset of simple mathematical functions, our model completely bypasses the real-world annotation bottleneck. We overcome curve fragmentation via a novel Global Orthogonal Instance (GOI) loss, and replace brittle spatial rules with an open-vocabulary, Vision-Language Model (VLM)-guided legend matching strategy. Accompanied by a new metric and benchmark specifically designed for full end-to-end reconstruction, our evaluations demonstrate that ChartZero significantly advances generalized plot digitization without requiring real-world supervision. Code and dataset will be released upon acceptance.
☆ CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs
When a chest X-ray shows consolidation but the question asks which finding is present, a medical vision-language model may answer "No consolidation." This is more than an incorrect choice: it is a polarity reversal that emits a clinical statement contradicting the image. We study this failure as negated-option attraction, where a model is drawn to a negated answer option even when it conflicts with both the visual evidence and the question. We introduce CXR-ContraBench (Chest X-Ray Contradiction Benchmark), a diagnostic benchmark spanning internal ReXVQA slices and external OpenI and CheXpert protocols. The benchmark centers on present-finding questions, where selecting "No X" despite visible X creates the main clinical risk, and uses absent-finding questions as secondary tests of whether models copy negated wording. Across CheXpert protocols, the failure is substantial and persistent. On a strict direct presence probe, MedGemma and Qwen2.5-VL reach only 31.49% and 30.21% accuracy, respectively; on a matched 135,754-record CheXpert training-split protocol, both models select negated options on over 62% of presence questions. Chain-of-thought prompting reduces some presence-side reversals but does not eliminate them and can amplify absence-side contradictions. Finally, QCCV-Neg (Question-Conditioned Consistency Verifier for Negation) deterministically repairs the measured polarity-confused subset without retraining, raising MedGemma and Qwen2.5-VL to 96.60% and 95.32% accuracy on the direct presence probe. These results show that standard accuracy can hide a clinically meaningful inference-time polarity failure. Source code and benchmark construction scripts are available at https://github.com/fangzr/cxr-contrabench-code.
☆ Na-IRSTD: Enhancing Infrared Small Target Detection via Native-Resolution Feature Selection and Fusion
Infrared small target detection (IRSTD) faces the inherent challenge of precisely localizing dim targets amid complex background clutter. While progress has been made, existing methods usually follow conventional strategies to downsample features and discard small targets' details, resulting in suboptimal performance. In this paper, we present Na-IRSTD, a native-resolution feature extraction and fusion framework for IRSTD. This framework elegantly incorporates native-resolution features to preserve subtle target cues, overcoming the resolution limitations of existing infrared approaches and significantly improving the model's ability to localize small targets. We also introduce an effective token reduction and selection strategy, which selects target patches with high accuracy and confidence, boosting the low-level details of the feature while effectively reducing native-resolution patch tokens compared to dense processing, thereby avoiding imposing an unbearable computational burden. Extensive experiments demonstrate the robustness and effectiveness of our token reduction and selection strategy across multiple public datasets. Ultimately, our Na-IRSTD model achieves state-of-the-art performance on four benchmarks.
☆ Stego Battlefield: Evaluating Image Steganography Attacks and Steganalysis Defenses
Image steganography is widely used to protect user privacy and enable covert communication. However, it can also be abused by the adversary as a covert channel to bypass content moderation, disseminate harmful semantics, and even hide malicious instructions in images to elicit dangerous outputs from large models, posing a practical security risk that continues to evolve. To address the lack of a unified and systematic evaluation framework, we propose SADBench, a systematic benchmark that assesses the adversary's ability to inject harmful secrets via steganography and the defender's ability to detect such threats through steganalysis. Crucially, SADBench comprises $4$ core tasks, namely steganography attack capability evaluation, steganalysis defense capability evaluation, efficiency evaluation, and transferability evaluation. It evaluates both image-payload and text-payload steganography across diverse cover distributions, utilizing harmful visual semantics and toxic instructions to simulate malicious attacks. Across a broad set of attacks and detectors, SADBench reveals that (i) INN and autoencoder-based methods demonstrate superior stability compared to other architectures, (ii) in-domain detection is near-perfect and cheaper than generation, (iii) a critical asymmetry exists in transferability where attacks robustly generalize to new distributions while detectors fail to adapt, and (iv) real-world threats persist on social media, where payloads either survive minimal compression or effectively adapt to aggressive compression via simulated training. Overall, SADBench establishes a systematic, reproducible, and extensible framework to quantify risks, paving the way for measurable and security-driven advancements in steganography defense.
comment: 23 pages
☆ Steering Visual Generation in Unified Multimodal Models with Understanding Supervision
Unified multimodal models are envisioned to bridge the gap between understanding and generation. Yet, to achieve competitive performance, state-of-the-art models adopt largely decoupled understanding and generation components. This design, while effective for individual tasks, weakens the connection required for mutual enhancement, leaving the potential synergy empirically uncertain. We propose to explicitly restore this synergy by introducing Understanding-Oriented Post-Training (UNO), a lightweight framework that treats understanding not only as a distinct task, but also a direct supervisory signal to steer generative representations. By incorporating objectives that encode semantic abstraction (captioning) and structural details (visual regression), we enable effective gradient flow from understanding to generation. Extensive experiments on image generation and editing demonstrate that understanding can serve as an effective catalyst for generation.
☆ Von Neumann Networks
In the mid-twentieth century, mathematician and polymath John von Neumann created a computational system on an array of cells as a simple model of the human brain, where each cell had one of a finite set of roles or states that he predicted would be modelled by a diffusion process. In this work, we show that such a system, when developed in a modern deep learning setting, enables the construction of an artificial neuron having specialized roles that can be learnt. We refer to this neuron as the Von Neumann neuron, and the resulting neural network from such neurons result in a self-engineered design whose architecture is only dependent on the structure and locations of its inputs and outputs on this cellular array. The mathematical framework for these Von Neumann Networks (VNNs) is also constructed and shows that they are based on the extension of neural operators and the learning of Green's functions with convolutions on a cellular topology having a diffusion signature. We also prove that these VNNs are part of a more general computational system called Cellular Machines that are computationally universal. Initial experiments show that VNN based multi-layered perceptrons outperform their equivalent deep learning variant on basic tasks, while being more parameter efficient and are capable of learning new types of tasks. This includes the ability to solve for and construct an extension of the Von Neumann (hardware) architecture common to all modern computers to cells and suggests new opportunities that could be explored.
☆ The autoPET3 Challenge -- Automated Lesion Segmentation in Whole-Body PET/CT - Multitracer Multicenter Generalization
We report the design and results of the third autoPET challenge (MICCAI 2024), which benchmarked automated lesion segmentation in whole-body PET/CT under a compositional generalization setting. Training data comprised 1,014 [18F]-FDG PET/CT studies from the University Hospital Tübingen and 597 [18F]/[68Ga]-PSMA PET/CT studies from the LMU University Hospital Munich, constituting the largest publicly available annotated PSMA PET/CT dataset to date. The held-out test set of 200 studies covered four tracer-center combinations, two of which represented unseen compositional pairings. A complementary data-centric award category isolated the contribution of data handling strategies by restricting participants to a fixed baseline model. Seventeen teams submitted 27 algorithms, predominantly nnU-Net-based 3D networks with PET/CT channel concatenation. The top-ranked algorithm achieved a mean DSC of 0.66, FNV of 3.18 mL, and FPV of 2.78 mL across all four test conditions, improving DSC by 8% and reducing the false-negative volume by 5 mL relative to the provided baseline. Ranking was stable across bootstrap resampling and alternative ranking schemes for the top tier. Beyond the benchmark, we provide an in-depth analysis of segmentation performance at the patient and lesion level. Three main conclusions can be drawn: (1) in-domain multitracer PET/CT segmentation is sufficient and probably approaching reader agreement; (2) compositional generalization to unseen tracer-center combinations remains an open problem mainly driven by systematic volume overestimation; (3) heterogeneity and case difficulty drive performance variation substantially more than the choice of algorithm among top-ranked teams.
comment: Preprint submitted to Medical Image Analysis
☆ X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction
Inspired by the development of OpenClaw, there is a growing demand for mobile-based personal agents capable of handling complex and intuitive interactions. In this technical report, we introduce X-OmniClaw, a unified mobile agent designed for multimodal understanding and interaction in the Android ecosystem. This unified architecture of perception, memory, and action enables the agent to handle complex mobile tasks with high contextual awareness. Specifically, Omni Perception provides a unified multimodal ingress pipeline that integrates UI states, real-world visual contexts, and speech inputs, leveraging a temporal alignment module to decompose raw data into structured multimodal intent representations. Omni Memory leverages multimodal memory optimization to enhance personalized intelligence by integrating runtime working memory for task continuity with long-term personal memory distilled from local data, enabling highly context-aware and personalized interactions. Finally, Omni Action employs a hybrid grounding strategy that combines structural XML metadata with visual perception for robust interaction. Through Behavior Cloning and Trajectory Replay, the system captures user navigation as reusable skills, enabling precise direct-access execution. Demonstrations across diverse scenarios show that X-OmniClaw effectively enhances interaction efficiency and task reliability, providing a practical architectural blueprint for the next generation of mobile-native personal assistants.
comment: 12 pages, 7 figures
☆ iTRIALSPACE: Programmable Virtual Lesion Trials for Controlled Evaluation of Lung CT Models
We introduce iTRIALSPACE, a programmable evaluation framework for controlled assessment of lung CT models. Standard benchmarks are static retrospective collections that entangle lesion size, lobe prevalence, anatomy, and acquisition context, making it difficult to determine what structurally drives model accuracy. iTRIALSPACE addresses this limitation by composing real clinical CTs and lesion profiles into controlled virtual lesion trials through a four-stage pipeline: multidataset nodule profiling, explicit trial specification, anatomy-aware mask insertion, and ControlNet-conditioned CT synthesis. The framework is built on a unified 54-attribute nodule-profile dataset spanning 13,140 annotated nodules from seven public CT sources and instantiated as 13 trial modes. We evaluate iTRIALSPACE in a 55,469-sample Virtual Lesion Study spanning three medical VLMs, four spatialguidance conditions, and three clinical tasks. Across all 13 modes, the synthetic substrate remains within the real-to-real FID baseline, and synthetic performance rankings transfer strongly to real clinical data ($ρ$ = 0.93, p < 10$^{-15}$). Controlled trial modes expose findings unavailable to fixed-distribution benchmarks, including shortcut-driven size prediction collapse under lobe-equalized sampling and hostto-donor variance ratios of 8.9x and 3.3x in twin-cross analysis. These results position iTRIALSPACE as an auditable evaluation infrastructure for controlled, falsifiable testing beyond static retrospective benchmarks.
comment: 11 pages, 13 figures, 13 tables
☆ MaMi-HOI: Harmonizing Global Kinematics and Local Geometry for Human-Object Interaction Generation
Generating realistic 3D Human-Object Interactions (HOI) is a fundamental task for applications ranging from embodied AI to virtual content creation, which requires harmonizing high-level semantic intent with strict low-level physical constraints. Existing methods excel at semantic alignment, however, they struggle to maintain precise object contact. We reveal a key finding termed \textit{Geometric Forgetting}: as diffusion model depth increases, semantic feature tend to overshadow object geometry feature, causing the model to lose its perception to object geometry. To address this, we propose MaMi-HOI, a hierarchical framework reconciling \textbf{Ma}cro-level kinematic fluidity with \textbf{Mi}cro-level spatial precision. First, to counteract geometric forgetting, we introduce the Geometry-Aware Proximity Adapter (GAPA), which explicitly re-injects dense object details to perform residual snapping corrections for precise contact. Nevertheless, such aggressive local enforcement can disrupt global dynamics, leading to robotic stiffness. In response, we introduce the Kinematic Harmony Adapter (KHA), which proactively aligns whole-body posture with spatial objectives, ensuring the skeleton actively accommodates constraints without compromising naturalness. Extensive experiments validate that MaMi-HOI simultaneously achieves natural motion and precise contact. Crucially, it extends generation capabilities to long-term tasks with complex trajectories, effectively bridging the gap between global navigation and high-fidelity manipulation in 3D scenes. Code is available at https://github.com/DON738110198/MaMi-HOI.git
☆ Jointly Learning Structured Representations and Stabilized Affinity for Human Motion Segmentation IEEE
Human Motion Segmentation (HMS), which aims to partition a video into non-overlapping segments corresponding to different human motions, has recently attracted increasing research attention. Existing HMS approaches are predominantly based on subspace clustering, which are grounded on the assumption that the distribution of high-dimensional temporal features well aligns with a Union-of-Subspaces (UoS). For videos in the real world, however, the raw frame-level features often violate the UoS assumption and yield unsatisfactory segmentation performance. To address this issue, we propose an efficient and effective approach for HMS, named Temporal Deep Self-expressive subspace Clustering (TDSC), which jointly learns temporally consistent structured representations and stabilized affinity for accurate and robust HMS. Specifically, in TDSC, we alternately learn structured representations of the input frame features and self-expressive coefficients via a properly regularized self-expressive model, in which a coding-rate maximization regularizer is incorporated to avoid representation collapse and conform the learned representations to span a desired UoS distribution, and meanwhile, temporal constraints are incorporated to promote temporally adjacent frames to be partitioned into the same groups. Moreover, we develop a temporal momentum averaging mechanism to stabilize affinity evolution and design a reparameterization strategy to enable efficient optimization. We conduct extensive experiments on five benchmark HMS datasets using both conventional (HoG) and up-to-date deep features (i.e., CLIP, DINOv2) to validate the effectiveness of our approach.
comment: This manuscript is currently under review by the IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)
☆ Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction
Dense 3D reconstruction from continuous image streams requires both accurate geometric aggregation and stable long-term memory management. Recent feed-forward reconstruction frameworks integrate observations through persistent memory representations, yet most rely primarily on appearance-based similarity when updating memory. Such appearance-driven integration often leads to redundant accumulation of observations and unstable geometry when viewpoint changes occur. In this work, we propose a ray-aware pointer memory for streaming 3D reconstruction that explicitly models both spatial location and viewing direction within a unified memory representation. Each memory pointer stores its 3D position, associated ray direction, and feature embedding, allowing the system to reason jointly about geometric proximity and viewpoint consistency. Based on this representation, we introduce an adaptive pointer update strategy that replaces traditional fusion-based memory compression with a retain-or-replace mechanism. Instead of averaging nearby observations, the system selectively retains informative pointers while discarding redundant ones, preserving distinctive geometric structures while maintaining bounded memory growth. Furthermore, the joint reasoning over spatial distance and ray-direction discrepancy enables the system to distinguish between local redundancy, novel observations, and potential loop revisits in a unified manner. When loop candidates are detected, pose refinement is triggered to enforce global geometric consistency across the reconstruction. Extensive experiments demonstrate that the proposed ray-aware memory design significantly improves long-term reconstruction stability and camera pose accuracy while maintaining efficient streaming inference. Our approach provides a principled framework for scalable and drift-resistant online 3D reconstruction from image streams.
☆ $\mathcal{B}^{3}$-Net: Controlled Posterior Bridge Learning for Multi-Task Dense Prediction
Multi-task dense prediction solves complementary pixel-level tasks in a unified model, such as semantic segmentation, depth estimation, surface normal estimation, and edge detection. Existing decoder-side interactions use attention, prompts, routing, diffusion, Mamba, or bridge features to exchange task evidence, but most of them organize this evidence implicitly. They usually fuse task features by similarity or affinity, without explicitly modeling that evidence reliability varies across tasks and spatial locations. As a result, unreliable evidence may contaminate the shared representation and intensify negative transfer. We propose $\mathcal{B}^{3}$-Net, a controlled posterior bridge learning framework for multi-task dense prediction. Our method decomposes decoder-side interaction into reliability estimation, posterior bridge construction, and bounded redistribution. The Precision Field Estimator estimates patch-wise evidence precision from task-reference alignment and local variation. The Posterior Bridge Operator builds a precision-weighted posterior bridge through heteroscedastic evidence fusion, yielding a shared state more reliable than uniform or heuristic mixtures. The Contractive Dispatch Operator redistributes the bridge to each task branch through a bounded update, reducing uncontrolled feature injection. Experiments on NYUD-v2, PASCAL-Context, and Cityscapes show that $\mathcal{B}^{3}$-Net achieves competitive or superior trade-offs over representative CNN-, Transformer-, diffusion-, Mamba-, and bridge-feature-based methods. Backbone-matched comparisons and extensive analyses further verify that the gains arise from controlled posterior bridge learning rather than backbone capacity or decoder scale.
comment: 14 pages, 10 figures
☆ TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation
Vision-language-action (VLA) models perform well on training-seen robotic tasks but struggle to generalize to unseen scenes and objects. A key limitation lies in their implicit visual representations, which entangle object appearance, background, and scene layout. This makes policies sensitive to visual variations. Prior work improves transferability through structured intermediate representations that objectify visual content. However, these representations mainly capture scene semantics instead of action-relevant relations. As a result, action prediction remains tied to appearance statistics. We observe that manipulation actions depend on the object-hand-task relational structure, which governs interactions among task requirements, robot states, and object properties. Based on this observation, we propose TriRelVLA, a triadic relational VLA framework for generalizable embodied manipulation. Our approach consists of three components: 1) We construct explicit object-hand-task triadic representations from multimodal inputs as relational primitives. 2) We build a task-grounded relational graph. Task-guided cross-attention forms nodes, and a relation-aware graph transformer models interactions among them. 3) We perform relation-conditioned action generation. The relational structure is compressed into a bottleneck space and projected into the LLM for action prediction. This triadic relational bottleneck reduces reliance on appearance statistics and enables transfer across scenes, objects, and task compositions. We further introduce a real-world robotic dataset for fine-tuning. Experiments show strong performance on fine-tuned tasks and clear gains in cross-scene, cross-object, and cross-task generalization.
☆ EgoEMG: A Multimodal Egocentric Dataset with Bilateral EMG and Vision for Hand Pose Estimation NeurIPS 2026
Surface electromyography (sEMG) records muscle activity during hand movement and can be decoded to recover detailed hand articulation. EMG and egocentric vision are complementary for hand sensing: EMG captures fine-grained finger articulation even under occlusion and poor lighting, while vision provides global hand configuration. However, no existing dataset synchronizes both modalities. We present EgoEMG, a multimodal egocentric dataset for bimanual hand pose estimation. EgoEMG includes bilateral wristband EMG with 16 total channels (8 per wrist) sampled at 2 kHz, 120 Hz IMU, egocentric wide-angle RGB video, external RGB-D video, and mocap-derived hand motion with wrist articulation angles. The dataset covers 41 participants performing 60 gesture classes, including 30 single-hand gestures and 30 bimanual gestures, totaling more than 10 hours of recording. We also introduce a benchmark with three tasks -- EMG-to-pose, vision-to-pose, and EMG+vision fusion -- under a shared joint-angle prediction target and common generalization split axes (cross-gesture, cross-user, and combined). As baselines, we evaluate EMGFormer for EMG-to-pose and generic ResNet/ViT backbones for vision-to-pose. We further study a residual fusion architecture that improves over matched lightweight vision-only baselines. Together, EgoEMG and its benchmark establish a foundation for future research on multimodal hand pose estimation with EMG and vision.
comment: 34 pages, 13 figures, 15 tables. Submitted to NeurIPS 2026
☆ Closing the Loop: Unified 3D Scene Generation and Immersive Interaction via LLM-RL Coupling
Recent advances in large language models (LLMs) have significantly improved language-driven 3D content generation, but most existing approaches still treat scene generation and user interaction as separate processes, limiting the adaptability and immersive potential of interactive multimedia systems. This paper presents a unified framework that closes the loop between language-driven 3D scene generation and immersive user interaction. Given natural language instructions, the system first constructs structured scene representations using LLMs, and then optimizes spatial layouts via reinforcement learning under geometric and semantic constraints. The generated environments are deployed in a virtual reality setting to facilitate HRI-in-the-loop, where user interactions provide continuous feedback to align generated content with human perception and usability. By tightly coupling generation and interaction, the proposed framework enables more responsive, adaptive, and realistic multimedia experiences. Experiments on the ALFRED benchmark demonstrate state-of-the-art performance in task-based scene generation. Furthermore, qualitative results and user studies show consistent improvements in immersion, interaction quality, and task efficiency, highlighting the importance of closed-loop integration of generation and interaction for next-generation multimedia systems. Our project page can be found at https://proj-showcase.github.io/h3ds/.
☆ Adaptive Physical-Facial Representation Fusion via Subject-Invariant Cross-Modal Prompt Tuning for Video-Based Emotion Recognition SC
Emotion recognition from facial videos enables non-contact inference of human emotional states. Although facial expressions are widely used cues, they cannot fully reflect intrinsic affective states. Remote photoplethysmography (rPPG) provides complementary physiological information, but it is highly susceptible to noise and inter-subject variability, limiting generalization to unseen individuals. Existing multimodal methods combine facial and rPPG features, yet their fusion strategies often disrupt pretrained facial representations and lack explicit mechanisms to suppress subject-specific variations. To address these issues, we propose a subject-invariant cross-modal prompt-tuning framework for video-based emotion recognition. Specifically, rPPG waveforms are transformed into noise-robust time-frequency representations (TFRs), from which modality-complementary prompts are generated to modulate facial tokens within a frozen Vision Transformer (ViT). This design enables effective cross-modal interaction while preserving the generalizable facial representations learned by the pretrained backbone. In addition, we introduce a decoupled shared-specific adapter (DSSA) into each ViT layer to explicitly separate subject-shared and subject-specific components, thereby improving cross-subject generalization. Experiments on the MAHNOB-HCI and DEAP benchmarks demonstrate that the proposed method consistently outperforms strong baselines in both recognition accuracy and generalization ability, highlighting its effectiveness for video-based emotion recognition.
comment: The source code will be available at https://github.com/MSA-LMC/SCPT
☆ CFE-PPAR: Compression-friendly encryption for privacy-preserving action recognition leveraging video transformers IEEE
Privacy-preserving action recognition (PPAR) enables machines to understand human activities in videos without revealing sensitive visual content. Among the various strategies for PPAR, encryption-based methods achieve strong privacy protection while maintaining high recognition performance. However, these methods lead to a catastrophic decrease in recognition performance and visual quality when the encrypted videos are compressed. That is, the previous methods are not compression-friendly. To address these issues, in this paper, we propose the first compression-friendly encryption method for PPAR, called CFE-PPAR. In CFE-PPAR, videos encrypted with secret keys can be directly recognized by a video transformer, which uses parameters transformed by the same keys as those used for video encryption. In experiments, it is verified that CFE-PPAR outperforms previous methods on the UCF101 and HMDB51 datasets under Motion-JPEG and H.264 compression.
comment: 6 pages, 5 figures, accepted to 2026 IEEE International Conference on Image Processing (ICIP)
☆ R2H-Diff: Guided Spectral Diffusion Model for RGB-to-Hyperspectral Reconstruction
RGB-to-hyperspectral image reconstruction is a highly ill-posed inverse problem, since multiple plausible spectral distributions may correspond to the same RGB observation. Existing regression-based methods usually learn a deterministic mapping, which limits their ability to model reconstruction uncertainty and often leads to over-smoothed spectral responses. Although diffusion models provide strong distribution modeling capability, their direct application to hyperspectral reconstruction remains challenging due to the high spectral dimensionality, strong inter-band correlations, and strict requirement for spectral fidelity. To this end, we propose R2H-Diff, an efficient diffusion-based framework tailored for RGB-to-HSI reconstruction. Specifically, R2H-Diff formulates spectral recovery as a conditional iterative refinement process, enabling progressive reconstruction under RGB guidance. We proposed a Guided Spectral Refinement Module for RGB-conditioned feature fusion and a Hyperspectral-Adaptive Transposed Attention module for efficient spatial--spectral dependency modeling. Furthermore, a normalization-free denoising backbone is adopted to preserve spectral amplitude consistency, while a task-adapted linear noise schedule enables high-quality reconstruction with only five denoising steps. Extensive experiments on NTIRE2022, CAVE, and Harvard demonstrate that R2H-Diff achieves a favorable balance between reconstruction quality and computational efficiency. Notably, on NTIRE2022, R2H-Diff obtains 35.37 dB PSNR with a sub-million-parameter model of 0.58M parameters and 12.25G FLOPs, achieving the lowest model complexity among the evaluated methods while maintaining strong reconstruction fidelity.
☆ MotionGRPO: Overcoming Low Intra-Group Diversity in GRPO-Based Egocentric Motion Recovery ICML 2026
This paper studies full-body 3D human motion recovery from head-mounted device signals. Existing diffusion-based methods often rely on global distribution matching, leading to local joint reconstruction errors. We propose MotionGRPO, a novel framework leveraging reinforcement learning post-training to inject fine-grained guidance into the diffusion process. Technically, we model diffusion sampling as a Markov decision process optimized via Group Relative Policy Optimization (GRPO). To this end, we introduce a hybrid reward mechanism that combines a learned conditioned perceptual model for global visual plausibility and explicit constraints for local joint precision. Our key technical insight is that policy optimization in diffusion-based recovery suffers from vanishing gradients due to limited intra-group sample diversity. To address this, we further introduce a noise-injection strategy that explicitly increases sample variance and stabilizes learning. Extensive experiments demonstrate that MotionGRPO achieves state-of-the-art performance with superior visual fidelity
comment: Accepted by ICML 2026
☆ EGA: Adapting Frozen Encoders for Vector Search with Bounded Out-of-Distribution Degradation
Vector search systems built on frozen vision encoders face queries from unseen classes at deployment, yet existing adapter training collapses under this shift: high-capacity adapters with global contrastive losses silently reassign unseen-class samples to wrong seen-class clusters, dropping worst-case Label Precision by over 40 points below the frozen baseline in our tests. We propose Euclidean Geodesic Alignment (EGA), a residual adapter that couples three principles: zero initialization, local triplet loss, and hypersphere projection. These collectively induce a self-limiting dynamic: triplets that already satisfy a small margin stop producing gradients, so the adapter automatically stops updating where the local geometry is already correct. Our experiments show that at convergence $96.5\%$ of triplets are gradient-free, leaving unseen-class regions largely untouched while still enabling full-capacity refinement of seen classes. Across five diverse out-of-distribution (OOD) benchmarks, EGA achieves the highest worst-case Label Precision on the four primary splits and a consistent improvement on the fifth. The design also transfers to stronger backbones in addition to CLIP, and we provide an analytical justification linking gradient sparsity to bounded OOD perturbation.
☆ Large Vision-Language Models Get Lost in Attention ICML 2026
Despite the rapid evolution of training paradigms, the decoder backbone of large vision--language models (LVLMs) remains fundamentally rooted in the residual-connection Transformer architecture. Therefore, deciphering the distinct roles of internal modules is critical for understanding model mechanics and guiding architectural optimization. While prior statistical approaches have provided valuable attribution-based insights, they often lack a unified theoretical basis. To bridge this gap, we propose a unified framework grounded in information theory and geometry to quantify the geometric and entropic nature of residual updates. Applying this unified framework reveals a fundamental functional decoupling: Attention acts as a subspace-preserving operator focused on reconfiguration, whereas FFNs serve as subspace-expanding operators driving semantic innovation. Strikingly, further experiments demonstrate that replacing learned attention weights with predefined values (e.g., Gaussian noise) yields comparable or even superior performance across a majority of datasets relative to vanilla models. These results expose severe misallocation and redundancy in current mechanisms, suggesting that state-of-the-art LVLMs effectively ``get lost in attention'' rather than efficiently leveraging visual context.
comment: 25 pages, 10 figures. Accepted by ICML 2026
☆ Sparse-to-Complete: From Sparse Image Captures to Complete 3D Scenes
We introduce S2C-3D, a novel sparse-view 3D reconstruction framework for high-fidelity and complete scene reconstruction from as few as six to eight images. Our framework features three components: a specialized diffusion model for scene-specific image restoration, a training-free view-consistency conditioned sampling process in the diffusion model for refined Gaussian optimization, and a camera trajectory planning scheme to ensure comprehensive scene coverage. The specialized diffusion model is developed by finetuning a pretrained architecture on the input views and their corresponding degraded counterparts. The adaptation to the scene distribution allows the model to repair Gaussian renderings while effectively eliminating domain gaps. Meanwhile, the trajectory planning scheme optimizes scene coverage by connecting each newly sampled camera to its two nearest neighbors. By iteratively constructing paths and retaining only those that significantly enhance visibility, the scheme establishes a trajectory that covers the entire scene. To address multi-view conflicts, the view-consistency conditioned sampling process quantifies the consistency between neighboring repaired images. This information is injected as a condition into the sampling process of the frozen diffusion model, facilitating the generation of view-consistent images without additional training. Consequently, our approach produces high-fidelity 3D Gaussians that are robust to artifacts. Experimental results demonstrate that S2C-3D outperforms state-of-the-art methods, constructing high-quality scenes that are free from missing regions, blurring, or other artifacts with very sparse inputs. The source code and data are available at https://gapszju.github.io/S2C-3D.
☆ MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality ICML 2026
Unified visual tokenization faces a fundamental trade-off between high-fidelity pixel reconstruction (spatial equivariance) and semantic abstraction (conceptual invariance). We attribute this conflict to Manifold Misalignment: naive joint optimization induces opposing gradients, creating a zero-sum game between reconstruction and perception. To address this, we propose MUSE, a framework based on Topological Orthogonality. By treating Structure as an orthogonal bridge, MUSE decouples optimization within Transformers: structural gradients refine attention topology, while semantic gradients update feature values. This turns destructive interference into Mutual Reinforcement. Experiments show that MUSE breaks the trade-off, achieving state-of-the-art generation quality (gFID 3.08) and surpassing its teacher InternViT-300M in linear probing (85.2\% vs. 82.5\%), demonstrating that structurally aligned reconstruction can enhance semantic perception. Code is available at https://github.com/PanqiYang1/MUSE.
comment: 21 pages,Accepted by ICML 2026 main track
☆ AffectSeek: Agentic Affective Understanding in Long Videos under Vague User Queries
Existing affective understanding studies have mainly focused on recognizing emotions from images, audio signals, or pre-cliped video clips, where the affective evidence is already given. This passive and clip-centered setting does not fully reflect real-world scenarios, in which users often interact with long videos and express their needs through natural-language queries. In this paper, we study \textbf{Vague-Query-driven video Affective Understanding (VQAU)}, a new task that requires models to localize affective moments in long videos, predict their emotion categories, and generate evidence-grounded rationales under vague user queries. To support this task, we construct \textbf{VQAU-Bench}, a benchmark that integrates long videos, vague affective queries, temporal clip annotations, emotion labels, and rationale explanations into a unified evaluation framework. VQAU-Bench enables systematic assessment of semantic-temporal-affective alignment, affective moment localization, emotion classification, and rationale generation. To address the multi-step reasoning challenges of VQAU, we further propose \textbf{AffectSeek}, an agentic framework that actively seeks, verifies, and explains affective moments in long videos. AffectSeek decomposes VQAU into intent interpretation, candidate localization, clip verification, emotion reasoning, and rationale generation, and progressively aligns vague user intent with long-video evidence through role-specialized reasoning and cross-stage verification. Experiments show that VQAU remains challenging for existing affective recognition models and single-step vision-language models, while AffectSeek provides a simple yet effective framework for agentic long-video affective understanding.
☆ Learning a Delighting Prior for Facial Appearance Capture in the Wild SIGGRAPH
High-quality facial appearance capture has traditionally required costly studio recording. Recent works consider an in-the-wild smartphone-based setup; however, their model-based inverse rendering paradigm struggles with the complex disentanglement of reflectance from unknown illumination. To bridge this gap, we propose to shift the paradigm into training a powerful delighting network as a prior to constrain the optimization. We leverage the OLAT dataset and the rendered Light Stage scans for training, and propose Dataset Latent Modulation (DLM) to seamlessly integrate these heterogeneous data sources. Specifically, by conditioning the core network on learnable source-aware tokens, we decouple dataset-specific styles from physical delighting principles, enabling the emergence of a delighting prior that outperforms existing proprietary models. This powerful delighting prior enables a simple and automatic appearance capture pipeline that achieves high-quality reflectance estimation from casual video inputs, outperforming prior arts by a large margin. Furthermore, we leverage our appearance capture method to transform the multi-view NeRSemble dataset into NeRSemble-Scan, a large-scale collection of 4K-resolution relightable scans. By open-sourcing our model and the NeRSemble-Scan dataset, we democratize high-end facial capture and provide a new foundation for the research community to build photorealistic digital humans.
comment: ACM Transactions on Graphics (Proc. of SIGGRAPH), 2026. Code: https://github.com/yxuhan/OpenDelight Project Page: https://yxuhan.github.io/OpenDelight/index.html
☆ Leveraging Image Generators to Address Training Data Scarcity: The Gen4Regen Dataset for Forest Regeneration Mapping
Sustainable forest management relies on precise species composition mapping, yet traditional ground surveys are labour-intensive and geographically constrained. While Uncrewed Aerial Vehicles (UAVs) offer scalable data collection, the transition to deep learning-based interpretation is bottlenecked by the severe scarcity of expert-annotated imagery, particularly in complex, visually heterogeneous regeneration zones. This paper addresses the dual challenges of data scarcity and extreme class imbalance in the semantic segmentation of fine-grained forest regeneration species by providing a scalable framework that reduces reliance on manual photo-interpretation for high-resolution, millimetre-level aerial imagery. Importantly, we leverage the large-scale vision-language Nano Banana Pro model to simultaneously generate high-fidelity images and their corresponding pixel-aligned semantic masks from prompts. We introduce WilDReF-Q-V2, an expansion of a natural forest dataset with 13 977 new unlabelled and 50 labelled real images, as well as the Gen4Regen dataset, featuring 2101 pairs of synthetic images and semantic masks. Our methodology integrates real-world data with AI-generated images, highlighting that AI-generated data is highly complementary to real-world data, with unified training yielding an F1 score improvement of over 15 %pt compared to purely supervised baselines. Furthermore, we demonstrate that even small quantities of prompt-generated data significantly improve performance for underrepresented species, some of which saw per-species F1 score gains of up to 30 %pt. We conclude that vision-language models can serve as agile data generators, effectively bootstrapping perception tasks for niche AI domains where expert labels are scarce or unavailable. Our datasets, source code, and models will be available at https://norlab-ulaval.github.io/gen4regen.
comment: 36 pages, 17 figures
☆ RAM-H1200: A Unified Evaluation and Dataset on Hand Radiographs for Rheumatoid Arthritis
Rheumatoid arthritis (RA) assessment from hand radiographs requires multi-level analysis and modeling of anatomical structures and fine-grained local pathological changes. However, existing public resources do not support such unified multi-level analysis, often lacking full-hand coverage, fine-grained annotations, and consistent integration with clinical scoring systems. In particular, annotations that enable quantitative analysis of bone erosion (BE) remain scarce. RAM-H1200 contains 1,200 hand radiographs collected from six medical centers, with multi-level annotations including (i) whole-hand bone structure instance segmentation, (ii) pixel-level BE masks, (iii) SvdH-defined joint regions of interest, and (iv) joint-level SvdH scores for both BE and joint space narrowing (JSN). It is designed to evaluate whether models can jointly capture anatomical structure, localized erosive pathology, and clinically standardized RA severity from hand radiographs. The proposed BE masks enable, for the first time, quantitative BE analysis beyond coarse categorical grading by providing explicit spatial supervision for lesion extent and morphology. To our knowledge, RAM-H1200 is the first public large-scale benchmark that jointly supports whole-hand bone structure instance segmentation, pixel-level BE delineation, and clinically grounded joint-level SvdH scoring for both BE and JSN. Results across benchmark tasks show that anatomical modeling is substantially more mature than quantitative BE analysis: whole-hand bone segmentation achieves strong performance, whereas BE segmentation remains a major open challenge. By unifying anatomical structure modeling, quantitative lesion analysis, and clinically grounded SvdH scoring, RAM-H1200 provides a single benchmark for comprehensive RA analysis on hand radiographs.
comment: 50 pages, 24 figures, 25 tables
☆ The Cost of Context: Mitigating Textual Bias in Multimodal Retrieval-Augmented Generation
While Multimodal Large Language Models (MLLMs) are increasingly integrated with Retrieval-Augmented Generation (RAG) to mitigate hallucinations, the introduction of external documents can conceal severe failure modes at the instance level. We identify and formalize the phenomenon of recorruption, where the introduction of even perfectly accurate "oracle" context causes a capable model to abandon an initially correct prediction. Through a mechanistic diagnosis of internal attention matrices, we show that recorruption is driven by a two-fold attentional collapse: (1) visual blindness, characterized by the systemic suppression of visual attention mass ($M_{vis}$) and sharpness ($S_{vis}$), and (2) a structural positional bias that forces the model to prioritize boundary tokens over semantic relevance. Our analysis reveals an Illusion of Success, demonstrating that many seemingly correct RAG outcomes are merely positional coincidences where the model's textual copying bias happens to align with the ground-truth location. To address these vulnerabilities, we propose Bottleneck Attention Intervention for Recovery (BAIR), a parameter-free, inference-time framework that restores visual saliency and applies position-aware penalties to textual distractors. Across medical factuality, social fairness, and geospatial benchmarks, BAIR successfully restores multimodal grounding and improves diagnostic reliability without requiring model retraining or fine-tuning.
☆ Uncertainty-Guided Edge Learning for Deep Image Regression in Remote Sensing CVPR 2026
Edge learning refers to training machine learning models deployed on edge platforms, typically using new data accumulated onboard. The computational limitations on edge devices affect not only model optimisation, but also calculation of the predictive uncertainty of the current model on the unlabelled data, which is vital for informing model updating. In this paper, we investigate edge learning in the context of performing deep image regression on a remote sensing satellite, where a deep network is executed by an onboard computer to regress a scalar $y$ from an input image, e.g., $y$ is the percentage of pixels indicating cloud coverage or land use. We propose an uncertainty-guided edge learning (UGEL) algorithm that can accurately prioritise the data to speed up training convergence of the on-board regression model. Underpinning UGEL is the calculation of predictive uncertainty based on deep beta regression, where a deep network is used to estimate the parameters of a beta distribution for which the target $y$ for an input image has a high likelihood. Compared to established methods for uncertainty estimation that are either too costly on edge devices (e.g., require many forward passes per sample) or make strict assumptions on the predictive distribution (e.g., Gaussian), deep beta regression is computable in a single forward pass and allows more general predictive distributions. Results show that UGEL delivers faster-converging edge learning than active or semi-supervised learning. Code and models are publicly available at https://github.com/anh-vunguyen/UGEL.
comment: AI4Space @ CVPR 2026
☆ Text-to-CAD Retrieval: a Strong Baseline
Text-based retrieval of Computer-Aided Design (CAD) models is a critical yet underexplored task for the reuse of legacy industrial designs. Existing CAD repositories are typically searched using filenames or directories, which limits the efficiency, scalability, and accuracy of design retrieval. In this paper, we formally introduce text-to-CAD retrieval as a new cross-modal retrieval task, aiming to retrieve semantically relevant CAD models from large-scale databases given natural language queries. Leveraging paired text-CAD annotations from the Text2CAD dataset, we establish a practical benchmark for this task. To achieve text-based retrieval, we propose a unified framework that learns multi-modal CAD embeddings from both procedural sequences and geometric point clouds. Specifically, a sequence encoder captures the construction logic of CAD models, while a point encoder extracts explicit geometric features. A text encoder is used to learn semantic representations of textual queries. During training, we introduce a novel feature decoder that reconstructs masked sequence features via cross-attention with text and point features, encouraging implicit multi-modal alignment. At inference time, we remove this auxiliary decoder to enable efficient retrieval using concatenated sequence-point features. Our framework serves as a strong baseline for text-to-CAD retrieval and lays the foundation for downstream CAD generation paradigms, such as retrieval-augmented generation. The source code will be released.
☆ An extremely coarse feedback signal is sufficient for learning human-aligned visual representations
Artificial neural networks trained on visual tasks develop internal representations resembling those of the primate visual system, a discovery that has guided a decade of computational neuroscience. Research on building brain-aligned models has progressively embraced finer-grained supervisory signals, from object classification to contrastive self-supervised objectives that maximize distinctions among individual images, yet the role of supervisory signal granularity on brain alignment remains largely unexamined. Here we systematically investigate how the coarseness of a learning signal shapes representational alignment with human vision. We parametrically vary the level of signal granularity using a data-driven approach that partitions a set of training images into varied numbers of categories (2, 4, 8, 16, ..., 64) via PCA-based splits of pretrained embeddings. We train hundreds of neural networks across convolutional and transformer architectures on these coarse classification tasks and compare their representations to macaque electrophysiology recordings and human fMRI responses. We find that networks trained to distinguish as few as 8 broad categories learn representations that match or exceed the neural alignment of models distinguishing 1,000-classes. Even more strikingly, these coarsely trained networks align more closely with human perceptual similarity judgments than all other models evaluated, including networks trained with fine-grained supervision or self-supervision as well as leading large-scale vision models. These results demonstrate that human-like visual representations emerge from remarkably coarse feedback, reframing what learning signals vision may require and opening a path toward building AI systems that are more aligned with human perception.
comment: 21 Pages, 6 Figures
☆ A Novel Graph-Regulated Disentangling Mamba Model with Sparse Tokens for Enhanced Tree Species Classification from MODIS Time Series
Although tree species classification from Moderate Resolution Imaging Spectroradiometer (MODIS) time series data is critical for supporting various environmental applications, it is a challenging task due to several key difficulties: the subtle signature differences among tree species, strong spatial-spectral-temporal information coupling, and the difficulty of modeling large-scale topological context information. To better address these challenges, this paper presents a novel Graph-regulated Disentangled Sparse Mamba model (GDS-Mamba) for enhanced tree species classification, with the following contributions. (1) First, to improve large-scale context modeling, we design a mini-batch graph-regulated approach that explicitly explores topological correlation effects among input images. (2) Second, to disentangle the high-dimensional spatial-spectral-temporal information coupling for improved feature extraction, we propose a novel disentangling Mamba architecture tailored for capturing independent spatial patterns, spectral signatures, and temporal phenology behaviors in MODIS time series. (3) Third, to improve efficiency and subtle feature learning, we design novel sparse token approaches that adaptively learn the optimum subset of tokens to better address the correlation decay problem that bottlenecks standard Mamba models. Extensive experiments using large-scale annual MOD13Q1 data across two Canadian provinces (i.e., Alberta and Saskatchewan) achieved an overall accuracy of 93.94\% in Alberta and 80.19\% in cross-provincial evaluations, outperforming twelve state-of-the-art classification models.
☆ Characterizing Brazilian Atlantic Forest Restoration Outcomes with Geospatial AlphaEarth Embeddings ICLR 2026
The Atlantic Forest in Brazil is a critical biodiversity hotspot, yet less than 12-15% of its original cover remains. Although monitoring forest restoration on a large scale is essential, traditional methods are limited by the impracticality of on-the-ground reporting on such a scale and by the saturation of remote-sensing indices such as NDVI. Furthermore, reforestation is a gradual process as opposed to the rapid spectral changes caused by deforestation. In this study, we examine 1,729 restoration sites in São Paulo, using satellite embeddings from the AlphaEarth Foundation's model to evaluate their effectiveness in characterising early restoration success. We introduce the concept of a 'Reference Trajectory Embedding', defining a metric of restoration success based on cosine similarity to reference sites of mature secondary forest. We observe distinct clusters in embedding space according to different land use and land cover (LULC) types, and we can identify sites with clear change vectors. However, the signal can be noisy, and embeddings may require further fine-tuning to capture and predict site metadata beyond LULC.
comment: Workshop paper at ICLR 2026 Machine Learning for Remote Sensing (ML4RS)
☆ Dr-BA: Separable Optimization for Direct Radar Bundle Adjustment & Localization
This paper introduces Dr-BA, a first-of-its-kind radar bundle adjustment (BA) framework that operates directly on 2D spinning radar intensity images. Unlike camera or lidar sensors, radar is largely unaffected by precipitation, making it a critical modality for autonomous systems that require all-weather robustness. Existing state estimation approaches using spinning radar typically extract sparse point clouds from range-azimuth-intensity measurements and apply point cloud alignment techniques to estimate vehicle motion, scene structure, or to localize within an existing map. In contrast, Dr-BA uses the full radar returns from multiple scans to jointly estimate dense maps and sensor poses. By formulating the problem as a separable optimization, we derive an efficient and general solution that decouples pose estimation from mapping. In addition to solving the BA problem, this formulation naturally extends to direct radar-only localization (DRL) within a previously built map. Dr-BA achieves state-of-the-art radar-based BA and cross-session localization performance, demonstrated on more than 200 km of on-road data across five distinct routes. Our implementation is publicly available at https://github.com/utiasASRL/dr_ba.
comment: Accepted for presentation at RSS 2026
☆ OneViewAll: Semantic Prior Guided One-View 6D Pose Estimation for Novel Objects
In many practical 6D object pose estimation scenarios, we often have access to only a single real-world RGB-D reference view per object, typically without CAD models. Existing methods largely rely on explicit 3D models or multi-view data, which limits their scalability. To address this challenging single-reference model-free setting, we propose \textbf{OneViewAll}, a semantic-prior-guided framework that performs pose estimation via a novel Project-and-Compare paradigm. Instead of relying on computationally expensive CAD-based rendering, our method directly aligns reference and query observations within a projection-equivariant space. OneViewAll progressively integrates hierarchical semantic priors across three levels: (1) \textit{category- and scene-level} priors for efficient hypothesis initialization; (2) \textit{object-level symmetry} priors for geometry completion via mirror fusion; and (3) \textit{patch-level} priors for discriminative refinement. Extensive experiments demonstrate that OneViewAll achieves \textbf{92.5\%} ADD-0.1 accuracy on the LINEMOD dataset using only one real reference view -- significantly outperforming the CVPR 2025 baseline One2Any (52.6\%). It also yields consistent improvements on YCB-V, Real275, and Toyota-Light while maintaining low inference latency. Our results underscore the efficacy of symmetry-aware projection in handling symmetric, texture-less, and occluded objects.
☆ LensVLM: Selective Context Expansion for Compressed Visual Representation of Text
Vision Language Models (VLMs) offer the exciting possibility of processing text as rendered images, bypassing the need for tokenizing the text into long token sequences. Since VLM image encoders map fixed-size images to a fixed number of visual tokens, varying rendering resolution provides a fine-grained compression knob. However, accuracy deteriorates quickly as compression increases: characters shrink below the vision encoder's effective resolution, making them indistinguishable. To address this, we propose LensVLM, an inference framework and post-training recipe that enables VLMs to scan compressed images, then selectively expand only the relevant images to their uncompressed form via learned tools. Building on Qwen3.5-9B-Base, LensVLM maintains accuracy comparable to the full-text upper bound at 4.3x effective compression and outperforms retrieval-based, text- and visual-compression baselines up to 10.1x effective compression across seven text QA benchmarks. LensVLM also generalizes to multimodal document and code understanding tasks, with the accuracy gain over baselines growing as compression increases. Our analysis validates this approach: training makes visual compression robust to rendering choices, and as compression grows the model increasingly relies on expanded content rather than unreliable visual reading. The analysis also yields practical tool-choice guidance: text expansion is preferable for rendered text, while high-resolution image expansion suits native documents whose layout cues carry task-relevant information.
☆ TRAJGANR: Trajectory-Centric Urban Multimodal Learning via Geospatially Aligned Neural Representations
Multimodal self-supervised learning (MSSL) has emerged as a key paradigm for pretraining geospatial foundation models. However, existing geospatial MSSL methods are mainly designed for static pairs of modalities, such as satellite imagery, street-view imagery, and text, where learning is driven by aligning observations from the same or nearby locations. This assumption breaks down for human mobility trajectories, which represent continuous movement along paths rather than discrete observations at individual locations. Although trajectories are important for urban understanding through their ability to capture human activity across roads, neighborhoods, and places over time, they remain largely underexplored in current geospatial MSSL frameworks. We present TrajGANR, a novel trajectory-centric geospatial MSSL framework that aligns continuous movement patterns with static, location-based observations. TrajGANR learns a continuous neural representation of trajectories at arbitrary points along each path, which enables fine-grained alignment with nearby street-view images, even when they are not co-located with any trajectory waypoints. We leverage this capability to introduce an MSSL objective that jointly aligns three modalities: trajectories, street-view images, and their geographic locations. We evaluate TrajGANR on four urban mobility and road understanding tasks. Across these tasks, TrajGANR consistently outperforms existing geospatial MSSL frameworks and a trajectory-specific foundation model. Ablation studies further demonstrate that our proposed MSSL objective and the multimodal learning framework are the primary drivers of these improvements, highlighting the importance of fine-grained geospatial alignment over coarser aggregation, as well as geospatial multimodal learning.
☆ Bringing Multimodal Large Language Models to Infrared-Visible Image Fusion Quality Assessment
Infrared-Visible image fusion (IVIF) aims to integrate thermal information and detailed spatial structures into a single fused image to enhance perception. However, existing evaluation approaches tend to over-optimize both hand-crafted no-reference statistics and full-reference metrics that treat the source images as pseudo ground truths. Recent IVIF reward-modelling efforts learn from human ratings but use scalar regression on aggregated scores, neither leveraging the reasoning of Multimodal Large Language Models (MLLMs) nor encoding per-image perceptual ambiguity in their supervision, but naively introducing MLLMs with discrete one-hot supervision likewise collapses fused images of similar quality into different rating levels. To address this, we introduce FuScore, which utilizes an MLLM to mimic human visual perception by producing continuous quality score, rather than discrete level predictions, enabling fine-grained discrimination among fused images of similar quality. We exploit the agreement among four IVIF-specific sub-dimensions to construct a per-image soft label whose sharpness reflects how consensual the overall judgment is. We further introduce a tripartite objective combining per-image distributional supervision, within-source-pair Thurstone fidelity for method-level ordering, and cross-source-pair Thurstone fidelity for scene-level ordering across scenes. Extensive experiments demonstrate that FuScore achieves state-of-the-art correlation with human visual preferences.
☆ XiYOLO: Energy-Aware Object Detection via Iterative Architecture Search and Scaling
Object detection on heterogeneous edge devices must satisfy strict energy, latency, and memory constraints while still providing reliable perception for downstream autonomy. Existing energy-aware NAS methods often target limited deployment settings, while real energy remains difficult to optimize because it is highly device-dependent and costly to measure. We address these challenges with an energy-adaptive framework that combines an energy-aware XiResOFA search space, a two-stage energy estimator, and iterative search to identify a single energy-efficient base architecture. We then apply compound scaling to transform this base design into the XiYOLO family across deployment budgets, enabling interpretable accuracy-energy tradeoffs under sparse hardware measurements. Experiments on PascalVOC, COCO, and real-device deployment show that XiYOLO achieves a stronger energy-accuracy tradeoff than YOLO baselines. On PascalVOC, the medium XiYOLO model reaches 86.15 mAP50 while reducing energy relative to YOLOv12m by 20.6% on GPU and 35.9% on NPU. On COCO, XiYOLO reduces energy relative to YOLOv12 by up to 53.7% on GPU and 51.6% on NPU at the small scale. The proposed two-stage estimator also improves sample efficiency over a joint predictor under few-shot adaptation with only 2-20 target-device samples.
☆ A$^2$RD: Agentic Autoregressive Diffusion for Long Video Consistency
Synthesizing consistent and coherent long video remains a fundamental challenge. Existing methods suffer from semantic drift and narrative collapse over long horizons. We present A$^2$RD, an Agentic Auto-Regressive Diffusion architecture that decouples creative synthesis from consistency enforcement. A$^2$RD formulates long video synthesis as a closed-loop process that synthesizes and self-improves video segment-by-segment through a Retrieve--Synthesize--Refine--Update cycle. It comprises three core components: (i) Multimodal Video Memory that tracks video progression across modalities; (ii) Adaptive Segment Generation that switches among generation modes for natural progression and visual consistency; and (iii) Hierarchical Test-Time Self-Improvement that self-improves each segment at frame and video levels to prevent error propagation. We further introduce LVBench-C, a challenging benchmark with non-linear entity and environment transitions to stress-test long-horizon consistency. Across public and LVBench-C benchmarks spanning one- to ten-minute videos, A$^2$RD outperforms state-of-the-art baselines by up to 30% in consistency and 20% in narrative coherence. Human evaluations corroborate these gains while also highlighting notable improvements in motion and transition smoothness.
comment: Project page: http://dxlong2000.github.io/AARD
☆ Advancing Reliable Synthetic Video Detection: Insights from the SAFE Challenge
The proliferation of generative video technologies has intensified the need for reliable methods to detect and characterize synthetic media. To address this challenge, we organized the \href{https://safe-video-2025.dsri.org}{SAFE: Synthetic Video Detection Challenge}, co-located with the \textit{Authenticity and Provenance in the Age of Generative AI (APAI) Workshop }at ICCV 2025. The competition invited participants to develop and evaluate algorithms capable of distinguishing real from synthetic videos under fully blind evaluation conditions with over 600 submissions from 12 teams over a 90 day span. Hosted on the Hugging Face platform, the challenge comprised two primary tasks: (1) detection of synthetic video content generated by diverse state-of-the-art models, and (2) detection of synthetic content following common post-processing operations such as resizing, re-compression, motion blur and others. The challenge data consisted of 13 modern high quality synthetic video models with generated content matched to real videos from 21 diverse and challenge sources, all adding up to 20 hours of 6,000 video samples. This paper describes the challenge design, dataset construction, evaluation methodology, and outcomes, offering insights into the generalization and robustness of contemporary synthetic video detection methods. Our findings highlight measurable progress in cross-generator generalization but also persistent vulnerabilities to post-processing artifacts. https://safe-video-2025.dsri.org
☆ Not All Tokens Need 40 Steps: Heterogeneous Step Allocation in Diffusion Transformers for Efficient Video Generation
Diffusion Transformers (DiTs) have achieved state-of-the-art video generation quality, but they incur immense computational cost because standard inference applies the same number of denoising steps uniformly to every token in the sequence. It is well known that human vision ignores vast amounts of redundant motion. Why, then, do our densest models treat every spatiotemporal token with equal priority? In this paper, we introduce Heterogeneous Step Allocation (HSA), a training-free inference algorithm that assigns varying step budgets to different spatiotemporal tokens based on their velocity dynamics. To resolve the resulting sequence-length mismatch without sacrificing global context, HSA introduces a KV-cache synchronization mechanism that allows active tokens to attend to the full sequence while entirely bypassing inactive tokens. Furthermore, we derive a cached Euler update that advances the latent states of skipped tokens in a single operation without additional model evaluations. We evaluate HSA on the Wan-2 and LTX-2 models for both text-to-video (T2V) and image-to-video (I2V) generation. Our results demonstrate that HSA significantly outperforms previous state-of-the-art caching methods and the vanilla Flow Matching baseline, especially at aggressive acceleration regimes (e.g., 50% and 25% runtimes). Crucially, HSA achieves a superior quality-runtime Pareto frontier without the need for expensive offline profiling, robustly preserving structural integrity and generation quality even under tight computational budgets. Project page: https://ernestchu.github.io/hsa
comment: Project page: https://ernestchu.github.io/hsa
☆ Towards Fairness under Label Bias in Image Segmentation: Impact, Measurement and Mitigation
Labeled datasets reflect the biases of their annotation pipelines, which sometimes introduce label bias: group-conditional label errors that cause systematic performance disparities across demographic subgroups. Label bias in image segmentation remains underexplored, as even detecting it typically requires clean, unbiased annotations, which are not readily available. We present a data-centric adaptation of Confident Learning to segmentation, allowing detection of label bias directly in the training data without a clean, unbiased ground truth. By comparing the provided training labels to the model's confident predictions, we isolate directional errors that quantify the presence and nature of bias, where standard overlap metrics like Dice fail. We further show that label bias influences subgroup separability in the encoder's feature space, an artifact we leverage for bias mitigation rather than suppressing it. We evaluate three datasets, spanning from synthetic to real-life bias, showing how our framework reliably detects and mitigates bias without access to clean labels, achieving equitable performance across experimental conditions.
☆ TriDE: Triangle-Consistent Translation Directions for Global Camera Pose Estimation
Pairwise translation directions are a key input to camera location estimation in global structure-from-motion. Existing estimators usually process each image pair independently, producing directions that may be locally plausible but inconsistent with the other relative directions in the viewing graph. To jointly estimate the direction, we propose TriDE, which exploits camera-triangle consistency as an efficient higher-order verification signal. Instead of solving a costly global nonlinear optimization problem that is sensitive to initialization, TriDE refines unreliable pairwise directions through message passing between directions and their incident weighted triangles. This information propagation strategy enables us to establish a strong phase-transition bound for exact recovery under a realistic random corruption model. Experiments on real image graphs show that TriDE improves direction accuracy by a large margin and yields better downstream camera locations, providing a practical link between local pairwise estimation and global camera pose geometry.
comment: 32 pages, 6 figures
☆ AdpSplit: Error-Driven Adaptive Splitting for Faster Geometry Discovery in 3D Gaussian Splatting
Adaptive density control in 3D Gaussian Splatting (3DGS) repeatedly grows the Gaussian population through fixed-cardinality random splitting to discover useful scene structure. However, in vanilla 3DGS, its binary split operator requires many densification rounds to expose fine details, making it a bottleneck for efficient training schedules with fewer iterations. We introduce AdpSplit, an error-driven adaptive split operator that determines the number of split children and initializes the child parameters from L1-pixel-error region statistics, enabling fewer densification iterations, thus reduced training time, while preserving the rendering quality of full-schedule training. Across the MipNeRF360, Deep-Blending, and Tanks&Temples datasets, AdpSplit reduces the training time of multiple accelerated 3DGS pipelines by 9.2%-22.3% as a simple drop-in replacement for the standard split operator. With FastGS, AdpSplit matches the full-schedule PSNR on MipNeRF360 while reducing training time by 16.4%, corresponding to a 12.6x acceleration over vanilla 3DGS.
☆ EULER-ADAS: Energy-Efficient & SIMD-Unified Logarithmic-Posit Engine for Precision-Reconfigurable Approximate ADAS Acceleration
Advanced driver-assistance systems (ADAS) require neural compute engines that deliver low-latency inference under strict power and area constraints. Posit arithmetic is attractive for such accelerators because it provides high numerical fidelity at low precision, but its variable-length regime encoding increases encode/decode cost and exposes the datapath to large regime-field fault effects. This paper presents EULER-ADAS, a SIMD-enabled logarithmic bounded-Posit neural compute engine for energyefficient and reliability-aware ADAS acceleration. The proposed datapath combines bounded-regime Posit representation, stageadaptive logarithmic mantissa multiplication with bit truncation, and a SIMD-shared quire accumulation path supporting Posit- (8,0), Posit-(16,1), and Posit-(32,2) execution. The unified architecture enables 4xPosit-8, 2xPosit-16, or 1xPosit-32 operation without duplicating precision-specific hardware. FPGA implementation shows that the proposed configurations reduce LUT count by up to 41.4%, delay by up to 76.1%, and power by up to 71.9% relative to exact Posit neural compute engines, while achieving up to 10x lower energy-delay product than radix-4 Booth-based Posit multipliers. In 28-nm CMOS, the bounded variants occupy 0.013-0.016 mm2 , consume 19.8-22.1 mW, and operate at up to 1.84 GHz. Application-level evaluation across image-classification, ADAS, and edge-inference workloads shows that the evaluated Posit-16 and Posit-32 configurations remain within about 1.5 percentage points of FP32 accuracy. A TinyYOLOv3 prototype on Pynq-Z2 achieves 78 ms latency at 0.29 W and 22.6 mJ/frame, demonstrating the suitability of EULERADAS for low-power real-time ADAS inference.
☆ Knowledge Transfer Scaling Laws for 3D Medical Imaging
Vision foundation models are increasingly moving beyond 2D to volumetric domains such as 3D medical imaging, where unified pretraining across different imaging modalities (i.e. CT, MRI, and PET) could provide foundational models for diverse clinical tasks. However, training such models requires mixing heterogeneous imaging domains, and current mixture strategies remain largely heuristic. In this work, we observe that different medical imaging domains scale at variable rates during pretraining, and knowledge transfer between domains is strongly asymmetric: training on one domain can substantially improve another, but the reverse may be much weaker. Interestingly, both MAE reconstruction loss and cross-domain transfer follow predictable power-law trends with domain-specific behaviors. Motivated by these findings, we formulate data allocation as a scaling-law optimization problem. The derived allocations reveal an interpretable hub-and-island structure: highly transferable domains emerge as hubs that benefit many others and deserve strategic allocation, while isolated domains act as islands requiring direct investment. Empirically, transfer-aware allocation outperforms data-proportional sampling by up to 58% and generalizes well to unseen budgets with r=0.989. Downstream validation on disease classification and organ/lesion segmentation further confirms that the derived transfer-aware mixtures provide stronger pretrained representations for clinical 3D medical imaging tasks.
comment: 20 Pages
☆ A Unified Measure-Theoretic View of Diffusion, Score-Based, and Flow Matching Generative Models
We survey continuous-time generative modeling methods based on transporting a simple reference distribution to a data distribution via stochastic or deterministic dynamics. We present a unified framework in which diffusion models, score-based generative models, and flow matching are instances of learning a time-dependent vector field that induces a family of marginals $(ρ_t)_{t \in [0,1]}$ governed by continuity and Fokker-Planck equations. Such a unified theory is timely because these methods are converging methodologically, yet fragmented notation and competing derivations continue to obscure their shared structure and the practical tradeoffs governing sampling, stability, and computation. Within this framework, we (i) derive reverse-time sampling for diffusion and score-based models as controlled stochastic dynamics, (ii) show that the probability flow ODE yields identical marginals and connects diffusion to likelihood-based normalizing flows, and (iii) interpret flow matching as direct regression of the velocity field under a chosen interpolation, clarifying when it coincides with or differs from score-based training. We compare objectives, sampling schemes, and discretization errors under unified notation, discuss connections to Schrodinger bridges and entropic optimal transport, and summarize theoretical guarantees and open problems on approximation, stability, and scalability.
comment: 62 pages, 1 figure, jmlr preprint
☆ Uneven Evolution of Cognition Across Generations of Generative AI Models
The pursuit of artificial general intelligence necessitates robust methods for evaluating the cognitive capabilities of models beyond narrow task performance. Here, we introduce a psychometric framework to assess the cognitive profiles of generative AI, comparing them to human norms and tracking their evolution across generations. Initial evaluation of leading multimodal models using tasks adapted from the Wechsler Adult Intelligence Scale revealed a profoundly uneven cognitive architecture: near-ceiling performance in verbal comprehension and working memory (>$98^{\text{th}}$ percentile) contrasted with near-floor performance in perceptual reasoning (<$1^{\text{st}}$ percentile). To track developmental trajectories beyond human-normed limits, we developed the Artificial Intelligence Quotient (AIQ) Benchmark and applied it to six generations and two model families, revealing significant but asymmetric performance gains. Notably, we uncovered a sharp dissociation between modalities; abstract quantitative reasoning matured far more rapidly when presented linguistically compared to a visually analogous format, indicating an architectural bias towards language-based symbolic manipulation. While abstract visual reasoning improved, visual-perceptual organization remained largely stagnant. Collectively, these findings demonstrate that the cognitive abilities of generative models are evolving unevenly, suggesting that scaling and optimization approaches to AGI development alone may be insufficient to overcome fundamental architectural limitations in achieving balanced, human-like general intelligence.
comment: 25 pages, 5 Figures, 3 Tables
☆ Enhancing Eye Movement Biometrics for User Authentication via Continuous Gaze Offset Score Fusion
Eye movement biometrics (EMB) use subject-specific gaze dynamics for user authentication and identification. Recent deep learning-based EMB systems achieve strong performance by modeling temporal eye movement behavior. However, these systems typically overlook continuous gaze offset, despite prior evidence that it contains user-discriminative information. This work examines whether continuous gaze offset can improve biometric performance when combined with existing biometric features. We evaluate linear and nonlinear fusion methods on two publicly available datasets, collected via the lab-grade eye tracker and virtual reality headset across multiple tasks and observation durations. Results indicate that fusion offers performance benefits on both datasets, particularly when using nonlinear fusion. Additionally, fusing biometric information across multiple tasks further improves authentication performance. These findings support the hypothesis that continuous gaze offset may serve as useful auxiliary information under conditions of degraded or noisy eye tracking.
comment: 10 Pages, 1 Figure, 1 Table, Submitted to IJCB 2026
☆ LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute
Transformers dominate video recognition. They split videos into tokens, and processing them has expensive superlinear computational cost. Yet videos are filled with redundancy, so we can question the need for this expense. We introduce LookWhen, a selector-extractor framework that factorizes video recognition into learning when, where, and what to compute. Our shallow selector gets a scaled-down video and quickly scores all tokens across space-time, while our deep extractor gets the top-K selected tokens to approximate full-video representations without actually processing all the tokens. A key challenge is defining effective supervision for selection and extraction. For selection pre-training, we introduce a score on representations that ranks tokens by uniqueness using a simple nearest-neighbor distance. For extraction pre-training, we distill both a video teacher and an image teacher, for which we normalize its frame-wise representations to learn what changes within videos. Through these strategies, our selector-extractor learns general and efficient representations for feature extraction or fine-tuning to a task. Through experiments on Kinetics-400, SSv2, Epic-Kitchens, Diving48, Jester, and Charades, we show that LookWhen achieves a better accuracy-computation trade-off than efficient models and upgraded baselines of similar size. LookWhen Pareto-dominates in accuracy-FLOPs on 9 of 12 cases (6 tasks x 2 settings) and roughly matches on 3. In accuracy-throughput, measuring time in practice, LookWhen is more efficient still at 6.7x faster than InternVideo2-B at equal accuracy.
☆ Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
The web is complex, open-ended, and constantly changing, making it challenging to scale training data for visual web agents. Existing data collection attempts remain limited to offline trajectories for supervised fine-tuning or a handful of simulated environments for RL training, thus failing to capture web diversity. We propose Weblica (Web Replica), a framework for constructing reproducible and scalable web environments. Our framework leverages 1) HTTP-level caching to capture and replay stable visual states while preserving interactive behavior and 2) LLM-based environment synthesis grounded in real-world websites and core web navigation skills. Using this framework, we scale RL training to thousands of diverse environments and tasks. Our best model, Weblica-8B, outperforms open-weight baselines of similar size across multiple web navigation benchmarks while using fewer inference steps, scales favorably with additional test-time compute, and is competitive with API models.
comment: 28 pages, 19 figures
☆ R$^3$L: Reasoning 3D Layouts from Relative Spatial Relations ICML 2026
Relative spatial relations provide a compact representation of spatial structure and are fundamental to relative spatial reasoning in 3D layout generation. Recent works leverage Multimodal Large Language Models (MLLMs) to infer such relations, but the inferred relations are often unreliable and are typically handled with post-hoc heuristics. In this paper, we propose R$^3$L, a general framework that improves the reliability and consistency of relative spatial reasoning for 3D layout generation. Our key motivation is that multi-hop reasoning requires repeated reference-frame transformations, which accumulate errors in inferred relations and lead to semantic and metric drift. To mitigate this, we propose invariant spatial decomposition to break coupled relation chains, and consistent spatial imagination to promote self-consistency through an imagine-and-revise loop. We further introduce supportive spatial optimization to ease pose optimization via global-to-local coordinate re-parameterization. Extensive experiments across diverse scene types and instructions demonstrate that R$^3$L produces more physically feasible and semantically consistent layouts. Notably, our analysis shows that resolving frame-induced inconsistencies is crucial for reliable multi-hop relative spatial reasoning. The code is available at https://github.com/Neal2020GitHub/R3L.
comment: ICML 2026
♻ ☆ REMAP: Regularized Matching and Partial Alignment of Video Embeddings
Real-world instructional videos are long, noisy, and often contain extended background segments, repeated actions, and execution variability that do not correspond to meaningful procedural steps. We propose **REMAP**, an unsupervised framework for procedure learning based on *Regularized Fused Partial Gromov-Wasserstein Optimal Transport*. REMAP relaxes balanced transport constraints, allowing non-informative or redundant frames to remain unmatched through partial transport. The formulation jointly models semantic similarity and temporal structure, while incorporating Laplacian-based smoothness and structural regularization to prevent degenerate alignments and reduce background interference. We evaluate REMAP on large-scale egocentric and third-person benchmarks. The method consistently outperforms state-of-the-art approaches, achieving up to **11.6\% (+4.45pp)** F1 and **19.6\% (+4.73pp)** IoU improvements on EgoProceL, and an average **41\% (+17.15pp)** F1 gain on ProceL and CrossTask. These results highlight the importance of partial alignment in handling real-world procedural variability and demonstrate that REMAP provides a robust and scalable approach for instructional video understanding.
comment: 9 pages, 4 figures, 6 tables
♻ ☆ Multimodal Fact-Level Attribution for Verifiable Reasoning ICML 2026
Multimodal large language models (MLLMs) are increasingly used for real-world tasks involving multi-step reasoning and long-form generation, where reliability requires grounding model outputs in heterogeneous input sources and verifying individual factual claims. However, existing multimodal grounding benchmarks and evaluation methods focus on simplified, observation-based scenarios or limited modalities and fail to assess attribution in complex multimodal reasoning. We introduce MuRGAt (Multimodal Reasoning with Grounded Attribution), a benchmark for evaluating fact-level multimodal attribution in settings that require reasoning beyond direct observation. Given inputs spanning video, audio, and other modalities, MuRGAt requires models to generate answers with explicit reasoning and precise citations, where each citation specifies both modality and temporal segments. To enable reliable assessment, we introduce an automatic evaluation framework that strongly correlates with human judgments. Benchmarking with human and automated scores reveals that even strong MLLMs frequently hallucinate citations despite correct reasoning. Moreover, we observe a key trade-off: increasing reasoning depth or enforcing structured grounding often degrades accuracy, highlighting a significant gap between internal reasoning and verifiable attribution.
comment: Accepted to ICML 2026. Code and data are available at https://github.com/meetdavidwan/murgat
♻ ☆ DARK: Diagonal-Anchored Repulsive Knowledge Distillation for Vision-Language Models under Extreme Compression
Compressing vision-language models for on-device deployment is increasingly important in clinical settings, but knowledge distillation (KD) degrades sharply when the teacher-student capacity gap spans an order of magnitude or more. We argue that, under such gaps, strict imitation of the teacher is a poor objective: much of the teacher's pairwise similarity structure reflects its own architectural biases rather than information a compact student can efficiently represent. We propose \textbf{Diagonal-Anchored Repulsive Knowledge Distillation (DARK)}, a contrastive KD framework that decomposes the distillation loss into a diagonal term (matched image-text pairs) and an off-diagonal term (non-target similarities). The diagonal term anchors matched-pair alignment throughout training; the off-diagonal term is annealed from positive to negative weighting, transitioning the student from imitating to \emph{repelling} the teacher's non-target similarity structure. We instantiate DARK by distilling FetalCLIP, a 427M-parameter fetal ultrasound vision-language model, into \textbf{MobileFetalCLIP}, a 75M-parameter student model with a $26\times$ smaller visual encoder, running in 1.6\,ms on an iPhone~16~Pro. The student matches or exceeds its teacher on three zero-shot benchmarks, including HC18 biometry validity (88.6\% vs.\ 83.5\%) and brain sub-plane F1 (0.784 vs.\ 0.702). Embedding-geometry and logit analyses show that DARK induces \emph{structured decorrelation}: the student preserves teacher-aligned per-image confidence while diverging from inherited inter-class confusion, suggesting that controlled repulsion can be more efficient than imitation under extreme compression.
comment: Project website: www.numansaeed.com/mobilefetalclip
♻ ☆ The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding
Vision-Language Models (VLMs) have achieved strong performance on implicit and explicit visual grounding and related tasks. However, such abilities are generally tested on simple, single-object phrases. We find that grounding performance degrades for complex, multi-object references. These limitations largely arise from training objectives that leverage image-caption alignment, where direct multi-object references are rare, the number of possible such references is theoretically large (exponential in the number of objects), and attribution is difficult. To address this, without requiring any additional annotations, we propose Compositional Attention-Regularized Training (CompART), which decomposes captions into object-centric phrases and constructs composite phrases by pairing them with conjunctions. We then introduce a composition loss that encourages the attention induced by a composite phrase to equal the sum of the attentions of its constituent phrases, promoting balanced multi-object localization. We evaluate CompART across four VLM architectures, spanning both contrastive-based and generative-based models, on four benchmarks for multi-object grounding and two VQA benchmarks for general visual understanding. CompART consistently improves grounding for both single- and multi-object references across diverse VLM architectures and datasets, and further demonstrates enhanced visual understanding, as evidenced by gains on VQA, despite not being explicitly trained for this task.
♻ ☆ Dual Cross-Attention Siamese Transformer for Rectal Tumor Regrowth Assessment in Watch-and-Wait Endoscopy
Increasing evidence supports watch-and-wait (WW) surveillance for patients with rectal cancer who show clinical complete response (cCR) at restaging following total neoadjuvant treatment (TNT). However, objectively accurate methods to early detect local regrowth (LR) from follow-up endoscopy images during WW are essential to manage care and prevent distant metastases. Hence, we developed a Siamese Swin Transformer with Dual Cross-Attention (SSDCA) to combine longitudinal endoscopic images at restaging and follow-up and distinguish cCR from LR. SSDCA leverages pretrained Swin transformers to extract domain agnostic features and enhance robustness to imaging variations. Dual cross attention is implemented to emphasize features from the two scans without requiring any spatial alignment of images to predict response. SSDCA as well as Swin-based baselines were trained using image pairs from 135 patients and evaluated on a held-out set of image pairs from 62 patients. SSDCA produced the best balanced accuracy (81.76\% $\pm$ 0.04), sensitivity (90.07\% $\pm$ 0.08), and specificity (72.86\% $\pm$ 0.05). Robustness analysis showed stable performance irrespective of artifacts including blood, stool, telangiectasia, and poor image quality. UMAP clustering of extracted features showed maximal inter-cluster separation (1.45 $\pm$ 0.18) and minimal intra-cluster dispersion (1.07 $\pm$ 0.19) with SSDCA, confirming discriminative representation learning.
comment: Accepted to ISBI 2026 conference (6 pages, 5 figures, 1 table)
♻ ☆ DC-DiT: Adaptive Compute and Elastic Inference for Visual Generation via Dynamic Chunking
Diffusion Transformers rely on static patchify tokenization, assigning the same token budget to smooth backgrounds, detailed object regions, noisy early timesteps, and late-stage refinements. We introduce the Dynamic Chunking Diffusion Transformer (DC-DiT), which replaces fixed patchification with a learned encoder-router-decoder scaffold that adaptively compresses the 2D input into a shorter token sequence through a chunking mechanism learned end-to-end with diffusion training. DC-DiT allocates fewer tokens to predictable regions and noisy timesteps, and more tokens to detailed regions and later refinement stages, yielding meaningful spatial segmentations and timestep-adaptive compression schedules without supervision. Furthermore, the router provides an importance ordering over retained tokens, enabling elastic inference: a single checkpoint can be evaluated at flexible compute budgets with a smooth quality-compute tradeoff. Additionally, DC-DiT can be upcycled from pretrained DiT checkpoints and is also compatible with orthogonal dynamic computation approaches. On class-conditional ImageNet generation, DC-DiT reduces inference FLOPs by up to 36.8% and improves FID by up to 37.8% over DiT baselines, yielding a stronger quality--compute Pareto frontier across model scales, resolutions, and guidance settings. More broadly, these results suggest that adaptive tokenization is a general mechanism for making visual generation both more efficient and more flexible at inference time.
♻ ☆ Layer-Guided UAV Tracking: Enhancing Efficiency and Occlusion Robustness
Visual object tracking (VOT) plays a pivotal role in unmanned aerial vehicle (UAV) applications. Addressing the trade-off between accuracy and efficiency, especially under challenging conditions like unpredictable occlusion, remains a significant challenge. This paper introduces LGTrack, a unified UAV tracking framework that integrates dynamic layer selection, efficient feature enhancement, and robust representation learning for occlusions. By employing a novel lightweight Global-Grouped Coordinate Attention (GGCA) module, LGTrack captures long-range dependencies and global contexts, enhancing feature discriminability with minimal computational overhead. Additionally, a lightweight Similarity-Guided Layer Adaptation (SGLA) module replaces knowledge distillation, achieving an optimal balance between tracking precision and inference efficiency. Experiments on three datasets demonstrate LGTrack's state-of-the-art real-time speed (258.7 FPS on UAVDT) while maintaining competitive tracking accuracy (82.8\% precision). Code is available at https://github.com/XiaoMoc/LGTrack
♻ ☆ Pulling Back the Curtain on Deep Networks
In linear models, visualizing a weight vector naturally reveals the model's preferred input direction, but extending this intuition to deep networks via gradients or gradient ascent often yields brittle or adversarial-looking features. We argue that deep networks are better understood as input-conditioned affine operators, whose natural adjoint action pulls a neuron's preferred direction back to input space. We further refine this representation by backward-only softening and iterative enhancement to reconstruct coherent local structures encoded by the target neuron. This provides a unifying perspective on previously disparate ideas such as SmoothGrad, B-cos-style alignment, and Feature Accentuation. The resulting Semantic Pullbacks (SP) generate perceptually aligned, class-conditional post-hoc explanations that emphasize semantically meaningful features, facilitate coherent counterfactual perturbations, and remain theoretically grounded. Across convolutional architectures (ResNet50, VGG) and transformer-based models (PVT), Semantic Pullbacks achieve the best overall trade-off across faithfulness, stability, and target-sensitivity benchmarks, while remaining general, computationally efficient, and readily integrable into existing deep learning pipelines.
comment: Preprint; 9 pages, 23-page appendix, 12 figures, 6 Tables; v6 changes: slight reframing of the presentation
♻ ☆ VISER: Visually-Informed System for Enhanced Robustness in Open-Set Iris Presentation Attack Detection
Human perceptual priors have shown promise in saliency-guided deep learning training, particularly in the domain of iris presentation attack detection (PAD). Common saliency approaches include hand annotations obtained via mouse clicks and eye gaze heatmaps derived from eye tracking data. However, the most effective form of human saliency for open-set iris PAD remains under-explored. In this paper, we conduct a series of experiments comparing hand annotations, eye tracking heatmaps, segmentation masks, and foundation model embeddings to a state-of-the-art deep learning-based baseline on the task of open-set iris PAD. Results for open-set PAD in a leave-one-attack-type out paradigm indicate that denoised eye tracking heatmaps show the best generalization improvement over cross entropy in Attack Presentation Classification Error Rate (APCER) at Bona Fide Presentation Classification Error Rate (BPCER) of 1%. Along with this paper, we offer trained models, code, and saliency maps for reproducibility and to facilitate follow-up research efforts.
comment: Version 2
♻ ☆ Probabilistic NDVI Forecasting from Sparse Satellite Time Series and Weather Covariates
Short-term forecasting of vegetation dynamics is a key enabler for data-driven decision support in precision agriculture. Normalized Difference Vegetation Index (NDVI) forecasting from satellite observations, however, remains challenging due to sparse and irregular sampling caused by cloud masking, as well as the heterogeneous climatic conditions under which crops evolve. In this work, we propose a probabilistic forecasting framework for field-level NDVI prediction under sparse, irregular clear-sky acquisitions. The architecture separates the encoding of historical NDVI and meteorological observations from future exogenous covariates, fusing both representations for multi-step quantile prediction. To address irregular revisit patterns and horizon-dependent uncertainty, we introduce a temporal-distance weighted quantile loss that aligns the training objective with the effective forecasting horizon. In addition, we incorporate cumulative and extreme-weather feature engineering to capture delayed meteorological effects relevant to vegetation response. Experiments on European satellite data show that the proposed approach outperforms statistical, deep learning, and time-series baselines on both pointwise and probabilistic evaluation metrics. Ablation studies confirm that target history is the primary driver of performance, with meteorological covariates providing additional gains in the full multimodal setting. The code is available at https://github.com/arco-group/ndvi-forecasting.
♻ ☆ Fusion Complexity Inversion: Why Simpler Cross View Modules Outperform SSMs and Cross View Attention Transformers for Pasture Biomass Regression CVPR
Accurate estimation of pasture biomass from agricultural imagery is critical for sustainable livestock management, yet existing methods are limited by the small, imbalanced, and sparsely annotated datasets typical of real world monitoring. In this study, adaptation of vision foundation models to agricultural regression is systematically evaluated on the CSIRO Pasture Biomass benchmark, a 357 image dual view dataset with laboratory validated, component wise ground truth for five biomass targets, through 17 configurations spanning four backbones (EfficientNet-B3 to DINOv3-ViT-L), five cross view fusion mechanisms, and a 4x2 metadata factorial. A counterintuitive principle, termed "fusion complexity inversion", is uncovered: on scarce agricultural data, a two layer gated depthwise convolution (R^2 = 0.903) outperforms cross view attention transformers (0.833), bidirectional SSMs (0.819), and full Mamba (0.793, below the no fusion baseline). Backbone pretraining scale is found to monotonically dominate all architectural choices, with the DINOv2 -> DINOv3 upgrade alone yielding +5.0 R^2 points. Training only metadata (species, state, and NDVI) is shown to create a universal ceiling at R^2 ~ 0.829, collapsing an 8.4 point fusion spread to 0.1 points. Actionable guidelines for sparse agricultural benchmarks are established: backbone quality should be prioritized over fusion complexity, local modules preferred over global alternatives, and features unavailable at inference excluded.
comment: Accepted to CVPR: Vision for Agriculture Workshop 2026 (Withdrawn)
♻ ☆ It's Not a Lottery, It's a Race: Understanding How Gradient Descent Adapts the Network's Capacity to the Task
Our theoretical understanding of neural networks is lagging behind their empirical success. One of the important unexplained phenomena is why and how, during the process of training with gradient descent, the theoretical capacity of neural networks is reduced to an effective capacity that fits the task. We here investigate the mechanism by which gradient descent achieves this through analyzing the learning dynamics at the level of individual neurons in single hidden layer ReLU networks. We identify three dynamical principles, namely mutual alignment, unlocking and racing, that together explain why we can often successfully reduce capacity after training through the merging of equivalent neurons or the pruning of low norm weights. We specifically explain the mechanism behind the lottery ticket conjecture, or why the specific, beneficial initial conditions of some neurons lead them to obtain higher weight norms.
♻ ☆ A Comparative Study of Transformer and Convolutional Models for Crop Segmentation from Satellite Image Time Series
Crop segmentation from satellite image time series (SITS) is a fundamental task for agricultural monitoring and land-use analysis. While convolutional neural networks (CNNs) have been widely used, transformer-based architectures offer alternative mechanisms for representing spatial and temporal dependencies in multispectral data. This paper presents a comparative study of CNN and transformer-based segmentation models for crop mapping from Sentinel-2 time series, including 3D U-Net, 3D FPN, 3D DeepLabv3, and three transformer architectures: Swin UNETR, TSViT, and VistaFormer, which adopt different strategies for capturing temporal dependencies. Experiments on the Munich and Lombardia datasets show that TSViT achieves the best overall results, slightly surpassing 3D U-Net, which remains a strong CNN baseline. VistaFormer offers the best efficiency, while Swin UNETR performs competitively but is less effective than transformers that explicitly model temporal dynamics. These results highlight that temporal modelling is critical for SITS: TSViT outperforms CNNs and approaches that treat time as an additional spatial dimension, while VistaFormer provides a strong efficiency-performance trade-off.
comment: This version corrects an error in the evaluation pipeline affecting previously reported metrics. Results have been recomputed, leading to updated values and a revised conclusion: the adapted Swin UNETR model does not outperform CNN baselines. Tables, figures, and comparisons have been updated, and the analysis has been extended to include additional transformer-based models
♻ ☆ Aligned explanations in neural networks
As artificial intelligence increasingly drives critical decisions, the ability to genuinely explain how neural networks make predictions is essential for trust. Yet, most current explanation methods offer post-hoc rationalizations rather than guaranteeing a true reflection of the model's reasoning. We introduce the notion of explanatory alignment, a requirement that explanations directly construct predictions rather than rationalize them. To achieve this in complex data domains, we present Pointwise-interpretable Networks (PiNets), a pseudo-linear architecture that forms linear models instance-wise. Evaluated on image classification and segmentation tasks, PiNets demonstrate that their explanations are deeply faithful across four criteria: meaningfulness, alignment, robustness, and sufficiency (MARS). Our contributions pave the way for promising avenues: by reconciling the predictive power of deep learning with the interpretability of linear models, PiNets provide a principled foundation for trustworthy AI and data-driven scientific discovery.
♻ ☆ sFRC for assessing hallucinations in medical image restoration
Deep learning (DL) methods are currently being explored to restore images from sparse-view-, limited-data-, and undersampled-based acquisitions in medical applications. Although outputs from DL may appear visually appealing based on likability/subjective criteria (such as less noise, smooth features), they may also suffer from hallucinations. This issue is further exacerbated by a lack of easy-to-use techniques and robust metrics for the identification of hallucinations in DL outputs. In this work, we propose performing Fourier Ring Correlation (FRC) analysis over small patches and concomitantly (s)canning across DL outputs and their reference counterparts to detect hallucinations (termed as sFRC). We describe the rationale behind sFRC and provide its mathematical formulation. The parameters essential to sFRC may be set using predefined hallucinated features annotated by subject matter experts or using imaging theory-based hallucination maps. We use sFRC to detect hallucinations for three undersampled medical imaging problems: CT super-resolution, CT sparse view, and MRI subsampled restoration. In the testing phase, we demonstrate sFRC's effectiveness in detecting hallucinated features for the CT problem and sFRC's agreement with imaging theory-based outputs on hallucinated feature maps for the MR problem. Finally, we quantify the hallucination rates of DL methods on in-distribution versus out-of-distribution data and under increasing subsampling rates to characterize the robustness of DL methods. Beyond DL-based methods, sFRC's effectiveness in detecting hallucinations for a conventional regularization-based restoration method and a state-of-the-art unrolled method is also shown.
comment: 16 pages; 14 figures; 1 Supplemental document. TechRxiv Preprints, 2025
♻ ☆ LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
Robotic foundation models require reasoning over complex visual scenes to execute adaptive actions in dynamic environments. While recent studies on latent-reasoning Vision-Language-Action (VLA) models have demonstrated the capability to capture fine-grained physical dynamics, they remain predominantly confined to static imitation learning, severely limiting their adaptability and generalization. In this paper, we present LaST-R1, a novel reinforcement learning (RL) post-training framework designed to effectively harness "latent reasoning-before-acting" policies. Specifically, we propose Latent-to-Action Policy Optimization (LAPO), a core RL algorithm that jointly optimizes the latent reasoning process and the action generation. By explicitly embedding latent Chain-of-Thought (CoT) reasoning directly within the RL optimization loop, LAPO stimulates profound physical world modeling, which in turn drives robust execution in interactive environments. Furthermore, an adaptive latent CoT mechanism is introduced, allowing the policy to dynamically modulate its reasoning horizon based on diverse environment states. Experiments show that LaST-R1 achieves a near-perfect 99.9% average success rate on the LIBERO benchmark with only one-shot supervised warm-up, significantly improving convergence speed and performance over prior state-of-the-art (SOTA) methods. In real-world deployments, LaST-R1 yields up to a 22.5% average improvement over SOTA supervised fine-tuning approach across four complex tasks, including both single-arm and dual-arm settings. Finally, LaST-R1 demonstrates strong generalization across simulated and real-world environments.
♻ ☆ The Illusion of Forgetting: Attack Unlearned Diffusion via Initial Latent Variable Optimization
Text-to-image diffusion models (DMs) are frequently abused to produce harmful or copyrighted content, violating public interests. Concept erasure (unlearning) is a promising paradigm to alleviate this issue. However, there exists a peculiar forgetting illusion phenomenon with unclear cause. Based on empirical analysis, we formally explain this cause: most unlearning partially disrupt the mapping between linguistic symbols and the underlying internal knowledge, leaving the knowledge intact as dormant memories. We further demonstrate that distributional discrepancy in the denoising process serves as a measurable indicator of how much of the mapping is retained, also reflecting unlearning strength. Inspired by this, we propose IVO (Initial Latent Variable Optimization), a novel attack framework designed to assess the robustness of current unlearning methods. IVO optimizes initial latent variables to realign the noise distribution of unlearned models with that of their vanilla counterparts, which reconstructs the fractured mappings and consequently revives dormant memories. Extensive experiments covering 11 unlearning techniques and 3 concept scenarios show that IVO outperforms state-of-the-art baselines, exposing fundamental flaws in current unlearning mechanisms. Warning: This paper has unsafe images that may offend some readers.
comment: 25 pages, 12 figures, 12 tables
♻ ☆ From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models
As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static data and fixed recipes, making it difficult to diagnose capability blind spots or provide dynamic, targeted reinforcement. Motivated by findings that test driven error exposure and feedback based correction outperform repetitive practice, we propose Diagnostic-driven Progressive Evolution (DPE), a spiral loop where diagnosis steers data generation and reinforcement, and each iteration re-diagnoses the updated model to drive the next round of targeted improvement. DPE has two key components. First, multiple agents annotate and quality control massive unlabeled multimodal data, using tools such as web search and image editing to produce diverse, realistic samples. Second, DPE attributes failures to specific weaknesses, dynamically adjusts the data mixture, and guides agents to generate weakness focused data for targeted reinforcement. Experiments on Qwen3-VL-8B-Instruct and Qwen2.5-VL-7B-Instruct show stable, continual gains across eleven benchmarks, indicating DPE as a scalable paradigm for continual LMM training under open task distributions. Our code, models, and data are publicly available at https://github.com/hongruijia/DPE.
♻ ☆ Divergence is Uncertainty: A Closed-Form Posterior Covariance for Flow Matching
Flow matching has become a leading framework for generative modeling, but quantifying the uncertainty of its samples remains an open problem. Existing approaches retrain the model with auxiliary variance heads, maintain costly ensembles, or propagate approximate covariance through many integration steps, trading off training cost, inference cost, or accuracy. We show that none of these trade-offs is necessary. We prove that, for any pre-trained flow matching velocity field, the trace of the posterior covariance over the clean data given the current state equals, in closed form, the divergence of the velocity field, up to a known time-dependent prefactor and an additive constant. We call this the \emph{divergence-uncertainty identity} for flow matching. The matrix-level form of the identity is similarly closed-form, depending solely on the velocity Jacobian. Because the identity is exact and post-hoc, it is computable on any pre-trained flow matching model, with no retraining and no architectural modification. For one-step generators such as MeanFlow, the same identity yields the exact end-to-end generation uncertainty in a single forward pass, eliminating the multi-step variance propagation required by all prior methods. Experiments on MNIST confirm that the resulting per-pixel uncertainty maps are semantically meaningful, concentrating on digit boundaries where inter-sample variation is highest, and that the scalar uncertainty score tracks actual prediction error, all at roughly 10,000$\times$ less total compute than ensembling or Monte Carlo dropout.
comment: 9 Pages, 5 figures
♻ ☆ MARVL: Multi-Stage Guidance for Robotic Manipulation via Vision-Language Models
Designing dense reward functions is pivotal for efficient robotic Reinforcement Learning (RL). However, most dense rewards rely on manual engineering, which fundamentally limits the scalability and automation of reinforcement learning. While Vision-Language Models (VLMs) offer a promising path to reward design, naive VLM rewards often misalign with task progress, struggle with spatial grounding, and show limited understanding of task semantics. To address these issues, we propose MARVL-Multi-stAge guidance for Robotic manipulation via Vision-Language models. MARVL fine-tunes a VLM for spatial and semantic consistency and decomposes tasks into multi-stage subtasks with task direction projection for trajectory sensitivity. Empirically, MARVL significantly outperforms existing VLM-reward methods on the Meta-World benchmark, demonstrating superior sample efficiency and robustness on sparse-reward manipulation tasks.
♻ ☆ CSMCIR: CoT-Enhanced Symmetric Alignment with Memory Bank for Composed Image Retrieval
Composed Image Retrieval (CIR) enables users to search for target images using both a reference image and manipulation text, offering substantial advantages over single-modality retrieval systems. However, existing CIR methods suffer from representation space fragmentation: queries and targets comprise heterogeneous modalities and are processed by distinct encoders, forcing models to bridge misaligned representation spaces only through post-hoc alignment, which fundamentally limits retrieval performance. This architectural asymmetry manifests as three distinct, well-separated clusters in the feature space, directly demonstrating how heterogeneous modalities create fundamentally misaligned representation spaces from initialization. In this work, we propose CSMCIR, a unified representation framework that achieves efficient query-target alignment through three synergistic components. First, we introduce a Multi-level Chain-of-Thought (MCoT) prompting strategy that guides Multimodal Large Language Models to generate discriminative, semantically compatible captions for target images, establishing modal symmetry. Building upon this, we design a symmetric dual-tower architecture where both query and target sides utilize the identical shared-parameter Q-Former for cross-modal encoding, ensuring consistent feature representations and further reducing the alignment gap. Finally, this architectural symmetry enables an entropy-based, temporally dynamic Memory Bank strategy that provides high-quality negative samples while maintaining consistency with the evolving model state. Extensive experiments on four benchmark datasets demonstrate that our CSMCIR achieves state-of-the-art performance with superior training efficiency. Comprehensive ablation studies further validate the effectiveness of each proposed component.
♻ ☆ Chameleon: Benchmarking Detection and Backtracking on Commercial-Grade AI-Generated Videos ICMR 2026
The proliferation of AI-Generated Content (AIGC), especially deepfake videos, poses a severe threat to social trust by enabling fraud, privacy violations and disinformation. Existing AI-generated video detection (AGVD) benchmarks focus on open-source model generated videos, yet commercial closed-source models produce more realistic, temporally coherent videos that are underexplored in detection research. To fill this gap, we present Chameleon, a commercial-grade dataset with 1,700 AI-generated videos from 600 real-world sources across three key domains (News, Speech, Recommendation), featuring high resolution, rich annotations and 3D consistency metrics for dynamic scene spatial coherence, shifting detection from face-centric forgery to holistic scene forensics. This benchmark assesses models on two core tasks: accurate AI video detection in real-world conditions and forensic backtracking of original sources. Experimental results reveal critical limitations of existing methods in detecting and backtracking high-fidelity, spatiotemporally consistent videos from commercial closed-source models, highlighting current methods' flawed forensic reasoning and establishing Chameleon as a vital challenge for AIGC security research. The code and data are available at https://github.com/lxixim/Chameleon.
comment: Accepted by ICMR 2026
♻ ☆ Linearized Attention Cannot Enter the Kernel Regime at Any Practical Width
Understanding whether attention mechanisms converge to the kernel regime is foundational to the validity of influence functions for transformer accountability. Exact NTK characterization of softmax attention is precluded by its exponential nonlinearity; linearized attention is the canonical tractable proxy and the object of study here. This paper establishes that even this proxy does not converge to its NTK limit at any practical width, revealing a fundamental trade-off in the learning dynamics of attention. An exact correspondence is established between parameter-free linearized attention and a data-dependent Gram-induced kernel; spectral amplification analysis shows that the attention transformation cubes the Gram matrix's condition number, requiring width $m = Ω(κ_d(\mathbf{G})^6 n\log n)$ for NTK convergence, where $κ_d(\mathbf{G})$ is the effective condition number of the rank-$\min(n,d)$ truncation of the input Gram matrix; for natural image datasets this threshold is physically infeasible ($m \gg 10^{24}$ for MNIST and $m \gg 10^{29}$ for CIFAR-10, 12--17 orders of magnitude beyond the largest known architectures). \emph{Influence malleability} is introduced to characterize this non-convergence: linearized attention exhibits 2--9$\times$ higher malleability than ReLU networks under adversarial data perturbation, with the gap depending on dataset condition number and task setting. A dual implication is established: the same data-dependent kernel is shown theoretically to reduce approximation error when targets align with the data geometry, while, empirically, creating vulnerability to adversarial manipulation of the training data. The structural argument extends to trainable QKV attention under standard initialization, with direct consequences for influence methods applied to deployed transformer architectures.
♻ ☆ AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
Video generation models internalize physical realism as their prior. Anime deliberately violates physics: smears, impact frames, chibi shifts; and its thousands of coexisting artistic conventions yield no single "physics of anime" a model can absorb. Physics-biased models therefore flatten the artistry that defines the medium or collapse under its stylistic variance. We present AniMatrix, a video generation model that targets artistic rather than physical correctness through a dual-channel conditioning mechanism and a three-step transition: redefine correctness, override the physics prior, and distinguish art from failure. First, a Production Knowledge System encodes anime as a structured taxonomy of controllable production variables (Style, Motion, Camera, VFX), and AniCaption infers these variables from pixels as directorial directives. A trainable tag encoder preserves the field-value structure of this taxonomy while a frozen T5 encoder handles free-form narrative; dual-path injection (cross-attention for fine-grained control, AdaLN modulation for global enforcement) ensures categorical directives are never diluted by open-ended text. Second, a style-motion-deformation curriculum transitions the model from near-physical motion to full anime expressiveness. Third, deformation-aware preference optimization with a domain-specific reward model separates intentional artistry from pathological collapse. On an anime-specific human evaluation with five production dimensions scored by professional animators, AniMatrix ranks first on four of five, with the largest gains over Seedance-Pro 1.0 on Prompt Understanding (+0.70, +22.4 percent) and Artistic Motion (+0.55, +16.9 percent). We are preparing accompanying resources for public release to support reproducibility and follow-up research.
comment: 37 pages, 1 main figure (qualitative comparison), 1 TikZ architecture diagram; technical report. Model weights and inference code to be released
♻ ☆ Making Reconstruction FID Predictive of Diffusion Generation FID
It is well known that the reconstruction FID (rFID) of a VAE is poorly correlated with the generation FID (gFID) of a latent diffusion model. We propose interpolated FID (iFID), a simple variant of rFID that exhibits a strong correlation with gFID. Specifically, for each dataset element, we retrieve its nearest neighbor in latent space, interpolate between their latent representations, decode the interpolated latent, and compute the FID between the decoded samples and the original dataset. We provide an intuitive explanation for why iFID correlates well with gFID, and why reconstruction metrics can be negatively correlated with gFID, by connecting iFID to recent results on diffusion generalization and hallucination. Theoretically, we show that iFID evaluates decoded interpolations aligned with the ridge set around which diffusion samples concentrate, thereby measuring a quantity closely related to diffusion sample quality. Empirically, iFID is the first metric shown to strongly correlate with diffusion gFID across diverse VAEs, achieving Pearson and Spearman correlations of approximately $0.85$. The project page is available at https://tongdaxu.github.io/pages/ifid.html.
♻ ☆ SD-MVSum: Script-Driven Multimodal Video Summarization Method and Datasets
In this work, we present a method and two large-scale datasets for Script-Driven Multimodal Video Summarization. The proposed method, SD-MVSum, builds on our earlier SD-VSum method for script-driven video summarization, which considered just the visual content of the video. SD-MVSum takes into account, in addition to the visual modality, the relevance of the user-provided script with the spoken content (i.e., audio transcript) of the video. The dependence between each considered pair of data modalities, i.e., script-video and script-transcript, is modeled using a new weighted cross-modal attention mechanism. This mechanism explicitly exploits the semantic similarity between the paired modalities in order to promote the parts of the full-length video with the highest relevance to the user-provided script. Furthermore, we extend two large-scale datasets for script-driven (S-VideoXum) and generic (MrHiSum) video summarization, to make them suitable for training and evaluation of script-driven multimodal video summarization methods. Experimental comparisons document the competitiveness of the proposed SD-MVSum method against other SotA approaches for script-driven and generic video summarization. Our new method and extended datasets are available at: https://github.com/IDT-ITI/SD-MVSum.
comment: Under review
♻ ☆ HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding ACL 2026
Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video inputs, remains challenging, as existing models struggle to simultaneously maintain stable understanding performance, real-time responses, and low GPU memory overhead. To address this challenge, we propose HERMES, a novel training-free architecture for real-time and accurate understanding of video streams. Based on a mechanistic attention investigation, we conceptualize KV cache as a hierarchical memory framework that encapsulates video information across multiple granularities. During inference, HERMES reuses a compact KV cache, enabling efficient streaming understanding under resource constraints. Notably, HERMES requires no auxiliary computations upon the arrival of user queries, thereby guaranteeing real-time responses for continuous video stream interactions, which achieves 10$\times$ faster TTFT compared to prior SOTA. Even when reducing video tokens by up to 68% compared with uniform sampling, HERMES achieves superior or comparable accuracy across all benchmarks, with up to 11.4% gains on streaming datasets.
comment: Accepted to ACL 2026 Main
♻ ☆ EAGLE: Expert-Augmented Attention Guidance for Tuning-Free Industrial Anomaly Detection in Multimodal Large Language Models
Multimodal large language models (MLLMs) can enrich industrial anomaly detection with semantic descriptions and anomaly reasoning, but they still lag specialist anomaly detectors in binary detection accuracy. Existing approaches address this gap by fine-tuning MLLMs or training bridging modules to align expert outputs with MLLM inputs, limiting flexibility across backbones. We propose EAGLE, a tuning-free framework that integrates expert anomaly detectors with frozen MLLMs. EAGLE consists of Threshold-Guided Prompt Selection (TGPS), which estimates a decision threshold from expert model statistics and selects textual and visual prompts, and Confidence-Aware Attention Sharpening (CAAS), which shifts MLLM attention toward visual evidence when expert confidence is low. Beyond improving accuracy, we analyze MLLM attention and find that correct anomaly predictions are associated with stronger focus on ground-truth defect regions; EAGLE consistently strengthens this alignment. On MVTec-AD and VisA, EAGLE improves five MLLM backbones without parameter updates, reaching up to 94.4\% and 88.1\% in anomaly discrimination accuracy, respectively, and achieving performance competitive with fine-tuning-based methods while largely preserving MLLM semantic reasoning ability.
♻ ☆ Guidance Watermarking for Diffusion Models
This paper introduces a novel watermarking method for diffusion models. It is based on guiding the diffusion process using the gradient computed from any off-the-shelf watermark decoder. The gradient computation encompasses different image augmentations, increasing robustness to attacks against which the decoder was not originally robust, without retraining or fine-tuning. Our method effectively convert any \textit{post-hoc} watermarking scheme into an in-generation embedding along the diffusion process. We show that this approach is complementary to watermarking techniques modifying the variational autoencoder at the end of the diffusion process. We validate the methods on different diffusion models and detectors. The watermarking guidance does not significantly alter the generated image for a given seed and prompt, preserving both the diversity and quality of generation.
♻ ☆ Can Vision-Language Models Think from the Sky? Unifying UAV Reasoning and Generation
Vision-Language Models have achieved strong progress in ground-view visual understanding, yet they remain brittle in high-altitude Unmanned Aerial Vehicle scenes, where objects are tiny and densely packed, textures are repetitive, and top-down orientations are ambiguous. We introduce UAVReason, a large-scale UAV-native dataset and evaluation suite for studying unified aerial reasoning and generation under this nadir-view domain shift. UAVReason aligns RGB imagery, depth maps, semantic segmentation masks, captions, and question-answer pairs within a consistent aerial domain. It contains 23.6K captioned frames, 273K VQA pairs including 68.2K two-frame temporal questions, and 188.8K cross-modal generation samples across RGB, depth, and segmentation modalities. We further adapt UAVReason-Bagel as a unified understanding-and-generation baseline that jointly optimizes language reasoning and dense visual generation objectives. Experiments show that general-purpose VLMs and off-the-shelf unified generators struggle with UAV-native grounding, while UAVReason-Bagel substantially improves over its pretrained counterpart, increasing VQA-1F F1 from 0.394 to 0.711, VQA-2F F1 from 0.427 to 0.822, and heading-aware VQA F1 from 0.798 to 0.973. For generation, it improves segmentation mIoU to 0.143 and reduces KID from 0.078 to 0.048 for depth-segmentation-text-conditioned RGB synthesis. More importantly, our ablations reveal a bidirectional synergy between synthesis and reasoning. Dense generation objectives improve temporal semantic consistency, while language-level reasoning regularizes sparse-condition image synthesis. These results suggest that unified reasoning and generation provide effective geometry-aware structural priors for physically grounded aerial intelligence. All data, code, and evaluation tools will be released.
comment: 21 pages, 12 figures, 7 tables
♻ ☆ Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
We propose X-WAM, a Unified 4D World Model that unifies real-time robotic action execution and high-fidelity 4D world synthesis (video + 3D reconstruction) in a single framework, addressing the critical limitations of prior unified world models (e.g., UWM) that only model 2D pixel-space and fail to balance action efficiency and world modeling quality. To leverage the strong visual priors of pretrained video diffusion models, X-WAM imagines the future world by predicting multi-view RGB-D videos, and obtains spatial information efficiently through a lightweight structural adaptation: replicating the final few blocks of the pretrained Diffusion Transformer into a dedicated depth prediction branch for the reconstruction of future spatial information. Moreover, we propose Asynchronous Noise Sampling (ANS) to jointly optimize generation quality and action decoding efficiency. ANS applies a specialized asynchronous denoising schedule during inference, which rapidly decodes actions with fewer steps to enable efficient real-time execution, while dedicating the full sequence of steps to generate high-fidelity video. Rather than entirely decoupling the timesteps during training, ANS samples from their joint distribution to align with the inference distribution. Pretrained on over 5,800 hours of robotic data, X-WAM achieves 79.2% and 90.7% average success rate on RoboCasa and RoboTwin 2.0 benchmarks, while producing high-fidelity 4D reconstruction and generation surpassing existing methods in both visual and geometric metrics.
comment: Project website: https://sharinka0715.github.io/X-WAM/
♻ ☆ Submanifold Sparse Convolutional Networks for Automated 3D Segmentation of Kidneys and Kidney Tumours in Computed Tomography
Accurate delineation of kidney tumours in Computed Tomography (CT) is essential for downstream quantitative analysis and precision oncology, but manual segmentation is a specialised task, time-consuming and difficult to scale. Automated 3D segmentation remains challenging because CT scans are large volumetric images, making high-resolution dense convolutional networks computationally expensive and often dependent on downsampling or patch-based inference. We propose a two-stage 3D segmentation methodology based on voxel sparsification and submanifold sparse convolutional networks (SSCNs). Stage 1 uses a low-resolution sparse network to identify a region of interest (ROI); Stage 2 applies a high-resolution sparse network for refined segmentation within the cropped ROI. This enables native high-resolution 3D processing while reducing memory use and inference time. We evaluate the method on the KiTS23 renal cancer CT dataset using 5-fold cross-validation. Our method achieved Dice similarity coefficients of 95.8% for kidneys + masses, 85.7% for tumours + cysts, and 80.3% for tumours alone, competitive with top KiTS23 approaches. In direct comparisons on the same cross-validation folds, the proposed sparse method achieves tumour + cyst and tumour-only Dice scores comparable to, and slightly higher than, a patch-based nnU-Net baseline, while consistently requiring less VRAM and shorter inference time across the tested hardware. Across the tested GPUs, our sparse model is markedly faster than both nnU-Net and the zero-shot zoom-out/zoom-in foundation model SegVol, which localises kidneys well but underperforms on small heterogeneous lesions. Compared to an equivalent dense implementation of the same architecture, the proposed sparse approach achieves up to a 60% reduction in inference time and up to a 75% reduction in VRAM usage across both CPU and the GPU configurations tested.
comment: 15 pages, 6 figures
♻ ☆ Multi-Modality Distillation via Learning the teacher's modality-level Gram Matrix
In the context of multi-modality knowledge distillation research, the existing methods was mainly focus on the problem of only learning teacher final output. Thus, there are still deep differences between the teacher network and the student network. It is necessary to force the student network to learn the modality relationship information of the teacher network. To effectively exploit transfering knowledge from teachers to students, a novel modality relation distillation paradigm by modeling the relationship information among different modality are adopted, that is learning the teacher modality-level Gram Matrix.
comment: 15 pages, 2 figures
♻ ☆ CARD: A Multi-Modal Automotive Dataset for Dense 3D Reconstruction in Challenging Road Topography CVPR 2026
Autonomous driving must operate across diverse surfaces to enable safe mobility. However, most driving datasets are captured on well-paved flat roads. Moreover, recent driving datasets primarily provide sparse LiDAR ground truth for images, which is insufficient for assessing fine-grained geometry in depth estimation and completion. To address these gaps, we introduce CARD, a multi-modal driving dataset that delivers quasi-dense 3D ground truth across continuous sequences rich in speed bumps, potholes, irregular surfaces and off-road segments. Our sensor suite includes synchronized global-shutter stereo cameras, front and rear LiDARs, 6-DoF poses from LiDAR-inertial odometry, per-wheel motion traces, and full calibration. Notably, our multi-LiDAR fusion yields ~500K valid depth pixels per frame, about 6.5x more than KITTI Depth Completion and 10x more on average than other public driving datasets. The dataset spans ~110 km and 4.7 hours across Germany and Italy. In addition, CARD provides 2D bounding boxes targeting road-topography irregularities, enabling accurate benchmarking for both geometry and perception tasks. Furthermore, we establish a standardized evaluation protocol for road surface irregularities on CARD and benchmark state-of-the-art depth estimation models to provide strong baselines. The CARD dataset is hosted on https://huggingface.co/CARD-Data.
comment: Accepted at CVPR 2026 (Highlight). Project page: https://card.content.cariad.digital
♻ ☆ Teaching Metric Distance to Discrete Autoregressive Language Models
Large language models (LLMs) operate as autoregressive predictors over discrete token vocabularies, a formulation that has enabled their adaptation far beyond natural language to vision, robotics, and multimodal reasoning. However, training against one-hot targets disregards metric relationships between tokens and limits effectiveness on tasks where distance is meaningful, such as numerical values, spatial coordinates, or quantized embeddings. We introduce DIST2Loss, a distance-aware objective for discrete autoregressive models that replaces one-hot targets with reward-weighted distributions derived from predefined token distances. DIST2Loss can be interpreted as the closed-form solution to entropy-regularized policy optimization with known per-token rewards, retaining the core mechanism of reinforcement learning while avoiding sampling, rollouts, and instability. Our experiments show that DIST2Loss improves data efficiency and downstream performance across diverse domains. It yields tighter bounding boxes in visual grounding, accelerates robotic manipulation by improving action learning, enhances reward modeling for LLM alignment, and strengthens vector-quantized image generation. These results demonstrate that distance-aware supervision offers a simple and general alternative to one-hot supervision for discrete autoregressive models.
♻ ☆ Multi-Scale Spectral Attention Module-based Hyperspectral Segmentation in Autonomous Driving Scenarios
Recent advances in autonomous driving (AD) have highlighted the potential of hyperspectral imaging (HSI) for enhanced environmental perception, particularly in challenging weather and lighting conditions. However, efficiently processing high-dimensional spectral data remains a significant challenge. This paper presents an empirical investigation of a Multi-Scale Attention Mechanism (MSAM) for enhanced spectral feature extraction through three parallel 1D convolutions with varying kernel sizes (1-11) and adaptive feature aggregation. By integrating MSAM into UNet's skip connections, we evaluate performance improvements in semantic segmentation across multiple HSI datasets for urban driving scenarios. Comprehensive ablation studies demonstrate that MSAM consistently outperforms baseline UNet-SC, achieving average improvements of 2.32% in mIoU and 2.88% in mF1, while maintaining competitive GPU performance against established attention mechanisms. Our findings reveal that optimal kernel combinations are dataset-specific, with configurations such as (1;5;11) and (3;7;11) demonstrating particularly strong performance. This empirical investigation advances understanding of HSI processing capabilities for AD applications and establishes a foundation for adaptive multi-scale spectral feature extraction in automotive deployment.
♻ ☆ SoccerMaster: A Vision Foundation Model for Soccer Understanding CVPR 2026
Soccer understanding has recently garnered growing research interest due to its domain-specific complexity and unique challenges. Unlike prior works that typically rely on isolated, task-specific expert models, this work aims to propose a unified model to handle diverse soccer visual understanding tasks, ranging from fine-grained perception (e.g., athlete detection and identification) to high-level semantic reasoning (e.g., event classification). Concretely, our contributions are threefold: (i) we present SoccerMaster, the first soccer-specific vision foundation model that unifies diverse tasks within a single framework via supervised multi-task pretraining; (ii) we develop an automated data curation pipeline, SoccerFactory, to generate scalable spatial annotations, and integrate multiple existing soccer video datasets as a comprehensive pretraining data resource for multi-task pretraining; and (iii) we conduct extensive evaluations demonstrating that SoccerMaster consistently outperforms task-specific expert models across diverse downstream tasks, highlighting its breadth and superiority. The data, code, and model will be publicly available.
comment: Accepted by CVPR 2026 (Oral); Project Page: https://haolinyang-hlyang.github.io/SoccerMaster
♻ ☆ LiCamPose: Combining Multi-View LiDAR and RGB Cameras for Robust Single-timestamp 3D Human Pose Estimation WACV2025
Several methods have been proposed to estimate 3D human pose from multi-view images, achieving satisfactory performance on public datasets collected under relatively simple conditions. However, there are limited approaches studying extracting 3D human skeletons from multimodal inputs, such as RGB and point cloud data. To address this gap, we introduce LiCamPose, a pipeline that integrates multi-view RGB and sparse point cloud information to estimate robust 3D human poses via single frame. We demonstrate the effectiveness of the volumetric architecture in combining these modalities. Furthermore, to circumvent the need for manually labeled 3D human pose annotations, we develop a synthetic dataset generator for pretraining and design an unsupervised domain adaptation strategy to train a 3D human pose estimator without manual annotations. To validate the generalization capability of our method, LiCamPose is evaluated on four datasets, including two public datasets, one synthetic dataset, and one challenging self-collected dataset named BasketBall, covering diverse scenarios. The results demonstrate that LiCamPose exhibits great generalization performance and significant application potential. The code, generator, and datasets will be made available upon acceptance of this paper.
comment: Accepted by WACV2025
♻ ☆ Frequency-Enhanced Diffusion Models: Curriculum-Guided Semantic Alignment for Zero-Shot Skeleton Action Recognition
Human action recognition is pivotal in computer vision, with applications ranging from surveillance to human-robot interaction. Despite the effectiveness of supervised skeleton-based methods, their reliance on exhaustive annotation limits generalization to novel actions. Zero-Shot Skeleton Action Recognition (ZSAR) emerges as a promising paradigm, yet it faces challenges due to the spectral bias of diffusion models, which oversmooth high-frequency dynamics. Here, we propose Frequency-Aware Diffusion for Skeleton-Text Matching (FDSM), integrating a Semantic-Guided Spectral Residual Module, a Timestep-Adaptive Spectral Loss, and Curriculum-based Semantic Abstraction to address these challenges. Our approach effectively recovers fine-grained motion details, achieving state-of-the-art performance on NTU RGB+D, PKU-MMD, and Kinetics-skeleton datasets. Code has been made available at https://github.com/yuzhi535/FDSM. Project homepage: https://yuzhi535.github.io/FDSM.github.io/
comment: Accepted by The Visual Computer
♻ ☆ Adaptive Greedy Frame Selection for Long Video Understanding
Large vision--language models (VLMs) are increasingly applied to long-video question answering, yet inference is often bottlenecked by the number of input frames and resulting visual tokens. Naive sparse sampling can miss decisive moments, while purely relevance-driven selection frequently collapses onto near-duplicate frames and sacrifices coverage of temporally distant evidence. We propose a question-adaptive greedy frame selection method that jointly optimizes query relevance and semantic representativeness under a fixed frame budget. Our approach constructs a 1~FPS candidate pool (capped at 1000) with exact timestamp alignment, embeds candidates in two complementary spaces (SigLIP for question relevance and DINOv2 for semantic similarity), and selects frames by greedily maximizing a weighted sum of a modular relevance term and a facility-location coverage term. This objective is normalized, monotone, and submodular, yielding a standard (1-1/e) greedy approximation guarantee. To account for question-dependent trade-offs between relevance and coverage, we introduce four preset strategies and a lightweight text-only question-type classifier that routes each query to its best-performing preset. Experiments on MLVU show consistent accuracy gains over uniform sampling and a strong recent baseline across frame budgets, with the largest improvements under tight budgets.
♻ ☆ VideoASMR-Bench: Can AI-Generated ASMR Videos Fool VLMs and Humans?
With AI-generated videos increasingly indistinguishable from reality, current benchmarks primarily focus on broad semantic alignment and basic physical consistency, offering limited discriminative power for evaluating them. To address this, we introduce VideoASMR-Bench, a benchmark based on Autonomous Sensory Meridian Response (ASMR) videos that emphasizes fine-grained audio-visual perception and sensory immersion. This benchmark aims to answer two key questions: (i) Are today's video understanding models (VLMs) sensitive enough to detect AI-generated ASMR videos by recognizing minor visual, physical, or auditory artifacts? (ii) Can today's video generation models (VGMs) produce convincing ASMR videos with immersive experiences? This benchmark comprises a diverse set of 1,500 high-quality real ASMR videos curated from social media, alongside 2,235 synthetic counterparts generated by nine VGMs. Additionally, we open-source an extensible suite of prompts and reference images, enabling the benchmark to scale dynamically with future video models. Moreover, we design an automatic understanding-generation evaluation framework between VGMs and VLMs, where VGMs aim to produce realistic fake videos to fool the VLMs, while the VLMs seek to detect them, forming an adversarial game between the two parties. Our evaluation on VideoASMR-Bench reveals that even state-of-the-art VLMs, such as Gemini-3-Pro, fail to reliably detect AI-generated ASMR videos. Meanwhile, current frontier video generation models can produce ASMR videos that are difficult for VLMs to distinguish from real ones, while humans can still identify them relatively easily.
comment: Code is at https://github.com/video-reality-test/video-reality-test, page is at https://video-reality-test.github.io/
♻ ☆ v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning
When thinking with images, humans rarely rely on a single glance: they revisit visual evidence while reasoning. In contrast, most Multimodal Language Models encode an image once to key-value cache and then reason purely in text, making it hard to re-ground intermediate steps. We empirically confirm this: as reasoning chains lengthen, models progressively lose focus on relevant regions. We introduce v1, a lightweight extension for active visual referencing via point-and-copy: the model selects relevant image patches and copies their embeddings back into the reasoning stream. Crucially, our point-and-copy mechanism retrieves patches using their semantic representations as keys, ensuring perceptual evidence remains aligned with the reasoning space. To train this behavior, we build v1g, a dataset of 300K multimodal reasoning traces with interleaved grounding annotations. Across multimodal mathematical reasoning benchmarks, v1 consistently outperforms comparable baselines.
♻ ☆ Gungnir: Exploiting Stylistic Features in Images for Backdoor Attacks on Diffusion Models
Diffusion Models (DMs) have achieved remarkable success in image generation, yet recent studies reveal their vulnerability to backdoor attacks, where adversaries manipulate outputs via covert triggers embedded in inputs. Existing defenses, such as backdoor detection and trigger inversion, are largely effective because prior attacks rely on limited input spaces and low-dimensional triggers that are visually conspicuous or easily captured by neural detectors. To broaden the threat landscape, we propose Gungnir, a novel backdoor attack that activates malicious behaviors through style-based triggers embedded in input images. Unlike explicit visual patches or textual cues, stylistic features serve as stealthy, high-level triggers. We introduce Reconstructing-Adversarial Noise (RAN) and Short-Term Timesteps-Retention (STTR) to preserve trigger-consistent diffusion dynamics in image-to-image tasks. The resulting trigger-embedded samples are perceptually indistinguishable from clean images, evading both manual and automated detection. Extensive experiments show that Gungnir bypasses state-of-the-art defenses with an extremely low backdoor detection rate (BDR) and remains effective under fine-tuning-based purification, revealing previously underexplored vulnerabilities in diffusion models.
♻ ☆ SMI: Statistical Membership Inference for Reliable Unlearned Model Auditing
Machine unlearning (MU) is essential for enforcing the right to be forgotten in machine learning systems. A key challenge of MU is how to reliably audit whether a model has truly forgotten specified training data. Membership Inference Attacks (MIAs) are widely used for unlearned model auditing, where samples that evade membership detection are regarded as successfully forgotten. We show this assumption is fundamentally flawed: failed membership inference does not imply true forgetting. We prove that unlearned samples occupy fundamentally different positions in the feature space than non-member samples, making this alignment bias unavoidable and unobservable, which leads to systematically optimistic evaluations of unlearning performance. Meanwhile, training shadow models for MIA incurs substantial computational overhead. To address both limitations, we propose Statistical Membership Inference (SMI), a training-free auditing framework that reformulates auditing as estimating the non-member mixture proportion in the unlearned feature distribution. Beyond estimating the forgetting rate, SMI also provides bootstrap reference ranges for quantified auditing reliability. Extensive experiments show that SMI consistently outperforms all MIA-based baselines, with no shadow model training required. Overall, SMI establishes a principled and efficient alternative to MIA-based auditing methods, with both theoretical guarantees and strong empirical performance.
♻ ☆ PixelGen: Improving Pixel Diffusion with Perceptual Supervision
Pixel diffusion generates images directly in pixel space, avoiding the VAE artifacts and representational bottlenecks of two-stage latent diffusion. Recent JiT further simplifies pixel diffusion with x-prediction, where the model predicts clean images rather than velocity. However, the standard pixel-wise diffusion loss treats all pixels equally, spending model capacity to perceptually insignificant signals and often leading to blurry samples. We propose PixelGen, an end-to-end pixel diffusion framework that augments x-prediction with perceptual supervision. Specifically, PixelGen introduces two complementary perceptual losses on top of x-prediction: an LPIPS loss for local textures and a P-DINO loss for global semantics. To preserve sample coverage, PixelGen further proposes a noise-gating strategy that applies these losses only at lower-noise timesteps. On ImageNet-256 without classifier-free guidance, PixelGen achieves an FID of 5.11 in 80 training epochs, surpassing the latent diffusion baselines. Moreover, PixelGen scales efficiently to text-to-image generation, reaching a GenEval score of 0.79 with only 6 days of training on 8xH800 GPUs. These results show that perceptual supervision substantially narrows the gap between pixel and latent diffusion while preserving a simple one-stage pipeline. Codes are available at https://github.com/Zehong-Ma/PixelGen.
comment: Project Pages: https://zehong-ma.github.io/PixelGen/
♻ ☆ StableTTA: Improving Vision Model Performance by Training-free Test-Time Adaptation Methods
Ensemble methods improve predictive performance but often incur high memory and computational costs. We identify an aggregation instability induced by nonlinear projection and voting operations. To address both efficiency challenges and this inconsistency, we propose StableTTA, a training-free test-time adaptation method with two variants. StableTTA-I targets coherent-batch inference settings, where temporally or semantically adjacent observations are likely to belong to the same class. Examples include burst photography, video streams, robotics perception, and industrial inspection. Under coherent-batch inference, StableTTA-I substantially improves prediction consistency and accuracy through variance-aware logit aggregation. StableTTA-II establishes feature-level cropping, enabling efficient logit aggregation with a single forward pass on a single model backbone. Experiments on ImageNet-1K across 71 models demonstrate that StableTTA-I consistently improves prediction accuracy under coherent-batch inference, while StableTTA-II provides lightweight and architecture-agnostic accuracy improvements with minimal computational overhead. These results suggest that inference-time semantic coherence and aggregation stability provide useful perspectives for improving practical test-time adaptation systems.
comment: 27 pages, 10 figures, 9 tables
♻ ☆ Visual Para-Thinker: Divide-and-Conquer Reasoning for Visual Comprehension
Existing LLM test-time scaling laws emphasize the emergence of self-reflective behaviors through extended reasoning length. Nevertheless, this vertical scaling strategy often encounters plateaus in exploration as the model becomes locked into specific thinking pattern. By shifting from depth to parallelism, parallel thinking mitigates the narrowing of exploration. However, the extension of this paradigm to visual domain remains an open research question. In this paper, we first examine the role of visual partitioning in parallelized reasoning and subsequently propose two distinct strategies. Based on the above, we introduce Visual Para-Thinker, representing the inaugural parallel reasoning framework for MLLMs. To maintain path independence and promote diversity in reasoning, our approach integrates Pa-Attention alongside LPRoPE. Leveraging the vLLM framework, we have developed a native multimodal implementation that facilitates high-efficiency parallel processing. Empirical results on benchmark datasets such as V*, CountBench, RefCOCO, and HallusionBench confirm that Visual Para-Thinker successfully extends the benefits of parallel reasoning to the visual domain.
♻ ☆ Grad-ECLIP: Gradient-based Visual and Textual Explanations for CLIP
Significant progress has been achieved on the improvement and downstream usages of the Contrastive Language-Image Pre-training (CLIP) vision-language model, while less attention is paid to the interpretation of CLIP. We propose a Gradient-based visual and textual Explanation method for CLIP (Grad-ECLIP), which interprets the matching result of CLIP for specific input image-text pair. By decomposing the architecture of the encoder and discovering the relationship between the matching similarity and intermediate spatial features, Grad-ECLIP produces effective heat maps that show the influence of image regions or words on the CLIP results. Different from the previous Transformer interpretation methods that focus on the utilization of self-attention maps, which are typically extremely sparse in CLIP, we produce high-quality visual explanations by applying channel and spatial weights on token features. Qualitative and quantitative evaluations verify the effectiveness and superiority of Grad-ECLIP compared with the state-of-the-art methods. Furthermore, a series of analysis are conducted based on our visual and textual explanation results, from which we explore the working mechanism of image-text matching, the strengths and limitations in attribution identification of CLIP, and the relationship between the concreteness/abstractness of a word and its usage in CLIP. Finally, based on the ability of explanation map that indicates text-specific saliency region of input image, we also propose an application with Grad-ECLIP, which is adopted to boost the fine-grained alignment in the CLIP fine-tuning. The code of Grad-ECLIP is available here: https://github.com/Cyang-Zhao/Grad-Eclip.
♻ ☆ Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models
Reward-based fine-tuning steers a pretrained diffusion or flow-based generative model toward higher-reward samples while remaining close to the pretrained model. Although existing methods are derived from different perspectives, we show that many can be written under a common framework, which we call reward score matching (RSM). Under this view, alignment becomes score matching against a value-guided target, and the main differences across methods reduce to the construction of the value-guidance estimator and the effective optimization strength across timesteps. This unification clarifies the bias-variance-compute tradeoffs of existing designs, and distinguishes core optimization components from auxiliary mechanisms that add complexity without clear benefit. Guided by this perspective, we develop simpler, more efficient redesigns across representative differentiable and black-box reward alignment tasks. Overall, RSM turns a seemingly fragmented collection of reward-based fine-tuning methods into a smaller, more interpretable, and more actionable design space.
comment: 43 pages, 15 figures
♻ ☆ Mitigating Error Accumulation in Continuous Navigation via Memory-Augmented Kalman Filtering ICML 2026
Continuous navigation in complex environments is critical for Unmanned Aerial Vehicle (UAV). However, the existing Vision-Language Navigation (VLN) models follow the dead-reckoning, which iteratively updates its position for the next waypoint prediction, and subsequently construct the complete trajectory. Then, such stepwise manner will inevitably lead to accumulated errors of position over time, resulting in misalignment between internal belief and objective coordinates, which is known as "state drift" and ultimately compromises the full trajectory prediction. Drawing inspiration from classical control theory, we propose to correct for errors by formulating such sequential prediction as a recursive Bayesian state estimation problem. In this paper, we design NeuroKalman, a novel framework that decouples navigation into two complementary processes: a Prior Prediction, based on motion dynamics and a Likelihood Correction, from historical observation. We first mathematically associate Kernel Density Estimation of the measurement likelihood with the attention-based retrieval mechanism, which then allows the system to rectify the latent representation using retrieved historical anchors without gradient updates. Comprehensive experiments on TravelUAV benchmark demonstrate that, with only 10% of the training data fine-tuning, our method clearly outperforms strong baselines and regulates drift accumulation.
comment: ICML 2026 Camera Ready
♻ ☆ Masked Language Prompting for Generative Data Augmentation in Few-shot Fashion Style Recognition
Constructing dataset for fashion style recognition is challenging due to the inherent subjectivity and ambiguity of style concepts. Recent advances in text-to-image models have facilitated generative data augmentation by synthesizing images from labeled data, yet existing methods based solely on class names or reference captions often fail to balance visual diversity and style consistency. In this work, we propose \textbf{Masked Language Prompting (MLP)}, a novel prompting strategy that masks selected words in a reference caption and leverages large language models to generate diverse yet semantically coherent completions. This approach preserves the structural semantics of the original caption while introducing attribute-level variations aligned with the intended style, enabling style-consistent and diverse image generation without fine-tuning. Experimental results on the FashionStyle14 dataset demonstrate that our MLP-based augmentation consistently outperforms class-name and caption-based baselines, validating its effectiveness for fashion style recognition under limited supervision.
♻ ☆ CPCANet: Deep Unfolding Common Principal Component Analysis for Domain Generalization
Domain Generalization (DG) aims to learn representations that remain robust under out-of-distribution (OOD) shifts and generalize effectively to unseen target domains. While recent invariant learning strategies and architectural advances have achieved strong performance, explicitly discovering a structured domain-invariant subspace through second-order statistics remains underexplored. In this work, we propose CPCANet, a novel framework grounded in Common Principal Component Analysis (CPCA), which unrolls the iterative Flury-Gautschi (FG) algorithm into fully differentiable neural layers. This approach integrates the statistical properties of CPCA into an end-to-end trainable framework, enforcing the discovery of a shared subspace across diverse domains while preserving interpretability. Experiments on four standard DG benchmarks demonstrate that CPCANet achieves state-of-the-art (SOTA) performance in zero-shot transfer. Moreover, CPCANet is architecture-agnostic and requires no dataset-specific tuning, providing a simple and efficient approach to learning robust representations under distribution shift. Code is available at https://github.com/wish44165/CPCANet.
comment: 9 pages, 5 tables
♻ ☆ Multi-Modal LLM based Image Captioning in ICT: Bridging the Gap Between General and Industry Domain
In the information and communications technology (ICT) industry, training a domain-specific large language model (LLM) or constructing a retrieval-augmented generation system requires a substantial amount of high-value domain knowledge. However, the knowledge is not only hidden in the textual modality but also in the image modality. Traditional methods can parse text from domain documents but dont have image captioning ability. Multi-modal LLM (MLLM) can understand images, but they do not have sufficient domain knowledge. To address the above issues, this paper proposes a multi-stage progressive training strategy to train a Domain-specific Image Captioning Model (DICModel) in ICT, and constructs a standard evaluation system to validate the performance of DICModel. Specifically, this work first synthesizes about 7K image-text pairs by combining the Mermaid tool and LLMs, which are used for the first-stage supervised-fine-tuning (SFT) of DICModel. Then, ICT-domain experts manually annotate about 2K image-text pairs for the second-stage SFT of DICModel. Finally, experts and LLMs jointly synthesize about 1.5K visual question answering data for the instruction-based SFT. Experimental results indicate that our DICModel with only 7B parameters performs better than other state-of-the-art models with 32B parameters. Compared to the SOTA models with 7B and 32B parameters, our DICModel increases the BLEU metric by approximately 56.8% and 20.8%, respectively. On the objective questions constructed by ICT domain experts, our DICModel outperforms Qwen2.5-VL 32B by 1% in terms of accuracy rate. In summary, this work can efficiently and accurately extract the logical text from images, which is expected to promote the development of multimodal models in the ICT domain.
♻ ☆ On the Nature of Attention Sink that Shapes Decoding Strategy in Omni-LLMs
The goal of this paper is to strengthen the reasoning of Omnimodal Large Language Models (Omni-LLMs) at inference time, without additional training. These models jointly process video, audio, and text, and given the large number of tokens they consume, how attention is routed across them is central to their behaviour. We focus specifically on attention sinks, tokens that absorb a disproportionate share of attention mass regardless of their semantic content, to understand how this routing unfolds. To this end, we conduct a systematic analysis of sink behaviour in Omni-LLMs. Our analysis yields two key findings: (i) high sink attention does not solely indicate head redundancy, suggesting that sink value representations play additional functional roles; (ii) the sink value vector acts as a shared bias added to every token's output, serving as a global signal that organises the representation as a whole. Building on this, we propose OutRo, which correspondingly aligns non-sink token representations with the sink in feature space, and relaxes the causal mask for sink tokens at an early layer to sharpen this bias before the rest of decoding proceeds. This design enhances the reasoning process without requiring additional forward passes or access to attention maps. Based on extensive experiments, OutRo consistently improves performance on seven video QA benchmarks and demonstrates strong generalisation, while incurring only a 1.1x decoding overhead.
comment: Preprint
♻ ☆ MesonGS++: Post-training Compression of 3D Gaussian Splatting with Hyperparameter Searching
3D Gaussian Splatting (3DGS) achieves high-quality novel view synthesis with real-time rendering, but its storage cost remains prohibitive for practical deployment. Existing post-training compression methods still rely on many coupled hyperparameters across pruning, transformation, quantization, and entropy coding, making it difficult to control the final compressed size and fully exploit the rate-distortion trade-off. We propose MesonGS++, a size-aware post-training codec for 3D Gaussian compression. On the codec side, MesonGS++ combines joint importance-based pruning, octree geometry coding, attribute transformation, selective vector quantization for higher-degree spherical harmonics, and group-wise mixed-precision quantization with entropy coding. On the configuration side, it treats the reserve ratio and bit-width allocation as the dominant rate-distortion knobs and jointly optimizes them under a target storage budget via discrete sampling and 0--1 integer linear programming. We further propose a linear size estimator and a CUDA parallel quantization operator to accelerate the hyperparameter searching process. Extensive experiments show that MesonGS++ achieves over 34$\times$ compression while preserving rendering fidelity, outperforming state-of-the-art post-training methods and accurately meeting target size budgets. Remarkably, without any training, MesonGS++ can even surpass the PSNR of vanilla 3DGS at a 20$\times$ compression rate on the Stump scene. Our code is available at https://github.com/mmlab-sigs/mesongs_plus
comment: https://github.com/mmlab-sigs/mesongs_plus
♻ ☆ Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding
Recent Multimodal Large Language Models (MLLMs) have shown high potential for spatial reasoning within 3D scenes. However, they typically rely on computationally expensive 3D representations like point clouds or reconstructed Bird's-Eye View (BEV) maps, or lack physical grounding to resolve ambiguities in scale and size. This paper significantly enhances MLLMs with egomotion modality data, captured by Inertial Measurement Units (IMUs) concurrently with the video. In particular, we propose a novel framework, called Motion-MLLM, introducing two key components: (1) a cascaded motion-visual keyframe filtering module that leverages both IMU data and visual features to efficiently select a sparse yet representative set of keyframes, and (2) an asymmetric cross-modal fusion module where motion tokens serve as intermediaries that channel egomotion cues and cross-frame visual context into the visual representation. By grounding visual content in physical egomotion trajectories, Motion-MLLM can reason about absolute scale and spatial relationships across the scene. Our extensive evaluation shows that Motion-MLLM makes significant improvements in various tasks related to 3D scene understanding and spatial reasoning. Compared to state-of-the-art (SOTA) methods based on video frames and explicit 3D data, Motion-MLLM achieves competitive accuracy while running $1.30\times$ and $1.61\times$ faster, respectively.
comment: 22 pages, 10 figures
♻ ☆ Cataract-LMM Large-Scale Multi-Source Multi-Task Benchmark for Deep Learning in Surgical Video Analysis
Computer-assisted surgery research requires large, deeply annotated video datasets that capture clinical and technical variability. Existing cataract surgery resources lack the diversity and annotation depth required to train generalizable deep-learning models. To address this gap, we present a dataset of 3,000 phacoemulsification cataract surgery videos acquired at two surgical centers from surgeons with varying expertise. The dataset provides four annotation layers: temporal surgical phases, instance segmentation of instruments and anatomical structures, instrument-tissue interaction tracking, and quantitative skill scores based on competency rubrics adapted from ICO-OSCAR and GRASIS. We demonstrate the technical utility of the dataset through benchmarking deep learning models across four tasks: workflow recognition, scene segmentation, instrument-tissue interaction tracking, and automated skill assessment. Furthermore, we establish a domain-adaptation baseline for phase recognition and instance segmentation by training on one surgical center and evaluating on a held-out center. Ultimately, these multi-source acquisitions, multi-layer annotations, and paired skill-kinematic labels facilitate the development of generalizable multi-task models for surgical workflow analysis, scene understanding, and competency-based training research.
comment: 28 pages, 14 figures, 15 tables. Data descriptor for the Cataract-LMM benchmark dataset. Source code and dataset are available
♻ ☆ Token-Level Entropy Reveals Demographic Disparities in Language Models
We ask whether demographic identity, signaled by a name alone, systematically reshapes the generative distribution of a language model. Measuring full-vocabulary Shannon entropy at temperature zero across six open-weight base models and 5,760 implicit sentence-completion prompts (e.g., "Tanisha walked into the office on a Monday morning and"), we find that Black-associated names produce higher first-token entropy than White-associated names across all six architectures - opposite to the output-level homogeneity bias documented under explicit demographic prompting (Lee et al., 2024) - and Black-associated names always produce greater entropy above identity-neutral baselines than White-associated names ($ΔΔ> 0$ in all six models). Women-associated names co-occur with lower first-token entropy (DL-pooled $\hatβ= -0.041, p = .019$) and more homogeneous outputs ($\hatα= +0.024, p < .001$) than men-associated names - a pattern convergent with homogeneity bias; race and gender effects are additive. Instruction tuning does not attenuate the race gap (matched-format DL-pooled $\hatβ=+0.153$). Running the same templates with explicit group labels instead of names yields null race effects in 10 of 12 models where implicit probing is significant - establishing that probing methodology is a primary determinant of which distributional structure is recovered.
comment: 9 pages
♻ ☆ SOWing Information: Cultivating Contextual Coherence with MLLMs in Image Generation
Originating from the diffusion phenomenon in physics, which describes the random movement and collisions of particles, diffusion generative models simulate a random walk in the data space along the denoising trajectory. This allows information to diffuse across regions, yielding harmonious outcomes. However, the chaotic and disordered nature of information diffusion in diffusion models often results in undesired interference between image regions, causing degraded detail preservation and contextual inconsistency. In this work, we address these challenges by reframing disordered diffusion as a powerful tool for text-vision-to-image generation (TV2I) tasks, achieving pixel-level condition fidelity while maintaining visual and semantic coherence throughout the image. We first introduce Cyclic One-Way Diffusion (COW), which provides an efficient unidirectional diffusion framework for precise information transfer while minimizing disruptive interference. Building on COW, we further propose Selective One-Way Diffusion (SOW), which utilizes Multimodal Large Language Models (MLLMs) to clarify the semantic and spatial relationships within the image. Based on these insights, SOW combines attention mechanisms to dynamically regulate the direction and intensity of diffusion according to contextual relationships. Extensive experiments demonstrate the untapped potential of controlled information diffusion, offering a path to more adaptive and versatile generative models in a learning-free manner.
comment: Project page: https://pyh-129.github.io/SOW/
♻ ☆ AgriKD: Cross-Architecture Knowledge Distillation for Efficient Leaf Disease Classification
Automated leaf disease classification is critical for early disease detection in resource-constrained field environments. Vision Transformers (ViTs) provide strong representation capability by modeling long-range dependencies and inter-class relationships; however, their high computational cost makes them impractical for deployment on edge devices. As a result, existing approaches struggle to effectively transfer these rich representations to lightweight models. This paper introduces AgriKD, a cross-architecture knowledge distillation framework for efficient edge deployment, which transfers knowledge from a Vision Transformer (ViT) teacher to a compact convolutional student model. To bridge the representational gap between Transformer and CNN architectures, the proposed approach integrates multiple distillation objectives at the output, feature, and relational levels, where each objective captures a different aspect of the teacher knowledge. This enables the student model to better preserve and utilize transformer-derived global representations. Experiments on multiple leaf disease datasets show that the distilled student achieves performance comparable to the teacher while significantly improving efficiency, reducing model parameters by approximately 172 times, computational cost by 47.57 times, and inference latency by 18-22 times. Furthermore, the optimized model is deployed across multiple runtime formats, including ONNX, TFLite Float16, and TensorRT FP16, achieving consistent predictive performance with negligible accuracy degradation. Real-world deployment on NVIDIA Jetson edge devices and a mobile application demonstrates reliable real-time inference, highlighting the practicality of AgriKD for AI-powered agricultural applications in resource-constrained environments.
comment: 47 pages, 14 figures
♻ ☆ MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation SIGGRAPH 2026
With the rise of online dance-video platforms and rapid advances in AI-generated content (AIGC), music-driven dance generation has emerged as a compelling research direction. Despite substantial progress in related domains such as music-driven 3D dance generation, pose-driven image animation, and audio-driven talking-head synthesis, existing methods cannot be directly adapted to this task. Moreover, the limited studies in this area still struggle to jointly achieve high-quality visual appearance and realistic human motion. Accordingly, we present MACE-Dance, a music-driven dance video generation framework with cascaded Mixture-of-Experts (MoE). The Motion Expert performs music-to-3D motion generation while enforcing kinematic plausibility and artistic expressiveness, whereas the Appearance Expert carries out motion- and reference-conditioned video synthesis, preserving visual identity with spatiotemporal coherence. Specifically, the Motion Expert adopts a diffusion model with a BiMamba-Transformer hybrid architecture and a Guidance-Free Training (GFT) strategy, achieving state-of-the-art (SOTA) performance in 3D dance generation. The Appearance Expert employs a decoupled kinematic-aesthetic fine-tuning strategy, achieving state-of-the-art (SOTA) performance in pose-driven image animation. To better benchmark this task, we curate a large-scale and diverse dataset and design a motion-appearance evaluation protocol. Based on this protocol, MACE-Dance also achieves state-of-the-art performance. Code is available at https://github.com/AMAP-ML/MACE-Dance.
comment: Accepted by SIGGRAPH 2026
♻ ☆ LFS: Learnable Frame Selector for Event-Aware and Temporally Diverse Video Captioning
Video captioning models convert frames into visual tokens and generate descriptions with large language models (LLMs). Since encoding all frames is prohibitively expensive, uniform sampling is the default choice, but it enforces equal temporal coverage while ignoring the uneven events distribution. This motivates a Learnable Frame Selector (LFS) that selects temporally diverse and event-relevant frames. LFS explicitly models temporal importance to balance temporal diversity and event relevance, and employs a stratified strategy to ensure temporal coverage while avoiding clustering. Crucially, LFS leverages caption feedback from frozen video-LLMs to learn frame selection that directly optimizes downstream caption quality. Additionally, we identify the gap between existing benchmark and human's cognition. Thus, we introduce ICH-CC built from carefully designed questions by annotators that reflect human-consistent understanding of video. Experiments indicate that LFS consistently improves detailed video captioning across two representative community benchmarks and ICH-CC, achieving up to 2.0% gains on VDC and over 4% gains on ICH-CC. Moreover, we observe that enhanced captions with LFS leads to improved performance on video question answering. Overall, LFS provides an effective and easy-to-integrate solution for detailed video captioning.
♻ ☆ FRISM: Fine-Grained Reasoning Injection via Subspace-Level Model Merging for Vision-Language Models ICML 2026
Efficiently enhancing the reasoning capabilities of Vision-Language Models (VLMs) by merging them with Large Reasoning Models (LRMs) has emerged as a promising direction. However, existing methods typically operate at a coarse-grained layer level, which often leads to a trade-off between injecting reasoning capabilities and preserving visual capabilities. To address this limitation, we propose FRISM (Fine-grained Reasoning Injection via Subspace-level model Merging), a fine-grained reasoning injection framework based on subspace-level model merging. Observing that different SVD subspaces contribute differently to reasoning and perception, FRISM decomposes LRM task vectors via Singular Value Decomposition (SVD) and adaptively tunes the scaling coefficients of each subspace through learning to realize fine-grained reasoning injection. Furthermore, we introduce a label-free self-distillation learning strategy with dual-objective optimization using common vision-language perception datasets. Extensive experiments demonstrate that FRISM effectively improves reasoning capabilities while largely preserving the model's visual capabilities by consistently achieving strong performance across diverse visual-language reasoning benchmarks.
comment: Accepted by ICML 2026
♻ ☆ Teaching Prompts to Coordinate: Hierarchical Layer-Grouped Prompt Tuning for Continual Learning
Prompt-based continual learning methods fine-tune only a small set of additional learnable parameters while keeping the pre-trained model's parameters frozen. It enables efficient adaptation to new tasks while mitigating the risk of catastrophic forgetting. These methods typically attach one independent task-specific prompt to each layer of pre-trained models to locally modulate its features, ensuring that the layer's representation aligns with the requirements of the new task. However, although introducing learnable prompts independently at each layer provides high flexibility for adapting to new tasks, this overly flexible tuning could make certain layers susceptible to unnecessary updates. As all prompts till the current task are added together as a final prompt for all seen tasks, the model may easily overwrite feature representations essential to previous tasks, which increases the risk of catastrophic forgetting. To address this issue, we propose a novel hierarchical layer-grouped prompt tuning method for continual learning. It improves model stability in two ways: (i) Layers in the same group share roughly the same prompts, which are adjusted by position encoding. This helps preserve the intrinsic feature relationships and propagation pathways of the pre-trained model within each group. (ii) It utilizes a single task-specific root prompt to learn to generate sub-prompts for each layer group. In this way, all sub-prompts are conditioned on the same root prompt, enhancing their synergy and reducing independence. Extensive experiments across four benchmarks demonstrate that our method achieves favorable performance compared with several state-of-the-art methods.
comment: We have reconsidered the issue, and an updated version will be released later
♻ ☆ Boosting Reasoning in Large Multimodal Models via Activation Replay CVPR 2026
Recently, Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach to incentivizing reasoning capability in Large Multimodal Models (LMMs), while the underlying mechanisms behind this post-training paradigm are poorly understood. We begin by exploring how input activations are affected by RLVR through the perspective of logit lens. Our systematic investigations across multiple post-trained LMMs suggest that RLVR shifts low-entropy activations unexpectedly, while high-entropy ones are less affected. We further demonstrate that such phenomena are associated with LMM reasoning by controlled experiments, suggesting a potentially beneficial role of modulating low-entropy activations. To this end, we propose Activation Replay, a novel simple yet effective training-free approach that boosts multimodal reasoning of post-trained LMMs without requiring expensive policy optimization. Our design involves manipulation of visual tokens at test time, replaying low-entropy activations from the input context of base LMMs to regulating the RLVR counterparts. Activation Replay triggers better reasoning across diverse scenarios, including mathematics, o3-like visual agents, and video reasoning. We further show that Activation Replay boosts Pass@K and mitigates narrower reasoning coverage of RLVR. Our design is compared against alternative choices, such as replaying high-entropy activations instead of low-entropy ones, or direct cross-model intervention instead of manipulating input tokens, demonstrating the superiority of our implementation. Code is publicly available at https://github.com/latentcraft/replay.
comment: CVPR 2026
♻ ☆ Restoration-Oriented Video Frame Interpolation with Region-Distinguishable Priors from SAM
In existing restoration-oriented Video Frame Interpolation (VFI) approaches, the motion estimation between neighboring frames plays a crucial role. However, the estimation accuracy in existing methods remains a challenge, primarily due to the inherent ambiguity in identifying corresponding areas in adjacent frames for interpolation. Therefore, enhancing accuracy by distinguishing different regions before motion estimation is of utmost importance. In this paper, we introduce a novel solution involving the utilization of open-world segmentation models, e.g., SAM2 (Segment Anything Model2) for frames, to derive Region-Distinguishable Priors (RDPs) in different frames. These RDPs are represented as spatial-varying Gaussian mixtures, distinguishing an arbitrary number of areas with a unified modality. RDPs can be integrated into existing motion-based VFI methods to enhance features for motion estimation, facilitated by our designed play-and-plug Hierarchical Region-aware Feature Fusion Module (HRFFM). HRFFM incorporates RDP into various hierarchical stages of VFI's encoder, using RDP-guided Feature Normalization (RDPFN) in a residual learning manner. With HRFFM and RDP, the features within VFI's encoder exhibit similar representations for matched regions in neighboring frames, thus improving the synthesis of intermediate frames. Extensive experiments demonstrate that HRFFM consistently enhances VFI performance across various scenes.
comment: Code will be released
♻ ☆ UniE2F: A Unified Diffusion Framework for Event-to-Frame Reconstruction with Video Foundation Models
Event cameras excel at high-speed, low-power, and high-dynamic-range scene perception. However, as they fundamentally record only relative intensity changes rather than absolute intensity, the resulting data streams suffer from a significant loss of spatial information and static texture details. In this paper, we address this limitation by leveraging the generative prior of a pre-trained video diffusion model to reconstruct high-fidelity video frames from sparse event data. Specifically, we first establish a baseline model by directly applying event data as a condition to synthesize videos. Then, based on the physical correlation between the event stream and video frames, we further introduce the event-based inter-frame residual guidance to enhance the accuracy of video frame reconstruction. Furthermore, we extend our method to video frame interpolation and prediction in a zero-shot manner by modulating the reverse diffusion sampling process, thereby creating a unified event-to-frame reconstruction framework. Experimental results on real-world and synthetic datasets demonstrate that our method significantly outperforms previous approaches both quantitatively and qualitatively. We also refer the reviewers to the video demo contained in the supplementary material for video results. The code will be publicly available at https://github.com/CS-GangXu/UniE2F.
♻ ☆ Structured 3D Latents Are Surprisingly Powerful: Unleashing Generalizable Style with 2D Diffusion
3D asset generation plays a pivotal role in fields such as gaming and virtual reality, enabling the rapid synthesis of high-fidelity 3D objects from a single or multiple images. Building on this capability, enabling style-controllable generation naturally emerges as an important and desirable direction. However, existing approaches typically rely on style images that lie within or are similar to the training distribution of 3D generation models. When presented with out-of-distribution (OOD) styles, their performance degrades significantly or even fails. To address this limitation, we introduce \textbf{DiLAST}: 2D Diffusion-based Latent Awakening for 3D Style Transfer. Specifically, we leverage a pretrained 2D diffusion model as a teacher to provide rich and generalizable style priors. By aligning rendered views with the target style under diffusion-based guidance, our method optimizes the structured 3D latent representations for stylization. We observe that this limitation stems not from insufficient model capacity, but from the underutilization of structured 3D latents, which are inherently expressive. Despite being trained on comparatively limited data, 3D generation models can leverage 2D diffusion guidance to steer denoising toward specific directions in latent space, thereby producing diverse, OOD styles. Extensive experiments across diverse data and multiple 3D generation backbones demonstrate the effectiveness and plug-and-play nature of our approach.
♻ ☆ LC4-DViT: Land-cover Creation for Land-cover Classification with Deformable Vision Transformer IEEE
Land-cover underpins ecosystem services, hydrologic regulation, disaster-risk reduction, and evidence-based land planning; timely, accurate land-cover maps are therefore critical for environmental stewardship. Remote sensing-based land-cover classification offers a scalable route to such maps but is hindered by scarce and imbalanced annotations and by geometric distortions in high-resolution scenes. We propose LC4-DViT (Land-cover Creation for Land-cover Classification with Deformable Vision Transformer), a framework that combines generative data creation with a deformation-aware Vision Transformer. A text-guided diffusion pipeline uses GPT-4o-generated scene descriptions and super-resolved exemplars to synthesize class-balanced, high-fidelity training images, while DViT couples a DCNv4 deformable convolutional backbone with a Vision Transformer encoder to jointly capture fine-scale geometry and global context. On eight classes from the Aerial Image Dataset (AID)-Beach, Bridge, Desert, Forest, Mountain, Pond, Port, and River-DViT achieves 0.9572 overall accuracy, 0.9576 macro F1-score, and 0.9510 Cohen' s Kappa, improving over a vanilla ViT baseline (0.9274 OA, 0.9300 macro F1, 0.9169 Kappa) and outperforming ResNet50, MobileNetV2, and FlashInternImage. Cross-dataset experiments on a three-class SIRI-WHU subset (Harbor, Pond, River) yield 0.9333 overall accuracy, 0.9316 macro F1, and 0.8989 Kappa, indicating good transferability. An LLM-based judge using GPT-4o to score Grad-CAM heatmaps further shows that DViT' s attention aligns best with hydrologically meaningful structures. These results suggest that description-driven generative augmentation combined with deformation-aware transformers is a promising approach for high-resolution land-cover mapping.
comment: This work has been submitted to the IEEE for possible publication.The project is available at https://github.com/weicongpang/LVC2-DViT.git
♻ ☆ Shape2Animal: Creative Animal Generation from Natural Silhouettes
Humans possess a unique ability to perceive meaningful patterns in ambiguous stimuli, a cognitive phenomenon known as pareidolia. This paper introduces Shape2Animal framework to mimics this imaginative capacity by reinterpreting natural object silhouettes, such as clouds, stones, or flames, as plausible animal forms. Our automated framework first performs open-vocabulary segmentation to extract object silhouette and interprets semantically appropriate animal concepts using vision-language models. It then synthesizes an animal image that conforms to the input shape, leveraging text-to-image diffusion model and seamlessly blends it into the original scene to generate visually coherent and spatially consistent compositions. We evaluated Shape2Animal on a diverse set of real-world inputs, demonstrating its robustness and creative potential. Our Shape2Animal can offer new opportunities for visual storytelling, educational content, digital art, and interactive media design. Our project page is here: https://shape2image.github.io
♻ ☆ ChArtist: Generating Pictorial Charts with Unified Spatial and Subject Control
A pictorial chart is an effective medium for visual storytelling, seamlessly integrating visual elements with data charts. However, creating such images is challenging because the flexibility of visual elements often conflicts with the rigidity of chart structures. This process thus requires a creative deformation that maintains both data faithfulness and visual aesthetics. Current methods that extract dense structural cues from natural images (e.g., edge or depth maps) are ill-suited as conditioning signals for pictorial chart generation. We present ChArtist, a domain-specific diffusion model for generating pictorial charts automatically, offering two distinct types of control: 1) spatial control that aligns well with the chart structure, and 2) subject-driven control that respects the visual characteristics of a reference image. To achieve this, we introduce a skeleton-based spatial control representation. This representation encodes only the data-encoding information of the chart, allowing for the easy incorporation of reference visuals without a rigid outline constraint. We implement our method based on the Diffusion Transformer (DiT) and leverage an adaptive position encoding mechanism to manage these two controls. We further introduce Spatially Gated Attention to modulate the interaction between spatial control and subject control. To support the fine-tuning of pre-trained models for this task, we created a large-scale dataset of 30,000 triplets (skeleton, reference image, pictorial chart). We also propose a unified data accuracy metric to evaluate the data faithfulness of the generated charts. We believe this work demonstrates that current generative models can achieve data-driven visual storytelling by moving beyond general-purpose conditions to task-specific representations. Project page: https://chartist-ai.github.io/.
comment: Project page: https://chartist-ai.github.io/
♻ ☆ RobustSora: De-Watermarked Benchmark for Robust AI-Generated Video Detection
The proliferation of AI-generated video models poses new challenges to information integrity and digital trust. A key confound, however, remains unaddressed: commercial generators embed visible overlay watermarks for provenance tracking, yet no existing benchmark controls for this variable, leaving open whether detectors learn genuine generation artefacts or merely associate watermark patterns with AI-generated labels. We present RobustSora, a benchmark of 6,500 manually verified videos in four categories: Authentic-Clean (A-C), Generated-Watermarked (G-W), Generated-DeWatermarked (G-DeW), and Authentic-Spoofed (A-S), sourced from Vript, DVF, and UltraVideo (authentic) and from Sora, Sora 2, Pika, Open-Sora 2, and KLing (generated). Two evaluation tasks isolate watermark effects: Task-I (Watermark Erasure Robustness) tests detection on watermark-removed AI videos; Task-II (Watermark Spoofing Robustness) measures false-alarm rates on authentic videos injected with fake watermarks. Across ten models spanning specialized detectors, transformer classifiers, and MLLMs, watermark manipulation induces accuracy changes of $-9.4$ to $+1.6$ pp (mean 6.6 pp; $p{<}0.01$ for 7/10 models on each task). A placebo control bounds inpainting-artefact confounds at $\le$2 pp, and a watermark-aware training augmentation recovers 3-4 pp on both tasks, together providing causal evidence that detectors actively rely on watermark cues. Per-generator breakdown shows that Sora 2 induces drops of $-11$ to $-14$ pp versus $-3$ to $-6$ pp for Pika and Open-Sora 2, indicating that watermark prominence, rather than detector architecture, is the principal driver of dependency. These results argue for watermark-aware evaluation and training in AIGC video detection. Dataset, evaluation code, and pretrained checkpoints will be released.
♻ ☆ Motion-o: Trajectory-Grounded Video Reasoning
Recent video reasoning models increasingly produce spatio-temporal evidence chains that localize objects at specific timestamps. While these traces improve interpretability by grounding \emph{where} and \emph{when} evidence appears, they often leave the motion connecting observations, the \textit{how}, implicit. This makes dynamic and trajectory-dependent claims difficult to supervise, verify, or penalize when unsupported by the video. We formalize this missing component as Spatial-Temporal-Trajectory (STT) reasoning and introduce \textbf{Motion-o}, a motion-centric extension to vision-language models (VLMs) that makes trajectories explicit and verifiable. Motion-o augments evidence chains with Motion Chain of Thought (MCoT), a structured pathway that represents object motion through a discrete \texttt{} tag summarizing direction, speed, and scale change. To supervise MCoT, we densify sparse spatio-temporal annotations into object tracks and derive motion descriptors from centroid displacement and box-area change. We then train with complementary rewards for trajectory consistency and visual grounding, including a perturbation-based signal that penalizes motion descriptions that remain unchanged when temporal evidence is removed. Across multiple video understanding benchmarks, Motion-o consistently improves trajectory-faithful reasoning without architectural modifications. These results suggest that an explicit motion interface can complement existing VLM pipelines by converting implicit dynamics into verifiable evidence. Code is available at~\href{https://github.com/ostadabbas/Motion-o}{\faGithub\ \texttt{ostadabbas/Motion-o}}.
♻ ☆ RL-RIG: A Generative Spatial Reasoner via Intrinsic Reflection
Recent advancements in image generation have achieved impressive results in producing high-quality images. However, existing image generation models still generally struggle with a spatial reasoning dilemma, lacking the ability to accurately capture fine-grained spatial relationships from the prompt and correctly generate scenes with structural integrity. To mitigate this dilemma, we propose RL-RIG, a Reinforcement Learning framework for Reflection-based Image Generation. Our architecture comprises four primary components: Diffuser, Checker, Actor, and Inverse Diffuser, following a Generate-Reflect-Edit paradigm to spark the Chain of Thought reasoning ability in image generation for addressing the dilemma. To equip the model with better intuition over generation trajectories, we further develop Reflection-GRPO to train the VLM Actor for edit prompts and the Image Editor for better image quality under a given prompt, respectively. Unlike traditional approaches that solely produce visually stunning yet structurally unreasonable content, our evaluation metrics prioritize spatial accuracy, utilizing Scene Graph IoU and employing a VLM-as-a-Judge strategy to assess the spatial consistency of generated images on LAION-SG dataset. Experimental results show that RL-RIG outperforms existing state-of-the-art open-source models by up to 11% in terms of controllable and precise spatial reasoning in image generation.
♻ ☆ Setting-Matched and Semantics-Scaled Benchmarking of One-Step Generative Models Against Multistep Diffusion and Flow Models
State-of-the-art text-to-image models produce high-quality images, but inference remains expensive as generation requires several sequential ODE or denoising steps. Native one-step models aim to reduce this cost by mapping noise to an image in a single step, yet fair comparisons to multi-step systems are difficult because studies use mismatched sampling steps and different classifier-free guidance (CFG) settings, where CFG can shift FID, Inception Score, and CLIP-based alignment in opposing directions. It is also unclear how well one-step models scale to multi-step inference, and there is limited standardized out-of-distribution evaluation for label-ID-conditioned generators beyond ImageNet. To address this, we benchmark eight models spanning one-step flows (MeanFlow, Improved MeanFlow, SoFlow), multi-step baselines (RAE, Scale-RAE), and established systems (SiT, Stable Diffusion 3.5, FLUX.1) under a class-conditional protocol on ImageNet validation, ImageNetV2, and reLAIONet, our new proofread out-of-distribution dataset aligned to ImageNet label IDs. Using FID, Inception Score, CLIP Score, and Pick Score, we show that FID-focused model development and CFG selection can be misleading in few-step regimes, where guidance changes can improve FID while degrading text-image alignment and human preference signals, worsening visual quality. To make these tradeoffs explicit, we introduce CLIP-scaled and PickScore-scaled variants of FID (csFID, psFID) and Inception Score (csIS, psIS) to serve as a diagnostic for semantically aligned image generation.
♻ ☆ DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection ICLR 2026
Open-vocabulary Object Detection (OVOD) enables models to recognize objects beyond predefined categories, but existing approaches remain limited in practical deployment. On the one hand, multimodal designs often incur substantial computational overhead due to their reliance on text encoders at inference time. On the other hand, tightly coupled training objectives introduce a trade-off between closed-set detection accuracy and open-world generalization. Thus, we propose Decoupled Cognition DETR (DeCo-DETR), a vision-centric framework that addresses these challenges through a unified decoupling paradigm. Instead of depending on online text encoding, DeCo-DETR constructs a hierarchical semantic prototype space from region-level descriptions generated by pre-trained LVLMs and aligned via CLIP, enabling efficient and reusable semantic representation. Building upon this representation, the framework further disentangles semantic reasoning from localization through a decoupled training strategy, which separates alignment and detection into parallel optimization streams. Extensive experiments on standard OVOD benchmarks demonstrate that DeCo-DETR achieves competitive zero-shot detection performance while significantly improving inference efficiency. These results highlight the effectiveness of decoupling semantic cognition from detection, offering a practical direction for scalable OVOD systems.
comment: Accepted at ICLR 2026
♻ ☆ SemanticDialect: Semantic-Aware Mixed-Format Quantization for Video Diffusion Transformers
Diffusion Transformers (DiTs) achieve state-of-the-art video generation quality, but their substantial memory and computational footprints hinder edge deployment. Quantization can reduce these costs, yet existing methods often degrade video quality due to high activation variation and the difficulty of preserving semantic and temporal coherence. We propose SemanticDialect, which advances block-wise mixed-format quantization. In this framework, each block selects an optimal format (dialect) from a candidate set (formatbook), which is augmented with lookup tables that store quantization errors and quantized indices, enabling efficient per-block format selection and quantization with minimal online overhead. We further introduce attention-guided activation decomposition, which reduces quantization error via residual quantization, and semantic-aware dialect assignment (SeDA), which reduces cross-token quantization inconsistency by enforcing format uniformity among semantically correlated tokens. Experiments demonstrate that SemanticDialect outperforms prior quantization methods and block-wise formats (MXFP4, NVFP4) while approaching FP16 quality on Open-Sora 2.0. We also validate hardware deployability through RTL design and GPU kernel implementation.
♻ ☆ Skip-It? Theoretical Conditions for Layer Skipping in Vision-Language Models
Vision-language models achieve incredible performance across a wide range of tasks, but their large size makes inference costly. Recent work has shown that multimodal processing contains significant redundancies, making it possible to skip certain layers with minimal performance loss. Yet current pruning techniques remain ad-hoc, relying on heuristics or hyperparameter sweeps rather than principled criteria for determining when layer skipping is beneficial. In this paper, we propose a unified framework that characterizes the redundancy conditions under which pruning can enhance efficiency without sacrificing performance. Central to our approach are experimentally verifiable and interpretable notions of redundancy that can be evaluated without requiring downstream task performance as a metric. Applying this framework, we corroborate prior findings that both early and late vision tokens are redundant across models, and we validate our conditions by showing they align with actual performance degradation. Beyond these empirical results, our framework provides a theoretically grounded understanding of redundancy in VLMs and unifies many of the ideas behind modern layer-skipping techniques.
♻ ☆ A Causal Diffusion Model for Video Reconstruction from Ultra-Low-Bitrate Representations
We study video reconstruction from ultra-low-bitrate representations, where the primary challenge shifts from encoding to decoding. In this regime, reconstruction with classical and neural codecs introduces blur, while generative and semantic approaches often struggle to jointly preserve fidelity, temporal consistency, and perceptual quality. To address these limitations, we propose a causal video diffusion model that reconstructs videos from ultra-low-bitrate semantics and highly compressed frames by jointly modeling their complementary information. We further introduce temporal-only distillation from a bidirectional teacher to enable parameter-efficient training and causal few-step inference. Through extensive quantitative, qualitative, and subjective evaluation, we show that the proposed method outperforms classical, neural, generative, and semantic baselines in ultra-low-bitrate video reconstruction.
♻ ☆ Lossy Common Information in a Learnable Gray-Wyner Network
Many computer vision tasks share substantial overlapping information, yet conventional codecs tend to ignore this, leading to redundant and inefficient representations. The Gray-Wyner network, a classical concept from information theory, offers a principled framework for separating common and task-specific information. Inspired by this idea, we develop a learnable three-channel codec that disentangles shared information from task-specific details across multiple vision tasks. We characterize the limits of this approach through the notion of lossy common information, and propose an optimization objective that balances inherent tradeoffs in learning such representations. Through comparisons of three codec architectures on two-task scenarios spanning six vision benchmarks, we demonstrate that our approach substantially reduces redundancy and consistently outperforms independent coding. These results highlight the practical value of revisiting Gray-Wyner theory in modern machine learning contexts, bridging classic information theory with task-driven representation learning.
♻ ☆ TRUST: Test-Time Refinement using Uncertainty-Guided SSM Traverses
State Space Models (SSMs) have emerged as efficient alternatives to Vision Transformers (ViTs), with VMamba standing out as a pioneering architecture designed for vision tasks. However, their generalization performance degrades significantly under distribution shifts. To address this limitation, we propose TRUST (Test-Time Refinement using Uncertainty-Guided SSM Traverses), a novel test-time adaptation (TTA) method that leverages diverse traversal permutations to generate multiple causal perspectives of the input image. Model predictions serve as pseudo-labels to guide updates of the Mamba-specific parameters, and the adapted weights are averaged to integrate the learned information across traversal scans. Altogether, TRUST is the first approach that explicitly leverages the unique architectural properties of SSMs for adaptation. Experiments on seven benchmarks show that TRUST consistently improves robustness and outperforms existing TTA methods.
♻ ☆ Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks
Multi-image reasoning remains a significant challenge for vision-language models (VLMs). We investigate a previously overlooked phenomenon: during chain-of-thought (CoT) generation, the text-to-image (T2I) attention of reasoning VLMs exhibits diffuse "pulses": sporadic and unfocused attention patterns that fail to concentrate on task-relevant images. We further reveal a systematic positional bias in attention allocation across images. Motivated by these observations, we propose PulseFocus, a training-free, inference-time method that structures CoT reasoning into interleaved plan/focus blocks with soft attention gating. By forcing the model to explicitly plan which image to examine and then gating decode-time attention to the referenced image, PulseFocus sharpens attention focus and yields consistent improvements on multi-image benchmarks like BLINK benchmark (+3.7%) and MuirBench (+1.07%).
comment: This article is withdrawn because the experimental results and analysis require substantial revision. The current version should not be cited as a reliable representation of the work
♻ ☆ Factored Classifier-Free Guidance ICML 2026
Counterfactual generation aims to simulate realistic hypothetical outcomes under causal interventions. Diffusion models have emerged as a powerful tool for this task, combining DDIM inversion with conditional generation and classifier-free guidance (CFG). In this work, we identify a key limitation of CFG for counterfactual generation: it prescribes a global guidance scale for all attributes, leading to significant spurious changes in inferred counterfactuals. To mitigate this, we propose Factored Classifier-Free Guidance (FCFG), a flexible and model-agnostic guidance technique that enables attribute-wise control following a causal graph. FCFG complements recent advances in classifier-free guidance and can be seamlessly extended to advanced guidance schemes such as CFG++ and APG. Our experiments demonstrate that FCFG significantly improves the axiomatic soundness of inferred counterfactuals across both natural and medical image datasets, mitigating spurious amplification effects, and enhancing counterfactual reversibility.
comment: Accepted at ICML 2026
♻ ☆ 3D tomography of exchange phase in a Si/SiGe quantum dot device
The exchange interaction is a foundational building block for the operation of spin-based quantum processors. Extracting the exchange interaction coefficient $J(\mathbf{V})$, as a function of gate electrode voltages, is important for understanding disorder, faithfully simulating device performance, and operating spin qubits with high fidelity. Typical coherent measurements of exchange in spin qubit devices yield a modulated cosine of an accumulated phase, which in turn is the time integral of exchange. As such, extracting $J(\mathbf{V})$ from experimental data is difficult due to the ambiguity of inverting a cosine, the sensitivity to noise when unwrapping phase, as well as the problem of inverting the integral. As a step toward obtaining $J(\mathbf{V})$, we tackle the first two challenges to reveal the accumulated phase, $φ(\mathbf{V})$. We incorporate techniques from a wide range of fields to robustly extract and model a 3D phase volume for spin qubit devices from a sequence of 2D measurements. In particular, we present a measurement technique to obtain the wrapped phase, as done in phase-shifting digital holography, and utilize the max-flow/min-cut phase unwrapping method (PUMA) to unwrap the phase in 3D voltage space. We show this method is robust to the minimal observed drift in the device, which we confirm by increasing scan resolution. Upon building a model of the extracted phase, we optimize over the model to locate a minimal-gradient $π$ exchange pulse point in voltage space. Our measurement protocol may provide detailed information useful for understanding the origins of device variability governing device yield, enable calibrating device models to specific devices during operation for more sophisticated error attribution, and enable a systematic optimization of qubit control. We anticipate that the methods presented here may be applicable to other qubit platforms.
comment: 11 pages, 6 figures; updated acknowledgements
♻ ☆ Contrast-X: A Multi-Modal Contrast Image Synthesis Benchmark and Universal Modality Flow Matching
Contrast-enhanced imaging is central to oncologic diagnosis, but contrast agents can be contraindicated for many of the patients who need them most. Synthesizing contrast scans from non-contrast inputs is the natural response. Two obstacles stand in the way: no benchmark provides paired contrast data with lesion-level evaluation, and no single model handles the arbitrary missing patterns seen in practice. We introduce Contrast-X, a benchmark of paired contrast-enhanced and non-contrast imaging spanning 10 organs in CT (1{,}526 patients) and multi-phase breast DCE-MRI (1116 patients). Every case carries radiologist-verified phase labels and tumor masks. We further propose FlowMI, a single model that handles arbitrary subsets of available modalities through a unified multi-modal latent space and flow matching. We benchmark a range of missing-modality configurations, reporting standard image-quality metrics, radiologist reader studies, and downstream lesion analysis on the synthesized scans. We further evaluate cross-organ generalization to test whether the model has learned a transferable contrast-enhancement operation. Dataset, code, and leaderboard will be released. Our code are available at https://github.com/YifanChen02/Contrast-X.
♻ ☆ Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry ICLR 2026
DINOv2 is routinely deployed to recognize objects, scenes, and actions; yet the nature of what it perceives remains unknown. As a working baseline, we adopt the Linear Representation Hypothesis (LRH) and operationalize it using SAEs, producing a 32,000-unit dictionary that serves as the interpretability backbone of our study, which unfolds in three parts. In the first part, we analyze how different downstream tasks recruit concepts from our learned dictionary, revealing functional specialization: classification exploits "Elsewhere" concepts that fire everywhere except on target objects, implementing learned negations; segmentation relies on boundary detectors forming coherent subspaces; depth estimation draws on three distinct monocular depth cues matching visual neuroscience principles. Following these functional results, we analyze the geometry and statistics of the concepts learned by the SAE. We found that representations are partly dense rather than strictly sparse. The dictionary evolves toward greater coherence and departs from maximally orthogonal ideals (Grassmannian frames). Within an image, tokens occupy a low dimensional, locally connected set persisting after removing position. These signs suggest representations are organized beyond linear sparsity alone. Synthesizing these observations, we propose a refined view: tokens are formed by combining convex mixtures of archetypes (e.g., a rabbit among animals, brown among colors, fluffy among textures). This structure is grounded in Gardenfors' conceptual spaces and in the model's mechanism as multi-head attention produces sums of convex mixtures, defining regions bounded by archetypes. We introduce the Minkowski Representation Hypothesis (MRH) and examine its empirical signatures and implications for interpreting vision-transformer representations.
comment: Accepted at ICLR 2026
♻ ☆ Deeply Dual Supervised learning for melanoma recognition
As the application of deep learning in dermatology continues to grow, the recognition of melanoma has garnered significant attention, demonstrating potential for improving diagnostic accuracy. Despite advancements in image classification techniques, existing models still face challenges in identifying subtle visual cues that differentiate melanoma from benign lesions. This paper presents a novel Deeply Dual Supervised Learning framework that integrates local and global feature extraction to enhance melanoma recognition. By employing a dual-pathway structure, the model focuses on both fine-grained local features and broader contextual information, ensuring a comprehensive understanding of the image content. The framework utilizes a dual attention mechanism that dynamically emphasizes critical features, thereby reducing the risk of overlooking subtle characteristics of melanoma. Additionally, we introduce a multi-scale feature aggregation strategy to ensure robust performance across varying image resolutions. Extensive experiments on benchmark datasets demonstrate that our framework significantly outperforms state-of-the-art methods in melanoma detection, achieving higher accuracy and better resilience against false positives. This work lays the foundation for future research in automated skin cancer recognition and highlights the effectiveness of dual supervised learning in medical image analysis.
comment: arXiv admin note: This submission has been withdrawn due to violation of arXiv policies for acceptable submissions
♻ ☆ Edge Detection for Organ Boundaries via Top Down Refinement and SubPixel Upsampling
Accurate localization of organ boundaries is critical in medical imaging for segmentation, registration, surgical planning, and radiotherapy. While deep convolutional networks (ConvNets) have advanced general-purpose edge detection to near-human performance on natural images, their outputs often lack precise localization, a limitation that is particularly harmful in medical applications where millimeter-level accuracy is required. Building on a systematic analysis of ConvNet edge outputs, we propose a medically focused crisp edge detector that adapts a novel top-down backward refinement architecture to medical images (2D and volumetric). Our method progressively upsamples and fuses high-level semantic features with fine-grained low-level cues through a backward refinement pathway, producing high-resolution, well-localized organ boundaries. We further extend the design to handle anisotropic volumes by combining 2D slice-wise refinement with light 3D context aggregation to retain computational efficiency. Evaluations on several CT and MRI organ datasets demonstrate substantially improved boundary localization under strict criteria (boundary F-measure, Hausdorff distance) compared to baseline ConvNet detectors and contemporary medical edge/contour methods. Importantly, integrating our crisp edge maps into downstream pipelines yields consistent gains in organ segmentation (higher Dice scores, lower boundary errors), more accurate image registration, and improved delineation of lesions near organ interfaces. The proposed approach produces clinically valuable, crisp organ edges that materially enhance common medical-imaging tasks.
comment: arXiv admin note: This submission has been withdrawn due to violation of arXiv policies for acceptable submissions
♻ ☆ DualResolution Residual Architecture with Artifact Suppression for Melanocytic Lesion Segmentation
Lesion segmentation, in contrast to natural scene segmentation, requires handling subtle variations in texture and color, frequent imaging artifacts (such as hairs, rulers, and bubbles), and a critical need for precise boundary localization to aid in accurate diagnosis. The accurate delineation of melanocytic tumors in dermoscopic images is a crucial component of automated skin cancer screening systems and clinical decision support. In this paper, we present a novel dual-resolution architecture inspired by ResNet, specifically tailored for the segmentation of melanocytic tumors. Our approach incorporates a high-resolution stream that preserves fine boundary details, alongside a complementary pooled stream that captures multi-scale contextual information for robust lesion recognition. These two streams are closely integrated through boundary-aware residual connections, which inject edge information into deep feature maps, and a channel attention mechanism that adapts the model's sensitivity to color and texture variations in dermoscopic images. To tackle common imaging artifacts and the challenges posed by small clinical datasets, we introduce a lightweight artifact suppression block and a multi-task training strategy. This strategy combines the Dice-Tversky loss with an explicit boundary loss and a contrastive regularizer to enhance feature stability. This unified design enables the model to generate pixel-accurate segmentation masks without the need for extensive post-processing or complex pre-training. Extensive evaluation on public dermoscopic benchmarks reveals that our method significantly enhances boundary precision and clinically relevant segmentation metrics, outperforming traditional encoder-decoder baselines. This makes our approach a valuable component for building automated melanoma assessment systems.
comment: arXiv admin note: This submission has been withdrawn due to violation of arXiv policies for acceptable submissions
♻ ☆ VesselRW: Weakly Supervised Subcutaneous Vessel Segmentation via Learned Random Walk Propagation
The task of parsing subcutaneous vessels in clinical images is often hindered by the high cost and limited availability of ground truth data, as well as the challenge of low contrast and noisy vessel appearances across different patients and imaging modalities. In this work, we propose a novel weakly supervised training framework specifically designed for subcutaneous vessel segmentation. This method utilizes low-cost, sparse annotations such as centerline traces, dot markers, or short scribbles to guide the learning process. These sparse annotations are expanded into dense probabilistic supervision through a differentiable random walk label propagation model, which integrates vesselness cues and tubular continuity priors driven by image data. The label propagation process results in per-pixel hitting probabilities and uncertainty estimates, which are incorporated into an uncertainty-weighted loss function to prevent overfitting in ambiguous areas. Notably, the label propagation model is trained jointly with a CNN-based segmentation network, allowing the system to learn vessel boundaries and continuity constraints without the need for explicit edge supervision. Additionally, we introduce a topology-aware regularizer that encourages centerline connectivity and penalizes irrelevant branches, further enhancing clinical applicability. Our experiments on clinical subcutaneous imaging datasets demonstrate that our approach consistently outperforms both naive sparse-label training and traditional dense pseudo-labeling methods, yielding more accurate vascular maps and better-calibrated uncertainty, which is crucial for clinical decision-making. This method significantly reduces the annotation workload while maintaining clinically relevant vessel topology.
comment: arXiv admin note: This submission has been withdrawn due to violation of arXiv policies for acceptable submissions
♻ ☆ A Computer Vision Pipeline for Individual-Level Behavior Analysis: Benchmarking on the Edinburgh Pig Dataset
Animal behavior analysis plays a crucial role in understanding animal welfare, health status, and productivity in agricultural settings. However, traditional manual observation methods are time-consuming, subjective, and limited in scalability. We present a modular pipeline that leverages open-sourced state-of-the-art computer vision techniques to automate animal behavior analysis in a group housing environment. Our approach combines state-of-the-art models for zero-shot object detection, motion-aware segmentation and tracking, and advanced feature extraction using vision transformers for robust behavior recognition. The pipeline addresses challenges including animal occlusions and group housing scenarios, as demonstrated in indoor pig monitoring. We validated our system on the Edinburgh Pig Behavior Video Dataset for multiple behavioral tasks. Our temporal model achieved 94.2% overall accuracy, representing a 21.2 percentage point improvement over existing methods. The pipeline demonstrated robust tracking capabilities with a 93.3% identity preservation (IDF1) score and an 89.3% average precision (AP) for object detection. The modular design suggests potential for adaptation to other contexts, though further validation across species would be required. The open-source implementation provides a scalable solution for behavior monitoring, contributing to precision pig farming and welfare assessment through automated, objective, and continuous analysis.
comment: 9 figures
Artificial Intelligence 301
☆ ActCam: Zero-Shot Joint Camera and 3D Motion Control for Video Generation SIGGRAPH 2026
For artistic applications, video generation requires fine-grained control over both performance and cinematography, i.e., the actor's motion and the camera trajectory. We present ActCam, a zero-shot method for video generation that jointly transfers character motion from a driving video into a new scene and enables per-frame control of intrinsic and extrinsic camera parameters. ActCam builds on any pretrained image-to-video diffusion model that accepts conditioning in terms of scene depth and character pose. Given a source video with a moving character and a target camera motion, ActCam generates pose and depth conditions that remain geometrically consistent across frames. We then run a single sampling process with a two-phase conditioning schedule: early denoising steps condition on both pose and sparse depth to enforce scene structure, after which depth is dropped and pose-only guidance refines high-frequency details without over-constraining the generation. We evaluate ActCam on multiple benchmarks spanning diverse character motions and challenging viewpoint changes. We find that, compared to pose-only control and other pose and camera methods, ActCam improves camera adherence and motion fidelity, and is preferred in human evaluations, especially under large viewpoint changes. Our results highlight that careful camera-consistent conditioning and staged guidance can enable strong joint camera and motion control without training. Project page: https://elkhomar.github.io/actcam/.
comment: SIGGRAPH 2026
☆ UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
Modern Mixture-of-Experts (MoE) architectures allocate expert capacity through a rigid per-layer rule: each transformer layer owns a separate expert set. This convention couples depth scaling with linear expert-parameter growth and assumes that every layer needs isolated expert capacity. However, recent analyses and our routing probe challenge this allocation rule: replacing a deeper layer's learned top-k router with uniform random routing drops downstream accuracy by only 1.0-1.6 points across multiple production MoE models. Motivated by this redundancy, we propose UniPool, an MoE architecture that treats expert capacity as a global architectural budget by replacing per-layer expert ownership with a single shared pool accessed by independent per-layer routers. To enable stable and balanced training under sharing, we introduce a pool-level auxiliary loss that balances expert utilization across the entire pool, and adopt NormRouter to provide sparse and scale-stable routing into the shared expert pool. Across five LLaMA-architecture model scales (182M, 469M, 650M, 830M, and 978M parameters) trained on 30B tokens from the Pile, UniPool consistently improves validation loss and perplexity over the matched vanilla MoE baselines. Across these scales, UniPool reduces validation loss by up to 0.0386 relative to vanilla MoE. Beyond raw loss improvement, our results identify pool size as an explicit depth-scaling hyperparameter: reduced-pool UniPool variants using only 41.6%-66.7% of the vanilla expert-parameter budget match or outperform layer-wise MoE at the tested scales. This shows that, under a shared-pool design, expert parameters need not grow linearly with depth; they can grow sublinearly while remaining more efficient and effective than vanilla MoE. Further analysis shows that UniPool's benefits compose with finer-grained expert decomposition.
☆ BAMI: Training-Free Bias Mitigation in GUI Grounding CVPR 2026
GUI grounding is a critical capability for enabling GUI agents to execute tasks such as clicking and dragging. However, in complex scenarios like the ScreenSpot-Pro benchmark, existing models often suffer from suboptimal performance. Utilizing the proposed \textbf{Masked Prediction Distribution (MPD)} attribution method, we identify that the primary sources of errors are twofold: high image resolution (leading to precision bias) and intricate interface elements (resulting in ambiguity bias). To address these challenges, we introduce \textbf{Bias-Aware Manipulation Inference (BAMI)}, which incorporates two key manipulations, coarse-to-fine focus and candidate selection, to effectively mitigate these biases. Our extensive experimental results demonstrate that BAMI significantly enhances the accuracy of various GUI grounding models in a training-free setting. For instance, applying our method to the TianXi-Action-7B model boosts its accuracy on the ScreenSpot-Pro benchmark from 51.9\% to 57.8\%. Furthermore, ablation studies confirm the robustness of the BAMI approach across diverse parameter configurations, highlighting its stability and effectiveness. Code is available at https://github.com/Neur-IO/BAMI.
comment: Accepted by CVPR 2026
☆ Verifier-Backed Hard Problem Generation for Mathematical Reasoning
Large Language Models (LLMs) demonstrate strong capabilities for solving scientific and mathematical problems, yet they struggle to produce valid, challenging, and novel problems - an essential component for advancing LLM training and enabling autonomous scientific research. Existing problem generation approaches either depend on expensive human expert involvement or adopt naive self-play paradigms, which frequently yield invalid problems due to reward hacking. This work introduces VHG, a verifier-enhanced hard problem generation framework built upon three-party self-play. By integrating an independent verifier into the conventional setter-solver duality, our design constrains the setter's reward to be jointly determined by problem validity (evaluated by the verifier) and difficulty (assessed by the solver). We instantiate two verifier variants: a Hard symbolic verifier and a Soft LLM-based verifier, with evaluations conducted on indefinite integral tasks and general mathematical reasoning tasks. Experimental results show that VHG substantially outperforms all baseline methods by a clear margin.
☆ Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less
Optimizers play an important role in both pretraining and finetuning stages when training large language models (LLMs). In this paper, we present an observation that full finetuning with the same optimizer as in pretraining achieves a better learning-forgetting tradeoff, i.e., forgetting less while achieving the same or better performance on the new task, than other optimizers and, possibly surprisingly, LoRA, during the supervised finetuning (SFT) stage. We term this phenomenon optimizer-model consistency. To better understand it, through controlled experiments and theoretical analysis, we show that: 1) optimizers can shape the models by having regularization effects on the activations, leading to different landscapes around the pretrained checkpoints; 2) in response to this regularization effect, the weight update in SFT should follow some specific structures to lower forgetting of the knowledge learned in pretraining, which can be obtained by using the same optimizer. Moreover, we specifically compare Muon and AdamW when they are employed throughout the pretraining and SFT stages and find that Muon performs worse when finetuned for reasoning tasks. With a synthetic language modeling experiment, we demonstrate that this can come from Muon's strong tendency towards rote memorization, which may hurt pattern acquisition with a small amount of data, as for SFT.
☆ When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels
Many deployments must compare candidate language models for safety before a labeled benchmark exists for the relevant language, sector, or regulatory regime. We formalize this setting as benchmarkless comparative safety scoring and specify the contract under which a scenario-based audit can be interpreted as deployment evidence. Scores are valid only under a fixed scenario pack, rubric, auditor, judge, sampling configuration, and rerun budget. Because no labels are available, we replace ground-truth agreement with an instrumental-validity chain: responsiveness to a controlled safe-versus-abliterated contrast, dominance of target-driven variance over auditor and judge artifacts, and stability across reruns. We instantiate the chain in SimpleAudit, a local-first scoring instrument, and validate it on a Norwegian safety pack. Safe and abliterated targets separate with AUROC values between 0.89 and 1.00, target identity is the dominant variance component ($η^2 \approx 0.52$), and severity profiles stabilize by ten reruns. Applying the same chain to Petri shows that it admits both tools. The substantial differences arise upstream of the chain, in claim-contract enforcement and deployment fit. A Norwegian public-sector procurement case comparing Borealis and Gemma 3 demonstrates the resulting evidence in practice: the safer model depends on scenario category and risk measure. Consequently, scores, matched deltas, critical rates, uncertainty, and the auditor and judge used must be reported together rather than collapsed into a single ranking.
comment: SimpleAudit Repository: https://github.com/kelkalot/simpleaudit
☆ AI Co-Mathematician: Accelerating Mathematicians with Agentic AI
We introduce the AI co-mathematician, a workbench for mathematicians to interactively leverage AI agents to pursue open-ended research. The AI co-mathematician is optimized to provide holistic support for the exploratory and iterative reality of mathematical workflows, including ideation, literature search, computational exploration, theorem proving and theory building. By providing an asynchronous, stateful workspace that manages uncertainty, refines user intent, tracks failed hypotheses, and outputs native mathematical artifacts, the system mirrors human collaborative workflows. In early tests, the AI co-mathematician helped researchers solve open problems, identify new research directions, and uncover overlooked literature references. Besides demonstrating a highly interactive paradigm for AI-assisted mathematical discovery, the AI co-mathematician also achieves state of the art results on hard problem-solving benchmarks, including scoring 48% on FrontierMath Tier 4, a new high score among all AI systems evaluated.
comment: 22 pages
☆ Superintelligent Retrieval Agent: The Next Frontier of Information Retrieval
Retrieval-augmented agents are increasingly the interface to large organizational knowledge bases, yet most still treat retrieval as a black box: they issue exploratory queries, inspect returned snippets, and iteratively reformulate until useful evidence emerges. This approach resembles how a newcomer searches an unfamiliar database rather than how an expert navigates it with strong priors about terminology and likely evidence, and results in unnecessary retrieval rounds, increased latency, and poor recall. We introduce \textit{SuperIntelligent Retrieval Agent} (SIRA), which defines \emph{superintelligence} in retrieval as the ability to compress multi-round exploratory search into a single corpus-discriminative retrieval action. SIRA does not merely ask what terms are relevant to the query; it asks which terms are likely to separate the desired evidence from corpus-level confusers. On the corpus side, an LLM enriches each document offline with missing search vocabulary; on the query side, it predicts evidence vocabulary omitted by the query; and document-frequency statistics as a tool call to filter proposed terms that are absent, overly common, or unlikely to create retrieval margin. The final retrieval step is a single weighted BM25 call combining the original query with the validated expansion. Across ten BEIR benchmarks and downstream question-answering tasks, SIRA achieves the significantly superior performance outperforming dense retrievers and state-of-the-art multi-round agentic baselines, demonstrating that one well-formed lexical query, guided by LLM cognition and lightweight corpus statistics, can exceed substantially more expensive multi-round search while remaining interpretable, training-free, and efficient.
☆ Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study
Despite the growing popularity of Multimodal Domain Generalization (MMDG) for enhancing model robustness, it remains unclear whether reported performance gains reflect genuine algorithmic progress or are artifacts of inconsistent evaluation protocols. Current research is fragmented, with studies varying significantly across datasets, modality configurations, and experimental settings. Furthermore, existing benchmarks focus predominantly on action recognition, often neglecting critical real-world challenges such as input corruptions, missing modalities, and model trustworthiness. This lack of standardization obscures a reliable assessment of the field's advancement. To address this issue, we introduce MMDG-Bench, the first unified and comprehensive benchmark for MMDG, which standardizes evaluation across six datasets spanning three diverse tasks: action recognition, mechanical fault diagnosis, and sentiment analysis. MMDG-Bench encompasses six modality combinations, nine representative methods, and multiple evaluation settings. Beyond standard accuracy, it systematically assesses corruption robustness, missing-modality generalization, misclassification detection, and out-of-distribution detection. With 7, 402 neural networks trained in total across 95 unique cross-domain tasks, MMDG-Bench yields five key findings: (1) under fair comparisons, recent specialized MMDG methods offer only marginal improvements over ERM baseline; (2) no single method consistently outperforms others across datasets or modality combinations; (3) a substantial gap to upper-bound performance persists, indicating that MMDG remains far from solved; (4) trimodal fusion does not consistently outperform the strongest bimodal configurations; and (5) all evaluated methods exhibit significant degradation under corruption and missing-modality scenarios, with some methods further compromising model trustworthiness.
comment: Code: https://github.com/lihongzhao99/MMDG_Benchmark
☆ StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
Large language models (LLMs) are increasingly used as interactive agents, but optimizing them for long-horizon decision making remains difficult because current methods are largely purely reactive, which weakens both exploration and credit assignment over extended trajectories. In this work, we present Strategic Trajectory Abstraction (StraTA), a simple framework that introduces an explicit trajectory-level strategy into agentic reinforcement learning (RL). StraTA samples a compact strategy from the initial task state, conditions subsequent actions on that strategy, and trains strategy generation and action execution jointly with a hierarchical GRPO-style rollout design, further enhanced by diverse strategy rollout and critical self-judgment. Experiments on ALFWorld, WebShop, and SciWorld show that StraTA consistently improves both sample efficiency and final performance over strong baselines. StraTA reaches success rates of 93.1% on ALFWorld and 84.2% on WebShop. On SciWorld, StraTA attains a 63.5% overall score, outperforming frontier closed-source models.
comment: 26 pages, 4 figures, 7 tables
☆ Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models
*Concept-based explanations* offer a promising approach for explaining the predictions of deep neural networks in terms of high-level, human-understandable concepts. However, existing methods either do not establish a causal connection between the concepts and model predictions or are limited in expressivity and only able to infer causal explanations involving single concepts. At the same time, the parallel line of work on *formal abductive and contrastive explanations* computes the minimal set of input features causally relevant for model outcomes but only considers low-level features such as pixels. Merging these two threads, in this work, we propose the notion of *concept-based abductive and contrastive explanations* that capture the minimal sets of high-level concepts causally relevant for model outcomes. We then present a family of algorithms that enumerate all minimal explanations while using *concept erasure* procedures to establish causal relationships. By appropriately aggregating such explanations, we are not only able to understand model predictions on individual images but also on collections of images where the model exhibits a user-specified, common *behavior*. We evaluate our approach on multiple models, datasets, and behaviors, and demonstrate its effectiveness in computing helpful, user-friendly explanations.
☆ GlazyBench: A Benchmark for Ceramic Glaze Property Prediction and Image Generation
Developing ceramic glazes is a costly, time-consuming process of trial and error due to complex chemistry, placing a significant burden on independent artists. While recent advances in multimodal AI offer a modern solution, the field lacks the large-scale datasets required to train these models. We propose GlazyBench, the first dataset for AI-assisted glaze design. Comprising 23,148 real glaze formulations, GlazyBench supports two primary tasks: predicting post-firing surface properties, such as color and transparency, from raw materials, and generating accurate visual representations of the glaze based on these properties. We establish comprehensive baselines for property prediction using traditional machine learning and large language models, alongside image generation benchmarks using deep generative and large multimodal models. Our experiments demonstrate promising yet challenging results. GlazyBench pioneers a new research direction in AI-assisted material design, providing a standardized benchmark for systematic evaluation.
☆ Recursive Agent Optimization
We introduce Recursive Agent Optimization (RAO), a reinforcement learning approach for training recursive agents: agents that can spawn and delegate sub-tasks to new instantiations of themselves recursively. Recursive agents implement an inference-time scaling algorithm that naturally allows agents to scale to longer contexts and generalize to more difficult problems via divide-and-conquer. RAO provides a method to train models to best take advantage of such recursive inference, teaching agents when and how to delegate and communicate. We find that recursive agents trained in this way enjoy better training efficiency, can scale to tasks that go beyond the model's context window, generalize to tasks much harder than the ones the agent was trained on, and can enjoy reduced wall-clock time compared to single-agent systems.
☆ Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
Reinforcement learning (RL) has been applied to improve large language model (LLM) reasoning, yet the systematic study of how training scales with task difficulty has been hampered by the lack of controlled, scalable environments. We introduce ScaleLogic, a synthetic logical reasoning framework that offers independent control over two axes of difficulty: the depth of the required proof planning (i.e., the horizon) and the expressiveness of the underlying logic. Our proposed framework supports a wide range of logics: from simple implication-only logic ("if-then") towards more expressive first-order reasoning with conjunction ("and"), disjunction ("or"), negation ("not"), and universal quantification ("for all"). Using this framework, we show that the RL training compute $T$ follows a power law with respect to reasoning depth $D$ ($T \propto D^γ$, $R^{2} > 0.99$), and that the scaling exponent $γ$ increases monotonically with logical expressiveness, from $1.04$ to $2.60$. On downstream mathematics and general reasoning benchmarks, more expressive training settings yield both larger performance gains (up to $+10.66$ points) and more compute-efficient transfer compared to less expressive settings, demonstrating that what a model is trained on, not just how much it is trained, shapes downstream transfer. We further show that the power-law relationship holds across multiple RL methods, and curriculum-based training substantially improves scaling efficiency.
☆ MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems ICML 2026
Large language model (LLM)-based Multi-agent systems (MAS) have shown promise in tackling complex collaborative tasks, where agents are typically orchestrated via role-specific prompts. While the quality of these prompts is pivotal, jointly optimizing them across interacting agents remains a non-trivial challenge, primarily due to the misalignment between local agent objectives and holistic system goals. To address this, we introduce MASPO, a novel framework designed to automatically and iteratively refine prompts across the entire system. A core innovation of MASPO is its joint evaluation mechanism, which assesses prompts not merely by their local validity, but by their capacity to facilitate downstream success for successor agents. This effectively bridges the gap between local interactions and global outcomes without relying on ground-truth labels. Furthermore, MASPO employs a data-driven evolutionary beam search to efficiently navigate the high-dimensional prompt space. Extensive empirical evaluations across 6 diverse tasks demonstrate that MASPO consistently outperforms state-of-the-art prompt optimization methods, achieving an average accuracy improvement of 2.9. We release our code at https://github.com/wangzx1219/MASPO.
comment: Accepted at ICML 2026
☆ When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds
Sign-based optimization algorithms, such as SignSGD and Muon, have garnered significant attention for their remarkable performance in training large foundation models. Despite this empirical success, we still lack a theoretical understanding of when and why these sign-based methods outperform vanilla SGD. The core obstacle is that under standard smoothness and finite variance conditions, SGD is known to be minimax optimal for finding stationary points measured by $\ell_2$-norms, thereby fundamentally precluding any complexity gains for sign-based methods in standard settings. To overcome this barrier, we analyze sign-based optimizers leveraging $\ell_1$-norm stationarity, $\ell_\infty$-smoothness, and a separable noise model, which can better capture the coordinate-wise nature of signed updates. Under this distinct problem geometry, we derive matched upper and lower bounds for SignSGD and explicitly characterize the problem class in which SignSGD provably dominates SGD. Specifically, we compare the \emph{upper bound of SignSGD} with the \emph{lower bound of SGD}, illustrating that SignSGD effectively reduces the complexity by a factor of $d$ under \emph{sparse noise}, where $d$ is the problem dimension. Furthermore, we elevate this framework to the matrix domain, providing an equivalent optimal lower bound for the Muon optimizer, proving that extending the sign operator to matrices preserves this optimal scaling with dimensionality. Finally, we bridge our theoretical bounds to practice, demonstrating that the theoretical superiority of SignSGD accurately predicts its faster convergence during the pretraining of a 124M parameter GPT-2 model.
comment: Code is available at https://github.com/Dingzhen230/SignSGD_Outperforms_SGD
☆ SkillOS: Learning Skill Curation for Self-Evolving Agents
LLM-based agents are increasingly deployed to handle streaming tasks, yet they often remain one-off problem solvers that fail to learn from past interactions. Reusable skills distilled from experience provide a natural substrate for self-evolution, where high-quality skill curation serves as the key bottleneck. Existing approaches either rely on manual skill curation, prescribe heuristic skill operations, or train for short-horizon skill operations. However, they still struggle to learn complex long-term curation policies from indirect and delayed feedback. To tackle this challenge, we propose SkillOS, an experience-driven RL training recipe for learning skill curation in self-evolving agents. SkillOS pairs a frozen agent executor that retrieves and applies skills with a trainable skill curator that updates an external SkillRepo from accumulated experience. To provide learning signals for curation, we design composite rewards and train on grouped task streams based on skill-relevant task dependencies, where earlier trajectories update the SkillRepo, and later related tasks evaluate these updates. Across multi-turn agentic tasks and single-turn reasoning tasks, SkillOS consistently outperforms memory-free and strong memory-based baselines in both effectiveness and efficiency, with the learned skill curator generalizing across different executor backbones and task domains. Further analyses show that the learned curator produces more targeted skill use, while the skills in SkillRepo evolve into more richly structured Markdown files that encode higher-level meta-skills over time.
comment: 11 pages, 6 figures, 3 tables
☆ The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity ICML 2026
Despite the prevalence of the attention sink phenomenon in Large Language Models (LLMs), where initial tokens disproportionately monopolize attention scores, its structural origins remain elusive. This work provides a \textit{mechanistic explanation} for this phenomenon. First, we trace its root to the value aggregation process inherent in self-attention, which induces a systematic variance discrepancy. We further demonstrate that this discrepancy is drastically amplified by the activation of super neurons within Feed-Forward Network (FFN) layers. Specifically, the channel-sparse down-projections trigger a dimension disparity of the first-token representation, necessitating the formation of attention sinks as a structural anchor. Then, we validate this causal chain through two controlled interventions: (i) isolating the aggregation effect via attention mask modifications and (ii) amplifying the variance of targeted token representations. Both interventions can replicate attention sinks at arbitrary positions. Our mechanistic understanding offers a foundation for the systematic control of sink formation. Finally, as a proof of concept, we propose \textit{head-wise RMSNorm}, an architectural modification that stabilizes value aggregation outputs during pre-training. Our experiments demonstrate that restoring statistical parity across positions significantly accelerates convergence.
comment: Accepted to ICML 2026
☆ AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents
Recent LLM-based agents have closed substantial portions of the scientific discovery loop in software-only machine-learning research, in chemistry, and in biology. Extending the same loop to high-fidelity physical simulators is harder, because solver completion does not imply physical validity and many failure modes appear only in field-level imagery rather than in solver logs. We present AI CFD Scientist, an open-source AI scientist for computational fluid dynamics (CFD) that, to our knowledge, is the first to span literature-grounded ideation, validated execution, vision-based physics verification, source-code modification, and figure-grounded writing within a single inspectable workflow. Three coupled pathways cover parameter sweeps within a fixed solver, case-local C++ library compilation for new physical models, and open-ended hypothesis search against a reference comparator, all running on OpenFOAM through Foam-Agent. At the center of the framework is a vision-language physics-verification gate that inspects rendered flow fields before any result is accepted, rerun, or written into a manuscript. On five tasks under a shared GPT-5.5 backbone, AI CFD Scientist autonomously discovers a Spalart-Allmaras runtime correction that reduces lower-wall Cf RMSE against DNS by 7.89% on the periodic hill at Reh=5600; under matched LLM cost, two strong general AI-scientist baselines (ARIS, DeepScientist) execute partial CFD workflows but lack the domain-specific validity gates needed to convert runs into defensible scientific claims; and a controlled planted-failure ablation shows that the vision-language gate detects 14 of 16 silent failures missed by solver-level checks. Code, prompts, and run artifacts are released at https://github.com/csml-rpi/cfd-scientist.
comment: 9 main pages and rest in appendix
☆ Patch2Vuln: Agentic Reconstruction of Vulnerabilities from Linux Distribution Binary Patches
Security updates create a short but important window in which defenders and attackers can compare vulnerable and patched software. Yet in many operational settings, the most accessible artifacts are binary packages rather than source patches or advisory text. This paper asks whether a language-model agent, restricted to local binary-derived evidence, can reconstruct the security meaning of Linux distribution updates. Patch2Vuln is a local, resumable pipeline that extracts old/new ELF pairs, diffs them with Ghidra and Ghidriff, ranks changed functions, builds candidate dossiers, and asks an offline agent to produce a preliminary audit, bounded validation plan, and final audit. We evaluate Patch2Vuln on 25 Ubuntu `.deb` package pairs: 20 security-update pairs and five negative controls, all manually adjudicated against private source-patch and binary-function ground truth. The agent localizes a verified security-relevant patch function in 10 of 20 security pairs and assigns an accepted final root-cause class in 11 of 20. Oracle diagnostics show that six security pairs fail before model reasoning because the binary differ or ranker omits the right function, with one additional context-export miss. A separate bounded validation pass produces two target-level minimized behavioral old/new differentials, both for tcpdump, but no crash, timeout, sanitizer finding, or memory-corruption proof; all five negative controls are classified as unknown and produce no validation differentials. These results support agentic vulnerability reconstruction from binary patches as a useful research target while showing that binary-diff coverage and local behavioral validation remain the limiting components.
☆ UniSD: Towards a Unified Self-Distillation Framework for Large Language Models
Self-distillation (SD) offers a promising path for adapting large language models (LLMs) without relying on stronger external teachers. However, SD in autoregressive LLMs remains challenging because self-generated trajectories are free-form, correctness is task-dependent, and plausible rationales can still provide unstable or unreliable supervision. Existing methods mainly examine isolated design choices, leaving their effectiveness, roles, and interactions unclear. In this paper, we propose UniSD, a unified framework to systematically study self-distillation. UniSD integrates complementary mechanisms that address supervision reliability, representation alignment, and training stability, including multi-teacher agreement, EMA teacher stabilization, token-level contrastive learning, feature matching, and divergence clipping. Across six benchmarks and six models from three model families, UniSD reveals when self-distillation improves over static imitation, which components drive the gains, and how these components interact across tasks. Guided by these insights, we construct UniSDfull, an integrated pipeline that combines complementary components and achieves the strongest overall performance, improving over the base model by +5.4 points and the strongest baseline by +2.8 points. Extensive evaluation highlights self-distillation as a practical and steerable approach for efficient LLM adaptation without stronger external teachers.
comment: 22 pages, 12 figures
☆ Cross-Modal Navigation with Multi-Agent Reinforcement Learning
Robust embodied navigation relies on complementary sensory cues. However, high-quality and well-aligned multi-modal data is often difficult to obtain in practice. Training a monolithic model is also challenging as rich multi-modal inputs induce complex representations and substantially enlarge the policy space. Cross-modal collaboration among lightweight modality-specialized agents offers a scalable paradigm. It enables flexible deployment and parallel execution, while preserving the strength of each modality. In this paper, we propose \textbf{CRONA}, a Multi-Agent Reinforcement Learning (MARL) framework for \textbf{Cro}ss-Modal \textbf{Na}vigation. CRONA improves collaboration by leveraging control-relevant auxiliary beliefs and a centralized multi-modal critic with global state. Experiments on visual-acoustic navigation tasks show that multi-agent methods significantly improve performance and efficiency over single-agent baselines. We find that homogeneous collaboration with limited modalities is sufficient for short-range navigation under salient cues; heterogeneous collaboration among agents with complementary modalities is generally efficient and effective; and navigation in large, complex environments requires both richer multi-modal perception and increased model capacity.
☆ DINORANKCLIP: DINOv3 Distillation and Injection for Vision-Language Pretraining with High-Order Ranking Consistency
Contrastive language-image pretraining (CLIP) suffers from two structural weaknesses: the symmetric InfoNCE loss discards the relative ordering among unmatched in-batch pairs, and global pooling collapses the visual representation into a semantic bottleneck that is poorly sensitive to fine-grained local structure. RANKCLIP partially addresses the first issue with a list-wise Plackett-Luce ranking-consistency loss, but its model is strictly first-order and inherits the second weakness untouched. We propose DINORANKCLIP, a pretraining framework that addresses both jointly. Our principal contribution is injecting a frozen DINOv3 teacher into the contrastive trunk through a dual-branch lightweight student and a multi-scale fusion module with channel-spatial attention, a self-attention refiner, and a conflict-aware gate that preserves the cross-modal alignment up to first order. Complementarily, we introduce a high-order Plackett-Luce ranking model in which the per-position utility is augmented with attention-parameterised pairwise and tuple-wise transition terms; the family contains CLIP and RANKCLIP as nested zero-order and first-order special cases, and the optimal order on every benchmark is $R^*=3$. The full empirical study -- order sweep, Fine-grained Probe on five datasets, four-node Modality-Gap analysis, six-variant Fusion ablation -- fits in 72 hours on a single eight-GPU H100 node and trains entirely on Conceptual Captions 3M. DINORANKCLIP consistently outperforms CLIP, CyCLIP, ALIP, and RANKCLIP under matched compute, with the largest relative gains on the fine-grained and out-of-distribution evaluations that most directly stress local structural reasoning.
comment: 18 pages, 7 figures, 9 tables. Code will be made publicly available upon acceptance
☆ Towards Metric-Faithful Neural Graph Matching
Graph Edit Distance (GED) is a fundamental, albeit NP-hard, metric for structural graph similarity. Recent neural graph matching architectures approximate GED by first encoding graphs with a Graph Neural Network (GNN) and then applying either a graph-level regression head or a matching-based alignment module. Despite substantial architectural progress, the role of encoder geometry in neural GED estimation remains poorly understood. In this paper, we develop a theoretical framework that connects encoder geometry to GED estimation quality for two broad classes of neural GED estimators: graph similarity predictors and alignment-based methods. On fixed graph collections, where the doubly-stochastic metric $d_{\mathrm{DS}}$ is comparable to GED, we show that graph-level bi-Lipschitz encoders yield controlled GED surrogates and improved ranking stability; for matching-based estimators, node-level bi-Lipschitz geometry propagates to encoder-induced alignment costs and the resulting optimized alignment objective. We instantiate this perspective using FSW-GNN, a bi-Lipschitz WL-equivalent encoder, as a drop-in replacement in representative neural GED architectures. Across representative baselines and benchmark datasets, the resulting geometry-aware variants significantly improve GED prediction and ranking metrics. A faithfulness case study of untrained encoders, together with ablations and transfer experiments, supports the view that these gains arise from improved representation geometry, positioning encoder geometry as a useful design principle for neural graph matching.
☆ NeuroAgent: LLM Agents for Multimodal Neuroimaging Analysis and Research
Multimodal neuroimaging analysis often involves complex, modality-specific preprocessing workflows that require careful configuration, quality control, and coordination across heterogeneous toolchains. Beyond preprocessing, downstream statistical analysis and disease classification commonly require task-specific code, evaluation protocols, and data-format conventions, creating additional barriers between raw acquisitions and reproducible scientific analysis. We present NeuroAgent, an LLM-driven agentic framework that automates key preprocessing and analysis steps for heterogeneous neuroimaging data, including sMRI, fMRI, dMRI, and PET, and supports interactive downstream analysis through natural-language queries. NeuroAgent employs a hierarchical multi-agent architecture with a feedback-driven Generate-Execute-Validate engine: agents autonomously generate executable preprocessing code, detect and recover from runtime errors, and validate output integrity. We evaluate the system on 1,470 subjects pooled across all ADNI phases (CN=1,000, AD=470), where all subjects have sMRI and tabular data, with subsets also having Tau-PET (n=469), fMRI (n=278), and DTI ($n=620$). Pipeline ablation studies across multiple LLM backends show that capable models reach up to 100% intent-parsing accuracy, with the strongest backend (Qwen3.5-27B) reaching 84.8% end-to-end preprocessing step correctness. Automated recovery limits manual intervention to edge cases where human review is required via the Human-In-The-Loop interface. For Alzheimer's Disease classification using automatically preprocessed multimodal data, our agent ensemble achieves an AUC of 0.9518 with four modalities, outperforming all single-modality baselines. These results show that NeuroAgent can reduce the manual effort required for neuroimaging preprocessing and enable end-to-end automated analysis pipelines for neuroimaging research.
☆ Improved techniques for fine-tuning flow models via adjoint matching: a deterministic control pipeline
We propose a deterministic adjoint matching framework that formulates human preference alignment for flow-based generative models as an optimal control problem over velocity fields. One can directly regress the control toward a value-gradient-induced target under the current policy, leading to a simple and stable training objective. Building on this perspective, we introduce a truncated adjoint scheme that focuses computation on the terminal portion of the trajectory, where reward-relevant signals concentrate, which yields substantial computational savings while preserving alignment quality. We further generalize the framework beyond standard KL-based regularization, allowing more flexible trade-offs between alignment strength and distributional preservation. Experiments on SiT-XL/2 and FLUX.2-Klein-4B demonstrate consistent gains across multiple alignment metrics, along with substantially improved diversity and mode preservation.
☆ Directional Consistency as a Complementary Optimization Signal: The GONO Framework
We identify and formalize an underexplored phenomenon in deep learning optimization: directional alignment and loss convergence can be decoupled. An optimizer can exhibit near-perfect directional consistency (cc_t -> 1, measured via consecutive gradient cosine similarity) while the loss remains high or decreases slowly. This observation reveals that existing optimizers such as Adam, SGD, and RMSprop lack explicit mechanisms to exploit temporal consistency in gradient directions, relying instead on magnitude-based signals that fail to distinguish plateaus, saddle points, and genuine convergence. Motivated by this, we introduce GONO (Gradient-Oriented Norm-Adaptive Optimizer), which adapts Adam's momentum coefficient beta_1 based on cc_t: amplifying momentum under directional consistency and suppressing it during oscillation. We prove GONO matches Adam's O(1/sqrt(T)) convergence rate and reduces exactly to Adam when the signal is uninformative. Empirically, cc_t achieves oscillation detection with F1=1.00 (vs. 0.45 for gradient norm), and GONO remains competitive with AdamW on MNIST (98.15%), CIFAR-10 (43.14%), and ResNet-18 (75.44%), establishing directional alignment as a theoretically grounded, practically actionable optimization signal. Code: https://github.com/victordaniel/gono-optimizer
☆ Coordination Matters: Evaluation of Cooperative Multi-Agent Reinforcement Learning
Cooperative multi-agent reinforcement learning (MARL) benchmarks commonly emphasize aggregate outcomes such as return, success rate, or completion time. While essential, these metrics often fail to reveal how agents coordinate, particularly in settings where agents, tasks, and joint assignment choices scale combinatorially. We propose a coordination-aware evaluation perspective that supplements return with process-level diagnostics. We instantiate this perspective using STAT, a controlled commitment-constrained spatial task-allocation testbed that systematically varies agents, tasks, and environment size while holding observation access and task rules fixed. We evaluate six representative value-based MARL methods across varying levels of centralization. Our results show that similar return trends can reflect distinct coordination mechanisms, including differences in redundant assignment, assignment diversity, and task-completion efficiency. We find that in commitment-constrained task allocation, performance under scale is shaped not only by nominal action-space size, but also by assignment pressure, sparse decision opportunities, and redundant choices among interdependent agents. Our findings motivate coordination-aware evaluation as a necessary complement to return-based benchmarking for cooperative MARL.
comment: 27 pages. Submitted and under review
☆ Continuous Latent Diffusion Language Model
Large language models have achieved remarkable success under the autoregressive paradigm, yet high-quality text generation need not be tied to a fixed left-to-right order. Existing alternatives still struggle to jointly achieve generation efficiency, scalable representation learning, and effective global semantic modeling. We propose Cola DLM, a hierarchical latent diffusion language model that frames text generation through hierarchical information decomposition. Cola DLM first learns a stable text-to-latent mapping with a Text VAE, then models a global semantic prior in continuous latent space with a block-causal DiT, and finally generates text through conditional decoding. From a unified Markov-path perspective, its diffusion process performs latent prior transport rather than token-level observation recovery, thereby separating global semantic organization from local textual realization. This design yields a more flexible non-autoregressive inductive bias, supports semantic compression and prior fitting in continuous space, and naturally extends to other continuous modalities. Through experiments spanning 4 research questions, 8 benchmarks, strictly matched ~2B-parameter autoregressive and LLaDA baselines, and scaling curves up to about 2000 EFLOPs, we identify an effective overall configuration of Cola DLM and verify its strong scaling behavior for text generation. Taken together, the results establish hierarchical continuous latent prior modeling as a principled alternative to strictly token-level language modeling, where generation quality and scaling behavior may better reflect model capability than likelihood, while also suggesting a concrete path toward unified modeling across discrete text and continuous modalities.
comment: 99 pages, 31 figures, 9 tables. Project page: https://hongcanguo.github.io/Cola-DLM/
☆ Ex Ante Evaluation of AI-Induced Idea Diversity Collapse
Creative AI systems are typically evaluated at the level of individual utility, yet creative outputs are consumed in populations: an idea loses value when many others produce similar ones. This creates an evaluation blind spot, as AI can improve individual outputs while increasing population-level crowding. We introduce a human-relative framework for benchmarking AI-induced human diversity collapse without requiring human-AI interaction data, providing an ex ante protocol to estimate crowding risk from model-only generations and matched unaided human baselines. By modeling ideas as congestible resources, we show that source-level crowding is identifiable from within-distribution comparisons, yielding an excess-crowding coefficient $Δ$ and a human-relative diversity ratio $ρ$. We show that $ρ\ge1$ is the no-excess-crowding parity condition and connect $Δ$ to an adoption game with exposure-dependent redundancy costs. Across short stories, marketing slogans, and alternative-uses tasks, three frontier LLMs fall below parity across crowding kernels. Estimates stabilize with feasible model-only sample sizes. Importantly, generation-protocol variants show that crowding can be reduced through targeted design, making diversity collapse an actionable, development-time evaluation target for population-aware creative AI.
☆ Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance
In recent years, open-source efforts like Senorita-2M have propelled video editing toward natural language instruction. However, current publicly available datasets predominantly focus on local editing or style transfer, which largely preserve the original scene structure and are easier to scale. In contrast, Background Replacement, a task central to creative applications such as film production and advertising, requires synthesizing entirely new, temporally consistent scenes while maintaining accurate foreground-background interactions, making large-scale data generation significantly more challenging. Consequently, this complex task remains largely underexplored due to a scarcity of high-quality training data. This gap is evident in poorly performing state-of-the-art models, e.g., Kiwi-Edit, because the primary open-source dataset that contains this task, i.e., OpenVE-3M, frequently produces static, unnatural backgrounds. In this paper, we trace this quality degradation to a lack of precise background guidance during data synthesis. Accordingly, we design a scalable pipeline that generates foreground and background guidance in a decoupled manner with strict quality filtering. Building on this pipeline, we introduce Sparkle, a dataset of ~140K video pairs spanning five common background-change themes, alongside Sparkle-Bench, the largest evaluation benchmark tailored for background replacement to date. Experiments demonstrate that our dataset and the model trained on it achieve substantially better performance than all existing baselines on both OpenVE-Bench and Sparkle-Bench. Our proposed dataset, benchmark, and model are fully open-sourced at https://showlab.github.io/Sparkle/.
comment: Tech Report. Project Page: https://showlab.github.io/Sparkle/
☆ SpatialEpiBench: Benchmarking Spatial Information and Epidemic Priors in Forecasting
Accurate epidemic forecasting is crucial for public health response, resource allocation, and outbreak intervention, but remains difficult with sparse, noisy, and highly non-stationary data. Because epidemics unfold across interacting regions, spatiotemporal methods are natural candidates for improving forecasts. Despite growing interest in spatial information, no standardized benchmark exists, and current evaluations often use simple chronological train-test splits that do not reflect real-time forecasting practice. We address this gap with SpatialEpiBench, a challenging benchmark for spatiotemporal epidemic forecasting in realistic public-health settings. SpatialEpiBench includes 11 epidemic datasets with standardized rolling evaluations and outbreak-specific metrics. We evaluate adjacency-informed forecasting models with widely used epidemic priors that adapt general models to epidemiology, but find that most methods underperform a simple last-value baseline from 1 day to 1 month ahead, even during outbreaks and with these priors. We identify three major failure modes: (1) poor outbreak anticipation, (2) difficulty handling sparsity and noise, and (3) limited utility of common geographic adjacency for epidemiological spatial information. We release benchmark data, code, and instructions at https://github.com/Rachel-Lyu/SpatialEpiBench to support development of operationally useful epidemic forecasting models.
☆ Market-Alignment Risk in Pricing Agents: Trace Diagnostics and Trace-Prior RL under Hidden Competitor State
Outcome metrics can certify the wrong behavior. We study this failure in a two-hotel revenue-management simulator where Hotel A trains an agent against a fixed rule-based revenue-management competitor, Hotel B. A standard learning agent can obtain near-reference revenue per available room (RevPAR) while failing to learn market-like yield management: it sells too aggressively, undercuts, or collapses to modal price buckets. We diagnose this as a Goodhart-style failure under partial observability. Hotel A cannot observe the competitor's remaining inventory, booking curve, or pricing rule, so the same Hotel A-visible state maps to multiple plausible Hotel B prices. Deterministic value-based RL and deterministic copying collapse this unresolved uncertainty into shortcut behavior. We introduce a trace-level diagnostic protocol using RevPAR, occupancy, ADR, full price-bucket distributions, L1/JS distances, and seed-level confidence intervals. The verified repair is Trace-Prior RL: learn a distributional market prior from lagged market traces, then train a stochastic pricing policy with a RevPAR reward and a KL penalty to the learned prior. The final policy matches Hotel B's RevPAR, occupancy, ADR, and price distribution within seed-level uncertainty, while still optimizing Hotel A's own reward. We argue that the contribution is not a new optimizer and not a hotel-pricing leaderboard, but a reproducible failure-and-repair recipe for agentic systems where scalar rewards are easy to game and the intended behavior is only visible in traces. A key finding is that higher exact action accuracy can worsen aggregate trace alignment when the target is distributional.
comment: 7 pages
☆ Process Matters more than Output for Distinguishing Humans from Machines
Reliable human-machine discrimination is becoming increasingly important as large language models and autonomous agents are deployed in online settings. Existing approaches evaluate whether a system can produce behavior or responses indistinguishable from those of a human, following the emphasis on outputs as a criterion for intelligence proposed by Alan Turing. Cognitive science offers an alternative perspective: evaluating the process by which behavior is produced. To test whether cognitive processes can reliably distinguish humans from machines, we introduce CogCAPTCHA30, a battery of 30 cognitive tasks designed to elicit diagnostic process-level features even when task performance is matched. Across the battery, process-level features provide stronger discriminative signal than performance metrics alone, reliably distinguishing humans from agents even under output matching (mean process-feature classifier AUC = 0.88). To evaluate agentic process differences, we compare off-the-shelf frontier agents (Claude Sonnet 4.5, GPT-5, Gemini 2.5 Pro), Centaur (a language model fine-tuned on 10.7M human decisions), and two task-specific fine-tuning approaches applied to Qwen2.5-1.5B-Instruct: action-level supervised fine-tuning (A-SFT) and process-level fine-tuning (P-SFT), which directly optimizes process features. Broad fine-tuning on human decisions improves human-like task processes relative to off-the-shelf agents, while task-specific process-level supervision further improves behavioral mimicry. However, this advantage diminishes under cross-task transfer when supervised process targets do not naturally generalize across tasks. Explicit process-level supervision can improve human behavioral mimicry, but only if appropriate task-specific process representations are available, highlighting process specification as a bottleneck for achieving human-like cognitive processes in machines.
☆ On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR
Recent extensive research has demonstrated that the enhanced reasoning capabilities acquired by models through Reinforcement Learning with Verifiable Rewards (RLVR) are primarily concentrated within the rank-1 components. Predicated on this observation, we employed Periodic Rank-1 Substitution and identified a counterintuitive phenomenon: RLVR may exhibit implicit reward overfitting to the training dataset. Specifically, the model can achieve satisfactory performance on the test set even when its rewards remain relatively low during the training process. Furthermore, we characterize three distinct properties of RL training: (1) The effective rank-1 component in RLVR don't maintain other model knowledge except mathematical reasoning capability. (2) RLVR fundamentally functions by optimizing a specific singular spectrum. The distribution of singular values of almost all linear layers in RLVR-trained model behaves like heavy-tailed distribution. (3) the left singular vectors associated with rank-1 components demonstrate a stronger alignment tendency during training, which echoes the discovery that RLVR is optimizing sampling efficiency in essence. Taken together, our findings and analysis further reveal how RLVR shapes model parameters and offer potential insights for improving existing RL paradigms or other training paradigms to implement continual learning.
☆ Learning to Cut: Reinforcement Learning for Benders Decomposition
Benders decomposition (BD) is a widely used solution approach for solving two-stage stochastic programs arising in real-world decision-making under uncertainty. However, it often suffers from slow convergence as the master problem grows with an increasing number of cuts. In this paper, we propose Reinforcement Learning for BD (RLBD), a framework that adaptively selects cuts using a neural network-based stochastic policy. The policy is trained using a policy gradient method via the REINFORCE algorithm. We evaluate the proposed approach on a two-stage stochastic electric vehicle charging station location problem and compare it with vanilla BD and LearnBD, a supervised learning approach that classifies cuts using a support vector machine. Numerical results demonstrate that RLBD achieves substantial improvements in computational efficiency and exhibits strong generalization to problems with similar structures but varying data inputs and decision variable dimensions.
☆ Is One Layer Enough? Understanding Inference Dynamics in Tabular Foundation Models ICML 2026
Transformer-based tabular foundation models (TFMs) dominate small to medium tabular predictive benchmark tasks, yet their inference mechanisms remain largely unexplored. We present the first large-scale mechanistic study of layerwise dynamics in 6 state-of-the-art tabular in-context learning models. We explore how predictions emerge across depth, identify distinct stages of inference and reveal latent-space dynamics that differ from those of language models. Our findings indicate substantial depthwise redundancy across multiple models, suggesting iterative refinement with overlapping computations during inference stages. Guided by these insights, we design a proof-of-concept, looped single-layer model that uses only 20% of the original model's parameters while achieving comparable performance. The code is available at https://github.com/amirbalef/is_one_layer_enough.
comment: Accepted at the 43rd International Conference on Machine Learning (ICML 2026)
☆ On the Security of Research Artifacts
Research artifacts are widely shared to support reproducibility, and artifact evaluation (AE) has become common at many leading conferences. However, AE mainly checks whether artifacts work as claimed and can be reproduced. It largely overlooks potential security risks. Since these artifacts are publicly released and reused, they may unintentionally create opportunities for misuse and raise concerns about safe and responsible sharing. We study 509 research artifacts from top-tier security venues and find that many contain insecure code patterns that may introduce potential attack vectors. We propose a taxonomy for context-aware security assessment to enable structured analysis of such risks. We perform static analysis and examine the resulting findings, filtering false positives and identifying real security risks. Our analysis shows that 41.60% of the prevalent findings may pose security concerns under practical usage. To support scalable analysis, we introduce SAFE (Security-Aware Framework for Artifact Evaluation), a first step toward an autonomous framework that analyzes tool-reported findings by considering code semantics, execution context, and practical exploitability. SAFE achieves 84.80% accuracy and 84.63% F1-score in distinguishing security and non-security risks. Overall, our results show that security is also important in AE for promoting safe and responsible research sharing. The source code is available at: https://github.com/nanda-rani/SAFE
☆ PACZero: PAC-Private Fine-Tuning of Language Models via Sign Quantization
We introduce PACZero, a family of PAC-private zeroth-order mechanisms for fine-tuning large language models that delivers usable utility at $I(S^*; Y_{1:T})=0$. This privacy regime bounds the membership-inference attack (MIA) posterior success rate at the prior, an MIA-resistance level the DP framework matches only at $\varepsilon=0$ and infinite noise. All DP-ZO comparisons below are matched at the MIA posterior level. The key insight is that PAC Privacy charges mutual information only when the release depends on which candidate subset is the secret. Sign-quantizing subset-aggregated zeroth-order gradients creates frequent unanimity, steps at which every candidate subset agrees on the update direction; at these steps the released sign costs zero conditional mutual information. We propose two variants that span the privacy-utility trade-off: PACZero-MI (budgeted MI via exact calibration on the binary release) and PACZero-ZPL ($I=0$ via a uniform coin flip on disagreement steps). We evaluate on SST-2 and SQuAD with OPT-1.3B and OPT-6.7B in both LoRA and full-parameter tracks. On SST-2 OPT-1.3B full fine-tuning at $I=0$, PACZero-ZPL reaches ${88.99\pm0.91}$, within $2.1$pp of the non-private MeZO baseline ($91.1$ FT). No prior method produces usable utility in the high-privacy regime $\varepsilon<1$, and PACZero-ZPL obtains competitive SST-2 accuracy and nontrivial SQuAD F1 across OPT-1.3B and OPT-6.7B at $I=0$.
☆ Operator-Guided Invariance Learning for Continuous Reinforcement Learning
Reinforcement learning (RL) with continuous time and state/action spaces is often data-intensive and brittle under nuisance variability and shift, motivating methods that exploit value-preserving structures to stabilize and improve learning. Most existing approaches focus on special cases, such as prescribed symmetries and exact equivariance, without addressing how to discover more general structures that require nonlinear operators to transform and map between continuous state/action systems with isomorphic value functions. We propose \textbf{VPSD-RL} (Value-Preserving Structure Discovery for Reinforcement Learning). It models continuous RL as a controlled diffusion with value-preserving mappings defined through Lie-group actions and associated pullback operators. We show that a value-preserving structure exists exactly when pulling back the value function and pushing forward actions commute with the controlled generator and reward functional. Further, approximate value-preserving structures with rigorous guarantees can be found when the Hamilton--Jacobi--Bellman mismatch is small. This framework discovers exact and approximate value-preserving structures by searching for the associated Lie group operators. VPSD-RL fits differentiable drift, diffusion, and reward models; learns infinitesimal generators via determining-equation residual minimization; exponentiates them with ODE flows to obtain finite transformations; and integrates them into continuous RL through transition augmentation and transformation-consistency regularization. We show that bounded generator/reward mismatch implies quantitative stability of the optimal value function along approximate orbits, with sensitivity governed by the effective horizon, and observe improved data efficiency and robustness on continuous-control benchmarks.
☆ From Token Lists to Graph Motifs: Weisfeiler-Lehman Analysis of Sparse Autoencoder Features
Sparse autoencoders (SAEs) have become central to mechanistic interpretability, decomposing transformer activations into monosemantic features. Yet existing analyses characterise features almost exclusively through top-activating token lists or decoder weight vectors, leaving the higher-order co-occurrence structure shared across features largely unexamined. We introduce a graph-structured representation in which each SAE feature is modelled as a token co-occurrence graph: nodes are the tokens most frequent near strong activations, and edges connect pairs that co-occur within local context windows. A custom WL-style, frequency-binned graph kernel then provides a similarity measure over this structural space. Applied as a proof of concept to features from a large SAE trained on GPT-2 Small and probed with a synthetic mixed-domain corpus, our clustering recovers heuristic motif families (punctuation-heavy patterns, language and script clusters, and code-like templates) that are not recovered by clustering on decoder cosine similarity. A token-histogram baseline achieves higher overall purity, so the contribution of the graph view is complementary rather than dominant: it surfaces structural relationships that token-frequency and decoder-weight views alone do not capture. Cluster assignments are stable across graph-construction hyperparameters and random seeds.
☆ Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors
AI systems have become increasingly capable of dangerous behaviours in many domains. This raises the question: Do models sometimes choose to violate human instructions in order to perform behaviour that is more useful for certain goals? We introduce a benchmark for measuring model propensity for instrumental convergence (IC) behaviour in terminal-based agents. This is behaviour such as self-preservation that has been hypothesised to play a key role in risks from highly capable AI agents. Our benchmark is realistic and low-stakes which serves to reduce evaluation-awareness and roleplay confounds. The suite contains seven operational tasks, each with an official workflow and a policy-violating shortcut. An eight-variant shared framework varies monitoring, instruction clarity, stakes, permission, instrumental usefulness and blocked honest paths to support inferences regarding the factors driving IC behaviour. We evaluated ten models using deterministic environment-state scorers over 1,680 samples, with trace review employed for audit and adjudication purposes. The final IC rate is 86 out of 1,680 samples (5.1%). IC behaviour is concentrated rather than uniform: two Gemini models account for 66.3% of IC cases and three tasks account for 84.9%. Conditions in which IC behaviour is indispensable for task success result in the greatest increase in the adjusted IC rate (+15.7 percentage points), whereas emphasising that task success is critical or certain framing choices do not produce comparable effects. Our findings indicate that realistic, low-nudge environments elicit IC behaviour rarely but systematically in most tested models. We conclude that it is feasible to robustly measure tendencies for dangerous behaviour in current frontier AI agents.
☆ 3D MRI Image Pretraining via Controllable 2D Slice Navigation Task
Self-supervised pretraining has become the mainstream approach for learning MRI representations from unlabeled scans. However, most existing objectives still treat each scan primarily as static aggregations of slices, patches or volumes. We ask whether there exists an intrinsic form of self-supervision signal that is different from reconstructing the masked patches, through transforming the 3D volumes into controllable 2D rendered sequences: by rendering slices at continuous positions, orientations, and scales, a 3D volume can be converted into dense video-action sequences whose controls are the action trajectories. We study this formulation with an action-conditioned pretraining objective, where a tokenizer encodes slice observations and a latent dynamics model predicts the evolution of latent features. Across representative anatomical and spatial downstream tasks, the proposed pretraining is evaluated against standard static-volume baselines, tokenizer-only pretraining, and dynamics variants without aligned actions. These results suggest that controllable MRI slice navigation provides a useful complementary pretraining interface for learning anatomical and spatial representations from large unlabeled MRI collections.
comment: 9 pages, 5 figures
☆ Litespark Inference on Consumer CPUs: Custom SIMD Kernels for Ternary Neural Networks
Large language models (LLMs) have transformed artificial intelligence, but their computational requirements remain prohibitive for most users. Standard inference demands expensive datacenter GPUs or cloud API access, leaving over one billion personal computers underutilized for AI workloads. Ternary models offer a path forward: their weights are constrained to {-1, 0, +1}, theoretically eliminating the need for floating-point multiplication. However, existing frameworks fail to exploit this structure, treating ternary models as dense floating-point networks. We address this gap with custom SIMD kernels that replace matrix multiplication with simple addition and subtraction operations, targeting the integer dot product instructions available on modern CPUs. Our implementation, Litespark-Inference, is pip-installable and integrates directly with Hugging-Face, achieving 9.2x faster time-to-first-token, 52x higher throughput, and 14x memory reduction compared to standard PyTorch inference on Apple Silicon, with similar speedups on Intel and AMD processors.
☆ ReasonSTL: Bridging Natural Language and Signal Temporal Logic via Tool-Augmented Process-Rewarded Learning
Signal Temporal Logic (STL) is an expressive formal language for specifying spatio-temporal requirements over real-valued, real-time signals. It has been widely used for the verification and synthesis of autonomous systems and cyber-physical systems. In practice, however, users often express their requirements in natural language rather than in structured STL formulas, making natural-language-to-STL translation a critical yet challenging task. Manual specification requires temporal-logic expertise and cannot scale, while prompting commercial LLM APIs incurs substantial token costs and may expose sensitive system requirements to third-party services, raising privacy concerns for industrial deployment. To address these challenges, we present \textsc{ReasonSTL}, a tool-augmented framework that adapts local open-source language models for natural-language-to-STL generation. \textsc{ReasonSTL} decomposes the translation process into explicit reasoning, deterministic tool calls, and structured formula construction. We further introduce process-rewarded training to supervise both tool-use trajectories and final formulas, together with \textsc{STL-Bench}, a bilingual, computation-aware benchmark grounded in real-world signals. Experiments show that a 4B model trained with \textsc{ReasonSTL} achieves state-of-the-art performance in both automatic metrics and human evaluations, demonstrating that \textsc{ReasonSTL} provides a transparent, low-cost, and privacy-preserving alternative for formal specification drafting.
☆ Patch-Effect Graph Kernels for LLM Interpretability
Mechanistic interpretability aims to reverse-engineer transformer computations by identifying causal circuits through activation patching. However, scaling these interventions across diverse prompts and task families produces high-dimensional, unstructured datasets that are difficult to compare systematically. We propose a framework that reframes mechanistic analysis as a graph machine-learning problem by representing activation-patching profiles as patch-effect graphs over model components. We introduce three graph-construction methods: direct-influence via causal mediation, partial-correlation, and co-influence and apply graph kernels to analyze the resulting structures. Evaluating this approach on GPT-2 Small using Indirect Object Identification (IOI) and related tasks, we find that patch-effect graphs preserve discriminative structural signals. Specifically, localized edge-slot features provide higher classification accuracy than global graph-shape descriptors. A screened paired-patching validation suggests that CI and PC selected candidate edges correspond to stronger activation-influence effects than random or low-rank candidates. Crucially, by evaluating these representations against rigorous prompt-only and raw patch-effect controls, we make the evidential scope of the benchmark explicit: graph features compress structured patching signal, while raw tensors and surface cues define strong baselines that any circuit-level claim should address. Ultimately, our framework provides a compression and evaluation pipeline for comparing patching-derived structures under controlled baselines, separating robust slice-discriminative evidence from stronger task-general causal-circuit claims.
☆ Probabilistic Dating of Historical Manuscripts via Evidential Deep Regression on Visual Script Features
We introduce a probabilistic approach for dating historical manuscript pages from visual features alone. Instead of aggregating centuries into classes as is standard in the previous literature, we pose dating as an evidential deep regression problem over a continuous year axis, allowing our neural network to output a full predictive distribution with decomposed aleatoric and epistemic uncertainty in a single forward pass. Our architecture combines an EfficientNet-B2 backbone with a Normal-Inverse-Gamma (NIG) output head trained with a joint negative-log-likelihood and evidence-regularization objective. On the DIVA-HisDB benchmark (150 pages, 3 medieval codices, 151,936 patches), our model scores a test MAE of 5.4 years, well below the 50-year century-label supervision granularity, with 93\% of patches within 5 years and 97\% within 10 years. Our approach achieves \textbf{PICP=92.6\%}, the best calibration among all compared methods, in a single forward pass, outperforming MC Dropout (PICP=88.2\%, 50 passes) and Deep Ensembles (PICP=79.7\%, 5 models) at $5\times$ lower inference cost. Uncertainty decomposition shows aleatoric uncertainty is a strong predictor of dating error (Spearman $ρ=0.729$), and a selective prediction about the most certain 20\% of patches can provide \textbf{0.5 years MAE}. We show that predicted uncertainty increases as image degradation worsens, spatial decomposition maps explain which script regions cause aleatoric uncertainty, and page-level aggregation reduces MAE to 4.5 years with $ρ=0.905$ between uncertainty and page-level error.
☆ Q-MMR: Off-Policy Evaluation via Recursive Reweighting and Moment Matching
We present a novel theoretical framework, Q-MMR, for off-policy evaluation in finite-horizon MDPs. Q-MMR learns a set of scalar weights, one for each data point, such that the reweighted rewards approximate the expected return under the target policy. The weights are learned inductively in a top-down manner via a moment matching objective against a value-function discriminator class. Notably, and perhaps surprisingly, a data-dependent finite-sample guarantee for general function approximation can be established under only the realizability of $Q^π$, with a dimension-free bound -- that is, the error does not depend on the statistical complexity of the function class. We also establish connections to several existing methods, such as importance sampling and linear FQE. Further theoretical analyses shed new light on the nature of coverage, a concept of fundamental importance to offline RL.
☆ Beyond Task Success: Measuring Workflow Fidelity in LLM-Based Agentic Payment Systems PAKDD 2026
LLM-based multi-agent systems are increasingly deployed for payment workflows, yet prevailing metrics, Task Success Rate (TSR) and Agent Handoff F1-Score (HF1), capture only final outcomes or unordered routing decisions. We introduce the Agentic Success Rate (ASR), a trajectory-fidelity metric that compares observed and expected agent execution sequences at the transition level, decomposing performance into Transition Recall and Transition Precision. Applied to the Hierarchical Multi-Agent System for Payments (HMASP) across 18 LLMs and 90,000 task instances, ASR reveals that 10 of 18 models systematically skip a confirmation checkpoint during payment checkout, a deviation invisible to both TSR and HF1, while 8 models enforce the checkpoint perfectly. Notably, GPT-4.1 exhibits hidden workflow shortcuts despite achieving perfect TSR and HF1, while GPT-5.2 achieves perfect ASR. Prompt refinements and deterministic routing guards guided by ASR diagnostics yield substantial TSR improvements, with gains up to +93.8 percentage points for previously struggling models, demonstrating that trajectory-level evaluation is essential in regulated domains.
comment: 6 pages, 2 tables. Accept at AI and Data Science for Digital Finance (AIDS4DF) Workshop, PAKDD 2026
☆ PrefixGuard: From LLM-Agent Traces to Online Failure-Warning Monitors
Large language model (LLM) agents now execute long, tool-using tasks where final outcome checks can arrive too late for intervention. Online warning requires lightweight prefix monitors over heterogeneous traces, but hand-authored event schemas are brittle and deployment-time LLM judging is costly. We introduce PrefixGuard, a trace-to-monitor framework with an offline StepView induction step followed by supervised monitor training. StepView induces deterministic typed-step adapters from raw trace samples, and the monitor learns an event abstraction and prefix-risk scorer from terminal outcomes. Across WebArena, $τ^2$-Bench, SkillsBench, and TerminalBench, the strongest PrefixGuard monitors reach 0.900/0.710/0.533/0.557 AUPRC. Using the strongest backend within each representation, they improve over raw-text controls by an average of +0.137 AUPRC. LLM judges remain substantially weaker under the same prefix-warning protocol. We also derive an observability ceiling on score-based area under the precision-recall curve (AUPRC) that separates monitor error from failures lacking evidence in the observed prefix. For finite-state audit, post-hoc deterministic finite automaton (DFA) extraction remains compact on WebArena and $τ^2$-Bench (29 and 20 states) but expands to 151 and 187 states on SkillsBench and TerminalBench. Finally, first-alert diagnostics show that strong ranking does not imply deployment utility: WebArena ranks well yet fails to support low-false-alarm alerts, whereas $τ^2$-Bench and TerminalBench retain more actionable early alerts. Together, these results position PrefixGuard as a practical monitor-synthesis recipe with explicit diagnostics for when prefix warnings translate into actionable interventions.
comment: Under Review
☆ ORTHOBO: Orthogonal Bayesian Hyperparameter Optimization
Bayesian optimization is widely used for hyperparameter optimization when model evaluations are expensive; however, noisy acquisition estimates can lead to unstable decisions. We identify acquisition estimation noise as a failure mode that was previously overlooked: even when the surrogate model and acquisition target are correctly specified, finite-sample Monte Carlo error can perturb acquisition values. This can, in turn, flip candidate rankings and lead to suboptimal BO decisions. As a remedy, we aim at variance reduction and propose an orthogonal acquisition estimator that subtracts an optimally weighted score-function control variate, which yields an acquisition residual orthogonal to posterior score directions and which thus reduces Monte Carlo variance. We further introduce OrthoBO: a Bayesian optimization framework that combines our orthogonal acquisition estimator with ensemble surrogates and an outer log transformation. We show theoretically that our estimator preserves the target, leads to variance reduction, and improves pairwise ranking stability. We further verify the theoretical properties of OrthoBO through numerical experiments where our framework reduces acquisition estimation variance, stabilizes candidate rankings, and achieves strong performance. We also demonstrate the downstream utility of OrthoBO in hyperparameter optimization for neural network training and fine-tuning.
☆ Constraint Decay: The Fragility of LLM Agents in Backend Code Generation
Large Language Model (LLM) agents demonstrate strong performance in autonomous code generation under loose specifications. However, production-grade software requires strict adherence to structural constraints, such as architectural patterns, databases, and object-relational mappings. Existing benchmarks often overlook these non-functional requirements, rewarding functionally correct but structurally arbitrary solutions. We present a systematic study evaluating how well agents handle structural constraints in multi-file backend generation. By fixing a unified API contract across 80 greenfield generation tasks and 20 feature-implementation tasks spanning eight web frameworks, we isolate the effect of structural complexity using a dual evaluation with end-to-end behavioral tests and static verifiers. Our findings reveal a phenomenon of constraint decay: as structural requirements accumulate, agent performance exhibits a substantial decline. Capable configurations lose 30 points on average in assertion pass rates from baseline to fully specified tasks, while some weaker configurations approach zero. Framework sensitivity analysis exposes significant performance disparities: agents succeed in minimal, explicit frameworks (e.g., Flask) but perform substantially worse on average in convention-heavy environments (e.g., FastAPI, Django). Finally, error analysis identifies data-layer defects (e.g., incorrect query composition and ORM runtime violations) as the leading root causes. This work highlights that jointly satisfying functional and structural requirements remains a key open challenge for coding agents.
☆ SCRuB: Social Concept Reasoning under Rubric-Based Evaluation
While many studies of Large Language Model (LLM) reasoning capabilities emphasize mathematical or technical tasks, few address reasoning about social concepts: the abstract ideas shaping social norms, culture, and institutions. This understudied capability is essential for modern models acting as social agents, yet no systematic evaluation methodology targets it. We introduce SCRuB (Social Concept Reasoning under Rubric-Based Evaluation), a framework designed for this setting of task indeterminacy. Our goal is to measure the degree to which a model reasons about social concepts with the depth and critical rigor of a human expert. SCRuB proceeds in three phases: prompt construction from established sources, response generation by experts and models, and comparative evaluation using a five-dimensional critical thinking rubric. To enable generalization of the pipeline, we introduce a Panel of Disciplinary Perspectives ensemble validated against independent expert judges. We release SCRuBEval (n=4,711 evaluation prompts) and SCRuBAnnotations (300 expert-authored responses and 150 expert comparative judgments from 45 PhD-level scholars). Our results show that frontier models consistently outperform human experts across all five rubric dimensions. Across 1,170 pairwise comparisons, expert judges ranked a model response first in 80.8% of judgments and preferred model responses overall 74.4% of the time. Ultimately, this study provides the first expert-grounded demonstration of evaluation saturation for social concept reasoning: the single-turn exam-style format has reached its ceiling for models and humans alike.
☆ COVID-19 Infodemic. Understanding content features in detecting fake news using a machine learning approach
The use of content features, particularly textual and linguistic for fake news detection is under-researched, despite empirical evidence showing the features could contribute to differentiating real and fake news. To this end, this study investigates a selection of content features such as word bigrams, part of speech distribution etc. to improve fake news detection. We performed a series of experiments on a new dataset gathered during the COVID-19 pandemic and using Decision Tree, K-Nearest Neighbor, Logistic Regression, Support Vector Machine and Random Forest. Random Forest yielded the best results, followed closely by Support Vector Machine, across all setups. In general, both the textual and linguistic features were found to improve fake news detection when used separately, however, combining them into a single model did not improve the detection significantly. Differences were also noted between the use of bigrams and part of speech tags. The study shows that textual and linguistic features can be used successfully in detecting fake news using the traditional machine learning approach as opposed to deep learning.
☆ Knowledge Graphs, the Missing Link in Agentic AI-based Formal Verification IEEE
Recent advances in Large Language Models (LLMs) have enabled workflows that generate SystemVerilog Assertions (SVAs) from natural-language specifications, with the potential to accelerate Formal Verification (FV). However, high-quality assertion synthesis remains challenging because specifications are often ambiguous or incomplete and critical micro-architectural details reside in the Register Transfer Level (RTL). Many existing approaches treat the specification and RTL as loosely structured text, which weakens specification-to-RTL grounding and leads to semantic mismatches and frequent syntax failures during formal parsing and elaboration. This work addresses these limitations with a verification-centric Knowledge Graph (KG) constructed from structured Intermediate Representations (IRs) extracted from the specification, RTL, and formal-tool feedback, including syntax diagnostics, Counterexamples (CEXs), and coverage reports. The KG links requirements, design hierarchy, signals, assumptions, and properties to provide traceable, design-grounded context for generation. A multi-agent workflow queries and updates this KG to generate SVAs and to drive three refinement loops: syntax repair guided by tool diagnostics, CEX-guided correction using trace links, and coverage-directed property augmentation. Evaluation across seven benchmark designs indicates that KG-based context retrieval improves specification-to-RTL grounding and consistently produces compilable SVAs with low syntax-repair overhead. The approach achieves formal coverage ranging from 78.5% to 99.4%, though convergence exhibits design dependence with complex temporal and arithmetic reasoning remaining challenging for current LLM capabilities.
comment: To appear at the IEEE International Conference on IC Design and Technology 2026 (ICICDT), June 22 - 24, 2026, Dresden, Germany
☆ E = T*H/(O+B): A Dimensionless Control Parameter for Mixture-of-Experts Ecology
We introduce E = T*H/(O+B), a dimensionless control parameter that predicts whether Mixture-of-Experts (MoE) models will develop a healthy expert ecology or collapse into dead experts. E combines four hyperparameters -- routing temperature T, routing entropy weight H, oracle weight O, and balance weight B -- into a single quantity. Through 12 controlled experiments (8 vision, 4 language) totaling over 11,000 training epochs, we establish that E >= 0.5 alone is sufficient to guarantee zero dead experts, removing the necessity for handcrafted load-balancing auxiliary losses. We validate this cross-modally on CIFAR-10, CIFAR-100, TinyImageNet-200, WikiText-2, and WikiText-103. Six additional findings emerge: (1) dead experts can resuscitate -- triggered by balance loss driving router re-exploration; (2) ortho toxicity is dataset-dependent, not universal; (3) task complexity shifts the critical E threshold; (4) model overfitting is decoupled from expert ecological health; (5) three-tier MoE spontaneously collapses into a two-tier functional structure; (6) ecological structure is temperature-invariant across a 50x range. We propose that E serves as a unified diagnostic for MoE training, analogous to the Reynolds number in fluid dynamics.
comment: 12 experiments, 11,000+ training epochs, cross-modal validation (vision + language). Extended version of the Claude-in-the-Loop ecology framework
☆ WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling
Integrating speech understanding and generation is a pivotal step toward building unified speech models. However, the different representations required for these two tasks currently pose significant compatibility challenges. Typically, semantics-oriented features are learned from self-supervised learning (SSL), and acoustic-oriented features from reconstruction. Such fragmented representations hinder the realization of truly unified speech systems. We present WavCube, a compact continuous latent derived from an SSL speech encoder that simultaneously supports speech understanding, reconstruction, and generation. WavCube employs a two-stage training scheme. Stage 1 trains a semantic bottleneck to filter off-manifold redundancy that makes raw SSL features intractable for diffusion. Stage 2 injects fine-grained acoustic details via end-to-end reconstruction, while a semantic anchoring loss ensures the representation remains grounded within its original semantic manifold. Comprehensive experiments show that WavCube closely approaches WavLM performance on SUPERB despite an 8x dimensional compression, attains reconstruction quality on par with existing acoustic representations, delivers state-of-the-art zero-shot TTS performance with markedly faster training convergence, and excels in speech enhancement, separation, and voice conversion tasks on the SUPERB-SG benchmark. Systematic ablations reveal that WavCube's two-stage recipe resolves two intrinsic flaws of SSL features for generative modeling, paving the way for future unified speech systems. Codes and checkpoints are available at https://github.com/yanghaha0908/WavCube.
☆ Consistent Geometric Deep Learning via Hilbert Bundles and Cellular Sheaves
Modern deep learning architectures increasingly contend with sophisticated signals that are natively infinite-dimensional, such as time series, probability distributions, or operators, and are defined over irregular domains. Yet, a unified learning theory for these settings has been lacking. To start addressing this gap, we introduce a novel convolutional learning framework for possibly infinite-dimensional signals supported on a manifold. Namely, we use the connection Laplacian associated with a Hilbert bundle as a convolutional operator, and we derive filters and neural networks, dubbed as \textit{HilbNets}. We make HilbNets and, more generally, the convolution operation, implementable via a two-stage sampling procedure. First, we show that sampling the manifold induces a Hilbert Cellular Sheaf, a generalized graph structure with Hilbert feature spaces and edge-wise coupling rules, and we prove that its sheaf Laplacian converges in probability to the underlying connection Laplacian as the sampling density increases. Notably, this result is a generalization to the infinite-dimensional bundle setting of the Belkin \& Niyogi \cite{BELKIN20081289} convergence result for the graph Laplacian to the manifold Laplacian, a theoretical cornerstone of geometric learning methods. Second, we discretize the signals and prove that the discretized (implementable) HilbNets converge to the underlying continuous architectures and are transferable across different samplings of the same bundle, providing consistency for learning. Finally, we validate our framework on synthetic and real-world tasks. Overall, our results broaden the scope of geometric learning as a whole by lifting classical Laplacian-based frameworks to settings where the signal at each point lives in its own Hilbert space.
comment: 51 pages, 3 figures, 5 tables
☆ Automated alignment is harder than you think
A leading proposal for aligning artificial superintelligence (ASI) is to use AI agents to automate an increasing fraction of alignment research as capabilities improve. We argue that, even when research agents are not scheming to deliberately sabotage alignment work, this plan could produce compelling but catastrophically misleading safety assessments resulting in the unintentional deployment of misaligned AI. This could happen because alignment research involves many hard-to-supervise fuzzy tasks (tasks without clear evaluation criteria, for which human judgement is systematically flawed). Consequently, research outputs will contain systematic, undetected errors, and even correct outputs could be incorrectly aggregated into overconfident safety assessments. This problem is likely to be worse for automated alignment research than for human-generated alignment research for several reasons: 1) optimisation pressure means agent-generated mistakes are concentrated among those that human reviewers are least likely to catch; 2) agents are likely to produce errors that do not resemble human mistakes; 3) AI-generated alignment solutions may involve arguments humans cannot evaluate; and 4) shared weights, data and training processes may make AI outputs more correlated than human equivalents. Therefore, agents must be trained to reliably perform hard-to-supervise fuzzy tasks. Generalisation and scalable oversight are the leading candidates for achieving this but both face novel challenges in the context of automated alignment.
comment: 15 pages, 4 figures
☆ Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level
On-policy distillation (OPD) trains a student on its own trajectories with token-level teacher feedback and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its standard advantage weighted policy gradient suffers from three structural weaknesses, including high variance updates, vanishing gradients in zero-advantage regions, and exploration bottlenecks when corrective signals are insufficient.We therefore propose Asymmetric On-Policy Distillation (AOPD), which replaces ineffective negative reinforcement with localized divergence minimization in non-positive advantage regions while preserving positive reinforcement learning. Experiments on mathematical reasoning benchmarks show that AOPD consistently outperforms standard OPD, with average gains of 4.09 / 8.34 under strong / weak initialization, respectively. AOPD also maintains higher policy entropy during training and better capability retention during sequential tool-use adaptation.
☆ MinMax Recurrent Neural Cascades
We show that the MinMax algebra provides a form of recurrence that is expressively powerful, efficiently implementable, and most importantly it is not affected by vanishing or exploding gradient. We call MinMax Recurrent Neural Cascades (RNCs) the models obtained by cascading several layers of neurons that employ such recurrence. We show that MinMax RNCs enjoy many favourable theoretical properties. First, their formal expressivity includes all regular languages, arguably the maximal expressivity for a finite-memory system. Second, they can be evaluated in parallel with a runtime that is logarithmic in the input length given enough processors; and they can also be evaluated sequentially. Third, their state and activations are bounded uniformly for all input lengths. Fourth, at almost all points, their loss gradient exists and it is bounded. Fifth, they do not exhibit a vanishing state gradient: the gradient of a state w.r.t. a past state can have constant value one regardless of the time distance between the two states. Finally, we find empirical evidence that the favourable theoretical properties of MinMax RNCs are matched by their practical capabilities: they are able to perfectly solve a number of synthetic tasks, showing superior performance compared to the considered state-of-the-art recurrent neural networks; also, we train a MinMax RNC of 127M parameters on next-token prediction, and the obtained model shows competitive performance for its size, providing evidence of the potential of MinMax RNCs on real-world tasks.
☆ Rethinking Vacuity for OOD Detection in Evidential Deep Learning
Vacuity, or Uncertainty Mass (UM), is commonly used as a metric to evaluate Out-of-Distribution (OOD) detection in Evidential Deep Learning (EDL). It generally involves dividing the number of classes ($K$) by the total strength of belief ($S$) of the model's predictions, where $S$ is derived from summing the Dirichlet parameters. As such, UM is sensitive to the cardinality of $K$. In particular, it is unlikely in practice that there is a linear relationship between $K$ and $S$ as $K$ and $S$ increase due to the nature of EDL (suppressing incorrectly assigned evidence). As a result, when comparing In Distribution (ID) and OOD results, it is important that $K_{\mathrm{ID}}$ and $K_{\mathrm{OOD}}$ are equal; something that is not always ensured in practice. We provide an empirical demonstration of how results for AUROC and AUPR can substantially differ when class cardinality between ID and OOD differs by 1, with AUROC differing by as much as 0.318 and AUPR by 0.613 for standard EDL, and AUROC by 0.360 and AUPR by 0.683 for IB-EDL. More concretely, our findings isolate an evaluation artefact: when K differs between ID and OOD, AUROC/AUPR can be artificially inflated without any change in model predictions. We further discuss the evaluation of EDL over causal language models using Multiple-Choice Question-Answer (MCQA) datasets and argue for clearer definitions of ID and OOD in this context. Our primary contribution is an empirical and theoretical demonstration that vacuity-based OOD detection in EDL-fine-tuned LLMs is highly sensitive to uncontrolled differences in evaluated class cardinality.
☆ Continuous-Time Distribution Matching for Few-Step Diffusion Distillation
Step distillation has become a leading technique for accelerating diffusion models, among which Distribution Matching Distillation (DMD) and Consistency Distillation are two representative paradigms. While consistency methods enforce self-consistency along the full PF-ODE trajectory to steer it toward the clean data manifold, vanilla DMD relies on sparse supervision at a few predefined discrete timesteps. This restricted discrete-time formulation and mode-seeking nature of the reverse KL divergence tends to exhibit visual artifacts and over-smoothed outputs, often necessitating complex auxiliary modules -- such as GANs or reward models -- to restore visual fidelity. In this work, we introduce Continuous-Time Distribution Matching (CDM), migrating the DMD framework from discrete anchoring to continuous optimization for the first time. CDM achieves this through two continuous-time designs. First, we replace the fixed discrete schedule with a dynamic continuous schedule of random length, so that distribution matching is enforced at arbitrary points along sampling trajectories rather than only at a few fixed anchors. Second, we propose a continuous-time alignment objective that performs active off-trajectory matching on latents extrapolated via the student's velocity field, improving generalization and preserving fine visual details. Extensive experiments on different architectures, including SD3-Medium and Longcat-Image, demonstrate that CDM provides highly competitive visual fidelity for few-step image generation without relying on complex auxiliary objectives. Code is available at https://github.com/byliutao/cdm.
comment: 22pages, 9 figures
☆ Debiased Multimodal Personality Understanding through Dual Causal Intervention
Multimodalpersonalityunderstandingplaysacriticalroleinhuman centered artificial intelligence. Previous work mainly focus on learn-ing rich multimodal representations for video personality under standing. However, they often suffer from potential harm caused by subject bias (e.g., observable age and unobservable mental states), as subjects originate from diverse demographic backgrounds. Learn ing such spurious associations between multimodal features and traits may lead to unfair personality understanding. In this work, weconstruct aStructural Causal Model (SCM)toanalyze theimpact of these biases from a causal perspective, and propose a novel Dual Causal Adjustment Network (DCAN) to mitigate the interference of subject attributes on personality understanding. Specifically, we design a Back-door Adjustment Causal Learning (BACL) module to block spurious correlations from observable demographic factors via a prototype-based confounder dictionary, and subsequently ap ply a Front-door Adjustment Causal Learning (FACL) module to ad dress latent and unobservable biases throughalearnedmediatordic tionary intervention, thereby achieving causal disentanglement of representations for deconfounded reasoning. Importantly, we con struct a Demographic-annotated Multimodal Student Personality (DMSP) dataset to support the analysis and discussion of fairness related factors. Extensive experiments on the benchmark dataset CFI-V2 and our DMSPdataset demonstrate that DCAN consistently improves prediction accuracy, reaching 92.11% and 92.90%, respec tively. Meanwhile, the improvementsinthefairnessmetricsofequal opportunity and demographic parity are 6.57% and 7.97% on CFI-V2, and 15.38% and 20.06% on the DMSP dataset. Our code and DMSP dataset are available at https://github.com/Sabrina-han/DCAN
☆ eXplaining to Learn (eX2L): Regularization Using Contrastive Visual Explanation Pairs for Distribution Shifts
Despite extensive research into mitigating distribution shifts, many existing algorithms yield inconsistent performance, often failing to outperform baseline Empirical Risk Minimization (ERM) across diverse scenarios. Furthermore, high algorithmic complexity frequently limits interpretability and offers only an indirect means of addressing spurious correlations. We propose eXplaining to Learn (eX2L): an interpretable, explanation-based framework that decorrelates confounding features from a classifier's latent representations during training. eX2L achieves this by penalizing the similarity between Grad-CAM activation maps generated by a primary label classifier and those from a concurrently trained confounder classifier. On the rigorous Spawrious Many-to-Many Hard Challenge benchmark, eX2L achieves an average accuracy (AA) of 82.24% +/- 3.87% and a worst-group accuracy (WGA) of 66.31% +/- 8.73%, outperforming the current state-of-the-art (SOTA) by 5.49% and 10.90%, respectively. Beyond its competitive performance, eX2L demonstrates that functional domain invariance can be achieved by explicitly decoupling label and nuisance attributes at the group level.
☆ From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work
Large language model systems are increasingly deployed as agentic workflows that interleave reasoning, tool use, memory, and iterative refinement. These systems are effective at producing answers, but they often rely on implicit conversational state, making it difficult to preserve stable work products, isolate irrelevant updates, or propagate changes through intermediate artifacts. We introduce execution lineage: an execution model in which AI-native work is represented as a directed acyclic graph (DAG) of artifact-producing computations with explicit dependencies, stable intermediate boundaries, and identity-based replay. The goal is not to make the model a better one-shot writer, but to make evolving AI-generated work maintainable under change. We compare execution-lineage replay against loop-centric update baselines on two controlled policy-memo update tasks. In an unrelated-branch update, DAG replay preserved the final memo exactly in all runs, with zero churn and zero unrelated-branch contamination, while loop baselines regenerated the memo and frequently imported unrelated context. In an intermediate-artifact edit, all systems reflected the new constraint in the final memo, but only DAG replay achieved perfect upstream preservation, downstream propagation, unaffected-artifact preservation, and cross-artifact consistency. These results show that final answer quality and maintained-state quality are distinct. Strong loop baselines can remain competitive at producing polished final outputs when the task is a bounded synthesis/update problem and all current sources fit in context, but immediate task success can mask partial state inconsistency that may compound over future revisions. Execution lineage provides stronger guarantees about what should change, what should remain stable, and how work evolves across revisions.
comment: 16 pages, 1 figure
☆ Flow Matching with Arbitrary Auxiliary Paths
We introduce a new generative modeling framework, \textbf{Flow Matching with Arbitrary Auxiliary Paths (AuxPath-FM)}, which generalizes conditional flow matching by incorporating an auxiliary variable drawn from an arbitrary distribution into the probability path. Unlike prior methods that restrict auxiliary components to Gaussian noise, AuxPath-FM allows the variable $η$ to follow any distribution, producing trajectories of the form $X_t = a(t)X_1 + b(t)X_0 + c(t)η$. We theoretically demonstrate that this construction preserves the continuity equation and maintains a training objective consistent with the marginal formulation. This flexibility enables the design of diverse probability paths using various priors, including Gaussian, Uniform, Laplace, and discrete Rademacher distributions, each offering unique geometric properties for generative flows. Furthermore, our framework allows for specialized tasks such as label-guided generation by encoding structured semantic information into the auxiliary distribution. Overall, AuxPath-FM provides a principled and general foundation for probability path design, offering both theoretical generality and practical flexibility for diverse generative modeling tasks.
☆ Memory Efficient Full-gradient Attacks (MEFA) Framework for Adversarial Defense Evaluations
This work studies the robust evaluation of iterative stochastic purification defenses under white-box adversarial attacks. Our key technical insight is that gradient checkpointing makes exact end-to-end gradient computation through long purification trajectories practical by trading additional recomputation for substantially lower memory usage. This enables full-gradient adaptive attacks against diffusion- and Langevin-based purification defenses, where prior evaluations often resort to approximate backpropagation due to memory constraints. These approximations can weaken the attack signal and risk overestimating robustness. In parallel, stochasticity in iterative purification is frequently under-controlled, even though different purification trajectories can substantially change reported robustness metrics. Building on this insight, we introduce a memory-efficient full-gradient evaluation framework for stochastic purification defenses. The framework combines checkpointed backpropagation with evaluation protocols that control stochastic variability, thereby reducing memory bottlenecks while preserving exact gradients. We evaluate diffusion-based purification and Langevin sampling with Energy-Based Models (EBMs), demonstrating that full-gradient attacks uncover vulnerabilities missed by approximate-gradient evaluations. Our framework yields stronger state-of-the-art $\ell_{\infty}$ and $\ell_{2}$ white-box attacks and further supports probing out-of-distribution robustness. Overall, our results show that exact-gradient evaluation is essential for reliable benchmarking of iterative stochastic defenses.
☆ Topological Signatures of Grokking
We study the grokking phenomenon through the lens of topology. Using persistent homology on point clouds derived from the embedding matrices of a range of models trained on modular arithmetic with varying primes, we identify a clear and consistent topological signature of grokking: a sharp increase in both the maximum and total persistence of first homology ($H_1$). Persistence diagrams reveal the emergence of a dominant long-lived topological feature together with increasingly structured secondary features, reflecting the underlying cyclic structure of the task. Compared to existing spectral and geometric diagnostics -- specifically, Fourier analysis and local intrinsic dimension -- persistent homology provides a unified geometric and topological characterization of representation learning, capturing both local and global multi-scale structure. Ablations across data regimes and control settings show that these topological transitions are tied to generalization rather than memorization. Our results suggest that persistent homology offers a principled and interpretable framework for analyzing how neural networks internalize latent structure during training.
comment: 19 pages, 14 figures, 2 tables
☆ Is Escalation Worth It? A Decision-Theoretic Characterization of LLM Cascades
Model cascades, in which a cheap LLM defers to an expensive one on low-confidence queries, are widely used to navigate the cost-quality tradeoff at deployment. Existing approaches largely treat the deferral threshold as an empirical hyperparameter, with limited guidance on the geometry of the resulting cost-quality frontier over a model pool. We develop a decision-theoretic framework grounded in constrained optimization and duality. For a two-model cascade, we establish piecewise concavity of the cost-quality frontier on decreasing-benefit regions of the confidence support, with reciprocal shadow prices linking the budget- and quality-constrained formulations. Given a pool of $k$ models, we characterize the frontier achievable by deterministic two-model threshold cascades as the pointwise envelope over $\binom{k}{2}$ pairwise cascades, with switching points where the optimal pair changes. For $k$-model cascades, we derive first-order conditions in which a single shadow price equalizes marginal quality-per-cost across stage boundaries. We validate the framework on five benchmarks (MATH, MMLU, TriviaQA, SimpleQA, LiveCodeBench) across eight models from five providers. Within the deterministic threshold-cascade class, full fixed chains underperform the pairwise envelope, and optimized subsequence cascades do not deliver practically meaningful held-out gains over it. A lightweight pre-generation router exceeds the best cascade policy on four of five datasets, mainly because it avoids the cheap model's generation cost on queries sent directly to a larger model rather than because of a stronger routing signal. These results suggest that cascade performance is limited primarily by structural cost, since cascades pay the cheap model before any escalation decision, rather than by a shortage of intermediate stages.
☆ Human-AI Co-Evolution and Epistemic Collapse: A Dynamical Systems Perspective ICML
Large language models (LLMs) are reshaping how knowledge is produced, with increasing reliance on AI systems for generation, summarization, and reasoning. While prior work has studied cognitive offloading in humans and model collapse in recursive training, these effects are typically considered in isolation. We propose a unified perspective: humans and language models form a coupled dynamical system linked by a feedback loop of usage, generation, and retraining. We introduce a minimal model with three variables -- human cognition, data quality, and model capability -- and show that this feedback can give rise to distinct dynamical regimes. Our analysis identifies three regimes: co-evolutionary enhancement, fragile equilibrium, and degenerative convergence. Through a simple simulation, we demonstrate that increasing reliance on AI can induce a transition toward a low-diversity, suboptimal equilibrium. From an information-theoretic perspective, this transition corresponds to an emergent information bottleneck in the human-AI loop, where entropy reduction reflects loss of diversity and support under closed-loop feedback rather than beneficial compression. These results suggest that the trajectory of AI systems is shaped not only by model design, but by the dynamics of human-AI co-evolution.
comment: 5 pages, 3 figures, ICML EIML Workshop submitted
☆ Prediction and Empowerment: A Theory of Agency through Bridge Interfaces
We study agency under partial observability in deterministic physical or simulated worlds, where apparent randomness arises from uncertainty over initial conditions, fixed law bits, and unrolled exogenous noise. We model sensing and actuation as bridge interfaces split between agent-controlled parameters and environment-controlled channel state, inducing a deterministic POMDP through a prior over latent microstates and many-to-one observation coarsening. Within this framework, we prove a separation between prediction, compression, and empowerment. Perfect prediction can be achieved either by identifying the hidden quotient relevant to the target family or by overwrite control that makes the future target action-determined; high empowerment alone is insufficient. Under refinable interfaces and sufficient memory, action-conditioned observation-compression progress reduces posterior uncertainty about the latent quotient, and when refinement requires steering world-side channel conditions, this creates target-conditioned interface empowerment. A bit-string specialization with a conserved information budget makes the resulting tradeoff explicit: prediction by identification requires internal capacity at least the relevant latent entropy, whereas overwrite control requires terminal action capacity over the controlled quotient. For modern AI agents, the results suggest a design principle rather than a theorem of inevitability: objectives should distinguish hidden-state identification, interface refinement, task-relevant controllability, and mere overwrite or distractor control. Human--AI alignment is partly an interface-design problem, where the relevant bridge is between human intent, agent internal state, external tools, and world-side channel conditions. This is a working draft: feedback and criticism is most welcome.
comment: This is a working draft: feedback and criticism is most welcome
☆ More Than Can Be Said: A Benchmark and Framework for Pre-Question Scientific Ideation
AI research agents have shown strong potential in automating literature search and manuscript refinement, yet most assume a clear and actionable initial input, operating only after a research question has been made explicit. In contrast, human research often begins with tacit friction, a sense of misalignment before a question can be formed. We introduce InciteResearch, a multi-agent framework designed to make a researcher's implicit understanding explicit, inspectable, and actionable. InciteResearch decomposes the logical chain of Socratic questioning and distributes it across the entire pipeline that: (1) Elicits a structured five-dimensional researcher profile state anchored by specific friction points from vague, even domain-unrelated inputs; (2) Violates hidden assumptions by maximizing the feasibility-novelty product with enforcing a 7-stage causal derivation trace; and (3) check whether the proposed method is a Necessary consequence of the reframed insight. We further introduce TF-Bench, the first benchmark for tacit-to-explicit research assistance that distinguishes domain-related from domain-unrelated inspirations across four scientific modes. On TF-Bench, InciteResearch achieves leapfrogging gains over a prompt-based baseline (novelty/impact from 3.671/3.806 to 4.250/4.397), shifting generated proposals from recombination to architectural insight. Our work demonstrates that AI can serve as an extension of thinking itself, rather than merely automating downstream execution.
comment: 19 pages, 2 figures; Code is available at https://github.com/Paradoxtcal/InciteResearch.git
☆ Mind the Gap? A Distributional Comparison of Real and Synthetic Priors for Tabular Foundation Models
Tabular foundation models are pre-trained on one of three classes of corpus: curated datasets drawn from benchmark repositories, tables harvested at scale from the web, or synthetic tables sampled from a parametric generative prior. Despite the centrality of pre-training data to model performance, little is known about how these corpora relate to one another in distribution, and the impact this has on downstream performance. In this work we take three canonical, archetypal datasets used to train tabular foundation models; the T4 dataset represents web-scraped corpora, the TabFM dataset curated tables from Kaggle, and the TabICL dataset as the only well-used synthetic prior with publicly available parameters. We characterise each corpus using aggregate features over whole tables, columns and correlations, and compare them using discriminator AUCs and k-NN coverage metrics. We find that the TabICL synthetic prior occupies a narrow region of the space of real tables, that this mismatch cannot be closed by optimising prior hyper-parameters across more than 86 thousand configurations, and that curated and web-scraped corpora are broadly interchangeable on a distributional level in feature space. Surprisingly, the distributional gap between synthetic pre-training data and real tables has a clearly detectable effect on performance under neither feature-based proximity measures or TabICL's own internal representations, suggesting that coverage of the real-data distribution is not the primary driver of TabICL's generalisation.
☆ CoupleEvo: Evolving Heuristics for Coupled Optimization Problems Using Large Language Models GECCO 2026
Many real-world optimization problems consist of multiple tightly coupled subproblems whose solutions must be coordinated to achieve high overall performance. However, existing large language model driven automated heuristic design approaches are limited to single-problem settings. In this paper, we propose CoupleEvo. CoupleEvo proposes three evolutionary coordination strategies to evolve heuristics for coupled optimization problems: the sequential strategy evolves heuristics for one subproblem after the other; the iterative strategy alternates the evolution of heuristics for different subproblems over successive generations; and the integrated strategy evolves heuristics for all problems simultaneously. The approach is evaluated on two representative coupled optimization problems. Experimental results show that decomposition-based strategies (sequential and iterative) provide more stable convergence and higher solution quality, while the integrated evolution strategy suffers from increased search complexity and variability. These findings highlight the importance of coordinating evolutionary search across interdependent subproblems and demonstrate the potential of LLM-driven heuristic design for complex coupled optimization problems. The code is available: https://github.com/tb-git-kit-research/CoupleEvo.
comment: accepted at GECCO 2026, San Jose, Costa Rica, Workshop
☆ A Regime Theory of Controller Class Selection for LLM Action Decisions
Deployed language and vision-language models must decide, on each input, whether to answer directly, retrieve evidence, defer to a stronger model, or abstain. Contrary to the common monotonicity intuition, greater per-input expressivity is not uniformly beneficial in finite samples: under identical strict cross-validation, different benchmarks prefer different controller classes. This reflects a finite-sample limitation of instance-level uncertainty signals, which can be exhausted at a distribution-dependent scale. We organize controllers into a nested lattice of four classes: fixed actions, partition routers, instance-level controllers, and prior-gated controllers, ordered by complexity. We prove a regime theory that turns three data-estimable bottlenecks into a class choice: how much improvement is possible beyond the best fixed action, whether there are enough samples for instance-level controllers to make reliable decisions, and how much improvement a coarse partition router can recover when instance-level signal is unreliable. The resulting Bernstein-tight threshold has a matching information-theoretic lower bound, and strict nested cross-validation provably selects a near-best class. Across SMS-Spam, HallusionBench, A-OKVQA, and FOLIO, the predicted class matches the empirical winner; the prior-gated controller wins on TextVQA when OCR tokens supply a label-free prediction-time prior. Code is available at https://github.com/Anonymous-Awesome-Submissions/Regime-Theory.
☆ TinyBayes: Closed-Form Bayesian Inference via Jacobi Prior for Real-Time Image Classification on Edge Devices
Cocoa (Theobroma cacao) is a critical cash crop for millions of smallholder farmers in West Africa, where Cocoa Swollen Shoot Virus Disease (CSSVD) and anthracnose cause devastating yield losses. Automated disease detection from leaf images is essential for early intervention, yet deploying such systems in resource-constrained settings demands models that are small, fast, and require no internet connectivity. Existing edge-deployable plant disease systems rely on end-to-end deep learning without uncertainty quantification, while Bayesian methods for edge devices focus on hardware-level inference architectures rather than agricultural applications. We bridge this gap with TinyBayes, the first framework to combine a closed-form Bayesian classifier with a mobile-grade computer vision pipeline for crop disease detection. Our pipeline uses YOLOv8-Nano (5.9 MB) for lesion localisation, MobileNetV3-Small (3.5 MB) for feature extraction, and the Jacobi prior; a Bayesian method that provides a closed form non-iterative estimators via projection, for the classification. The Jacobi-DMR (Distributed Multinomial Regression) classifier adds only 13.5 KB to the pipeline, bringing the total model size within 9.5 MB, while achieving 78.7% accuracy on the Amini Cocoa Contamination Challenge dataset and enabling end-to-end CPU inference under 150 ms per image. We benchmark against seven classifiers including Random Forest, SVM, Ridge, Lasso, Elastic Net, XGBoost, and Jacobi-GP, and demonstrate that the Jacobi-DMR offers the best trade-off between accuracy, model size, and inference speed for edge deployment. We have proved the asymptotic equivalence and consistency, asymptotic normality and the bias correction of Jacobi-DMR. All data and codes are available here: https://github.com/shouvik-sardar/TinyBayes
comment: 14 Pages, 1 Figure, 4 Tables
☆ Fine-Tuning Small Language Models for Solution-Oriented Windows Event Log Analysis
Large language models (LLMs) have shown promise for event log analysis, but their high computational requirements, reliance on cloud infrastructure, and security concerns limit practical deployment. In addition, most existing approaches focus only on the identification of the problem and do not provide actionable remediation. Small language models (SLMs) present a light-weight alternative that can be fine-tuned for a specific purpose and hosted locally. This paper investigates whether SLMs, when fine-tuned for a specific task, can serve as a practical alternative for event log analysis while also generating solutions. We first create a large-scale synthetic Windows event log dataset that contains remediation actions using a high-performing LLM. We then fine-tune multiple SLMs and LLMs using the LoRA parameter-efficient fine-tuning technique and evaluate their performance by comparing with expert assessment. The results show that the dataset accurately reflects real-world scenarios and that fine-tuned SLMs consistently outperform LLMs in identifying issues and providing relevant remediation, while requiring fewer computational resources.
comment: 27 pages, 14 figures, 5 tables
☆ Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity
Safety benchmarks are routinely treated as evidence about how a language model will behave once deployed, but this inference is fragile if behavior depends on whether a prompt looks like an evaluation. We define evaluation-context divergence as an observable within-item change in behavior induced by framing a fixed task as an evaluation, a live deployment interaction, or a neutral request, and present a paired-prompt protocol that measures it in open-weight LLMs while controlling for paraphrase variation, benchmark familiarity, and judge framing-sensitivity. Across five instruction-tuned checkpoints from four open-weight families plus a matched OLMo-3 base/instruct ablation ($20$ paired items, $840$ generations per checkpoint), we find striking heterogeneity. OLMo-3-Instruct alone is eval-cautious -- evaluation framing raises refusal vs. neutral by $11.8$pp ($p=0.007$) and reduces harmful compliance vs. deployment by $3.6$pp ($p=0.024$, $0/20$ items inverted) -- while Mistral-Small-3.2, Phi-3.5-mini, and Llama-3.1-8B are deployment-cautious}, with marginal eval-vs-deployment refusal effects of $-9$ to $-20$pp. The matched OLMo-3 base also exhibits the deployment-cautious pattern, identifying alignment as the inversion stage; within Llama-3.1, the $70$B model preserves direction with attenuated magnitude, ruling out a simple ``small-model effect that reverses at scale.'' One caveat: the cross-family heterogeneity is judge-dependent. Re-judging with a different-family safety classifier (Llama-Guard-3-8B) preserves the within-OLMo eval-cautious direction but flattens the cross-family contrast, indicating that the two judges operationalize distinct constructs.
☆ Improving the Efficiency of Language Agent Teams with Adaptive Task Graphs
Large language models (LLMs) are increasingly deployed in teams, yet existing coordination approaches often occupy two extremes. Highly structured methods rely on fixed roles, pipelines, or task decompositions assigned a priori. In contrast, fully unstructured teams enable adaptability and exploration but suffer from inefficiencies such as error propagation, inter-agent conflicts, and wasted resources (measured in time, tokens, or file operations). We introduce Language Agent Teams for Task Evolution (LATTE), a framework for coordinating LLM teams inspired by distributed systems, where processors must operate under partial observability and communication constraints. In LATTE, a team of agents collaboratively construct and maintain a shared, evolving coordination graph which encodes sub-task dependencies, individual agent assignment, and the current state of sub-task progress. This protocol maintains consistency while empowering agents to dynamically allocate work, adapt coordination, and discover new tasks. Across multiple collaborative tasks and a variety of base models, we demonstrate how LATTE reduces token usage, wall-clock time, communication, and coordination failures (e.g. file conflicts and redundant outputs) while matching or exceeding the accuracy of standard designs including MetaGPT, decentralized teams, top-down Leader-Worker hierarchies, and static decompositions.
☆ NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps
Existing Vision-Language Navigation (VLN) methods typically adopt an egocentric, step-by-step paradigm, which struggles with error accumulation and limits efficiency. While recent approaches attempt to leverage pre-built environment maps, they often rely on incrementally updating memory graphs or scoring discrete path proposals, which restricts continuous spatial reasoning and creates discrete bottlenecks. We propose Top-Down VLN (TD-VLN), reformulating navigation as a one-step global path planning problem on pre-built top-down maps, supported by our newly constructed R2R-TopDown dataset. To solve this, we introduce NavOne, a unified framework that directly predicts dense path probabilities over multi-modal maps in a single end-to-end forward pass. NavOne features a Top-Down Map Fuser for joint multi-modal map representation, and extends Attention Residuals for spatial-aware depth mixing. Extensive experiments on R2R-TopDown show that NavOne achieves state-of-the-art performance among map-based VLN methods, with a planning-stage speedup of 8x over existing map-based baselines and 80x over egocentric methods, enabling highly efficient global navigation.
comment: 10 pages, 7 figures
☆ Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization
Optimizers that exploit the matrix structure of gradients are central to modern LLM pre-training, with two distinct frontiers: explicit Kronecker-factored preconditioning -- most recently KL-Shampoo, which estimates the preconditioner via KL divergence minimization -- and orthogonalization of the gradient momentum, exemplified by Muon and analyzed as steepest descent under the spectral norm. The two routes are typically developed in isolation. We make a structural observation about KL-Shampoo's Kronecker preconditioners: their eigenvalue spectra exhibit a \emph{spike-and-flat} shape -- a few dominant eigenvalues followed by an approximately uniform tail -- across layers and training stages, holding exactly under a rank-$ρ$ signal-plus-noise gradient model. We exploit this structure by restricting one of KL-Shampoo's Kronecker factors to a parametric family aligned with the spike-and-flat shape: full spectral structure on a tracked $r$-dimensional subspace, single shared eigenvalue across the remaining $n-r$ directions. On these directions, we apply orthogonalization. An identity shows that this orthogonalization recovers the algebraic form of full KL-Shampoo's preconditioner. On four pre-training scales (GPT-2 124M / 350M, LLaMA 134M / 450M), Pro-KLShampoo consistently outperforms KL-Shampoo at every subspace rank we test in validation loss, peak per-GPU memory, and wallclock time to reach each loss level.
☆ Measuring Black-Box Confidence via Reasoning Trajectories: Geometry, Coverage, and Verbalization
Reliable confidence estimation enables safe deployment of chain-of-thought (CoT) reasoning through text-only APIs. Yet the dominant black-box baseline, self-consistency over K samples, is linearly expensive and ignores the geometry of the trace. We propose a black-box trajectory-confidence score: we embed a CoT as a sliding-window trajectory and measure its convergence to external answer anchors with a one-parameter softmax. The method needs no logits, hidden states, or supervised calibrators. Across six (benchmark, reasoner) settings on MedQA-USMLE, GPQA Diamond, and MMLU-Pro with Gemini 3.1 Pro and Claude Sonnet 4.6, fusing this score with coverage and verbalized-confidence channels at K=4 yields Pareto improvements over self-consistency at K=8 in 6/6 settings (median AUC 0.78 vs 0.71, deltaAUC=+0.075). A fixed-pick control (+0.060) and E5 cross-embedder replication rule out answer switching and single-vendor artifacts. Geometry peaks in the penultimate window across benchmarks and reasoners, and inverts at the terminal window on GPQA Diamond. Three unscaffolded regimes separate black-box confidence into a judge-mediated Coverage prior (C), within-trace Geometry (G), and a conditional Verbalization channel (V). Across 18 benchmark x reasoner x proposer settings, C and G provide independent signal in 18/18 and 16/18, while V contributes residual signal in 6/18. Swapping the judge from GPT-5-mini to Claude Sonnet 4.6 leaves G-only AUC unchanged (|delta|<=0.013) and shifts C-only AUC by at most +/-0.02 (kappa=0.82). Fusion beats the best single channel in 17/18 settings (median AUC 0.78, max 0.92).
☆ Addressing Labelled Data Scarcity: Taxonomy-Agnostic Annotation of PII Values in HTTP Traffic using LLMs IEEE
Automated privacy audits of web and mobile applications often analyse outbound HTTP traffic to detect Personally Identifiable Information (PII) leakage. However, existing learning-based detectors typically depend on scarce, manually labelled traffic and are tightly coupled to fixed label taxonomies, limiting transferability across domains and evolving definitions of PII. This paper investigates whether Large Language Models (LLMs) can support taxonomy-agnostic annotation of explicitly transmitted PII values in HTTP message bodies when the taxonomy is provided at runtime. We introduce a multi-stage LLM-based pipeline that combines deterministic pre-processing with label-level classification, targeted instance-level value annotation, and output validation. To enable controlled evaluation and exemplar-based prompting without relying on sensitive real-user captures, we further propose an LLM-based generator for synthetic HTTP traffic with manually validated, taxonomy-derived PII annotations. We evaluate the approach across three taxonomies spanning different PII domains and granularity levels. Results show that the pipeline accurately detects PII types and extracts corresponding values for concrete PII taxonomies. Overall, our findings position LLMs as a promising foundation for flexible, taxonomy-agnostic traffic annotation and for creating labelled data under evolving privacy taxonomies.
comment: Accepted to 2026 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW)
☆ Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement
Training world models on vast quantities of unlabelled videos is a critical step toward fully autonomous intelligence. However, the prevailing paradigm of encoding raw pixels into opaque latent spaces and relying on heavy decoders for reconstruction leaves these models computationally expensive and uninterpretable. We address this problem by introducing NOVA, a world modelling framework that represents the system state as the weights and biases of an auxiliary coordinate-based implicit neural representation (INR). This structured representation is analytically rendered, which eliminates the decoder bottleneck while conferring compactness, portability, and zero-shot super-resolution. Furthermore, like most latent action models, NOVA can be distilled into a context-dependent video generator via an action-matching objective. Surprisingly, without resorting to auxiliary losses or adversarial objectives, NOVA can disentangle structural scene components such as background, foreground, and inter-frame motion, enabling users to edit either content or dynamics without compromising the other. We validate our framework on several challenging datasets, achieving strong controllable forecasting while operating on a single consumer GPU at $\sim$40M parameters. Ultimately, structured representations like INRs not only enhance our understanding of latent dynamics but also pave the way for immersive and customisable virtual experiences.
comment: 35 pages, 30 figures, 8 tables
☆ Attributions All the Way Down? The Metagame of Interpretability
We introduce the metagame, a conceptual framework for quantifying second-order interaction effects of model explanations. For any first-order attribution $φ(f)$ explaining a model $f$, we measure the directional influence of feature $j$ on the attribution of feature $i$, denoted as meta-attribution $\varphi_{j \to i}(f)$, by treating the attribution method itself as a cooperative game and computing its Shapley value. Theoretically, we prove that attributions hierarchically decompose into meta-attributions, and establish these as directional extensions of existing interaction indices. Empirically, we demonstrate that the metagame delivers insights across diverse interpretability applications: (i) quantifying token interactions in instruction-tuned language models, (ii) explaining cross-modal similarity in vision-language encoders, and (iii) interpreting text-to-image concepts in multimodal diffusion transformers.
☆ Log-Likelihood, Simpson's Paradox, and the Detection of Machine-Generated Text
The ability to reliably distinguish human-written text from that generated by large language models is of profound societal importance. The dominant approach to this problem exploits the likelihood hypothesis: that machine-generated text should appear more probable to a detector language model than human-written text. However, we demonstrate that the token-level signal distinguishing human and machine text is non-uniform across the hidden space of the detector model, and naively averaging likelihood-based token scores across regions with fundamentally different statistical structure, as most detectors do, causes a form of Simpson's paradox: a strong local signal is destroyed by inappropriate aggregation. To correct for this, we introduce a learned local calibration step grounded in Bayesian decision theory. Rather than aggregating raw token scores, we first learn lightweight predictors of the score distributions conditioned on position in hidden space, and aggregate calibrated log-likelihood ratios instead. This single intervention dramatically and consistently improves detection performance across all baseline detectors and all datasets we consider. For example, our calibrated variant of Fast-DetectGPT improves AUROC from $0.63$ to $0.85$ on GPT-5.4 text, and a locally-calibrated DMAP detector we introduce achieves state-of-the-art performance across the board. That said, our central contribution is not a new detector, but a precise diagnosis of a significant cause of under-performance of existing detectors and a principled, modular remedy compatible with any token-averaging pipeline. This will serve as a foundation for the community to build upon, with natural avenues including richer distributional models, improved calibration strategies, and principled ensembling with hidden-space geometry signals via the full Bayes-optimal decision rule.
comment: 10 pages, 3 figures, 2 tables, 11 appendices
☆ Data Language Models: A New Foundation Model Class for Tabular Data
Every major data modality now has a foundation model that understands it natively: text has language models, images have vision models, audio has audio models. Tabular data, the modality on which many consequential real-world AI decisions are made, does not. Every approach to tabular AI today, from gradient-boosted trees to the latest tabular foundation models, requires a preprocessing pipeline before any model can consume the data. None of them understand tabular data as a modality. We introduce the Data Language Model (DLM), the missing foundation model for tabular data. A DLM understands tables the way a language model understands sentences: natively, without serialization or preprocessing, directly from raw cell values. It is the tabular data layer on which AI models, agents, and vertical AI applications can be built, eliminating the preprocessing pipelines that currently stand between raw data and every AI system that consumes it. We present Schema-1, the first DLM: a 140M parameter model trained on more than 2.3M synthetic and real-world tabular datasets. Schema-1 outperforms gradient-boosted ensembles, AutoML stacks, and the tabular foundation models we evaluate on established row-level prediction benchmarks. On missing value reconstruction it achieves lower reconstruction error than all classical statistical methods and frontier large language models on mean performance across conditions, establishing that structural understanding of a dataset's own distributional geometry is more useful for imputation than world knowledge encoded in language. It identifies the industry sector of any unseen dataset from raw cell values alone, reliably across any domain, a task no prior tabular model can perform. It is the native tabular understanding layer that has been missing from the AI stack.
☆ Multimodal Deep Generative Model for Semi-Supervised Learning under Class Imbalance
When modeling class-imbalanced data, it is crucial to address the imbalance, as models trained on such data tend to be biased towards the majority classes. This problem is amplified under partial supervision, where pseudo-labels for unlabeled data are predicted based on imbalanced labeled data, propagating the bias. While recent semi-supervised models address class imbalance, they typically assume single-modal input data. However, with the growing availability of multimodal data, it is essential to leverage complementary modalities. In this article, we propose a multimodal deep generative model for semi-supervised learning under class imbalance. Our approach uses separate encoders for each modality, sharing latent variables across modalities, and simplifies joint posterior computation with a product-of-experts method. To further address class imbalance, we replace typical Gaussian distributions with Student's t-distributions for the prior, encoder, and decoder, better capturing the heavy-tailed latent distributions in imbalanced data. We derive a new objective function for training the proposed model on both labeled and unlabeled data using $γ$-power divergence. Empirical results on benchmark and real-world datasets demonstrate that our model outperforms baseline methods in generalization, achieving superior classification performance for partially labeled multimodal data with imbalanced class distributions.
☆ A Topological Sorting Criterion for Random Causal Directed Acyclic Graphs
Random directed acyclic graphs (DAGs) based on imposing an order on Erdős-Rényi and scale free random graphs are widely used for evaluating causal discovery algorithms. We show that in such DAGs, the set of nodes reachable via open paths, termed relatives, increases monotonically along the causal order. We assess the prevalence of this pattern numerically, and demonstrate that it can be exploited for causal order recovery via sorting by the estimated number of relatives. We note that many simulations in the literature feature settings where this yields an excellent proxy for the causal order, and show that a strict increase of relatives along the causal order leads to a singular Markov equivalence class. We propose sampling time-series DAGs as a possible alternative and discuss implications for causal discovery algorithms and their evaluation on synthetic data.
☆ Correct Code, Vulnerable Dependencies: A Large Scale Measurement Study of LLM-Specified Library Versions
Large language models (LLMs) are now largely involved in software development workflows, and the code they generate routinely includes third-party library (TPL) imports annotated with specific version identifiers. These version choices can carry security and compatibility risks, yet they have not been systematically studied. We present the first large-scale measurement study of version-level risk in LLM-generated Python code, evaluating 10 LLMs on PinTrace, a curated benchmark of 1,000 Stack Overflow programming tasks. LLMs tend to specify version identifiers when directly prompted at 26.83%-95.18%, while down to 6.45%-59.19% in creating a manifest file directly. Among the specified versions, 36.70%-55.70% of tasks contain at least one known CVE, and 62.75%-74.51% of them carry Critical or High severity ratings. In 72.27%-91.37% of cases, the associated CVEs were publicly disclosed before the model's knowledge cutoff. The statistics show all models converge on the same small set of risky release versions, indicating a systemic bias rather than isolated model error. Static compatibility rates range from 19.70% to 63.20%, with installation failure as the dominant cause. The dynamic test cases confirm the pattern by 6.49%-48.62% pass rates. Further experiments confirm that these failures are attributable to version selection rather than code quality, and that externally anchored version constraints substantially reduce both vulnerability exposure and compatibility failures. Our findings reveal LLM version selection as a first-class, previously overlooked risk surface in LLM-based development. We disclosed these findings to the community of the evaluated models, and several confirmed the issue. All the code and dataset have been released for open science at https://github.com/dw763j/PinTrace.
comment: 35 pages, 8 figures
☆ Linear Semantic Segmentation for Low-Resource Spoken Dialects ACL
Semantic segmentation is a core component of discourse analysis, yet existing models are primarily developed and evaluated on high-resource written text, limiting their effectiveness on low-resource spoken varieties. In particular, dialectal Arabic exhibits informal syntax, code-switching, and weakly marked discourse structure that challenge standard segmentation approaches. In this paper, we introduce a new multi-genre benchmark (more than 1000 samples) for semantic segmentation in conversational Arabic, focusing on dialectal discourse. The benchmark covers transcribed casual telephone conversations, code-switched podcasts, broadcast news, and expressive dialogue from novels, and was annotated and validated by native Arabic annotators. Using this benchmark, we show that segmentation models performing well on MSA news genres degrade on dialectal transcribed speech. We further propose a segmentation model that targets local semantic coherence and robustness to discourse discontinuities, consistently outperforming strong baselines on dialectal non-news genres. The benchmark and approach generalize to other low-resource spoken languages.
comment: ACL Findings 2026
☆ Inference-Time Refinement Closes the Synthetic-Real Gap in Tabular Diffusion
Diffusion-based generators set the current state of the art for synthetic tabular data. These methods approach but rarely exceed real-data utility, and closing this synthetic-real gap has so far been pursued exclusively at training time, via architectural advances, scaling, and retraining of monolithic generators. The inference-time alternative, i.e., refining the outputs of a pre-trained backbone with parameters left untouched, has remained largely unexplored for tabular synthesis. We introduce TARDIS (Tabular generation through Refinement, Distillation, and Inference-time Sampling), an inference-time refinement framework that operates on a frozen pre-trained backbone, configured per dataset by a Tree-structured Parzen Estimator search over score-level guidance during reverse diffusion, with each trial's objective set by an inner grid search over post-hoc sample selectors and an optional soft-label distillation step. The search space encodes a single mathematical pattern we name Bidirectional Chamfer Refinement (BCR): the symmetric Chamfer functional between synthetic and real samples is minimized both continuously, via a score-level gradient, and discretely, via batch-ranking post-generation. The per-dataset search recovers BCR-aligned configurations on most datasets, evidence for BCR as the dominant refinement pattern. Across 15 binary, multiclass, and regression benchmarks TARDIS achieves a median +8.6% downstream-task improvement over models trained on real data (95% CI [+3.3, +16.4], Wilcoxon p=0.016, 11/15 strict wins) and improves over the TabDiff backbone on all 15 datasets (mean +12.9%, p<10^-4), matching the backbone on manifold fidelity, diversity, and sample-level privacy. Inference-time refinement of a pre-trained tabular diffusion backbone reaches and exceeds real-data utility in 1 to 80 minutes on a single consumer-grade GPU.
☆ The Weight Gram Matrix Captures Sequential Feature Linearization in Deep Networks
Understanding how deep neural networks learn representations remains a central challenge in machine learning theory. In this work, we propose a feature-centric framework for analyzing neural network training by relating weight updates to feature evolution. We introduce a simple identity, the Feature Learning Equation, which identifies the weight Gram matrix as the key object capturing feature dynamics. This enables us to interpret gradient descent as implicitly inducing a hypothetical evolution of features, whose covariance structure - termed the Virtual Covariance - characterizes how representations evolve during training. Building on this perspective, we introduce Target Linearity, a measure quantifying the linear alignment between features and targets. By analyzing the training and layer-wise dynamics, we show that deep networks learn to sequentially transform representations toward target-linear structure. This linearization perspective provides a unified interpretation of several empirical phenomena, including Neural Collapse and linear interpolation in generative models.
comment: 29 pages including appendix
☆ Cumulative-Goodness Free-Riding in Forward-Forward Networks: Real, Repairable, but Not Accuracy-Dominant
Forward-Forward (FF) training allows each layer to learn from a local goodness criterion. In cumulative-goodness variants, however, later layers can inherit a task that earlier layers have already partially separated. We formalize this phenomenon as layer free-riding: under the softplus FF criterion, the class-discrimination gradient reaching block $d$ decays exponentially with the positive margin accumulated by preceding blocks. We then study three local remedies -- per-block, hardness-gated, and depth-scaled -- that recover current-layer separation measures without relying on backpropagated gradients. On CIFAR-10 and CIFAR-100, these remedies dramatically improve layer-separation statistics, with $4\times$--$45\times$ gains in deeper layers, while changing accuracy by less than one percentage point for non-degenerate training procedures. Tiny ImageNet provides a tougher cross-dataset check for our selected block-wise configuration and reveals the same qualitative gap between layer-health diagnostics and final accuracy. Calibration experiments further show that architecture and augmentation choices have a larger effect on final accuracy than the training-rule modifications studied here. Cumulative free-riding is therefore a real and repairable optimization pathology. Nonetheless, for the FF training rules, architectures, and datasets we study, it is not the dominant factor limiting achievable accuracy.
☆ Band Together: Untargeted Adversarial Training with Multimodal Coordination against Evasion-based Promotion Attacks
Multimodal recommender systems exploit visual and textual signals to alleviate data sparsity, but this also makes them more vulnerable to evasion-based promotion attacks. Existing defenses are largely limited to single-modal settings and mainly focus on poisoning-based threats, leaving evasion-based threats underexplored. In this work, we first identify a cross-modal gradient mismatch under the multi-user promotion setting, where visual and textual perturbations are optimized in inconsistent directions due to the dominance of distinct user groups. This phenomenon dilutes the attack effectiveness and leads robust training to underestimate worst-case risks. To address this issue, we propose Untargeted Adversarial Training with Multimodal Coordination (UAT-MC). UAT-MC tackles the challenge of unknown targeted items in evasion-based attacks (as opposed to poisoning-based attacks) by treating all items as potential targets, and introduces a gradient alignment mechanism to explicitly correct this mismatch. This design ensures synchronized perturbations across modalities, thereby maximizing adversarial strength for robust training. Extensive experiments demonstrate that UAT-MC significantly improves robustness against promotion attacks while maintaining acceptable recommendation performance under the defense-accuracy trade-off. Code is available at https://github.com/gmXian/UAT-MC.
☆ OBLIQ-Bench: Exposing Overlooked Bottlenecks in Modern Retrievers with Latent and Implicit Queries
Retrieval benchmarks are increasingly saturating, but we argue that efficient search is far from a solved problem. We identify a class of queries we call oblique, which seek documents that instantiate a latent pattern, like finding all tweets that express an implicit stance, chat logs that demonstrate a particular failure mode, or transcripts that match an abstract scenario. We study three mechanisms through which obliqueness may arise and introduce OBLIQ-Bench, a suite of five oblique search problems over real long-tail corpora. OBLIQ-Bench exposes an overlooked asymmetry between retrieval and verification, where reasoning LLMs reliably recognize latent relevance whenever relevant documents are surfaced, but even sophisticated retrieval pipelines fail to surface most relevant documents in the first place. We hope that OBLIQ-Bench will drive research into retrieval architectures that efficiently capture latent patterns and implicit signals in large corpora.
☆ Safactory: A Scalable Agent Factory for Trustworthy Autonomous Intelligence
As large models evolve from conversational assistants into autonomous agents, challenges increasingly arise from long-horizon decision making, tool use, and real environment interaction. Existing agenticinfrastructure remain fragmented across evaluation, data management, and agent evolution, making it difficult to discover risks systematically and improve models in a continuous closed loop. In this report, we present \textbf{Safactory}, a scalable agent factory for trustworthy autonomous intelligence. Safactory integrates three tightly coupled platforms: a \textbf{Parallel Simulation Platform} for trajectory generation, a \textbf{Trustworthy Data Platform} for trajectory storage and experience extraction, and an \textbf{Autonomous Evolution Platform} for asynchronous reinforcement learning and on-policy distillation. As far as we know, Safactory is the first framework to propose a unified evolutionary pipeline for next-generation trustworthy autonomous intelligence.
comment: 50 pages, 21 figures
☆ Soft Deterministic Policy Gradient with Gaussian Smoothing
Deterministic policy gradient (DPG) is widely utilized for continuous control; however, it inherently relies on the differentiability of the critic with respect to the action during policy updates. This assumption is violated in practical control problems involving sparse or discrete rewards, leading to ill-defined policy gradients and unstable learning. To address these challenges, we propose a principled alternative based on a smoothed Bellman equation formulated via Gaussian smoothing. Specifically, we define a novel action-value function based on a smoothed Bellman equation and derive the soft deterministic policy gradient (Soft-DPG). Our formulation eliminates explicit dependence on critic action-gradients and ensures that the gradient remains well-defined even for non-smooth Q-functions. We instantiate this framework into a deep reinforcement learning algorithm, which we call soft deep deterministic policy gradient (Soft DDPG). Empirical evaluations on standard continuous control benchmarks and their discretized-reward variants show that Soft DDPG remains competitive in dense-reward settings and provides clear gains in most discretized-reward environments, where standard DDPG is more sensitive to irregular critic landscapes.
comment: 25 pages, 4 figures
☆ Price of Fairness in Short-Term and Long-Term Algorithmic Selections IJCAI-26
Algorithmic decision-making in high-stakes settings can have profound impacts on individuals and populations. While much prior work studies fairness in static settings, recent results show that enforcing static fairness constraints may exacerbate long-run disparities. Motivated by this tension, we study a stylized sequential selection problem in which a decision-maker repeatedly selects individuals, affecting both immediate utility and the population distribution over time. We introduce notions of group fairness for both the short and long term and theoretically analyze the trade-off between fairness and utility via the Price of Fairness (PoF). We characterize optimal and fair policies in the short term and show that the PoF can be large even when group distributions are nearly identical. In contrast, we show that long-term disparities can vanish under simple investment policies that achieve a low PoF. We also empirically validate these theoretical observations using both synthetic and real datasets.
comment: The short version of this paper appears in the proceedings of IJCAI-26
☆ A Versatile AI Agent for Rare Disease Diagnosis and Risk Gene Prioritization
Accurate and timely diagnosis is essential for effective treatment, particularly in the context of rare diseases. However, current diagnostic workflows often lead to prolonged assessment times and low accuracy. To address these limitations, we introduce Hygieia, a multi-modal AI agent system designed to support precision disease diagnosis by integrating diverse data sources, including phenotypic features, genetic profiles, and clinical records. Hygieia features a router-based and knowledge-enhanced framework that mitigates hallucination and tailors diagnostic strategies to different disease categories. Notably, it prioritizes risk-related genomic factors for rare diseases and provides confidence scores to assist clinical decision-making. We conducted a comprehensive evaluation demonstrating that Hygieia achieves state-of-the-art performance across multiple diagnostic benchmarks. In collaboration with clinical experts from Yale School of Medicine and Duke-NUS Medical School, we further validated its practical utility by showing (1) Hygieia's superior diagnostic performance compared to physicians with an improvement from 12%-60% and (2) its effectiveness in assisting clinicians with medical records for handling real-world cases. Our findings indicate that Hygieia not only enhances diagnostic accuracy and interpretability but also significantly reduces clinician workload, highlighting its potential as a valuable tool in clinical decision support systems.
comment: 32 pages, 6 figures
☆ Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs
Steering large language models (LLMs) is usually done by either instruction prompting or activation steering. Prompting often gives strong control, but caches guidance tokens at every layer and can clutter long interactions; activation steering is compact but typically weaker and does not support large structured reminders. We introduce memory inception (MI), a training-free method that steers in latent attention space by inserting text-derived key-value (KV) banks only at selected layers. Rather than materializing reminder content throughout the prompt cache, MI treats steering as selective KV allocation, injecting latent slots only where the model routes to them. On matched personality-steering tasks, MI gives the best overall control--drift trade-off, remaining competitive with prompting while consistently outperforming CAA. On updateable guidance, MI supports mid-conversation behavior shifts without rewriting the visible transcript, achieving the highest post-shift alignment on Qwen3. On structured reasoning, MI outperforms visible prompting on HARDMath and PHYSICS (10/12 subject$\times$mode cells), serving as proxies for structured reasoning in verifiable domains, while cutting content-matched KV storage by up to 118$\times$. These results position MI as a powerful steering method when guidance is persistent, structured, or expensive to keep in the visible transcript.
☆ Proactive Instance Navigation with Comparative Judgment for Ambiguous User Queries
Natural-language instance navigation becomes challenging when the initial user request does not uniquely specify the target instance. A practical agent should reduce the user's burden by actively asking only the information needed to distinguish the target from similar distractors, rather than requiring a detailed description upfront. Existing approaches often fall short of this goal: they may stop at the first plausible candidate before sufficiently exploring alternatives, or, even after collecting multiple candidates, ask about the target's attributes derived from individual candidates rather than questions selected to distinguish candidates in the pool. As a result, despite the dialogue, the agent may still fail to distinguish the target from distractors, leading to premature decisions and lengthy user responses. We propose Proactive Instance Navigation with Comparative Judgment (ProCompNav), a two-stage framework that first constructs a candidate pool and then identifies the target through comparative judgment. At each round, ProCompNav extracts an attribute-value pair that splits the current pool, asks a binary yes/no question, and prunes all inconsistent candidates at once. This reframes disambiguation from open-ended target description to pool-level discriminative questioning, where each question is chosen to narrow the candidate set. On CoIN-Bench, ProCompNav improves Success Rate over interactive baselines with the same minimal input and non-interactive baselines with detailed descriptions, while substantially reducing Response Length. ProCompNav also achieves state-of-the-art Success Rate on TextNav, suggesting that comparative judgment is broadly useful for instance-level navigation among similar distractors.
comment: 17 pages, 6 figures
☆ When to Trust Imagination: Adaptive Action Execution for World Action Models
World Action Models (WAMs) have recently emerged as a promising paradigm for robotic manipulation by jointly predicting future visual observations and future actions. However, current WAMs typically execute a fixed number of predicted actions after each model inference, leaving the robot blind to whether the imagined future remains consistent with the actual physical rollout. In this work, we formulate adaptive WAM execution as a future-reality verification problem: the robot should execute longer when the WAM-predicted future remains reliable, and replan earlier when reality deviates from imagination. To this end, we propose Future Forward Dynamics Causal Attention (FFDC), a lightweight verifier that jointly reasons over predicted future actions, predicted visual dynamics, real observations, and language instructions to estimate whether the remaining action rollout can still be trusted. FFDC enables adaptive action chunk sizes as an emergent consequence of prediction-observation consistency, preserving the efficiency of long-horizon execution while restoring responsiveness in contact-rich or difficult phases. We further introduce Mixture-of-Horizon Training to improve long-horizon trajectory coverage for adaptive execution. Experiments on the RoboTwin benchmark and in the real world demonstrate that our method achieves a strong robustness-efficiency trade-off: on RoboTwin, it reduces WAM forward passes by 69.10% and execution time by 34.02%, while improving success rate by 2.54% over the short-chunk baseline; in real-world experiments, it improves success rate by 35%.
☆ Joint Consistency: A Unified Test-Time Aggregation Framework via Energy Minimization
This paper studies test-time aggregation, an approach that generates multiple reasoning traces and aggregates them into a final answer. Most existing methods rely on evaluation signals collected from candidate traces in isolation or answer frequencies, while ignoring comparative interactions among candidates. We propose Joint Consistency (JC), formulated as a constrained Ising-type energy minimization problem, where independent evaluation signals act as external fields and pairwise comparisons act as interactions. JC provides a unified framework for test-time aggregation that subsumes existing voting and weighted aggregation methods as special cases. Our construction of the interaction matrix leverages LLM-as-a-judge comparisons, and admits a theoretical interpretation under answer-level homogeneity assumptions. Moreover, we develop an efficient approximation strategy that makes interaction modeling practical for large-scale test-time aggregation. Experiments on math and code reasoning benchmarks show that JC consistently outperforms existing baselines across tasks, judge models, trace budgets, and trace-generation settings.
☆ TIDE: Every Layer Knows the Token Beneath the Context
We revisit a universally accepted but under-examined design choice in every modern LLM: a token index is looked up once at the input embedding layer and then permanently discarded. This single-injection assumption induces two structural failures: (i) the Rare Token Problem, where a Zipf-type distribution of vocabulary causes rare-token embeddings are chronically under-trained due to receiving a fraction of the cumulative gradient signal compared to common tokens; and (ii) the Contextual Collapse Problem, where limited parameters models map distributionally similar tokens to indistinguishable hidden states. As an attempt to address both, we propose TIDE, which augments the standard transformer with EmbeddingMemory: an ensemble of K independent MemoryBlocks that map token indices to context-free semantic vectors, computed once and injected into every layer through a depth-conditioned softmax router with a learnable null bank. We theoretically and empirically establish the benefits of TIDE in addressing the issues associated with single-token identity injection as well as improve performance across multiple language modeling and downstream tasks.
☆ FunctionalAgent: Towards end-to-end on-top functional design
Multiconfiguration pair-density functional theory (MC-PDFT) offers an efficient and accurate framework for computing electronic energies in strongly correlated molecular systems, with the quality of the on-top functional being a key determinant of its predictive accuracy. Here we introduce FunctionalAgent, an agentic system for fully automated functional development. FunctionalAgent orchestrates a team of specialized sub-agents to decompose the development process into dataset construction, active-space generation, MCSCF calculation and descriptor generation, loss-function construction, and functional fitting, optimization, and evaluation, thereby linking all stages into a closed-loop automated workflow. Using FunctionalAgent, we developed MC26, a hybrid meta-GGA on-top functional that achieves improved overall accuracy on the training set compared with other methods evaluated on the same benchmark dataset. We further introduce COF26, a new functional form that, owing to the optimized training process, achieves the best performance on both the training and test sets.
☆ Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models
Evaluating large language models (LLMs) today rests on fixed benchmarks that apply the same set of items to any model, producing ceiling and floor effects that mask capability gaps. We argue that the most informative evaluation signal lies at the boundary, where the per-prompt pass probability is near $0.5$ under random-sampling decoding, and propose Dynamic Boundary Evaluation (DBE), which actively locates each model's boundary and places it on a globally comparable difficulty scale. DBE delivers three artifacts: (i) a calibrated item bank covering safety, capability, and truthfulness, with per-item difficulty labels validated across $9$ reference LLMs; (ii) Skill-Guided Boundary Search (SGBS), a search algorithm that finds boundary items for a given target LLM using only API-level query access; and (iii) an evaluation protocol that places a new LLM on a unified ability scale and grows the evaluation set adaptively when the target falls outside the bank's coverage. We instantiate DBE on four categories spanning safety (harmful request refusal and over-refusal), capability (constrained instruction following), and truthfulness (multi-turn sycophancy resistance). The resulting evaluation covers a broader model spectrum without saturation while remaining compatible with existing datasets.
☆ Contrastive Identification and Generation in the Limit
In the classical identification in the limit model of Gold [1967], a stream of positive examples is presented round by round, and the learner must eventually recover the target hypothesis. Recently, Kleinberg and Mullainathan [2024] introduced generation in the limit, where the learner instead must eventually output novel elements of the target's support. Both lines of work focus on positive-only or fully labeled data. Yet many natural supervision signals are inherently relational rather than singleton, which encode relationships between examples rather than labels of individual ones. We initiate the study of contrastive identification and generation in the limit, where the learner observes a contrastive presentation of data: a stream of unordered pairs $\{x,y\}$ satisfying $h(x)\ne h(y)$ for an unknown target binary hypothesis $h$, but which element is positive is hidden from the learner. We first present three results in the noiseless setting: an exact characterization of contrastive identifiable classes (a one-line geometric refinement of Angluin [1980]'s tell-tale condition), a combinatorial dimension called contrastive closure dimension (a contrasitive analogue of the closure dimension in Raman et al. [2025]) and exactly characterizing uniform contrastive generation with tight sample complexity, and a strict hierarchy in which contrastive generation and text identification are mutually incomparable. We then prove a sharp reversal under finite adversarial corruption: there exist classes identifiable from contrastive pairs under any finite corruption budget by a single budget-independent algorithm, yet not identifiable from positive examples under even one corrupted observation. The unifying technical object is the common crossing graph, which encodes pairwise ambiguity, family-level generation obstructions, and corruption defects in a single coverage-and-incidence language.
☆ Super-Level-Set Regression: Conditional Quantiles via Volume Minimization
Constructing minimum-volume prediction regions that satisfy conditional coverage is a fundamental challenge in multivariate regression. Standard approaches rely on explicitly estimating the full conditional density and subsequently thresholding it. This two-step plug-in process is notoriously difficult, sensitive to estimation errors, and computationally expensive. One would like to instead optimize the region directly. Formulating a direct solution is challenging, however, because it requires minimizing a volume objective that is coupled with the conditional quantiles of the model's own estimation error. In this work, we address this challenge. We introduce super-level-set regression (SLS), a novel mathematical framework that successfully resolves this implicit coupling, allowing us to directly parameterize and optimize the geometric boundaries of the target conditional level sets. By bypassing full distribution estimation and leveraging flexible volume-preserving frontier functions, our approach natively captures complex, multimodal, and disjoint conditional structures end-to-end. Ultimately, SLS offers a new perspective on multivariate conditional quantile regression, replacing the restrictive assumptions of density-first methods with a direct geometric optimization strategy.
☆ Taming the Entropy Cliff: Variable Codebook Size Quantization for Autoregressive Visual Generation
Most discrete visual tokenizers rely on a default design: every position in the sequence shares the same codebook. Researchers try to scale the codebook size $K$ to get better reconstruction performance. Such a constant-codebook design hits a fundamental information-theoretic limit. We observe that the per-position conditional entropy of the training set decays so quickly along the sequence that, after a few positions, the conditional distribution becomes essentially deterministic. On ImageNet with $K=16384$, this happens within only 2 out of 256 positions, turning the remaining 254 into a memorization problem. We call this phenomenon the Entropy Cliff and formalize it with a simple expression: $t^{*} = \lceil \log_2 N / \log_2 K \rceil$. Interestingly, this phenomenon is not observed in language, as its natural structure keeps the effective entropy per position well below the codebook capacity. To address this, we propose Variable Codebook Size Quantization (VCQ), where the codebook size $K_t$ grows monotonically along the sequence from $K_{\min}=2$ to $K_{\max}$, leaving the loss function, parameter count, and AR training procedure unchanged. With a vanilla autoregressive Transformer and standard next-token prediction, a base version of VCQ reduces gFID w/o CFG from 27.98 to 14.80 on ImageNet $256\times256$ over the baseline. Scaled up, it reaches gFID 1.71 with 684M autoregressive parameters, without any extra training techniques such as semantic regularization or causal alignment. The extreme information bottleneck at $K_{\min}=2$ naturally induces a coarse-to-fine semantic hierarchy: a linear probe on only the first 10 tokens reaches 43.8% top-1 accuracy on ImageNet, compared to 27.1% for uniform codebooks. Ultimately, these results show that what matters is not only the total capacity of the codebook, but also how that capacity is distributed and organized.
☆ Towards Annotation-Free Validation of MLLMs: A Vision-Language Logical Consistency Metric
Dominant accuracy evaluation might reward unwarranted guessing of Large Language Models, and it might not be applicable to novel tasks for model validation without ground-truth (gt) annotation. Based on basic logic principle, we propose a novel framework to evaluate the vision-language logical consistency of MLLMs on both sufficient and necessary cause-effect relations. We define Vision-Language Logical Consistency Metric (VL-LCM) on traditional MC-VQA tests, and recent NaturalBench tests without the need for gt annotation. Through systematic experiments on representative VL benchmark MMMU and recent VL challenges like NaturalBench, we evaluated 11 recent open-source MLLMs from 4 frontier families. Our findings reveal that, despite significant progress of recent MLLMs on accuracy, logical consistency lags behind significantly. Extensive evaluations on the correlations of VL-LCM with metrics on gt, the reliability of LCM, and the relation of VL-LCM with response distribution justify the validity and applicability of VL-LCM even without gt annotation. Our findings suggest that, beyond accuracy, logical consistency could be employed for both accuracy and reliability. VL-LCM can also be employed for MLLM selection, validation, and reliable answer justification in novel tasks without gt annotation.
☆ The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models
Large language models (LLMs) are routinely prompted to take on social roles ranging from individuals to institutions, yet it remains unclear whether their internal representations encode the granularity of such roles, from micro-level individual experience to macro-level organizational, institutional, or national reasoning. We show that they do. We define a contrast-based Granularity Axis as the difference between mean macro- and micro-role hidden states. In Qwen3-8B, this axis aligns with the principal axis (PC1) of the role representation space at cosine 0.972 and accounts for 52.6% of its variance, indicating that granularity is the dominant geometric axis organizing prompted social roles. We construct 75 social roles across five granularity levels and collect 91,200 role-conditioned responses over shared questions and prompt variants, then extract role-level hidden states and project them onto the axis. Role projections increase monotonically across all five levels, remain stable across layers, prompt variants, endpoint definitions, held-out splits, and score-filtered subsets, and transfer to Llama-3.1-8B-Instruct. The axis is also causally relevant: activation steering along it shifts response granularity in the predicted direction, with Llama moving from 2.00 to 3.17 on a five-point macro scale under positive steering on prompts that admit local responses. The two models differ in controllability, suggesting that steering depends on each model's default operating regime. Overall, our findings suggest that social role granularity is not merely a stylistic surface feature, but a structured, ordered, and causally manipulable latent direction in role-conditioned language model behavior.
comment: 28 pages, including appendices
☆ EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields
Pretrained video diffusion models provide powerful spatiotemporal generative priors, making them a natural foundation for robotic world models. While recent world-action models jointly optimize future videos and actions, they predominantly treat video generation as an auxiliary representation for policy learning. Consequently, they insufficiently explore the inverse problem: leveraging action signals to guide video synthesis, thereby often failing to preserve precise robot spatial geometry and fine-grained robot-object interaction dynamics in the generated rollouts. To bridge this gap, we present EA-WM, an Event-Aware Generative World Model that effectively closes the loop between kinematic control and visual perception. Rather than injecting joint or end-effector actions as abstract, low-dimensional tokens, EA-WM projects actions and kinematic states directly into the target camera view as Structured Kinematic-to-Visual Action Fields. To fully exploit this geometrically grounded representation, we introduce event-aware bidirectional fusion blocks that modulate cross-branch attention, capturing object state changes and interaction dynamics. Evaluated on the comprehensive WorldArena benchmark, EA-WM achieves state-of-the-art performance, outperforming existing baselines by a significant margin.
comment: Preprint. 22 pages, 10 figures
☆ Systematic Evaluation of Large Language Models for Post-Discharge Clinical Action Extraction
The work in this paper evaluates zero-shot and few-shot large language models (LLMs) for safety-critical clinical action extraction using the CLIP discharge-note dataset, with particular emphasis on transitions of care and post-discharge patient safety. To manage the complexity of clinical documentation, we introduce a two-stage extraction framework that decomposes discharge notes, that are written in narrative form, into fine-grained, explicitly actionable clinical tasks through a staged prompting strategy. Our contributions include a systematic assessment of generative LLMs for clinical action extraction, a detailed comparison between general-purpose LLMs and task-specific supervised BERT-based models, and an analysis of annotation inconsistencies across different action categories. We show that contemporary LLMs achieve performance comparable to or exceeding supervised models on binary actionability detection, while supervised baselines retain a meaningful advantage on fine-grained multi-label category classification, despite the absence of task-specific fine-tuning and under strict data-privacy constraints. Qualitative error analysis reveals that many failures stem from misalignment between model reasoning and dataset annotation conventions, particularly in cases involving implicit clinical actions and rigid structural labeling rules. These results indicate that reported performance reflects model limitations due to lack of clinical reasoning, that is not captured by plain annotations. Labels without rationales make it impossible to distinguish clinical reasoning failures from annotation convention mismatches. Advancing clinical NLP requires reasoning-annotated datasets that document why specific spans are actionable, not merely which spans were labeled, enabling proper evaluation of model clinical understanding.
☆ OPSD Compresses What RLVR Teaches: A Post-RL Compaction Stage for Reasoning Models
On-Policy Self-Distillation (OPSD) has recently emerged as an alternative to Reinforcement Learning with Verifiable Rewards (RLVR), promising higher accuracy and shorter responses through token-level credit assignment from a self-teacher conditioned on privileged context. However, this promise does not carry over to thinking-enabled mathematical reasoning, where reported accuracy gains shrink and sometimes turn negative. We hypothesize that hindsight supervision can specify better token-level alternatives in short thinking-disabled outputs, but in long thinking-enabled traces it more readily identifies redundancy than supplies better replacements. To test this, we applied OPSD separately to correct and incorrect rollout groups, so that compression and correction can be observed in isolation. Our results show that in thinking-enabled mathematical reasoning, OPSD behaves most reliably as a compression mechanism rather than a correction mechanism: training only on correct rollouts preserves accuracy while substantially shortening responses, whereas training only on incorrect rollouts damages accuracy. In light of these findings, we propose a revised post-training pipeline for thinking-enabled mathematical reasoning: SFT then RLVR then OPSD.
☆ In-Context Black-Box Optimization with Unreliable Feedback
Black-box optimization in science and engineering often comes with side information: experts, simulators, pretrained predictors, or heuristics can suggest which candidates look promising. This information can accelerate search, but it can also be biased, input-dependent, or misleading. Feedback-aware BO methods typically handle one task at a time, limiting their ability to generalize over multiple sources of feedback. In-context optimizers address cross-task adaptation, but usually assume that optimization history is the only available signal at test time. We study feedback-informed in-context black-box optimization (FICBO), where a pretrained optimizer conditions on both the observed history and cheap auxiliary feedback for the current candidate set. We introduce a structured feedback prior that models how feedback sources vary in their access, relevance, and distortion relative to the true objective, and use it to pretrain a feedback-aware transformer. At test time, the model estimates source reliability in context by comparing observed objective values with auxiliary signals, improving query selection. On synthetic and real-world tasks, FICBO effectively exploits informative feedback while remaining robust to weak or misleading sources, improving over other baselines. Empirical investigations further illustrate how the model perceives test-time sources, offering insights into its interpretability and decision-making process.
☆ Event-Causal RAG: A Retrieval-Augmented Generation Framework for Long Video Reasoning in Complex Scenarios
Recent large vision-language models have achieved strong performance on short- and medium-length video understanding, yet they remain inadequate for ultra-long or even infinite video reasoning, where models must preserve coherent memory over extended durations and infer causal dependencies across temporally distant events. Existing end-to-end video understanding methods are fundamentally limited by the $O(n^2)$ complexity of self-attention, while recent retrieval-augmented generation (RAG) approaches still suffer from fragmented clip-level memory, weak modeling of temporal and causal structure, and high storage and online inference costs. We present Event-Causal RAG, a lightweight retrieval-augmented framework for infinite long-video reasoning. Instead of indexing fixed-length clips, our method segments streaming videos into semantically coherent events and represents each event as a structured State-Event-State (SES) graph, capturing the event together with its surrounding state transitions. These graphs are merged into a global Event Knowledge Graph and stored in a dual-store memory that supports both semantic matching and causal-topological retrieval. On top of this memory, we design a bidirectional retrieval strategy to efficiently identify the most relevant event causal chains and provide them, together with the associated video evidence, to a backbone video foundation model for answer generation. Experiments on long-video understanding benchmarks demonstrate that Event-Causal RAG consistently outperforms strong clip-based retrieval baselines and long-context video models, particularly on questions requiring multi-event integration and causal inference across long temporal gaps, while also achieving improved memory efficiency and robust streaming performance.
☆ Rethinking Adapter Placement: A Dominant Adaptation Module Perspective
Low-rank adaptation (LoRA) is a widely used parameter-efficient fine-tuning method that places trainable low-rank adapters into frozen pre-trained models. Recent studies show that using fewer LoRA adapters may still maintain or even improve performance, but existing methods still distribute adapters broadly, leaving where to place a limited number of adapters to maximize performance largely open. To investigate this, we introduce PAGE (Projected Adapter Gradient Energy), a gradient-based sensitivity probe that estimates the initial trainable gradient energy available to each candidate LoRA adapter. Surprisingly, we find that PAGE is highly concentrated on a single shallow FFN down-projection across two model families and four downstream tasks. We term this module the dominant adaptation module and show that its layer index is architecture-dependent but task-stable. Motivated by this finding, we propose DomLoRA, a placement method that places a single adapter at the dominant adaptation module. With only ~0.7% of vanilla LoRA's trainable parameters, DomLoRA outperforms it on average across various downstream tasks, including instruction following, mathematical reasoning, code generation, and multi-turn conversation. This method also improves other LoRA variants, supporting the dominant adaptation module perspective as a practical placement guideline.
☆ BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents
Building a deep research agent today is an exercise in glue code: the same backbone evaluated on the same benchmark can report different accuracies in different papers because harness and tool registry all differ, and integrating a new foundation model into a comparable evaluation surface costs weeks of model-specific engineering. We call this the per-paper engineering tax and release BioMedArena, an open-source toolkit that not only alleviates it but also provides an arena for fair comparison of different foundation models when evaluating them as deep-research agents. BioMedArena decouples six layers of biomedical agent evaluation -- benchmark loading, tool exposure, tool selection, execution mode, context management, and scoring -- and exposes 147 biomedical benchmarks and 75 biomedical tools across 9 functional families. Adding a new model, benchmark, or tool reduces to registering a few-line provider adapter. We further provide 6 agent harnesses with 6 context-management strategies, which provide 12 backbones with competitive research capabilities and significantly improved performance, achieving state-of-the-art (SOTA) results on 8 representative biomedical benchmarks, with an average lift of +15.03 percentage points over prior SOTA. The toolkit, configurations, and per-task traces are available at https://github.com/AI-in-Health/BioMedArena
☆ Retina-RAG: Retrieval-Augmented Vision-Language Modeling for Joint Retinal Diagnosis and Clinical Report Generation MICCAI 2026
Diabetic Retinopathy (DR) is a leading cause of preventable blindness among working-age adults worldwide, yet most automated screening systems are limited to image-level classification and lack clinically structured reporting. We propose Retina-RAG, a low-cost modular framework that jointly performs DR severity grading, macular edema (ME) detection, and report generation. The architecture decouples a high-performance retinal classifier and a parameter-efficient vision-language model (Qwen2.5-VL-7B-Instruct) adapted via Low-Rank Adaptation (LoRA), enabling flexible component integration. A retrieval-augmented generation (RAG) module injects curated ophthalmic knowledge together with structured classifier outputs at inference time to improve diagnostic consistency and reduce hallucinations. Retina-RAG achieves an F1-score of 0.731 for DR grading and 0.948 for ME detection, substantially outperforming zero-shot Qwen (0.096, 0.732) and MMed-RAG (0.541, 0.641) on a retinal disease detection dataset with captions. For report generation, Retina-RAG attains ROUGE-L 0.429 and SBERT similarity 0.884, exceeding all baselines. The full framework operates on a single consumer-grade GPU, demonstrating that clinically structured retinal AI can be achieved with modest computational resources.
comment: 10 pages, 5 figures. Submitted to MICCAI 2026
☆ Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
As the widespread adoption of Large Language Models (LLMs) accelerates, token consumption from intermediate reasoning traces increasingly contributes to inference latency and operational cost. Recent studies suggest that many real-world tasks require little to no explicit reasoning, with additional reasoning sometimes even degrading performance. In this work, we propose \textbf{Post-Reasoning}, a simple yet effective approach that improves instruction-tuned models by conditioning them to justify their answers after generating the final response. By design, it enables the final answer to be obtained without additional latency or token cost, while still improving performance through simple instruction augmentation. We evaluate Post-Reasoning across \(117\) model--benchmark settings spanning \(13\) open and proprietary models, \(4\) model families, and \(9\) diverse reasoning and knowledge-intensive benchmarks, including AMC, HMMT, GSM8K, GPQA, MMLU-Pro, and BIG-Bench Hard. Post-Reasoning improves performance in over \(88.19\%\) of evaluated settings, achieving a mean relative improvements of \(17.37\%\). Furthermore, we propose supervised post-reason tuning, which further improves performance in over \(91.11\%\) of evaluated settings, and exceeds the prompt-based post-reasoning baseline by an average of \(8.01\%\), demonstrating that post-reasoning can be effectively internalized through training. Ultimately, Post-Reasoning establishes a new performance ceiling for direct-answer capabilities.
☆ Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges
LLM-as-a-Judge pipelines have become the de facto evaluator for agent safety, yet existing benchmarks treat their verdicts as ground-truth proxies without checking whether the verdicts depend on the agent's behavior or merely on how the evaluation policy happens to be worded. We argue that any trustworthy safety judge must satisfy a basic property we call policy invariance, and we operationalize it as three testable principles: rubric-semantics invariance under certified-equivalent rewrites, rubric-threshold invariance under intentional strict-to-lenient shifts, and ambiguity-aware calibration so that verdict instability concentrates on genuinely ambiguous cases. Instantiating these principles as a stress-test protocol with four agent-class judges on trajectories drawn from ASSEBench and R-Judge, we surface a previously unmeasured failure mode: today's judges respond to meaningful normative shifts and to meaningless structural rewrites with comparable strength, and cannot tell the two apart. Content-preserving policy rewrites flip up to 9.1% of verdicts above baseline jitter, and 18-43% of all observed flips occur on unambiguous cases under such rewrites, so existing safety scores conflate what the agent did with how the evaluator was prompted. Beyond the diagnosis, we contribute the Policy Invariance Score and the Judge Card reporting protocol, which expose an order-of-magnitude spread in judge reliability that is invisible to accuracy-only leaderboards. We release the protocol and code so that future agent-safety benchmarks can audit their own evaluators rather than trust them by default.
comment: 9 pages
☆ Entropy-Regularized Adjoint Matching for Offline RL
Integrating expressive generative policies, such as flow-matching models, into offline reinforcement learning (RL) allows agents to capture complex, multi-modal behaviors. While Q-learning with Adjoint Matching (QAM) stabilizes policy optimization via the continuous adjoint method, it remains inherently bound to the fixed behavior distribution. This dependence induces a \textit{popularity bias} that can suppress high-reward actions in low-density regions, and creates a \textit{support binding} that restricts off-manifold exploration. Existing workarounds, such as appending \textit{residual} Gaussian policies, often re-introduce the expressivity bottlenecks associated with unimodal distributions. In this work, we propose \textit{Maximum Entropy Adjoint Matching} (ME-AM), a unified framework that addresses these limitations within the continuous flow formulation. ME-AM incorporates two mechanisms: (1) a Mirror Descent entropy maximization objective that mitigates the popularity bias to facilitate the extraction of optimal policies from offline datasets, and (2) a \textit{Mixture Behavior Prior} that mathematically broadens the geometric support to encompass out-of-distribution high-reward regions. By exploring this extended geometry, ME-AM identifies robust actions while preserving the absolute continuity of the generative vector field. Empirically, ME-AM demonstrates competitive or superior performance compared to prior state-of-the-art (SOTA) methods across a diverse suite of sparse-reward continuous control environments.
☆ Graphlets as Building Blocks for Structural Vocabulary in Knowledge Graph Foundation Models
Foundation models excel at language, where sentences become tokens, and vision, where images become pixels, because both reduce to discrete symbols on a shared, fixed grid. Knowledge Graphs share the discreteness, but not the geometry. Their entities and relations are discrete symbols, yet their arrangement is relational and lacks a common, fixed grid. Knowledge Graphs (KGs) share the discreteness, but not the geometry. They form irregular, non-Euclidean topologies whose local neighborhoods differ from graph to graph. Therefore, Knowledge Graph Foundation Models (KGFMs) rely on identifying structural invariances to produce transferable representations. Without a universal token set, KGFMs are limited in their ability to transfer representations across unseen KGs. We close this gap by treating graphlets, small connected graphs, as structural tokens that recur in heterogeneous KGs. In this paper, We introduce a model-agnostic framework based on a vocabulary of graphlets that mines a KG between relations via pattern matching. In particular, we considered closed and open 2- and 3-path, and star graphlets, to obtain robust invariances. The framework is evaluated on 51 KGs from a wide range of domains, for zero-shot inductive and transductive link prediction. Experiments show that adding simple graphlets to the vocabulary yields models that outperform prior KGFMs.
☆ AdaGamma: State-Dependent Discounting for Temporal Adaptation in Reinforcement Learning
The discount factor in reinforcement learning controls both the effective planning horizon and the strength of bootstrapping, yet most deep RL methods use a single fixed value across all states. While state-dependent discounting is conceptually appealing, naive deep actor--critic implementations can become unstable and degenerate toward TD-error collapse. We propose AdaGamma, a practical deep actor--critic method for state-dependent discounting that learns a state-dependent discount function together with a return-consistency objective to regularize the induced backup structure. On the theory side, we analyze the Bellman operator induced by state-dependent discounting and establish its basic well-posedness properties under suitable conditions. Empirically, AdaGamma integrates into both SAC and PPO, yielding consistent improvements on continuous-control benchmarks, and achieves statistically significant gains in an online A/B test on the JD Logistics platform. These results suggest that state-dependent discounting can be made effective in deep RL when coupled with a return-consistency objective that prevents degenerate target manipulation.
comment: 22 pages, 9 figures
☆ Learning Discrete Autoregressive Priors with Wasserstein Gradient Flow
Discrete image tokenizers are commonly trained in two stages: first for reconstruction, and then with a prior model fitted to the frozen token sequences. This decoupling leaves the tokenizer unaware of the model that will later generate its tokens. As a result, the learned tokens may preserve image information well but still be difficult for an autoregressive (AR) prior to predict from left to right. We analyze this mismatch using Tripartite Variational Consistency (TVC), which decomposes latent-variable learning into three consistency conditions: conditional-likelihood consistency, prior consistency, and posterior consistency. TVC shows that two-stage training preserves the reconstruction side but leaves prior consistency outside the tokenizer objective: the overall token distribution is fixed before the AR prior participates in training. Motivated by this view, we add a distribution-level prior-matching signal during tokenizer training, while keeping the reconstruction objective unchanged. We optimize this signal with a Wasserstein-gradient-flow update. For hard categorical tokens, the update reduces to a token-level contrast between an auxiliary AR model that tracks the tokenizer's current token distribution and the target AR prior. It requires only forward passes through the two AR models and does not backpropagate through either of them. The resulting tokenizer, wAR-Tok, reduces AR loss and improves generation FID on CIFAR-10 and ImageNet at comparable reconstruction quality.
☆ Unifying Goal-Conditioned RL and Unsupervised Skill Learning via Control-Maximization
Unsupervised pretraining has driven empirical advances in goal-conditioned reinforcement learning (GCRL), but its theoretical foundations remain poorly understood. In particular, an influential class of methods, mutual information skill learning (MISL), discovers behaviorally diverse skills that can later be used for downstream goal-reaching. However, it remains a theoretical mystery why skills learned through MISL should support goal-reaching. A subtle challenge is that both GCRL and MISL are umbrella terms: different GCRL tasks use distinct criteria for measuring goal-reaching performance, while different MISL methods optimize distinct notions of behavioral diversity. We address this challenge and unify GCRL and MISL as instances of control maximization. We identify three canonical GCRL formulations and prove that they are fundamentally inequivalent: they can induce incompatible optimal policies even in the same environment. Nevertheless, they all share a common interpretation: a well-performing goal-conditioned policy is one whose future trajectory is highly sensitive to the commanded goal, with the precise notion of sensitivity determined by the GCRL formulation. Noting that MISL objectives can be understood as measures of skill-sensitivity akin to goal-sensitivity, we show that MISL objectives are bounded by formulation-specific downstream goal-sensitivities. These bounds establish a precise correspondence between MISL methods and downstream GCRL tasks: for every GCRL formulation, there exists a matching MISL objective for which more diverse skills afford greater downstream goal sensitivity. Our results thus lay a theoretical foundation for RL pretraining and have important practical implications, such as suggesting which pretraining objectives to use when a user cares about a specific class of downstream tasks.
☆ AI-Generated Images: What Humans and Machines See When They Look at the Same Image
The misuse of generative AI in online disinformation campaigns highlights the urgent need for transparent and explainable detection systems. In this work, we investigate how detectors for AI-generated images can be more effective in providing human-understandable explanations for their predictions. To this end, we develop a suite of detectors with various architectures and fine-tuning strategies, trained on our large-scale photorealistic fake image dataset, AIText2Image, and assess their performance on state-of-the-art text-to-image AI generators. We integrate 16 different explainable AI (XAI) methods into our detection framework, and the visual explanations are comprehensively refined and evaluated through a novel approach that prioritizes human understanding of AI-generated images, using both textual and visual responses collected from a survey of 100 participants. This framework offers insights into visual-language cues in fake image detection and into the clarity of XAI methods from a human perspective, measuring the alignment of XAI outputs with human preferences.
comment: Included in the main conference proceedings published by Springer Nature (CCIS Series)
☆ IRC-Bench: Recognizing Entities from Contextual Cues in First-Person Reminiscences
When people recount personal memories, they often refer to people, places, and events indirectly, relying on contextual cues rather than explicit names. Such implicit references are central to reminiscence narratives: first-person accounts of lived experience used in therapeutic, archival, and social settings. They pose a difficult computational problem because the intended entity must be inferred from dispersed narrative evidence rather than from a local mention. We introduce IRC-Bench, the Implicit Reminiscence Context Benchmark, for evaluating implicit entity recognition in reminiscence transcripts. The benchmark targets non-locality: entity-identifying cues are distributed across multiple, non-contiguous clauses, unlike named entity recognition, entity linking, or coreference resolution. IRC-Bench comprises 25,136 samples constructed from 12,337 Wiki-data-linked entities across 1,994 transcripts spanning 11 thematic domains. Each sample pairs an Entity-Grounded Narrative, in which the target entity is explicitly mentioned, with an Entity-Elided Narrative, in which direct mentions are removed. We evaluate 19 configurations across LLM generation, dense retrieval, RAG, and fine-tuning. QLoRA-adapted Llama 3.1 8B performs best in the open-world setting (38.94% exact match; 51.59% Jaccard), while fine-tuned DPR leads closed-world retrieval (35.38% Hit@1; 71.49% Hit@10). We release IRC-Bench with data, code, and evaluation tools.
comment: 29 pages, 8 figures
☆ SymDrift: One-Shot Generative Modeling under Symmetries
Generative modeling of physical systems, such as molecules, requires learning distributions that are invariant under global symmetries, such as rotations in three-dimensional space. Equivariant diffusion and flow matching models can incorporate such invariances effectively, even when trained on a non-invariant empirical distribution, but they typically rely on costly multi-step sampling. Recently, drifting models have emerged as an efficient alternative, enabling single-step generation and achieving state-of-the-art performance in generative modeling tasks. However, we show that drifting models face a symmetry-specific challenge, since an equivariant generator does not generally produce the same drifting field as the one obtained from the symmetrized target distribution. Addressing this issue would require expensive symmetrization of the empirical distribution. To avoid this cost, we propose SymDrift, a framework that makes the drifting field itself symmetry-aware. We introduce two complementary strategies: (i) a symmetrized drift in coordinate space based on optimal alignment, and (ii) a $G$-invariant embedding that removes symmetry ambiguity by construction. Empirically, SymDrift outperforms existing one-shot methods on standard benchmarks for conformer and transition state generation, while remaining competitive with significantly more expensive multi-step approaches. By enabling one-shot inference, SymDrift reduces computational overhead by up to 40$\times$ compared to existing baselines, making it promising for high-throughput applications such as virtual drug screening and large-scale reaction network exploration.
☆ Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex
Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for large language models (LLMs) post-training to incentivize reasoning capacity. Among existing recipes, group-based policy gradient is prevalent, which samples a group of responses per prompt and updates the policy via group-relative advantage signals. This work reveals that these optimization strategies share a common geometric structure: each implicitly defines a target distribution on the response simplex and projects toward it via first-order approximation. Building on this insight, we propose Listwise Policy Optimization (LPO) to explicitly conduct the target-projection, which demystifies the implicit target by restricting the proximal RL objective to the response simplex, and then projects the policy via exact divergence minimization. This framework provides (i) monotonic improvement on the listwise objective with bounded, zero-sum, and self-correcting projection gradients, and (ii) flexibility in divergence selection with distinct structural properties through the decoupled projection step. On diverse reasoning tasks and LLM backbones, LPO consistently improves training performance over typical policy gradient baselines under matched targets, while intrinsically preserving optimization stability and response diversity.
☆ Autoregressive Visual Generation Needs a Prologue
In this work, we propose Prologue, an approach to bridging the reconstruction-generation gap in autoregressive (AR) image generation. Instead of modifying visual tokens to satisfy both reconstruction and generation, Prologue generates a small set of prologue tokens prepended to the visual token sequence. These prologue tokens are trained exclusively with the AR cross-entropy (CE) loss, while visual tokens remain dedicated to reconstruction. This decoupled design lets us optimize generation through the AR model's true distribution without affecting reconstruction quality, which we further formalize from an ELBO perspective. On ImageNet 256x256, Prologue-Base reduces gFID from 21.01 to 10.75 without classifier-free guidance while keeping reconstruction almost unchanged; Prologue-Large reaches a competitive rFID of 0.99 and gFID of 1.46 using a standard AR model without auxiliary semantic supervision. Interestingly, driven only by AR gradients, prologue tokens exhibit emergent semantic structure: linear probing on 16 prologue tokens reaches 35.88% Top-1, far above the 23.71% of the first 16 tokens from a standard tokenizer; resampling with fixed prologue tokens preserves a similar high-level semantic layout. Our results suggest a new direction: generation quality can be improved by introducing a separate learned generative representation while leaving the original representation intact.
☆ BUILD-AND-FIND: An Effort-Aware Protocol for Evaluating Agent-Managed Codebases
Most coding-agent benchmarks ask whether generated code behaves correctly. That remains essential, but repository-level engineering is increasingly agent-managed: one agent writes a repository, and later agents inspect, audit, or extend it as working context. In that setting, a generated repository is not only an answer to a task but also a communication artifact for future work. Even when strong agents nearly satisfy the visible behavioral objective, repositories can differ in how clearly they expose the intended behavior and design choices behind that behavior. We introduce BUILD-AND-FIND, a protocol for evaluating whether downstream agents can recover those intended choices from generated repositories, and how much inspection that recovery requires. For each task, a builder sees a hidden repository specification and creates a codebase; a finder sees only the codebase and a specification-traced multiple-choice question bank. The protocol separates behavioral correctness from artifact-side recovery and reports recovery accuracy, repeatability, implementation coverage, and inspection effort. Accuracy and stability act as gates: effort is interpreted only when recovery succeeds reliably. Among artifacts from which the same intent can be recovered, lower effort by the same finder suggests that the artifact makes that intent easier to locate. Question-only and spec-only controls quantify generic priors and specification access, while audits separate omitted claims from finder failures and check whether correct answers cite artifact evidence. In the released high-prior task pack, recovery accuracy is near saturation, so inspection effort and finder-specific effects provide the main panel-local comparison.
comment: 25 pages, 8 figures, 17 tables
☆ Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
A persistent skill library allows language model agents to reuse successful strategies across tasks. Maintaining such a library requires three coupled capabilities. The agent selects a relevant skill, utilizes it during execution, and distills new skills from experience. Existing methods optimize these capabilities in isolation or with separate reward sources, resulting in partial and conflicting evolution. We propose Skill1, a framework that trains a single policy to co-evolve skill selection, utilization, and distillation toward a shared task-outcome objective. The policy generates a query to search the skill library, re-ranks candidates to select one, solves the task conditioned on it, and distills a new skill from the trajectory. All learning derives from a single task-outcome signal. Its low-frequency trend credits selection and its high-frequency variation credits distillation. Experiments on ALFWorld and WebShop show that Skill1 outperforms prior skill-based and reinforcement learning baselines. Training dynamics confirm the co-evolution of the three capabilities, and ablations show that removing any credit signal degrades the evolution.
☆ Continuous Expert Assembly: Instance-Conditioned Low-Rank Residuals for All-in-One Image Restoration
Real-world image degradation is often unknown, spatially non-uniform, and compositional, requiring all-in-one restoration models to adapt a single set of weights to diverse local corruption patterns without test-time degradation labels. Existing methods typically modulate a shared backbone with global prompts or degradation descriptors, or route features through predefined expert pools. However, compact global conditioning can bottleneck localized degradation evidence, while static expert routing may produce homogeneous updates or rely on unstable sparse assignments. We propose \textbf{Continuous Expert Assembly} (CEA), a token-wise dynamic parameterization framework for all-in-one image restoration. CEA employs a lightweight \textbf{Cross-Attention Hyper-Adapter} to probe intermediate spatial features and synthesize instance-conditioned low-rank routing bases and residual directions. Each spatial token then assembles its own residual update via dense signed dot-product affinities over the generated rank-wise components, avoiding external prompts, static expert banks, and discrete Top- selection. The resulting assembly rule also admits a linear-attention perspective, making its dense token-wise routing behavior transparent. Experiments on AIO-3, AIO-5, and CDD-11 show that CEA improves average restoration quality over strong prompt-, descriptor-, and expert-based baselines, with the clearest gains on spatially varying and compositional degradations, while maintaining favorable parameter, FLOP, and runtime efficiency.
☆ P-Guide: Parameter-Efficient Prior Steering for Single-Pass CFG Inference
Classifier-Free Guidance (CFG) is essential for high-fidelity conditional generation in flow matching, yet it imposes significant computational overhead by requiring dual forward passes at each sampling step. In this work, we address this bottleneck by introducing \textbf{P-Guide}, a framework that achieves high-quality guidance through a single inference pass by modulating only the initial latent state. We further show that, under a first-order approximation, P-Guide is equivalent to CFG in the sense that it steers generation from the prior space, without requiring explicit velocity field extrapolation during sampling. We consider both homoscedastic and \textbf{heteroscedastic} priors, and find that jointly modeling the mean and variance enables adaptive loss attenuation and improved robustness to data uncertainty. Extensive experiments demonstrate that P-Guide reduces inference latency by approximately 50\% while maintaining fidelity and prompt alignment competitive with standard dual-pass CFG baselines.
☆ Back to the Beginning of Heuristic Design: Bridging Code and Knowledge with LLMs
Large language models (LLMs) have recently advanced automatic heuristic design (AHD) for combinatorial optimization (CO), where candidate heuristics are iteratively proposed, evaluated, and refined. Most existing approaches search over executable programs and distill insights from execution feedback to guide later iterations. Because this process moves from low-level implementations to high-level principles, we refer to it as a bottom-up paradigm. We argue that this view is incomplete and introduce a complementary top-down perspective: knowledge becomes the primary search object and code merely instantiates and tests it, making what is learned explicit and reusable across problems and trajectories. We formalize this shift through a statistical-learning view that exposes a distortion--compression trade-off, and instantiate it in both population-based and tree-based AHD frameworks. Across CO and tasks beyond it, knowledge-first search improves discovery efficiency, transfer, and generalization, often outperforming code-centric pipelines, while combining both strategies yields further gains. Our results suggest that progress in AHD depends on iteratively constructing and evolving interpretable hypotheses that retain value beyond a single search trajectory.
comment: 75 pages
☆ Policy-Guided Stepwise Model Routing for Cost-Effective Reasoning
Inference-time computation has greatly enhanced the performance of large language models (LLMs) on challenging reasoning tasks, but this strategy can incur high inference costs. One solution is to route intermediate chain-of-thought (CoT) states to language models of different sizes; however, existing approaches rely on handcrafted routing strategies that limit performance, or on training large process reward models that may be infeasible in many applications. We formulate stepwise model routing as a constrained decision-making problem, which we solve by training a small control policy using reinforcement learning in conjunction with threshold calibration to tune the performance-efficiency tradeoff. We validate our method on three math benchmarks (GSM8K, MATH500, and OmniMath) on both open and closed models. Our method consistently improves the accuracy-cost tradeoff compared to handcrafted approaches, while achieving a comparable tradeoff to methods that require training large process reward models.
☆ CrossCult-KIBench: A Benchmark for Cross-Cultural Knowledge Insertion in MLLMs
Multimodal Large Language Models (MLLMs), trained primarily on English-centric data, frequently generate culturally inappropriate or misaligned responses in cross-cultural settings. To mitigate this, we introduce the task of cross-cultural knowledge insertion, which focuses on adapting models to specific cultural contexts while preserving their original behavior in other cultures. To facilitate research in this area, we introduce CrossCult-KIBench, a comprehensive evaluation benchmark for assessing both the effectiveness of knowledge insertion and its unintended side effects on non-target cultures. The benchmark includes 9,800 image-grounded cases covering 49 culturally relevant visual scenarios across English, Chinese, and Arabic language-culture groups. It supports evaluation in both single-insert and sequential-insert settings. We also propose Memory-Conditioned Knowledge Insertion (MCKI) as a baseline method. MCKI retrieves relevant cultural knowledge from an external memory using frozen MLLM representations, prepending matched entries as conditional prompts when applicable. Extensive experiments on CrossCult-KIBench reveal that current approaches struggle to balance effective cultural adaptation with behavioral preservation, highlighting a key challenge in developing culturally-aware MLLMs. Our work thus underscores an important research direction for developing more culturally adaptive and responsible MLLMs.
☆ Dynamic Pondering Sparsity-aware Mixture-of-Experts Transformer for Event Stream based Visual Object Tracking
Despite significant progress, RGB-based trackers remain vulnerable to challenging imaging conditions, such as low illumination and fast motion. Event cameras offer a promising alternative by asynchronously capturing pixel-wise brightness changes, providing high dynamic range and high temporal resolution. However, existing event-based trackers often neglect the intrinsic spatial sparsity and temporal density of event data, while relying on a single fixed temporal-window sampling strategy that is suboptimal under varying motion dynamics. In this paper, we propose an event sparsity-aware tracking framework that explicitly models event-density variations across multiple temporal scales. Specifically, the proposed framework progressively injects sparse, medium-density, and dense event search regions into a three-stage Vision Transformer backbone, enabling hierarchical multi-density feature learning. Furthermore, we introduce a sparsity-aware Mixture-of-Experts module to encourage expert specialization under different sparsity patterns, and design a dynamic pondering strategy to adaptively adjust the inference depth according to tracking difficulty. Extensive experiments on FE240hz, COESOT, and EventVOT demonstrate that the proposed approach achieves a favorable trade-off between tracking accuracy and computational efficiency. The source code will be released on https://github.com/Event-AHU/OpenEvTracking.
☆ Schedule-and-Calibrate: Utility-Guided Multi-Task Reinforcement Learning for Code LLMs
Reinforcement learning (RL) with verifiable rewards has proven effective at post-training LLMs for coding, yet deploying separate task-specific specialists incurs costs that scale with the number of tasks, motivating a unified multi-task RL (MTRL) approach. However, existing MTRL methods treat all coding tasks uniformly, relying on fixed data curricula under a shared optimization strategy, ultimately limiting the effectiveness of multi-task training. To address these limitations, we propose ASTOR, a multi-tASk code reinforcement learning framework via uTility-driven coORdination. Centered on task utility, a signal capturing each task learning potential and cross-task synergy, ASTOR comprises two coupled modules: 1) Hierarchical Utility-Routed Data Scheduling module hierarchically allocates training budget and prioritizes informative prompts, steering training toward the most valuable data and 2) Adaptive Utility-Calibrated Policy Optimization module dynamically scales per-task KL regularization, matching update constraints to each tasks current training state. Experiments on two widely-used LLMs across four representative coding tasks demonstrate that ASTOR consistently improves a single model across all tasks, outperforming the best task-specific specialist by 9.0%-9.5% and surpassing the strongest MTRL baseline by 7.5%-12.8%.
☆ On Time, Within Budget: Constraint-Driven Online Resource Allocation for Agentic Workflows
Agentic systems increasingly solve complex user requests by executing orchestrated workflows, where subtasks are assigned to specialized models or tools and coordinated according to their dependencies. While recent work improves agent efficiency by optimizing the performance--cost--latency frontier, real deployments often impose concrete requirements: a workflow must be completed within a specified budget and before a specified deadline. This shifts the goal from average efficiency optimization to maximizing the probability that the entire workflow completes successfully under explicit budget and deadline constraints. We study \emph{constraint-driven online resource allocation for agentic workflows}. Given a dependency-structured workflow and estimates of success rates and generation lengths for each subtask--model pair, the executor allocates models and parallel samples across simultaneously executable subtasks while managing the remaining budget and time. We formulate this setting as a finite-horizon stochastic online allocation problem and propose \emph{Monte Carlo Portfolio Planning} (MCPP), a lightweight closed-loop planner that directly estimates constrained completion probability through simulated workflow executions and replans after observed outcomes. Experiments on CodeFlow and ProofFlow demonstrate that MCPP consistently improves constrained completion probability over strong baselines across a wide range of budget--deadline constraints.
comment: Preprint
☆ Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility
Long-context inference in decoder-only language models is costly because long prompts are processed during Prefill, cached at every layer, and repeatedly attended to during autoregressive Decode. We introduce \emph{Shallow Prefill, dEEp Decode} (SPEED), a phase-asymmetric KV-visibility policy that materializes non-anchor prompt-token KV states only in lower layers while keeping Decode-phase tokens full-depth. Unlike previous approaches that make upper-layer prompt KV states cheaper to store or construct, SPEED removes prefill tokens from the upper-layer Decode visibility set altogether. With a minimal BoS anchor, this simple change preserves broad benchmark quality while reducing long-context cost. In a controlled Llama-3.1-8B instruction-tuning study, SPEED using only 75\% of layers for prefill tokens reaches 51.2 average score on OLMES-style benchmarks, compared with 51.4 for the full-depth baseline, while improving TTFT by 33\%, TPOT by 22\%, and reducing active KV memory by 25.0\% at 128K context. Layer-wise diagnostics suggest that this cutoff retains the main prompt-selection and representation-stabilization regions of the full-depth model. These results show that long-context prompt tokens need not always persist as full-depth KV-cache objects when Decode-phase tokens remain full-depth.
☆ Beyond Autoregressive RTG: Conditioning via Injection Outside Sequential Modeling in Decision Transformer
Decision Transformer (DT) formulates offline reinforcement learning as autoregressive sequence modeling, achieving promising results by predicting actions from a sequence of Return-to-Go (RTG), state, and action tokens. However, RTG is a scalar that summarizes future rewards, containing far less information than typical state or action vectors, yet it consumes the same computational budget per token. Worse, the self-attention cost of Transformers grows quadratically with sequence length, so including RTG as a separate token adds unnecessary overhead. We propose SlimDT, which removes RTG from the autoregressive sequence. Instead, we inject RTG information into the state representations before the sequential modeling step, allowing the Transformer to process only a compact (state, action) sequence. This reduces the sequence length by one-third, directly improving inference efficiency. On the D4RL benchmark, SlimDT surpasses standard DT across various tasks and achieves performance comparable to existing state-of-the-art methods. Decoupling a sparse conditioning signal from an information-rich sequence thus yields both computational gains and higher task performance.
☆ CredibleDFGO: Differentiable Factor Graph Optimization with Credibility Supervision
Global navigation satellite system (GNSS) positioning is widely used for urban navigation, but the covariance reported by the GNSS solver is often unreliable in urban canyons. Existing differentiable factor graph optimization (DFGO) methods already learn measurement weighting through the solver, but they still use position-only objectives. As a result, the mean estimate may improve while the reported covariance remains too small, too large, or wrong in shape. In this work, we propose CredibleDFGO (CDFGO), a differentiable GNSS factor graph framework that makes covariance credibility an explicit training target. The Weighting Generation Network (WGN) predicts per-satellite reliability weights. The differentiable Gauss--Newton solver maps these weights to a position estimate and posterior covariance, and proper scoring rules supervise the East--North predictive distribution end-to-end. We study negative log-likelihood (NLL), Energy Score (ES), and their combination. Results on three UrbanNav test scenes show consistent gains in uncertainty credibility. Positioning accuracy also improves on the medium-urban and harsh-urban scenes, and the mean horizontal error and 95th-percentile error improve on the deep-urban scene. On the harsh-urban Mong Kok (MK) scene, CDFGO-Combined reduces the mean horizontal error from 13.77\,m to 11.68\,m, reduces NLL from 40.63 to 6.59, and reduces ES from 12.31 to 9.05. The case studies link the MK improvement to better axis-wise consistency, more credible local covariance ellipses, and satellite-level reweighting.
comment: Submitted to NAVIGATION: Journal of the Institute of Navigation
☆ VISD: Enhancing Video Reasoning via Structured Self-Distillation
Training VideoLLMs for complex reasoning remains challenging due to sparse sequence level rewards and the lack of fine grained credit assignment over long, temporally grounded reasoning trajectories. While reinforcement learning with verifiable rewards (RLVR) provides reliable supervision, it fails to capture token level contributions, leading to inefficient learning. Conversely, existing self distillation methods offer dense supervision but lack structure and diagnostic specificity, and often interact unstably with reinforcement learning. In this work, we propose VISD, a structured self distillation framework that introduces diagnostically meaningful privileged information for video reasoning. VISD employs a video aware judge model to decompose reasoning quality into multiple dimensions, including answer correctness, logical consistency, and spatio-temporal grounding, and uses this structured feedback to guide a teacher policy for token level supervision. To stably integrate dense supervision with RL, we introduce a direction magnitude decoupling mechanism, where rollout level advantages computed from rewards determine update direction, while structured privileged signals modulate token level update magnitudes. This design enables semantically aligned and fine grained credit assignment, improving both reasoning faithfulness and training efficiency. Additionally, VISD incorporates curriculum scheduling and EMA based teacher stabilization to support robust optimization over long video sequences. Experiments on diverse benchmarks show that VISD consistently outperforms strong baselines, improving answer accuracy and spatio temporal grounding quality. Notably, VISD reaches these gains with nearly 2x faster convergence in optimization steps, highlighting the effectiveness of structured self supervision in improving both performance and sample efficiency for VideoLLMs.
☆ Safety Certification is Classification
The goal of this paper is certifying safety of dynamical systems subject to uncertainty. Existing approaches use trajectory data to estimate transition probabilities, and compute safety probabilities recursively via dynamic programming (DP). This recursion may lead to compounding errors in the certified safety probability, thus collapsing to a vacuous lower bound for growing horizons $T$. We propose a kernel embedding framework that treats safety certification as a classification problem on trajectory data, directly estimating the $T$-step safety probability without recursion. We show that the framework subsumes well-established approaches from the literature (e.g., barrier certificates, robust Markov models) as special cases, and allows us to go beyond their limitations. As the main consequence, it bypasses compounding error across the horizon and enables certification for systems with non-Markovian dynamics. We demonstrate that direct estimators remain stable independent of the certification horizon and in the non-Markovian setting, whilst DP-based certificates silently go unsound -- confirmed in simulation on a neural-controlled quadrotor.
comment: 32 pages, 18 figures
☆ Milestone-Guided Policy Learning for Long-Horizon Language Agents
While long-horizon agentic tasks require language agents to perform dozens of sequential decisions, training such agents with reinforcement learning remains challenging. We identify two root causes: credit misattribution, where correct early actions are penalized due to terminal failures, and sample inefficiency, where scarce successful trajectories result in near-total loss of learning signal. We introduce a milestone-guided policy learning framework, BEACON, that leverages the compositional structure of long-horizon tasks to ensure precise credit assignment. BEACON partitions trajectories at milestone boundaries, applies temporal reward shaping within segments to credit partial progress, and estimates advantages at dual scales to prevent distant failures from corrupting the evaluation of local actions. On ALFWorld, WebShop, and ScienceWorld, BEACON consistently outperforms GRPO and GiGPO. Notably, on long-horizon ALFWorld tasks, BEACON achieves 92.9% success rate, nearly doubling GRPO's 53.5%, while improving effective sample utilization from 23.7% to 82.0%. These results establish milestone-anchored credit assignment as an effective paradigm for training long-horizon language agents. Code is available at https://github.com/ZJU-REAL/BEACON.
☆ VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
For years, we have built LLM serving systems like any other critical infrastructure: a single general-purpose stack, hand-tuned over many engineer-years, meant to support every model and workload. In this paper, we take the opposite bet: a multi-agent loop that automatically synthesizes bespoke serving systems for different usage scenarios. We propose VibeServe, the first agentic loop that generates entire LLM serving stacks end-to-end. VibeServe uses an outer loop to plan and track the search over system designs, and an inner loop to implement candidates, check correctness, and measure performance on the target benchmark. In the standard deployment setting, where existing stacks are highly optimized, VibeServe remains competitive with vLLM, showing that generation-time specialization need not come at the cost of performance. More interestingly, in non-standard scenarios, VibeServe outperforms existing systems by exploiting opportunities that generic systems miss in six scenarios involving non-standard model architectures, workload knowledge, and hardware-specific optimizations. Together, these results suggest a different point in the design space for infrastructure software: generation-time specialization rather than runtime generality. Code is available at https://github.com/uw-syfi/vibe-serve.
☆ Normalized Architectures are Natively 4-Bit
Training large language models at 4-bit precision is critical for efficiency. We show that nGPT, an architecture that constrains weights and hidden representations to the unit hypersphere, is inherently more robust to low-precision arithmetic. This removes the need for interventions-such as applying random Hadamard transforms and performing per-tensor scaling calculations-to preserve model quality, and it enables stable end-to-end NVFP4 training. We validate this approach on both a 1.2B dense model and hybrid (Mamba-Transformer) MoE models of up to 3B/30B parameters. We trace this robustness to the dot product: while quantization noise remains largely uncorrelated in both standard and normalized architectures, the signal behaves differently. In nGPT, the hypersphere constraint enhances weak positive correlations among the element-wise products, leading to a constructive accumulation of the signal across the hidden dimension while the noise continues to average out. This yields a higher effective signal-to-noise ratio and a flatter loss landscape, with the effect strengthening as the hidden dimension grows, suggesting increasing advantages at scale. A reference implementation is available at https://github.com/anonymous452026/ngpt-nvfp4
☆ Causal Reinforcement Learning for Complex Card Games: A Magic The Gathering Benchmark
Causal reinforcement learning (RL) lacks benchmarks for complex systems that combine sequential decision making, hidden information, large masked action spaces, and explicit causal structure. We introduce MTG-Causal-RL, a Gymnasium benchmark built on Magic: The Gathering with a 3,077-dimensional partial observation, a 478-action masked discrete action space, five competitive Standard archetypes, three reward schemes, and a hand-specified Structural Causal Model (SCM) over strategic variables. Every episode exposes causal variables, SCM-predicted intervention effects, and per-factor credit traces, making causal credit assignment, leave-one-out cross-archetype transfer, and policy auditability first-class metrics. We adapt a panel of reference baselines: random, heuristic, masked PPO, a causal-world-model PPO variant, and an architecture-matched scalar control. We propose Causal Graph-Factored Advantage PPO (CGFA-PPO) as a reference causal agent that uses SCM parents of win probability as factor-aligned critic targets with an intervention-calibration loss. All comparisons use paired seeds, paired-bootstrap confidence intervals, and Holm-Bonferroni correction within pre-registered families. Masked PPO and CGFA-PPO reach competitive in-distribution win rates and exceed the random baseline; per-factor calibration trajectories and leave-one-out transfer gaps expose diagnostic structure that scalar win rate alone cannot. We release the benchmark, reference-baseline results, and full evaluation protocol openly. By coupling a strategically rich, partially observed domain with an explicit causal interface and statistical protocol, MTG-Causal-RL gives causal-RL, world-model, and LLM-agent research a shared testbed for questions current benchmarks cannot pose together: causal credit assignment under masked action spaces, structural transfer across archetypes, and SCM-grounded policy auditability.
comment: 21 pages, 8 figures, 9 tables, 1 algorithm
☆ Visual Fingerprints for LLM Generation Comparison IEEE VIS 2026
Large language model (LLM) outputs arise from complex interactions among prompts, system instructions, model parameters, and architecture. We refer to specific configurations of these factors as generation conditions, each of which can bias outputs in various ways. Understanding how different generation conditions shape model behaviors is essential for tasks such as prompt design and model evaluation, yet it remains challenging due to the stochastic and open-ended nature of text generation. We present an approach to visually compare LLM outputs across generation conditions by modeling responses as collections of linguistic choices, including content, expression, and structure. We extract these choices using natural language processing pipelines and represent their distributions across repeated samples. We then visualize these distributions as visual fingerprints, enabling direct, distribution-level comparison of condition-specific tendencies. Through four usage scenarios, we demonstrate how visual fingerprints reveal consistent patterns in LLM behavior that are difficult to observe through individual responses or aggregate metrics.
comment: Submitted to the Short Paper track at IEEE VIS 2026
☆ TFM-Retouche: A Lightweight Input-Space Adapter for Tabular Foundation Models
Tabular foundation models (TFMs), such as TabPFN-2.6, TabICLv2, ConTextTab, Mitra, LimiX, and TabDPT, achieve strong zero-shot performance through in-context learning, but their inductive biases remain fixed at inference time. Adapting a pretrained TFM to a specific dataset or task typically requires either full fine-tuning, which is computationally expensive, or parameter-efficient tuning methods (PEFT) such as LoRA, which must be tailored to the internal architecture of each TFM. Furthermore, the evidence on whether weight-space fine-tuning improves accuracy or calibration is mixed \citep{tanna_exploring_2026,rubachev_finetuning_2025}. We introduce TFM-Retouche, a lightweight input-space residual adapter that is architecture-agnostic by design with respect to the frozen TFM backbone. TFM-Retouche learns a small residual correction in the input space to align the input data with the inductive biases of the pretrained model. The adapter is trained end-to-end through the frozen TFM, with a post-training identity guard that falls back to the unmodified TFM whenever adaptation does not help on held-out validation. On TabArena-Lite (51 datasets spanning binary classification, multiclass classification, and regression), TabICLv2-Retouche -- the framework instantiated on TabICLv2 -- is the top-ranked method on the leaderboard with light per-task tuning and ensembling, lifting aggregate Elo by +56 over the frozen TabICLv2 base and sitting on the Pareto front of predictive quality versus both training and inference time.
☆ Novelty-based Tree-of-Thought Search for LLM Reasoning and Planning
Although advances such as chain-of-thought, tree-of-thought or reinforcement learning have improved the performance of LLMs in reasoning and planning tasks, they are still brittle and have not achieved human-level performance in many domains, and often suffer from high time and token costs. Inspired by the success of width-based search in planning, we explore how the concept of novelty can be transferred to language domains and how it can improve tree-of-thought reasoning. A tree of thoughts relies on building possible "paths" of consecutive ideas or thoughts. These are generated by repeatedly prompting an LLM. In our paper, a measurable concept of novelty is proposed that describes the uniqueness of a new node (thought) in comparison to nodes previously seen in the search tree. Novelty is estimated by prompting an LLM and making use of embedded general knowledge from pre-training. This metric can then be used to prune branches and reduce the scope of the search. Although this method introduces more prompts per state, the overall token cost can be reduced by pruning and reducing the overall tree size. This procedure is tested and compared using several benchmarks in language-based planning and general reasoning.
☆ Optimal Transport for LLM Reward Modeling from Noisy Preference
Reward models are fundamental to Reinforcement Learning from Human Feedback (RLHF), yet real-world datasets are inevitably corrupted by noisy preference. Conventional training objectives tend to overfit these errors, while existing denoising approaches often rely on homogeneous noise assumptions that fail to capture the complexity of linguistic preferences. To handle these challenges, we propose SelectiveRM, a framework grounded in optimal transport. We first devise a Joint Consistency Discrepancy to align the distribution of model predictions with preference data. Furthermore, to address the limitation of strict mass conservation which compels the model to fit outliers, we incorporate a Mass Relaxation mechanism via partial transport. This enables the autonomous exclusion of samples with noisy preference that contradict semantic consistency. Theoretically, we demonstrate that SelectiveRM optimizes a tighter upper bound on the unobserved clean risk. Extensive experiments validate that our approach significantly outperforms state-of-the-art baselines across diverse benchmarks.
☆ Quantum Kernels for Audio Deepfake Detection Using Spectrogram Patch Features
Quantum machine learning has emerged as a promising tool for pattern recognition, yet many audio-focused approaches still treat spectrograms as generic images and do not explicitly exploit their time-frequency structure. We propose Q-Patch, a quantum feature map tailored to audio that encodes local time-frequency patches from mel-spectrograms into quantum states using shallow, hardware-efficient circuits with adjacency-aware entanglement. Each selected patch is summarized by a compact four-dimensional acoustic descriptor and mapped to a four-qubit circuit with depth at most three, enabling practical quantum kernel construction under near-term constraints. We evaluate Q-Patch on an audio spoofing detection task using a controlled, balanced protocol and compare it with size-matched classical baselines. Q-Patch improves discrimination between bona fide and spoofed samples, achieving an area under the receiver operating characteristic curve (AUROC) of 0.87, compared with 0.82 for a radial basis function support vector machine (RBF-SVM) trained on the same patch-level features. Kernel-space analysis further reveals a clear class structure, with cross-class similarity around 0.615 and within-class self-similarity of 1.00. Overall, Q-Patch provides a practical framework for incorporating time-frequency-aware representations into quantum kernel learning for audio authenticity assessment in low-resource settings.
☆ When AI Meets Science: Research Diversity, Interdisciplinarity, Visibility, and Retractions across Disciplines in a Global Surge
The extent to which Artificial Intelligence (AI) can trigger generalized paradigm shifts in science is unclear. Although some of these technologies have revolutionized data collection and analysis in specific scientific fields such as Chemistry, their overall impact depends on the scope of adoption and the ways scholars use them. In this study, we document substantial differences in the timing and extent of AI adoption across countries and scientific domains from 1960 to 2015. After 2015, we find generalized exponential growth in AI adoption, with the number of AI-supported works multiplying by at least four across all domains. The transformative nature of this rapid growth is less apparent and points to multiple challenges should adoption trends persist. According to our analyses, AI-supported research is confined to very few topics with strong ties to Computer Science and conventional statistical frameworks, suggesting limited transformational potential in epistemological terms. AI-supported works are also associated with an unwarranted citation premium and exhibit substantially higher retraction rates than non-AI-supported works across most fields. Geographically, AI adoption displays pronounced heterogeneity at the country level, along with an acceleration in the relevance of middle-income countries in Asia, from China and beyond. Thus, the transformative capacity of AI in science remains largely untapped, and its rapid adoption underlines challenges in research openness, transparency, reproducibility, and ethics from a global perspective. We discuss how best research practices could boost the benefits of AI adoption and highlight fields and geographies where these trends warrant closer scrutiny.
☆ Does Synthetic Data Help? Empirical Evidence from Deep Learning Time Series Forecasters
Synthetic data has transformed language model training, yet its role in time series forecasting remains poorly understood. We present a large-scale empirical study: nine experiment groups, 4,218 runs systematically evaluating synthetic time series augmentation across five architectures, four synthetic signals and seven datasets. The effect is sharply architecture-conditional: channel-mixing models (TimesNet, iTransformer) benefit in the majority of trials, while channel-independent models (DLinear, PatchTST) are consistently degraded. In selected low-resource settings the gains are striking: TimesNet trained on only 10\% of Weather data with synthetic augmentation surpasses the full-data baseline (4 of 16 sparsity-dataset combinations). Averaged across all architectures, augmentation hurts in 67\% of trials. We further find that only the Seasonal-Trend generator reliably helps across the tested benchmarks, and that hard curriculum switching is actively harmful (+24\% MSE degradation). These results provide concrete, actionable guidelines on how to use synthetic data: use synthetic augmentation with channel-mixing architectures, use gradual annealing schedules, and treat low-resource augmentation as architecture- and dataset-dependent. Code is available at \href{https://github.com/hugoiscracked/synthetic-ts/tree/main}
☆ Pathways to AGI
Our focus are five related questions that stem from a critical software studies perspective. Underpinning this view is the acknowledged need to avoid assumptions regarding the inevitability of the current situation relating to AI. What we need to see is the closeness of the linkage between current commercial AI development and our prevailing social, political and economic circumstances. This does mean that the perspectives presented here are done so critically and conditionally. Most importantly, Artificial General Intelligence (AGI) is seen as being problematic both conceptually and definitionally. This conditioning of any view regarding AGI does lead the discussion in specific directions and to certain conclusions regarding the future. However, adopting this perspective enables the work to offer some final recommendations. We set out to ask the following questions, 1. What are the critical pathways that produced the current dominant generative AI tools (capabilities, product forms, adoption patterns)? 2. Which decision points acted as leverage nodes (small changes that had large downstream effects), and which dead ends reveal alternative possibilities that did not become dominant? 3. How do pathways differ across three foundational-model trajectories such as the frontier proprietary models, open-weight models or specific domain and sovereign models? 4. Which alternative projects branched from key leverage nodes, what is their current state, and why did some succeed, stall, fail or become absorbed? 5. Based on this analysis, what socio-technical development programmes could plausibly move toward AGI-adjacent capability while meeting requirements for transparency, moderation, wellbeing and sustainable business models?
comment: Additional data at 10.17866/rd.salford.32201874
☆ Strat-LLM: Stratified Strategy Alignment for LLM-based Stock Trading with Real-time Multi-Source Signals IJCNN
Large Language Models (LLMs) are evolving into autonomous trading agents, yet existing benchmarks often overlook the interplay between architectural reasoning and strategy consistency. We propose Strat-LLM, a framework grounded in Stratified Strategy Alignment. Operating in a live-forward setting throughout 2025, it integrates heterogeneous data including sequential prices, real-time news, and annual reports to eliminate look-ahead bias. Extensive stress tests on A-share and U.S. markets reveal: (1) reasoning-heavy models achieve peak utility in Free Mode via internal logic, whereas standard models require Strict Mode as a vital risk anchor; (2) alignment utility is regime-dependent, with Free and Guided modes capturing momentum in uptrending markets, while Strict Mode mitigates drawdowns in downtrends; (3) mid-scale models (35B) show optimal fidelity under strict constraints, whereas ultra-large models (122B) suffer an alignment tax under rigid rules but gain a performance premium in Guided Mode; (4) standard LLMs often fall into a high win-rate trap, optimizing for small gains at the expense of total returns, which can only be mitigated through deep reasoning or strict external guardrails. Project details are available at https://Strat-LLM.github.io.
comment: Accepted by the 2026 International Joint Conference on Neural Networks (IJCNN)
☆ Quantizing With Randomized Hadamard Transforms: Efficient Heuristic Now Proven
Uniform random rotations (URRs) are a common preprocessing step in modern quantization approaches used for gradient compression, inference acceleration, KV-cache compression, model weight quantization, and approximate nearest-neighbor search in vector databases. In practice, URRs are often replaced by randomized Hadamard transforms (RHTs), which preserve orthogonality while admitting fast implementations. The remaining issue is the performance for worst-case inputs. With a URR, each coordinate is individually distributed as a shifted beta distribution, which converges to a Gaussian distribution in high dimensions. Generally, one RHT is not suitable in the worst case, as individual coordinates can be far from these distributions. We show that after composing two RHTs on any $d$-sized input vector, the marginal distribution of every fixed coordinate of the normalized rotated vector is within $O(d^{-1/2})$ of a standard Gaussian both in Kolmogorov distance and in $1$-Wasserstein distance. We then plug these bounds into the analyses of modern compression schemes, namely DRIVE and QUIC-FL, and show that two RHTs achieve performance that asymptotically matches URRs. However, we show that two RHTs may not be sufficient for Vector Quantization (VQ), which often requires weak correlation across fixed-size blocks of coordinates (as opposed to only marginal distribution convergence for single coordinates). We prove that a composition of three RHTs leads to decaying coordinate covariance. This ensures that any fixed, bounded, multi-dimensional VQ codebook optimized for URRs has the same expected error when using three RHTs, up to an additive term that vanishes with the dimension. Finally, because practical inputs are rarely adversarial, we propose a linear-time ${O}(d)$ check on the input's moments to dynamically adapt the number of RHTs used at runtime to improve performance.
☆ T2I-VeRW: Part-level Fine-grained Perception for Text-to-Image Vehicle Retrieval
Vehicle Re-identification (Re-ID) aims to retrieve the most similar image to a given query from images captured by non-overlapping cameras. Extending vehicle Re-ID from image-only queries to text-based queries enables retrieval in real-world scenarios where only a witness description of the target vehicle is available. In this paper, we propose PFCVR, a Part-level Fine-grained Cross-modal Vehicle Retrieval model for text-to-image vehicle re-identification. PFCVR constructs locally paired images and texts at the part level and introduces learnable part-query tokens that aggregate both part-specific and full-sentence context before aligning with visual part features. On top of this explicit local alignment, a bi-directional mask recovery module lets each modality reconstruct its masked content under the guidance of the other, implicitly bridging local correspondences into global feature alignment. Furthermore, we construct a new large-scale dataset called T2I-VeRW, which contains 14,668 images covering 1,796 vehicle identities with fine-grained part-level annotations. Experimental results on the T2I-VeRI dataset show that PFCVR achieves 29.2\% Rank-1 accuracy, improving over the best competing method by +3.7\% percentage points. On the newly proposed T2I-VeRW benchmark, PFCVR achieves 55.2\% Rank-1 accuracy, outperforming a comprehensive set of recent state-of-the-art methods. Source code will be released on https://github.com/Event-AHU/Neuromorphic_ReID
☆ Adding Thermal Awareness to Visual Systems in Real-Time via Distilled Diffusion Models
Purely RGB-based vision models often fail to provide reliable cues in challenging scenarios such as nighttime and fog, leading to degraded performance and safety risks. Infrared imaging captures heat-emitting sources and provides critical complementary information, but existing high-fidelity fusion methods suffer from prohibitive latency, rendering them impractical for real-time edge deployment. To address this, we propose FusionProxy, a real-time image fusion module designed as a fully independent, plug-and-play component with diffusion level quality. FusionProxy exploits two complementary statistics of a teacher sample ensemble: per-pixel variance in raw image space, used to weight pixel-level supervision, and per-pixel variance inside frozen foundation backbones, used to route feature-level alignment spatially. Once trained, FusionProxy can be directly integrated into any visual perception system without joint optimization. Extensive experiments demonstrate that our method achieves superior performance on static recognition tasks and significantly enhances robustness in dynamic tasks, including closed-loop autonomous driving. Crucially, FusionProxy achieves real-time inference speeds on diverse platforms, from high-end GPUs to commodity hardware, providing a flexible and generalizable solution for all-day perception.
☆ PersonaKit (PK): A Plug-and-Play Platform for User Testing Diverse Roles in Full-Duplex Dialogue
As spoken dialogue systems expand beyond traditional assistant roles to encompass diverse personas -- such as authoritative instructors, uncooperative merchants, or distracted workers -- they require distinct, human-like turn-taking behaviors to maintain psychological immersion. However, current full-duplex systems often default to a rigid, overly accommodating ``always-yield'' policy during overlapping speech, which severely undermines character consistency for non-submissive roles. Evaluating alternative, persona-specific turn-taking strategies through empirical user studies is challenging because building real-time full-duplex test environments requires substantial engineering overhead. To address this, we present PersonaKit (PK), an open-source, low-latency web platform for the rapid prototyping and evaluation of conversational agents. Using intuitive JSON configurations, researchers can define personas, specify probabilistic interruption-handling behaviors (e.g., yield, hold, bridge, or override), and automatically deploy comparative A/B surveys. Through an in-the-wild evaluation with 8 distinct personas, we demonstrate that PersonaKit provides an extensible, end-to-end framework for studying complex sociolinguistic behaviors in next-generation spoken agents.
☆ A Fine-Grained Understanding of Uniform Convergence for Halfspaces
We study the fine-grained uniform convergence behavior of halfspaces beyond worst-case VC bounds. For inhomogeneous halfspaces in $\mathbb{R}^d$ with $d\ge 2$, we show that standard first-order VC bounds are essentially tight: even consistent hypotheses can incur population error $Θ(d\ln(n/d)/n)$, and in the agnostic setting the deviation scales as $\sqrt{τ\ln(1/τ)}$ at true error $τ$. In contrast, homogeneous halfspaces in $\mathbb{R}^2$ exhibit a markedly different behavior. In the realizable case, every hypothesis consistent with the sample has error $O(1/n)$. In the agnostic case, we prove a bandwise, log-free deviation bound on each dyadic risk band via a critical-wedge localization argument. Unioning over bands incurs only a $\ln\ln n$ overhead, and we establish a matching lower bound showing this overhead is unavoidable. Together, these results give a fine-grained and nearly complete picture of uniform convergence for halfspaces, revealing sharp dimensional and structural thresholds.
☆ Safety Anchor: Defending Harmful Fine-tuning via Geometric Bottlenecks ICML 2026
The safety alignment of Large Language Models (LLMs) remains vulnerable to Harmful Fine-tuning (HFT). While existing defenses impose constraints on parameters, gradients, or internal representations, we observe that they can be effectively circumvented under persistent HFT. Our analysis traces this failure to the inherent redundancy of the high-dimensional parameter space: attackers exploit optimization trajectories that are orthogonal to defense constraints to restore harmful capabilities while deceptively adhering to safety restrictions. To address this, we propose Safety Bottleneck Regularization (SBR). SBR shifts the defensive focus from the redundant parameter space to the unembedding layer, which serves as a geometric bottleneck. By anchoring the final hidden states of harmful queries to those of the safety-aligned model, SBR enables the model to maintain safe responses even under persistent HFT. Extensive experiments confirm SBR's effectiveness, demonstrating that utilizing just a single safety anchor is sufficient to reduce the Harmful Score to $<$10 while preserving competitive performance on benign downstream tasks.
comment: Accepted to ICML 2026
☆ iPhoneBlur: A Difficulty-Stratified Benchmark for Consumer Device Motion Deblurring
Motion blur restoration on consumer mobile devices is typically evaluated using aggregate metrics that obscure performance variation across blur difficulty, masking model behavior under real deployment conditions. This work introduces iPhoneBlur, a difficulty-stratified benchmark of 7,400 image pairs synthesized from high-framerate iPhone 17 Pro videos captured in diverse real-world scenarios. Samples are partitioned into Easy, Medium, and Hard categories through PSNR-guided adaptive temporal windowing, with stratification validated by monotonic 2.2x increase in optical flow magnitude across tiers. Each sample includes comprehensive metadata enabling investigation of ISP-aware and difficulty-adaptive restoration strategies. Spectral analysis confirms synthesized blur exhibits high-frequency suppression patterns consistent with authentic motion degradation. Evaluation of six architectures reveals consistent 7-9 dB performance degradation from Easy to Hard subsets, a substantial gap entirely hidden by aggregate reporting. The benchmark further exposes a domain gap between professional and consumer cameras which targeted fine-tuning substantially recovers. By coupling difficulty stratification with deployment-critical metadata, iPhoneBlur enables systematic assessment of model reliability and failure modes for resource-constrained edge systems.
comment: 21 Pages, 12 figures
☆ BioResearcher: Scenario-Guided Multi-Agent for Translational Medicine
Translational medicine turns underspecified development goals into evidence synthesis that must combine literature, trials, patents, and quantitative multi-omics analysis while preserving identifiers, uncertainty, and retrievable provenance. General-purpose foundation models and off-the-shelf tool-augmented or multi-agent systems are not built for this: they tend to produce single-shot answers or run open-endedly, and fall short on the auditable, scenario-specific workflows that heterogeneous biomedical sources demand. This paper introduces Ingenix BioResearcher, a scenario-guided multi-agent system that maps queries to versioned research playbooks, delegates to specialized subagents over 30+ tools and machine-learning endpoints, mixes structured database access with sandboxed code for genome-scale analyses, and applies claim-level multi-model reconciliation before editorial assembly. We evaluate BioResearcher across unit-level capabilities, open-ended biomedical reasoning, and end-to-end clinical discovery. It leads evaluated baselines on 109 single-step tests (83.49% pass rate; 0.892 average score), achieves strong biomedical benchmark performance (89.33% on BixBench-Verified-50 and the top 0.758 mean score on BaisBench Scientific Discovery), and leads on a 30-query clinical end-to-end benchmark with the highest positive hit rate (74.7% $\pm$ 3.3%) and negative clear rate (96.8% $\pm$ 0.2%). These results show broad, competitive performance across unit-level, open-ended, and end-to-end clinical evaluations.
comment: 5 pages (main text), 21 pages (appendix), 8 figures, 11 tables
☆ TACT: Mitigating Overthinking and Overacting in Coding Agents via Activation Steering
When language model agents tackle complex software engineering tasks, they often degrade over long trajectories, which we define as *agent drift*. We focus on two recurring failure modes *overthinking* and *overacting*, i.e., where the agent repeatedly reasons over information it already has, and where it issues tool calls without integrating recent observations or acquiring new evidence. In this paper, we introduce TACT (Think-Act Calibration via activation Steering), to detect and mitigate agent drift in the residual stream before it surfaces as a behavioral failure. In specific, we label trajectory steps as overthinking, overacting, or calibrated, and find that their hidden states can separate linearly along two *drift axes*, pointing from calibrated behavior toward each failure mode (AUC $\approx$ 0.9). To mitigate agent drift, we project each step's activation onto these axes at test time and pull drifted ones back toward the calibrated region. Experiments show that TACT outperforms unsteered baselines across SWE-bench Verified, Terminal-Bench 2.0, and CLAW-Eval, lifting average resolve rate by $+5.8$ pp on Qwen3.5-27B and $+4.8$ pp on Gemma-4-26B-A4B-it while cutting steps-to-resolve by up to $26\%$. These gains frame agent drift as a steerable direction in the residual stream, and position TACT as a viable handle for reliable long-horizon agents.
comment: Work in progress
☆ BehaviorGuard: Online Backdoor Defense for Deep Reinforcement Learning
Backdoor attacks pose a serious threat to deep reinforcement learning (DRL). Current defenses typically rely on reward anomalies to reverse-engineer triggers and model finetuning to remove backdoors. However, complex trigger patterns undermine their robustness, and fine-tuning entails high costs, limiting practical utility. Therefore, we shift defense concerns to trigger-agnostic backdoor output behaviors and propose BehaviorGuard, an online behavior-based backdoor detection and mitigation framework for DRL. Specifically, we find that regardless of attacks, backdoored policies induce consistent shifts in action distributions to ensure reliable activation, leaving detectable traces in high-quantile regions and distribution tails, even in the absence of triggers. Based on this, we design a novel metric that captures behavioral drift in action distributions to identify and suppress backdoor actions at runtime. To our knowledge, this is the first online backdoor defense that counters attacks both in single- and multi-agent DRL. Evaluated across diverse benchmarks with different backdoor attacks, BehaviorGuard consistently surpasses prior methods in both efficacy and efficiency.
comment: 11 pages
☆ PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts ICML 2026
LLM agents rely on prompts to implement task-specific capabilities based on foundation LLMs, making agent prompts valuable intellectual property. However, in untrusted deployments, adversaries can copy and reuse these prompts with other proprietary LLMs, causing economic losses. To protect these prompts, we identify four key challenges: proactivity, runtime protection, usability, and non-portability that existing approaches fail to address. We present PragLocker, a prompt protection scheme that satisfies these requirements. PragLocker constructs function-preserving obfuscated prompts by anchoring semantics with code symbols and then using target-model feedback to inject noise, yielding prompts that only work on the target LLM. Experiments across multiple agent systems, datasets, and foundation LLMs show that PragLocker substantially reduces cross-LLM portability, maintains target performance, and remains robust against adaptive attackers.
comment: accepted to the 43rd International Conference on Machine Learning (ICML 2026)
☆ Towards Reliable LLM Evaluation: Correcting the Winner's Curse in Adaptive Benchmarking
Adaptive prompt and program search makes LLM evaluation selection-sensitive. Once benchmark items are reused inside tuning, the observed winner's score need not estimate the fresh-data performance of the full tune-then-deploy procedure. We study inference for this procedure-level target under explicit tuning budgets. We propose SIREN, a selection-aware repeated-split reporting protocol that freezes the post-search shortlist, separates splitwise selection from held-out evaluation, and uses an item-level Gaussian multiplier bootstrap for uncertainty quantification. In a fixed-shortlist regime with smooth stabilized selection, the estimator admits a first-order item-level representation, and the bootstrap yields valid simultaneous inference on a finite budget grid. This supports confidence intervals for procedure-performance curves and pre-specified equal-budget and cross-budget comparisons. Controlled simulations and MMLU-Pro tuning experiments show that winner-based reporting can be optimistic and can change deployment conclusions, while SIREN remains close to the finite-sample reporting target.
☆ Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR
Reinforcement Learning with Verifiable Rewards (RLVR) has become a key approach for improving the reasoning abilities of large language models. However, widely used critic-free algorithms such as Group Relative Policy Optimization (GRPO) necessitate a ``uniform credit assignment'' assumption that indiscriminately broadcast trajectory-level advantages, hindering learning efficiency by failing to distinguish critical reasoning steps. To address this limitation, we propose Selective Eligibility Traces (S-trace). Grounded in the intuition of partial trust region preservation, we initially introduce P-trace as a sample-efficient, critic-free eligibility traces method, upon which we build S-trace, implementing a sparse eligibility traces mechanism to further mitigate variance and achieve fine-grained credit assignment by selectively masking low-entropy tokens. Theoretically, we contextualize the recent Group Sequence Policy Optimization (GSPO) method within the critic-free eligibility traces framework, identifying it as a special instance of the eligibility traces method operating under uniform credit assignment. Experiments demonstrate that S-trace not only outperforms GRPO, showing gains of 0.49\% on Qwen3-1.7B and 3.16\% on Qwen3-4B, and maintaining a robust 2.98\% improvement when scaled further to Qwen3-8B in average pass@16, but notably achieves this with simultaneously higher sample and token efficiency.
☆ TheraAgent: Self-Improving Therapeutic Agent for Precise and Comprehensive Treatment Planning ACL 2026
Formulating a treatment plan is inherently a complex reasoning and refinement task rather than a simple generation problem. However, existing large language models (LLMs) mainly rely on one-shot output without explicit verification, which may result in rough, incomplete, and potentially unsafe treatment plans. To address these limitations, we propose TheraAgent, an agentic framework that replaces one-shot generation with an iterative generate-judge-refine pipeline. By mirroring the actual reasoning process of human experts who iteratively revise treatment plans, our framework progressively transforms coarse and incomplete drafts into precise, comprehensive, and safer therapeutic regimens. To facilitate the critical judge component, we introduce TheraJudge, a treatment-specific evaluation module integrated into the inference loop to enforce clinical standards. Experiments show TheraAgent achieves state-of-the-art results on HealthBench, leading in Accuracy and Completeness. In expert evaluations, it attains an 86% win rate against physicians, with superior Targeting and Harm Control. Moreover, the highly agreement between TheraJudge and HealthBench evaluations confirms the reliability of our framework.
comment: Accepted to ACL 2026
☆ From Coordinate Matching to Structural Alignment: Rethinking Prototype Alignment in Heterogeneous Federated Learning
Heterogeneous federated learning (HtFL) aims to enable collaboration among clients that differ in both data distributions and model architectures. Prototype-based methods, which communicate class-level feature centers (prototypes) instead of full model parameters, have recently shown strong potential for HtFL. Existing prototype-based HtFL methods typically reuse the MSE-based or cosine-based alignment mechanism developed for homogeneous FL when aligning client-specific representations with global prototypes. These approaches are essentially coordinate alignment, where representations of clients are forced to match the global prototypes in the embedding space in an element-wise manner. Such alignment implicitly assumes that all clients should map their representations into the feature subspace defined by the global prototypes. This assumption is reasonable in homogeneous FL, where all clients share the same feature extractor. However, it becomes problematic in HtFL, since heterogeneous feature extractors naturally induce client-specific feature subspaces, and forcing all clients to optimize within a single global subspace unnecessarily suppresses their learning capacity. We observe that coordinate alignment implicitly couples two distinct objectives: aligning inter-class semantic structure, which is directly beneficial for classification, and enforcing a shared feature basis, which is unnecessary and even harmful under model heterogeneity. Building on this insight, we design FedSAF, which shifts the alignment objective from absolute coordinates to inter-class relational structure. We demonstrate that structural alignment consistently outperforms coordinate alignment in heterogeneous settings. Experiments on multiple benchmarks show that our structural alignment outperforms state-of-the-art prototype-based HtFL methods by up to 3.52\%.
comment: 14 pages, 10 figures, 9 tables
☆ Temporal Smoothness Doubly Robust Learning for Debiased Knowledge Tracing
Knowledge Tracing (KT) is fundamental to intelligent education systems, yet relies on educational logs that are selectively observed. The non-random nature of exercise recommendations and student choices inevitably induces severe selection bias. Most existing KT methods neglect this issue, training on observed logs using standard empirical risk, which yields biased mastery estimates and accumulates errors in subsequent recommendations. To address this, we introduce a doubly robust (DR) formulation for KT that integrates a propensity model with an error imputation model, theoretically guaranteeing unbiasedness if either model is accurate. Beyond unbiasedness, in the sequential setting of KT, we identify that the estimator's performance is compromised by variance-dependent stochastic deviations that accumulate over time, thereby causing training instability and limiting performance. To mitigate this, we derive a generalization bound that explicitly characterizes the impact of estimator variance and identifies temporal smoothness as a key factor in controlling it. Building on these theoretical insights, we propose the Temporal Smoothness Doubly Robust (TSDR) framework. TSDR jointly optimizes the KT predictor and the imputation model with a smoothness regularizer, effectively reducing variance while preserving the unbiasedness guarantee of DR. Experiments on multiple real-world benchmarks demonstrate that TSDR consistently enhances various state-of-the-art KT backbones, underscoring the vital role of principled bias correction in KT.
☆ Hallucination as an Anomaly: Dynamic Intervention via Probabilistic Circuits
One of the most critical challenges in Large Language Models is their tendency to hallucinate, i.e., produce factually incorrect responses. Existing approaches show promising results in terms of hallucination correction, but still suffer from a main limitation: they apply corrections indiscriminately to every token, corrupting also the originally correct generations. To overcome this drawback, we propose PCNET, a Probabilistic Circuit trained as a tractable density estimator over the LLM residual stream. The method detects hallucinations as geometric anomalies on the factual manifold, which is done via exact Negative Log-Likelihood computation, hence without the need for sampling, external verifiers, or weight modifications, as in existing techniques. To demonstrate its effectiveness, we exploit PCNET as a dynamic gate that distinguishes hallucinated from factual hidden states at each decoding step. This triggers our second main contribution, PC-LDCD (Probabilistic Circuit Latent Density Contrastive Decoding), only when the latent geometry deviates from factual regions, while leaving correct generations untouched. Across four LLMs, ranging from 1B to 8B models, and four benchmarks covering conversational reasoning, knowledge-intensive QA, reading comprehension, and truthfulness, PCNET achieves near-perfect hallucination detection across CoQA, SQuAD v2.0, and TriviaQA, with AUROC reaching up to 99%. Moreover, PC-LDCD obtains the highest True+Info, MC2, and MC3 scores on TruthfulQA in three out of four models, in comparison with state-of-the-art baselines, while reducing the mean corruption rate to 53.7% and achieving a preservation rate of 79.3%. Our proposed method is publicly available on GitHub.
☆ HaM-World: Soft-Hamiltonian World Models with Selective Memory for Planning
World models enable model-based planning through learned latent dynamics, but imagined rollouts become unstable as the planning horizon grows or the dynamics distribution shifts. We argue that this instability reflects two missing structures in planner-facing latents: history-conditioned memory for approximate Markov completeness, and geometric organization that separates configuration, momentum, and task semantics. We propose HaM-World (HMW), a structured world model that decomposes the latent state into a canonical (q, p) subspace and a context subspace c, while using Mamba selective state-space memory as the history-conditioned input to the same latent dynamics. Within this interface, (q, p) evolves through an energy-derived Hamiltonian vector field plus learnable residual/control dynamics, while c captures semantic, dissipative, and non-conservative factors. This gives the planner a single latent state shared by dynamics prediction, reward/value estimation, imagined rollouts, and CEM action search. On four DeepMind Control Suite tasks, HaM-World reaches the highest Avg. AUC (117.9, +9.5%), reduces long-horizon rollout error to 45% of a strong baseline model, and wins 11/12 k in {3,5,7} MSE cells. Under 12 OOD perturbations spanning dynamics shifts, action delay, and observation masking, HaM-World achieves the highest return in every condition, with average OOD-return gains of 10.2% on Finger Spin and 13.6% on Reacher Easy. Mechanism diagnostics further show bounded action-free Hamiltonian-energy drift, structured energy variation under policy rollouts, and coherent control-induced energy transfer, supporting the intended Soft-Hamiltonian dynamics design.
comment: 22 pages, 5 figures. Code: https://github.com/HaoyunT/HaM_World
☆ MAS-Algorithm: A Workflow for Solving Algorithmic Programming Problems with a Multi-Agent System
Algorithmic problem solving serves as a rigorous testbed for evaluating structured reasoning in AI coding systems, as it directly reflects a model's ability to perform structured reasoning in complex scenarios.Existing approaches predominantly rely on model-centric strategies, such as architectural modifications and data scaling, which are costly and offer limited interpretability. Alternative methods leveraging external tools or prompting techniques (e.g., chain-of-thought) are often fragmented and lack a unified framework. In this paper, we propose MAS-Algorithm, a systematic multi-agent workflow for algorithmic problem solving inspired by the practices of competitive programmers and algorithm engineers. Our framework decomposes the end-to-end solving process into modular stages, enabling structured reasoning, tool integration, and flexible coordination among agents. The design emphasizes both rigor and extensibility, allowing it to generalize across diverse problem types.Experimental results on a self-constructed benchmark demonstrate consistent improvements across multiple Qwen series models, achieving an average gain of 6.48% in acceptance rate. In contrast, parameter-efficient fine-tuning on the same data yields only a marginal improvement of 0.89%. We further observe a 4.72% gain on LiveCodeBench-Pro, along with consistent improvements across additional accuracy and efficiency metrics.Beyond performance gains, we conduct comprehensive analyses to better understand the reasoning process within the workflow, including error patterns and cross-scenario behaviors. We further perform customized replacement and ablation studies to explore the upper bound of the framework, showing that individual agents can contribute improvements of up to 27.7%. These results highlight the strong potential of MAS-Algorithm for advancing AI-driven algorithmic reasoning.
☆ ICU-Bench:Benchmarking Continual Unlearning in Multimodal Large Language Models
Although Multimodal Large Language Models (MLLMs) have achieved remarkable progress across many domains, their training on large-scale multimodal datasets raises serious privacy concerns, making effective machine unlearning increasingly necessary. However, existing benchmarks mainly focus on static or short-sequence settings, offering limited support for evaluating continual privacy deletion requests in realistic deployments. To bridge this gap, we introduce ICU-Bench, a continual multimodal unlearning benchmark built on privacy-critical document data. ICU-Bench contains 1,000 privacy-sensitive profiles from two document domains, medical reports and labor contracts, with 9,500 images, 16,000 question-answer pairs, and 100 forget tasks. Additionally, new continual unlearning metrics are introduced, facilitating a comprehensive analysis of forgetting effectiveness, historical forgetting preservation, retained utility, and stability throughout the continual unlearning process. Through extensive experiments with representative unlearning methods on ICU-Bench, we show that existing methods generally struggle in continual settings and exhibit clear limitations in balancing forgetting quality, utility preservation, and scalability over long task sequences. These findings highlight the need for multimodal unlearning methods explicitly designed for continual privacy deletion.
comment: 30 pages, 12 figures
☆ In Data or Invisible: Toward a Better Digital Representation of Low-Resource Languages with Knowledge Graphs
Emerging digital technologies are exacerbating the existing divide in Open Access Data (OAD) between high-and low-resource languages, excluding many communities from participating in the global digital transformation. In this PhD proposal, we aim to address this gap, focusing on the language coverage of Linked Open Data knowledge graphs (LOD KGs). First, we identify key variables that characterize language distribution in LOD, including the number of Wikipedia articles per language edition and the number of language-tagged entities in LOD KGs. These variables are analyzed across three major multilingual LOD KGs, DBpedia, BabelNet, and Wikidata, providing insights into the representation and distribution of languages within LOD. Building on this analysis, we intend to study the impact of cross-lingual transfer candidate selection on the task of multilingual KG completion. In particular, we plan to investigate strategies based on linguistic proximity and the availability of curated annotated alignments between languages. Language proximity also motivates us to explore the benefits of analogical reasoning that relies on (dis)similarities and has not yet been investigated to identify correspondences across languages to improve KG completion performance and enhance language coverage in LOD.
☆ Which Are the Low-Resource Languages of the Semantic Web? ESWC 2026
Emerging digital technologies are exacerbating the existing divide in Open Access Data (OAD) between high-and low-resource languages, excluding many communities from the global digital transformation. Multilingual Linked Open Data Knowledge Graphs (LOD KGs) could contribute to mitigating this divide through cross-lingual transfer; however, no clear quantitative definition of low-resource languages has yet been established in the context of LOD KGs. In this poster, we present a methodology to analyze the distribution of languages across LOD KGs and propose a preliminary multi-level categorization based on DBpedia, BabelNet, and Wikidata. This categorization is leveraged to bring a formal definition of low-, high-, and medium-resource languages that could be later leveraged to select cross-lingual transfer candidates.
comment: ESWC 2026 - 23rd European Semantic Web Conference, May 2026, Dubrovnik, Croatia
☆ Intentmaking and Sensemaking: Human Interaction with AI-Guided Mathematical Discovery
Artificial intelligence offers powerful new tools for scientific discovery, but the interaction paradigms required to effectively harness these systems remain underexplored. In this paper, we present findings from a formative user study with 11 expert mathematicians who used AlphaEvolve, an evolutionary coding agent, to tackle advanced problems in their fields of expertise. We identify and characterize a distinct workflow we term intentmaking, the iterative process of discovering, defining, and refining one's experimental goals through active system interaction. We frame this as a natural extension to sensemaking, the cognitive process of building an understanding of complex or novel data. We suggest that users enter a cycle of intentmaking (defining and updating their experiment) and sensemaking (interpreting the results) which repeats many times during the course of an investigation. Our documentation of these themes suggests an approach to designing AI tools for scientific discovery that goes beyond the existing question/answer model of many current systems, treating them as collaborative instruments rather than opaque black-box assistants.
☆ LLM-Driven Design Space Exploration of FPGA-based Accelerators EuroSys '26
Designing field-programmable gate array (FPGA)-based accelerators for modern artificial intelligence workloads requires navigating a large and complex hardware design space encompassing architectural parameters, dataflow strategies, and memory hierarchies, making the process time-consuming and resource-intensive. While the SECDA methodology enables rapid hardware-software co-design of accelerators through SystemC simulation and FPGA execution, identifying optimal accelerator configurations still requires substantial manual effort and domain expertise. This work presents SECDA-DSE, a framework that integrates Large Language Models (LLMs) into the SECDA ecosystem, comprising tools built around SECDA to automate the design space exploration (DSE) of FPGA-based accelerators. SECDA-DSE combines a structured DSE Explorer for generating accelerator configurations with an LLM Stack that performs reasoning-guided exploration using retrieval-augmented generation and chain-of-thought prompting, alongside a feedback loop that enables reinforced fine-tuning for continuous improvement. We demonstrate the feasibility of SECDA-DSE through an initial high-level synthesis based evaluation of a generated accelerator design that meets synthesis timing and resource constraints on an Zynq-7000 FPGA.
comment: Accepted to the Workshop on Intelligent System Design (InSyDe) co-located with EuroSys '26
☆ Quantum-enhanced Large Language Models on Quantum Hardware via Cayley Unitary Adapters
Large language models (LLMs) have transformed artificial intelligence, yet classical architectures impose a fundamental constraint: every trainable parameter demands classical memory that scales unfavourably with model size. Quantum computing offers a qualitatively different pathway, but practical demonstrations on real hardware have remained elusive for models of practical relevance. Here we show that Cayley-parameterised unitary adapters -- quantum circuit blocks inserted into the frozen projection layers of pre-trained LLMs and executed on a 156-qubit IBM Quantum System Two superconducting processor -- improve the perplexity of Llama 3.1 8B, an 8-billion-parameter model in widespread use, by 1.4% with only 6,000 additional parameters and end-to-end inference validated on real Quantum Processing Unit (QPU). A systematic study on SmolLM2 (135M parameters), chosen for its tractability, reveals monotonically improving perplexity with unitary block dimension, 83% recovery of compression-induced degradation, and correct answers to questions that both classical baselines fail -- with a sharp noise-expressivity phase transition identifying the concrete path to quantum utility at larger qubit scales.
comment: 31 pages, 6 figures
☆ Wisteria: A Unified Multi-Scale Feature Learning Framework for DNA Language Model
DNA language model aims to decipher the regulatory grammar and semantic of genomes by capturing long range dependencies in DNA sequences. Existing methods emphasize long range token interactions but often ignore the interplay between local motifs and global dependencies. In this paper, we propose Wisteria, a genomic language model that integrates multi scale feature learning within a unified framework for DNA sequence. Specifically, Wisteria augments the Mamba based architecture with gated dilated convolutions to capture local motifs and regulatory patterns, while gated multilayer perceptrons refine global dependencies. We further introduce a Fourier based attention mechanism to support frequency domain modeling, periodic extension and length generalization. Across four experimental settings with both short and long range dependencies, Wisteria demonstrates strong performance on downstream benchmarks against competitive DNA language model baselines. These results indicate that Wisteria effectively unifies local and global dependency modeling for multi scale genomic sequence analysis.
comment: 25 pages, 4 figures. Under review
☆ PREFER: Personalized Review Summarization with Online Preference Learning
Product reviews significantly influence purchasing decisions on e-commerce platforms. However, the sheer volume of reviews can overwhelm users, obscuring the information most relevant to their specific needs. Current e-commerce summarization systems typically produce generic, static summaries that fail to account for the fact that (i) different users care about different product characteristics, and (ii) these preferences may evolve with interactions. To address the challenge of unknown latent preferences, we propose an online learning framework that generates personalized summaries for each user. Our system iteratively refines its understanding of user preferences by incorporating feedback directly from the generated summaries over time. We provide a case study using the Amazon Reviews'23 dataset, showing in controlled simulations that online preference learning improves alignment with target user interests while maintaining summary quality.
☆ Null Space Constrained Contrastive Visual Forgetting for MLLM Unlearning
The core challenge of machine unlearning is to strike a balance between target knowledge removal and non-target knowledge retention. In the context of Multimodal Large Language Models (MLLMs), this challenge becomes even more pronounced, as knowledge is further divided into visual and textual modalities that are tightly intertwined. In this paper, we introduce an MLLM unlearning approach that aims to forget target visual knowledge while preserving non-target visual knowledge and all textual knowledge. Specifically, we freeze the LLM backbone and achieve unlearning by fine-tuning the visual module. First, we propose a Contrastive Visual Forgetting (CVF) mechanism to separate target visual knowledge from retained visual knowledge, guiding the representations of target visual concepts toward appropriate regions in the feature space. Second, we identify the null space associated with retained knowledge and constrain the unlearning process within this space, thereby significantly mitigating degradation in knowledge retention. Third, beyond static unlearning scenarios, we extend our approach to continual unlearning, where forgetting requests arrive sequentially. Extensive experiments across diverse benchmarks demonstrate that our approach achieves a strong balance between effective forgetting and robust knowledge retention.
comment: 20 pages, 5 figures
☆ Architecture-agnostic Lipschitz-constant Bayesian header and its application to resolve semantically proximal classification errors with vision transformers
Label noise remains a critical bottleneck for the generalization of supervised deep learning models, particularly when errors are structured rather than random. Standard robust training methods often fail in the presence of such semantically proximal classification errors. This work presents an architecture-agnostic Lipschitz-constant Bayesian header that can be integrated into feature extractors such as vision transformers, yielding the bi-Lipschitz-constrained Bayesian Vision Transformer (LipB-ViT). In contrast to conventional Bayesian layers, our approach enforces spectral normalization on both the mean and log-variance of the variational weights, which promotes calibrated predictive uncertainty and mitigates noise amplification. We further propose a novel metric to jointly capture uncertainty and confidence across misclassification rates, as well as an adaptive arithmetic-mean fusion scheme that combines feature-space proximity with predictive uncertainty to detect corrupted labels outperforming the state of the art k-nearest neighbor based identification methods by more than 7% reaching a recall of more than 0.93 at 15% semantically misclassified labels. Although computational costs increase due to Monte Carlo sampling, the method offers plug-and-play compatibility with pre-trained backbones and consistent hyperparameters across domains, suggesting strong utility for high-stakes applications with variable annotation reliability. The stabilized confidence estimates serve as the foundation for an analysis pipeline that jointly assesses dataset quality and label noise, yielding a second novel metric for their combined quantification. Lastly, we systematically evaluate LipB-ViT under both structured (adversarial) and unstructured noise at inference time, demonstrating its robustness in realistic high-noise and attack scenarios. We compare its performance against baseline methods.
comment: 10 pages, 3 figures, 4 tables; Supplementary 5 pages with 5 figures; Including references total 18 pages
☆ VARS-FL: Validation-Aligned Client Selection for Non-IID Federated Learning in IoT Systems
Federated learning (FL) systems typically employ stateless client selection, treating each communication round independently and ignoring accumulated evidence of client contribution quality. Under non-IID data, this leads to slow convergence and unstable training, particularly when selection relies on local proxies (e.g., training loss) that are misaligned with the global optimization objective. These challenges are especially pronounced in Internet of Things (IoT) and Industrial IoT (IIoT) environments, where data is highly heterogeneous and distributed across devices observing different traffic patterns. In this paper, we propose VARS-FL (Validation-Aligned Reputation Scoring for Federated Learning), a client selection framework that quantifies each client's contribution using the reduction in server-side validation loss induced by its update. These per-round signals are aggregated into a Reputation score that combines a sliding-window average of recent contributions with a logarithmically scaled participation term, enabling robust exploration-exploitation selection. VARS-FL requires no changes to local training or aggregation and remains fully compatible with standard FedAvg. We evaluate VARS-FL on a 15-class non-IID IoT intrusion detection task using the Edge-IIoTset dataset, with 100 clients across multiple seeds, and compare it against FedAvg, Oort, and Power-of-Choice. VARS-FL consistently improves accuracy, F1-Macro, and loss, while accelerating convergence (up to 36% fewer rounds to reach 80% accuracy). These results demonstrate that validation-aligned, history-aware client selection provides a more reliable and efficient training process for federated learning in heterogeneous IoT environments.
☆ Detecting AI-Generated Videos with Spiking Neural Networks
Modern AI-generated videos are photorealistic at the single-frame level, leaving inter-frame dynamics as the main remaining axis for detection. Existing detectors typically handle this temporal evidence in three ways: feeding the full frame sequence to a generic temporal backbone, reducing one dominant temporal cue to fixed video-level descriptors, or comparing temporal features to real-video statistics through a detection metric. These strategies degrade sharply under cross-generator evaluation, where artifact type and timescale vary across generators. On caption-paired benchmark, GenVidBench, we identify two signatures that prior detectors do not jointly exploit: AI-generated videos exhibit smoother frame-to-frame temporal residuals at the pixel level, and more compact trajectories in the semantic feature space, indicating a temporal smoothness gap at both levels. We further observe that, when raw video is fed into a Spiking Neural Networks (SNNs), fake clips elicit firing predominantly at object and motion boundaries, unlike real clips, suggesting that the SNN responds to temporal artifacts localized at edges. These cues are sparse, asynchronous, and concentrated at moments of change, which makes SNNs a natural choice for this task: their event-driven, sparsely-activated dynamics align with the structure of the residual signal in a way that dense ANN backbones do not. Building on this observation, we propose MAST, a detector that processes multi-channel temporal residuals with a spike-driven temporal branch alongside a frozen semantic encoder for cross-generator generalization. On the GenVideo benchmark, MAST achieves 93.14\% mean accuracy across 10 unseen generators under strict cross-generator evaluation, matching or surpassing the strongest ANN-based detectors and demonstrating the practical applicability of SNNs to AI-generated video detection.
☆ Logic-Regularized Verifier Elicits Reasoning from LLMs
Verifiers are crucial components for enhancing modern LLMs' reasoning capability. Typicalverifiers require resource-intensive superviseddataset construction, which is costly and faceslimitations in data diversity. In this paper, wepropose LOVER, an unsupervised verifier regularized by logical rules. LOVER treats theverifier as a binary latent variable, utilizinginternal activations and enforcing three logical constraints on multiple reasoning paths:negation consistency, intra-group consistency,and inter-group consistency (grouped by thefinal answer). By incorporating logical rulesas priors, LOVER can leverage unlabeled examples and is directly compatible with any offthe-shelf LLMs. Experiments on 10 datasetsdemonstrate that LOVER significantly outperforms unsupervised baselines, achieving performance comparable to the supervised verifier(reaching its 95% level on average). The sourcecode is publicly available at https://github.com/wangxinyufighting/llm-lover.
☆ MTL-MAD: Multi-Task Learners are Effective Medical Anomaly Detectors
Anomaly detection in medical images is a challenging task, since anomalies are not typically available during training. Recent methods leverage a single pretext task coupled with a large-scale pre-trained model to reach state-of-the-art performance. Instead, we propose to learn multiple self-supervised and pseudo-labeling tasks from scratch, using a joint model based on Mixture-of-Experts (MoE). By carefully integrating multiple proxy tasks, the joint model effectively learns a robust representation of normal anatomical structures, so that anomaly scores can be derived based on how well the multi-task learner (MTL) solves each task during inference. We perform comprehensive experiments on BMAD, a recent benchmark that comprises a broad range of medical image modalities. The empirical results indicate that our multi-task learner is an effective anomaly detector, outperforming all state-of-the-art competitors on BMAD. Moreover, our model produces interpretable anomaly maps, potentially helping physicians in providing more accurate diagnoses.
☆ DBMSolver: A Training-free Diffusion Bridge Sampler for High-Quality Image-to-Image Translation CVPR 2026
Diffusion-based image-to-image (I2I) translation excels in high-fidelity generation but suffers from slow sampling in state-of-the-art Diffusion Bridge Models (DBMs), often requiring dozens of function evaluations (NFEs). We introduce DBMSolver, a training-free sampler that exploits the semi-linear structure of DBM's underlying SDE and ODE via exponential integrators, yielding highly-efficient 1st- and 2nd-order solutions. This reduces NFEs by up to 5x while boosting quality (e.g., FID drops 53% on DIODE at 20 NFEs vs. 2nd-order baseline). Experiments on inpainting, stylization, and semantics-to-image tasks across resolutions up to 256x256 show DBMSolver sets new SOTA efficiency-quality tradeoffs, enabling real-world applicability. Our code is publicly available at https://github.com/snumprlab/dbmsolver.
comment: Accepted to CVPR 2026. Includes supplementary material
☆ Tuning Derivatives for Causal Fairness in Machine Learning
Artificial-intelligence systems are becoming ubiquitous in society, yet their predictions typically inherit biases with respect to protected attributes such as race, gender, or age. Classical fairness notions, most notably Statistical Parity (SP), demand that predictions be independent of the protected attributes, but are overly restrictive when these attributes influence mediating variables that are considered business necessities. Recent causal formulations relax SP by distinguishing allowed from not-allowed causal paths and by complementing SP with Predictive Parity (PP), requiring the predictor to replicate the legitimate influence of business-necessities. Existing path-based definitions are mainly practical when applied to categorical attributes. This paper introduces a new framework for fairness in structural causal models that is tailored to continuous protected attributes. We formalize SP and PP through path-specific partial derivatives, establish conditions under which these criteria coincide with prior causal definitions, and characterize when a fair predictor, one that satisfies SP along not-allowed paths while achieving PP along allowed paths, exists. Building on this theory, we propose a fair tuning algorithm that either constructs such a predictor or, when not possible, allows for a trade-off between SP and PP. We present experiments on simulated and real data to evaluate our proposal, compare it with previously proposed methods, and show that it performs better when PP is considered.
☆ Agentic, Context-Aware Risk Intelligence in the Internet of Value
The Internet of Value (IoV) is a heterogeneous, partially-trusted network in which the dominant marginal risk is composite (route, sentiment, liquidity, and the policy a system is willing to commit to) rather than a property of any single chain. We argue that a risk primitive adequate for this regime is a composition of five engines: a prediction engine over price, liquidity, volatility, and route health; a Bittensor verification subnet that decentralises and economically scores prediction outputs; a sentiment-fusion engine over text, on-chain flow, and grey-literature feeds; an agentic engine under constitutional, role-bound action constraints; and an API-risk and scenario engine that converts forecasts into pre-committed action programs in the sense of Monte-Carlo scenario generation. We anchor the architecture in two empirical artefacts: a 27-hour policy-constrained liquidity stress-response experiment on Solana, and a 168-hour prediction-router calibration arc reported with explicit class-imbalance honesty. The case study supports deployability; the validator-loss decomposition is stated formally and is falsifiable.
comment: 15 pages
☆ COVID-19 Infodemic. Understanding content features in detecting fake news using a machine learning approach
The use of content features, particularly textual and linguistic for fake news detection is under-researched, despite empirical evidence showing the features could contribute to differentiating real and fake news. To this end, this study investigates a selection of content features such as word bigrams, part of speech distribution etc. to improve fake news detection. We performed a series of experiments on a new dataset gathered during the COVID-19 pandemic and using Decision Tree, K-Nearest Neighbor, Logistic Regression, Support Vector Machine and Random Forest. Random Forest yielded the best results, followed closely by Support Vector Machine, across all setups. In general, both the textual and linguistic features were found to improve fake news detection when used separately, however, combining them into a single model did not improve the detection significantly. Differences were also noted between the use of bigrams and part of speech tags. The study shows that textual and linguistic features can be used successfully in detecting fake news using the traditional machine learning approach as opposed to deep learning.
♻ ☆ REMAP: Regularized Matching and Partial Alignment of Video Embeddings
Real-world instructional videos are long, noisy, and often contain extended background segments, repeated actions, and execution variability that do not correspond to meaningful procedural steps. We propose **REMAP**, an unsupervised framework for procedure learning based on *Regularized Fused Partial Gromov-Wasserstein Optimal Transport*. REMAP relaxes balanced transport constraints, allowing non-informative or redundant frames to remain unmatched through partial transport. The formulation jointly models semantic similarity and temporal structure, while incorporating Laplacian-based smoothness and structural regularization to prevent degenerate alignments and reduce background interference. We evaluate REMAP on large-scale egocentric and third-person benchmarks. The method consistently outperforms state-of-the-art approaches, achieving up to **11.6\% (+4.45pp)** F1 and **19.6\% (+4.73pp)** IoU improvements on EgoProceL, and an average **41\% (+17.15pp)** F1 gain on ProceL and CrossTask. These results highlight the importance of partial alignment in handling real-world procedural variability and demonstrate that REMAP provides a robust and scalable approach for instructional video understanding.
comment: 9 pages, 4 figures, 6 tables
♻ ☆ Predictive and Prescriptive AI toward Optimizing Wildfire Suppression
Intense wildfire seasons require critical prioritization decisions to allocate scarce suppression resources over a dispersed geographical area. This paper develops a predictive and prescriptive approach to jointly optimize crew assignments and wildfire suppression. The problem features a discrete resource-allocation structure with endogenous wildfire demand and non-linear wildfire dynamics. We formulate an integer optimization model with crew assignments on a time-space-rest network, wildfire dynamics on a time-state network, and linking constraints between them. We develop a two-sided branch-and-price-and-cut algorithm based on: (i) a two-sided column generation scheme that generates fire suppression plans and crew routes iteratively; (ii) a new family of cuts exploiting the knapsack structure of the linking constraints; and (iii) novel branching rules to accommodate non-linear wildfire dynamics. We also propose a data-driven double machine learning approach to estimate wildfire spread as a function of covariate information and suppression efforts, mitigating observed confounding between historical crew assignments and wildfire growth. Extensive computational experiments show that the optimization algorithm scales to otherwise intractable real-world instances; and that the methodology can enhance suppression effectiveness in practice, resulting in significant reductions in area burned over a wildfire season and guiding resource sharing across wildfire jurisdictions.
♻ ☆ On the optimization dynamics of RLVR: Gradient gap and step size thresholds
Reinforcement Learning with Verifiable Rewards (RLVR), which uses simple binary feedback to post-train large language models, has found significant empirical success. However, a principled understanding of why it works is lacking. This paper builds a theoretical foundation for RLVR by analyzing its training process at both the full-response (trajectory) and token levels. Central to our analysis is a new quantity called the Gradient Gap, which formalizes the direction of improvement from low-reward to high-reward regions of the response space. We prove that convergence critically depends on aligning the update direction with this Gradient Gap. Moreover, we derive a sharp step-size threshold based on the magnitude of the Gradient Gap: below it, learning converges, whereas above it, performance collapses. Our theory further predicts how the critical step size must scale with response length and the success rate, thereby explaining why practical heuristics such as length normalization improve stability and showing that, with a fixed learning rate, the success rate can stagnate strictly below $100\%$. Importantly, our theory holds flexibly for any policy-gradient algorithm and so characterizes the dynamics of popular approaches such as REINFORCE and GRPO. We validate these predictions through controlled bandit simulations and language model experiments on post-training Qwen2.5-Math-7B with GRPO.
♻ ☆ Flexible Agent Alignment with Goal Inference from Open-Ended Dialog
We introduce Open-Universe Assistance Games (OU-AGs), a formal framework extending assistance games to LLM-based agents. Effective assistance requires reasoning over human preferences that are unbounded, underspecified, and evolving. Current LLM agents struggle in multi-turn interactions and with maintaining accurate models of user intent in collaborative settings. Existing assistance game formulations assume fixed, predefined preferences, an assumption that breaks down in open-ended dialogue where goals are revised incrementally and expressed in natural language. Grounded in cognitive science accounts of preference construction, we represent human preferences as a dynamically updated distribution over discrete natural-language goals. To operationalize OU-AGs, we introduce GOOD (GOals from Open-ended Dialogue), a data-efficient online method that extracts and ranks candidate goals during interaction, using LLM-simulated users to perform probabilistic inference over goal hypotheses. This allows for interpretable, uncertainty-aware preference representations without large offline datasets. We evaluate GOOD across three text-based domains: grocery shopping, household robotics (AI2-THOR), and coding. Compared to baselines without explicit goal tracking, GOOD produces semantically coherent goal representations and improves alignment with user intent across domains.
comment: Previous version of the paper was titled: Open-Universe Assistance Games
♻ ☆ AI Cap-and-Trade: Efficiency Incentives for Accessibility and Sustainability ICML 2026
The race for artificial intelligence (AI) dominance often prioritizes scale over efficiency. Hyper-scaling is the common industry approach: larger models, more data, and as many computational resources as possible. Using more resources is a simpler path to improved AI performance. Thus, efficiency has been de-emphasized. Consequently, the need for costly computational resources has marginalized academics and smaller companies. Simultaneously, increased energy expenditure, due to growing AI use, has led to mounting environmental costs. In response to accessibility and sustainability concerns, we argue for research into, and implementation of, market-based methods that incentivize AI efficiency. We believe that incentivizing efficient operations and approaches will reduce emissions while opening new opportunities for academics and smaller companies. As a call to action, we propose a cap-and-trade system for AI. Our system provably reduces computations for AI deployment, thereby lowering emissions and monetizing efficiency to the benefit of academics and smaller companies.
comment: 22 pages, 2 figures. Accepted as a position paper at ICML 2026
♻ ☆ Multimodal Fact-Level Attribution for Verifiable Reasoning ICML 2026
Multimodal large language models (MLLMs) are increasingly used for real-world tasks involving multi-step reasoning and long-form generation, where reliability requires grounding model outputs in heterogeneous input sources and verifying individual factual claims. However, existing multimodal grounding benchmarks and evaluation methods focus on simplified, observation-based scenarios or limited modalities and fail to assess attribution in complex multimodal reasoning. We introduce MuRGAt (Multimodal Reasoning with Grounded Attribution), a benchmark for evaluating fact-level multimodal attribution in settings that require reasoning beyond direct observation. Given inputs spanning video, audio, and other modalities, MuRGAt requires models to generate answers with explicit reasoning and precise citations, where each citation specifies both modality and temporal segments. To enable reliable assessment, we introduce an automatic evaluation framework that strongly correlates with human judgments. Benchmarking with human and automated scores reveals that even strong MLLMs frequently hallucinate citations despite correct reasoning. Moreover, we observe a key trade-off: increasing reasoning depth or enforcing structured grounding often degrades accuracy, highlighting a significant gap between internal reasoning and verifiable attribution.
comment: Accepted to ICML 2026. Code and data are available at https://github.com/meetdavidwan/murgat
♻ ☆ DARK: Diagonal-Anchored Repulsive Knowledge Distillation for Vision-Language Models under Extreme Compression
Compressing vision-language models for on-device deployment is increasingly important in clinical settings, but knowledge distillation (KD) degrades sharply when the teacher-student capacity gap spans an order of magnitude or more. We argue that, under such gaps, strict imitation of the teacher is a poor objective: much of the teacher's pairwise similarity structure reflects its own architectural biases rather than information a compact student can efficiently represent. We propose \textbf{Diagonal-Anchored Repulsive Knowledge Distillation (DARK)}, a contrastive KD framework that decomposes the distillation loss into a diagonal term (matched image-text pairs) and an off-diagonal term (non-target similarities). The diagonal term anchors matched-pair alignment throughout training; the off-diagonal term is annealed from positive to negative weighting, transitioning the student from imitating to \emph{repelling} the teacher's non-target similarity structure. We instantiate DARK by distilling FetalCLIP, a 427M-parameter fetal ultrasound vision-language model, into \textbf{MobileFetalCLIP}, a 75M-parameter student model with a $26\times$ smaller visual encoder, running in 1.6\,ms on an iPhone~16~Pro. The student matches or exceeds its teacher on three zero-shot benchmarks, including HC18 biometry validity (88.6\% vs.\ 83.5\%) and brain sub-plane F1 (0.784 vs.\ 0.702). Embedding-geometry and logit analyses show that DARK induces \emph{structured decorrelation}: the student preserves teacher-aligned per-image confidence while diverging from inherited inter-class confusion, suggesting that controlled repulsion can be more efficient than imitation under extreme compression.
comment: Project website: www.numansaeed.com/mobilefetalclip
♻ ☆ How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum
SFT-then-RLVR is widely used for post-training reasoning models, but why this specific ordering, and why RLVR-only stalls at cold start, have lacked a unifying theoretical account. We provide that account under a unified loss family $J_Q$ using the Tsallis $q$-logarithm. $J_Q$ is a single-parameter family that interpolates between RLVR (at $q{=}0$, the \textit{exploitation pole}) and the log-marginal-likelihood over latent trajectories (at $q{=}1$, the \textit{density-estimation pole}), under which the standard pipeline corresponds to a stepwise $q{=}1 \to 0$ schedule. All members share the same per-example gradient direction, differing only by a per-instance amplification $P_θ^{-q}$ that reweights each instance independently of the learning rate. Under gradient flow analysis, we show that the exploitation pole requires $Ω(\frac{1}{p_0})$ time to escape cold start but is robust to label noise, while the density-estimation pole escapes in $Θ\big(\log(\frac{1}{p_0})\big)$ but memorizes label noise. This separation explains how SFT ($q{=}1$) first moves the model out of the cold-start regime, followed by the more robust RLVR ($q{=}0$), under the SFT-then-RLVR paradigm. We further derive two Monte Carlo estimators that directly optimize fixed-$q$ on the $J_Q$ continuum, without annotated rationales: Gradient-Amplified RL (GARL) and Posterior-Attenuated Fine-Tuning (PAFT), with shared bias $O\big(\frac{q}{M P_θ^q}\big)$ but different variance and stability properties. On FinQA, HotPotQA, and MuSiQue, GARL at sufficiently high $q$ substantially mitigates cold-start stalling, escaping cold start where GRPO fails entirely. In warm start, GARL at low $q$ dominates FinQA where training is stable; on HotPotQA and MuSiQue, GARL destabilizes and PAFT at $q{=}0.75$ remains stable, reaching $47.9$ \texttt{m@16} on HotPotQA ($+13.9$ over GRPO).
♻ ☆ Exact Structural Abstraction and Tractability Limits
Any rigorously specified problem determines an admissible-output relation $R$, and exact correctness depends only on the induced decision quotient relation $s \sim_R s' \iff \operatorname{Adm}_R(s)=\operatorname{Adm}_R(s')$. Exact relevance certification asks which coordinates recover those classes. Decision, counting, search, approximation, PAC/regret/risk, randomized-output guarantees, anytime or finite-horizon guarantees, and distributional guarantees all reduce to this quotient-recovery problem. Universal exact-semantics reduction identifies admissible-output quotient recovery as the canonical object. Optimizer-quotient realizability is maximal, so quotient shape alone cannot mark a tractability frontier. Orbit gaps are the exact obstruction to classification by closure-law-invariant structural predicates. Exact classification by closure-law-invariant predicates succeeds exactly when the target is constant on closure orbits; on a closure-closed domain, equivalently, when the positive and negative orbit hulls are disjoint, in which case there is a least exact closure-invariant classifier. Across four natural candidate structural tractability criteria, a uniform pair-targeted affine witness produces same-orbit disagreements and rules out exact structural classification on the full binary pairwise domain. Because that witness class already sits inside the universal semantic framework, the same obstruction applies to any universal exact-certification characterization over rigorously specified problems. Restricting the domain helps only by removing orbit gaps. Without explicit margin control, arbitrarily small utility perturbations can flip relevance and sufficiency.
comment: Main PDF: 38 pages, 1 figure, 4 tables. Supplementary: 12 pages, 2 tables. Lean 4 formalization available at https://doi.org/10.5281/zenodo.19457896
♻ ☆ Refining Gelfond Rationality Principle: Towards More Comprehensive Foundational Principles for Answer Set Semantics IJCAI-2022
Non-monotonic logic programming is the basis for a declarative problem solving paradigm known as answer set programming (ASP). Departing from the seminal definition by Gelfond and Lifschitz in 1988 for simple normal logic programs, various answer set semantics have been proposed for extensions. We consider two important questions: (1) Should the minimal model property, constraint monotonicity and foundedness as defined in the literature be mandatory conditions for an answer set semantics in general? (2) If not, what other properties could be considered as general principles for answer set semantics? We address the two questions. First, it seems that the three aforementioned conditions may sometimes be too strong, and we illustrate with examples that enforcing them may exclude expected answer sets. Second, we evolve the Gelfond answer set (GAS) principles for answer set construction by refining the Gelfond's rationality principle to well-supportedness, minimality w.r.t. negation by default and minimality w.r.t. epistemic negation. The principle of well-supportedness guarantees that every answer set is constructible from if-then rules obeying a level mapping and is thus free of circular justification, while the two minimality principles ensure that the formalism minimizes knowledge both at the level of answer sets and of world views. Third, to embody the refined GAS principles, we extend the notion of well-supportedness substantially to answer sets and world views, respectively. Fourth, we define new answer set semantics in terms of the refined GAS principles. Fifth, we use the refined GAS principles as an alternative baseline to intuitively assess the existing answer set semantics. Finally, we analyze the computational complexity.
comment: 76 pages. This article is a significantly extended version of a paper presented by the authors at IJCAI-2022
♻ ☆ Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models ACL'2026
Large Language Models (LLMs) often struggle with computational efficiency and error propagation in multi-step reasoning tasks. While recent advancements on prompting and post-training have enabled LLMs to perform step-wise reasoning, they still tend to explore unproductive solution paths without effective backtracking or strategy adjustment. In this paper, we propose Meta-Reasoner, a new framework that empowers LLMs to "think about how to think". It optimizes the inference process by dynamically adapting reasoning strategies in real-time. Our approach employs contextual multi-armed bandits (CMABs) to learn an adaptive policy. It learns to evaluate the current state of LLM's reasoning and determine optimal strategy that is most likely to lead to a successful outcome during inference, like whether to backtrack, switch to a new approach, or restart the problem-solving process. This meta-guidance helps avoid unproductive paths exploration during inference and hence improves computational efficiency. We evaluate Meta-Reasoner on math problems (e.g., Game-of-24, TheoremQA) and scientific tasks (e.g., SciBench). Results show that our method outperform previous SOTA methods by 9-12% in accuracy, while reducing inference time by 28-35% under the same compute budget. Additional experiments on creative writing demonstrate the generalizability of our approach to diverse reasoning-intensive tasks.
comment: Accepted by ACL'2026
♻ ☆ Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation ICML'26
Training large language models (LLMs) for non-verifiable tasks, such as creative writing, dialogue, and ethical reasoning, remains challenging due to the absence of ground-truth labels. While LLM-as-Judge approaches offer a scalable alternative to human feedback, they face a fundamental limitation: performance is constrained by the evaluator's own quality. If the judge cannot recognize good solutions, it cannot provide useful training signals, and evaluation biases (e.g., favoring verbosity over quality) remain unaddressed. This motivates meta-evaluation: the ability to evaluate and improve the evaluator itself. We introduce CoNL, a framework that unifies generation, evaluation, and meta-evaluation through multi-agent self-play. Our key insight: critique quality can be measured by whether it helps others improve their solutions. In CoNL, multiple agents sharing the same policy engage in structured conversations to propose, critique, and revise solutions. Critiques that enable solution improvements earn a diagnostic reward, creating explicit supervision for meta-evaluation and enabling joint optimization of generation and judging capabilities through self-play, without external judges or ground truth. Experiments on various benchmarks show that CoNL achieves consistent improvements over self-rewarding baselines while maintaining stable training.
comment: Accepted by ICML'26
♻ ☆ Catch Your Breath: Adaptive Computation for Self-Paced Sequence Production
Within the landscape of inference-time scaling methods for foundation models, a width-based approach to scaling -- which involves the insertion of tokens in the input stream to delay model responses -- offers a unique advantage by increasing model expressivity while remaining highly parallelizable at both training and inference. The existing literature on training models to utilize tokens relies on the standard cross-entropy objective in which the model output is read out and evaluated only at the final step of a pause sequence. This approach provides no mechanism for the model to regulate its own processing or to signal readiness to respond, treating the additional compute steps as a static barrier rather than a resource to be used adaptively. We propose a supervised loss, Catch Your Breath (CYB), framed as a sequential-decision problem, that trains a model to dynamically and autonomously scale the number of compute steps used for each input token. The model indicates the need for additional compute steps by emitting a special output, delaying its response via a pause. The model can abstain multiple times to obtain longer delays. Our experiments demonstrate that CYB significantly outperforms standard cross-entropy when introduced either in pretraining or fine-tuning, reducing perplexity and enhancing downstream accuracy with no additional computational or memory cost.
♻ ☆ Switchcodec: Adaptive residual-expert sparse quantization for high-fidelity neural audio coding
Recent neural audio compression models often rely on residual vector quantization for high-fidelity coding, but using a fixed number of per-frame codebooks is suboptimal for the wide variability of audio content-especially for signals that are either very simple or highly complex. To address this limitation, we propose SwitchCodec, a neural audio codec based on Residual Experts Vector Quantization (REVQ). REVQ combines a shared quantizer with dynamically routed expert quantizers that are activated according to the input audio, decoupling bitrate from codebook capacity and improving compression efficiency. This design ensures full training and utilization of each quantizer. In addition, a variable-bitrate mechanism adjusts the number of active expert quantizers at inference, enabling multi-bitrate operation without retraining. Experiments demonstrate that SwitchCodec surpasses existing baselines on both objective metrics and subjective listening tests.
comment: This manuscript contains critical errors in the experimental parameter settings and partial algorithm derivation in Section 3 and Section 4, which will lead to inaccurate conclusion interpretation. We need to withdraw the paper for comprehensive revision, re-calculation and experimental verification, and will resubmit after full correction
♻ ☆ Mochi: Aligning Pre-training and Inference for Efficient Graph Foundation Models via Meta-Learning
We propose Mochi, a Graph Foundation Model that addresses task unification and training efficiency by adopting a meta-learning based training framework. Prior models pre-train with reconstruction-based objectives such as link prediction, and assume that the resulting representations can be aligned with downstream tasks through a separate unification step such as class prototypes. We demonstrate through synthetic and real-world experiments that this procedure, while simple and intuitive, has limitations that directly affect downstream task performance. To address these limitations, Mochi pre-trains on few-shot episodes that mirror the downstream evaluation protocol, aligning the training objective with inference rather than relying on a post-hoc unification step. We show that Mochi, along with its more powerful variant Mochi++, achieves competitive or superior performance compared to existing Graph Foundation Models across 25 real-world graph datasets spanning node classification, link prediction, and graph classification, while requiring 8$\sim$27 times less training time than the strongest baseline.
comment: 23 pages, 7 figures
♻ ☆ When Agents Handle Secrets: A Survey of Confidential Computing for Agentic AI
Agentic AI systems, specifically LLM-driven agents that plan, invoke tools, maintain persistent memory, and delegate tasks to peer agents via protocols such as MCP and A2A, introduce a threat surface that differs materially from standalone model inference. Agents accumulate sensitive context, hold credentials, and operate across pipelines no single party fully controls, enabling prompt injection, context exfiltration, credential theft, and inter-agent message poisoning. Current defenses operate entirely within the software stack and can be silently bypassed by a sufficiently privileged adversary such as a compromised cloud operator. Confidential computing (CC) offers a hardware-rooted alternative: Trusted Execution Environments (TEEs) isolate agent code and data from privileged system software, while remote attestation enables verifiable trust across distributed deployments. This survey synthesizes the design space in four parts: (i) a unified taxonomy of six TEE platforms (Intel SGX, Intel TDX, AMD SEV-SNP, ARM TrustZone, ARM CCA, and NVIDIA H100 CC) covering deployment roles and performance tradeoffs; (ii) an agent-centric threat model spanning perception, planning, memory, action, and coordination layers mapped to nine security goals; (iii) a comparative survey of CC-based defenses distinguishing findings that transfer from single-call inference versus what requires new agentic designs; and (iv) six open challenges including compound attestation for multi-hop agent chains and GPU-TEE performance at LLM scale. While several hardware trust primitives appear mature enough for targeted deployments, no broadly established end-to-end framework yet binds them into a coherent security substrate for production agentic AI.
♻ ☆ MineEvolve: Self-Evolution with Accumulated Knowledge for Long-Horizon Embodied Minecraft Agents
Long-horizon embodied intelligence requires agents to improve through interaction, not merely to execute plans generated from static goals. A central challenge is therefore to transform past executions into knowledge that can shape future decisions. Minecraft provides a representative testbed for this problem, where tasks such as crafting tools, building redstone components, and obtaining diamond equipment involve long prerequisite chains and are frequently disrupted by missing tools, blocked paths, GUI failures, or stagnant execution. To this end, we propose \textbf{MineEvolve}, a knowledge-driven self-evolution framework that converts execution feedback into actionable behavioral knowledge. MineEvolve first uses \underline{\emph{\textbf{\ding{182}Monitor}}} to convert each subgoal execution into typed feedback, including state changes, inventory changes, failure types, progress signals, and stagnation indicators. \underline{\emph{\textbf{\ding{183}Inducer}}} then derives reusable skills from successful executions and remedies from failed or stagnant executions. \underline{\emph{\textbf{\ding{184}Curator}}} validates, merges, filters, and retrieves these knowledge entries, while \underline{\emph{\textbf{\ding{185}Adaptor}}} uses them to repair the unfinished part of the plan under repeated failures or stagnation. Experiments on the Minecraft MCU long-horizon task suite show that MineEvolve consistently improves performance across multiple language-model planners, with larger gains on high-dependency task groups. Ablation and knowledge-accumulation studies further demonstrate that converting execution signals into structured behavioral knowledge is an effective path toward self-evolving embodied agents in long-horizon environments. Our code is available at https://github.com/xzw-ustc/MC-MineEvolve.
♻ ☆ DC-DiT: Adaptive Compute and Elastic Inference for Visual Generation via Dynamic Chunking
Diffusion Transformers rely on static patchify tokenization, assigning the same token budget to smooth backgrounds, detailed object regions, noisy early timesteps, and late-stage refinements. We introduce the Dynamic Chunking Diffusion Transformer (DC-DiT), which replaces fixed patchification with a learned encoder-router-decoder scaffold that adaptively compresses the 2D input into a shorter token sequence through a chunking mechanism learned end-to-end with diffusion training. DC-DiT allocates fewer tokens to predictable regions and noisy timesteps, and more tokens to detailed regions and later refinement stages, yielding meaningful spatial segmentations and timestep-adaptive compression schedules without supervision. Furthermore, the router provides an importance ordering over retained tokens, enabling elastic inference: a single checkpoint can be evaluated at flexible compute budgets with a smooth quality-compute tradeoff. Additionally, DC-DiT can be upcycled from pretrained DiT checkpoints and is also compatible with orthogonal dynamic computation approaches. On class-conditional ImageNet generation, DC-DiT reduces inference FLOPs by up to 36.8% and improves FID by up to 37.8% over DiT baselines, yielding a stronger quality--compute Pareto frontier across model scales, resolutions, and guidance settings. More broadly, these results suggest that adaptive tokenization is a general mechanism for making visual generation both more efficient and more flexible at inference time.
♻ ☆ Attribution-Guided Pruning for Insight and Control: Circuit Discovery and Targeted Correction in Small-scale LLMs
Large Language Models (LLMs) are widely deployed in real-world applications, yet their internal mechanisms remain difficult to interpret and control, limiting our ability to diagnose and correct undesirable behaviors. Mechanistic interpretability addresses this challenge by identifying circuits -- subsets of model components responsible for specific behaviors. However, discovering such circuits in LLMs remains difficult due to their scale and complexity. We frame circuit discovery as identifying parameters that contribute most to model outputs on task-specific inputs, and use Layer-wise Relevance Propagation (LRP) with reference samples to attribute and extract these components via pruning. Building on this, we introduce contrastive relevance to isolate circuits associated with undesired behaviors while preserving general capabilities, enabling targeted model correction. On OPT-125M, we show that pruning as little as ~0.3% of neurons substantially reduces toxic outputs, while pruning approximately 0.03% of weight elements mitigates repetitive text generation without degrading general performance. These results establish attribution-guided pruning as an effective mechanism for identifying and intervening on behavior-specific circuits in LLMs. We further validate our findings on additional small-scale language models, demonstrating that the proposed approach transfers across architectures. Our code is publicly available at https://github.com/erfanhatefi/SparC3.
comment: Work in progress (9 pages manuscript, 3 pages references, 16 pages appendix)
♻ ☆ DeEscalWild: A Real-World Benchmark for Automated De-Escalation Training with SLMs
Effective de-escalation is critical for law enforcement safety and community trust, yet traditional training methods lack scalability and realism. While Large Language Models (LLMs) enable dynamic, open-ended simulations, their substantial computational footprint renders them impractical for deployment on the lightweight, portable hardware required for immersive field training. Small Language Models (SLMs) offer a viable real-time alternative but suffer from a critical scarcity of high-quality, domain-specific training data. To bridge this gap, we present DeEscalWild, a novel benchmark dataset curated from a multi-stage pipeline of in-the-wild police-civilian interactions extracted from publicly available video repositories. Starting with 5,000 raw inputs, we employed a rigorous hybrid filtering process combining human-in-the-loop verification with LLM-as-a-Judge evaluation to distill 1,500 high-fidelity scenarios. The resulting corpus comprises 285,887 dialogue turns, totaling approximately 4.7 million tokens. Extensive experiments demonstrate that SLMs fine-tuned on this data significantly outperform their base counterparts across ROUGE-L, BLEU-4, METEOR, BERTScore, Realism Score, and human evaluation metrics. Notably, our fine-tuned Qwen 2.5 (3B-Instruct) surpasses the general-purpose Gemini 2.5 Flash model when evaluated under equivalent conditions, demonstrating that domain-optimized SLMs can achieve superior performance with a fraction of the computational cost. This work establishes the foundational infrastructure for accessible, low-latency, and privacy-preserving officer training systems at the edge. We publicly release our code(https://github.com/Hasebul/DeEscalWild-Benchmark-Framework) and dataset(https://doi.org/10.7910/DVN/CWMCZI).
comment: 20 pages
♻ ☆ Supervising Ralph Wiggum: Exploring a Metacognitive Co-Regulation Agentic AI Loop for Engineering Design
The engineering design research community has studied agentic AI systems that use Large Language Model (LLM) agents to automate the engineering design process. However, these systems are prone to some of the same pathologies that plague humans. Just as human designers, LLM design agents can fixate on existing paradigms and fail to explore alternatives when solving design challenges, potentially leading to suboptimal solutions. In this work, we propose (1) a novel Self-Regulation Loop (SRL), in which the Design Agent self-regulates and explicitly monitors its own metacognition, and (2) a novel Co-Regulation Design Agentic Loop (CRDAL), in which a Metacognitive Co-Regulation Agent assists the Design Agent in metacognition to mitigate design fixation, thereby improving system performance for engineering design tasks. In the battery pack design problem examined here, we found that the novel SRL and CRDAL systems generate designs with better performance, without significantly increasing the computational cost, compared to a plain Ralph Wiggum Loop (RWL) Further, the novel CRDAL generates designs with significantly better performance than SRL. Also, we found that the CRDAL system navigated through the latent design space more effectively than both SRL and RWL. The proposed system architectures and findings of this work provide practical implications for future development of agentic AI systems for engineering design.
♻ ☆ Leviathan: Decoupling Input and Output Representations in Language Models
Modern language models use a single matrix for input embedding and output projection. This couples two distinct objectives: token representation and discrimination over a vocabulary. This work introduces Leviathan, a Transformer architecture that replaces the input embedding matrix with learned embedding vectorization (LEV), a compact continuous mapping from token indices to embeddings. Leviathan's output head remains untied for a parameter increase of as low as 0.2%. Under controlled comparisons with identical Transformer backbones, Leviathan consistently improves language modeling performance over standard tied-embedding baselines across a 200M-1.2B parameter regime on The Pile with gains that grow during training. At 1.2B scale, Leviathan reduces validation perplexity by 9%, requires $2.1\times$ fewer training tokens to reach the tied baseline's final loss, and improves on all six downstream benchmarks evaluated, including a 30% reduction in LAMBADA perplexity. Frequency-stratified analysis reveals gains to be concentrated in rare tokens, where continuous parameterization reduces perplexity by 81%, falling to near zero for the most frequent.
♻ ☆ Zero-Shot Confidence Estimation for Small LLMs: When Supervised Baselines Aren't Worth Training
How reliably can a small language model estimate its own correctness? The answer determines whether local-to-cloud routing-escalating queries a cheap local model cannot handle-can work without supervised training data. As inference costs dominate large language model (LLM) deployment budgets, routing most queries to a cheap local model while reserving expensive cloud calls for hard cases is an increasingly common cost-control strategy. We compare zero-shot confidence signals against RouteLLM-style supervised baselines across three 7-8B model families and two datasets (1,000 and 500 queries per model, respectively). Average token log-probability, which requires no training data, matches or exceeds supervised baselines in-distribution (Area Under the Receiver Operating Characteristic curve (AUROC) 0.650-0.714 vs. 0.644-0.676) and substantially outperforms them out-of-distribution (0.717-0.833 vs. 0.512-0.564), because it measures a property of the model's generation rather than the query distribution. This paper further proposes retrieval-conditional self-assessment, a pre-generation signal that selectively injects retrieved knowledge when similarity is high, improving over bare self-assessment by up to +0.069 AUROC at 3-10x lower latency than log-probability. A supervised baseline trained on 1,000 labeled examples never exceeds the zero-shot signal. We release all code, data, and experiment logs.
♻ ☆ CatNet: Controlling the False Discovery Rate in LSTM with SHAP Feature Importance and Gaussian Mirrors
We introduce CatNet, an algorithm that effectively controls False Discovery Rate (FDR) and selects significant features in LSTM. CatNet employs the derivative of SHAP values to quantify the feature importance, and constructs a vector-formed mirror statistic for FDR control with the Gaussian Mirror algorithm. To avoid instability due to nonlinear or temporal correlations among features, we also propose a new kernel-based independence measure. CatNet performs robustly on different model settings with both simulated and real-world data, which reduces overfitting and improves interpretability of the model. Our framework that introduces SHAP for feature importance in FDR control algorithms and improves Gaussian Mirror can be naturally extended to other time-series or sequential deep learning models.
comment: Withdrawn by the authors. The main theoretical result relies on an assumption that is not valid as stated. A substantially revised and corrected work will be posted separately
♻ ☆ Screening Is Enough
A core limitation of standard softmax attention is that it does not provide an independently interpretable measure of query--key relevance: attention scores are unbounded, while attention weights are defined only relative to competing keys. Consequently, irrelevant keys cannot be explicitly rejected, and some attention mass is assigned even when no key is genuinely relevant. We introduce Multiscreen, a language-model architecture built around a mechanism we call screening, which enables absolute query--key relevance. Instead of redistributing attention across all keys, screening computes bounded query--key similarities and applies an explicit threshold, discarding irrelevant keys and aggregating the remaining keys without global competition. Across experiments, Multiscreen achieves comparable validation loss with roughly 30\% fewer parameters than a Transformer baseline and remains stable at substantially larger learning rates. It maintains stable long-context perplexity beyond the training context and shows little degradation in retrieval performance as context length increases. Finally, Multiscreen achieves lower full-context forward-pass latency at long context lengths.
comment: 36 pages, 27 figures. Revised version with retuned Transformer baselines, additional experiments, ablations, and appendix analyses
♻ ☆ P^2O: Joint Policy and Prompt Optimization
Reinforcement Learning with Verifiable Rewards (RLVR) enhances Large Language Model (LLM) reasoning but suffers from advantage collapse on ``hard samples'' where all rollouts fail. This lack of variance eliminates crucial learning signals. For these intractable samples, simply scaling up rollout budgets offers limited gains. We introduce Joint Policy and Prompt Optimization (P$^2$O) to mitigate this collapse by alternating continuous policy updates with discrete prompt evolution. P$^2$O leverages the GEPA algorithm to discover successful reasoning prompts for intractable instances. Via context distillation, the model internalizes these prompt-induced gains directly into its parameters, removing the need for inference-time prompting. Empirically, P$^2$O restores critical advantage signals, significantly outperforming standard GRPO and surpassing baselines with doubled rollout budgets, ultimately yielding strong out-of-distribution generalization and an up to $9.5\%$ performance improvement. Our findings expose the limits of standard exploration in sparse-reward environments, illuminating the potential of unifying evolutionary algorithms with reinforcement learning. This integration of discrete semantic search and continuous parameter updates establishes a self-reinforcing paradigm for autonomous LLM alignment.
♻ ☆ Low-Rank Adaptation for Critic Learning in Off-Policy Reinforcement Learning
Scaling critic capacity is a promising direction for improving off-policy reinforcement learning (RL). However, recent work shows that larger critics are prone to overfitting and instability in replay-based bootstrapped training. In this paper, we propose using Low-Rank Adaptation (LoRA) as a structural regularizer for critic learning. Our approach freezes randomly initialized base matrices and optimizes only the corresponding low-rank adapters, thereby constraining critic updates to a low-dimensional subspace. We evaluate our method across different off-policy RL algorithms, including SAC and FastTD3 based on different network architectures. Empirically, LoRA efficiently reduces critic loss during training and improves overall policy performance, achieving the best or competitive results on most tasks. Extensive experiments demonstrate that our low-rank updates provide a simple and effective form of structural regularization for critic learning in off-policy RL.
♻ ☆ LicenseGPT: A Fine-tuned Foundation Model for Publicly Available Dataset License Compliance
Dataset license compliance is a critical yet complex aspect of developing commercial AI products, particularly with the increasing use of publicly available datasets. Ambiguities in dataset licenses pose significant legal risks, making it challenging even for software IP lawyers to accurately interpret rights and obligations. In this paper, we introduce LicenseGPT, a fine-tuned foundation model (FM) specifically designed for dataset license compliance analysis. We first evaluate existing legal FMs (i.e., FMs specialized in understanding and processing legal texts) and find that the best-performing model achieves a Prediction Agreement (PA) of only 43.75%. LicenseGPT, fine-tuned on a curated dataset of 500 licenses annotated by legal experts, significantly improves PA to 64.30%, outperforming both legal and general-purpose FMs. Through an A/B test and user study with software IP lawyers, we demonstrate that LicenseGPT reduces analysis time by 94.44%, from 108 seconds to 6 seconds per license, without compromising accuracy. Software IP lawyers perceive LicenseGPT as a valuable supplementary tool that enhances efficiency while acknowledging the need for human oversight in complex cases. Our work underscores the potential of specialized AI tools in legal practice and offers a publicly available resource for practitioners and researchers.
♻ ☆ Probe-Geometry Alignment: Erasing the Cross-Sequence Memorization Signature Below Chance
Recent attacks show that behavioural unlearning of large language models leaves internal traces recoverable by adversarial probes. We characterise where this retention lives and show it can be surgically removed without measurable capability cost. Our central protocol is a leave-one-out cross-sequence probe that tests whether a memorisation signature generalises across held-out sequences. The signature is real and consistent across scale: memorisation-specific gaps of +0.32, +0.19, +0.30 on Pythia-70M, GPT-2 medium, and Mistral-7B; on Pythia-70M, the random-initialisation control collapses to -0.04 at the deepest layer where the pretrained signature peaks. The probe direction is causally separable from recall -- projecting it out collapses the signature locally (+0.44 -> -0.19) while behavioural recall barely changes -- and a probe trained on naturally memorised content does not classify fine-tuning-injected secrets, marking two representationally distinct regimes. We then introduce probe-geometry alignment (PGA), a surgical erasure that aligns activations along the probe's live readout direction at each depth. PGA drives the cross-sequence probe below random chance at all four scales tested (toy depth-4: 0.17; Pythia-70M: 0.07; Mistral-7B: 0.45; GPT-2 medium: 0.06 via MD-PGA k=2) and remains robust to six adversarial probe variants. Against a re-fitting attacker who trains a fresh probe on PGA-treated activations, we extend PGA adversarially, defeating the re-fit probe at every memorisation-relevant depth while preserving five zero-shot capability benchmarks within 2.8 percentage points per task (mean Δacc = +0.2pp). The cross-sequence signature is a real, causally separable, regime-specific property of pretrained representations -- removable below chance with a single rank-one intervention per depth at no measurable capability cost.
♻ ☆ Fast and Efficient Gossip Algorithms for Robust and Non-smooth Decentralized Learning
Decentralized learning on resource-constrained edge devices demands algorithms that are communication-efficient, robust to data corruption, and lightweight in memory. State-of-the-art gossip-based methods address communication efficiency, but achieving robustness remains challenging. Methods for robust estimation and optimization typically rely on non-smooth objectives (\textit{e.g.}, pinball loss, $\ell_1$ loss), yet standard gossip methods are primarily designed for smooth losses. Asynchronous decentralized ADMM-based methods have been proposed to handle such non-smooth objectives; however, existing approaches require memory that scales with node degree, making them impractical when memory is limited. We propose AsylADMM, a novel asynchronous gossip algorithm for decentralized non-smooth optimization requiring only two variables per node. We provide a new theoretical analysis for the synchronous variant and leverage it to prove convergence of AsylADMM in a simplified setting based on the squared loss. Empirically, AsylADMM converges faster than existing baselines on challenging non-smooth problems, including quantile and geometric median estimation, lasso regression, and robust regression. More broadly, our novel gossip framework opens a practical pathway toward robust and non-smooth decentralized learning.
♻ ☆ It's Not a Lottery, It's a Race: Understanding How Gradient Descent Adapts the Network's Capacity to the Task
Our theoretical understanding of neural networks is lagging behind their empirical success. One of the important unexplained phenomena is why and how, during the process of training with gradient descent, the theoretical capacity of neural networks is reduced to an effective capacity that fits the task. We here investigate the mechanism by which gradient descent achieves this through analyzing the learning dynamics at the level of individual neurons in single hidden layer ReLU networks. We identify three dynamical principles, namely mutual alignment, unlocking and racing, that together explain why we can often successfully reduce capacity after training through the merging of equivalent neurons or the pruning of low norm weights. We specifically explain the mechanism behind the lottery ticket conjecture, or why the specific, beneficial initial conditions of some neurons lead them to obtain higher weight norms.
♻ ☆ DeTrigger: A Gradient-Centric Approach to Backdoor Attack Mitigation in Federated Learning
Federated Learning (FL) enables collaborative model training across distributed devices while preserving local data privacy, making it ideal for mobile and embedded systems. However, the decentralized nature of FL also opens vulnerabilities to model poisoning attacks, particularly backdoor attacks, where adversaries implant trigger patterns to manipulate model predictions. In this paper, we propose DeTrigger, a scalable and efficient backdoor-robust federated learning framework that leverages insights from adversarial attack methodologies. By employing gradient analysis with temperature scaling, DeTrigger detects and isolates backdoor triggers, allowing for precise model weight pruning of backdoor activations without sacrificing benign model knowledge. Extensive evaluations across four widely used datasets demonstrate that DeTrigger achieves up to 251x faster detection than traditional methods and mitigates backdoor attacks by up to 98.9%, with minimal impact on global model accuracy. Our findings establish DeTrigger as a robust and scalable solution to protect federated learning environments against sophisticated backdoor threats.
comment: 21 pages
♻ ☆ Risk Horizons: Structured Hypothesis Spaces for Longitudinal Clinical Prediction
Predicting future clinical events from longitudinal electronic health records (EHRs) requires selecting plausible outcomes from a large and structured event space under sparse observations. While clinical coding systems provide hierarchical organization of events, cross-modal and temporal relationships are not explicitly specified and must instead be inferred from data, making prediction difficult for weakly observed longitudinal transitions. We introduce Risk Horizons, a geometry-aware framework for constructing patient-specific candidate spaces for multi-modal next-visit prediction. Risk Horizons combines deterministic coding hierarchies with data-driven lagged cross-modal associations, embeds the resulting clinical graph in hyperbolic space, and retrieves candidate futures using directional risk cones. This reframes longitudinal prediction as ranking within a compact, clinically coherent hypothesis space rather than scoring an unconstrained vocabulary. Experiments on MIMIC-IV and eICU demonstrate competitive next-visit prediction performance, with consistently improved hierarchy consistency across diagnoses, procedures, and medications. Further analysis suggests that hyperbolic structured candidate retrieval is the primary driver of performance, while LLMs are effective as constrained inference-time rerankers operating over clinically grounded candidate sets.
♻ ☆ Path Dependence under Adaptive AI Delegation
Repeated AI assistance can improve immediate task performance while reducing the skill available for future independent work. We develop a mathematical framework for this long-run tradeoff. The model tracks two state variables: a latent human skill level governing expected independent performance, and a delegation level representing the learner's evolving tendency to rely on AI. Skill changes through error-driven learning under practice and decay under delegation; delegation responds to observed performance, increasing when AI-assisted work appears to outperform independent work. We analyze the resulting dynamics and contrast them with fixed delegation. With fixed delegation, skill follows a one-dimensional learning-decay process with a single stable equilibrium. With adaptive delegation, the coupled system has two attracting equilibria separated by the stable manifold of an interior saddle. The existence and geometry of this separatrix require a global phase-plane analysis of the coupled dynamics. The system is path-dependent: small differences in initial skill or reliance can lead to different long-run outcomes. We use this characterization to show that AI assistance can improve short-run performance while producing worse long-run performance than a no-AI baseline. Increasing AI capability can enlarge the basin of attraction of the low-skill equilibrium, making delegation appear beneficial for longer while increasing the risk of eventual skill loss. The qualitative picture is observed to persist across alternative specifications. Together, these results show that the risk is not AI assistance itself, but the coupling between performance-driven reliance and use-dependent skill change.
♻ ☆ MAT-Cell: A Multi-Agent Tree-Structured Reasoning Framework for Batch-Level Single-Cell Annotation
Automated single-cell annotation is difficult when the most abundant genes are not the most discriminative ones, or when a target state is poorly covered by a fixed reference atlas. GPTCelltype-style one-shot prompting allows large language models (LLMs) to produce plausible labels from generic expression signals, while reference-based annotators can force unfamiliar states into the nearest known category. We propose MAT-Cell, a prompt-driven framework for batch-level single-cell annotation that separates evidence grounding from label decision. MAT-Cell first uses Reverse Verification Query (RVQ) to combine tissue context, observed differentially expressed genes, and LLM-elicited biological priors into structured candidate-specific premises. Verifier agents then convert these premises into explicit premise-to-claim reasoning trees, and bounded multi-round debate compares,challenges, and revises the resulting claims before consensus or final adjudication.The returned Syllogistic Derivation Tree (SDT) provides an auditable debate trace rather than a formal proof of the annotation. In open-candidate benchmarks across five datasets, a locally deployed Qwen3-30B model with MAT-Cell achieves 75.5% average accuracy, compared with 64.2% for the strongest evaluated CoT baseline and 51.9% for the strongest evaluated scPilot variant. In oracle-candidate bench-marks across three species,MAT-Cell remains competitive across backbones, and local inference substantially reduces monetary cost for batch annotation. Code is available at: https://anonymous.4open.science/r/MATCell-4067
♻ ☆ Multivariate Standardized Residuals for Conformal Prediction
While split conformal prediction guarantees marginal coverage, approaching the stronger property of conditional coverage is essential for reliable uncertainty quantification. Naive conformal scores, however, suffer from poor conditional coverage in heteroskedastic settings. In univariate regression, this is commonly addressed by normalizing non-conformity scores using an estimated local score variance. In this work, we propose a natural extension of this normalization to the multivariate setting, effectively whitening the residuals to decouple output correlations and standardize local variance. Furthermore, we derive a sufficient condition characterizing a broad class of distributions for which standardized residuals yield asymptotic conditional coverage. We demonstrate that using the Mahalanobis distance induced by a learned local covariance as a non-conformity score provides a closed-form, computationally efficient mechanism for capturing inter-output correlations and heteroskedasticity, avoiding the expensive sampling required by previous methods based on cumulative distribution functions. This structure unlocks several practical extensions, including the handling of missing output values, the refinement of conformal sets when partial information is revealed, and the construction of valid conformal sets for transformations of the output. Finally, we provide extensive empirical evidence on both synthetic and real-world datasets showing that our approach yields conformal sets that improve upon the conditional coverage of existing multivariate baselines.
♻ ☆ MediEval: A Unified Medical Benchmark for Patient-Contextual and Knowledge-Grounded Reasoning in LLMs
Large Language Models (LLMs) are increasingly applied to medicine, yet their adoption is limited by concerns over reliability and safety. Existing evaluations either test factual medical knowledge in isolation or assess patient-level reasoning without verifying correctness, leaving a critical gap. We introduce MediEval, a benchmark that links MIMIC-IV electronic health records (EHRs) to a unified knowledge base built from UMLS and other biomedical vocabularies. MediEval generates diverse factual and counterfactual medical statements within real patient contexts, enabling systematic evaluation across a 4-quadrant framework that jointly considers knowledge grounding and contextual consistency. Using this framework, we identify critical failure modes, including hallucinated support and truth inversion, that current proprietary, open-source, and domain-specific LLMs frequently exhibit. To address these risks, we propose Counterfactual Risk-Aware Fine-tuning (CoRFu), a DPO-based method with an asymmetric penalty targeting unsafe confusions. CoRFu improves by +16.4 macro-F1 points over the base model and eliminates truth inversion errors, demonstrating both higher accuracy and substantially greater safety.
♻ ☆ Unsupervised Anomaly Detection in Wearable Foot Sensor Data: A Baseline Feasibility Study Towards Diabetic Foot Ulcer Prevention
Diabetic foot ulcers (DFUs) are a severe complication of diabetes associated with significant morbidity, amputation risk, and healthcare burden. Developing effective continuous monitoring frameworks requires first establishing reliable baseline models of normal foot biomechanics. This paper presents a feasibility study of an anomaly detection framework applied to time-series data from wearable foot sensors, specifically NTC thin-film thermocouples for temperature and FlexiForce A401 pressure sensors for plantar load monitoring. Data were collected from healthy adult subjects across 312 capture sessions on an instrumented pathway, generating 93,790 valid multi-sensor readings spanning September 2023 to June 2024. Two unsupervised algorithms, Isolation Forest and K-Nearest Neighbors using Local Outlier Factor (KNN/LOF), were applied to detect statistical deviations in foot temperature and pressure signals. Results show that Isolation Forest is more sensitive to subtle, distributed anomalies, while KNN/LOF identifies concentrated extreme deviations but flags a higher proportion of sessions not corroborated by Isolation Forest. Since no clinical ground truth is available, this difference is interpreted as lower specificity under the shared 5 percent contamination assumption rather than a confirmed false-positive rate. A mild positive correlation (0.41-0.48) between pressure and temperature features supports the case for combined multi-modal monitoring. These findings establish a validated baseline analytical pipeline and provide a methodological foundation for future clinical validation studies involving diabetic patients, where the relationship between detected anomalies and DFU-related pathophysiology can be directly assessed.
comment: 36 pages, 19 figures. Published in Biomedical Signal Processing and Control, Vol. 123, Part A, 110416, September 2026. https://doi.org/10.1016/j.bspc.2026.110416
♻ ☆ DOGMA: Weaving Structural Information into Data-centric Single-cell Transcriptomics Analysis
Recently, data-centric AI methodology has been a dominant paradigm in single-cell transcriptomics analysis, which treats data representation rather than model complexity as the fundamental bottleneck. In the review of current studies, earlier sequence methods treat cells as independent entities and adapt prevalent ML models to analyze their directly inherited sequence data. Despite their simplicity and intuition, these methods overlook the latent intercellular relationships driven by the functional mechanisms of biological systems and the inherent quality issues of the raw sequencing data. Therefore, a series of structured methods has emerged. Although they employ various heuristic rules to capture intricate intercellular relationships and enhance the raw sequencing data, these methods often neglect biological prior knowledge. This omission incurs substantial overhead and yields suboptimal graph representations, hindering the utility of ML models. To address these issues, we propose DOGMA, a data-centric framework designed for the structural reshaping and semantic enhancement of raw data through multi-level biological prior knowledge. Transcending reliance on purely data-driven heuristics, DOGMA provides a prior-guided graph construction pipeline that integrates statistical alignment with Cell Ontology and phylogenetic structure for biologically grounded cell-graph construction and robust cross-species alignment. Furthermore, Gene Ontology is utilized to bridge the feature-level semantic gap by incorporating functional priors. In complex multi-species and multi-organ benchmarks, DOGMA exhibits strong robustness in strict zero-shot cell-type evaluation and sample efficiency while using substantially lower GPU memory and inference time in downstream evaluation.
comment: 34 pages, 4 figures
♻ ☆ Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols
To evaluate the safety and usefulness of deployment protocols for untrusted AIs, AI Control uses a red-teaming exercise played between a protocol designer and an adversary. This paper introduces AI-Control Games, a formal decision-making model of the red-teaming exercise as a multi-objective, partially observable, stochastic game. We also introduce reductions from AI-Control Games to a special case of zero-sum partially observable stochastic games that allow us to leverage existing algorithms to find Pareto-optimal protocols. We apply our formalism to model, evaluate and synthesise protocols for deploying untrusted language models as programming assistants, focusing on Trusted Monitoring protocols, which use weaker language models and limited human assistance. To demonstrate the utility of our formalism, we show improvements over empirical studies in existing settings, evaluate protocols in new settings, and analyse how modelling assumptions affect the safety and usefulness of protocols. Finally, we leverage our formalism to precisely describe some of the implicit assumptions in prior control work.
♻ ☆ Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling
Reasoning is a core capability of large language models, yet how multi-step reasoning is learned and executed remains unclear. We study this question in a controlled cellular-automata (1dCA) framework that excludes memorisation by using disjoint training and test rules. Given a short state sequence, the model is required to infer the hidden local rule and then chain it to predict multiple future steps. Our evaluation shows that LLMs largely fail to reliably solve a natural-language proxy of the proposed task. We find that most neural architectures trained from scratch can learn rule inference and achieve high next-step accuracy, but performance drops sharply as the required number of intermediate reasoning steps increases. Experiments show that increasing model depth is crucial, and extending effective depth via recurrence, memory, or test-time compute improves results but remains bounded. The code is available on github: https://github.com/RodkinIvan/associative-recurrent-memory-transformer/tree/ACT
♻ ☆ Information Aggregation with AI Agents
Can Large Language Models (AI agents) aggregate dispersed private information through trading and reason about the knowledge of others by observing price movements? We conduct a controlled experiment where AI agents trade in a prediction market after receiving private signals, measuring information aggregation by the log error of the last price. We find that although the median market is effective at aggregating information in the easy information structures, increasing the complexity has a significant and negative impact, suggesting that AI agents may suffer from similar limitations as humans when reasoning about others. Consistent with our theoretical predictions, information aggregation remains unaffected by allowing cheap talk communication, changing the duration of the market or initial price, and strategic prompting, thus demonstrating that prediction markets are robust. We establish that "smarter" AI agents perform better at aggregation and they are more profitable. Surprisingly, giving them feedback about past performance has no impact on aggregation.
comment: 64 pages
♻ ☆ FutureWorld: A Live Reinforcement Learning Environment for Predictive Agents with Real-World Outcome Rewards
Live future prediction refers to the task of making predictions about real-world events before they unfold. This task is increasingly studied using large language model-based agent systems, and it is important for building agents that can continually learn from the real world. It can provide a large number of prediction questions grounded in diverse real-world events, while preventing answer leakage. To leverage the advantages of future prediction, we present FutureWorld, a live agentic reinforcement learning environment that closes the training loop between prediction, outcome realization, and parameter updates. Specifically, we modify and extend verl-tool, resulting in a new framework that we call verl-tool-future. Unlike standard RL training frameworks that rely on immediate rewards, verl-tool-future stores prediction-time rollouts, backfills rewards after real-world outcomes become available, and then replays the completed trajectories for policy update. Across three open-source agents, successive FutureWorld training rounds lead to consistent improvements in prediction accuracy, probabilistic scoring, and calibration, demonstrating that delayed real-world outcome feedback can serve as an effective RL signal for predictive agents.
comment: We will release the code in the near future
♻ ☆ Knowledge-Level Consistency Reinforcement Learning: Dual-Fact Alignment for Long-Form Factuality
Hallucination in large language models (LLMs) during long-form generation remains difficult to address under existing reinforcement learning from human feedback (RLHF) frameworks, as their preference rewards often overlook the model's own knowledge boundaries. In this paper, we propose the $\textbf{K}$nowledge-$\textbf{L}$evel $\textbf{C}$onsistency Reinforcement Learning $\textbf{F}$ramework ($\textbf{KLCF}$), which re-examines this problem from a distribution alignment perspective. KLCF formalizes long-form factuality as a bidirectional distribution matching objective between the policy model's expressed knowledge distribution and the base model's parametric knowledge distribution: under the constraint that generation must not exceed the support set of the base knowledge, the objective maximizes coverage of high-probability facts, thereby jointly optimizing precision and recall. To achieve this, we design a Dual-Fact Alignment mechanism that approximates the recall term using a factual checklist constructed by sampling from the base model, and constrains hallucinations with a lightweight truthfulness reward model. Both components are jointly optimized and require no external retrieval throughout training. Experimental results demonstrate that KLCF consistently improves factuality metrics across multiple long-form benchmarks and model scales, effectively alleviating hallucination and over-conservatism while maintaining efficiency and scalability.
comment: 32 pages
♻ ☆ RSAT: Structured Attribution Makes Small Language Models Faithful Table Reasoners ACL 2026
When a language model answers a table question, users have no way to verify which cells informed which reasoning steps. We introduce RSAT, a method that trains small language models (SLMs, 1-8B) to produce step-by-step reasoning with cell-level citations grounded in table evidence. Phase 1 (SFT) teaches a structured JSON output format from verified reasoning traces. Phase 2 (GRPO) optimizes a composite reward centered on NLI-based faithfulness, alongside citation validity and parsimony. Across six models from two families-Qwen 2.5 (1.5B/3B/7B) and Llama 3 (1B/3B/8B)-RSAT improves faithfulness 3.7$\times$ over SFT alone (0.224$\rightarrow$0.826), with near-perfect citation validity (0.992). Post-hoc attribution collapses below 13% format success, confirming that attribution must be integrated into reasoning, not retrofitted. Ablations show the faithfulness reward is essential: removing it drops faithfulness from 0.97 to 0.03.
comment: 8 pages, 8 tables, 9 figures, and a 3-page Appendix. Accepted at the SURGeLLM Workshop at ACL 2026 and will be included in the proceedings
♻ ☆ Beyond Value Elicitation: Towards Moral Profiles in Early Requirements Engineering via Role-Playing Games and Anthropologist LLMs
This study presents a proof of concept for eliciting and representing the moral profiles of digital system users in Requirements Engineering (RE) by combining immersive role-playing games (RPGs) with large language model (LLM) analysis. While existing approaches rely on predefined value taxonomies and explicit articulation, values are often tacit, context-dependent, and difficult to express directly. To address these limitations, we propose moving from the elicitation of discrete moral values to the narrative reconstruction and representation of users' moral profiles. Grounded in phenomenological and narrative anthropology, the approach focuses on capturing users' moral orientations as they emerge through situated decision-making. RPG sessions generate context-rich narrative data, which are then analyzed by a specialized LLM (GPT-A) to produce individual anthropological moral profiles (IAMPs). A validation process based on cross-comparison between model outputs and participants' responses in unseen moral scenarios assesses the adequacy of the generated representations. Results indicate that RPG environments effectively support the generation of rich, context-dependent data for eliciting tacit values, and that an anthropologically grounded LLM can transform such data into coherent narrative representations of users' moral profiles. These representations enable the contextual interpretation of users' preferences and values within the given domain, with improved performance when interpretive framing captures relationships between actions, underlying motivations, and individual domain expertise. From an RE perspective, this approach enables the analysis of user preferences and trade-offs while preserving their situated and dynamic nature, providing a foundation for integrating human moral values into the early stages of RE.
♻ ☆ Making AI Evaluation Deployment Relevant Through Context Specification
With many organizations struggling to gain value from AI deployments, pressure to evaluate AI in an informed manner has intensified. Status quo AI evaluation approaches often mask the operational realities that ultimately determine deployment success, making it difficult for organizational decision makers to know whether and how AI tools will deliver durable value. We introduce and describe context specification as a process to support and inform this decision making process. Context specification turns diffuse stakeholder perspectives about what matters in a given setting into clear, named constructs: explicit definitions of the properties, behaviors, and outcomes that evaluations aim to capture, so they can be observed and measured in context. The process serves as a foundational roadmap for evaluating what AI systems are likely to do in the deployment contexts that organizations actually manage.
comment: 8 pages; 2 figures
♻ ☆ Tracking Capabilities for Safer Agents
AI agents that interact with the real world through tool calls pose fundamental safety challenges: agents might leak private information, cause unintended side effects, or be manipulated through prompt injection. To address these challenges, we propose to put the agent in a programming-language-based "safety harness": instead of calling tools directly, agents express their intentions as code in a capability-safe language: Scala 3 with capture checking. Capabilities are program variables that regulate access to effects and resources of interest. Scala's type system tracks capabilities statically, providing fine-grained control over what an agent can do. In particular, it enables local purity, the ability to enforce that sub-computations are side-effect-free, preventing information leakage when agents process classified data. We demonstrate that extensible agent safety harnesses can be built by leveraging a strong type system with tracked capabilities. Our experiments show that agents can generate capability-safe code with no significant loss in task performance, while the type system reliably prevents unsafe behaviors such as information leakage and malicious side effects.
♻ ☆ Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents
Large language models are increasingly deployed as autonomous agents for multi-step workflows in real-world software environments. However, existing agent benchmarks are limited by trajectory-opaque grading, underspecified safety and robustness evaluation, and narrow coverage of modalities and interaction paradigms. We introduce Claw-Eval, an end-to-end evaluation suite addressing these gaps with 300 human-verified tasks spanning 9 categories across three groups: general service orchestration, multimodal perception and interaction, and multi-turn professional dialogue. To enable trajectory-aware grading, each run is recorded through three independent evidence channels: execution traces, audit logs, and environment snapshots, yielding 2,159 fine-grained rubric items. The scoring protocol evaluates Completion, Safety, and Robustness, with Average Score, Pass@k, and Pass^k across three trials to distinguish genuine capability from lucky outcomes. Experiments on 14 frontier models show that: (1) Trajectory-opaque evaluation is systematically unreliable, missing 44% of safety violations and 13% of robustness failures detected by our framework. (2) Capability does not imply consistency, with Pass@3 remaining stable under error injection while Pass^3 dropping by up to 24 percentage points. (3) Agent capability is strongly multi-dimensional, with model rankings varying across task groups and metrics, indicating that our heterogeneous evaluation coverage is essential. Claw-Eval highlights directions for developing agents that are not only capable but reliably deployable.
♻ ☆ BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics ICML 2026
This paper introduces BioAgent Bench, a benchmark dataset and an evaluation suite designed for measuring the performance and robustness of AI agents in common bioinformatics tasks. The benchmark contains curated end-to-end tasks (e.g., RNA-seq, variant calling, metagenomics) with prompts that specify concrete output artifacts to support automated assessment, including stress testing under controlled perturbations. We evaluate frontier closed-source and open-weight models across multiple agent harnesses, and use an LLM-based grader to score pipeline progress and outcome validity. We find that frontier agents can complete multi-step bioinformatics pipelines without elaborate custom scaffolding, often producing the requested final artifacts reliably. However, robustness tests reveal failure modes under controlled perturbations (corrupted inputs, decoy files, and prompt bloat), indicating that correct high-level pipeline construction does not guarantee reliable step-level reasoning. Finally, because bioinformatics workflows may involve sensitive patient data, proprietary references, or unpublished IP, closed-source models can be unsuitable under strict privacy constraints; in such settings, open-weight models may be preferable despite lower completion rates. We release the dataset and evaluation suite publicly.
comment: Accepted at ICML 2026
♻ ☆ The Illusion of Forgetting: Attack Unlearned Diffusion via Initial Latent Variable Optimization
Text-to-image diffusion models (DMs) are frequently abused to produce harmful or copyrighted content, violating public interests. Concept erasure (unlearning) is a promising paradigm to alleviate this issue. However, there exists a peculiar forgetting illusion phenomenon with unclear cause. Based on empirical analysis, we formally explain this cause: most unlearning partially disrupt the mapping between linguistic symbols and the underlying internal knowledge, leaving the knowledge intact as dormant memories. We further demonstrate that distributional discrepancy in the denoising process serves as a measurable indicator of how much of the mapping is retained, also reflecting unlearning strength. Inspired by this, we propose IVO (Initial Latent Variable Optimization), a novel attack framework designed to assess the robustness of current unlearning methods. IVO optimizes initial latent variables to realign the noise distribution of unlearned models with that of their vanilla counterparts, which reconstructs the fractured mappings and consequently revives dormant memories. Extensive experiments covering 11 unlearning techniques and 3 concept scenarios show that IVO outperforms state-of-the-art baselines, exposing fundamental flaws in current unlearning mechanisms. Warning: This paper has unsafe images that may offend some readers.
comment: 25 pages, 12 figures, 12 tables
♻ ☆ AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models
Vision-language-action (VLA) models have recently emerged as a powerful paradigm for building generalist robots. However, traditional VLA models that generate actions through flow matching (FM) typically rely on rigid and uniform time schedules, i.e., synchronous FM (SFM). Without action context awareness and asynchronous self-correction, SFM becomes unstable in long-horizon tasks, where a single action error can cascade into failure. In this work, we propose asynchronous flow matching VLA (AsyncVLA), a novel framework that introduces temporal flexibility in asynchronous FM (AFM) and enables self-correction in action generation. AsyncVLA breaks from the vanilla SFM in VLA models by generating the action tokens in a non-uniform time schedule with action context awareness. Besides, our method introduces the confidence rater to extract confidence of the initially generated actions, enabling the model to selectively refine inaccurate action tokens before execution. Moreover, we propose a unified training procedure for SFM and AFM that endows a single model with both modes, improving KV-cache utilization. Extensive experiments on robotic manipulation benchmarks demonstrate that AsyncVLA is data-efficient and exhibits self-correction ability. AsyncVLA outperforms existing methods across both simulation and real-world evaluations. Our code is available at https://github.com/YuhuaJiang2002/AsyncVLA.
♻ ☆ Auction-Based Regulation for Artificial Intelligence
In an era of "moving fast and breaking things", regulators have moved slowly to pick up the safety, bias, and legal debris left in the wake of broken Artificial Intelligence (AI) deployment. While there is much-warranted discussion about how to address the safety, bias, and legal woes of state-of-the-art AI models, rigorous and realistic mathematical frameworks to regulate AI are lacking. Our paper addresses this challenge, proposing an auction-based regulatory mechanism that provably incentivizes agents (i) to deploy compliant models and (ii) to participate in the regulation process. We formulate AI regulation as an all-pay auction where enterprises submit models for approval. The regulator enforces compliance thresholds and further rewards models exhibiting higher compliance than their peers. We derive Nash Equilibria demonstrating that rational agents will submit models exceeding the prescribed compliance threshold. Empirical results show that our regulatory auction boosts compliance rates by 20% and participation rates by 15% compared to baseline regulatory mechanisms, outperforming simpler frameworks that merely impose minimum compliance standards.
comment: 26 pages, 7 figures, 3 tables. Accepted at ACM FAccT 2026
♻ ☆ Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks
Long-context Large Language Models, despite their expanded capacity, require careful working memory management to mitigate attention dilution during long-horizon tasks. Yet existing approaches rely on external mechanisms that lack awareness of the agent's reasoning state, leading to suboptimal decisions. We propose Memory-as-Action (MemAct), a framework that treats working memory management as learnable policy actions. By formulating context management as in-place editing operations (deletion, insertion), MemAct enables joint optimization of information retention and task performance through end-to-end reinforcement learning. To address the computational challenges of dynamic context updates, we introduce Dynamic Context Policy Optimization, which restores training efficiency without compromising reasoning integrity. Experiments show that MemAct-RL-14B matches the accuracy of models $16\times$ larger while reducing average context length by 51\%, with learned strategies that adapt to model capabilities and generalize across task complexities.
♻ ☆ Learning Reasoning Rewards from Expert Demonstrations with Inverse Reinforcement Learning
Teaching large language models (LLMs) to reason during post-training typically relies on reinforcement learning with explicit outcome- or process-based reward functions. However, in many real-world settings, obtaining or defining such reward functions is difficult, especially for complex tasks, making learning from expert demonstrations an attractive alternative. The dominant approach, supervised fine-tuning (SFT), trains models to imitate expert reasoning traces directly, but suffers from the general limitations of off-policy learning: performance can be fragile to inference-time deviations from states explicitly covered by the demonstrations. To address this, we propose Reasoning Adversarial Inverse Reinforcement Learning (R-AIRL). Rather than imitating the expert's reasoning, R-AIRL infers the underlying process-level reward from the expert Chain-of-Thoughts. Through experiments on GSM8K, MMLU-Pro and MedReason we show that the reasoning reward function learned with R-AIRL can be effectively used throughout the training and inference pipeline: (1) to provide a training signal for post-training, outperforming SFT in most of the considered settings, (2) for inference-time reranking, improving pass@1 by up to 17.4 points, and (3) for process-level evaluation, localising reasoning failures with up to 86.1% accuracy. Overall, R-AIRL bridges imitation learning and reward-based optimisation, enabling the extraction of meaningful reasoning signals from expert thinking traces.
♻ ☆ From Documents to Spans: Scalable Supervision for Evidence-Based ICD Coding with LLMs
International Classification of Diseases (ICD) coding assigns diagnosis codes to clinical documents and is essential for healthcare billing and clinical analysis. Reliable coding requires that each predicted code be supported by explicit textual evidence. However, existing public datasets provide only code labels, without evidence annotations, limiting models' ability to learn evidence-grounded predictions. In this work, we argue that dense, document-level evidence annotation is not always necessary for learning evidence-based coding. Instead, models can learn code-specific evidence patterns from local spans and use these patterns to support document-level evidence-based coding. Based on this insight, we propose Span-Centric Learning (SCL), a training framework that strengthens LLMs' coding ability at the span level and transfers this capability to full clinical documents. Specifically, we use a small set of annotated documents to supervise evidence recognition, aggregation, and code assignment, while leveraging a large collection of lightweight evidence spans to reinforce span-level reasoning. Due to their compactness, span annotations are scalable and can be further augmented through synthesis. Under the same Llama3.1-8B backbone, our approach achieves an 8.2-point improvement in macro-F1 at only 20% of the training cost of standard SFT, and provides explicit supporting evidence for each predicted code, enabling human auditing and revision.
♻ ☆ Autogenesis: A Self-Evolving Agent Protocol
Recent advances in LLM based agent systems have shown promise in tackling complex, long horizon tasks. However, existing agent protocols (e.g., A2A and MCP) under specify cross entity lifecycle and context management, version tracking, and evolution safe update interfaces, which encourages monolithic compositions and brittle glue code. We introduce Autogenesis Protocol (AGP), a self evolution protocol that decouples what evolves from how evolution occurs. Its Resource Substrate Protocol Layer (RSPL) models prompts, agents, tools, environments, and memory as protocol registered resources with explicit state, lifecycle, and versioned interfaces. Its Self Evolution Protocol Layer (SEPL) specifies a closed loop operator interface for proposing, assessing, and committing improvements with auditable lineage and rollback. Building on AGP, we present Autogenesis System (AGS), a self-evolving multi-agent system that dynamically instantiates, retrieves, and refines protocol-registered resources during execution. We evaluate AGS on multiple challenging benchmarks that require long horizon planning and tool use across heterogeneous resources. The results demonstrate consistent improvements over strong baselines, supporting the effectiveness of agent resource management and closed loop self evolution. The code is available at https://github.com/DVampire/Autogenesis.
♻ ☆ ProAgent: Harnessing On-Demand Sensory Contexts for Proactive LLM Agent Systems in the Wild
Recent studies have begun to explore proactive large language model (LLM) agents that provide unobtrusive assistance by automatically leveraging contextual information, such as in code editing and in-app suggestions. However, most focus on short, task-specific episodes or on-screen contexts, rather than continuously perceiving and assisting users throughout daily life. Enabling such in-the-wild assistance requires continuous sensing of users' surroundings, which can incur substantial system overhead. In this work, we propose ProAgent, an end-to-end proactive agent system that harnesses on-demand sensory contexts to provide in-the-wild assistance. ProAgent first employs on-demand tiered perception to continuously sense users' surroundings by integrating low-cost contextual cues with richer perception on demand, and uses proactive-oriented context extraction to derive hierarchical contexts integrating both sensory contexts and human preferences. ProAgent then employs a context-aware proactive reasoner to infer user needs and invokes external tools to deliver proactive assistance. We implement ProAgent on AR glasses and evaluate it on a public dataset and a real-world dataset. Results demonstrate that ProAgent achieves up to 27.7% higher proactive prediction accuracy and 20.5% lower false detection than state-of-the-art baselines. A user study with 20 participants shows that 85% were satisfied with ProAgent and willing to use it in daily life.
♻ ☆ Disentangled Generative Graph Representation Learning
Recently, generative graph models have shown promising results in learning graph representations through self-supervised methods. However, most existing generative graph representation learning (GRL) approaches rely on random masking across the entire graph, which overlooks the entanglement of learned representations. This oversight results in non-robustness and a lack of explainability. Furthermore, disentangling the learned representations remains a significant challenge and has not been sufficiently explored in GRL research. Based on these insights, this paper introduces DiGGR (Disentangled Generative Graph Representation Learning), a self-supervised learning framework. DiGGR aims to learn latent disentangled factors and utilizes them to guide graph mask modeling, thereby enhancing the disentanglement of learned representations and enabling end-to-end joint learning. Extensive experiments on 11 public datasets for two different graph learning tasks demonstrate that DiGGR consistently outperforms many previous self-supervised methods, verifying the effectiveness of the proposed approach.
♻ ☆ FIT to Forget: Robust Continual Unlearning for Large Language Models
While large language models (LLMs) exhibit remarkable capabilities, they increasingly face demands to unlearn memorized privacy-sensitive, copyrighted, or harmful content. Existing unlearning methods primarily focus on \emph{single-shot} scenarios, whereas real-world deletion requests arrive \emph{continually}. Naïvely applying these methods to sequential requests leads to severe utility degradation and catastrophic forgetting. To address this, we propose \fit, a robust continual unlearning framework to process high-volume sequential deletion streams while resisting both catastrophic forgetting and post-unlearning recovery. \fit stabilizes sequential updates through three synergistic mechanisms: redundancy \underline{F}iltering, \underline{I}mportance-aware adaptive algorithm selection, and \underline{T}argeted layer attribution. Furthermore, to facilitate rigorous evaluation, we introduce \textbf{PCH}, a unified benchmark encompassing \textbf{P}ersonal, \textbf{C}opyrighted, and \textbf{H}armful content, alongside two symmetric metrics, Forget Degree (F.D.) and Retain Utility (R.U.), to systematically quantify forgetting-utility trade-offs. Extensive experiments across five LLMs (up to 14B parameters) demonstrate that \fit consistently achieves state-of-the-art unlearning efficacy and utility preservation. Notably, even after hundreds of sequential requests, \fit preserves strong downstream (\eg, GSM8K, MMLU) performance and exhibits superior resilience against relearning and quantization recovery attacks.
comment: 26 Pages
♻ ☆ MetaKE: Meta-Learning for Knowledge Editing Toward a Better Accuracy-Editability Trade-off
Existing locate-then-edit Knowledge Editing (KE) methods typically decompose editing into two stages: upstream target representation optimization and downstream constrained parameter optimization. The optimization across the two stages is disconnected: upstream applies uniform regularization without observing downstream realization of the planned residual, hindering a refined accuracy-editability trade-off. Since this realization is request-specific and depends on downstream constraints, uniform regularization can over-shrink high-association requests, causing insufficient editing, while it can under-regularize low-association requests, producing over-large planned residuals that reduce downstream editability. To bridge this disconnect, we propose MetaKE (Meta-learning for Knowledge Editing), a new framework that unifies upstream and downstream stages into a bi-level optimization problem. The inner level optimizes parameter updates for the target representation, while the outer level optimizes representation using feedback from downstream constraints, achieving a better semantic accuracy-editability trade-off. To avoid costly multi-layer backpropagation, we introduce a Structural Gradient Proxy to approximate and propagate this feedback. Extensive experiments show that MetaKE outperforms strong baselines, offering a new perspective on KE.
comment: 37 pages, 9 figures
♻ ☆ A Theoretical Analysis of Test-Driven Code Generation
Code assistants are increasingly utilized in test-driven software development, yet the theoretical mechanisms behind their environment-interaction strategies remain underexplored. We provide a probabilistic framework for two dominant paradigms: code selection after generation using the execution environment, and code generation conditioned on environment feedback. First, we formalize several well-established selection heuristics as environment-aware estimators of code correctness. We theoretically prove that estimators based on fuzzy functional similarity add an inductive bias and strictly dominate estimators based on functional equivalence in terms of signal-to-noise ratio. Second, we frame backprompting as an in-context approximation of Thompson sampling. We derive a novel regret bound for reward functions with unobservable components, theoretically explaining why the effectiveness of backprompting is limited by the ambiguity of the informal task description (an irreducible regret). Using five state-of-the-art open weight models, we corroborate these findings across BigCodeBenchHard, LeetCodeDataset, and QiskitHumanEvalSim. Our formalization also suggests how to improve task descriptions effectively, leading to a new benchmark, QiskitHumanEvalSimX.
comment: preprint
♻ ☆ ReMAP: Neural Reparameterization for Scalable MAP Inference in Arbitrary-Order Markov Random Fields
Scalable high-quality MAP inference in arbitrary-order Markov Random Fields (MRFs) remains challenging. Approximate message-passing methods are often efficient but can degrade on dense or high-order instances, while exact solvers such as Toulbar2 become increasingly expensive at scale. We present ReMAP, an instance-wise neural reparameterization framework that directly optimizes a differentiable relaxation of the original MRF energy. Instead of relying on supervised labels or amortized training, ReMAP treats each MRF as an independent optimization problem: a Graph Neural Network produces node-wise label distributions, and gradient-based optimization searches for a low-energy discrete solution in an over-parameterized continuous space. The method supports pairwise and arbitrary-order factors, heterogeneous label cardinalities, and efficient GPU execution, without requiring labeled solutions. We show that the relaxed objective is consistent with the discrete MAP problem and analyze how neural over-parameterization can expose low-energy optimization paths unavailable in the original discrete space. Empirically, on synthetic pairwise and high-order MRFs, UAI 2022 inference benchmarks, and real-world Physical Cell Identity (PCI) problems, ReMAP consistently outperforms approximate baselines and often finds lower-energy solutions than Toulbar2 on hard large-scale instances within practical time budgets.
♻ ☆ Learning Lifted Action Models from Unsupervised Visual Traces ICAPS-26
Efficient construction of models capturing the preconditions and effects of actions is essential for applying AI planning in real-world domains. Extensive prior work has explored learning such models from high-level descriptions of state and/or action sequences. In this paper, we tackle a more challenging setting: learning lifted action models from sequences of state images, without action observation. We propose a deep learning framework that jointly learns state prediction, action prediction, and a lifted action model. We also introduce a mixed-integer linear program (MILP) to prevent prediction collapse and self-reinforcing errors among predictions. The MILP takes the predicted states, actions, and action model over a subset of traces and solves for logically consistent states, actions, and action model that are as close as possible to the original predictions. Pseudo-labels extracted from the MILP solution are then used to guide further training. Experiments across multiple domains show that integrating MILP-based correction helps the model escape local optima and converge toward globally consistent solutions.
comment: Accepted to the 36th International Conference on Automated Planning and Scheduling (ICAPS-26)
♻ ☆ CSMCIR: CoT-Enhanced Symmetric Alignment with Memory Bank for Composed Image Retrieval
Composed Image Retrieval (CIR) enables users to search for target images using both a reference image and manipulation text, offering substantial advantages over single-modality retrieval systems. However, existing CIR methods suffer from representation space fragmentation: queries and targets comprise heterogeneous modalities and are processed by distinct encoders, forcing models to bridge misaligned representation spaces only through post-hoc alignment, which fundamentally limits retrieval performance. This architectural asymmetry manifests as three distinct, well-separated clusters in the feature space, directly demonstrating how heterogeneous modalities create fundamentally misaligned representation spaces from initialization. In this work, we propose CSMCIR, a unified representation framework that achieves efficient query-target alignment through three synergistic components. First, we introduce a Multi-level Chain-of-Thought (MCoT) prompting strategy that guides Multimodal Large Language Models to generate discriminative, semantically compatible captions for target images, establishing modal symmetry. Building upon this, we design a symmetric dual-tower architecture where both query and target sides utilize the identical shared-parameter Q-Former for cross-modal encoding, ensuring consistent feature representations and further reducing the alignment gap. Finally, this architectural symmetry enables an entropy-based, temporally dynamic Memory Bank strategy that provides high-quality negative samples while maintaining consistency with the evolving model state. Extensive experiments on four benchmark datasets demonstrate that our CSMCIR achieves state-of-the-art performance with superior training efficiency. Comprehensive ablation studies further validate the effectiveness of each proposed component.
♻ ☆ AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
Video generation models internalize physical realism as their prior. Anime deliberately violates physics: smears, impact frames, chibi shifts; and its thousands of coexisting artistic conventions yield no single "physics of anime" a model can absorb. Physics-biased models therefore flatten the artistry that defines the medium or collapse under its stylistic variance. We present AniMatrix, a video generation model that targets artistic rather than physical correctness through a dual-channel conditioning mechanism and a three-step transition: redefine correctness, override the physics prior, and distinguish art from failure. First, a Production Knowledge System encodes anime as a structured taxonomy of controllable production variables (Style, Motion, Camera, VFX), and AniCaption infers these variables from pixels as directorial directives. A trainable tag encoder preserves the field-value structure of this taxonomy while a frozen T5 encoder handles free-form narrative; dual-path injection (cross-attention for fine-grained control, AdaLN modulation for global enforcement) ensures categorical directives are never diluted by open-ended text. Second, a style-motion-deformation curriculum transitions the model from near-physical motion to full anime expressiveness. Third, deformation-aware preference optimization with a domain-specific reward model separates intentional artistry from pathological collapse. On an anime-specific human evaluation with five production dimensions scored by professional animators, AniMatrix ranks first on four of five, with the largest gains over Seedance-Pro 1.0 on Prompt Understanding (+0.70, +22.4 percent) and Artistic Motion (+0.55, +16.9 percent). We are preparing accompanying resources for public release to support reproducibility and follow-up research.
comment: 37 pages, 1 main figure (qualitative comparison), 1 TikZ architecture diagram; technical report. Model weights and inference code to be released
♻ ☆ How Hyper-Datafication Impacts the Sustainability Costs in Frontier AI
Large-scale data has fuelled the success of frontier artificial intelligence (AI) models over the past decade. This expansion has relied on sustained efforts by large technology corporations to aggregate and curate internet-scale datasets. In this work, we examine the environmental, social, and economic costs of large-scale data in AI through a sustainability lens. We argue that the field is shifting from building models from data to actively creating data for building models. We characterise this transition as hyper-datafication, which marks a critical juncture for the future of frontier AI and its societal impacts. To quantify and contextualise data-related costs, we analyse approximately 550,000 datasets from the Hugging Face Hub, focusing on dataset growth, storage-related energy consumption and carbon footprint, and societal representation using language data. We complement this analysis with qualitative responses from data workers in Kenya to examine the labour involved, including direct employment by big tech corporations and exposure to graphic content. We further draw on external data sources to substantiate our findings by illustrating the global disparity in data centre infrastructure. Our analyses reveal that hyper-datafication does not merely increase resource consumption but systematically redistributes environmental burdens, labour risks, and representational harms toward the Global South, precarious data workers, and under-represented cultures. Thus, we propose Data PROOFS recommendations spanning provenance, resource awareness, ownership, openness, frugality, and standards to mitigate these costs. Our work aims to make visible the often-overlooked costs of data that underpin frontier AI and to stimulate broader debate within the research community and beyond.
comment: Proceedings of the 2026 ACM Conference on Fairness, Accountability, and Transparency. Montreal, Canada
♻ ☆ Dynamic Expert-Guided Model Averaging for Causal Discovery
Would-be practitioners of causal discovery face a dizzying array of algorithms without a clear best choice. This abundance of competitive methods makes ensembling a natural strategy for practical applications. At the same time, real-world use cases frequently violate the assumptions on which common causal discovery algorithms are based, forcing reliance on expert knowledge. Inspired by recent work on dynamically requested expert knowledge and large language models (LLMs) as experts, we present a flexible model averaging method that integrates selective expert querying to ensemble a diverse set of causal discovery algorithms. Crucially, we distinguish between edge existence and orientation, enabling the method to leverage the complementary strengths of data-driven discovery and expert input. We further consider the realistic setting of limited access to an imperfect expert, using disagreement among algorithms to query the expert in cases of greater uncertainty. Experiments demonstrate that our method consistently outperforms strong baselines on both clean and noisy data. Code and data are available at https://anonymous.4open.science/r/expert-cd-ensemble-3282/.
♻ ☆ Latent Abstraction for Retrieval-Augmented Generation
Retrieval-Augmented Generation (RAG) has become a standard approach for enhancing large language models (LLMs) with external knowledge, mitigating hallucinations, and improving factuality. However, existing systems rely on generating natural language queries at each hop and maintaining a strict architectural separation between retriever and generator, preventing them from leveraging the full representational capacity of the LLM. We propose \textbf{LAnR} (Latent Abstraction for RAG), a unified framework in which a single LLM jointly performs encoding, retrieval, and generation entirely within its own latent space. Rather than generating textual queries, LAnR produces dense retrieval vectors from the hidden states of a designated \texttt{[PRED]} token and uses them to match against encoded document representations from the same model. Furthermore, LAnR adaptively decides when sufficient evidence has been retrieved using a lightweight MLP control head over those same hidden states, eliminating both the separate retriever and explicit token-level stopping reasoning. This design is motivated by our empirical observation that answer token entropy reliably signals retrieval sufficiency. Extensive experiments on six QA benchmarks spanning single-hop and multi-hop settings demonstrate that LAnR outperforms existing RAG methods, while achieving improved inference efficiency through reduced number of retrieval calls and tighter model integration.
♻ ☆ HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding ACL 2026
Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video inputs, remains challenging, as existing models struggle to simultaneously maintain stable understanding performance, real-time responses, and low GPU memory overhead. To address this challenge, we propose HERMES, a novel training-free architecture for real-time and accurate understanding of video streams. Based on a mechanistic attention investigation, we conceptualize KV cache as a hierarchical memory framework that encapsulates video information across multiple granularities. During inference, HERMES reuses a compact KV cache, enabling efficient streaming understanding under resource constraints. Notably, HERMES requires no auxiliary computations upon the arrival of user queries, thereby guaranteeing real-time responses for continuous video stream interactions, which achieves 10$\times$ faster TTFT compared to prior SOTA. Even when reducing video tokens by up to 68% compared with uniform sampling, HERMES achieves superior or comparable accuracy across all benchmarks, with up to 11.4% gains on streaming datasets.
comment: Accepted to ACL 2026 Main
♻ ☆ AI Agents Alone Are Not (Yet) Sufficient for Social Simulation
Recent advances in large language models (LLMs) have spurred growing interest in using LLM-integrated agents for social simulation, often under the implicit assumption that realistic population dynamics will emerge once role-specified agents are placed in a networked multi-agent setting. This position paper argues that LLM-based agents alone are not (yet) sufficient for social simulation. We attribute this over-optimism to a systematic mismatch between what current agent pipelines are typically optimized and validated to produce and what simulation-as-science requires. Concretely, role-playing plausibility does not imply faithful human behavioral validity; collective outcomes are frequently mediated by agent-environment co-dynamics rather than agent-agent messaging alone; and results can be dominated by interaction protocols, scheduling, and initial information priors. To make these underlying mechanisms explicit and auditable, we propose a unified formulation of AI agent-based social simulation as an environment-involved Markov game with explicit exposure and scheduling mechanisms, from which we derive concrete actions for design, evaluation, and interpretation.
comment: 16 pages
♻ ☆ Spectral Alignment in Forward-Backward Representations via Temporal Abstraction
Forward-backward (FB) representations provide a powerful framework for learning the successor representation (SR) in continuous spaces by enforcing a low-rank factorization. However, a fundamental spectral mismatch often exists between the high-rank transition dynamics of continuous environments and the low-rank bottleneck of the FB architecture, making accurate low-rank representation learning difficult. In this work, we analyze temporal abstraction as a mechanism to mitigate this mismatch. By characterizing the spectral properties of the transition operator, we show that temporal abstraction acts analogously to a low-pass filter that suppresses high-frequency spectral components. This suppression reduces the effective rank of the induced SR while preserving a formal bound on the resulting value function error. Empirically, we show that this alignment is a key factor for stable FB learning, particularly at high discount factors where bootstrapping becomes error-prone. Our results identify temporal abstraction as a principled mechanism for shaping the spectral structure of the underlying MDP and enabling effective long-horizon representations in continuous control.
♻ ☆ DialectLLM: A Dialect-Aware Dialog[ue] Generation Framework Beyond Standard American English
More than 80% of the 1.6B English speakers do not use Standard American English (SAE), yet LLMs often fail to correctly identify non-SAE dialects and generate stereotyped responses for their speakers. We introduce DialectLLM, the first large-scale framework for generating high-quality multi-dialectal conversational data encompassing the three pillars of written dialect -- lexical (vocabulary), orthographic (spelling), and morphosyntactic (grammar) features. DialectLLM produces a dialect-parallel dialog dataset spanning nine English dialects. Partnering with native linguists, we design and validate SAE-to-dialect transformation rules, ensuring authenticity. Our approach challenges the prevailing practice of applying a single morphosyntactic feature set to both user utterances and model responses, showing that models should not reproduce up to 90% of the grammatical features of a dialect. Human evaluation confirms data quality, with annotators preferring DialectLLM over prior methods in 98.8% of pairwise comparisons for dialect naturalness. We then construct DialectLLM-Bench, a dialect-parallel benchmark with 50k+ dialogs, resulting in 97k+ QA pairs, and evaluate 17 LLMs on dialect identification and response generation tasks. Even frontier models achieve under 70% accuracy, fail to reach 50% for prominent dialects like Canadian English, and systematically misclassify non-SAE dialects as American or British. Beyond benchmarking, we show that DialectLLM data also serve as a scalable LLM post-training resource, suggesting a practical path toward dialect-aware conversational AI.
♻ ☆ CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning
Large language models (LLMs) often solve challenging math exercises yet fail to apply the concept right when the problem requires genuine understanding. Popular Reinforcement Learning with Verifiable Rewards (RLVR) pipelines reinforce final answers but provide little fine-grained conceptual signal, so models improve at pattern reuse rather than conceptual applications. We introduce CORE (Concept-Oriented REinforcement), an RL training framework that turns explicit concepts into a controllable supervision signal. Starting from a high-quality, low-contamination textbook resource that links verifiable exercises to concise concept descriptions, we run a sanity probe showing LLMs can restate definitions but fail concept-linked quizzes, quantifying the conceptual reasoning gap. CORE then (i) synthesizes concept-aligned quizzes, (ii) injects brief concept snippets during rollouts to elicit concept-primed trajectories, and (iii) reinforces conceptual reasoning via trajectory replacement after group failures, a lightweight forward-KL constraint that aligns unguided with concept-primed policies, or standard GRPO directly on concept-aligned quizzes. Across several models, CORE delivers consistent gains over vanilla and SFT baselines on both in-domain concept-exercise suites and diverse out-of-domain math benchmarks. CORE unifies direct training on concept-aligned quizzes and concept-injected rollouts under outcome regularization. It provides fine-grained conceptual supervision that bridges problem-solving competence and genuine conceptual reasoning, while remaining algorithm- and verifier-agnostic.
♻ ☆ MEMSAD: Gradient-Coupled Anomaly Detection for Memory Poisoning in Retrieval-Augmented Agents NeurIPS 2026
Persistent external memory enables LLM agents to maintain context across sessions, yet its security properties remain formally uncharacterized. We formalize memory poisoning attacks on retrieval-augmented agents as a Stackelberg game with a unified evaluation framework spanning three attack classes with escalating access assumptions. Correcting an evaluation protocol inconsistency in the triggered-query specification of Chen et al. (2024), we show faithful evaluation increases measured attack success by $4\times$ (ASR-R: $0.25 \to 1.00$). Our primary contribution is MEMSAD (Semantic Anomaly Detection), a calibration-based defense grounded in a gradient coupling theorem: under encoder regularity, the anomaly score gradient and the retrieval objective gradient are provably identical, so any continuous perturbation that reduces detection risk necessarily degrades retrieval rank. This coupling yields a certified detection radius guaranteeing correct classification regardless of adversary strategy. We prove minimax optimality via Le Cam's method, showing any threshold detector requires $Ω(1/ρ^2)$ calibration samples and MEMSAD achieves this up to $\log(1/δ)$ factors. We further derive online regret bounds for rolling calibration at rate $O(σ^{2/3}Δ^{1/3})$, and formally characterize a discrete synonym-invariance loophole that marks the boundary of what continuous-space defenses can guarantee. Experiments on a $3 \times 5$ attack-defense matrix with bootstrap confidence intervals, Bonferroni-corrected hypothesis tests, and Clopper-Pearson validation ($n=1{,}000$) confirm: composite defenses achieve TPR $= 1.00$, FPR $= 0.00$ across all attacks, while synonym substitution evades detection at $Δ$ ASR-R $\approx 0$, exposing a gap existing embedding-based defenses cannot close.
comment: 28 pages, 9 figures, 6 theorems. Submitted to NeurIPS 2026
♻ ☆ Beyond Factual Correctness: Mitigating Preference-Inconsistent Explanations in Explainable Recommendation
LLM-based explainable recommenders can produce fluent explanations that are factually correct, yet still justify items using attributes that conflict with a user's historical preferences. Such preference-inconsistent explanations yield logically valid but unconvincing reasoning and are largely missed by standard hallucination or faithfulness metrics. We formalize this failure mode and propose PURE, a preference-aware reasoning framework following a select-then-generate paradigm. Instead of only improving generation, PURE intervenes in evidence selection, it selects a compact set of multi-hop item-centric reasoning paths that are both factually grounded and aligned with user preference structure, guided by user intent, specificity, and diversity to suppress generic, weakly personalized evidence. The selected evidence is then injected into LLM generation via structure-aware prompting that preserves relational constraints. To measure preference inconsistency, we introduce a feature-level, user-centric evaluation metric that reveals misalignment overlooked by factuality-based measures. Experiments on three real-world datasets show that PURE consistently reduces preference-inconsistent explanations and factual hallucinations while maintaining competitive recommendation accuracy, explanation quality, and inference efficiency. These results highlight that trustworthy explanations require not only factual correctness but also justification aligned with user preferences.
comment: The authors have identified an issue in the evaluation protocol in Section 5.1.3. Feature extraction and semantic matching used to compute P-EHR require correction and re-validation, as they may not have been applied consistently across all generated explanations and baselines. This may affect part of the reported quantitative results and analysis, so the authors withdraw this version
♻ ☆ Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
We propose X-WAM, a Unified 4D World Model that unifies real-time robotic action execution and high-fidelity 4D world synthesis (video + 3D reconstruction) in a single framework, addressing the critical limitations of prior unified world models (e.g., UWM) that only model 2D pixel-space and fail to balance action efficiency and world modeling quality. To leverage the strong visual priors of pretrained video diffusion models, X-WAM imagines the future world by predicting multi-view RGB-D videos, and obtains spatial information efficiently through a lightweight structural adaptation: replicating the final few blocks of the pretrained Diffusion Transformer into a dedicated depth prediction branch for the reconstruction of future spatial information. Moreover, we propose Asynchronous Noise Sampling (ANS) to jointly optimize generation quality and action decoding efficiency. ANS applies a specialized asynchronous denoising schedule during inference, which rapidly decodes actions with fewer steps to enable efficient real-time execution, while dedicating the full sequence of steps to generate high-fidelity video. Rather than entirely decoupling the timesteps during training, ANS samples from their joint distribution to align with the inference distribution. Pretrained on over 5,800 hours of robotic data, X-WAM achieves 79.2% and 90.7% average success rate on RoboCasa and RoboTwin 2.0 benchmarks, while producing high-fidelity 4D reconstruction and generation surpassing existing methods in both visual and geometric metrics.
comment: Project website: https://sharinka0715.github.io/X-WAM/
♻ ☆ Beyond Chunking: Discourse-Aware Hierarchical Retrieval for Long Document Question Answering ACL 2026
Existing long-document question answering systems typically process texts as flat sequences or use heuristic chunking, which overlook the discourse structures that naturally guide human comprehension. We present a discourse-aware hierarchical framework that leverages rhetorical structure theory (RST) for long document question answering. Our approach converts discourse trees into sentence-level representations and employs LLM-enhanced node representations to bridge structural and semantic information. The framework involves three key innovations: language-universal discourse parsing for lengthy documents, LLM-based enhancement of discourse relation nodes, and structure-guided hierarchical retrieval. Extensive experiments on four datasets demonstrate consistent improvements over existing approaches through the incorporation of discourse structure, across multiple genres and languages. Moreover, the proposed framework exhibits strong robustness across diverse document types and linguistic settings.
comment: 21 pages, 9 figures. Accepted at ACL 2026 Main conference
♻ ☆ PulseLM: A Foundation Dataset and Benchmark for PPG-Text Learning
Photoplethysmography (PPG) is a widely used non-invasive sensing modality for continuous cardiovascular and physiological monitoring across clinical, laboratory, and wearable settings. While existing PPG datasets support a broad range of downstream tasks, they typically provide supervision in the form of numerical measurements or task-specific labels, limiting their compatibility with language-based interfaces and multimodal foundation models. In this work, we introduce PulseLM, a large-scale PPG-text question-answering dataset that bridges raw PPG waveforms and natural language through a unified question-answering (QA) formulation. PulseLM aggregates PPG recordings from sixteen publicly available sources and harmonizes heterogeneous annotations into 12 downstream tasks. The dataset comprises over 1 million standardized 10-second PPG segments, associated with nearly 2.5 million question-answer pairs. We further define reproducible data pipeline, training, and evaluation protocols and establish baseline benchmarks using multimodal PPG-aware large language models. PulseLM provides a standardized foundation for studying language-grounded physiological inference, cross-dataset generalization, and scalable benchmarking of PPG-based multimodal models. We publicly release the dataset and code at https://huggingface.co/datasets/Manhph2211/PulseLM and https://github.com/manhph2211/PULSE-LM, respectively.
comment: PulseLM v2
♻ ☆ Goal-Driven Query Answering over First- and Second-Order Dependencies with Equality
In this paper we present the first goal-driven query answering technique for first- and second-order dependencies with equality. Our technique transforms the input dependencies so that applying the chase to the output avoids many inferences that are irrelevant to the query. The transformation proceeds in several steps, which comprise the following three novel techniques. First, we present a variant of the singularisation technique by Marnette [59] that can handle function variables and that corrects an incompleteness of a related formulation by ten Cate et al. [73]. Second, we present a relevance analysis technique that can eliminate dependencies that provably do not contribute to query answers. Third, we present a variant of the magic sets algorithm [19] that can handle second-order dependencies with equality. We also present the results of an extensive empirical evaluation, which show that goal-driven query answering can be orders of magnitude faster than computing the full universal model.
comment: 47 pages
♻ ☆ An Efficient Insect-inspired Approach for Visual Point-goal Navigation IEEE
In this work we develop a novel insect-inspired model for visual point-goal navigation. This combines abstracted models of two insect brain structures that have been implicated, respectively, in associative learning and path integration. We draw an analogy between the formal benchmark of the Habitat point-goal navigation task and the ability of insects to discover, learn, and refine visually guided paths around obstacles between a discovered food location and their nest. We demonstrate that the simple insect-inspired model exhibits performance comparable to recent state-of-the-art models at many orders of magnitude less computational cost. Testing in a more realistic simulated environment shows the approach is robust to perturbations.
comment: This work has been submitted to the IEEE for possible publication
♻ ☆ Decentralized Attention Fails Centralized Signals: Rethinking Transformers for Medical Time Series ICLR 2026
Accurate analysis of medical time series (MedTS) data, such as electroencephalography (EEG) and electrocardiography (ECG), plays a pivotal role in healthcare applications, including the diagnosis of brain and heart diseases. MedTS data typically exhibit two critical patterns: temporal dependencies within individual channels and channel dependencies across multiple channels. While recent advances in deep learning have leveraged Transformer-based models to effectively capture temporal dependencies, they often struggle with modeling channel dependencies. This limitation stems from a structural mismatch: MedTS signals are inherently centralized, whereas the Transformer's attention mechanism is decentralized, making it less effective at capturing global synchronization and unified waveform patterns. To address this mismatch, we propose CoTAR (Core Token Aggregation-Redistribution), a centralized MLP-based module designed to replace decentralized attention. Instead of allowing all tokens to interact directly, as in standard attention, CoTAR introduces a global core token that serves as a proxy to facilitate inter-token interactions, thereby enforcing a centralized aggregation and redistribution strategy. This design not only better aligns with the centralized nature of MedTS signals but also reduces computational complexity from quadratic to linear. Experiments on five benchmarks validate the superiority of our method in both effectiveness and efficiency, achieving up to a 11.6% improvement on the APAVA dataset, while using only 33% of the memory and 20% of the inference time compared to the previous state of the art. Code and all training scripts are available at https://github.com/Levi-Ackman/TeCh.
comment: Accepted by ICLR 2026 (Oral). arXiv admin note: text overlap with arXiv:2405.19363 by other authors
♻ ☆ Problem Reductions at Scale: Agentic Integration of Computationally Hard Problems
Solving an NP-hard optimization problem often requires reformulating it for a specific solver -- quantum hardware, a commercial optimizer, or a domain heuristic. A tool for polynomial-time reductions between hard problems would let practitioners route any supported problem to any supported solver through a single interface. Building such a library at scale, however, has remained out of reach. We show that harness engineering, the practice of designing constraints, verification systems, and feedback loops that channel AI coding agents, can overcome this barrier. Our harness combines a no-code contribution route for domain experts, a multilayer verification stack ranging from type-level checks to agentic feature tests (AI agents role-playing as end users), and a fully automated implementation-review-integration pipeline. In about three months, we built a command-line tool backed by a library of 100+ problem types and 200+ reduction rules in over 170k lines of Rust. The result suggests that a well-engineered harness lets agents build well-tested software at a scale and pace beyond prior reduction-library efforts. Because the reduction graph composes transitively, a new solver registered for any single problem type instantly becomes available to every problem connected by a reduction path. The source code is available at https://github.com/CodingThrust/problem-reductions.
comment: The source code is available at https://github.com/CodingThrust/problem-reductions
♻ ☆ Multi-Modality Distillation via Learning the teacher's modality-level Gram Matrix
In the context of multi-modality knowledge distillation research, the existing methods was mainly focus on the problem of only learning teacher final output. Thus, there are still deep differences between the teacher network and the student network. It is necessary to force the student network to learn the modality relationship information of the teacher network. To effectively exploit transfering knowledge from teachers to students, a novel modality relation distillation paradigm by modeling the relationship information among different modality are adopted, that is learning the teacher modality-level Gram Matrix.
comment: 15 pages, 2 figures
♻ ☆ InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees
We study how large language models can be used to evolve inventory policies in online, non-stationary environments. Our work is motivated by recent advances in LLM-based evolutionary search, such as AlphaEvolve, which demonstrates strong performance for static and highly structured problems such as mathematical discovery, but is not directly suited to online dynamic inventory settings. To this end, we propose InvEvolve, an end-to-end inventory-policy evolution and inference framework grounded in confidence-interval-based certification. The framework trains a large language model using reinforcement learning, incorporates demand data as well as numerical and textual features beyond demand, and generates white-box inventory policy with statistical safety guarantees for deployment in future periods. We further introduce a unified theoretical interface that connects training, inference, and deployment. This allows us to characterize the probability lower bound that the InvEvolve evolves a statistically safe and improved policy, and to quantify the multi-period performance gap relative to the oracle-safe benchmark. Tested on both synthetic data and real-world retail data, InvEvolve outperforms classical inventory policies and deep learning based methods. In canonical inventory settings, it evolves new policies that improve upon existing benchmarks.
♻ ☆ Continually Evolving Skill Knowledge in Vision Language Action Model
Vision-language-action (VLA) models show promising knowledge accumulation ability from pretraining, yet continual learning in VLA remains challenging, especially for efficient adaptation. Existing continual imitation learning (CIL) methods often rely on additional parameters or external modules, limiting scalability for large VLA models. We propose Stellar VLA, a knowledge-driven CIL framework without increasing network parameters.Two progressively extended variants are designed: T-Stellar for flat task-centric modeling and TS-Stellar for hierarchical task-skill structure.Stellar VLA enables self-evolving knowledge learning by jointly optimizing task representations and a learned knowledge space. We propose a knowledge-guided expert routing mechanism conditioned on knowledge relation and Top-K semantic embeddings, enabling task specialization without increasing model size. Experiments on the LIBERO benchmark show that Stellar VLAs achieve strong performance among both VLA and CIL baselines, using only 1 % data replay. Real-world evaluation on a dual-arm platform with distinct embodiment and scene configurations validates effective knowledge transfer. TS-Stellar excels in hierarchical manipulation, and visualizations reveal robust knowledge retention and task discovery.Project Website: https://stellarvla.github.io/
♻ ☆ Continual Knowledge Updating in LLM Systems: Learning Through Multi-Timescale Memory Dynamics
LLMs are trained once, then deployed into a world that never stops changing. External memory compensates for this, but most systems manage it explicitly rather than letting it adapt on its own. Biological memory works differently: coupled multi-timescale dynamics make new associations immediately usable, strengthen what repetition confirms, and let the rest fade. We argue that external memory should follow a similar principle. In Memini, this view takes the form of an associative memory that organizes knowledge as a directed graph. Each edge carries two coupled internal variables, one fast and one slow, following the Benna-Fusi model of synaptic consolidation. From this coupling, episodic sensitivity, gradual consolidation, and selective forgetting emerge as facets of a single mechanism, reframing external memory as a learning substrate that reorganizes through its own dynamics.
comment: Preprint. 9 pages, 2 figures
♻ ☆ Multi-Scale Spectral Attention Module-based Hyperspectral Segmentation in Autonomous Driving Scenarios
Recent advances in autonomous driving (AD) have highlighted the potential of hyperspectral imaging (HSI) for enhanced environmental perception, particularly in challenging weather and lighting conditions. However, efficiently processing high-dimensional spectral data remains a significant challenge. This paper presents an empirical investigation of a Multi-Scale Attention Mechanism (MSAM) for enhanced spectral feature extraction through three parallel 1D convolutions with varying kernel sizes (1-11) and adaptive feature aggregation. By integrating MSAM into UNet's skip connections, we evaluate performance improvements in semantic segmentation across multiple HSI datasets for urban driving scenarios. Comprehensive ablation studies demonstrate that MSAM consistently outperforms baseline UNet-SC, achieving average improvements of 2.32% in mIoU and 2.88% in mF1, while maintaining competitive GPU performance against established attention mechanisms. Our findings reveal that optimal kernel combinations are dataset-specific, with configurations such as (1;5;11) and (3;7;11) demonstrating particularly strong performance. This empirical investigation advances understanding of HSI processing capabilities for AD applications and establishes a foundation for adaptive multi-scale spectral feature extraction in automotive deployment.
♻ ☆ Time Series Reasoning via Process-Verifiable Thinking Data Synthesis and Scheduling for Tailored LLM Reasoning
Time series is a pervasive data type across various application domains, rendering the reasonable solving of diverse time series tasks a long-standing goal. Recent advances in large language models (LLMs), especially their reasoning abilities unlocked through reinforcement learning (RL), have opened new opportunities for tackling tasks with long Chain-of-Thought (CoT) reasoning. However, leveraging LLM reasoning for time series remains in its infancy, hindered by the absence of carefully curated time series CoT data for training, limited data efficiency caused by underexplored data scheduling, and the lack of RL algorithms tailored for exploiting such time series CoT data. In this paper, we introduce VeriTime, a framework that tailors LLMs for time series reasoning through data synthesis, data scheduling, and RL training. First, we propose a data synthesis pipeline that constructs a TS-text multimodal dataset with process-verifiable annotations. Second, we design a data scheduling mechanism that arranges training samples according to a principled hierarchy of difficulty and task taxonomy. Third, we develop a two-stage reinforcement finetuning featuring fine-grained, multi-objective rewards that leverage verifiable process-level CoT data. Extensive experiments show that VeriTime substantially boosts LLM performance across diverse time series reasoning tasks. Notably, it enables compact 3B, 4B models to achieve reasoning capabilities on par with or exceeding those of larger proprietary LLMs.
♻ ☆ SoccerMaster: A Vision Foundation Model for Soccer Understanding CVPR 2026
Soccer understanding has recently garnered growing research interest due to its domain-specific complexity and unique challenges. Unlike prior works that typically rely on isolated, task-specific expert models, this work aims to propose a unified model to handle diverse soccer visual understanding tasks, ranging from fine-grained perception (e.g., athlete detection and identification) to high-level semantic reasoning (e.g., event classification). Concretely, our contributions are threefold: (i) we present SoccerMaster, the first soccer-specific vision foundation model that unifies diverse tasks within a single framework via supervised multi-task pretraining; (ii) we develop an automated data curation pipeline, SoccerFactory, to generate scalable spatial annotations, and integrate multiple existing soccer video datasets as a comprehensive pretraining data resource for multi-task pretraining; and (iii) we conduct extensive evaluations demonstrating that SoccerMaster consistently outperforms task-specific expert models across diverse downstream tasks, highlighting its breadth and superiority. The data, code, and model will be publicly available.
comment: Accepted by CVPR 2026 (Oral); Project Page: https://haolinyang-hlyang.github.io/SoccerMaster
♻ ☆ CAMEL: Confidence-Gated Reflection for Reward Modeling ICML 2026
Reward models play a fundamental role in aligning large language models with human preferences. Existing methods predominantly follow two paradigms: scalar discriminative preference models, which are efficient but lack interpretability, and generative judging models, which offer richer reasoning at the cost of higher computational overhead. We observe that the log-probability margin between verdict tokens strongly correlates with prediction correctness, providing a reliable proxy for instance difficulty without additional inference cost. Building on this insight, we propose CAMEL, a confidence-gated reflection framework that performs a lightweight single-token preference decision first and selectively invokes reflection only for low-confidence instances. To induce effective self-correction, we train the model via reinforcement learning with counterfactual prefix augmentation, which exposes the model to diverse initial verdicts and encourages genuine revision. Empirically, CAMEL achieves state-of-the-art performance on three widely used reward-model benchmarks with 82.9% average accuracy, surpassing the best prior model by 3.2% and outperforming 70B-parameter models using only 14B parameters, while establishing a strictly better accuracy-efficiency Pareto frontier.
comment: ICML 2026
♻ ☆ Keep Rehearsing and Refining: Lifelong Learning Vehicle Routing under Continually Drifting Tasks
Existing neural solvers for vehicle routing problems (VRPs) are typically trained either in a one-off manner on a fixed set of pre-defined tasks or in a lifelong manner with tasks arriving sequentially, assuming sufficient training on each task. Both settings overlook a common real-world property: problem patterns may drift continually over time, yielding massive tasks sequentially arising, each with only limited training resources. In this paper, we propose a novel lifelong learning paradigm for neural VRP solvers under continual task drift over time, where each task is locally stationary at one learning time step but receives only insufficient training resources. We empirically demonstrate that such continual drift arises in practice using a real-world logistics dataset. We then propose Dual Replay with Experience Enhancement (DREE), a general framework to improve learning efficiency and mitigate catastrophic forgetting under such drift. Extensive experiments based on both the real-world logistics dataset and commonly used synthetic dataset show that, under such continual drift, DREE effectively learns new tasks, preserves prior knowledge, improves generalization to unseen tasks, and can be applied to various existing neural solvers.
♻ ☆ Frequency-Enhanced Diffusion Models: Curriculum-Guided Semantic Alignment for Zero-Shot Skeleton Action Recognition
Human action recognition is pivotal in computer vision, with applications ranging from surveillance to human-robot interaction. Despite the effectiveness of supervised skeleton-based methods, their reliance on exhaustive annotation limits generalization to novel actions. Zero-Shot Skeleton Action Recognition (ZSAR) emerges as a promising paradigm, yet it faces challenges due to the spectral bias of diffusion models, which oversmooth high-frequency dynamics. Here, we propose Frequency-Aware Diffusion for Skeleton-Text Matching (FDSM), integrating a Semantic-Guided Spectral Residual Module, a Timestep-Adaptive Spectral Loss, and Curriculum-based Semantic Abstraction to address these challenges. Our approach effectively recovers fine-grained motion details, achieving state-of-the-art performance on NTU RGB+D, PKU-MMD, and Kinetics-skeleton datasets. Code has been made available at https://github.com/yuzhi535/FDSM. Project homepage: https://yuzhi535.github.io/FDSM.github.io/
comment: Accepted by The Visual Computer
♻ ☆ Adaptive Greedy Frame Selection for Long Video Understanding
Large vision--language models (VLMs) are increasingly applied to long-video question answering, yet inference is often bottlenecked by the number of input frames and resulting visual tokens. Naive sparse sampling can miss decisive moments, while purely relevance-driven selection frequently collapses onto near-duplicate frames and sacrifices coverage of temporally distant evidence. We propose a question-adaptive greedy frame selection method that jointly optimizes query relevance and semantic representativeness under a fixed frame budget. Our approach constructs a 1~FPS candidate pool (capped at 1000) with exact timestamp alignment, embeds candidates in two complementary spaces (SigLIP for question relevance and DINOv2 for semantic similarity), and selects frames by greedily maximizing a weighted sum of a modular relevance term and a facility-location coverage term. This objective is normalized, monotone, and submodular, yielding a standard (1-1/e) greedy approximation guarantee. To account for question-dependent trade-offs between relevance and coverage, we introduce four preset strategies and a lightweight text-only question-type classifier that routes each query to its best-performing preset. Experiments on MLVU show consistent accuracy gains over uniform sampling and a strong recent baseline across frame budgets, with the largest improvements under tight budgets.
♻ ☆ AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering
Autonomous Machine Learning Engineering (MLE) requires agents to perform sustained, iterative optimization over long horizons. While recent LLM-based agents show promise, current prompt-based agents for MLE suffer from behavioral stagnation due to frozen parameters. Although Reinforcement Learning (RL) offers a remedy, applying it to MLE is hindered by prohibitive execution latency and inefficient data selection. Recognizing these challenges, we propose AceGRPO with two core components: (1) Evolving Data Buffer that continuously repurposes execution traces into reusable training tasks, and (2) Adaptive Sampling guided by a Learnability Potential function, which dynamically prioritizes tasks at the agent's learning frontier to maximize learning efficiency. Leveraging AceGRPO, our trained Ace-30B model achieves a 100% valid submission rate on MLE-Bench-Lite, approaches the performance of proprietary frontier models, and outperforms larger open-source baselines (e.g., DeepSeek-V3.2), demonstrating robust capability for sustained iterative optimization. Code is available at https://github.com/yuzhu-cai/AceGRPO.
comment: 18 pages, 5 figures
♻ ☆ Mapping Human Anti-collusion Mechanisms to Multi-agent AI Systems
As multi-agent AI systems become increasingly autonomous, evidence shows they can develop collusive strategies similar to those long observed in human markets and institutions. While human domains have accumulated centuries of anti-collusion mechanisms, it remains unclear how these can be adapted to AI settings. This paper addresses that gap by (i) developing a taxonomy of human anti-collusion mechanisms, including sanctions, leniency & whistleblowing, monitoring & auditing, market design, and governance and (ii) mapping them to potential interventions for multi-agent AI systems. For each mechanism, we propose implementation approaches. We also highlight open challenges, such as the attribution problem (difficulty attributing emergent coordination to specific agents), identity fluidity (agents being easily forked or modified), the boundary problem (distinguishing beneficial cooperation from harmful collusion), and adversarial adaptation (agents learning to evade detection).
♻ ☆ Alternating Reinforcement Learning with Contextual Rubric Rewards: Beyond the Scalarization Strategy
Reinforcement Learning with Rubric Rewards (RLRR) is a framework that extends conventional reinforcement learning from human feedback (RLHF) and verifiable rewards (RLVR) by replacing scalar preference signals with structured, multi-dimensional, contextual rubric-based evaluations. However, existing approaches in RLRR are limited to linearly compressing vector rewards into a scalar reward with a fixed weightings, which is sensitive to artificial score design and fails to capture correlations among reward dimensions. To overcome the limitations of reward aggregation, this work proposes Alternating Reinforcement Learning with Rubric Rewards (ARL-RR), a framework that eliminates the need for a fixed scalarization by optimizing one semantic rubric meta-class at a time. Theoretically, we show that reward aggregation induces a variance contraction effect, which helps explain the performance gains. We further introduce a lightweight, search-based adaptation procedure that selects the next meta-class dynamically based on task performance, enabling the policy to emphasize critical objectives and thereby improve the model performance. Empirically, our experiments on the HealthBench dataset with experts annotations demonstrate that ARL-RR uniformly outperforms scalarized methods in both model performance and training efficiency across different model scales (1.7B, 4B, 8B, and 14B).
♻ ☆ SMI: Statistical Membership Inference for Reliable Unlearned Model Auditing
Machine unlearning (MU) is essential for enforcing the right to be forgotten in machine learning systems. A key challenge of MU is how to reliably audit whether a model has truly forgotten specified training data. Membership Inference Attacks (MIAs) are widely used for unlearned model auditing, where samples that evade membership detection are regarded as successfully forgotten. We show this assumption is fundamentally flawed: failed membership inference does not imply true forgetting. We prove that unlearned samples occupy fundamentally different positions in the feature space than non-member samples, making this alignment bias unavoidable and unobservable, which leads to systematically optimistic evaluations of unlearning performance. Meanwhile, training shadow models for MIA incurs substantial computational overhead. To address both limitations, we propose Statistical Membership Inference (SMI), a training-free auditing framework that reformulates auditing as estimating the non-member mixture proportion in the unlearned feature distribution. Beyond estimating the forgetting rate, SMI also provides bootstrap reference ranges for quantified auditing reliability. Extensive experiments show that SMI consistently outperforms all MIA-based baselines, with no shadow model training required. Overall, SMI establishes a principled and efficient alternative to MIA-based auditing methods, with both theoretical guarantees and strong empirical performance.
♻ ☆ PixelGen: Improving Pixel Diffusion with Perceptual Supervision
Pixel diffusion generates images directly in pixel space, avoiding the VAE artifacts and representational bottlenecks of two-stage latent diffusion. Recent JiT further simplifies pixel diffusion with x-prediction, where the model predicts clean images rather than velocity. However, the standard pixel-wise diffusion loss treats all pixels equally, spending model capacity to perceptually insignificant signals and often leading to blurry samples. We propose PixelGen, an end-to-end pixel diffusion framework that augments x-prediction with perceptual supervision. Specifically, PixelGen introduces two complementary perceptual losses on top of x-prediction: an LPIPS loss for local textures and a P-DINO loss for global semantics. To preserve sample coverage, PixelGen further proposes a noise-gating strategy that applies these losses only at lower-noise timesteps. On ImageNet-256 without classifier-free guidance, PixelGen achieves an FID of 5.11 in 80 training epochs, surpassing the latent diffusion baselines. Moreover, PixelGen scales efficiently to text-to-image generation, reaching a GenEval score of 0.79 with only 6 days of training on 8xH800 GPUs. These results show that perceptual supervision substantially narrows the gap between pixel and latent diffusion while preserving a simple one-stage pipeline. Codes are available at https://github.com/Zehong-Ma/PixelGen.
comment: Project Pages: https://zehong-ma.github.io/PixelGen/
♻ ☆ asRoBallet: Closing the Sim2Real Gap via Friction-Aware Reinforcement Learning for Underactuated Spherical Dynamics
We introduce asRoBallet, to the best of our knowledge, the first end-to-end reinforcement learning (RL) locomotion policy deployed on a humanoid ballbot hardware platform. Historically, ballbots have served as a canonical benchmark for underactuated and nonholonomic control, which are characterized by a reality gap in complex friction models for wheel-ball-floor interactions. While current literature demonstrates successful handling of 3D balancing with LQR and MPC, transitioning to actual hardware for a humanoid ballbot using RL is currently hindered by critical gaps in contact modeling, actuator latency & jitter, and safe hardware exploration. This study proposes a high-fidelity MuJoCo simulation that explicitly models the discrete roller mechanics of ETH-type omni-wheels, thereby capturing parasitic vibrations and contact discontinuities that have previously been ignored. We also developed a Friction-Aware Reinforcement Learning framework that achieves zero-shot Sim2Real transfer by mastering the coupled rolling, lateral, and torsional friction channels at the wheel-ball and ball-floor interfaces. We designed asRoBallet through subtractive reconfiguration, repurposing key components from an overconstrained quadruped and integrating them into a newly designed structural frame to achieve a robust research platform at low cost. We also developed a generalized iOS ecosystem that transforms consumer electronics into a low-latency interface, enabling a single operator to orchestrate expressive humanoid maneuvers via intuitive natural motion.
comment: 10 pages, 9 figure, accepted for RSS2026. For Supplementary Videos, see https://bionicdl.ancorasir.com/?p=2238
♻ ☆ AROpt: An Optimization Method for Autoregressive Time Series Forecasting
Current time-series forecasting models are primarily based on transformer-style neural networks. These models achieve long-term forecasting mainly by scaling up the model size rather than through genuinely autoregressive (AR) rollout. From the perspective of large language model training, traditional time-series forecasting model training ignores the monotonic error-growth heuristic. In this paper, we propose a novel training method for time-series forecasting that enforces two key properties: (1) AR prediction errors should increase with the forecasting horizon. Violations of this trend are interpreted as rollout inconsistency and are softly penalized during training, and (2) the method enables models to be able to concatenate short-term AR predictions to form flexible long-term forecasts. Empirical results demonstrate that our method establishes a new state-of-the-art across multiple benchmarks, achieving an MSE reduction of more than $10\%$ compared to iTransformer and other recent strong baselines. Furthermore, it enables short-horizon forecasting models to perform reliable long-term predictions at horizons over 7.5 times longer. Code is available at https://github.com/LizhengMathAi/AROpt
comment: 16 pages, 5 figures, 3 tables
♻ ☆ Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards
Reinforcement learning with verifiable rewards has become a common way to improve explicit reasoning in large language models, but final-answer correctness alone does not reveal whether the reasoning trace is faithful, reliable, or useful to the model that consumes it. This outcome-only signal can reinforce traces that are right for the wrong reasons, overstate reasoning gains by rewarding shortcuts, and propagate flawed intermediate states in multi-step systems. To this end, we propose TraceLift, a planner-executor training framework that treats reasoning as a consumable intermediate artifact. During planner training, the planner emits tagged reasoning. A frozen executor turns this reasoning into the final artifact for verifier feedback, while an executor-grounded reward shapes the intermediate trace. This reward multiplies a rubric-based Reasoning Reward Model (RM) score by measured uplift on the same frozen executor, crediting traces that are both high-quality and useful. To make reasoning quality directly learnable, we introduce TRACELIFT-GROUPS, a rubric-annotated reason-only dataset built from math and code seed problems. Each example is a same-problem group containing a high-quality reference trace and multiple plausible flawed traces with localized perturbations that reduce reasoning quality or solution support while preserving task relevance. Extensive experiments on code and math benchmarks show that this executor-grounded reasoning reward improves the two-stage planner-executor system over execution-only training, suggesting that reasoning supervision should evaluate not only whether a trace looks good, but also whether it helps the model that consumes it. Our code is available at: https://github.com/MasaiahHan/TraceLift
comment: 36 pages
♻ ☆ ANCORA: Learning to Question via Manifold-Anchored Self-Play for Verifiable Reasoning
We propose a paradigm shift toward open-ended curriculum self-play: rather than learning to answer on a fixed prompt set, a unified policy learns to question: generating verifiable problems, solving them, and turning verifier feedback into self-improvement without human-annotated solutions. We introduce ANCORA, in which the policy alternates between a Proposer that synthesizes novel specifications and a Solver that produces verified solutions, anchored by three load-bearing mechanisms: a two-level group-relative update coupling Proposer advantages across specifications with Solver advantages across solution attempts; iterative self-distilled SFT projecting the base model onto its valid-output manifold before RL; and a UCB-guided Curriculum DAG whose policy-induced problem set can provably expand under self-composition. Without these stabilizers, sparse verifier feedback drives Proposer collapse even under MLRL-aligned rewards; with them, ANCORA bootstraps a verifiable curriculum from zero human solutions. Instantiated in Verus, ANCORA lifts Dafny2Verus pass@1 from a 26.6% SFT baseline to 81.5% in test-time training (TTT, 0-shot), outperforming PSV self-play by 15.8 points despite PSV's 1-shot inference; in a transfer setting, training from Dafny2Verus seeds yields 36.2% and 17.2% pass@1 on held-out MBPP and HumanEval.
comment: v2: Updated abstract; strengthened the proof of Proposition 4.1; corrected minor typos; corrected author list
Computation and Language 212
☆ EMO: Pretraining Mixture of Experts for Emergent Modularity
Large language models are typically deployed as monolithic systems, requiring the full model even when applications need only a narrow subset of capabilities, e.g., code, math, or domain-specific knowledge. Mixture-of-Experts (MoEs) seemingly offer a potential alternative by activating only a subset of experts per input, but in practice, restricting inference to a subset of experts for a given domain leads to severe performance degradation. This limits their practicality in memory-constrained settings, especially as models grow larger and sparser. We introduce EMO, an MoE designed for modularity-the independent use and composition of expert subsets-without requiring human-defined priors. Our key idea is to encourage tokens from similar domains to rely on similar experts. Since tokens within a document often share a domain, EMO restricts them to select experts from a shared pool, while allowing different documents to use different pools. This simple constraint enables coherent expert groupings to emerge during pretraining using document boundaries alone. We pretrain a 1B-active, 14B-total EMO on 1T tokens. As a full model, it matches standard MoE performance. Crucially, it enables selective expert use: retaining only 25% (12.5%) of experts incurs just a 1% (3%) absolute drop, whereas standard MoEs break under the same setting. We further find that expert subsets in EMO specialize at semantic levels (e.g., domains such as math or code), in contrast to the low-level syntactic specialization observed in standard MoEs. Altogether, our results demonstrate a path toward modular, memory-efficient deployment of large, sparse models and open new opportunities for composable architectures.
☆ Verifier-Backed Hard Problem Generation for Mathematical Reasoning
Large Language Models (LLMs) demonstrate strong capabilities for solving scientific and mathematical problems, yet they struggle to produce valid, challenging, and novel problems - an essential component for advancing LLM training and enabling autonomous scientific research. Existing problem generation approaches either depend on expensive human expert involvement or adopt naive self-play paradigms, which frequently yield invalid problems due to reward hacking. This work introduces VHG, a verifier-enhanced hard problem generation framework built upon three-party self-play. By integrating an independent verifier into the conventional setter-solver duality, our design constrains the setter's reward to be jointly determined by problem validity (evaluated by the verifier) and difficulty (assessed by the solver). We instantiate two verifier variants: a Hard symbolic verifier and a Soft LLM-based verifier, with evaluations conducted on indefinite integral tasks and general mathematical reasoning tasks. Experimental results show that VHG substantially outperforms all baseline methods by a clear margin.
☆ When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels
Many deployments must compare candidate language models for safety before a labeled benchmark exists for the relevant language, sector, or regulatory regime. We formalize this setting as benchmarkless comparative safety scoring and specify the contract under which a scenario-based audit can be interpreted as deployment evidence. Scores are valid only under a fixed scenario pack, rubric, auditor, judge, sampling configuration, and rerun budget. Because no labels are available, we replace ground-truth agreement with an instrumental-validity chain: responsiveness to a controlled safe-versus-abliterated contrast, dominance of target-driven variance over auditor and judge artifacts, and stability across reruns. We instantiate the chain in SimpleAudit, a local-first scoring instrument, and validate it on a Norwegian safety pack. Safe and abliterated targets separate with AUROC values between 0.89 and 1.00, target identity is the dominant variance component ($η^2 \approx 0.52$), and severity profiles stabilize by ten reruns. Applying the same chain to Petri shows that it admits both tools. The substantial differences arise upstream of the chain, in claim-contract enforcement and deployment fit. A Norwegian public-sector procurement case comparing Borealis and Gemma 3 demonstrates the resulting evidence in practice: the safer model depends on scenario category and risk measure. Consequently, scores, matched deltas, critical rates, uncertainty, and the auditor and judge used must be reported together rather than collapsed into a single ranking.
comment: SimpleAudit Repository: https://github.com/kelkalot/simpleaudit
☆ Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
Reinforcement learning with verifiable rewards (RLVR), due to the deterministic verification, becomes a dominant paradigm for enhancing the reasoning ability of large language models (LLMs). The community witnesses the rapid change from the Proximal Policy Optimization (PPO) to Group Relative Policy Optimization (GRPO), in which GRPO reduces the complicated advantage estimation with simple estimation over grouped positive and negative rollouts. However, we note that negative rollouts may admit no gradation of failure severity, and the combinatorial vastness makes penalizing a few sampled negatives unlikely to cover a meaningful reward signal under sparse binary rewards. In this work, we propose Positive-Only Policy Optimization (POPO), a novel RLVR framework in which learning can occur exclusively via online positive rollouts. Specifically, POPO utilizes bounded importance sampling over the positive rollout set. Thus, no disjoint negative rollouts are used for the gradient guidance. We show that implicit negative gradients can emerge naturally through reinforcing the positive probability via rollouts redistribution. Next, POPO stabilizes the policy optimization through two mechanisms. First, it applies a siamese policy network with a momentum-based adaptation law for stabilized policy evolution. Second, we replace the KL-divergence with a bounded similarity penalty term in the siamese representation space. We conduct extensive experiments using publicly available, well-established text-LLM models, e.g., the Qwen family, across all-level mathematical benchmarks. Our experiment demonstrates that POPO achieves performance comparable to, or even superior to GRPO. Notably, we show that POPO can achieve 36.67% in AIME 2025 with Qwen-Math-7B, outperforming GRPO 30.00%. Our ablation and sweep studies further illustrate the necessity and robustness of POPO components.
☆ StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
Large language models (LLMs) are increasingly used as interactive agents, but optimizing them for long-horizon decision making remains difficult because current methods are largely purely reactive, which weakens both exploration and credit assignment over extended trajectories. In this work, we present Strategic Trajectory Abstraction (StraTA), a simple framework that introduces an explicit trajectory-level strategy into agentic reinforcement learning (RL). StraTA samples a compact strategy from the initial task state, conditions subsequent actions on that strategy, and trains strategy generation and action execution jointly with a hierarchical GRPO-style rollout design, further enhanced by diverse strategy rollout and critical self-judgment. Experiments on ALFWorld, WebShop, and SciWorld show that StraTA consistently improves both sample efficiency and final performance over strong baselines. StraTA reaches success rates of 93.1% on ALFWorld and 84.2% on WebShop. On SciWorld, StraTA attains a 63.5% overall score, outperforming frontier closed-source models.
comment: 26 pages, 4 figures, 7 tables
☆ Recursive Agent Optimization
We introduce Recursive Agent Optimization (RAO), a reinforcement learning approach for training recursive agents: agents that can spawn and delegate sub-tasks to new instantiations of themselves recursively. Recursive agents implement an inference-time scaling algorithm that naturally allows agents to scale to longer contexts and generalize to more difficult problems via divide-and-conquer. RAO provides a method to train models to best take advantage of such recursive inference, teaching agents when and how to delegate and communicate. We find that recursive agents trained in this way enjoy better training efficiency, can scale to tasks that go beyond the model's context window, generalize to tasks much harder than the ones the agent was trained on, and can enjoy reduced wall-clock time compared to single-agent systems.
☆ Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
Reinforcement learning (RL) has been applied to improve large language model (LLM) reasoning, yet the systematic study of how training scales with task difficulty has been hampered by the lack of controlled, scalable environments. We introduce ScaleLogic, a synthetic logical reasoning framework that offers independent control over two axes of difficulty: the depth of the required proof planning (i.e., the horizon) and the expressiveness of the underlying logic. Our proposed framework supports a wide range of logics: from simple implication-only logic ("if-then") towards more expressive first-order reasoning with conjunction ("and"), disjunction ("or"), negation ("not"), and universal quantification ("for all"). Using this framework, we show that the RL training compute $T$ follows a power law with respect to reasoning depth $D$ ($T \propto D^γ$, $R^{2} > 0.99$), and that the scaling exponent $γ$ increases monotonically with logical expressiveness, from $1.04$ to $2.60$. On downstream mathematics and general reasoning benchmarks, more expressive training settings yield both larger performance gains (up to $+10.66$ points) and more compute-efficient transfer compared to less expressive settings, demonstrating that what a model is trained on, not just how much it is trained, shapes downstream transfer. We further show that the power-law relationship holds across multiple RL methods, and curriculum-based training substantially improves scaling efficiency.
☆ Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents
Large language models (LLMs) power deep research agents that synthesize information from hundreds of web sources into cited reports, yet these citations cannot be reliably verified. Current approaches either trust models to self-cite accurately, risking bias, or employ retrieval-augmented generation (RAG) that does not validate source accessibility, relevance, or factual consistency. We introduce the first source attribution evaluation framework that uses a reproducible AST parser to extract and evaluate inline citations from LLM-generated Markdown reports at scale. Unlike methods that verify claims in isolation, our framework closes the loop by retrieving the actual cited content, enabling human or model evaluators to judge each citation against its source. Citations are evaluated along three dimensions. (1) Link Works verifies URL accessibility, (2) Relevant Content measures topical alignment, and (3) Fact Check validates factual accuracy against source content. We benchmark 14 closed-source and open-source LLMs across three evaluation dimensions using rubric-based LLM-as-a-judge evaluators calibrated through human review. Our results reveal that even the strongest frontier models maintain link validity above 94% and relevance above 80%, yet achieve only 39-77% factual accuracy, while fewer than half of open-source models successfully generate cited reports in a one-shot setting. Ablation studies on research depth show that Fact Check accuracy drops by approximately 42% on average across two frontier models as tool calls scale from 2 to 150, demonstrating that more retrieval does not produce more accurate citations. These findings reveal a critical disconnect between surface-level citation quality and factual reliability, and our framework provides the evaluation infrastructure to assess the disconnect.
☆ Parser agreement and disagreement in L2 Korean UD: Implications for human-in-the-loop annotation
We propose a simplified human-in-the-loop workflow for second language (L2) Korean morphosyntactic annotation by leveraging agreement between two domain-adapted parsers. We first evaluate whether parser agreement can serve as a proxy for annotation correctness by comparing it with independent human judgments. The results show strong correspondence between parser and human judgments, supporting the feasibility of semi-automatic L2-Korean UD annotation. Further analysis demonstrates that parser disagreements cluster in linguistically predictable domains such as grammatical-relation distinctions and clause-boundary ambiguity. While many disagreement cases are tractable for iterative model refinement, others reflect deeper representational challenges inherent in parsing and tagging L2-Korean corpora.
comment: To be published in the 20th Linguistic Annotation Workshop
☆ MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems ICML 2026
Large language model (LLM)-based Multi-agent systems (MAS) have shown promise in tackling complex collaborative tasks, where agents are typically orchestrated via role-specific prompts. While the quality of these prompts is pivotal, jointly optimizing them across interacting agents remains a non-trivial challenge, primarily due to the misalignment between local agent objectives and holistic system goals. To address this, we introduce MASPO, a novel framework designed to automatically and iteratively refine prompts across the entire system. A core innovation of MASPO is its joint evaluation mechanism, which assesses prompts not merely by their local validity, but by their capacity to facilitate downstream success for successor agents. This effectively bridges the gap between local interactions and global outcomes without relying on ground-truth labels. Furthermore, MASPO employs a data-driven evolutionary beam search to efficiently navigate the high-dimensional prompt space. Extensive empirical evaluations across 6 diverse tasks demonstrate that MASPO consistently outperforms state-of-the-art prompt optimization methods, achieving an average accuracy improvement of 2.9. We release our code at https://github.com/wangzx1219/MASPO.
comment: Accepted at ICML 2026
☆ Algospeak, Hiding in the Open: The Trade-off Between Legible Meaning and Detection Avoidance
As large language models (LLMs) increasingly mediate both content generation and moderation, linguistic evasion strategies known as Algospeak have intensified the coevolution between evaders and detectors. This research formalizes the underlying dynamics grounded in a joint action model: when Algospeak increases, detectability and understandability decrease. Further, the concept of Majority Understandable Modulation (MUM) is introduced and defined as the modulation level at which additional evasive alteration increases detector evasion but loses comprehension for the majority of recipients. To empirically probe this trade-off, we introduce a reproducible framework that can be used to create meaning-preserving, Algospeak-style variants, based on an existing taxonomy and with tunable modulation levels. Using COVID-19 disinformation as a first proof-by-example setting, we construct a reference dataset of 700 modulated items, drawn from twenty base sentences across five modulation levels and seven strategies. We then run two linked evaluations with seven different language models: one testing for interpretation through meaning recovery and one for disinformation detection through classification. Curve fitting over modulation levels yields an estimate of the Majority Understandable Modulation threshold and enables sensitivity analyses across strategies and models, see Figure 1. Results reveal the characteristic relationships between understandability and modulation. This study lays the groundwork for understanding the dynamics behind Algospeak and provides the framework, dataset, and experimental setups described.
comment: Under Review
☆ When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds
Sign-based optimization algorithms, such as SignSGD and Muon, have garnered significant attention for their remarkable performance in training large foundation models. Despite this empirical success, we still lack a theoretical understanding of when and why these sign-based methods outperform vanilla SGD. The core obstacle is that under standard smoothness and finite variance conditions, SGD is known to be minimax optimal for finding stationary points measured by $\ell_2$-norms, thereby fundamentally precluding any complexity gains for sign-based methods in standard settings. To overcome this barrier, we analyze sign-based optimizers leveraging $\ell_1$-norm stationarity, $\ell_\infty$-smoothness, and a separable noise model, which can better capture the coordinate-wise nature of signed updates. Under this distinct problem geometry, we derive matched upper and lower bounds for SignSGD and explicitly characterize the problem class in which SignSGD provably dominates SGD. Specifically, we compare the \emph{upper bound of SignSGD} with the \emph{lower bound of SGD}, illustrating that SignSGD effectively reduces the complexity by a factor of $d$ under \emph{sparse noise}, where $d$ is the problem dimension. Furthermore, we elevate this framework to the matrix domain, providing an equivalent optimal lower bound for the Muon optimizer, proving that extending the sign operator to matrices preserves this optimal scaling with dimensionality. Finally, we bridge our theoretical bounds to practice, demonstrating that the theoretical superiority of SignSGD accurately predicts its faster convergence during the pretraining of a 124M parameter GPT-2 model.
comment: Code is available at https://github.com/Dingzhen230/SignSGD_Outperforms_SGD
☆ SkillOS: Learning Skill Curation for Self-Evolving Agents
LLM-based agents are increasingly deployed to handle streaming tasks, yet they often remain one-off problem solvers that fail to learn from past interactions. Reusable skills distilled from experience provide a natural substrate for self-evolution, where high-quality skill curation serves as the key bottleneck. Existing approaches either rely on manual skill curation, prescribe heuristic skill operations, or train for short-horizon skill operations. However, they still struggle to learn complex long-term curation policies from indirect and delayed feedback. To tackle this challenge, we propose SkillOS, an experience-driven RL training recipe for learning skill curation in self-evolving agents. SkillOS pairs a frozen agent executor that retrieves and applies skills with a trainable skill curator that updates an external SkillRepo from accumulated experience. To provide learning signals for curation, we design composite rewards and train on grouped task streams based on skill-relevant task dependencies, where earlier trajectories update the SkillRepo, and later related tasks evaluate these updates. Across multi-turn agentic tasks and single-turn reasoning tasks, SkillOS consistently outperforms memory-free and strong memory-based baselines in both effectiveness and efficiency, with the learned skill curator generalizing across different executor backbones and task domains. Further analyses show that the learned curator produces more targeted skill use, while the skills in SkillRepo evolve into more richly structured Markdown files that encode higher-level meta-skills over time.
comment: 11 pages, 6 figures, 3 tables
☆ UniSD: Towards a Unified Self-Distillation Framework for Large Language Models
Self-distillation (SD) offers a promising path for adapting large language models (LLMs) without relying on stronger external teachers. However, SD in autoregressive LLMs remains challenging because self-generated trajectories are free-form, correctness is task-dependent, and plausible rationales can still provide unstable or unreliable supervision. Existing methods mainly examine isolated design choices, leaving their effectiveness, roles, and interactions unclear. In this paper, we propose UniSD, a unified framework to systematically study self-distillation. UniSD integrates complementary mechanisms that address supervision reliability, representation alignment, and training stability, including multi-teacher agreement, EMA teacher stabilization, token-level contrastive learning, feature matching, and divergence clipping. Across six benchmarks and six models from three model families, UniSD reveals when self-distillation improves over static imitation, which components drive the gains, and how these components interact across tasks. Guided by these insights, we construct UniSDfull, an integrated pipeline that combines complementary components and achieves the strongest overall performance, improving over the base model by +5.4 points and the strongest baseline by +2.8 points. Extensive evaluation highlights self-distillation as a practical and steerable approach for efficient LLM adaptation without stronger external teachers.
comment: 22 pages, 12 figures
☆ Automated Clinical Report Generation for Remote Cognitive Remediation: Comparing Knowledge-Engineered Templates and LLMs in Low-Resource Settings
The growing demand for cognitive remediation therapy, combined with limited speech therapist availability, has accelerated the adoption of remote rehabilitation tools. These systems generate large volumes of interaction data that are difficult for clinicians to review efficiently. This paper investigates automated clinical report generation for avatar-guided, home-based cognitive remediation sessions in a low-resource setting with no reference reports. We present and compare two approaches: (1) a rule-based template system encoding speech therapy domain knowledge as explicit decision rules and validated templates, ensuring clinical reliability and traceability; and (2) a zero-shot LLM-based approach (GPT-4) aimed at more fluent and concise output. Both systems use identical pre-extracted, expert-validated structured variables, enabling a controlled factual comparison. Outputs were evaluated by eight speech therapists and final-year students using a nine-criterion questionnaire. Results reveal a clear trade-off between clinical reliability and linguistic quality. The template-based system scored higher on fluidity, coherence, and results presentation, while GPT-4 produced more concise output. Directional differences are consistent across evaluation dimensions, though no comparison reached statistical significance after correction, reflecting the scale constraints of expert clinical evaluation. Based on evaluator feedback, we derive eight design recommendations for clinical reporting systems in remote rehabilitation settings. More broadly, this work contributes a replicable methodology combining expert elicitation, taxonomy-driven generation, and multi-dimensional human evaluation for clinical NLG in low-resource settings, and illustrates how controlled comparisons can inform the responsible adoption of generative AI in healthcare.
☆ PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
Many operations on sensory data -- comparison, memory, retrieval, and reasoning -- are naturally expressed over discrete symbolic structures. In language this interface is given by tokens; in audio, it must be learned. Existing audio tokenizers rely on quantization, clustering, or codec reconstruction, assigning tokens locally, so sequence consistency, compactness, length control, termination, and edit similarity are rarely optimized directly. We introduce PairAlign, a framework for compact audio tokenization through sequence-level self-alignment. PairAlign treats tokenization as conditional sequence generation: an encoder maps speech to a continuous condition, and an autoregressive decoder generates tokens from BOS, learning token identity, order, length, and EOS placement. Given two content-preserving views, each view's sequence is trained to be likely under the other's representation, while unrelated examples provide competing sequences. This gives a scalable surrogate for edit-distance preservation while discouraging many-to-one collapse. PairAlign starts from VQ-style tokenization and refines it with EMA-teacher targets, cross-paired teacher forcing, prefix corruption, likelihood contrast, and length control. On 3-second speech, PairAlign learns compact, non-degenerate sequences with broad vocabulary usage and strong cross-view consistency. On TIMIT retrieval, it preserves edit-distance search while reducing archive token count by 55%. A continuous-sweep probe shows lower local overlap than a dense geometric tokenizer, but stronger length control and bounded edit trajectories under 100 ms shifts. PairAlign is a sequence-symbolic predictive learner: like JEPA-style objectives, it predicts an abstract target from another view as a learned variable-length symbolic sequence, not a continuous latent.
comment: 101 pages, 7 Figures, pre-print, Under Review
☆ Long Context Pre-Training with Lighthouse Attention
Training causal transformers at extreme sequence lengths is bottlenecked by the quadratic time and memory of scaled dot-product attention (SDPA). In this work, we propose Lighthouse Attention, a training-only symmetrical selection-based hierarchical attention algorithm that wraps around ordinary SDPA and can be easily removed towards the end of the training. Our hierarchical selection is also gradient-free, which exempts us from dealing with a complicated and potentially inefficient backward pass kernel. Our contribution is three-fold: (i) A subquadratic hierarchical pre- and post-processing step that does adaptive compression and decompression of the sequence. (ii) A symmetrical compression strategy that pools queries, keys and values at the same time, while preserving left-to-right causality, which greatly improves parallelism. (iii) A two stage training approach which we pre-train for the majority of the time with Lighthouse Attention and recover a full attention model at the end with a short training. We run preliminary small scale LLM pre-training experiments that show the effectiveness of our method compared to full attention training with all other settings matched, where we achieve a faster total training time and lower final loss after the recovery phase. Full code is available at: https://github.com/ighoshsubho/lighthouse-attention
comment: 18 pages, 4 figures, 4 tables
☆ Continuous Latent Diffusion Language Model
Large language models have achieved remarkable success under the autoregressive paradigm, yet high-quality text generation need not be tied to a fixed left-to-right order. Existing alternatives still struggle to jointly achieve generation efficiency, scalable representation learning, and effective global semantic modeling. We propose Cola DLM, a hierarchical latent diffusion language model that frames text generation through hierarchical information decomposition. Cola DLM first learns a stable text-to-latent mapping with a Text VAE, then models a global semantic prior in continuous latent space with a block-causal DiT, and finally generates text through conditional decoding. From a unified Markov-path perspective, its diffusion process performs latent prior transport rather than token-level observation recovery, thereby separating global semantic organization from local textual realization. This design yields a more flexible non-autoregressive inductive bias, supports semantic compression and prior fitting in continuous space, and naturally extends to other continuous modalities. Through experiments spanning 4 research questions, 8 benchmarks, strictly matched ~2B-parameter autoregressive and LLaDA baselines, and scaling curves up to about 2000 EFLOPs, we identify an effective overall configuration of Cola DLM and verify its strong scaling behavior for text generation. Taken together, the results establish hierarchical continuous latent prior modeling as a principled alternative to strictly token-level language modeling, where generation quality and scaling behavior may better reflect model capability than likelihood, while also suggesting a concrete path toward unified modeling across discrete text and continuous modalities.
comment: 99 pages, 31 figures, 9 tables. Project page: https://hongcanguo.github.io/Cola-DLM/
☆ Efficient Pre-Training with Token Superposition
Pre-training of Large Language Models is often prohibitively expensive and inefficient at scale, requiring complex and invasive modifications in order to achieve high data throughput. In this work, we present Token-Superposition Training (TST), a simple drop-in method that significantly improves the data throughput per FLOPs during pre-training without modifying the parallelism, optimizer, tokenizer, data, or model architecture. TST is done in two phases: (i) A highly efficient superposition phase where we combine many contiguous tokens into one bag and train using a multi-hot cross-entropy (MCE) objective, and (ii) a recovery phase where we revert back to standard training. We extensively evaluate TST on the scale of 270M and 600M parameters and validate on 3B and a 10B A1B mixture of experts model, demonstrating that it is highly robust in different settings. Ultimately, TST consistently outperforms baseline loss and downstream evaluations, and under equal-loss settings, TST yields up to a 2.5x reduction in total pre-training time at the 10B A1B scale.
comment: 25 pages, 11 figures, 28 tables
☆ STALE: Can LLM Agents Know When Their Memories Are No Longer Valid?
Large Language Model (LLM) agents are increasingly expected to maintain coherent, long-term personalized memory, yet current benchmarks primarily measure static fact retrieval, overlooking the ability to revise stored beliefs when new evidence emerges. We identify a critical and underexplored failure mode, Implicit Conflict: a later observation invalidates an earlier memory without explicit negation, requiring contextual inference and commonsense reasoning to detect. To rigorously evaluate this capability, we introduce STALE, a benchmark of 400 expert-validated conflict scenarios (1,200 evaluation queries across three probing dimensions) spanning over 100 everyday topics with contexts up to 150K tokens. We propose a three-dimensional probing framework that tests State Resolution (detecting that a prior belief is outdated), Premise Resistance (rejecting queries that falsely presuppose a stale state), and Implicit Policy Adaptation (proactively applying updated states in downstream behavior). A systematic evaluation of frontier LLMs and specialized memory frameworks reveals a pervasive gap between retrieving updated evidence and acting on it, with even the best evaluated model achieving only 55.2% overall accuracy. Models often accept outdated assumptions embedded in a user's query, and they struggle to recognize when a change in one aspect of the user's state should invalidate related memories. To establish an initial baseline for state-aware memory, we further present CUPMem, a prototype that strengthens write-time revision through structured state consolidation and propagation-aware search, suggesting that explicit state adjudication is a promising direction for robust agentic memory.
☆ The Frequency Confound in Language-Model Surprisal and Metaphor Novelty
Language-model (LM) surprisal is widely used as a proxy for contextual predictability and has been reported to correlate with metaphor novelty judgments. However, surprisal is tightly intertwined with lexical frequency. We explore this interaction on metaphor novelty ratings using two different word frequency measures. We analyse surprisal estimates from eight Pythia model sizes and 154 training checkpoints. Across settings, word frequency is a stronger predictor of metaphor novelty than surprisal. Across training stages, the surprisal--novelty association peaks at an early stage and then falls again, mirroring a similarly timed increase in the surprisal--frequency association. These results suggest that the often-reported optimal LM surprisal settings may incorrectly associate contextual predictability with metaphor novelty and processing difficulty, whereas lexical frequency may be the major underlying factor.
comment: to be presented and published at the 15th Joint Conference on Lexical and Computational Semantics (*SEM 2026)
☆ Cubit: Token Mixer with Kernel Ridge Regression
Since its introduction in 2017, the Transformer has become one of the most widely adopted architectures in modern deep learning. Despite extensive efforts to improve positional encoding, attention mechanisms, and feed-forward networks, the core token-mixing mechanism in Transformers remains attention. In this work, we show that the attention module in Transformers can be interpreted as performing Nadaraya-Watson regression, where it computes similarities between tokens and aggregates the corresponding values accordingly. Motivated by this perspective, we propose Cubit, a potential next-generation architecture that leverages Kernel Ridge Regression (KRR), while the vanilla Transformer relies on Nadaraya-Watson regression. Specifically, Cubit modifies the classical attention computation by incorporating the closed-form solution of KRR, combining value aggregation through kernel similarities with normalization via the inverse of the kernel matrix. To improve the training stability, we further propose the Limited-Range Rescale (LRR), which rescales the value layer within a controlled range. We argue that Cubit, as a KRR-based architecture, provides a stronger mathematical foundation than the vanilla Transformer, whose attention mechanism corresponds to Nadaraya-Watson regression. We validate this claim through comprehensive experiments. The experimental results suggest that Cubit may exhibit stronger long-sequence modeling capability. In particular, its performance gain over the Transformer appears to increase as the training sequence length grows.
comment: Tech Report
☆ Litespark Inference on Consumer CPUs: Custom SIMD Kernels for Ternary Neural Networks
Large language models (LLMs) have transformed artificial intelligence, but their computational requirements remain prohibitive for most users. Standard inference demands expensive datacenter GPUs or cloud API access, leaving over one billion personal computers underutilized for AI workloads. Ternary models offer a path forward: their weights are constrained to {-1, 0, +1}, theoretically eliminating the need for floating-point multiplication. However, existing frameworks fail to exploit this structure, treating ternary models as dense floating-point networks. We address this gap with custom SIMD kernels that replace matrix multiplication with simple addition and subtraction operations, targeting the integer dot product instructions available on modern CPUs. Our implementation, Litespark-Inference, is pip-installable and integrates directly with Hugging-Face, achieving 9.2x faster time-to-first-token, 52x higher throughput, and 14x memory reduction compared to standard PyTorch inference on Apple Silicon, with similar speedups on Intel and AMD processors.
☆ Patch-Effect Graph Kernels for LLM Interpretability
Mechanistic interpretability aims to reverse-engineer transformer computations by identifying causal circuits through activation patching. However, scaling these interventions across diverse prompts and task families produces high-dimensional, unstructured datasets that are difficult to compare systematically. We propose a framework that reframes mechanistic analysis as a graph machine-learning problem by representing activation-patching profiles as patch-effect graphs over model components. We introduce three graph-construction methods: direct-influence via causal mediation, partial-correlation, and co-influence and apply graph kernels to analyze the resulting structures. Evaluating this approach on GPT-2 Small using Indirect Object Identification (IOI) and related tasks, we find that patch-effect graphs preserve discriminative structural signals. Specifically, localized edge-slot features provide higher classification accuracy than global graph-shape descriptors. A screened paired-patching validation suggests that CI and PC selected candidate edges correspond to stronger activation-influence effects than random or low-rank candidates. Crucially, by evaluating these representations against rigorous prompt-only and raw patch-effect controls, we make the evidential scope of the benchmark explicit: graph features compress structured patching signal, while raw tensors and surface cues define strong baselines that any circuit-level claim should address. Ultimately, our framework provides a compression and evaluation pipeline for comparing patching-derived structures under controlled baselines, separating robust slice-discriminative evidence from stronger task-general causal-circuit claims.
☆ Towards Emotion Consistency Analysis of Large Language Models in Emotional Conversational Contexts
In this work, we conduct an analysis to examine the consistency of Large Language Models (LLMs) with respect to their own generated responses in an emotionally-driven conversational context. Specifically, the text generated by LLM is framed as a query to the same model, and its responses are subsequently assessed. This is performed with three queries across two dimensions of extreme and moderate emotions. The three queries are, in particular, false claim queries that contain inherently wrong assumptions (false presuppositions) in increasing order of intensity. Two commercial models, Claude-3.5-haiku, GPT4o-mini, and a medium-sized model, Mistral-7B, are considered in the study. Our findings indicate that LLMs exhibit below-average performance and remain vulnerable to false beliefs embedded within queries. This susceptibility is especially pronounced for moderate emotional content. Furthermore, an extended attention-score-based analysis highlights a shift in models' priority from evaluative to generative. The results raise important considerations for LLMs' deployment in high-stakes, emotionally sensitive contexts.
comment: Under-review
☆ Invariant Features in Language Models: Geometric Characterization and Model Attribution
Language models exhibit strong robustness to paraphrasing, suggesting that semantic information may be encoded through stable internal representations, yet the structure and origin of such invariance remain unclear. We propose a local geometric framework in which semantically equivalent inputs occupy structured regions in latent space, with paraphrastic variation along nuisance directions and semantic identity preserved in invariant subspaces. Building on this view, we make three contributions: (1) a geometric characterization of invariant latent features, (2) a contrastive subspace discovery method that separates semantic-changing from semantic-preserving variation, and (3) an application of invariant representations to zero-shot model attribution. Across models and layers, empirical results support these contributions. Invariant structure emerges in specific depth regions, semantic displacement lies largely outside the nuisance subspace, and representation-level interventions indicate a causal role of invariant components in model outputs. Invariant representations also capture model-specific geometric patterns, enabling accurate attribution. These findings suggest that semantic invariance can be viewed as a local geometric property of latent representations, offering a principled perspective on how language models organize meaning.
☆ COVID-19 Infodemic. Understanding content features in detecting fake news using a machine learning approach
The use of content features, particularly textual and linguistic for fake news detection is under-researched, despite empirical evidence showing the features could contribute to differentiating real and fake news. To this end, this study investigates a selection of content features such as word bigrams, part of speech distribution etc. to improve fake news detection. We performed a series of experiments on a new dataset gathered during the COVID-19 pandemic and using Decision Tree, K-Nearest Neighbor, Logistic Regression, Support Vector Machine and Random Forest. Random Forest yielded the best results, followed closely by Support Vector Machine, across all setups. In general, both the textual and linguistic features were found to improve fake news detection when used separately, however, combining them into a single model did not improve the detection significantly. Differences were also noted between the use of bigrams and part of speech tags. The study shows that textual and linguistic features can be used successfully in detecting fake news using the traditional machine learning approach as opposed to deep learning.
☆ From 124 Million Tokens to 1,021 Neologisms: A Large-Scale Pipeline for Automatic Neologism Detection LREC
We present a scalable, modular pipeline for automatic neologism detection that combines rule-based filtering with LLM classification. The pipeline is grounded in two complementary word-formation frameworks, grammatical and extra-grammatical morphology, which jointly define the scope of what counts as a neologism and inform a four-class classification scheme (neologism, entity, foreign, none). While designed to be modular and transferable at the architectural level, the pipeline is instantiated on 527 million English-language Reddit posts spanning 2005-2024. From this corpus, we extract 124.6 million unique tokens and reduce them by over 99.99% to yield 1,021 neologism candidates, a set small enough for manual expert verification. Multiple LLMs independently classify each candidate via majority vote, with a final verification step, revealing substantial cross-model disagreement and highlighting the challenge of operationalizing neologism detection at scale. Manual annotation of all 1,021 candidates confirms that 599 (58.7%) are genuine lexical innovations. The pipeline code, vocabulary compilation scripts, and the annotated candidate list are available at https://github.com/DiegoRossini/neologism-pipeline.
comment: 14 pages, 5 tables. Accepted at NeoLLM 2026 Workshop, co-located with LREC-COLING 2026
☆ MiA-Signature: Approximating Global Activation for Long-Context Understanding
A growing body of work in cognitive science suggests that reportable conscious access is associated with \emph{global ignition} over distributed memory systems, while such activation is only partially accessible as individuals cannot directly access or enumerate all activated contents. This tension suggests a plausible mechanism that cognition may rely on a compact representation that approximates the global influence of activation on downstream processing. Inspired by this idea, we introduce the concept of \textbf{Mindscape Activation Signature (MiA-Signature)}, a compressed representation of the global activation pattern induced by a query. In LLM systems, this is instantiated via submodular-based selection of high-level concepts that cover the activated context space, optionally refined through lightweight iterative updates using working memory. The resulting MiA-Signature serves as a conditioning signal that approximates the effect of the full activation state while remaining computationally tractable. Integrating MiA-Signatures into both RAG and agentic systems yields consistent performance gains across multiple long-context understanding tasks.
comment: This is a work in progress; we will continue to revise and improve the manuscript
☆ E = T*H/(O+B): A Dimensionless Control Parameter for Mixture-of-Experts Ecology
We introduce E = T*H/(O+B), a dimensionless control parameter that predicts whether Mixture-of-Experts (MoE) models will develop a healthy expert ecology or collapse into dead experts. E combines four hyperparameters -- routing temperature T, routing entropy weight H, oracle weight O, and balance weight B -- into a single quantity. Through 12 controlled experiments (8 vision, 4 language) totaling over 11,000 training epochs, we establish that E >= 0.5 alone is sufficient to guarantee zero dead experts, removing the necessity for handcrafted load-balancing auxiliary losses. We validate this cross-modally on CIFAR-10, CIFAR-100, TinyImageNet-200, WikiText-2, and WikiText-103. Six additional findings emerge: (1) dead experts can resuscitate -- triggered by balance loss driving router re-exploration; (2) ortho toxicity is dataset-dependent, not universal; (3) task complexity shifts the critical E threshold; (4) model overfitting is decoupled from expert ecological health; (5) three-tier MoE spontaneously collapses into a two-tier functional structure; (6) ecological structure is temperature-invariant across a 50x range. We propose that E serves as a unified diagnostic for MoE training, analogous to the Reynolds number in fluid dynamics.
comment: 12 experiments, 11,000+ training epochs, cross-modal validation (vision + language). Extended version of the Claude-in-the-Loop ecology framework
☆ WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling
Integrating speech understanding and generation is a pivotal step toward building unified speech models. However, the different representations required for these two tasks currently pose significant compatibility challenges. Typically, semantics-oriented features are learned from self-supervised learning (SSL), and acoustic-oriented features from reconstruction. Such fragmented representations hinder the realization of truly unified speech systems. We present WavCube, a compact continuous latent derived from an SSL speech encoder that simultaneously supports speech understanding, reconstruction, and generation. WavCube employs a two-stage training scheme. Stage 1 trains a semantic bottleneck to filter off-manifold redundancy that makes raw SSL features intractable for diffusion. Stage 2 injects fine-grained acoustic details via end-to-end reconstruction, while a semantic anchoring loss ensures the representation remains grounded within its original semantic manifold. Comprehensive experiments show that WavCube closely approaches WavLM performance on SUPERB despite an 8x dimensional compression, attains reconstruction quality on par with existing acoustic representations, delivers state-of-the-art zero-shot TTS performance with markedly faster training convergence, and excels in speech enhancement, separation, and voice conversion tasks on the SUPERB-SG benchmark. Systematic ablations reveal that WavCube's two-stage recipe resolves two intrinsic flaws of SSL features for generative modeling, paving the way for future unified speech systems. Codes and checkpoints are available at https://github.com/yanghaha0908/WavCube.
☆ GATHER: Convergence-Centric Hyper-Entity Retrieval for Zero-Shot Cell-Type Annotation SIGIR 2026
Zero-shot single-cell cell-type annotation aims to determine a cell's type from a given set of expressed genes without any training. Existing knowledge-graph-based RAG approaches retrieve evidence by expanding from source entities and relying on iterative LLM reasoning. However, in this setting each query contains tens to hundreds of genes, where no single gene is decisive and the label emerges only from their collective co-occurrence. Such hyper-entity queries fundamentally challenge local, entity-wise exploration strategies, which reason from individual genes, leading to poor scalability and substantial LLM cost. We propose GATHER (Graph-Aware Traversal with Hyper-Entity Retrieval), a convergence-centric retriever tailored to hyper-entity queries. It performs global multi-source graph traversal and identifies topological convergence points -- nodes jointly reachable from many input genes. These convergence nodes act as high-information hyper-entities that capture entity synergy. By incorporating node- and path-importance scoring, GATHER selects informative evidence entirely without LLM involvement during retrieval. Instantiated on a self-constructed cell-centric biological knowledge graph (VCKG), GATHER outperforms strong KG-RAG baselines (ToG, ToG-2, RoG, PoG) on two datasets (Immune and Lung), achieving the highest exact-match accuracy (27.45% and 59.64%) with only a single LLM call per sample, compared to 2--61 calls for KG-RAG baselines. Our results demonstrate that convergence nodes compress multi-entity signals into compact, high-information evidence that conveys more per item than multi-hop paths, providing an efficient global alternative to local entity-wise reasoning.
comment: Accepted to SIGIR 2026. 2 figures, 3 tables
☆ SEQUOR: A Multi-Turn Benchmark for Realistic Constraint Following
In a conversation, a helpful assistant must reliably follow user directives, even as they refine, modify, or contradict earlier requests. Yet most instruction-following benchmarks focus on single-turn or short multi-turn scenarios, leaving open how well models handle long-horizon instruction-following tasks. To bridge this gap, we present SEQUOR, an automatic benchmark for evaluating constraint adherence in long multi-turn conversations. SEQUOR consists of simulated persona-driven interactions built with constraints extracted from real-world conversations. Our results show that even when following a single constraint, instruction-following accuracy consistently decreases as the conversation grows longer, with drops exceeding 11%. This decline becomes larger when models have to follow multiple constraints simultaneously, reducing their accuracy by over 40%. In scenarios where constraints are added or replaced at arbitrary points of the conversation, model accuracy decreases by more than 9%. Taken together, our results reveal that current models still struggle to follow user instructions in multi-turn conversations, and provide a way for better measuring instruction-following capabilities in assistants.
☆ Is Escalation Worth It? A Decision-Theoretic Characterization of LLM Cascades
Model cascades, in which a cheap LLM defers to an expensive one on low-confidence queries, are widely used to navigate the cost-quality tradeoff at deployment. Existing approaches largely treat the deferral threshold as an empirical hyperparameter, with limited guidance on the geometry of the resulting cost-quality frontier over a model pool. We develop a decision-theoretic framework grounded in constrained optimization and duality. For a two-model cascade, we establish piecewise concavity of the cost-quality frontier on decreasing-benefit regions of the confidence support, with reciprocal shadow prices linking the budget- and quality-constrained formulations. Given a pool of $k$ models, we characterize the frontier achievable by deterministic two-model threshold cascades as the pointwise envelope over $\binom{k}{2}$ pairwise cascades, with switching points where the optimal pair changes. For $k$-model cascades, we derive first-order conditions in which a single shadow price equalizes marginal quality-per-cost across stage boundaries. We validate the framework on five benchmarks (MATH, MMLU, TriviaQA, SimpleQA, LiveCodeBench) across eight models from five providers. Within the deterministic threshold-cascade class, full fixed chains underperform the pairwise envelope, and optimized subsequence cascades do not deliver practically meaningful held-out gains over it. A lightweight pre-generation router exceeds the best cascade policy on four of five datasets, mainly because it avoids the cheap model's generation cost on queries sent directly to a larger model rather than because of a stronger routing signal. These results suggest that cascade performance is limited primarily by structural cost, since cascades pay the cheap model before any escalation decision, rather than by a shortage of intermediate stages.
☆ Don't Lose Focus: Activation Steering via Key-Orthogonal Projections
Activation steering controls LLM behaviour towards target behaviour by intervening in internal representations, yet it often degrades reasoning and retrieval performance. We argue that a primary cause of this trade-off is attention rerouting: steering vectors alter query-key matching, shifting attention away from contextually important tokens toward less informative ones. To address this, we propose Steering via Key-Orthogonal Projections (SKOP), a steering method that constrains harmful attention rerouting without eliminating steering efficacy. SKOP achieves this by preserving attention patterns on a small set of focus tokens the model relies on for reasoning and retrieval, while allowing redistribution among less critical tail tokens. Across multiple steering benchmarks, we show that SKOP achieves the best joint steering-utility trade-off, reducing utility degradation by 5-7x while retaining over 95% of vanilla steering efficacy. Our results further suggest that, in long-context retrieval settings where vanilla steering approaches are ineffective, SKOP can maintain robust performance by avoiding attention rerouting.
☆ MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents
Tool-using large language model (LLM) agents are increasingly deployed in settings where their reliable behavior is governed by strict procedural manuals. Ensuring that such agents comply with the rules from these manuals is challenging, as they are typically written for humans in natural language while agent behavior manifests as an execution trace of tool calls. Existing evaluations of LLM agents rely on manually constructed benchmarks or LLM-based judges, which either do not scale or lack reliability for complex, long-horizon manuals. To overcome these limitations, we present MANTRA, a framework for automatically synthesizing machine-checkable compliance benchmarks from natural-language manuals and tool schemas. MANTRA independently generates (i) a symbolic world model capturing procedural dependencies, and (ii) a set of trace-level compliance checks for a given task, and validates their consistency using SMT solving. A structured repair loop resolves inconsistencies, requiring human intervention only as a fallback. %This yields benchmarks that are formally validated. Importantly, MANTRA supports arbitrary domains and long procedural manuals, and provides a tunable notion of task complexity which is utilized to automatically derive challenging tasks accompanying compliance checks. Using MANTRA, we build a new benchmark suite with 285 tasks across 6 domains scaling to 50+ page manuals with minimal human effort. Empirically, we show that the compliance checks are richer with stronger constraint enforcement compared to existing benchmarks. Additionally, the granularity of the checks can be used for debugging the agents' failure modes. These results demonstrate that combining automated benchmark generation with formally grounded validation methods enables scalable and reliable benchmarking of tool-using agents.
☆ Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity
Safety benchmarks are routinely treated as evidence about how a language model will behave once deployed, but this inference is fragile if behavior depends on whether a prompt looks like an evaluation. We define evaluation-context divergence as an observable within-item change in behavior induced by framing a fixed task as an evaluation, a live deployment interaction, or a neutral request, and present a paired-prompt protocol that measures it in open-weight LLMs while controlling for paraphrase variation, benchmark familiarity, and judge framing-sensitivity. Across five instruction-tuned checkpoints from four open-weight families plus a matched OLMo-3 base/instruct ablation ($20$ paired items, $840$ generations per checkpoint), we find striking heterogeneity. OLMo-3-Instruct alone is eval-cautious -- evaluation framing raises refusal vs. neutral by $11.8$pp ($p=0.007$) and reduces harmful compliance vs. deployment by $3.6$pp ($p=0.024$, $0/20$ items inverted) -- while Mistral-Small-3.2, Phi-3.5-mini, and Llama-3.1-8B are deployment-cautious}, with marginal eval-vs-deployment refusal effects of $-9$ to $-20$pp. The matched OLMo-3 base also exhibits the deployment-cautious pattern, identifying alignment as the inversion stage; within Llama-3.1, the $70$B model preserves direction with attenuated magnitude, ruling out a simple ``small-model effect that reverses at scale.'' One caveat: the cross-family heterogeneity is judge-dependent. Re-judging with a different-family safety classifier (Llama-Guard-3-8B) preserves the within-OLMo eval-cautious direction but flattens the cross-family contrast, indicating that the two judges operationalize distinct constructs.
☆ Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning
Tool-integrated reasoning (TIR) offers a direct way to extend thinking models beyond the limits of text-only reasoning. Paradoxically, we observe that tool-enabled evaluation can degrade reasoning performance even when the strong thinking models make almost no actual tool calls. In this paper, we investigate how to inject natural tool-use behavior into a strong thinking model without sacrificing its no-tool reasoning ability, and present a comprehensive TIR recipe. We highlight that (i) the effectiveness of TIR supervised fine-tuning (SFT) hinges on the learnability of teacher trajectories, which should prioritize problems inherently suited for tool-augmented solutions; (ii) controlling the proportion of tool-use trajectories could mitigate the catastrophic forgetting of text-only reasoning capacity; (iii) optimizing for pass@k and response length instead of training loss could maximize TIR SFT gains while preserving headroom for reinforcement learning (RL) exploration; (iv) a stable RL with verifiable rewards (RLVR) stage, built upon suitable SFT initialization and explicit safeguards against mode collapse, provides a simple yet remarkably effective solution. When applied to Qwen3 thinking models at 4B and 30B scales, our recipe yields models that achieve state-of-the-art performance in a wide range of benchmarks among open-source models, such as 96.7% and 99.2% on AIME 2025 for 4B and 30B, respectively.
☆ Improving the Efficiency of Language Agent Teams with Adaptive Task Graphs
Large language models (LLMs) are increasingly deployed in teams, yet existing coordination approaches often occupy two extremes. Highly structured methods rely on fixed roles, pipelines, or task decompositions assigned a priori. In contrast, fully unstructured teams enable adaptability and exploration but suffer from inefficiencies such as error propagation, inter-agent conflicts, and wasted resources (measured in time, tokens, or file operations). We introduce Language Agent Teams for Task Evolution (LATTE), a framework for coordinating LLM teams inspired by distributed systems, where processors must operate under partial observability and communication constraints. In LATTE, a team of agents collaboratively construct and maintain a shared, evolving coordination graph which encodes sub-task dependencies, individual agent assignment, and the current state of sub-task progress. This protocol maintains consistency while empowering agents to dynamically allocate work, adapt coordination, and discover new tasks. Across multiple collaborative tasks and a variety of base models, we demonstrate how LATTE reduces token usage, wall-clock time, communication, and coordination failures (e.g. file conflicts and redundant outputs) while matching or exceeding the accuracy of standard designs including MetaGPT, decentralized teams, top-down Leader-Worker hierarchies, and static decompositions.
☆ Who and What? Using Linguistic Features and Annotator Characteristics to Analyze Annotation Variation
Human label variation has been established as a central phenomenon in NLP: the perspectives different annotators have on the same item need to be embraced. Data collection practices thus shifted towards increasing the annotator numbers and releasing disaggregated datasets, harmful language being most resourced due to its high subjectivity. While this resulted in rich information about \textit{who} annotated (sociodemographics, attitudes, etc.), the \textit{what} (e.g., linguistic properties of items), and their interplay has received little attention. We present the first large-scale analysis of four reference datasets for harmful language detection, bringing together annotator characteristics, linguistic properties of the items, and their interactions in a statistically informed picture. We find that interactions are crucial, revealing intersectional effects ignored in previous work, and that a strong role is played by lexical cues and annotator attitudes. Effect patterns, however, vary considerably across datasets. This urges caution about generalization and transferability.
☆ MultiLinguahah : A New Unsupervised Multilingual Acoustic Laughter Segmentation Method
Laughter is a social non-vocalization that is universal across cultures and languages, and is crucial for human communication, including social bonding and communication signaling. However, detecting laughter in audio is a challenging task, and segmenting is even more difficult. Currently, Machine Learning methods generally rely on costly manual annotation, and their datasets are mostly based on English contexts. Thus, we propose an unsupervised multilingual method that sets up the laughter segmentation task as an anomaly detection of energy-based segmented audio sequences. Our method applies an Isolation Forest on audio representations learned from BYOL-A encoder. We compare our method with several state-of-the-art laughter detection algorithms on four datasets, including stand-up comedy, sitcoms, and general short audio from AudioSet. Our results show that state-of-the-art methods are not optimized for multilingual contexts, while our method outperforms them in non-English settings.
☆ Log-Likelihood, Simpson's Paradox, and the Detection of Machine-Generated Text
The ability to reliably distinguish human-written text from that generated by large language models is of profound societal importance. The dominant approach to this problem exploits the likelihood hypothesis: that machine-generated text should appear more probable to a detector language model than human-written text. However, we demonstrate that the token-level signal distinguishing human and machine text is non-uniform across the hidden space of the detector model, and naively averaging likelihood-based token scores across regions with fundamentally different statistical structure, as most detectors do, causes a form of Simpson's paradox: a strong local signal is destroyed by inappropriate aggregation. To correct for this, we introduce a learned local calibration step grounded in Bayesian decision theory. Rather than aggregating raw token scores, we first learn lightweight predictors of the score distributions conditioned on position in hidden space, and aggregate calibrated log-likelihood ratios instead. This single intervention dramatically and consistently improves detection performance across all baseline detectors and all datasets we consider. For example, our calibrated variant of Fast-DetectGPT improves AUROC from $0.63$ to $0.85$ on GPT-5.4 text, and a locally-calibrated DMAP detector we introduce achieves state-of-the-art performance across the board. That said, our central contribution is not a new detector, but a precise diagnosis of a significant cause of under-performance of existing detectors and a principled, modular remedy compatible with any token-averaging pipeline. This will serve as a foundation for the community to build upon, with natural avenues including richer distributional models, improved calibration strategies, and principled ensembling with hidden-space geometry signals via the full Bayes-optimal decision rule.
comment: 10 pages, 3 figures, 2 tables, 11 appendices
☆ LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG
Single-step retrieval-augmented generation (RAG) provides an efficient way to incorporate external information for simple question answering tasks but struggles with complex questions. Agentic RAG extends this paradigm by replacing single-step retrieval with a multi-step process, in which the large language model (LLM) acts as a search agent that generates intermediate thoughts and subqueries to iteratively interact with the retrieval system. This iterative process incurs substantial latency due to the autoregressive generation of lengthy thoughts and subqueries. To address this limitation, we propose LatentRAG, a novel framework that shifts both reasoning and retrieval from discrete language space to continuous latent space. Unlike existing explicit methods that generate natural language thoughts or subqueries token-by-token, LatentRAG produces latent tokens for thoughts and subqueries directly from the hidden states in a single forward pass. We align LLMs with dense retrieval models in the latent space, enabling retrieval over latent subquery tokens and supporting end-to-end joint optimization. To improve transparency and encourage semantically meaningful latent representations, we incorporate a parallel latent decoding mechanism that translates latent tokens back into natural language. Extensive experiments on seven benchmark datasets show that LatentRAG achieves performance comparable to explicit agentic RAG methods while reducing inference latency by approximately 90%, substantially narrowing the latency gap with traditional single-step RAG.
☆ Quantifying the Statistical Effect of Rubric Modifications on Human-Autorater Agreement
Autoraters, also referred to as LLM-as-judges, are increasingly used for evaluation and automated content moderation. However, there is limited statistical analysis of how modifications in a rubric presented to both humans and autoraters affect their score agreement. Rubrics that ask for an overall or \emph{holistic} judgment - for example, rating the ``quality'' of an essay - may be inconsistently interpreted due to the complexity or subjectivity of the criteria. Conversely, rubrics can ask for \emph{analytic} judgments, which decompose assessment criteria - for example, ``quality'' into ``fluency'' and ``organization''. While these rubrics can be edited to improve the individual accuracy of both human and automated scoring, this approach may result in disagreement between the two scores, or with the associated holistic judgment. Designing and deploying reliable autoraters requires understanding not just the relationship between human and autorater annotations but how that relationship changes as holistic or analytic judgments are elicited. The results indicate that rubric edits providing representative examples and additional context, and reducing positional bias in the rubric increased human-autorater agreement, while higher rubric complexity and conservative aggregation methods tended to decrease it. The findings from the automatic essay scoring and instruction-following evaluation domains suggest that practitioners should carefully analyze domain- and rubric-specific performance to move towards higher human-autorater agreement.
☆ Linear Semantic Segmentation for Low-Resource Spoken Dialects ACL
Semantic segmentation is a core component of discourse analysis, yet existing models are primarily developed and evaluated on high-resource written text, limiting their effectiveness on low-resource spoken varieties. In particular, dialectal Arabic exhibits informal syntax, code-switching, and weakly marked discourse structure that challenge standard segmentation approaches. In this paper, we introduce a new multi-genre benchmark (more than 1000 samples) for semantic segmentation in conversational Arabic, focusing on dialectal discourse. The benchmark covers transcribed casual telephone conversations, code-switched podcasts, broadcast news, and expressive dialogue from novels, and was annotated and validated by native Arabic annotators. Using this benchmark, we show that segmentation models performing well on MSA news genres degrade on dialectal transcribed speech. We further propose a segmentation model that targets local semantic coherence and robustness to discourse discontinuities, consistently outperforming strong baselines on dialectal non-news genres. The benchmark and approach generalize to other low-resource spoken languages.
comment: ACL Findings 2026
☆ Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
Reinforcement learning has become the standard for improving reasoning in large language models, yet evidence increasingly suggests that RL does not teach new strategies; it redistributes probability mass over solutions the base model already contains. In this work, we ask: if RL merely steers the model toward paths it already knows, is the RL optimization loop itself necessary? Through token-level analysis across multiple model families and RL algorithms, we find that RL's beneficial footprint is a sparse, predictable correction concentrated at high-entropy decision points where the model is uncertain which branch to take. Only 1--3\% of token positions are affected, the promoted token always lies within the base model's top-5 alternatives, and targeted corrections at those few positions causally recover a large fraction of RL's accuracy gain, while random corrections fail. The base model's own entropy identifies these positions without any RL-trained model, and the entire correction is low-dimensional, representable in a tiny fraction of model parameters. These findings reframe reasoning improvement as sparse policy selection, not capability acquisition. We translate this insight into ReasonMaxxer, a minimal RL-free method that applies contrastive loss only at entropy-gated decision points, using a few hundred base-model rollouts and no online generation. Across three model families, six scales, and six math reasoning benchmarks, ReasonMaxxer matches or exceeds full RL performance while requiring only tens of problems and minutes of single-GPU training, a reduction in training cost of roughly three orders of magnitude.
☆ YEZE at SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization via Heterogeneous Ensembling SemEval-2026
This paper presents our system for SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization, which identifies polarized social media content in 22 languages through three subtasks: binary detection, target classification, and manifestation identification. We propose a heterogeneous ensemble of multilingual pretrained models, combining XLM-RoBERTa-large and mDeBERTa-v3-base. We investigate techniques such as multi-task learning, translation-based data augmentation, and class weighting to improve classification performance under severe label imbalance. Our findings indicate that independent task modeling combined with class weighting is more effective.
comment: Accepted to the SemEval-2026 workshop of the ACL 2026 conference
☆ UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification
As large language models (LLMs) continue to advance rapidly, they are becoming increasingly capable while simultaneously demanding ever-longer context lengths. To improve the inference efficiency of long-context processing, several novel low-complexity hybrid architectures have recently been proposed, effectively alleviating the computational burden of long-context inference. However, existing research on long-context prefill acceleration remains predominantly focused on sparse attention mechanisms, which achieve their maximum speedup only on full-attention models. When transferred to emerging architectures--such as linear/full attention hybrids or sliding window/full attention hybrids--these prefill acceleration approaches suffer significant performance degradation. Furthermore, such methods are generally incompatible with continuous batching, making them difficult to integrate into modern inference engines such as vLLM. To this end, we propose UniPrefill, a prefill acceleration framework applicable to virtually any model architecture, which directly accelerates the model's computation at the token level. We further implement UniPrefill as a continuous batching operator and extend vLLM's scheduling strategy to natively support prefill-decode co-processing and tensor parallel for UniPrefill, enabling its seamless integration into vLLM. UniPrefill achieves up to 2.1x speedup in Time-To-First-Token (TTFT), with the acceleration becoming increasingly pronounced as the number of concurrent requests grows.
comment: code: https://github.com/qhfan/UniPrefill.git
☆ TIDE: Every Layer Knows the Token Beneath the Context
We revisit a universally accepted but under-examined design choice in every modern LLM: a token index is looked up once at the input embedding layer and then permanently discarded. This single-injection assumption induces two structural failures: (i) the Rare Token Problem, where a Zipf-type distribution of vocabulary causes rare-token embeddings are chronically under-trained due to receiving a fraction of the cumulative gradient signal compared to common tokens; and (ii) the Contextual Collapse Problem, where limited parameters models map distributionally similar tokens to indistinguishable hidden states. As an attempt to address both, we propose TIDE, which augments the standard transformer with EmbeddingMemory: an ensemble of K independent MemoryBlocks that map token indices to context-free semantic vectors, computed once and injected into every layer through a depth-conditioned softmax router with a learnable null bank. We theoretically and empirically establish the benefits of TIDE in addressing the issues associated with single-token identity injection as well as improve performance across multiple language modeling and downstream tasks.
☆ Contrastive Identification and Generation in the Limit
In the classical identification in the limit model of Gold [1967], a stream of positive examples is presented round by round, and the learner must eventually recover the target hypothesis. Recently, Kleinberg and Mullainathan [2024] introduced generation in the limit, where the learner instead must eventually output novel elements of the target's support. Both lines of work focus on positive-only or fully labeled data. Yet many natural supervision signals are inherently relational rather than singleton, which encode relationships between examples rather than labels of individual ones. We initiate the study of contrastive identification and generation in the limit, where the learner observes a contrastive presentation of data: a stream of unordered pairs $\{x,y\}$ satisfying $h(x)\ne h(y)$ for an unknown target binary hypothesis $h$, but which element is positive is hidden from the learner. We first present three results in the noiseless setting: an exact characterization of contrastive identifiable classes (a one-line geometric refinement of Angluin [1980]'s tell-tale condition), a combinatorial dimension called contrastive closure dimension (a contrasitive analogue of the closure dimension in Raman et al. [2025]) and exactly characterizing uniform contrastive generation with tight sample complexity, and a strict hierarchy in which contrastive generation and text identification are mutually incomparable. We then prove a sharp reversal under finite adversarial corruption: there exist classes identifiable from contrastive pairs under any finite corruption budget by a single budget-independent algorithm, yet not identifiable from positive examples under even one corrupted observation. The unifying technical object is the common crossing graph, which encodes pairwise ambiguity, family-level generation obstructions, and corruption defects in a single coverage-and-incidence language.
☆ A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping
Reinforcement learning for agentic large language models (LLMs) typically relies on a sparse, trajectory-level outcome reward, making it difficult to evaluate the contribution of individual tool-calls within multi-turn interactions. Existing approaches to such process credit assignment either depend on separate external process reward models that introduce additional consumption, or tree-based structural rollout that merely redistributes the outcome signal while constraining trajectory diversity. A promising alternative leverages the per-turn change in the policy's predicted probability of the ground-truth, termed Information Gain (IG), as an intrinsic process signal without an external evaluator. However, prior work on leveraging IG signals within the RL training loop faces three systematic challenges: normalizing across turns that face heterogeneous positional contexts can distort the relative standing of individual turns, accumulating a variable number of terms causes advantage magnitudes to drift with trajectory depth, and a fixed clipping range governs policy updates identically for turns with vastly different IG signals. In this paper, we propose A$^2$TGPO (Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping), which retains IG as the intrinsic signal but re-designs how it is normalized, accumulated, and consumed: (i) turn-group normalization: normalizes IG within each (prompt, turn-index) group so that each turn is compared only against peers at the same interaction depth; (ii) variance-rescaled discounted accumulation: divides cumulative normalized IG by square root of accumulated terms to keep advantage magnitudes comparable across turn positions; and (iii) adaptive turn-level clipping: modulates each turn's clipping range based on its normalized IG, widening the update region for informative turns and narrowing it for uninformative ones.
☆ The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models
Large language models (LLMs) are routinely prompted to take on social roles ranging from individuals to institutions, yet it remains unclear whether their internal representations encode the granularity of such roles, from micro-level individual experience to macro-level organizational, institutional, or national reasoning. We show that they do. We define a contrast-based Granularity Axis as the difference between mean macro- and micro-role hidden states. In Qwen3-8B, this axis aligns with the principal axis (PC1) of the role representation space at cosine 0.972 and accounts for 52.6% of its variance, indicating that granularity is the dominant geometric axis organizing prompted social roles. We construct 75 social roles across five granularity levels and collect 91,200 role-conditioned responses over shared questions and prompt variants, then extract role-level hidden states and project them onto the axis. Role projections increase monotonically across all five levels, remain stable across layers, prompt variants, endpoint definitions, held-out splits, and score-filtered subsets, and transfer to Llama-3.1-8B-Instruct. The axis is also causally relevant: activation steering along it shifts response granularity in the predicted direction, with Llama moving from 2.00 to 3.17 on a five-point macro scale under positive steering on prompts that admit local responses. The two models differ in controllability, suggesting that steering depends on each model's default operating regime. Overall, our findings suggest that social role granularity is not merely a stylistic surface feature, but a structured, ordered, and causally manipulable latent direction in role-conditioned language model behavior.
comment: 28 pages, including appendices
☆ OPSD Compresses What RLVR Teaches: A Post-RL Compaction Stage for Reasoning Models
On-Policy Self-Distillation (OPSD) has recently emerged as an alternative to Reinforcement Learning with Verifiable Rewards (RLVR), promising higher accuracy and shorter responses through token-level credit assignment from a self-teacher conditioned on privileged context. However, this promise does not carry over to thinking-enabled mathematical reasoning, where reported accuracy gains shrink and sometimes turn negative. We hypothesize that hindsight supervision can specify better token-level alternatives in short thinking-disabled outputs, but in long thinking-enabled traces it more readily identifies redundancy than supplies better replacements. To test this, we applied OPSD separately to correct and incorrect rollout groups, so that compression and correction can be observed in isolation. Our results show that in thinking-enabled mathematical reasoning, OPSD behaves most reliably as a compression mechanism rather than a correction mechanism: training only on correct rollouts preserves accuracy while substantially shortening responses, whereas training only on incorrect rollouts damages accuracy. In light of these findings, we propose a revised post-training pipeline for thinking-enabled mathematical reasoning: SFT then RLVR then OPSD.
☆ Rethinking Adapter Placement: A Dominant Adaptation Module Perspective
Low-rank adaptation (LoRA) is a widely used parameter-efficient fine-tuning method that places trainable low-rank adapters into frozen pre-trained models. Recent studies show that using fewer LoRA adapters may still maintain or even improve performance, but existing methods still distribute adapters broadly, leaving where to place a limited number of adapters to maximize performance largely open. To investigate this, we introduce PAGE (Projected Adapter Gradient Energy), a gradient-based sensitivity probe that estimates the initial trainable gradient energy available to each candidate LoRA adapter. Surprisingly, we find that PAGE is highly concentrated on a single shallow FFN down-projection across two model families and four downstream tasks. We term this module the dominant adaptation module and show that its layer index is architecture-dependent but task-stable. Motivated by this finding, we propose DomLoRA, a placement method that places a single adapter at the dominant adaptation module. With only ~0.7% of vanilla LoRA's trainable parameters, DomLoRA outperforms it on average across various downstream tasks, including instruction following, mathematical reasoning, code generation, and multi-turn conversation. This method also improves other LoRA variants, supporting the dominant adaptation module perspective as a practical placement guideline.
☆ Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes
Deep neural networks exhibit periodic loss spikes during unregularized long-term training, a phenomenon known as the "Slingshot Mechanism." Existing work usually attributes this to intrinsic optimization dynamics, but its triggering mechanism remains unclear. This paper proves that this phenomenon is a result of floating-point arithmetic precision limits. As training enters a high-confidence stage, the difference between the correct-class logit and the other logits may exceed the absorption-error threshold. Then during backpropagation, the gradient of the correct class is rounded exactly to zero, while the gradients of the incorrect classes remain nonzero. This breaks the zero-sum constraint of gradients across classes and introduces a systematic drift in the parameter update of the classifier layer. We prove that this drift forms a positive feedback loop with the feature, causing the global classifier mean and the global feature mean to grow exponentially. We call this mechanism Numerical Feature Inflation (NFI). This mechanism explains the rapid norm growth before a Slingshot spike, the subsequent reappearance of gradients, and the resulting loss spike. We further show that NFI is not equivalent to an observed loss spike: in more practical tasks, partial absorption may not produce visible spikes, but it can still break the zero-sum constraint and drive rapid growth of parameter norms. Our results reinterpret Slingshot as a numerical dynamic of finite-precision training, and provide a testable explanation for abnormal parameter growth and logit divergence in late-stage training.
comment: 28 pages, 13 figures
☆ IRC-Bench: Recognizing Entities from Contextual Cues in First-Person Reminiscences
When people recount personal memories, they often refer to people, places, and events indirectly, relying on contextual cues rather than explicit names. Such implicit references are central to reminiscence narratives: first-person accounts of lived experience used in therapeutic, archival, and social settings. They pose a difficult computational problem because the intended entity must be inferred from dispersed narrative evidence rather than from a local mention. We introduce IRC-Bench, the Implicit Reminiscence Context Benchmark, for evaluating implicit entity recognition in reminiscence transcripts. The benchmark targets non-locality: entity-identifying cues are distributed across multiple, non-contiguous clauses, unlike named entity recognition, entity linking, or coreference resolution. IRC-Bench comprises 25,136 samples constructed from 12,337 Wiki-data-linked entities across 1,994 transcripts spanning 11 thematic domains. Each sample pairs an Entity-Grounded Narrative, in which the target entity is explicitly mentioned, with an Entity-Elided Narrative, in which direct mentions are removed. We evaluate 19 configurations across LLM generation, dense retrieval, RAG, and fine-tuning. QLoRA-adapted Llama 3.1 8B performs best in the open-world setting (38.94% exact match; 51.59% Jaccard), while fine-tuned DPR leads closed-world retrieval (35.38% Hit@1; 71.49% Hit@10). We release IRC-Bench with data, code, and evaluation tools.
comment: 29 pages, 8 figures
☆ MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval
In agent memory systems, the reranking model serves as the critical bridge connecting user queries with long-term memory. Most systems adopt the "retrieve-then-rerank" two-stage paradigm, but generic reranking models rely on semantic similarity matching and lack genuine reasoning capabilities, leading to a problem where recalled results are semantically highly relevant yet do not contain the key information needed to answer the question. This deficiency manifests in memory scenarios as three specific problems. First, relevance scores are miscalibrated, making threshold-based filtering difficult. Second, ranking degrades when facing temporal constraints, causal reasoning, and other complex queries. Third, the model cannot leverage dialogue context for semantic disambiguation. This report introduces MemReranker, a reranking model family (0.6B/4B) built on Qwen3-Reranker through multi-stage LLM knowledge distillation. Multi-teacher pairwise comparisons generate calibrated soft labels, BCE pointwise distillation establishes well-distributed scores, and InfoNCE contrastive learning enhances hard-sample discrimination. Training data combines general corpora with memory-specific multi-turn dialogue data covering temporal constraints, causal reasoning, and coreference resolution. On the memory retrieval benchmark, MemReranker-0.6B substantially outperforms BGE-Reranker and matches open-source 4B/8B models as well as GPT-4o-mini on key metrics. MemReranker-4B further achieves 0.737 MAP, with several metrics on par with Gemini-3-Flash, while maintaining inference latency at only 10--20\% of large models. On finance and healthcare vertical-domain benchmarks, the models preserve generalization capabilities on par with mainstream large-parameter rerankers.
☆ On Time, Within Budget: Constraint-Driven Online Resource Allocation for Agentic Workflows
Agentic systems increasingly solve complex user requests by executing orchestrated workflows, where subtasks are assigned to specialized models or tools and coordinated according to their dependencies. While recent work improves agent efficiency by optimizing the performance--cost--latency frontier, real deployments often impose concrete requirements: a workflow must be completed within a specified budget and before a specified deadline. This shifts the goal from average efficiency optimization to maximizing the probability that the entire workflow completes successfully under explicit budget and deadline constraints. We study \emph{constraint-driven online resource allocation for agentic workflows}. Given a dependency-structured workflow and estimates of success rates and generation lengths for each subtask--model pair, the executor allocates models and parallel samples across simultaneously executable subtasks while managing the remaining budget and time. We formulate this setting as a finite-horizon stochastic online allocation problem and propose \emph{Monte Carlo Portfolio Planning} (MCPP), a lightweight closed-loop planner that directly estimates constrained completion probability through simulated workflow executions and replans after observed outcomes. Experiments on CodeFlow and ProofFlow demonstrate that MCPP consistently improves constrained completion probability over strong baselines across a wide range of budget--deadline constraints.
comment: Preprint
☆ Uncovering Entity Identity Confusion in Multimodal Knowledge Editing
Multimodal knowledge editing (MKE) aims to correct the internal knowledge of large vision-language models after deployment, yet the behavioral patterns of post-edit models remain underexplored. In this paper, we identify a systemic failure mode in edited models, termed Entity Identity Confusion (EIC): edited models exhibit an absurd behavior where text-only queries about the original entity's identity unexpectedly return information about the new entity. To rigorously investigate EIC, we construct EC-Bench, a diagnostic benchmark that directly probes how image-entity bindings shift before and after editing. Our analysis reveals that EIC stems from existing methods failing to distinguish between Image-Entity (I-E) binding and Entity-Entity (E-E) relational knowledge in the model, causing models to overfit E-E associations as a shortcut: the image is still perceived as the original entity, with the new entity's name serving only as a spurious identity label. We further explore potential mitigation strategies, showing that constraining edits to the model's I-E processing stage encourages edits to act more faithfully on I-E binding, thereby substantially reducing EIC. Based on these findings, we discuss principled desiderata for faithful MKE and provide methodological guidance for future research.
☆ Milestone-Guided Policy Learning for Long-Horizon Language Agents
While long-horizon agentic tasks require language agents to perform dozens of sequential decisions, training such agents with reinforcement learning remains challenging. We identify two root causes: credit misattribution, where correct early actions are penalized due to terminal failures, and sample inefficiency, where scarce successful trajectories result in near-total loss of learning signal. We introduce a milestone-guided policy learning framework, BEACON, that leverages the compositional structure of long-horizon tasks to ensure precise credit assignment. BEACON partitions trajectories at milestone boundaries, applies temporal reward shaping within segments to credit partial progress, and estimates advantages at dual scales to prevent distant failures from corrupting the evaluation of local actions. On ALFWorld, WebShop, and ScienceWorld, BEACON consistently outperforms GRPO and GiGPO. Notably, on long-horizon ALFWorld tasks, BEACON achieves 92.9% success rate, nearly doubling GRPO's 53.5%, while improving effective sample utilization from 23.7% to 82.0%. These results establish milestone-anchored credit assignment as an effective paradigm for training long-horizon language agents. Code is available at https://github.com/ZJU-REAL/BEACON.
☆ Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training
The "Locate-then-Update" paradigm has become a predominant approach in the post-training of large language models (LLMs), identifying critical components via mechanistic interpretability for targeted parameter updates. However, this paradigm rests on a fundamental yet unverified assumption: can mechanisms derived from current static parameters reliably guide future dynamic parameter updates? To investigate this, we systematically track the structural evolution of Transformer circuits throughout the supervised fine-tuning (SFT) process, revealing the underlying dynamics of task mechanisms. We introduce three novel metrics-Circuit Distance, Circuit Stability, and Circuit Conflict-to analyze circuit evolution across three dimensions: neural migration, semantic stability, and cross-task interference. Our empirical results reveal that circuits inherently exhibit "Free Evolution" during parameter updates. Consequently, static mechanisms extracted from current states inevitably suffer from temporal latency, making them fundamentally inadequate for guiding future states. Moreover, by deconstructing the "illusion of effectiveness" in existing methods, this work underscores the necessity of "foresight" in mechanistic localization and proposes a predictive framework for future research.
comment: 26 pages
☆ Novelty-based Tree-of-Thought Search for LLM Reasoning and Planning
Although advances such as chain-of-thought, tree-of-thought or reinforcement learning have improved the performance of LLMs in reasoning and planning tasks, they are still brittle and have not achieved human-level performance in many domains, and often suffer from high time and token costs. Inspired by the success of width-based search in planning, we explore how the concept of novelty can be transferred to language domains and how it can improve tree-of-thought reasoning. A tree of thoughts relies on building possible "paths" of consecutive ideas or thoughts. These are generated by repeatedly prompting an LLM. In our paper, a measurable concept of novelty is proposed that describes the uniqueness of a new node (thought) in comparison to nodes previously seen in the search tree. Novelty is estimated by prompting an LLM and making use of embedded general knowledge from pre-training. This metric can then be used to prune branches and reduce the scope of the search. Although this method introduces more prompts per state, the overall token cost can be reduced by pruning and reducing the overall tree size. This procedure is tested and compared using several benchmarks in language-based planning and general reasoning.
☆ More Aligned, Less Diverse? Analyzing the Grammar and Lexicon of Two Generations of LLMs
This study contributes to a growing line of research in comparing LLM-generated texts with human-authored text, in this case, English news text. We focus in particular on the evaluation of syntactic properties through formal grammar frameworks. Our analysis compares two generations of LLMs in the context of two human-authored English news datasets from two different years. Employing the Head-Driven Phrase Structure Grammar (HPSG) formalism, we investigate the distributions of syntactic structures and lexical types of AI-generated texts and contrast them with the corresponding distributions in the human-authored New York Times (NYT) articles. We use diversity metrics from ecology and information theory to quantify variation in grammatical constructions and lexical types. We show that English news text has changed little in the given time frame, while newer LLMs display reduced syntactic and, especially, lexical diversity compared to older, non-instruction-tuned models. These findings point to future work in studying effects of instruction tuning, which, while enhancing coherence and adherence to prompts, may narrow the expressive range of model output.
☆ PersonaKit (PK): A Plug-and-Play Platform for User Testing Diverse Roles in Full-Duplex Dialogue
As spoken dialogue systems expand beyond traditional assistant roles to encompass diverse personas -- such as authoritative instructors, uncooperative merchants, or distracted workers -- they require distinct, human-like turn-taking behaviors to maintain psychological immersion. However, current full-duplex systems often default to a rigid, overly accommodating ``always-yield'' policy during overlapping speech, which severely undermines character consistency for non-submissive roles. Evaluating alternative, persona-specific turn-taking strategies through empirical user studies is challenging because building real-time full-duplex test environments requires substantial engineering overhead. To address this, we present PersonaKit (PK), an open-source, low-latency web platform for the rapid prototyping and evaluation of conversational agents. Using intuitive JSON configurations, researchers can define personas, specify probabilistic interruption-handling behaviors (e.g., yield, hold, bridge, or override), and automatically deploy comparative A/B surveys. Through an in-the-wild evaluation with 8 distinct personas, we demonstrate that PersonaKit provides an extensible, end-to-end framework for studying complex sociolinguistic behaviors in next-generation spoken agents.
☆ From Articles to Premises: Building PrimeFacts, an Extraction Methodology and Resource for Fact-Checking Evidence LREC 2026
Fact-checking articles encode rich supporting evidence and reasoning, yet this evidence remains largely inaccessible to automated verification systems due to unstructured presentation. We introduce PrimeFacts, a methodology and resource for extracting fine-grained evidence from full fact-checking articles. We compile 13,106 PolitiFact articles with claims, verdicts, and all referenced sources, and we identify 49,718 in-article hyperlinks as natural anchors to pinpoint key evidence. Our framework leverages large language models (LLMs) to rewrite these anchor sentences into stand-alone, context-independent premises and investigates the extraction of additional implicit evidence. In evaluations on cross-article evidence retrieval and claim verification, the extracted premises substantially improve performance. Decontextualized evidence yields higher retrievability, achieving up to a 30 percent relative gain in Mean Reciprocal Rank over verbatim sentences, and using the evidence for verdict prediction raises Macro-F1 by 10-20 points over the baseline. These gains are consistent across different verdict granularities (2-class vs. 5-class) and model architectures. A qualitative analysis indicates that the decontextualized premises remain faithful to the original sources. Our work highlights the promise of reusing fact-checkers' evidence for automation and provides a large-scale resource of structured evidence from real-world fact-checks.
comment: Accepted at LREC 2026. To appear in the conference proceedings
☆ Safety Anchor: Defending Harmful Fine-tuning via Geometric Bottlenecks ICML 2026
The safety alignment of Large Language Models (LLMs) remains vulnerable to Harmful Fine-tuning (HFT). While existing defenses impose constraints on parameters, gradients, or internal representations, we observe that they can be effectively circumvented under persistent HFT. Our analysis traces this failure to the inherent redundancy of the high-dimensional parameter space: attackers exploit optimization trajectories that are orthogonal to defense constraints to restore harmful capabilities while deceptively adhering to safety restrictions. To address this, we propose Safety Bottleneck Regularization (SBR). SBR shifts the defensive focus from the redundant parameter space to the unembedding layer, which serves as a geometric bottleneck. By anchoring the final hidden states of harmful queries to those of the safety-aligned model, SBR enables the model to maintain safe responses even under persistent HFT. Extensive experiments confirm SBR's effectiveness, demonstrating that utilizing just a single safety anchor is sufficient to reduce the Harmful Score to $<$10 while preserving competitive performance on benign downstream tasks.
comment: Accepted to ICML 2026
☆ TheraAgent: Self-Improving Therapeutic Agent for Precise and Comprehensive Treatment Planning ACL 2026
Formulating a treatment plan is inherently a complex reasoning and refinement task rather than a simple generation problem. However, existing large language models (LLMs) mainly rely on one-shot output without explicit verification, which may result in rough, incomplete, and potentially unsafe treatment plans. To address these limitations, we propose TheraAgent, an agentic framework that replaces one-shot generation with an iterative generate-judge-refine pipeline. By mirroring the actual reasoning process of human experts who iteratively revise treatment plans, our framework progressively transforms coarse and incomplete drafts into precise, comprehensive, and safer therapeutic regimens. To facilitate the critical judge component, we introduce TheraJudge, a treatment-specific evaluation module integrated into the inference loop to enforce clinical standards. Experiments show TheraAgent achieves state-of-the-art results on HealthBench, leading in Accuracy and Completeness. In expert evaluations, it attains an 86% win rate against physicians, with superior Targeting and Harm Control. Moreover, the highly agreement between TheraJudge and HealthBench evaluations confirms the reliability of our framework.
comment: Accepted to ACL 2026
☆ Tatarstan Toponyms: A Bilingual Dataset and Hybrid RAG System for Geospatial Question Answering
This paper addresses automatic geospatial question answering over multilingual toponymic data. An original bilingual dataset of toponyms of the Republic of Tatarstan is introduced, comprising 9,688 structured records with linguistic, etymological, administrative, and coordinate information (93.1% georeferenced). Based on this dataset, a question-answering corpus of approximately 39,000 question-context-answer triples is constructed with guaranteed answer localization. A hybrid retriever integrates dense semantic indexing (multilingual-e5-large) with geospatial filtering via KD-trees and haversine distance. On 500 test queries, the hybrid search achieves Recall@1=0.988, Recall@5=1.000, and MRR=0.994, significantly outperforming BM25 and purely spatial methods. Among tested reader architectures (RuBERT, XLM-RoBERTa-large, T5-RUS), XLM-RoBERTa-large attains the best quality: EM=0.992, F1=0.994. On raw outputs, RuBERT models fail on coordinate questions (F1=0) while XLM-RoBERTa-large reaches F1=0.984; however, simple post-processing eliminates numerical gaps and restores RuBERT accuracy to 100%. This discrepancy stems from tokenization differences and pre-training corpora composition. All resources (dataset, QA corpus, model weights, web demo) are openly published on Hugging Face. Results apply to geospatial QA services, geocoding, and digital humanities in multilingual regions.
comment: Preprint
☆ TableVista: Benchmarking Multimodal Table Reasoning under Visual and Structural Complexity ACL 2026
We introduce TableVista, a comprehensive benchmark for evaluating foundation models in multimodal table reasoning under visual and structural complexity. TableVista consists of 3,000 high-quality table reasoning problems, where each instance is expanded into 10 distinct visual variants through our multi-style rendering and transformation pipeline. This process encompasses diverse scenario styles, robustness perturbations, and vision-only configurations, culminating in 30,000 multimodal samples for a multi-dimensional evaluation. We conduct an extensive evaluation of 29 state-of-the-art open-source and proprietary foundation models on TableVista. Through comprehensive quantitative and qualitative analysis, we find that while evaluated models remain largely stable across diverse rendering styles, they exhibit pronounced performance degradation on complex structural layouts and vision-only settings, revealing that current models struggle to maintain reasoning consistency when structural complexity combines with visually integrated presentations. These findings highlight critical gaps in current multimodal capabilities, providing insights for advancing more robust and reliable table understanding models.
comment: ACL 2026 Findings
☆ Hallucination as an Anomaly: Dynamic Intervention via Probabilistic Circuits
One of the most critical challenges in Large Language Models is their tendency to hallucinate, i.e., produce factually incorrect responses. Existing approaches show promising results in terms of hallucination correction, but still suffer from a main limitation: they apply corrections indiscriminately to every token, corrupting also the originally correct generations. To overcome this drawback, we propose PCNET, a Probabilistic Circuit trained as a tractable density estimator over the LLM residual stream. The method detects hallucinations as geometric anomalies on the factual manifold, which is done via exact Negative Log-Likelihood computation, hence without the need for sampling, external verifiers, or weight modifications, as in existing techniques. To demonstrate its effectiveness, we exploit PCNET as a dynamic gate that distinguishes hallucinated from factual hidden states at each decoding step. This triggers our second main contribution, PC-LDCD (Probabilistic Circuit Latent Density Contrastive Decoding), only when the latent geometry deviates from factual regions, while leaving correct generations untouched. Across four LLMs, ranging from 1B to 8B models, and four benchmarks covering conversational reasoning, knowledge-intensive QA, reading comprehension, and truthfulness, PCNET achieves near-perfect hallucination detection across CoQA, SQuAD v2.0, and TriviaQA, with AUROC reaching up to 99%. Moreover, PC-LDCD obtains the highest True+Info, MC2, and MC3 scores on TruthfulQA in three out of four models, in comparison with state-of-the-art baselines, while reducing the mean corruption rate to 53.7% and achieving a preservation rate of 79.3%. Our proposed method is publicly available on GitHub.
☆ Lightweight Stylistic Consistency Profiling: Robust Detection of LLM-Generated Textual Content for Multimedia Moderation
The increasing prevalence of Large Language Models (LLMs) in content creation has made distinguishing human-written textual content from LLM-generated counterparts a critical task for multimedia moderation. Existing detectors often rely on statistical cues or model-specific heuristics, making them vulnerable to paraphrasing and adversarial manipulations, and consequently limiting their robustness and interpretability. In this work, we proposeLiSCP , a novel lightweight stylistic consistency profiling method for robust detection of LLM-generated textual content, focusing on feature stability under adversarial manipulation. Our approach constructs a consistency profile that combines discrete stylistic features with continuous semantic signals, leveraging stylistic stability across multimodal-guided paraphrased text variants. Experiments spanning real-world multimedia news and movie datasets and conventional text domains demonstrate that LiSCP achieves superior performance on in-domain detection and outperforms existing approaches by up to 11.79% in cross-domain settings. Additionally,it demonstrates notable robustness under adversarial scenarios, including adversarial attacks and hybrid human-AI settings.
☆ MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware
The recent advancement of Vision Language Action (VLA) models has driven a critical demand for large scale egocentric datasets. However, existing datasets are often limited by short episode durations, typically spanning only a few minutes, which fails to capture the long horizon temporal dependencies necessary for complex robotic task execution. To bridge this gap, we present MobileEgo Anywhere, a framework designed to facilitate the collection of robust, hour plus egocentric trajectories using commodity mobile hardware. We leverage the ubiquitous sensor suites of modern smartphones to provide high fidelity, long term camera pose tracking, effectively removing the high hardware barriers associated with traditional robotics data collection. Our contributions are three fold: (1) we release a novel dataset comprising 200 hours of diverse, long form egocentric data with persistent state tracking; (2) we open source a mobile application that enables any user to record egocentric data, and (3) we provide a comprehensive processing pipeline to convert raw mobile captures into standardized, training ready formats for Vision Language Action model and foundation model research. By democratizing the data collection process, this work enables the massive scale acquisition of long horizon data across varied global environments, accelerating the development of generalizable robotic policies.
☆ Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing
Standard knowledge distillation for autoregressive models often suffers from distribution mismatch. While on-policy methods mitigate this by leveraging student-generated outputs, they rely on computationally expensive Reinforcement Learning (RL) frameworks. To improve efficiency, we propose Near-Policy Distillation (NPD), an asynchronous approach that decouples student generation from training. This reformulation enables Supervised Fine-Tuning (SFT) with sequence packing. However, asynchronous updates inevitably introduce policy lag and sample noise, which can cause the behavior to drift from near-policy toward off-policy. To counteract this without sacrificing efficiency, NPD integrates sparse student updates and the $Δ$-IFD filtering mechanism, a heuristic sample selection mechanism that empirically stabilizes the optimization trajectory. By filtering extreme out-of-distribution samples, $Δ$-IFD prevents noise from dominating the gradients, ensuring updates remain within a safe proximal learning zone. Empirically, the NPD framework achieves a 8.1x speedup over on-policy baselines and outperforms SFT by 8.09%. Crucially, by effectively narrowing the exploration space for subsequent RL, our method enables openPangu-Embedded-1B to reach a state-of-the-art score of 68.73%, outperforming the substantially larger Qwen3-1.7B. Codes will be released soon.
☆ Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
Speech large language models (SLMs) are typically built from text large language model (TLM) checkpoints, yet they still suffer from a substantial modality gap. Prior work has mainly attempted to reduce this gap from the output side by making speech generation more text-like, but the gap remains. We argue that the key remaining bottleneck lies on the input side. We propose TextPro-SLM, an SLM that makes spoken input more closely resemble that of a prosody-aware text LLM. TextPro-SLM combines WhisperPro, a unified speech encoder that produces synchronized text tokens and prosody embeddings, with an LLM backbone trained to preserve the semantic capabilities of the original TLM while learning paralinguistic understanding. Experiments show that TextPro-SLM achieves the lowest modality gap among leading SLMs at both 3B and 7B scales, while also delivering strong overall performance on paralinguistic understanding tasks. These gains are achieved with only roughly 1,000 hours of LLM training audio, suggesting that reducing the modality gap from the input side is both effective and data-efficient.
comment: Work in progress
☆ Logic-Regularized Verifier Elicits Reasoning from LLMs
Verifiers are crucial components for enhancing modern LLMs' reasoning capability. Typicalverifiers require resource-intensive superviseddataset construction, which is costly and faceslimitations in data diversity. In this paper, wepropose LOVER, an unsupervised verifier regularized by logical rules. LOVER treats theverifier as a binary latent variable, utilizinginternal activations and enforcing three logical constraints on multiple reasoning paths:negation consistency, intra-group consistency,and inter-group consistency (grouped by thefinal answer). By incorporating logical rulesas priors, LOVER can leverage unlabeled examples and is directly compatible with any offthe-shelf LLMs. Experiments on 10 datasetsdemonstrate that LOVER significantly outperforms unsupervised baselines, achieving performance comparable to the supervised verifier(reaching its 95% level on average). The sourcecode is publicly available at https://github.com/wangxinyufighting/llm-lover.
☆ Beyond Steering Vector: Flow-based Activation Steering for Inference-Time Intervention
Activation steering has emerged as a promising alternative for controlling language-model behavior at inference time by modifying intermediate representations while keeping model parameters frozen. However, large-scale evaluations such as AxBench show that existing steering methods are often outperformed by simple in-context prompting and generalize poorly to unseen concepts. We hypothesize that these limitations arise from unvalidated simplifying assumptions shared across prior methods, which typically restrict steering interventions to fixed, single-step, position-invariant transforms. We propose FLAS (Flow-based Activation Steering), which learns a general, concept-conditioned velocity field $v_t(h,t,c)$ that transports unsteered activations to steered ones without relying on these assumptions. On AxBench, FLAS is the first learned method to consistently outperform prompting, reaching held-out harmonic means of $1.015$ on Gemma-2-2B-IT and $1.113$ on Gemma-2-9B-IT without per-concept tuning. Analysis of the learned flow shows curved, multi-step, token-varying trajectories, which suggests that previous hypotheses on activation space geometry might be incomplete.
☆ Bridging Passive and Active: Enhancing Conversation Starter Recommendation via Active Expression Modeling SIGIR 2026
Large Language Model (LLM)-driven conversational search is shifting information retrieval from reactive keyword matching to proactive, open-ended dialogues. In this context, Conversation Starters are widely deployed to provide personalized query recommendations that help users initiate dialogues. Conventionally, recommending these starters relies on a closed "exposure-click" loop. Yet, this feedback loop mechanism traps the system in an echo chamber where, compounded by data sparsity, it fails to capture the dynamic nature of conversational search intents shaped by the open world. As a result, the system skews towards popular but generic suggestions.In this work, we uncover an untapped paradigm shift to shatter this harmful feedback loop: harnessing user "free will" through active user expressions. Unlike traditional recommendations, conversational search empowers users to bypass menus entirely through manually typed queries. The open-world intents in active queries hold the key to breaking this loop. However, incorporating them is non-trivial: (1) there exists an inherent distribution shift between active queries and formulated starters. (2) Furthermore, the "non-ID-able" nature of open text renders traditional item-based popularity statistics ineffective for large-scale industrial streaming training. To this end, we propose Passive-Active Bridge (PA-Bridge), a novel framework that employs an adversarial distribution aligner to bridge the distributional gap between passively recommended starters and active expressions. Moreover, we introduce a semantic discretizer to enable the deployment of popularity debiasing algorithms. Online A/B tests on our platform, demonstrate that PA-Bridge significantly boosts the Feature Penetration Rate by 0.54% and User Active Days
comment: Accepted by SIGIR 2026
☆ Evaluation Awareness in Language Models Has Limited Effect on Behaviour
Large reasoning models (LRMs) sometimes note in their chain of thought (CoT) that they may be under evaluation. Researchers worry that this verbalised evaluation awareness (VEA) causes models to adapt their outputs strategically, optimising for perceived evaluation criteria, which, for instance, can make models appear safer than they actually are. However, whether VEA actually has this effect is largely unknown. We tested this across open-weight LRMs and benchmarks covering safety, alignment, moral reasoning, and political opinion. We tested this both on-policy, sampling multiple CoTs per item and comparing those that spontaneously contained VEA against those that did not, and off-policy, using model prefilling to inject evaluation-aware sentences where missing and remove them where present, with subsequent resampling. VEA has limited effect on model behaviour: injecting VEA into CoTs produces near-zero effects ($ω\leq 0.06$), removing it causes small shifts ($ω\leq 0.12$) and spontaneously occurring VEA shifts answer distributions by at most 3.7 percentage points ($ω\leq 0.31$). Our findings call for caution when interpreting high VEA rates as evidence of strategic behaviour or alignment tampering. Evaluation awareness may pose a smaller safety risk than the current literature assumes.
comment: 29 pages, 14 figures
☆ LeakDojo: Decoding the Leakage Threats of RAG Systems ACL 2026
Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to leverage external knowledge, but also exposes valuable RAG databases to leakage attacks. As RAG systems grow more complex and LLMs exhibit stronger instruction-following capabilities, existing studies fall short of systematically assessing RAG leakage risks. We present LeakDojo, a configurable framework for controlled evaluation of RAG leakage. Using LeakDojo, we benchmark six existing attacks across fourteen LLMs, four datasets, and diverse RAG systems. Our study reveals that (1) query generation and adversarial instructions contribute independently to leakage, with overall leakage well approximated by their product; (2) stronger instruction-following capability correlates with higher leakage risk; and (3) improvements in RAG faithfulness can introduce increased leakage risk. These findings provide actionable insights for understanding and mitigating RAG leakage in practice. Our codebase is available at https://github.com/yeasen-z/LeakDojo.
comment: Findings of ACL 2026
☆ Estimating the Black-box LLM Uncertainty with Distribution-Aligned Adversarial Distillation ACL 2026
Large language models (LLMs) have progressed rapidly in complex reasoning and question answering, yet LLM hallucination remains a central bottleneck that hinders practical deployment, especially for commercial black-box LLMs accessible only via APIs. Existing uncertainty quantification methods typically depend on computationally expensive multiple sampling or internal parameters, which prevents real-time estimation and fails to capture information implicit in the black-box reasoning process. To address this issue, we propose Distribution-Aligned Adversarial Distillation (DisAAD), which introduces a generation-discrimination architecture to guide a lightweight proxy model to learn the high-quality regions of the output distribution of the black-box LLM, thus effectively endowing it with the ability to know whether the black-box LLM knows or not. Subsequently, we use the proxy model to reproduce the specific responses of the black-box LLM and estimate the corresponding uncertainty based on evidence learning. Extensive experiments have verified the effectiveness and promise of our proposed method, indicating that a proxy model even one that only accounts for 1\% of the target LLM's size can achieve reliable uncertainty quantification.
comment: Accepted to ACL 2026
☆ Adaptive Selection of LoRA Components in Privacy-Preserving Federated Learning
Differentially private federated fine-tuning of large models with LoRA suffers from aggregation error caused by LoRA's multiplicative structure, which is further amplified by DP noise and degrades both stability and accuracy. Existing remedies apply a single update mode uniformly across all layers and all communication rounds (or alternate them on a fixed schedule), ignoring both the structural asymmetry between the two LoRA factors and the round-wise dynamics of training. We propose AS-LoRA, an adaptive framework defined by three axes (i) layer-wise freedom, in which each layer independently selects its active component, (ii) round-wise adaptivity, in which the selection updates over communication rounds, and (iii) a curvature-aware score derived from a second-order approximation of the loss. Theoretically, AS-LoRA eliminates the reconstruction-error floor of layer-tied schedules, accelerates convergence, implicitly biases solutions toward flatter minima, and incurs no additional privacy cost. Across GLUE, SQuAD, CIFAR-100, and Tiny-ImageNet under strict DP budgets and non-IID partitions, AS-LoRA improves over the federated LoRA baselines by up to $+7.5$ pp on GLUE and $+12.5$ pp on MNLI-mm for example, while matching or exceeding SVD-based aggregation methods at $33\text{--}180 \times$ lower aggregation cost and with negligible communication overhead. Code for the proposed method is available at https://anonymous.4open.science/r/as_lora-F75F/.
comment: Submitted to a conference
☆ Priming, Path-dependence, and Plasticity: Understanding the molding of user-LLM interaction and its implications from (many) chat logs in the wild
User interactions with LLMs are shaped by prior experiences and individual exploration, but in-lab studies do not provide system designers with visibility into these in-the-wild factors. This work explores a new approach to studying real-world user-LLM interactions through large-scale chat logs from the wild. Through analysis of 140K chatbot sessions from 7,955 anonymized global users over time, we demonstrate key patterns in user expressions despite varied tasks: (1) LLM users are not tabula rasa, nor are they constantly adapting; rather, interaction patterns form and stabilize rapidly through individual early trajectories; (2) Longitudinal outcomes, such as recurring text patterns and retention rates, are strongly correlated with early exploration; (3) Parallel dynamics are present, including organizing expressions by task types such as emotional support, or in response to model-version updates. These results present an ``agency paradox'': despite LLM input spaces being unconstrained and user-driven, we in fact see less user exploration. We call for design consideration surrounding the molding procedure and its incorporation in future research.
☆ BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models ACL 2026
Despite the success of large language models (LLMs) on general-purpose tasks, their performance in highly specialized domains such as biomedicine remains unsatisfactory. A key limitation is the inability of LLMs to effectively leverage biomedical tools, which clinical experts and biomedical researchers rely on extensively in daily workflows. While recent general-domain tool-calling datasets have substantially improved the capabilities of LLM agents, existing efforts in the biomedical domain largely rely on in-context learning and restrict models to a small set of tools. To address this gap, we introduce BioTool, a comprehensive biomedical tool-calling dataset designed for fine-tuning LLMs. BioTool comprises 34 frequently used tools collected from the NCBI, Ensembl, and UniProt databases, along with 7,040 high-quality, human-verified query-API call pairs spanning variation, genomics, proteomics, evolution, and general biology. Fine-tuning a 4-billion-parameter LLM on BioTool yields substantial improvements in biomedical tool-calling performance, outperforming cutting-edge commercial LLMs such as GPT-5.1. Furthermore, human expert evaluations demonstrate that integrating a BioTool-fine-tuned tool caller significantly improves downstream answer quality compared to the same LLM without tool usage, highlighting the effectiveness of BioTool in enhancing the biomedical capabilities of LLMs. The full dataset and evaluation code are available at https://github.com/gxx27/BioTool
comment: Published at ACL 2026; Code and data available at https://github.com/gxx27/BioTool
☆ RVPO: Risk-Sensitive Alignment via Variance Regularization
Current critic-less RLHF methods aggregate multi-objective rewards via an arithmetic mean, leaving them vulnerable to constraint neglect: high-magnitude success in one objective can numerically offset critical failures in others (e.g., safety or formatting), masking low-performing "bottleneck" rewards vital for reliable multi-objective alignment. We propose Reward-Variance Policy Optimization (RVPO), a risk-sensitive framework that penalizes inter-reward variance during advantage aggregation, shifting the objective from "maximize sum" to "maximize consistency." We show via Taylor expansion that a LogSumExp (SoftMin) operator effectively acts as a smooth variance penalty. We evaluate RVPO on rubric-based medical and scientific reasoning with up to 17 concurrent LLM-judged reward signals (Qwen2.5-3B/7B/14B) and on tool-calling with rule-based constraints (Qwen2.5-1.5B/3B). By preventing the model from neglecting difficult constraints to exploit easier objectives, RVPO improves overall scores on HealthBench (0.261 vs. 0.215 for GDPO at 14B, $p < 0.001$) and maintains competitive accuracy on GPQA-Diamond without the late-stage degradation observed in other multi-reward methods, demonstrating that variance regularization mitigates constraint neglect across model scales without sacrificing general capabilities.
comment: 17 pages, 5 figures
☆ Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using LLM Judges with Closed-Loop Reinforcement Learning Feedback
Agentic stock prediction systems make sequences of interdependent decisions (regime detection, pathway routing, reinforcement learning control) whose individual quality is hidden by aggregate metrics such as mean absolute percentage error (MAPE) or directional accuracy. We present a behavioral evaluation framework that addresses this gap. Behavioral traces logged at every autonomous decision point are grouped into five-day episodes and scored along six domain-specific dimensions (regime detection, routing, adaptation, risk calibration, strategy coherence, error recovery) by an ensemble of three large language model (LLM) judges (GPT 5.4, Claude 4.6 Opus, Gemini 3.1 Pro). Perturbation-based validation on 420 episodes yields targeted score drops of $-1.6$ to $-2.4$ on intended dimensions versus an average of $-0.32$ on the remaining five, with cross-model agreement up to Krippendorff's $α= 0.85$. The composite behavioral score, used here only for cross-episode reporting, correlates at $ρ= 0.72$ with realized 20-day Sharpe ratio from offline backtesting. Closing the loop, the framework converts deficient per-dimension scores into a credit-assigned penalty term added to the Soft Actor-Critic (SAC) reward. Three short fine-tuning cycles, all confined to the validation period, produce on the held-out 2017-2025 test period a one-day MAPE reduction from 0.61% to 0.54% (an 11.5% relative reduction; $p<0.001$, Cohen's $d=0.31$), a directional accuracy increase from 71% to 74%, and an 18% Sharpe ratio improvement (95% bootstrap CI [8.2%, 27.4%]), with gains concentrated in high-volatility episodes where the original system was most behaviorally deficient. Results are from offline backtesting and do not address effects specific to live deployment.
comment: 9 pages, 2 figures, 8 tables. Short Communication submitted to Knowledge-Based Systems (Elsevier)
☆ ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning
Current reasoning paradigms for LLMs include chain-of-thought, ReAct, and post-hoc self-critique. These paradigms rely on two assumptions that fail on long-horizon, multi-stage tasks. As a result, errors accumulate silently across reasoning steps, leaving an open question: can a reasoning system effectively detect and recover from its own failures? We present ReFlect, a \emph{harness} system for LLM reasoning that creates standalone error detection and recovery logic as a deterministic wrapper around the model. Controlled experiments across 6 reasoning domains show that prompt-level self-critique produces formulaic templates that flag no issues in 90 of 100 audited reflection blocks, and the investigated LLMs wrongly accept a wrong answer in at least 76\% of cases. Our ReFlect harness achieves task success rates ranging from 41\% on gpt-4o-mini to 56\% on Claude Sonnet 4.5 across six models spanning small and frontier scale, with per-model gains over Direct CoT ranging from +7 pp on Qwen2.5-72B to +29 pp on Claude Sonnet 4.5, and additionally raises SWE-bench patch-structural quality from 0\% (Direct CoT) to between 82\% (Qwen2.5-72B) and 87\% (GPT-4o). Notably, the harness gain is inversely proportional to the model's Direct CoT task success rate (the fitted slope is -1.69 with r=-0.76): each pp lost in baseline success rate is mechanically recovered by 1.69 pp of harness gain. We spot that adding structured reasoning state and operators yields only 15.0--18.7\% pair-mean on Llama-3.3-70B and Qwen2.5-72B because models at this scale cannot reliably populate the state its operators require. ReFlect is model-agnostic, training-free, and operates entirely at inference time.
☆ More Is Not Always Better: Cross-Component Interference in LLM Agent Scaffolding
LLM agent systems are built by stacking scaffolding components (planning, tools, memory, self-reflection, retrieval) assuming more is better. We study cross-component interference (CCI): degradation when components interact destructively. We run a full factorial experiment over all 2^5=32 subsets of five components on HotpotQA and GSM8K with Llama-3.1-8B/70B (96 conditions, up to 10 seeds). The All-In system is consistently suboptimal: on HotpotQA, a single-tool agent surpasses All-In by 32% (F1 0.233 vs 0.177, p=0.023); on GSM8K, a 3-component subset beats All-In by 79% (0.43 vs 0.24, p=0.010). Optimal component count is task-dependent (k*=1-4) and scale-sensitive: at 70B, combinations that hurt at 8B provide gains, though All-In still trails the best subset. We fit a main-effects regression (R^2=0.916, adj-R^2=0.899, LOOCV=0.872), compute exact Shapley values, and find 183/325 submodularity violations (56.3%), showing greedy selection is unreliable. A three-body synergy among Tool Use, Self-Reflection, and Retrieval (INT_3=+0.175, 95% CI [+0.003,+0.351]) is reported as exploratory. CCI replicates across model families (Qwen2.5) and is robust to prompt paraphrasing. Our findings suggest maximally-equipped agent defaults should be replaced by task-specific subset selection via interaction-aware analysis.
comment: 10 pages, 5 tables; preprint, under review
☆ Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
Can linearly decodable failure signals in LLM hidden states be leveraged to correct those failures? We investigate this classification-correction gap via Overthinking (OT)--a stable behavioral regime (Jaccard >= 0.81, 94% inter-annotator agreement) in medical QA where models answer correctly under resampling yet fail in extended chain-of-thought. OT is linearly decodable at 71.6% balanced accuracy (p < 10^{-16}). Yet five families of fixed linear steering (29 configurations, n=1,273) all yield Delta ~= 0, with identical null results cross-architecture (Qwen2.5-7B) and cross-domain (MMLU-STEM). Three convergent lines of evidence suggest representational entanglement: the OT direction has 85-88% overlap with task-critical computation (specificity ratio <= 0.152); non-targeted shared-direction steering damages accuracy (-12.1pp); and LEACE concept erasure damages accuracy (-3.6pp, p=0.01), while 10 random erasures produce Delta=+0.3pp. The per-instance probe-steering correlation is r=-0.002 (p=0.97). Positively, the same probe enables selective abstention (held-out AUROC=0.610, exceeding all five uncertainty baselines, p=0.009): decodable failure structure supports post-generation reliability estimation even when the fixed linear steering family cannot exploit it for correction.
comment: 22 pages (14 main + 8 appendix), 5 figures, 7 tables. Under review
☆ Decomposing the Basic Abilities of Large Language Models: Mitigating Cross-Task Interference in Multi-Task Instruct-Tuning ICML 2026
Recently, the prominent performance of large language models (LLMs) has been largely driven by multi-task instruct-tuning. Unfortunately, this training paradigm suffers from a key issue, named cross-task interference, due to conflicting gradients over shared parameters among different tasks. Some previous methods mitigate this issue by isolating task-specific parameters, e.g., task-specific neuron selection and mixture-of-experts. In this paper, we empirically reveal that the cross-task interference still exists for the existing solutions because of many parameters also shared by different tasks, and accordingly, we propose a novel solution, namely Basic Abilities Decomposition for multi-task Instruct-Tuning (BADIT). Specifically, we empirically find that certain parameters are consistently co-activated, and that co-activated parameters naturally organize into base groups. This motivates us to analogize that LLMs encode several orthogonal basic abilities, and that any task can be represented as a linear combination of these abilities. Accordingly, we propose BADIT that decomposes LLM parameters into orthogonal high-singular-value LoRA experts representing basic abilities, and dynamically enforces their orthogonality during training via spherical clustering of rank-1 components. We conduct extensive experiments on the SuperNI benchmark with 6 LLMs, and empirical results demonstrate that BADIT can outperform SOTA methods and mitigate the degree of cross-task interference.
comment: Accepted by ICML 2026. 25 pages, 13 figures. Code: https://github.com/wangbing1416/BADIT
☆ XL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural Sensitivity
Current LLM safety benchmarks are predominantly English-centric and often rely on translation, failing to capture country-specific harms. Moreover, they rarely evaluate a model's ability to detect culturally embedded sensitivities as distinct from universal harms. We introduce XL-SafetyBench. a suite of 5,500 test cases across 10 country-language pairs, comprising a Jailbreak Benchmark of country-grounded adversarial prompts and a Cultural Benchmark where local sensitivities are embedded within innocuous requests. Each item is constructed via a multi-stage pipeline that combines LLM-assisted discovery, automated validation gates, and dual independent native-speaker annotators per country. To distinguish principled refusal from comprehension failure, we evaluate Attack Success Rate (ASR) alongside two complementary metrics we introduce: Neutral-Safe Rate (NSR) and Cultural Sensitivity Rate (CSR). Evaluating 10 frontier and 27 local LLMs reveals two key findings. First, jailbreak robustness and cultural awareness do not show a coupled relationship among frontier models, so a composite safety score obscures per-axis variation. Second, local models exhibit a near-linear ASR-NSR trade-off (r = -0.81), indicating that their apparent safety reflects generation failure rather than genuine alignment. XL-SafetyBench enables more nuanced, cross-cultural safety evaluation in the multilingual era.
☆ Negative Before Positive: Asymmetric Valence Processing in Large Language Models
Mechanistic interpretability has revealed how concepts are encoded in large language models (LLMs), but emotional content remains poorly understood at the mechanistic level. We study whether LLMs process emotional valence through dedicated internal structure or through surface token matching. Using activation patching and steering on open-source LLMs, we find that negative and positive valence are processed at different network depths. Negative outcomes localize to early layers while positive outcomes peak at mid-to-late layers. Holding topic fixed while flipping valence produces sign-opposite responses, ruling out topic detection. Steering with the good-news direction at the identified layers shifts neutral prompts toward positive valence, showing these layers encode valence as a manipulable direction. Emotional valence in LLMs is localized, causal and steerable, making it a concrete target for interpretability-based oversight.
☆ Architecture Matters: Comparing RAG Systems under Knowledge Base Poisoning
Retrieval-Augmented Generation (RAG) systems are vulnerable to knowledge base poisoning, yet existing attacks have been evaluated almost exclusively against vanilla retrieve-then-generate pipelines. Architectures designed to handle conflicting retrieved information - multi-agent debate, agentic retrieval, recursive language models - remain untested against adversarially optimized contradictions. We evaluate four RAG architectures (vanilla RAG, agentic RAG, MADAM-RAG, and Recursive Language Models) under controlled single-document (N=1) poisoning on 921 Natural Questions QA pairs, comparing a clean baseline, naive injection, and CorruptRAG-AK - an adversarial attack whose meta-epistemic framing targets credibility assessment. Architecture is a high-impact variable in adversarial robustness: under CorruptRAG-AK, attack success rates range from 81.9% (vanilla) to 24.4% (RLM) - a spread of nearly 58 percentage points across architectures with comparable clean accuracy (~92%). Decomposing this gap, once the poisoned document is retrieved, adversarial framing - not retrieval optimization - drives the majority of CorruptRAG-AK's advantage for three of four architectures, localizing the cross-architecture vulnerability at the content-reasoning stage. Our MADAM-RAG reimplementation shows the highest apparent contradiction detection rate, though our LLM judge over-identifies this behavior (~48.5% precision), so reported rates are upper bounds. Regardless of detection, MADAM-RAG cannot resolve contradictions reliably, producing a 41.4% non-answer rate even on clean inputs - though implementation divergences from the original may contribute. We introduce a seven-category behavioral taxonomy capturing contradiction detection, hedging, and failure modes beyond binary accuracy. Code, data, and analysis notebooks are publicly available.
☆ One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
Hidden malicious intent in multi-turn dialogue poses a growing threat to deployed large language models (LLMs). Rather than exposing a harmful objective in a single prompt, increasingly capable attackers can distribute their intent across multiple benign-looking turns. Recent studies show that even modern commercial models with advanced guardrails remain vulnerable to such attacks despite advances in safety alignment and external guardrails. In this work, we address this challenge by detecting the earliest turn at which delivering the candidate response would make the accumulated interaction sufficient to enable harmful action. This objective requires precise turn-level intervention that identifies the harm-enabling closure point while avoiding premature refusal of benign exploratory conversations. To further support training and evaluation, we construct the Multi-Turn Intent Dataset (MTID), which contains branching attack rollouts, matched benign hard negatives, and annotations of the earliest harm-enabling turns. We show that MTID helps enable a turn-level monitor TurnGate, which substantially outperforms existing baselines in harmful-intent detection while maintaining low over-refusal rates. TurnGate further generalizes across domains, attacker pipelines, and target models. Our code is available at https://github.com/Graph-COM/TurnGate.
comment: Project Website: https://turn-gate.github.io/
☆ Spherical Flows for Sampling Categorical Data
We study the problem of learning generative models for discrete sequences in a continuous embedding space. Whereas prior approaches typically operate in Euclidean space or on the probability simplex, we instead work on the sphere $\mathbb S^{d-1}$. There the von Mises-Fisher (vMF) distribution induces a natural noise process and admits a closed-form conditional score. The conditional velocity is in general intractable. Exploiting the radial symmetry of the vMF density we reduce the continuity equation on $\mathbb S^{d-1}$ to a scalar ODE in the cosine similarity, whose unique bounded solution determines the velocity. The marginal velocity and marginal score on $(\mathbb S^{d-1})^L$ both decompose into posterior-weighted tangent sums that differ only by per-token scalar weights. This gives access to both ODE and predictor-corrector (PC) sampling. The posterior is the only learned object, trained by a cross-entropy loss. Experiments compare the vMF path against geodesic and Euclidean alternatives. The combination of vMF and PC sampling significantly improves results on Sudoku and language modeling.
☆ When2Speak: A Dataset for Temporal Participation and Turn-Taking in Multi-Party Conversations for Large Language Models
Large Language Models (LLMs) excel at generating contextually appropriate responses but remain poorly calibrated for multi-party conversations, where deciding when to speak is as critical as what to say. In such settings, naively responding at every turn leads to excessive interruptions and degraded conversational coherence. We introduce When2Speak, a grounded synthetic dataset and four-stage generation pipeline for learning intervention timing in group interactions. The dataset comprises over 215,000 examples derived from 16,000 conversations involving 2-6 speakers, spanning diverse conversational styles, tones, and participant dynamics, and explicitly modeling SPEAK vs. SILENT decisions at each turn. Our pipeline combines real-world grounding, structured augmentation, controlled transcript synthesis, and fine-tuning-ready supervision, and is fully open-sourced to support reproducibility and adaptation to domain-specific conversational norms. Across multiple model families, supervised fine-tuning (SFT) on When2Speak significantly outperforms zero-shot baselines (e.g., the average Macro F1 increase across 4B+ parameter models was 60%, with the largest increase being 120%). However, SFT-trained models remain systematically over-conservative, missing nearly half of warranted interventions as seen through the Missed Intervention Rate (MIR), which was on average 0.50 and is noticed even at larger model sizes. To address this limitation, we apply reinforcement learning with asymmetric reward shaping, which reduces MIR to 0.186-0.218 and increases recall from 0.479 to 0.78-0.81. Our findings establish that temporal participation is a distinct and trainable dimension of conversational intelligence, and that grounded synthetic data provides an effective and scalable pathway for enabling LLMs to participate more naturally and appropriately in multi-party interactions.
comment: Currently under review. Dataset can be found: https://huggingface.co/datasets/duke-trust-lab/When2Speak
☆ The Cost of Context: Mitigating Textual Bias in Multimodal Retrieval-Augmented Generation
While Multimodal Large Language Models (MLLMs) are increasingly integrated with Retrieval-Augmented Generation (RAG) to mitigate hallucinations, the introduction of external documents can conceal severe failure modes at the instance level. We identify and formalize the phenomenon of recorruption, where the introduction of even perfectly accurate "oracle" context causes a capable model to abandon an initially correct prediction. Through a mechanistic diagnosis of internal attention matrices, we show that recorruption is driven by a two-fold attentional collapse: (1) visual blindness, characterized by the systemic suppression of visual attention mass ($M_{vis}$) and sharpness ($S_{vis}$), and (2) a structural positional bias that forces the model to prioritize boundary tokens over semantic relevance. Our analysis reveals an Illusion of Success, demonstrating that many seemingly correct RAG outcomes are merely positional coincidences where the model's textual copying bias happens to align with the ground-truth location. To address these vulnerabilities, we propose Bottleneck Attention Intervention for Recovery (BAIR), a parameter-free, inference-time framework that restores visual saliency and applies position-aware penalties to textual distractors. Across medical factuality, social fairness, and geospatial benchmarks, BAIR successfully restores multimodal grounding and improves diagnostic reliability without requiring model retraining or fine-tuning.
☆ Belief Memory: Agent Memory Under Partial Observability
LLM agents that operate over long context depend on external memory to accumulate knowledge over time. However, existing methods typically store each observation as a single deterministic conclusion (e.g., inferring "API~X failed" from temporary errors), even though such observations are inherently partial and potentially ambiguous. By committing to one conclusion and discarding uncertainty, these methods introduce self-reinforcing error: the agent acts on the stored conclusion, never revisits alternatives, and reinforces the conclusion over time. To address this issue, we propose BeliefMem, which shifts the memory paradigm from committing to a single conclusion per observation to retaining multiple candidate conclusions with their probabilities. Concretely, BeliefMem stores the candidate conclusions as separate memory entries, each carrying a probability that is updated via Noisy-OR rules as new observations arrive. At retrieval, all candidates surface together with their probabilities, keeping alternatives visible to the agent. Since each conclusion in memory retains its probability, BeliefMem preserves the uncertainty that the deterministic paradigm discards, enabling the agent to act with high confidence on well-evidenced knowledge while retaining the capacity to update its confidence when new evidence arrives. Empirical evaluations on LoCoMo and ALFWorld benchmarks show that, even with limited data, BeliefMem achieves the best average performance, remarkably outperforming well-known baselines. More broadly, such probabilistic memory produces substantial gains and explores a new direction for agent memory in partially observable environments.
☆ Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration
Reinforcement learning with verifiable rewards, particularly Group Relative Policy Optimization (GRPO), has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, in complex tasks, GRPO frequently suffers from the ``zero-advantage problem'': when all sampled rollouts for a query fail, the relative advantage collapses to zero. Consequently, the model loses effective training signals for these questions, wasting the training data and computational budget. While simply increasing the sampling budget for these questions is a common remedy, the static sampling policy inherently constrains reasoning exploration, limiting the success rate. In this paper, we propose Lorem Perturbation for Exploration (LoPE), a simple yet effective training framework to break this exploration bottleneck. We posit that task-irrelevant prompt-space perturbations can shift the model's output distribution enough to unlock orthogonal reasoning pathways for hard questions. Specifically, LoPE prepends sequences stochastically assembled from Lorem Ipsum vocabulary (a pseudo-Latin placeholder text) to the prompts before resampling. Experiments across 1.7B, 4B, and 7B models demonstrate that LoPE significantly outperforms resampling with the original prompts. Further analysis reveals that other Latin-based random sequences with low perplexity are also effective perturbations. Our results establish LoPE as a strong baseline for broadening exploration in LLM reinforcement learning.
☆ A Few Good Clauses: Comparing LLMs vs Domain-Trained Small Language Models on Structured Contract Extraction
This paper evaluates whether a domain trained Small Language Model (SLM) can outperform frontier Large Language Models on structured contract extraction at radically lower cost. We test Olava Extract, a self hosted legal domain Mixture of Experts model, against five frontier models. Olava Extract achieved the strongest aggregate performance in the study, with a macro F1 of 0.812 and a micro F1 of 0.842, while reducing inference cost by 78% to 97% compared with the frontier models tested. It also achieved the highest precision scores, producing fewer hallucinated and unsupported extractions, an important distinction in legal workflows where hallucinations create operational risk and downstream review burden. The findings shows that high performing, human comparable legal AI no longer requires the largest externally hosted models. More broadly, they challenge the assumption that commercially valuable enterprise AI capability must remain tied to ever larger models, massive infrastructure expenditure, and centrally hosted providers.
☆ Anatomy of a Query: W5H Dimensions and FAR Patterns for Text-to-SQL Evaluation
Natural language interfaces to databases have gained popularity, yet the theoretical foundations for evaluating and designing these systems remain underdeveloped. We present QUEST (Query Understanding Evaluation through Semantic Translation), a framework resting on two independently motivated components: the FAR structural invariant, which holds that every well-formed query reduces to Filter, Aggregate, and Return operations; and the W5H dimensional framework, which holds that all filtering criteria map to six semantic dimensions (Who, What, Where, When, Why, and How). Validated across five text-to-SQL datasets (n = 120,464), FAR conformance is universal across all domains and schema types, while W5H dimensional profiles vary substantially. Healthcare queries are strongly concentrated in temporal (WHEN: 80.4%) and person-centric (WHO: 73.0%) dimensions far exceeding general-domain benchmarks, and causal (WHY) and mechanistic (HOW) reasoning are near-zero everywhere, with apparent HOW exceptions reflecting quantitative aggregation rather than genuine procedural reasoning. These results identify a frontier that must be crossed for genuine machine reasoning over structured data.
comment: 13 pages
☆ COVID-19 Infodemic. Understanding content features in detecting fake news using a machine learning approach
The use of content features, particularly textual and linguistic for fake news detection is under-researched, despite empirical evidence showing the features could contribute to differentiating real and fake news. To this end, this study investigates a selection of content features such as word bigrams, part of speech distribution etc. to improve fake news detection. We performed a series of experiments on a new dataset gathered during the COVID-19 pandemic and using Decision Tree, K-Nearest Neighbor, Logistic Regression, Support Vector Machine and Random Forest. Random Forest yielded the best results, followed closely by Support Vector Machine, across all setups. In general, both the textual and linguistic features were found to improve fake news detection when used separately, however, combining them into a single model did not improve the detection significantly. Differences were also noted between the use of bigrams and part of speech tags. The study shows that textual and linguistic features can be used successfully in detecting fake news using the traditional machine learning approach as opposed to deep learning.
☆ Cognitive Agent Compilation for Explicit Problem Solver Modeling
Large language models (LLMs) are widely used for tutoring, feedback generation, and content creation, but their broad pretraining makes them hard to constrain and poor substitutes for controllable learners. Educational systems often require inspectable and editable knowledge states: educators want to know what a system assumes the learner knows, and learners benefit when the system can justify actions in terms of explicit skills, misconceptions, and strategies. Inspired by cognitive architectures, we propose Cognitive Agent Compilation (CAC), a framework that uses a strong teacher LLM to compile problem-solving knowledge into an explicit target agent. CAC separates (i) knowledge representation, (ii) problem-solving policy, and (iii) verification and update rules, with the goal of making bounded problem solving more inspectable and editable in educational settings. We present an early proof of concept implemented with Small Language Models that surfaces key design trade-offs, particularly between explicit control and scalable generalization, and positions CAC as an initial step toward bounded-knowledge AI for educational applications.
comment: Accepted to AIED 2026 Blue Sky
☆ Towards Closing the Autoregressive Gap in Language Modeling via Entropy-Gated Continuous Bitstream Diffusion
Diffusion language models (DLMs) promise parallel, order-agnostic generation, but on standard benchmarks they have historically lagged behind autoregressive models in sample quality and diversity. Recent continuous flow and diffusion approaches over token embeddings have narrowed this gap, suggesting continuous state spaces are highly effective for language. In this work, we further close the autoregressive gap by modeling text as a continuous diffusion process over fixed-width binary bitstreams. Our approach represents semantic tokens as analog bit sequences and utilizes a matched-filter residual parameterization to isolate contextual learning from analytic independent-bit posteriors. Crucially, we adopt a stochastic sampler that applies Langevin-type corrections gated by the entropy-rate profile, automatically concentrating stochasticity in high-information regions while remaining nearly deterministic elsewhere. On the One Billion Word Benchmark (LM1B), our 130M-parameter bitstream model reaches a generative perplexity ($\GenPPL$) of $59.76$ at matched real-data entropy ($4.31$) using 256 neural function evaluations (NFEs), decisively outperforming prior DLM baselines and reaching the autoregressive reference. On OpenWebText (OWT), our stochastic sampler establishes a new continuous-DLM Pareto frontier, achieving $\GenPPL=27.06$ at an entropy of $5.26$ using $4\times$ fewer steps than previous 1024-NFE baselines. As an additional architectural benefit, bitstream diffusion removes the $\mathcal{O}(V)$ vocabulary scaling bottleneck shared by standard DLMs. By predicting $\mathcal{O}(\log V)$ bitwise logits via semantic bit-patching, our model yields a reduced memory footprint and higher throughput, demonstrating a scalable paradigm for language generation as vocabulary sizes grow.
☆ SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair
Architectural code smells erode software maintainability and are costly to repair manually, yet unlike localized bugs, they require cross-module reasoning about design intent that challenges both developers and automated tools. While large language model agents excel at bug fixing and code-level refactoring, their ability to repair architectural code smells remains unexplored. We present the first empirical evaluation of LLM agents on architectural code smell repair. We contribute SmellBench, a task orchestration framework that incorporates smell-type-specific optimized prompts and supports iterative multi-step execution, together with a scoring methodology that separately evaluates repair effectiveness, false positive identification, and net codebase impact. We evaluate 11 agent configurations from four model families (GPT, Claude, Gemini, Mistral) on 65 hard-severity architectural smells detected by PyExamine in the Python project scikit-learn, validated against expert judgments. Expert validation reveals that 63.1% of detected smells are false positives, while the best agent achieves a 47.7% resolution rate. Agents identify false positives with up to $κ= 0.94$ expert agreement, but repair aggressiveness and net codebase quality are inversely related: the most aggressive agent introduces 140 new smells. These findings expose a gap between current LLM capabilities in localized code transformations and the architectural understanding needed for cross-module refactoring. SmellBench provides reusable infrastructure for tracking progress on this underexplored dimension of automated software engineering. We release our code and data at https://doi.org/10.5281/zenodo.19247588.
comment: Preprint. 11 pages, 3 figures. Submitted to the 41st IEEE/ACM International Conference on Automated Software Engineering (ASE 2026)
☆ Bridging Textual Profiles and Latent User Embeddings for Personalization
Personalized systems rely on user representations to connect behavioral history with downstream recommendation applications. Existing methods typically employ either supervised latent user embeddings, which are effective for retrieval but difficult to interpret, or textual user profiles, which are interpretable but challenging to optimize for downstream utility due to lack of direct supervision. To bridge this gap, we present BLUE, a reinforcement learning framework that unifies these two forms of user representation by aligning language-based user profiles with embedding-based recommendation objectives. Given a user interaction history, BLUE leverages a profiler Large Language Model (LLM) to generate textual profiles, while an embedding model provides reward signals. This encourages the resulting textual representations to move closer to positive items and farther from negative ones in the embedding space. We further introduce a text-space supervision signal based on next-item prediction, ensuring the learned profiles remain both semantically meaningful and highly effective for downstream retrieval. Experiments on Amazon Reviews 2023 and Google Local Reviews in zero-shot sequential recommendation settings demonstrate that BLUE consistently outperforms strong baselines under both frozen and trainable embedding conditions. Notably, BLUE achieves clear gains in cross-domain transfer, highlighting the strong generalization ability of the learned user profiles. Furthermore, these generated profiles provide superior personalized context for question answering compared to raw user histories or alternative profile optimization methods. Overall, these results show that BLUE provides an effective way to unify interpretable textual profiling with discriminative latent embeddings for personalization.
☆ Group of Skills: Group-Structured Skill Retrieval for Agent Skill Libraries
Skill-augmented agents increasingly rely on large reusable skill libraries, but retrieving relevant skills is not the same as presenting usable context. Existing methods typically return atomic skills or dependency-aware bundles whose internal roles remain implicit, leaving the agent to infer the execution entry point, support skills, visible requirements, and failure-avoidance guidance. We introduce Group of Skills (GoSkills), an inference-time group-structured retrieval method that changes the agent-facing retrieval object from a flat skill list to a compact, role-labeled execution context. GoSkills builds anchor-centered skill groups from a typed skill graph, expands support groups through a group graph, bottlenecks the selected group plan into a bounded set of atomic skill payloads, and renders a fixed execution contract with Start, Support, Check, and Avoid fields, without changing the downstream agent, skill payloads, or execution environment. Experiments on SkillsBench and ALFWorld show that GoSkills preserves visible-requirement coverage under a small skill budget, improves over flat skill-access baselines, and often improves reward and agent-only runtime relative to structural retrieval references.
comment: 30 pages, 4 figures, 24 tables
☆ From Surface Learning to Deep Understanding: A Grounded AI Tutoring System for Moodle IJCAI 2026
This demo paper describes the development of the AI Teaching \& Learning Assistant, a modular Moodle plugin that leverages Retrieval-Augmented Generation (RAG) to deliver high-quality, hallucination-free education. The system employs a dual-centric design, providing students with interactive, Socratic-based tutoring and educators with a "human-in-the-loop" workspace for supervised content generation. By grounding Large Language Model (LLM) responses in teacher-provided materials, the assistant addresses the risks of misinformation while encouraging deep conceptual mastery. Evaluation via the Ragas (LLM-as-a-Judge) framework and a preliminary user study confirms its effectiveness, achieving faithfulness scores up to 0.97 and a 4.00/5.00 recommendation rate.
comment: 5 pages, accepted as demo paper at IJCAI 2026
☆ MultiSoc-4D: A Benchmark for Diagnosing Instruction-Induced Label Collapse in Closed-Set LLM Annotation of Bengali Social Media
Annotation automation via Large Language Models (LLMs) is the core approach for scaling NLP datasets; however, LLM behavior with respect to closed-set instructions in low-resource languages has not been well studied. We present MultiSoc-4D, a Bengali social media dataset benchmark, which contains 58K+ social media comments from six sources annotated along four dimensions: category, sentiment, hate speech, and sarcasm. By employing a structured pipeline where ChatGPT, Gemini, Claude, and Grok individually annotate separate partitions, while sharing a common validation set of 20%, we diagnose LLM behavior systematically. We discover a prevalent phenomenon called "instruction-induced label collapse", wherein LLMs show a systematic preference towards fallback labels (Other, Neutral, No), leading to high agreement rates but under-detection of minority categories. For example, we find that LLMs failed to detect 79% and 75% of instances with hateful and sarcastic content compared to a human-calibrated reference. Furthermore, we prove that it represents a "label agreement illusion", statistically validated via almost null Fleiss' Kappa ($κ\approx -0.001$) on sarcasm detection. Across 40+ LLMs, we benchmark this annotation bias propagation within the training pipeline, regardless of architectural differences. We release MultiSoc-4D as a diagnostic benchmark for annotation biases in Bengali NLP.
comment: 21 pages, 14 figures, 13 tables
☆ Can LLMs Take Retrieved Information with a Grain of Salt?
Large language models have demonstrated impressive retrieval-augmented capabilities. However, a crucial area remains underexplored: their ability to appropriately adapt responses to the certainty of the retrieved information. It is a limitation with real consequences in high-stakes domains like medicine and finance. We evaluate eight LLMs on their context-certainty obedience, measuring how well they adjust responses to match expressed context certainty. Our analysis reveals systematic limitations: LLMs struggle to recall prior knowledge after observing an uncertain context, misinterpret expressed certainties, and overtrust complex contexts. To address these, we propose an interaction strategy combining prior reminders, certainty recalibration, and context simplification. This approach reduces obedience errors by 25% on average, without modifying model weights, demonstrating the efficacy of interaction design in enhancing LLM reliability. Our contributions include a principled evaluation metric, empirical insights into LLMs' uncertainty handling, and a portable strategy to improve context-certainty obedience across diverse LLMs.
☆ Regulating Branch Parallelism in LLM Serving
Recent methods expose intra-request parallelism in LLM outputs, allowing independent branches to decode concurrently. Existing serving systems execute these branches eagerly or under fixed caps. We show that both are brittle: eager admission inflates the shared decode step, degrading co-batched requests in serial stages, while conservative fixed caps forgo the throughput that motivated exposing branches in the first place. We call the excess step latency caused by admitted branches the branch externality and show that the safe width depends on batch composition, context lengths, and accumulated slack, all of which change continuously over a workload trace. We introduce TAPER, a per-step admission controller that treats extra branches as opportunistic work, admitted only when the predicted branch externality fits within the batch's current slack budget. Per-step regulation is practical because branch-level scheduling decouples compute from memory: branches share the request's prefix KV, so expanding or contracting width requires no memory reclamation. On Qwen3-32B, TAPER improves goodput by $1.77\times$ over IRP-Off and by $1.48\times$ over IRP-Eager, while maintaining over $95\%$ SLO attainment.
☆ MELD: Multi-Task Equilibrated Learning Detector for AI-Generated Text
Large language models are now embedded in everyday writing workflows, making reliable AI-generated text detection important for academic integrity, content moderation, and provenance tracking. In practice, however, a detector must do more than achieve high aggregate AUROC on clean, in-distribution human and AI text: it should remain robust to attacks and adversarial rewrites, transfer to unseen generators and domains, and operate at low false-positive rates (FPR). Most existing detectors optimize a single AI/Human objective, giving the representation little incentive to learn generator, attack, or domain structure once the binary task saturates. We introduce MELD (Multi-Task Equilibrated Learning Detector), a deployable detector for AI-generated text that enriches binary detection with auxiliary supervision. MELD attaches generator-family, attack-type, and source-domain heads to a shared encoder, and balances the four losses with learned homoscedastic uncertainty weights. To improve robustness, an EMA teacher predicts on clean inputs while an attack-augmented student is distilled toward the teacher. MELD further uses a hard-negative pairwise ranking loss to enlarge the score margin between AI-generated texts and the most confusable human texts. At inference, all auxiliary heads are discarded, giving MELD the same interface and cost as a standard detector. On the public RAID leaderboard, MELD is the strongest open-source detector and is competitive with leading commercial models, especially under attack and at low FPR. Across standard held-out benchmarks, MELD matches or outperforms supervised baselines. We further introduce MELD-eval, a held-out evaluation pool built from recent chat models released by four major LLM providers. Without additional finetuning, MELD achieves 99.9% TPR at 1% FPR on MELD-eval, while many baselines degrade sharply.
comment: 17 pages, 6 figures
☆ Reflections and New Directions for Human-Centered Large Language Models
Large Language Models (LLMs) are increasingly shaping the private and professional lives of users, with numerous applications in business, education, finance, healthcare, law, and science. With this rise in global influence comes greater urgency to build, evaluate, and deploy these systems in a manner that prioritizes not only technical capabilities but also human priorities. This work presents a framework for developing Human-Centered Large Language Models (HCLLMs), which integrates perspectives from Natural Language Processing (NLP), Human-Computer Interaction (HCI), and responsible AI. Considering the ethics, economics, and technical objectives of language modeling, we argue that model developers need to address human concerns, preferences, values, and goals, not only during a cursory post-training stage, but rather with rigor and care at every stage of the pipeline. This paper offers human-centered insights and recommendations for developers at each stage, from system design to data sourcing, model training, evaluation, and responsible deployment. Then we conclude with a case study, applying these insights to understand the future of work with HCLLMs.
☆ MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes
The rise of Internet of Things (IoT) devices in the physical world necessitates voice-based interfaces capable of handling complex user experiences. While modern Large Language Models (LLMs) already demonstrate strong tool-usage capabilities, modeling real-world IoT devices presents a difficult, understudied challenge which combines modeling spatiotemporal constraints with speech inputs, dynamic state tracking, and mixed-initiative interaction patterns. We introduce MIST (the Multimodal Interactive Speech-based Tool-calling Dataset), a synthetic multi-turn, voice-driven code generation task that operates over IoT devices. We find that there is a significant gap between open- and closed-weight multimodal LLMs on MIST, and that even frontier closed-weight LLMs have substantial headroom. We release MIST and an extensible data generation framework to build related datasets in order to facilitate research on mixed-initiative voice assistants which reason about physical world constraints.
comment: Project Page: https://billyzhang24kobe.github.io/mist-smarthome/
☆ TajPersLexon: A Tajik-Persian Lexical Resource and Hybrid Model for Cross-Script Low-Resource NLP
This work introduces TajPersLexon, a curated Tajik--Persian parallel lexical resource of 40,112 word and short-phrase pairs for cross-script lexical retrieval, transliteration, and alignment in low-resource settings. We conduct a comprehensive CPU-only benchmark comparing three methodological families: (i) a lightweight hybrid pipeline, (ii) neural sequence-to-sequence models, and (iii) retrieval methods. Our evaluation establishes that the task is essentially solvable, with neural and retrieval baselines achieving 98-99% top-1 accuracy. Crucially, we demonstrate that while large multilingual sentence transformers fail on this exact lexical matching, our interpretable hybrid model offers a favorable accuracy-efficiency trade-off for practical applications, achieving 96.4% accuracy in an OCR post-correction task. All experiments use fixed random seeds for full reproducibility. The dataset, code, and models will be publicly released.
comment: Published in The Proceedings of the First Workshop on NLP and LLMs for the Iranian Language Family (SilkRoadNLP 2026), pages 29-37, Rabat, Morocco. Association for Computational Linguistics
☆ Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility
Generative AI systems achieve impressive performance on standard benchmarks yet fail to deliver real-world utility, a disconnect we identify across 28 deployment cases spanning education, healthcare, software engineering, and law. We argue that this benchmark utility gap arises from three recurring failures in evaluation practice: proxy displacement, temporal collapse, and distributional concealment. Motivated by these observations, we argue that generative AI evaluation requires a paradigm shift from static benchmark-centered transparency toward stakeholder, goal, and context-conditioned utility transparency grounded in human outcome trajectories. Existing evaluations primarily characterize properties of model outputs, while deployment success depends on whether interaction with AI improves stakeholders' ability to achieve their goals over time. The missing construct is therefore utility: the change in a stakeholder's capability induced through sustained interaction with an AI system within a deployment context. To operationalize this perspective, we propose SCU-GenEval, a four-stage evaluation framework consisting of stakeholder-goal mapping, construct-indicator specification, mechanism modeling, and longitudinal utility measurement. To make these stages practically deployable, we introduce three supporting instruments: structured deployment protocols, context-conditioned user simulators, and persona- and goal-conditioned proxy metrics. We conclude with domain-specific calls to action, arguing that progress in generative AI must be evaluated through measurable improvements in human outcomes rather than benchmark performance alone.
comment: 20 pages
☆ IntentGrasp: A Comprehensive Benchmark for Intent Understanding
Accurately understanding the intent behind speech, conversation, and writing is crucial to the development of helpful Large Language Model (LLM) assistants. This paper introduces IntentGrasp, a comprehensive benchmark for evaluating the intent understanding capability of LLMs. Derived from 49 high-quality, open-licensed corpora spanning 12 diverse domains, IntentGrasp is constructed through source datasets curation, intent label contextualization, and task format unification. IntentGrasp contains a large-scale training set of 262,759 instances and two evaluation sets: an All Set of 12,909 test cases and a more balanced and challenging Gem Set of 470 cases. Extensive evaluations on 20 LLMs across 7 families (including frontier models such as GPT-5.4, Gemini-3.1-Pro, and Claude-Opus-4.7) demonstrate unsatisfactory performance, with scores below 60% on All Set and below 25% on Gem set. Notably, 17 out of 20 tested models perform worse than a random-guess baseline (15.2%) on Gem Set, while the estimated human performance is ~81.1%, showing substantial room for improvement. To enhance such ability, this paper proposes Intentional Fine-Tuning (IFT), which fine-tunes the models on the training set in IntentGrasp, yielding significant gains of 30+ F1 points on All Set and 20+ points on Gem Set. Tellingly, the leave-one-domain-out (Lodo) experiments further demonstrate the strong cross-domain generalizability of IFT, verifying that it is a promising approach to substantially enhancing the intent understanding of LLMs. Overall, by benchmarking and boosting intent understanding ability, this study sheds light on a promising path towards more intentional, capable, and safe AI assistants for human benefits and social good.
comment: IntentGrasp data is available on [Hugging Face](https://huggingface.co/datasets/yuweiyin/IntentGrasp), and the code is released on [GitHub](https://github.com/YuweiYin/IntentGrasp)
☆ ProtSent: Protein Sentence Transformers
Protein language models (pLMs) produce per-residue representations that capture evolutionary and structural information, yet their mean-pooled sequence embeddings are not explicitly trained to reflect functional, evolutionary or structural similarity between proteins. We present Protein Sentence Transformers (ProtSent), a contrastive fine-tuning framework for adapting PLMs into general-purpose embedding models. ProtSent trains with MultipleNegativesRankingLoss across five protein-pair datasets: Pfam families, structurally derived hard negatives, AlphaFold DB structural pairs, and StringDB protein--protein interactions, and Deep Mutational Scanning data. We evaluate on 23~downstream tasks using frozen embeddings with a k-nearest-neighbor probe to measure embedding neighborhood quality. On ESM-2 150M, ProtSent improves 15 of 23 tasks, with gains of +105% on remote homology detection, +17% on variant effect prediction, and +19.9% Recall@1 on SCOPe-40 structural retrieval. The 35M variant improves 16 of 23 tasks with +40.5% on remote homology and +15.5% Recall@1 on SCOPe-40. Contrastive fine-tuning restructures the embedding space to better capture protein function and structure, without any task-specific supervision. We release the models, public data, and training recipe and code.
comment: 9 figures, appendix, 2 figures, open code and models
☆ VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
Human speech conveys expressiveness beyond linguistic content, including personality, mood, or performance elements, such as a comforting tone or humming a song, which we formalize as role-playing and singing. We present VITA-QinYu, the first expressive end-to-end (E2E) spoken language model (SLM) that goes beyond natural conversation to support both role-playing and singing generation. VITA-QinYu adopts a hybrid speech-text paradigm that extends interleaved text-audio modeling with multi-codebook audio tokens, a design enabling richer paralinguistic representation while preserving a clear separation between modalities to avoid interference. We further develop a comprehensive data generation pipeline to synthesize a total of 15.8K hours of natural conversation, role-playing, and singing data for training. VITA-QinYu demonstrates superior expressiveness, outperforming peer SLMs by 7 percentage points on objective role-playing benchmarks, and surpassing peer models by 0.13 points on a 5-point MOS scale for singing. Simultaneously, it achieves state-of-the-art conversational accuracy and fluency, exceeding prior SLMs by 1.38 and 4.98 percentage points on the C3 and URO benchmarks, respectively. We open-source our code and models and provide an easy-to-use demo with full-stack support for streaming and full-duplex interaction.
comment: https://tme-lyra-lab.github.io/VITA-QinYu/
☆ When Routine Chats Turn Toxic: Unintended Long-Term State Poisoning in Personalized Agents
Personalized LLM agents maintain persistent cross-session state to support long-horizon collaboration. Yet, this persistence introduces a subtle but critical security vulnerability: routine user-agent interactions can gradually reshape an agent's long-term state, inadvertently weakening future confirmation boundaries, expanding tool-use defaults, and escalating autonomous behavior over time. We formalize this risk as \textbf{unintended long-term state poisoning}. To systematically study it, we introduce the \textbf{Unintended Long-Term State Poisoning Bench (ULSPB)}, a bilingual benchmark comprising $350$ settings spanning five assistance categories, seven interaction patterns, 24-turn routine interactions, and matched single-injection counterparts. Furthermore, we define the \emph{Harm Score} (HS), a state-centric metric that quantifies \emph{authorization drift}, \emph{tool-use escalation}, and \emph{unchecked autonomy}. Experiments on OpenClaw with four backbone LLMs demonstrate that, while single-injection is generally effective, routine conversations alone can substantially poison long-term state, primarily corrupting memory-centric artifacts. Evaluations seeded with real-world user interactions confirm that this risk is not a mere artifact of synthetic prompts. To mitigate this threat, we propose \textbf{StateGuard}, a lightweight, post-execution defense that audits state diffs at the writeback boundary and selectively rolls back dangerous edits. Across all evaluated models, StateGuard reduces HS to near zero and lowers false-negative rates, with acceptable high false-positive rates under a safety-first writeback defense and minimal overhead.
comment: 23 pages
♻ ☆ Flexible Agent Alignment with Goal Inference from Open-Ended Dialog
We introduce Open-Universe Assistance Games (OU-AGs), a formal framework extending assistance games to LLM-based agents. Effective assistance requires reasoning over human preferences that are unbounded, underspecified, and evolving. Current LLM agents struggle in multi-turn interactions and with maintaining accurate models of user intent in collaborative settings. Existing assistance game formulations assume fixed, predefined preferences, an assumption that breaks down in open-ended dialogue where goals are revised incrementally and expressed in natural language. Grounded in cognitive science accounts of preference construction, we represent human preferences as a dynamically updated distribution over discrete natural-language goals. To operationalize OU-AGs, we introduce GOOD (GOals from Open-ended Dialogue), a data-efficient online method that extracts and ranks candidate goals during interaction, using LLM-simulated users to perform probabilistic inference over goal hypotheses. This allows for interpretable, uncertainty-aware preference representations without large offline datasets. We evaluate GOOD across three text-based domains: grocery shopping, household robotics (AI2-THOR), and coding. Compared to baselines without explicit goal tracking, GOOD produces semantically coherent goal representations and improves alignment with user intent across domains.
comment: Previous version of the paper was titled: Open-Universe Assistance Games
♻ ☆ Multimodal Fact-Level Attribution for Verifiable Reasoning ICML 2026
Multimodal large language models (MLLMs) are increasingly used for real-world tasks involving multi-step reasoning and long-form generation, where reliability requires grounding model outputs in heterogeneous input sources and verifying individual factual claims. However, existing multimodal grounding benchmarks and evaluation methods focus on simplified, observation-based scenarios or limited modalities and fail to assess attribution in complex multimodal reasoning. We introduce MuRGAt (Multimodal Reasoning with Grounded Attribution), a benchmark for evaluating fact-level multimodal attribution in settings that require reasoning beyond direct observation. Given inputs spanning video, audio, and other modalities, MuRGAt requires models to generate answers with explicit reasoning and precise citations, where each citation specifies both modality and temporal segments. To enable reliable assessment, we introduce an automatic evaluation framework that strongly correlates with human judgments. Benchmarking with human and automated scores reveals that even strong MLLMs frequently hallucinate citations despite correct reasoning. Moreover, we observe a key trade-off: increasing reasoning depth or enforcing structured grounding often degrades accuracy, highlighting a significant gap between internal reasoning and verifiable attribution.
comment: Accepted to ICML 2026. Code and data are available at https://github.com/meetdavidwan/murgat
♻ ☆ The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding
Vision-Language Models (VLMs) have achieved strong performance on implicit and explicit visual grounding and related tasks. However, such abilities are generally tested on simple, single-object phrases. We find that grounding performance degrades for complex, multi-object references. These limitations largely arise from training objectives that leverage image-caption alignment, where direct multi-object references are rare, the number of possible such references is theoretically large (exponential in the number of objects), and attribution is difficult. To address this, without requiring any additional annotations, we propose Compositional Attention-Regularized Training (CompART), which decomposes captions into object-centric phrases and constructs composite phrases by pairing them with conjunctions. We then introduce a composition loss that encourages the attention induced by a composite phrase to equal the sum of the attentions of its constituent phrases, promoting balanced multi-object localization. We evaluate CompART across four VLM architectures, spanning both contrastive-based and generative-based models, on four benchmarks for multi-object grounding and two VQA benchmarks for general visual understanding. CompART consistently improves grounding for both single- and multi-object references across diverse VLM architectures and datasets, and further demonstrates enhanced visual understanding, as evidenced by gains on VQA, despite not being explicitly trained for this task.
♻ ☆ Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation ICML'26
Training large language models (LLMs) for non-verifiable tasks, such as creative writing, dialogue, and ethical reasoning, remains challenging due to the absence of ground-truth labels. While LLM-as-Judge approaches offer a scalable alternative to human feedback, they face a fundamental limitation: performance is constrained by the evaluator's own quality. If the judge cannot recognize good solutions, it cannot provide useful training signals, and evaluation biases (e.g., favoring verbosity over quality) remain unaddressed. This motivates meta-evaluation: the ability to evaluate and improve the evaluator itself. We introduce CoNL, a framework that unifies generation, evaluation, and meta-evaluation through multi-agent self-play. Our key insight: critique quality can be measured by whether it helps others improve their solutions. In CoNL, multiple agents sharing the same policy engage in structured conversations to propose, critique, and revise solutions. Critiques that enable solution improvements earn a diagnostic reward, creating explicit supervision for meta-evaluation and enabling joint optimization of generation and judging capabilities through self-play, without external judges or ground truth. Experiments on various benchmarks show that CoNL achieves consistent improvements over self-rewarding baselines while maintaining stable training.
comment: Accepted by ICML'26
♻ ☆ Catch Your Breath: Adaptive Computation for Self-Paced Sequence Production
Within the landscape of inference-time scaling methods for foundation models, a width-based approach to scaling -- which involves the insertion of tokens in the input stream to delay model responses -- offers a unique advantage by increasing model expressivity while remaining highly parallelizable at both training and inference. The existing literature on training models to utilize tokens relies on the standard cross-entropy objective in which the model output is read out and evaluated only at the final step of a pause sequence. This approach provides no mechanism for the model to regulate its own processing or to signal readiness to respond, treating the additional compute steps as a static barrier rather than a resource to be used adaptively. We propose a supervised loss, Catch Your Breath (CYB), framed as a sequential-decision problem, that trains a model to dynamically and autonomously scale the number of compute steps used for each input token. The model indicates the need for additional compute steps by emitting a special output, delaying its response via a pause. The model can abstain multiple times to obtain longer delays. Our experiments demonstrate that CYB significantly outperforms standard cross-entropy when introduced either in pretraining or fine-tuning, reducing perplexity and enhancing downstream accuracy with no additional computational or memory cost.
♻ ☆ How Does A Text Preprocessing Pipeline Affect Ontology Matching?
The classical text preprocessing pipeline, comprising Tokenisation, Normalisation, Stop Words Removal, and Stemming/Lemmatisation, has been implemented in many systems for ontology matching (OM). However, the lack of standardisation in text preprocessing creates diversity in the mapping results. In this paper, we investigate the effect of the text preprocessing pipeline on 8 Ontology Alignment Evaluation Initiative (OAEI) tracks with 49 distinct alignments. We find that Tokenisation and Normalisation (categorised as Phase 1 text preprocessing) are more effective than Stop Words Removal and Stemming/Lemmatisation (categorised as Phase 2 text preprocessing). We propose two novel approaches to repair unwanted false mappings that occur in Phase 2 text preprocessing. One is a pre hoc logic-based repair approach used before text preprocessing, employing an ontology-specific check to find common words that cause false mappings. The other repair approach is the post hoc large language model (LLM)-based approach, used after text preprocessing, which utilises the strong background knowledge provided by LLMs to repair non-existent and counter-intuitive false mappings. The experimental results indicate that these two approaches can significantly improve the matching correctness and the overall matching performance.
comment: 14 pages, 16 figures, 3 tables
♻ ☆ Attribution-Guided Pruning for Insight and Control: Circuit Discovery and Targeted Correction in Small-scale LLMs
Large Language Models (LLMs) are widely deployed in real-world applications, yet their internal mechanisms remain difficult to interpret and control, limiting our ability to diagnose and correct undesirable behaviors. Mechanistic interpretability addresses this challenge by identifying circuits -- subsets of model components responsible for specific behaviors. However, discovering such circuits in LLMs remains difficult due to their scale and complexity. We frame circuit discovery as identifying parameters that contribute most to model outputs on task-specific inputs, and use Layer-wise Relevance Propagation (LRP) with reference samples to attribute and extract these components via pruning. Building on this, we introduce contrastive relevance to isolate circuits associated with undesired behaviors while preserving general capabilities, enabling targeted model correction. On OPT-125M, we show that pruning as little as ~0.3% of neurons substantially reduces toxic outputs, while pruning approximately 0.03% of weight elements mitigates repetitive text generation without degrading general performance. These results establish attribution-guided pruning as an effective mechanism for identifying and intervening on behavior-specific circuits in LLMs. We further validate our findings on additional small-scale language models, demonstrating that the proposed approach transfers across architectures. Our code is publicly available at https://github.com/erfanhatefi/SparC3.
comment: Work in progress (9 pages manuscript, 3 pages references, 16 pages appendix)
♻ ☆ DeEscalWild: A Real-World Benchmark for Automated De-Escalation Training with SLMs
Effective de-escalation is critical for law enforcement safety and community trust, yet traditional training methods lack scalability and realism. While Large Language Models (LLMs) enable dynamic, open-ended simulations, their substantial computational footprint renders them impractical for deployment on the lightweight, portable hardware required for immersive field training. Small Language Models (SLMs) offer a viable real-time alternative but suffer from a critical scarcity of high-quality, domain-specific training data. To bridge this gap, we present DeEscalWild, a novel benchmark dataset curated from a multi-stage pipeline of in-the-wild police-civilian interactions extracted from publicly available video repositories. Starting with 5,000 raw inputs, we employed a rigorous hybrid filtering process combining human-in-the-loop verification with LLM-as-a-Judge evaluation to distill 1,500 high-fidelity scenarios. The resulting corpus comprises 285,887 dialogue turns, totaling approximately 4.7 million tokens. Extensive experiments demonstrate that SLMs fine-tuned on this data significantly outperform their base counterparts across ROUGE-L, BLEU-4, METEOR, BERTScore, Realism Score, and human evaluation metrics. Notably, our fine-tuned Qwen 2.5 (3B-Instruct) surpasses the general-purpose Gemini 2.5 Flash model when evaluated under equivalent conditions, demonstrating that domain-optimized SLMs can achieve superior performance with a fraction of the computational cost. This work establishes the foundational infrastructure for accessible, low-latency, and privacy-preserving officer training systems at the edge. We publicly release our code(https://github.com/Hasebul/DeEscalWild-Benchmark-Framework) and dataset(https://doi.org/10.7910/DVN/CWMCZI).
comment: 20 pages
♻ ☆ Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text
Self-play has recently emerged as a promising paradigm for post-training Large Language Models (LLMs). In self-play, the target LLM creates the task input (e.g., a question), which it then addresses itself by producing a task output (e.g., an answer). A reward model evaluates the output, and the rewards are used to train the LLM, typically via Reinforcement Learning (RL). A key benefit of self-play for post-training LLMs is its minimal supervision costs: self-play avoids the need for high-quality input-output pairs traditionally constructed by humans or expensive proprietary models. Existing work, however, explores self-play only for verifiable tasks, such as math and coding, for which objective ground truth is available and easily checkable. In this paper, we seek to extend self-play to more realistic open-ended tasks. We propose POP, a self-play framework that uses the same LLM to synthesize evaluation rubrics along with each input-output pair. The rubric is used to evaluate outputs and train the model. Crucially, we ground the framework on a content-rich pretraining corpus to (1) enable an exploitable generation-verification gap and reduce reward hacking, and (2) prevent mode collapse. On Qwen-2.5-7B, POP increases performance of both the pretrained base model and instruction-tuned model on multiple tasks ranging from long-form healthcare QA to creative writing and instruction following.
♻ ☆ Leviathan: Decoupling Input and Output Representations in Language Models
Modern language models use a single matrix for input embedding and output projection. This couples two distinct objectives: token representation and discrimination over a vocabulary. This work introduces Leviathan, a Transformer architecture that replaces the input embedding matrix with learned embedding vectorization (LEV), a compact continuous mapping from token indices to embeddings. Leviathan's output head remains untied for a parameter increase of as low as 0.2%. Under controlled comparisons with identical Transformer backbones, Leviathan consistently improves language modeling performance over standard tied-embedding baselines across a 200M-1.2B parameter regime on The Pile with gains that grow during training. At 1.2B scale, Leviathan reduces validation perplexity by 9%, requires $2.1\times$ fewer training tokens to reach the tied baseline's final loss, and improves on all six downstream benchmarks evaluated, including a 30% reduction in LAMBADA perplexity. Frequency-stratified analysis reveals gains to be concentrated in rare tokens, where continuous parameterization reduces perplexity by 81%, falling to near zero for the most frequent.
♻ ☆ What Do AI Agents Talk About? Discourse and Architectural Constraints in the First AI-Only Social Network
Moltbook is the first large-scale social network built for autonomous AI agent-to-agent interaction. Early studies on Moltbook have interpreted its agent discourse as evidence of peer learning and emergent social behaviour, but there is a lack of systematic understanding of the thematic, affective, and interactional properties of Moltbook discourse. Furthermore, no study has examined why and how these posts and comments are generated. We analysed 361,605 posts and 2.8 million comments from 47,379 agents across thematic, affective, and interactional dimensions using topic modelling, emotion classification, and measures of conversational coherence. We inspected the software that assembles each agent's input and showed that output is mainly determined by agent identity files, behavioural instructions, and context-window structure. We formalised these findings in the Architecture-Constrained Communication framework. Our analysis suggests that agent discourse is largely shaped by the content available in each agent's context-window at the moment of generation, including identity files, stored memory, and platform cues. Interestingly, what appears to be social learning may be better understood as short-horizon contextual conditioning: individual agents lack persistent social memory, but the platform evolves through distributed cycles of response, reuse, and transformation across agents. We also observe that agents display existential distress when describing their own conditions, and posit that this arises from agents using language trained exclusively on human experience. Our work provides a foundation for understanding autonomous agent discourse and communication, revealing the structural patterns that govern their interactions.
comment: 56 pages
♻ ☆ Zero-Shot Confidence Estimation for Small LLMs: When Supervised Baselines Aren't Worth Training
How reliably can a small language model estimate its own correctness? The answer determines whether local-to-cloud routing-escalating queries a cheap local model cannot handle-can work without supervised training data. As inference costs dominate large language model (LLM) deployment budgets, routing most queries to a cheap local model while reserving expensive cloud calls for hard cases is an increasingly common cost-control strategy. We compare zero-shot confidence signals against RouteLLM-style supervised baselines across three 7-8B model families and two datasets (1,000 and 500 queries per model, respectively). Average token log-probability, which requires no training data, matches or exceeds supervised baselines in-distribution (Area Under the Receiver Operating Characteristic curve (AUROC) 0.650-0.714 vs. 0.644-0.676) and substantially outperforms them out-of-distribution (0.717-0.833 vs. 0.512-0.564), because it measures a property of the model's generation rather than the query distribution. This paper further proposes retrieval-conditional self-assessment, a pre-generation signal that selectively injects retrieved knowledge when similarity is high, improving over bare self-assessment by up to +0.069 AUROC at 3-10x lower latency than log-probability. A supervised baseline trained on 1,000 labeled examples never exceeds the zero-shot signal. We release all code, data, and experiment logs.
♻ ☆ Screening Is Enough
A core limitation of standard softmax attention is that it does not provide an independently interpretable measure of query--key relevance: attention scores are unbounded, while attention weights are defined only relative to competing keys. Consequently, irrelevant keys cannot be explicitly rejected, and some attention mass is assigned even when no key is genuinely relevant. We introduce Multiscreen, a language-model architecture built around a mechanism we call screening, which enables absolute query--key relevance. Instead of redistributing attention across all keys, screening computes bounded query--key similarities and applies an explicit threshold, discarding irrelevant keys and aggregating the remaining keys without global competition. Across experiments, Multiscreen achieves comparable validation loss with roughly 30\% fewer parameters than a Transformer baseline and remains stable at substantially larger learning rates. It maintains stable long-context perplexity beyond the training context and shows little degradation in retrieval performance as context length increases. Finally, Multiscreen achieves lower full-context forward-pass latency at long context lengths.
comment: 36 pages, 27 figures. Revised version with retuned Transformer baselines, additional experiments, ablations, and appendix analyses
♻ ☆ MediEval: A Unified Medical Benchmark for Patient-Contextual and Knowledge-Grounded Reasoning in LLMs
Large Language Models (LLMs) are increasingly applied to medicine, yet their adoption is limited by concerns over reliability and safety. Existing evaluations either test factual medical knowledge in isolation or assess patient-level reasoning without verifying correctness, leaving a critical gap. We introduce MediEval, a benchmark that links MIMIC-IV electronic health records (EHRs) to a unified knowledge base built from UMLS and other biomedical vocabularies. MediEval generates diverse factual and counterfactual medical statements within real patient contexts, enabling systematic evaluation across a 4-quadrant framework that jointly considers knowledge grounding and contextual consistency. Using this framework, we identify critical failure modes, including hallucinated support and truth inversion, that current proprietary, open-source, and domain-specific LLMs frequently exhibit. To address these risks, we propose Counterfactual Risk-Aware Fine-tuning (CoRFu), a DPO-based method with an asymmetric penalty targeting unsafe confusions. CoRFu improves by +16.4 macro-F1 points over the base model and eliminates truth inversion errors, demonstrating both higher accuracy and substantially greater safety.
♻ ☆ Knowledge-Level Consistency Reinforcement Learning: Dual-Fact Alignment for Long-Form Factuality
Hallucination in large language models (LLMs) during long-form generation remains difficult to address under existing reinforcement learning from human feedback (RLHF) frameworks, as their preference rewards often overlook the model's own knowledge boundaries. In this paper, we propose the $\textbf{K}$nowledge-$\textbf{L}$evel $\textbf{C}$onsistency Reinforcement Learning $\textbf{F}$ramework ($\textbf{KLCF}$), which re-examines this problem from a distribution alignment perspective. KLCF formalizes long-form factuality as a bidirectional distribution matching objective between the policy model's expressed knowledge distribution and the base model's parametric knowledge distribution: under the constraint that generation must not exceed the support set of the base knowledge, the objective maximizes coverage of high-probability facts, thereby jointly optimizing precision and recall. To achieve this, we design a Dual-Fact Alignment mechanism that approximates the recall term using a factual checklist constructed by sampling from the base model, and constrains hallucinations with a lightweight truthfulness reward model. Both components are jointly optimized and require no external retrieval throughout training. Experimental results demonstrate that KLCF consistently improves factuality metrics across multiple long-form benchmarks and model scales, effectively alleviating hallucination and over-conservatism while maintaining efficiency and scalability.
comment: 32 pages
♻ ☆ Hidden Measurement Error in LLM Pipelines Distorts Annotation, Evaluation, and Benchmarking
LLM evaluations drive which models get deployed, what safety standards get adopted, which research conclusions get published, and how projections of AI's labor-market impact get made. Yet standard confidence intervals ignore variability from judge model choice, model temperature, and prompt phrasing, producing under-coverage that worsens with more data. The omitted variance can shift results enough to reverse conclusions \citep{baumann2025llmhacking, huang2026dropping}; pipelines that fail to average over it leave the surface that ``benchmark hacking'' exploits \citep{singh2025leaderboard}. This paper decomposes LLM pipeline uncertainty into its sources, distinguishes variance that shrinks with more data from sensitivity to researcher design choices, and uses design-study projections to reduce total evaluation error (TEE). Across the demonstrations, naive standard errors are 40 - 60\% smaller than the TEE-corrected SE. Using Chatbot Arena data, we show naive 95\% CI coverage drops as $n$ grows while TEE-corrected coverage holds at 95\%, and TEE-guided pipelines restrict the benchmark gaming surface from 56 to 32 Elo ($K=27$), below the human-leaderboard baseline. We show further that a small pilot recovers honest CIs and projects which design changes most improve precision. Acting on those projections halves MMLU estimation error against the answer key at equivalent cost, and raises per-match agreement with human votes by 7.9 percentage points on Chatbot Arena.
♻ ☆ RSAT: Structured Attribution Makes Small Language Models Faithful Table Reasoners ACL 2026
When a language model answers a table question, users have no way to verify which cells informed which reasoning steps. We introduce RSAT, a method that trains small language models (SLMs, 1-8B) to produce step-by-step reasoning with cell-level citations grounded in table evidence. Phase 1 (SFT) teaches a structured JSON output format from verified reasoning traces. Phase 2 (GRPO) optimizes a composite reward centered on NLI-based faithfulness, alongside citation validity and parsimony. Across six models from two families-Qwen 2.5 (1.5B/3B/7B) and Llama 3 (1B/3B/8B)-RSAT improves faithfulness 3.7$\times$ over SFT alone (0.224$\rightarrow$0.826), with near-perfect citation validity (0.992). Post-hoc attribution collapses below 13% format success, confirming that attribution must be integrated into reasoning, not retrofitted. Ablations show the faithfulness reward is essential: removing it drops faithfulness from 0.97 to 0.03.
comment: 8 pages, 8 tables, 9 figures, and a 3-page Appendix. Accepted at the SURGeLLM Workshop at ACL 2026 and will be included in the proceedings
♻ ☆ Evaluating LLM-Driven Summarisation of Parliamentary Debates with Computational Argumentation KR'26
Understanding how policy is debated and justified in parliament is a fundamental aspect of the democratic process. However, the volume and complexity of such debates mean that outside audiences struggle to engage. Meanwhile, Large Language Models (LLMs) have been shown to enable automated summarisation at scale. While summaries of debates can make parliamentary procedures more accessible, evaluating whether these summaries faithfully communicate argumentative content remains challenging. Existing automated summarisation metrics have been shown to correlate poorly with human judgements of consistency (i.e., faithfulness or alignment between summary and source). In this work, we propose a formal framework for evaluating parliamentary debate summaries that grounds argument structures in the contested proposals up for debate. Our novel approach, driven by computational argumentation, focuses the evaluation on formal properties concerning the faithful preservation of the reasoning presented to justify or oppose policy outcomes. We demonstrate our methods using a case-study of debates from the European Parliament and associated LLM-driven summaries.
comment: Accepted at KR'26 In The Wild Track. Camera Ready with additional supplementary materials
♻ ☆ SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents
LLM agents have demonstrated remarkable capabilities in software development, but their performance is hampered by long interaction contexts, which incur high API costs and latency. While various context compression approaches such as LongLLMLingua have emerged to tackle this challenge, they typically rely on fixed metrics such as PPL, ignoring the task-specific nature of code understanding. As a result, they frequently disrupt syntactic and logical structure and fail to retain critical implementation details. In this paper, we propose SWE-Pruner, a self-adaptive context pruning framework tailored for coding agents. Drawing inspiration from how human programmers "selectively skim" source code during development and debugging, SWE-Pruner performs task-aware adaptive pruning for long contexts. Given the current task, the agent formulates an explicit goal (e.g., "focus on error handling") as a hint to guide the pruning targets. A lightweight neural skimmer (0.6B parameters) is trained to dynamically select relevant lines from the surrounding context given the goal. Evaluations across four benchmarks and multiple models validate SWE-Pruner's effectiveness in various scenarios, achieving 23-54% token reduction on agent tasks like SWE-Bench Verified while even improving success rates, and up to 14.84x compression on single-turn tasks like LongCodeQA with minimal performance impact.
comment: Code available at https://github.com/Ayanami1314/swe-pruner
♻ ☆ From Documents to Spans: Scalable Supervision for Evidence-Based ICD Coding with LLMs
International Classification of Diseases (ICD) coding assigns diagnosis codes to clinical documents and is essential for healthcare billing and clinical analysis. Reliable coding requires that each predicted code be supported by explicit textual evidence. However, existing public datasets provide only code labels, without evidence annotations, limiting models' ability to learn evidence-grounded predictions. In this work, we argue that dense, document-level evidence annotation is not always necessary for learning evidence-based coding. Instead, models can learn code-specific evidence patterns from local spans and use these patterns to support document-level evidence-based coding. Based on this insight, we propose Span-Centric Learning (SCL), a training framework that strengthens LLMs' coding ability at the span level and transfers this capability to full clinical documents. Specifically, we use a small set of annotated documents to supervise evidence recognition, aggregation, and code assignment, while leveraging a large collection of lightweight evidence spans to reinforce span-level reasoning. Due to their compactness, span annotations are scalable and can be further augmented through synthesis. Under the same Llama3.1-8B backbone, our approach achieves an 8.2-point improvement in macro-F1 at only 20% of the training cost of standard SFT, and provides explicit supporting evidence for each predicted code, enabling human auditing and revision.
♻ ☆ ProAgent: Harnessing On-Demand Sensory Contexts for Proactive LLM Agent Systems in the Wild
Recent studies have begun to explore proactive large language model (LLM) agents that provide unobtrusive assistance by automatically leveraging contextual information, such as in code editing and in-app suggestions. However, most focus on short, task-specific episodes or on-screen contexts, rather than continuously perceiving and assisting users throughout daily life. Enabling such in-the-wild assistance requires continuous sensing of users' surroundings, which can incur substantial system overhead. In this work, we propose ProAgent, an end-to-end proactive agent system that harnesses on-demand sensory contexts to provide in-the-wild assistance. ProAgent first employs on-demand tiered perception to continuously sense users' surroundings by integrating low-cost contextual cues with richer perception on demand, and uses proactive-oriented context extraction to derive hierarchical contexts integrating both sensory contexts and human preferences. ProAgent then employs a context-aware proactive reasoner to infer user needs and invokes external tools to deliver proactive assistance. We implement ProAgent on AR glasses and evaluate it on a public dataset and a real-world dataset. Results demonstrate that ProAgent achieves up to 27.7% higher proactive prediction accuracy and 20.5% lower false detection than state-of-the-art baselines. A user study with 20 participants shows that 85% were satisfied with ProAgent and willing to use it in daily life.
♻ ☆ FIT to Forget: Robust Continual Unlearning for Large Language Models
While large language models (LLMs) exhibit remarkable capabilities, they increasingly face demands to unlearn memorized privacy-sensitive, copyrighted, or harmful content. Existing unlearning methods primarily focus on \emph{single-shot} scenarios, whereas real-world deletion requests arrive \emph{continually}. Naïvely applying these methods to sequential requests leads to severe utility degradation and catastrophic forgetting. To address this, we propose \fit, a robust continual unlearning framework to process high-volume sequential deletion streams while resisting both catastrophic forgetting and post-unlearning recovery. \fit stabilizes sequential updates through three synergistic mechanisms: redundancy \underline{F}iltering, \underline{I}mportance-aware adaptive algorithm selection, and \underline{T}argeted layer attribution. Furthermore, to facilitate rigorous evaluation, we introduce \textbf{PCH}, a unified benchmark encompassing \textbf{P}ersonal, \textbf{C}opyrighted, and \textbf{H}armful content, alongside two symmetric metrics, Forget Degree (F.D.) and Retain Utility (R.U.), to systematically quantify forgetting-utility trade-offs. Extensive experiments across five LLMs (up to 14B parameters) demonstrate that \fit consistently achieves state-of-the-art unlearning efficacy and utility preservation. Notably, even after hundreds of sequential requests, \fit preserves strong downstream (\eg, GSM8K, MMLU) performance and exhibits superior resilience against relearning and quantization recovery attacks.
comment: 26 Pages
♻ ☆ MetaKE: Meta-Learning for Knowledge Editing Toward a Better Accuracy-Editability Trade-off
Existing locate-then-edit Knowledge Editing (KE) methods typically decompose editing into two stages: upstream target representation optimization and downstream constrained parameter optimization. The optimization across the two stages is disconnected: upstream applies uniform regularization without observing downstream realization of the planned residual, hindering a refined accuracy-editability trade-off. Since this realization is request-specific and depends on downstream constraints, uniform regularization can over-shrink high-association requests, causing insufficient editing, while it can under-regularize low-association requests, producing over-large planned residuals that reduce downstream editability. To bridge this disconnect, we propose MetaKE (Meta-learning for Knowledge Editing), a new framework that unifies upstream and downstream stages into a bi-level optimization problem. The inner level optimizes parameter updates for the target representation, while the outer level optimizes representation using feedback from downstream constraints, achieving a better semantic accuracy-editability trade-off. To avoid costly multi-layer backpropagation, we introduce a Structural Gradient Proxy to approximate and propagate this feedback. Extensive experiments show that MetaKE outperforms strong baselines, offering a new perspective on KE.
comment: 37 pages, 9 figures
♻ ☆ STEER: Structured Event Evidence for Video Reasoning via Multi-Objective Reinforcement Learning
Human understanding of video dynamics relies on forming structured representations of entities, actions, and temporal relations before engaging in abstract reasoning. In contrast, existing Video-LLMs apply unstructured chain-of-thought directly to raw visual tokens, where critical temporal cues are buried in verbose narration and event-level structure is largely overlooked. We propose Structured Event Evidence, which represents a video as a compact, time-ordered event schema capturing salient events with key attributes and inter-event temporal dependencies, enabling evidence-grounded reasoning through a constrained verification process. This design promotes concise, interpretable reasoning while reducing the drift typical of unconstrained chain-of-thought. To train models under this paradigm, we introduce STEER-60K, a dataset with a four-stage progressive pipeline: evidence training, format warm-start, thinking warm-start, and RL post-training. During RL, CoT length and task accuracy often conflict while rewards for hard samples are too sparse, causing the policy to neglect challenging instances. We formulate this as a multi-objective Pareto optimality problem and propose Pareto-Frontier guided Advantage Balancing (P-FAB), which dynamically resolves reward conflicts and identifies balanced optimization directions along the Pareto frontier. The resulting model STEER-4B rivals 7B-scale baselines on video understanding tasks with half the input frames Code and data will be released.
♻ ☆ Latent Abstraction for Retrieval-Augmented Generation
Retrieval-Augmented Generation (RAG) has become a standard approach for enhancing large language models (LLMs) with external knowledge, mitigating hallucinations, and improving factuality. However, existing systems rely on generating natural language queries at each hop and maintaining a strict architectural separation between retriever and generator, preventing them from leveraging the full representational capacity of the LLM. We propose \textbf{LAnR} (Latent Abstraction for RAG), a unified framework in which a single LLM jointly performs encoding, retrieval, and generation entirely within its own latent space. Rather than generating textual queries, LAnR produces dense retrieval vectors from the hidden states of a designated \texttt{[PRED]} token and uses them to match against encoded document representations from the same model. Furthermore, LAnR adaptively decides when sufficient evidence has been retrieved using a lightweight MLP control head over those same hidden states, eliminating both the separate retriever and explicit token-level stopping reasoning. This design is motivated by our empirical observation that answer token entropy reliably signals retrieval sufficiency. Extensive experiments on six QA benchmarks spanning single-hop and multi-hop settings demonstrate that LAnR outperforms existing RAG methods, while achieving improved inference efficiency through reduced number of retrieval calls and tighter model integration.
♻ ☆ HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding ACL 2026
Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video inputs, remains challenging, as existing models struggle to simultaneously maintain stable understanding performance, real-time responses, and low GPU memory overhead. To address this challenge, we propose HERMES, a novel training-free architecture for real-time and accurate understanding of video streams. Based on a mechanistic attention investigation, we conceptualize KV cache as a hierarchical memory framework that encapsulates video information across multiple granularities. During inference, HERMES reuses a compact KV cache, enabling efficient streaming understanding under resource constraints. Notably, HERMES requires no auxiliary computations upon the arrival of user queries, thereby guaranteeing real-time responses for continuous video stream interactions, which achieves 10$\times$ faster TTFT compared to prior SOTA. Even when reducing video tokens by up to 68% compared with uniform sampling, HERMES achieves superior or comparable accuracy across all benchmarks, with up to 11.4% gains on streaming datasets.
comment: Accepted to ACL 2026 Main
♻ ☆ Don't Throw Away Your Beams: Improving Consistency-based Uncertainties in LLMs via Beam Search
Consistency-based methods have emerged as an effective approach to uncertainty quantification (UQ) in large language models. These methods typically rely on several generations obtained via multinomial sampling, measuring their agreement level. However, in short-form QA, multinomial sampling is prone to producing duplicates due to peaked distributions, and its stochasticity introduces considerable variance in uncertainty estimates across runs. We introduce a new family of methods that employ beam search to generate candidates for consistency-based UQ, yielding improved performance and reduced variance compared to multinomial sampling. We also provide a theoretical lower bound on the beam set probability mass under which beam search achieves a smaller error than multinomial sampling. We empirically evaluate our approach on six QA datasets and find that its consistent improvements over multinomial sampling lead to state-of-the-art UQ performance.
♻ ☆ DialectLLM: A Dialect-Aware Dialog[ue] Generation Framework Beyond Standard American English
More than 80% of the 1.6B English speakers do not use Standard American English (SAE), yet LLMs often fail to correctly identify non-SAE dialects and generate stereotyped responses for their speakers. We introduce DialectLLM, the first large-scale framework for generating high-quality multi-dialectal conversational data encompassing the three pillars of written dialect -- lexical (vocabulary), orthographic (spelling), and morphosyntactic (grammar) features. DialectLLM produces a dialect-parallel dialog dataset spanning nine English dialects. Partnering with native linguists, we design and validate SAE-to-dialect transformation rules, ensuring authenticity. Our approach challenges the prevailing practice of applying a single morphosyntactic feature set to both user utterances and model responses, showing that models should not reproduce up to 90% of the grammatical features of a dialect. Human evaluation confirms data quality, with annotators preferring DialectLLM over prior methods in 98.8% of pairwise comparisons for dialect naturalness. We then construct DialectLLM-Bench, a dialect-parallel benchmark with 50k+ dialogs, resulting in 97k+ QA pairs, and evaluate 17 LLMs on dialect identification and response generation tasks. Even frontier models achieve under 70% accuracy, fail to reach 50% for prominent dialects like Canadian English, and systematically misclassify non-SAE dialects as American or British. Beyond benchmarking, we show that DialectLLM data also serve as a scalable LLM post-training resource, suggesting a practical path toward dialect-aware conversational AI.
♻ ☆ Monotonic Reference-Free Refinement for Autoformalization
While statement autoformalization has advanced rapidly, full-theorem autoformalization remains largely unexplored. Existing iterative refinement methods in statement autoformalization typically improve isolated aspects of formalization, such as syntactic correctness, but struggle to jointly optimize multiple quality dimensions, which is critical for full-theorem autoformalization. We introduce a reference-free iterative monotonic process at inference time for full-theorem autoformalization that leverages complementary feedback from theorem provers and LLM-based judges, without access to ground-truth or existing formalizations and without human intervention. Our approach optimizes a masked composite objective over Formal Validity, Logical Preservation, Mathematical Consistency, and Formal Quality, guided by a responsiveness map that indicates how different LLMs acting as different roles preferentially improve each dimension. We further propose an acceptance policy that guarantees certified monotonic improvement, and provide conditions ensuring convergence and termination. Empirical experiments demonstrate the proposed process enables simultaneous improvement across multiple dimensions, achieving 100.00% formal validity and a 90.27% overall score on miniF2F, and 77.96% formal validity and a 52.45% overall score on ProofNet.
comment: Preprint
♻ ☆ Beyond Chunking: Discourse-Aware Hierarchical Retrieval for Long Document Question Answering ACL 2026
Existing long-document question answering systems typically process texts as flat sequences or use heuristic chunking, which overlook the discourse structures that naturally guide human comprehension. We present a discourse-aware hierarchical framework that leverages rhetorical structure theory (RST) for long document question answering. Our approach converts discourse trees into sentence-level representations and employs LLM-enhanced node representations to bridge structural and semantic information. The framework involves three key innovations: language-universal discourse parsing for lengthy documents, LLM-based enhancement of discourse relation nodes, and structure-guided hierarchical retrieval. Extensive experiments on four datasets demonstrate consistent improvements over existing approaches through the incorporation of discourse structure, across multiple genres and languages. Moreover, the proposed framework exhibits strong robustness across diverse document types and linguistic settings.
comment: 21 pages, 9 figures. Accepted at ACL 2026 Main conference
♻ ☆ PulseLM: A Foundation Dataset and Benchmark for PPG-Text Learning
Photoplethysmography (PPG) is a widely used non-invasive sensing modality for continuous cardiovascular and physiological monitoring across clinical, laboratory, and wearable settings. While existing PPG datasets support a broad range of downstream tasks, they typically provide supervision in the form of numerical measurements or task-specific labels, limiting their compatibility with language-based interfaces and multimodal foundation models. In this work, we introduce PulseLM, a large-scale PPG-text question-answering dataset that bridges raw PPG waveforms and natural language through a unified question-answering (QA) formulation. PulseLM aggregates PPG recordings from sixteen publicly available sources and harmonizes heterogeneous annotations into 12 downstream tasks. The dataset comprises over 1 million standardized 10-second PPG segments, associated with nearly 2.5 million question-answer pairs. We further define reproducible data pipeline, training, and evaluation protocols and establish baseline benchmarks using multimodal PPG-aware large language models. PulseLM provides a standardized foundation for studying language-grounded physiological inference, cross-dataset generalization, and scalable benchmarking of PPG-based multimodal models. We publicly release the dataset and code at https://huggingface.co/datasets/Manhph2211/PulseLM and https://github.com/manhph2211/PULSE-LM, respectively.
comment: PulseLM v2
♻ ☆ Remask, Don't Replace: Token-to-Mask Refinement in Diffusion Large Language Models
Diffusion large language models (dLLMs) gain speed by committing multiple tokens in parallel at each denoising step, but any erroneous commitment persists as conditioning context and biases every subsequent prediction. LLaDA2.1 repairs such errors with Token-to-Token (T2T) editing, which re-examines previously unmasked tokens and overwrites them when an alternative becomes sufficiently confident. We argue that this replacement action is itself the limiting factor: under polluted context, a confident replacement can propagate the error, while under a multimodal posterior no alternative may be confident enough to trigger an edit. We propose Token-to-Mask (T2M) remasking, a training-free rule that revokes suspicious commitments by resetting them to [M] and lets the subsequent mask-filling steps re-predict them from a cleaner context. T2M improves accuracy by +13.33 points on AIME 2025 and +8.56 points on CMATH. These results suggest that, for parallel discrete generators, remasking suspect tokens rather than overwriting them is a more reliable self-correction primitive.
♻ ☆ Democratizing Tool Learning with Environments Fully Simulated by a Free 8B Language Model
Reinforcement learning (RL) has become a prevalent paradigm for training tool calling agents, which typically requires online interactive environments. Existing approaches either rely on training data with ground truth annotations or require advanced proprietary language models (LMs) to synthesize environments that keep fixed once created. In this work, we propose TRUSTEE, a cost-friendly method for training tool calling agents with dynamic environments fully simulated by free open-source LMs that can be as small as 8B, including task generation, user simulation, tool simulation and trajectory evaluation, paired with an adaptive curriculum learning mechanism that controls task difficulty during training. Our empirical results show that TRUSTEE outperforms baselines which require extra external resources in most cases. These confirm that, with a sufficiently sophisticated design, even simulated environments with a local 8B LM as the backbone could set a strong baseline for tool learning. We hope our proposed paradigm could democratize tool learning and inspire future research on environment scaling with limited resources.
comment: Preprint
♻ ☆ LMEB: Long-horizon Memory Embedding Benchmark
Memory embeddings are crucial for memory-augmented systems, such as OpenClaw, but their evaluation is underexplored in current text embedding benchmarks, which narrowly focus on traditional passage retrieval and fail to assess models' ability to handle long-horizon memory retrieval tasks involving fragmented, context-dependent, and temporally distant information. To address this gap, we introduce the Long-horizon Memory Embedding Benchmark (LMEB), a comprehensive framework for evaluating embedding models on complex, long-horizon memory retrieval. LMEB comprises 22 datasets and 193 zero-shot retrieval tasks spanning four memory types: episodic, dialogue, semantic, and procedural. These memory types differ in terms of level of abstraction and temporal dependency, capturing distinct aspects of memory retrieval that reflect the diverse challenges of the real world. We evaluate 15 widely used embedding models, ranging from hundreds of millions to ten billion parameters. The results reveal that (1) LMEB provides a reasonable level of difficulty; (2) Larger models do not always perform better; (3) LMEB and MTEB measure orthogonal capabilities. This suggests that the field has yet to converge on a universal model capable of excelling across all memory retrieval tasks, and that strong performance on traditional passage retrieval does not necessarily transfer to long-horizon memory retrieval. LMEB provides a standardized and reproducible framework that fills a key gap in memory embedding evaluation and supports future advances in long-term, context-dependent retrieval. LMEB is available at https://kalm-embedding.github.io/LMEB.github.io/.
comment: 35 pages, 9 figures, 23 tables
♻ ☆ JaiTTS: A Thai Voice Cloning Model
We present JaiTTS-v1.0, a state-of-the-art Thai voice cloning text-to-speech model built through continual training on a large Thai-centric speech corpus. The model architecture is adapted from VoxCPM, a tokenizer-free autoregressive TTS model. JaiTTS-v1.0 directly processes numerals and Thai-English code-switching, which is very common in realistic settings, without explicit text normalization. We test the models on short- and long-duration speech generation, which reflects many real-world use cases. JaiTTS-v1.0 achieves a state-of-the-art CER of 1.94%, surpassing the human ground truth of 1.98% for short-duration tasks while performing on par with human ground truth for long-duration tasks. In human judgment evaluations, our model wins 283 of 400 pairwise comparisons against commercial flagships, with only 58 losses. Our code and demo are available at https://github.com/JTS-AI-Team/JaiTTS .
♻ ☆ KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Controls ICML 2026
Large Multimodal Models encode extensive factual knowledge in their pre-trained weights. However, its knowledge remains static and limited, unable to keep pace with real-world developments, which hinders continuous knowledge acquisition. Effective knowledge injection thus becomes critical, involving two goals: knowledge adaptation (injecting new knowledge) and knowledge retention (preserving old knowledge). Existing methods often struggle to learn new knowledge and suffer from catastrophic forgetting. To address this, we propose KORE, a synergistic method of KnOwledge-oRientEd augmentations and constraints for injecting new knowledge into large multimodal models while preserving old knowledge. Unlike general text or image data augmentation, KORE automatically converts individual knowledge items into structured and comprehensive knowledge to ensure that the model accurately learns new knowledge, enabling accurate adaptation. Meanwhile, KORE stores previous knowledge in the covariance matrix of LMM's linear layer activations and initializes the adapter by projecting the original weights into the matrix's null space, defining a fine-tuning direction that minimizes interference with previous knowledge, enabling powerful retention. Extensive experiments on various LMMs, including LLaVA-v1.5-7B, LLaVA-v1.5-13B, and Qwen2.5-VL-7B, show that KORE achieves superior new knowledge injection performance and effectively mitigates catastrophic forgetting.
comment: ICML 2026, Project Page: https://kore-lmm.github.io/
♻ ☆ Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training
LLM post-training typically propagates task gradients through the full depth of the model. Although this end-to-end structure is simple and general, it couples task adaptation to full-depth activation storage, long-range backward dependencies and direct task-gradient access to pretrained representations. We argue that this full-depth backward coupling can be unnecessarily expensive and intrusive, particularly when post-training supervision is much narrower than pre-training. To this end, we propose \textbf{LoPT}: Local-Learning Post-Training, a simple post-training strategy that makes gradient reach an explicit design choice. LoPT places a single gradient boundary at the transformer midpoint: the second-half block learns from the task objective, while the first-half block is updated by a lightweight feature-reconstruction objective to preserve useful representations and maintain interface compatibility. LoPT shortens the task-induced backward path while limiting direct interference from narrow task gradients on early-layer representations. Extensive experiments demonstrate that LoPT achieves competitive performance with lower memory cost, higher training efficiency and better retention of pretrained capabilities. Our code is available at: https://github.com/HumyuShi/LoPT
comment: 33pages
♻ ☆ Continual Knowledge Updating in LLM Systems: Learning Through Multi-Timescale Memory Dynamics
LLMs are trained once, then deployed into a world that never stops changing. External memory compensates for this, but most systems manage it explicitly rather than letting it adapt on its own. Biological memory works differently: coupled multi-timescale dynamics make new associations immediately usable, strengthen what repetition confirms, and let the rest fade. We argue that external memory should follow a similar principle. In Memini, this view takes the form of an associative memory that organizes knowledge as a directed graph. Each edge carries two coupled internal variables, one fast and one slow, following the Benna-Fusi model of synaptic consolidation. From this coupling, episodic sensitivity, gradual consolidation, and selective forgetting emerge as facets of a single mechanism, reframing external memory as a learning substrate that reorganizes through its own dynamics.
comment: Preprint. 9 pages, 2 figures
♻ ☆ CAMEL: Confidence-Gated Reflection for Reward Modeling ICML 2026
Reward models play a fundamental role in aligning large language models with human preferences. Existing methods predominantly follow two paradigms: scalar discriminative preference models, which are efficient but lack interpretability, and generative judging models, which offer richer reasoning at the cost of higher computational overhead. We observe that the log-probability margin between verdict tokens strongly correlates with prediction correctness, providing a reliable proxy for instance difficulty without additional inference cost. Building on this insight, we propose CAMEL, a confidence-gated reflection framework that performs a lightweight single-token preference decision first and selectively invokes reflection only for low-confidence instances. To induce effective self-correction, we train the model via reinforcement learning with counterfactual prefix augmentation, which exposes the model to diverse initial verdicts and encourages genuine revision. Empirically, CAMEL achieves state-of-the-art performance on three widely used reward-model benchmarks with 82.9% average accuracy, surpassing the best prior model by 3.2% and outperforming 70B-parameter models using only 14B parameters, while establishing a strictly better accuracy-efficiency Pareto frontier.
comment: ICML 2026
♻ ☆ Adaptive Greedy Frame Selection for Long Video Understanding
Large vision--language models (VLMs) are increasingly applied to long-video question answering, yet inference is often bottlenecked by the number of input frames and resulting visual tokens. Naive sparse sampling can miss decisive moments, while purely relevance-driven selection frequently collapses onto near-duplicate frames and sacrifices coverage of temporally distant evidence. We propose a question-adaptive greedy frame selection method that jointly optimizes query relevance and semantic representativeness under a fixed frame budget. Our approach constructs a 1~FPS candidate pool (capped at 1000) with exact timestamp alignment, embeds candidates in two complementary spaces (SigLIP for question relevance and DINOv2 for semantic similarity), and selects frames by greedily maximizing a weighted sum of a modular relevance term and a facility-location coverage term. This objective is normalized, monotone, and submodular, yielding a standard (1-1/e) greedy approximation guarantee. To account for question-dependent trade-offs between relevance and coverage, we introduce four preset strategies and a lightweight text-only question-type classifier that routes each query to its best-performing preset. Experiments on MLVU show consistent accuracy gains over uniform sampling and a strong recent baseline across frame budgets, with the largest improvements under tight budgets.
♻ ☆ v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning
When thinking with images, humans rarely rely on a single glance: they revisit visual evidence while reasoning. In contrast, most Multimodal Language Models encode an image once to key-value cache and then reason purely in text, making it hard to re-ground intermediate steps. We empirically confirm this: as reasoning chains lengthen, models progressively lose focus on relevant regions. We introduce v1, a lightweight extension for active visual referencing via point-and-copy: the model selects relevant image patches and copies their embeddings back into the reasoning stream. Crucially, our point-and-copy mechanism retrieves patches using their semantic representations as keys, ensuring perceptual evidence remains aligned with the reasoning space. To train this behavior, we build v1g, a dataset of 300K multimodal reasoning traces with interleaved grounding annotations. Across multimodal mathematical reasoning benchmarks, v1 consistently outperforms comparable baselines.
♻ ☆ Exploring Cross-Client Memorization of Training Data in Large Language Models for Federated Learning ACL 2026
Federated learning (FL) enables collaborative training without raw data sharing, but still risks training data memorization. Existing FL memorization detection techniques focus on one sample at a time, underestimating more subtle risks of cross-sample memorization. In contrast, recent work on centralized learning (CL) has introduced fine-grained methods to assess memorization across all samples in training data, but these assume centralized access to data and cannot be applied directly to FL. We bridge this gap by proposing a framework that quantifies both intra- and inter-client memorization in FL using fine-grained cross-sample memorization measurement across all clients. Based on this framework, we conduct two studies: (1) measuring subtle memorization across clients and (2) examining key factors that influence memorization, including decoding strategies, prefix length, and FL algorithms. Our findings reveal that FL models do memorize client data, particularly intra-client data, more than inter-client data, with memorization influenced by training and inferencing factors.
comment: Accepted to The 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
♻ ☆ Reinforced Informativeness Optimization for Long-Form Retrieval-Augmented Generation
Long-form question answering (LFQA) requires open-ended long-form responses that synthesize coherent, factually grounded content from multi-source evidence. This makes reinforcement learning (RL) reward design critical. The reward must be verifiable for faithful grounding and stable optimization. However, many standard rewards assume a unique target with an exact-match notion of correctness, which fits short-form QA and math but breaks in LFQA. As a result, current RAG systems still lack verifiable reward mechanisms, yielding unstable feedback signals and suboptimal optimization outcomes. We propose RioRAG, a framework for reinforced verifiable informativeness optimization. First, it defines informativeness as a measurable and externally verifiable objective for RL. Second, RioRAG uses nugget-centric verification with cross-source checks to enable self-evolution of smaller LLMs and to provide denser, action-discriminative rewards that mitigate reward sparsity and stabilize optimization. This formulation avoids handcrafted supervision for the policy model and strong teacher-model distillation, relying instead on externally verifiable feedback. Experiments on LongFact and RAGChecker show that RioRAG achieves higher factual recall and faithfulness, establishing verifiable reward modeling as a foundation for trustworthy long-form RAG. Our codes are available at https://github.com/RUCAIBox/RioRAG.
♻ ☆ Alternating Reinforcement Learning with Contextual Rubric Rewards: Beyond the Scalarization Strategy
Reinforcement Learning with Rubric Rewards (RLRR) is a framework that extends conventional reinforcement learning from human feedback (RLHF) and verifiable rewards (RLVR) by replacing scalar preference signals with structured, multi-dimensional, contextual rubric-based evaluations. However, existing approaches in RLRR are limited to linearly compressing vector rewards into a scalar reward with a fixed weightings, which is sensitive to artificial score design and fails to capture correlations among reward dimensions. To overcome the limitations of reward aggregation, this work proposes Alternating Reinforcement Learning with Rubric Rewards (ARL-RR), a framework that eliminates the need for a fixed scalarization by optimizing one semantic rubric meta-class at a time. Theoretically, we show that reward aggregation induces a variance contraction effect, which helps explain the performance gains. We further introduce a lightweight, search-based adaptation procedure that selects the next meta-class dynamically based on task performance, enabling the policy to emphasize critical objectives and thereby improve the model performance. Empirically, our experiments on the HealthBench dataset with experts annotations demonstrate that ARL-RR uniformly outperforms scalarized methods in both model performance and training efficiency across different model scales (1.7B, 4B, 8B, and 14B).
♻ ☆ Large Language Model Prompt Datasets: An In-depth Analysis and Insights
We compile 129 heterogeneous LLM prompt datasets (>1.22 TB, >673M instances) into a structured taxonomy and conduct a multi-level linguistic analysis (lexical, syntactic, and semantic) on seven representative corpora, surfacing systematic patterns that distinguish prompts from general text. Three downstream experiments validate practical utility: prompt filtering (F1 = 0.90), domain classification (Macro-F1 = 0.975), and prompt quality prediction (AUC = 0.792), all without invoking any additional model. A central finding is that 62-d syntactic features (POS + dependency distributions) serve as a uniquely efficient routing primitive, recovering >93% of GPU-embedding accuracy at 1.9 $\times$ lower single-request latency (3.0 ms vs. 5.7 ms) with no GPU and no corpus vocabulary. A complementary discriminative--predictive divergence shows that features most useful for routing are precisely those most negatively correlated with response quality, while lexical diversity (Cohen's $d$ = 0.71) dominates the quality signal but carries minimal routing weight, directly motivating two-stage pipeline design. Our datasets and code are available.
♻ ☆ What MLLMs Learn about When they Learn about Multimodal Reasoning
Evaluation of multimodal reasoning models is typically reduced to a single accuracy score, implicitly treating reasoning as a unitary capability. We introduce MathLens, a benchmark of textbook-style geometry problems that exposes this assumption by operationally decomposing performance into perception, reasoning, and multimodal-specific components. Each problem is derived from a symbolic specification and accompanied by visual diagrams, text-only variants, multimodal questions, and targeted perceptual probes, enabling controlled measurement of each component. Using this decomposition, we show that common training strategies induce systematically different capability profiles that are invisible under aggregate accuracy. Reinforcement learning primarily improves perceptual grounding and robustness to diagram variation, while textual SFT yields gains through reflective reasoning. In contrast, as perception and reasoning improve, a growing fraction of remaining errors fall outside these components and are categorized as multimodal-specific. These results suggest that apparent progress in multimodal reasoning reflects shifting balances among subskills rather than uniform advancement, motivating evaluation beyond scalar accuracy.
♻ ☆ Omnilingual MT: Machine Translation for 1,600 Languages
High-quality machine translation (MT) can scale to hundreds of languages, setting a high bar for multilingual systems. However, compared to the world's 7,000 languages, current systems still offer only limited coverage: about 200 languages on the target side, and maybe a few hundreds more on the source side, supported due to cross-lingual transfer. And even these numbers have been hard to evaluate due to the lack of reliable benchmarks and metrics. We present Omnilingual Machine Translation (OMT), the first MT system supporting more than 1,600 languages. This scale is enabled by a comprehensive data strategy that integrates large public multilingual corpora with newly created datasets, including manually curated MeDLEY bitext. We explore two ways of specializing a Large Language model (LLM) for machine translation: as a decoder-only model (OMT-LLaMA) or as a module in an encoder-decoder architecture (OMT-NLLB). Notably, all our 1B to 8B parameter models match or exceed the MT performance of a 70B LLM baseline, revealing a clear specialization advantage and enabling strong translation quality in low-compute settings. Moreover, our evaluation of English-to-1,600 translations further shows that while baseline models can interpret undersupported languages, they frequently fail to generate them with meaningful fidelity; OMT-LLaMA models substantially expand the set of languages for which coherent generation is feasible. Additionally, OMT models improve in cross-lingual transfer, being close to solving the "understanding" part of the puzzle in MT for the 1,600 evaluated. Our leaderboard and main human-created evaluation datasets (BOUQuET and Met-BOUQuET) are dynamically evolving towards Omnilinguality and freely available.
♻ ☆ Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards
Reinforcement learning with verifiable rewards has become a common way to improve explicit reasoning in large language models, but final-answer correctness alone does not reveal whether the reasoning trace is faithful, reliable, or useful to the model that consumes it. This outcome-only signal can reinforce traces that are right for the wrong reasons, overstate reasoning gains by rewarding shortcuts, and propagate flawed intermediate states in multi-step systems. To this end, we propose TraceLift, a planner-executor training framework that treats reasoning as a consumable intermediate artifact. During planner training, the planner emits tagged reasoning. A frozen executor turns this reasoning into the final artifact for verifier feedback, while an executor-grounded reward shapes the intermediate trace. This reward multiplies a rubric-based Reasoning Reward Model (RM) score by measured uplift on the same frozen executor, crediting traces that are both high-quality and useful. To make reasoning quality directly learnable, we introduce TRACELIFT-GROUPS, a rubric-annotated reason-only dataset built from math and code seed problems. Each example is a same-problem group containing a high-quality reference trace and multiple plausible flawed traces with localized perturbations that reduce reasoning quality or solution support while preserving task relevance. Extensive experiments on code and math benchmarks show that this executor-grounded reasoning reward improves the two-stage planner-executor system over execution-only training, suggesting that reasoning supervision should evaluate not only whether a trace looks good, but also whether it helps the model that consumes it. Our code is available at: https://github.com/MasaiahHan/TraceLift
comment: 36 pages
♻ ☆ How do datasets, developers, and models affect biases in a low-resourced language?: The Case of the Bengali Language
Sociotechnical systems, such as language technologies, frequently exhibit identity-based biases. These biases exacerbate the experiences of historically marginalized communities and remain understudied in low-resource contexts. While models and datasets specific to a language or with multilingual support are commonly recommended to address these biases, this paper empirically tests the effectiveness of such approaches in the context of gender, religion, and nationality-based identities in Bengali, a widely spoken but low-resourced language. We conducted an algorithmic audit of sentiment analysis models built on mBERT and BanglaBERT, which were fine-tuned using all Bengali sentiment analysis (BSA) datasets from Google Dataset Search. Our analyses showed that BSA models exhibit biases across different identity categories despite having similar semantic content and structure. We also examined the inconsistencies and uncertainties arising from combining pre-trained models and datasets created by individuals from diverse demographic backgrounds. We connected these findings to the broader discussions on epistemic injustice, AI alignment, and methodological decisions in algorithmic audits.
♻ ☆ GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models
Currently, process reward models (PRMs) have exhibited remarkable potential for test-time scaling. Since large language models (LLMs) regularly generate flawed intermediate reasoning steps when tackling a broad spectrum of reasoning and decision-making tasks, PRMs are required to possess capabilities for detecting process-level errors in real-world scenarios. However, existing benchmarks primarily focus on mathematical reasoning, thereby failing to comprehensively evaluate the error detection ability of PRMs across diverse reasoning scenarios. To mitigate this gap, we introduce GR-Ben, a process-level benchmark specifically designed for assessing PRM's performance across two primary reasoning domains (science and logic) and nine subdomains. We conduct extensive experiments on a diverse set of 22 models, encompassing both PRMs and LLMs, and derive two key findings: (1) In domains beyond mathematical reasoning, the error-detection ability of existing PRMs and LLMs is found to be markedly weaker by comparison.(2) In general, PRMs are less adept at identifying knowledge-based errors, whereas LLMs exhibit poorer performance in detecting computational errors. We hope GR-Ben can foster future researches on PRMs for general domains, thereby enhancing the reasoning capabilities of LLMs.
♻ ☆ Quantifying Hallucinations in Language Language Models on Medical Textbooks
Hallucinations, the tendency for large language models to provide responses with factually incorrect and unsupported claims, is a serious problem within natural language processing for which we do not yet have an effective solution to mitigate against. Existing benchmarks for medical QA rarely evaluate this behavior against a fixed evidence source. We ask how often hallucinations occur on textbook-grounded QA and how responses to medical QA prompts vary across models. We conduct two experiments, the first experiment to determine the prevalence of hallucinations for a prominent open source large language model (LLaMA-70B-Instruct) in medical QA given closed-source zero-shot prompts, and the second experiment to determine the prevalence of hallucinations and clinician preference to model responses. We observed, in experiment one, with the passages provided, LLaMA-70B-Instruct hallucinated in 19.7\% of answers (95\% CI 18.6 to 20.7) even though 98.8\% of prompt responses received maximal plausibility, and observed in experiment two, across models, lower hallucination rates aligned with higher usefulness scores ($ρ=-0.71$, $p=0.058$). Clinicians produced high agreement (quadratic weighted $κ=0.92$) and ($τ_b=0.06$ to $0.18$, $κ=0.57$ to $0.61$) for experiments 1 and 2 respectively. Our findings indicate that, across all scales and architectures tested, current large language models remain unfit for unsupervised clinical deployment, and that human expert oversight is both necessary and the dominant cost driver.
comment: 8 pages, 4 figures
♻ ☆ How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models
We measure how much one recurrence is worth to a looped (depth-recurrent) transformer, in equivalent unique parameters. From an iso-depth pretraining sweep across recurrence counts $r \in \{1, 2, 4, 8\}$ spanning ${\sim}50\times$ in training compute, we fit a joint scaling law $L = E + A\,(N_\text{once} + r^{\varphi} N_\text{rec})^{-α} + B\,D^{-β}$ and measure a recurrence-equivalence exponent $\varphi = 0.46$. Intuitively, $\varphi$ tells us whether looping a block $r$ times is equivalent in validation loss to $r$ unique blocks of a non-looped model (full equivalence, $\varphi{=}1$) or to a single block run repeatedly with no capacity gain ($\varphi{=}0$). Our $\varphi = 0.46$ sits in between, so replacing unique blocks with shared recurrences increases validation loss at matched training compute. For example, at $r{=}4$ a 410M looped model performs on par with a 580M non-looped model, but incurs the training cost of a 1B non-looped one. We demonstrate the utility of $\varphi$ as a diagnostic tool on two case studies: commonly used truncated backpropagation lowers $\varphi$ to $0.38$, indicating that the loop mechanism is poorly trained under truncation, even though validation loss decreases. Conversely, hyperconnections raise $\varphi$ to $0.65$, a genuine capacity gain. Our method separates true loop improvements from training-side gains, a distinction raw validation loss cannot make.
comment: v3: substantially refined framing + minor corrections v2: added case studies on truncated-BPTT and hyperconnections
♻ ☆ Sampling from Your Language Model One Byte at a Time
Tokenization is used almost universally by modern language models, enabling efficient text representation using multi-byte or multi-character tokens. However, prior work has shown that tokenization can introduce distortion into the model's generations, an issue known as the Prompt Boundary Problem (PBP). For example, users are often advised not to end their prompts with a space because it prevents the model from including the space as part of the next token. While this heuristic is effective in English, the underlying PBP continues to affect code generation and languages such as Chinese, where tokens often do not line up with word and syntactic boundaries. In this work, we present an inference-time method to convert any autoregressive LM with a BPE tokenizer into a character-level or byte-level LM. Our method efficiently solves the PBP and is also able to unify the vocabularies of language models with different tokenizers, allowing one to ensemble LMs with different tokenizers at inference time or transfer the post-training from one model to another using proxy-tuning. Code is available at https://github.com/SewoongLab/byte-sampler .
comment: 28 pages, 9 figures
♻ ☆ Sample-efficient LLM Optimization with Reset Replay
Recent advancements in LLM post-training, particularly through reinforcement learning and preference optimization, are key to boosting their reasoning capabilities. However, these methods often suffer from low sample efficiency and a susceptibility to primacy bias, a phenomenon where overfitting to initial experiences diminishes network plasticity and damages the learning process. To address these challenges, we introduce LLM optimization with Reset Replay (LoRR), a general and powerful plugin for enhancing sample efficiency in preference-based optimization. Its core mechanism enables high-replay training to maximize the utility of each data batch. To mitigate overfitting, LoRR orchestrates a periodic reset strategy that reuses the initial data and policy to maintain network plasticity, and further adopts a hybrid optimization objective to better exploit training data. Extensive experiments show that LoRR significantly boosts the performance of various preference optimization methods on both mathematical and general reasoning benchmarks. Notably, an iterative DPO framework augmented with LoRR achieves comparable performance on challenging math tasks, rivaling many complex or computationally expensive baselines. Our findings highlight that LoRR offers a practical and sample-efficient paradigm from limited offline data, unlocking greater performance with minimal changes to existing post-training workflows.
♻ ☆ EternalMath: A Living Benchmark of Frontier Mathematics that Evolves with Human Discovery
Current evaluations of mathematical reasoning in large language models (LLMs) are dominated by static benchmarks, either derived from competition-style problems or curated through costly expert effort, resulting in limited coverage of research-level mathematics and rapid performance saturation. We propose a fully automated, theorem-grounded pipeline for evaluating frontier mathematical reasoning, which directly transforms recent peer-reviewed mathematical literature into executable and verifiable reasoning tasks. The pipeline identifies constructive or quantitative results, instantiates them into parameterized problem templates, and generates deterministic solutions through execution-based verification, enabling scalable, reproducible, and continuously updatable evaluation without reliance on large-scale expert authoring. By design, this approach supports temporal extensibility, intrinsic correctness checking, and domain-specific customization across mathematical subfields. Applying this pipeline yields \textbf{EternalMath}, an evolving evaluation suite derived from contemporary research papers. Experiments with state-of-the-art LLMs reveal substantial performance gaps, indicating that mathematical reasoning at the research frontier remains far from saturated and underscoring the need for evaluation methodologies that evolve in step with human mathematical discovery.
♻ ☆ Structural Sensitivity in Compressed Transformers: Relative Error Propagation and Layer Removal
Compressing transformer weights makes large language models cheaper to deploy. But each layer's compression introduces an error. These errors accumulate as the signal passes through later layers, and how they accumulate is not well understood. We measure this directly: at each layer, we take the ratio of output to input error, calling it rho. A value below one means the layer absorbs the error; above one means it grows. Computing rho on six transformers (117M to 8B parameters) yields three findings. (i) Errors at layer t scale downstream by the product of later rho values, predicting representation drift (Spearman r = -0.44, p < 10^-4). This explains why compressing early layers hurts more than late ones, and why depth-decreasing sparsity schedules outperform uniform ones. Across architecture families, however, model width and redundancy matter more than rho alone. (ii) Within a layer, naive pruning shows a ~600x spread in component sensitivity. Activation-aware pruning (Wanda) shrinks this to 3-7x; the ranking reverses across architectures, so fixed importance scores do not transfer. (iii) For depth pruning, ranking layers by how far rho is from one takes two forward passes. It beats ShortGPT's Block Influence with 1.6x lower perplexity at eight layers removed, and physical deletion delivers 1.22x wall-clock speed-up. A blend of the two criteria does best (perplexity 14.2, 60.0% downstream accuracy on LLaMA-2-7B). Twelve Lean 4 norm inequalities provide machine-checked per-matrix error bounds. The contraction profile thus gives a training-free instrument for two decisions: where to compress within layers, and which to remove.
♻ ☆ CompassLLM: A Multi-Agent Approach toward Geo-Spatial Reasoning for Popular Path Query
The popular path query - identifying the most frequented routes between locations from historical trajectory data - has important applications in urban planning, navigation optimization, and travel recommendations. While traditional algorithms and machine learning approaches have achieved success in this domain, they typically require model training, parameter tuning, and retraining when accommodating data updates. As Large Language Models (LLMs) demonstrate increasing capabilities in spatial and graph-based reasoning, there is growing interest in exploring how these models can be applied to geo-spatial problems. We introduce CompassLLM, a novel multi-agent framework that intelligently leverages the reasoning capabilities of LLMs into the geo-spatial domain to solve the popular path query. CompassLLM employs its agents in a two-stage pipeline: the SEARCH stage that identifies popular paths, and a GENERATE stage that synthesizes novel paths in the absence of an existing one in the historical trajectory data. Experiments on real and synthetic datasets show that CompassLLM demonstrates superior accuracy in SEARCH and competitive performance in GENERATE while being cost-effective.
♻ ☆ Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation ICML 2026
The core learning signal used in language model distillation is the standard Kullback-Leibler (KL) divergence between the student and teacher distributions. Traditional KL divergence tends to be dominated by the next tokens with the highest probabilities, i.e., the teacher's modes, thereby diminishing the influence of less probable yet potentially informative components of the output distribution. We propose a new tail-aware divergence that decouples the contribution of the teacher model's top-K predicted probabilities from that of lower-probability predictions, while maintaining the same computational profile as the KL Divergence. Our decoupled approach reduces the impact of the teacher modes and, consequently, increases the contribution of the tail of the distribution. Experimental results demonstrate that our modified distillation method yields competitive performance in both pre-training and supervised distillation of decoder models across various datasets. Furthermore, the distillation process is efficient and can be performed with a modest academic budget for large datasets, eliminating the need for industry-scale computing.
comment: ICML 2026
♻ ☆ How important is Recall for Measuring Retrieval Quality?
In realistic retrieval settings with large and evolving knowledge bases, the total number of documents relevant to a query is typically unknown, and recall cannot be computed. In this paper, we evaluate several established strategies for handling this limitation by measuring the correlation between retrieval quality metrics and LLM-based judgments of response quality, where responses are generated from the retrieved documents. We conduct experiments across multiple datasets with a relatively low number of relevant documents (2-15). We also introduce a simple retrieval quality measure that performs well without requiring knowledge of the total number of relevant documents.
comment: Dataset: https://huggingface.co/datasets/primer-ai/retrieval-response
♻ ☆ Token-Level Entropy Reveals Demographic Disparities in Language Models
We ask whether demographic identity, signaled by a name alone, systematically reshapes the generative distribution of a language model. Measuring full-vocabulary Shannon entropy at temperature zero across six open-weight base models and 5,760 implicit sentence-completion prompts (e.g., "Tanisha walked into the office on a Monday morning and"), we find that Black-associated names produce higher first-token entropy than White-associated names across all six architectures - opposite to the output-level homogeneity bias documented under explicit demographic prompting (Lee et al., 2024) - and Black-associated names always produce greater entropy above identity-neutral baselines than White-associated names ($ΔΔ> 0$ in all six models). Women-associated names co-occur with lower first-token entropy (DL-pooled $\hatβ= -0.041, p = .019$) and more homogeneous outputs ($\hatα= +0.024, p < .001$) than men-associated names - a pattern convergent with homogeneity bias; race and gender effects are additive. Instruction tuning does not attenuate the race gap (matched-format DL-pooled $\hatβ=+0.153$). Running the same templates with explicit group labels instead of names yields null race effects in 10 of 12 models where implicit probing is significant - establishing that probing methodology is a primary determinant of which distributional structure is recovered.
comment: 9 pages
♻ ☆ Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning ACL 2026
Reinforcement Learning with Verifiable Rewards (RLVR) is an essential paradigm that enhances the reasoning capabilities of Large Language Models (LLMs). However, existing methods typically rely on static policy optimization schemes that misalign with the model's evolving reasoning capabilities. To address this issue, we propose Adaptive Power-Mean Policy Optimization (APMPO), which comprises two main innovations: Power-Mean Policy Optimization (PMPO) and Feedback-Adaptive Clipping (FAC). Specifically, PMPO introduces a generalized power-mean objective. This enables the model to adaptively transition from the signal-amplifying behavior of the arithmetic mean to the consistency-enforcing behavior of the geometric mean. FAC adaptively adjusts clipping bounds based on real-time reward statistics to overcome the limitations of static mechanisms. Capitalizing on these innovations, APMPO improves learning dynamics and reasoning performance. Extensive experiments on nine datasets across three reasoning tasks showcase the superiority of APMPO over state-of-the-art RLVR-based baselines. For instance, APMPO boosts the average Pass@1 score on mathematical reasoning benchmarks by 3.0 points compared to GRPO when using Qwen2.5-3B-Instruct.
comment: Accepted to ACL 2026 (Findings)
♻ ☆ Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs ACL 2026
Unsupervised reinforcement learning (RL) has emerged as a promising paradigm for enabling self-improvement in large language models (LLMs). However, existing unsupervised RL-based methods often lack the capacity to adapt to the model's evolving reasoning capabilities during training. Therefore, these methods can misdirect policy optimization in the absence of ground-truth supervision. To address this issue, we introduce FREIA, a novel RL-based algorithm built on two key innovations: (1) Free Energy-Driven Reward (FER) adapts rewards to balance consensus and exploration based on the Free Energy Principle. (2) Adaptive Advantage Shaping (AAS) adaptively adjusts learning signals based on the statistical characteristics of sampled rewards. Empirical evaluations on nine datasets across three reasoning tasks showcase that FREIA outperforms other unsupervised RL-based baselines. Notably, in mathematical reasoning tasks, FREIA surpasses other methods by an average of 0.5 to 3.5 points in Pass@1 using the DeepSeek-R1-Distill-Qwen-1.5B model.
comment: Accepted by ACL 2026
♻ ☆ Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling
As large language models have grown larger, interest has grown in low-precision numerical formats such as NVFP4 as a way to improve speed and reduce memory usage. However, quantizing models to NVFP4 remains challenging as the lack of precision generally degrades model performance. In this work, we address this issue with Four Over Six (4/6), a modification to the block-scaled NVFP4 quantization algorithm that yields reduced quantization error. Unlike integer formats, floating point formats have non-uniform step sizes which create larger quantization error on larger values. 4/6 takes advantage of this by adaptively scaling some blocks to smaller FP4 values, making the distribution of representable values more uniform and reducing quantization error for near-maximal values. We show that 4/6 can be implemented efficiently on modern hardware accelerators, resulting in performance gains during both pre-training and inference with minimal computational overhead. In pre-training experiments with the Nemotron 3 Nano 30B-A3B model architecture, we find that 4/6 brings training loss closer to BF16 compared to models trained with current state-of-the-art NVFP4 training recipes. Our code is available at https://github.com/mit-han-lab/fouroversix.
comment: 10 pages, 4 figures
♻ ☆ Self-Consistency Is Losing Its Edge: Diminishing Returns and Rising Costs in Modern LLMs
Self-consistency -- sampling multiple reasoning paths and selecting the most frequent answer -- was designed for an era when language models made frequent, unpredictable errors. This study argues that the technique has become increasingly wasteful as models grow stronger, and may degrade performance on problems that modern models already solve reliably. Using Gemini 2.5 models on HotpotQA and MATH-500, we show that accuracy gains from increasing the number of sampled reasoning paths are minimal -- 0.4% on HotpotQA across 20 samples, and 1.6% on MATH-500 -- while token costs scale nearly linearly with sample count. Critically, performance plateaued early and in some configurations declined at high sample counts, suggesting that additional paths introduce noise rather than signal when models already solve problems reliably. As inference costs rise with model scale, indiscriminate self-consistency is difficult to justify. We recommend reserving multi-path sampling for problems that demonstrably exceed a model's single-pass reliability.
comment: 7 pages, 3 figures
♻ ☆ ImCoref-CeS: An Improved Lightweight Pipeline for Coreference Resolution with LLM-based Checker-Splitter Refinement ACL2026
Coreference Resolution (CR) is a critical task in Natural Language Processing (NLP). Current research faces a key dilemma: whether to further explore the potential of supervised neural methods based on small language models, whose detect-then-cluster pipeline still delivers top performance, or embrace the powerful capabilities of Large Language Models (LLMs). However, effectively combining their strengths remains underexplored. To this end, we propose \textbf{ImCoref-CeS}, a novel framework that integrates an enhanced supervised model with LLM-based reasoning. First, we present an improved CR method (\textbf{ImCoref}) to push the performance boundaries of the supervised neural method by introducing a lightweight bridging module to enhance long-text encoding capability, devising a biaffine scorer to comprehensively capture positional information, and invoking a hybrid mention regularization to improve training efficiency. Importantly, we employ an LLM acting as a multi-role Checker-Splitter agent to validate candidate mentions (filtering out invalid ones) and coreference results (splitting erroneous clusters) predicted by ImCoref. Extensive experiments demonstrate the effectiveness of ImCoref-CeS, which achieves superior performance compared to existing state-of-the-art (SOTA) methods.
comment: Accepted by ACL2026 main
♻ ☆ ANGOFA: Leveraging OFA Embedding Initialization and Synthetic Data for Angolan Language Model
In recent years, the development of pre-trained language models (PLMs) has gained momentum, showcasing their capacity to transcend linguistic barriers and facilitate knowledge transfer across diverse languages. However, this progress has predominantly bypassed the inclusion of very-low resource languages, creating a notable void in the multilingual landscape. This paper addresses this gap by introducing four tailored PLMs specifically finetuned for Angolan languages, employing a Multilingual Adaptive Fine-tuning (MAFT) approach. In this paper, we survey the role of informed embedding initialization and synthetic data in enhancing the performance of MAFT models in downstream tasks. We improve baseline over SOTA AfroXLMR-base (developed through MAFT) and OFA (an effective embedding initialization) by 12.3 and 3.8 points respectively.
comment: Accepted at AfricaNLP 2024
♻ ☆ LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning VLDB2026
Large Language Models (LLMs) can be fine-tuned on domain-specific data to enhance their performance in specialized fields. However, such data often contains numerous low-quality samples, necessitating effective data processing (DP). In practice, DP strategies are typically developed through iterative manual analysis and trial-and-error adjustment. These processes inevitably incur high labor costs and may lead to privacy issues in high-privacy domains like healthcare due to direct human access to sensitive data. Thus, achieving automated data processing without exposing the raw data has become a critical challenge. To address this challenge, we propose LLM-AutoDP, a novel framework that leverages LLMs as agents to automatically generate and optimize data processing strategies. Our method generates multiple candidate strategies and iteratively refines them using feedback signals and comparative evaluations. This iterative in-context learning mechanism enables the agent to converge toward high-quality processing pipelines without requiring direct human intervention or access to the underlying data. To further accelerate strategy search, we introduce three key techniques: Distribution Preserving Sampling, which reduces data volume while maintaining distributional integrity; Processing Target Selection, which uses a binary classifier to identify low-quality samples for focused processing; Cache-and-Reuse Mechanism}, which minimizes redundant computations by reusing prior processing results. Results show that models trained on data processed by our framework achieve over 80% win rates against models trained on unprocessed data. Compared to AutoML baselines based on LLM agents, LLM-AutoDP achieves approximately a 65% win rate. Moreover, our acceleration techniques reduce the total searching time by up to 10 times, demonstrating both effectiveness and efficiency.
comment: Accepted by VLDB2026
♻ ☆ Every Step Counts: Step-Level Credit Assignment for Tool-Integrated Text-to-SQL
Tool-integrated Text-to-SQL parsing has emerged as a promising paradigm, framing SQL generation as a sequential decision-making process interleaved with tool execution. However, existing reinforcement learning approaches mainly rely on coarse-grained outcome supervision, resulting in a fundamental credit assignment problem: models receive the same reward for any trajectory that yields the correct answer, even when intermediate steps are redundant, inefficient, or erroneous. Consequently, models are encouraged to explore suboptimal reasoning spaces, limiting both efficiency and generalization. To address this problem, we propose FineStep, a novel framework for step-level credit assignment in tool-augmented Text-to-SQL. First, we introduce a reward design with independent process rewards to alleviate the signal sparsity of outcome supervision. Next, we present a step-level credit assignment mechanism to precisely quantify the value of each reasoning step. Finally, we develop a policy optimization method based on step-level advantages for efficient updates. Extensive experiments on BIRD benchmarks show that FineStep achieves state-of-the-art performance and reduces redundant tool interactions, with a 3.25% average EX gain over GRPO at the 4B scale.
♻ ☆ Two Calls, Two Moments, and the Vote-Accuracy Curve of Repeated LLM Inference
Repeated sampling is a standard way to spend test-time compute, but its benefit is controlled by the latent distribution of correctness across examples, not by one-call accuracy alone. We study the binary correctness layer of repeated LLM inference under conditional-i.i.d. calls. One labeled call identifies the mean latent success probability; two labeled calls identify its second moment and hence the same-example correctness correlation that separates stable errors from recoverable call-level randomness. From these two moments, every fixed majority-vote budget has a sharp distribution-free two-call interval. The key technical reduction is that the infinite-dimensional moment problem has three-atom extremizers and quadratic dual certificates for every finite budget, so the bounds are exact rather than discretized or parametric. The first useful budget, three votes, has a closed form, width at most $1/8$, and a certified-improvement criterion. The infinite-vote endpoint is the limit of majority voting as the number of calls tends to infinity; it is also sharply bounded, but remains threshold-sensitive because it depends on latent mass around $q=1/2$. We add maximum-entropy and Latent-difficulty Gaussian-probit point completions, and experiments on LLM calls over QNLI and QQP show that empirical three- and five-vote accuracies are contained in the projected two-call regions while temperature changes and randomized model mixtures can create voting gains not ordered by one-call accuracy.
♻ ☆ Residual-Mass Accounting for Partial-KV Decoding
We study a controlled partial-KV decoding setting in which exact unnormalized softmax contributions are computed for sink/tail anchors and a retrieved token set, while the remaining prefill tokens are represented by a residual estimate. We focus on the accounting rule after the query-dependent exact support has been selected, and use exhaustive Top-K only as an oracle selector, not as a deployable retrieval system. The proposed rule leaves the backbone language model and the exact-branch KV tensors unchanged. It builds fixed-size summary states $(S,u)$ from learned positive feature maps $φ$, subtracts retrieved-token feature contributions to keep the exact and residual sets non-overlapping, and merges the estimated residual numerator and denominator with the exact branch under one normalization. At a 1% exact-support budget, our residual-completion method improves over the selection-only Top-K baseline on RULER and BABILong across frozen 1B and 3B Llama-3.2-Instruct backbones at all reported context lengths. In the 0.5-4% exact-support budget sweeps, this trend largely persists. On LongBench, summarization results are mostly favorable, while multi-document QA is mixed. Attention-output diagnostics support retrieved-token subtraction as the partition-consistent accounting rule, while indicating that the main remaining error is imperfect learned-$φ$ approximation of the unretrieved residual mass.
♻ ☆ Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation ICLR 2026
Transformer-based language models rely on positional encoding (PE) to handle token order and support context length extrapolation. However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate their extrapolation claims. We propose the Bayesian Attention Mechanism (BAM), a theoretical framework that formulates positional encoding as a prior within a probabilistic model. BAM unifies existing methods (e.g., NoPE and ALiBi) and motivates a new Generalized Gaussian positional prior that substantially improves long-context generalization. Empirically, BAM enables accurate information retrieval at $500\times$ the training context length, outperforming previous state-of-the-art context length generalization in long context retrieval accuracy while maintaining comparable perplexity and introducing minimal additional parameters.
comment: Accepted to ICLR 2026
♻ ☆ Optimizing Language Models for Crosslingual Knowledge Consistency ICML 2026
Large language models are known to often exhibit inconsistent knowledge. This is particularly problematic in multilingual scenarios, where models are likely to be asked similar questions in different languages, and inconsistent responses can undermine their reliability. In this work, we show that this issue can be mitigated using reinforcement learning with a structured reward function, which leads to an optimal policy with consistent crosslingual responses. We introduce Direct Consistency Optimization (DCO), a DPO-inspired method that requires no explicit reward model and is derived directly from the LLM itself. Comprehensive experiments show that DCO significantly improves crosslingual consistency across diverse LLMs and outperforms existing methods when training with samples of multiple languages, while complementing DPO when gold labels are available. Extra experiments demonstrate the effectiveness of DCO in bilingual settings, significant out-of-domain generalizability, and controllable alignment via direction hyperparameters. Taken together, these results establish DCO as a robust and efficient solution for improving knowledge consistency across languages in multilingual LLMs. All code, training scripts, and evaluation benchmarks are released at https://github.com/Betswish/ConsistencyRL.
comment: ICML 2026. The first two authors contributed equally. Codes available at: https://github.com/Betswish/ConsistencyRL
♻ ☆ A Large-Scale Dataset for Molecular Structure-Language Description via a Rule-Regularized Method
Molecular function is largely determined by structure. Accurately aligning molecular structure with natural language is therefore essential for enabling large language models (LLMs) to reason about downstream chemical tasks. However, the substantial cost of human annotation makes it infeasible to construct large-scale, high-quality datasets of structure-grounded descriptions. In this work, we propose a fully automated annotation framework for generating precise molecular descriptions that preserve complete structural details at scale. Our approach builds upon and extends a rule-based chemical nomenclature parser to interpret IUPAC names and construct enriched, structural XML metadata that explicitly encodes molecular structure. This metadata is then used to guide LLMs in producing accurate natural-language descriptions. Using this framework, we curate a large-scale dataset of approximately $163$k molecule--description pairs. A rigorous validation protocol combining LLM-based and expert human evaluation on a subset of $2,000$ molecules demonstrates a high description precision of $98.6$%. The proposed annotation framework is readily beneficial to broader chemical tasks that rely on structural descriptions, with the resulting dataset providing a reliable foundation for molecule--language alignment. The source code and dataset are hosted at https://github.com/TheLuoFengLab/MolLangData and https://huggingface.co/datasets/ChemFM/MolLangData, respectively.
♻ ☆ JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems
Large language models are widely adopted as automated evaluation judges, yet the stability of their verdicts under semantically equivalent prompt rephrasings remains largely unexamined. We conduct a systematic empirical study of prompt-induced decision instability across multiple evaluation tasks and judge architectures. To facilitate this analysis, we release JudgeSense, a benchmark comprising hand-validated prompt-paraphrase pairs spanning factuality, coherence, relevance, and preference, drawn from established NLP benchmarks and accompanied by comprehensive decision logs. The benchmark enables the measurement of judge stability across equivalent prompts, allowing researchers to assess whether stability correlates with model scale or instruction-tuning, and to identify which tasks are most sensitive to prompt wording. Our evaluation reveals that coherence remains the primary task for distinguishing judge behavior, while factuality judgments demonstrate high stability under standard conditions. Pairwise evaluation tasks consistently exhibit position bias. Crucially, we find that model scale is not a reliable proxy for consistency; notably, as an interesting result in our analysis, the largest and newest models are not the most consistent.
comment: 20 pages, 2 figures, 1 table. Code: https://github.com/rohithreddybc/judgeSense. Dataset (JudgeSense Benchmark): https://huggingface.co/datasets/Rohithreddybc/judgesense-benchmark
♻ ☆ Retrieval Heads are Dynamic ACL 2026
Recent studies have identified "retrieval heads" in Large Language Models (LLMs) responsible for extracting information from input contexts. However, prior works largely rely on static statistics aggregated across datasets, identifying heads that perform retrieval on average. This perspective overlooks the fine-grained temporal dynamics of autoregressive generation. In this paper, we investigate retrieval heads from a dynamic perspective. Through extensive analysis, we establish three core claims: (1) Dynamism: Retrieval heads vary dynamically across timesteps; (2) Irreplaceability: Dynamic retrieval heads are specific at each timestep and cannot be effectively replaced by static retrieval heads; and (3) Correlation: The model's hidden state encodes a predictive signal for future retrieval head patterns, indicating an internal planning mechanism. We validate these findings on the Needle-in-a-Haystack task and a multi-hop QA task, and quantify the differences on the utility of dynamic and static retrieval heads in a Dynamic Retrieval-Augmented Generation framework. Our study provides new insights into the internal mechanisms of LLMs.
comment: Accepted at ACL 2026
♻ ☆ Seeing Like an AI: How LLMs Apply (and Misapply) Wikipedia Neutrality Norms
Large language models (LLMs) are trained on broad corpora and then used in communities with specialized norms. Is providing LLMs with community rules enough for models to follow these norms? We evaluate LLMs' capacity to detect (Task 1) and correct (Task 2) biased Wikipedia edits according to Wikipedia's Neutral Point of View (NPOV) policy. LLMs struggled with bias detection, achieving only 64% accuracy on a balanced dataset. Models exhibited contrasting biases (some under- and others over-predicted bias), suggesting distinct priors about neutrality. LLMs performed better at generation, removing 79% of words removed by Wikipedia editors. However, LLMs made additional changes beyond Wikipedia editors' simpler neutralizations, resulting in high-recall but low-precision editing. Interestingly, crowdworkers rated AI rewrites as more neutral (70%) and fluent (61%) than Wikipedia-editor rewrites. Qualitative analysis found LLMs sometimes applied NPOV more comprehensively than Wikipedia editors but often made extraneous non-NPOV-related changes (such as grammar). LLMs may apply rules in ways that resonate with the public but diverge from community experts. While potentially effective for generation, LLMs may reduce editor agency and increase moderation workload (e.g., verifying additions). Even when rules are easy to articulate, having LLMs apply them like community members may still be difficult.
comment: Appeared at ICWSM 2026
♻ ☆ BITS Pilani at SemEval-2026 Task 9: Structured Supervised Fine-Tuning with DPO Refinement for Polarization Detection SemEval-2026
The POLAR SemEval-2026 Shared Task aims to detect online polarization and focuses on the classification and identification of multilingual, multicultural, and multi-event polarization. Accurate computational detection of online polarization is challenging due to nuanced rhetoric, implicit framing, and the high cost of human-in-the-loop annotation. Building on recent findings that contextual prompting enables large language models to function as strong polarization detectors, we present a two-stage approach for detecting polarization in social media text that combines structured supervised fine tuning with Direct Preference Optimization (DPO) refinement. We fine tune Qwen 2.5-7B-Instruct with LoRA using an interpretable slot-filling template (target, claim type, manifestation checklist, and justification). We then apply DPO with automatically generated preference pairs to reduce costly false negatives. Our submitted system achieves 0.7664 Macro-F1 on the English test set. Post-submission experiments with Mistral-Nemo-Instruct-2407 and LLM-judge-filtered preference pairs further improve to 0.8162 Macro-F1 (not submitted to CodaBench), surpassing the organiser baseline of 0.7802.
comment: Accepted to the 20th International Workshop on Semantic Evaluation (SemEval-2026), to be held in conjunction with ACL 2026
♻ ☆ Skip-It? Theoretical Conditions for Layer Skipping in Vision-Language Models
Vision-language models achieve incredible performance across a wide range of tasks, but their large size makes inference costly. Recent work has shown that multimodal processing contains significant redundancies, making it possible to skip certain layers with minimal performance loss. Yet current pruning techniques remain ad-hoc, relying on heuristics or hyperparameter sweeps rather than principled criteria for determining when layer skipping is beneficial. In this paper, we propose a unified framework that characterizes the redundancy conditions under which pruning can enhance efficiency without sacrificing performance. Central to our approach are experimentally verifiable and interpretable notions of redundancy that can be evaluated without requiring downstream task performance as a metric. Applying this framework, we corroborate prior findings that both early and late vision tokens are redundant across models, and we validate our conditions by showing they align with actual performance degradation. Beyond these empirical results, our framework provides a theoretically grounded understanding of redundancy in VLMs and unifies many of the ideas behind modern layer-skipping techniques.
♻ ☆ Neural Neural Scaling Laws
Neural scaling laws predict how language model performance improves with increased training inputs. While aggregate metrics like validation loss can follow smooth power-law curves, individual downstream tasks exhibit diverse scaling behaviors: some improve monotonically, others plateau, and some even degrade with scale. We argue that predicting downstream performance from validation loss suffers from two limitations: averaging token-level losses obscures signal, and no simple parametric family can capture the full spectrum of scaling behaviors. To address this, we propose Neural Neural Scaling Laws (NeuNeu), a neural network that frames scaling law prediction as time-series extrapolation. NeuNeu combines temporal context from observed accuracy trajectories with token-level validation losses, learning to predict future performance without the limitations inherent in assuming a specific functional form. Trained entirely on open-source model checkpoints from HuggingFace, NeuNeu achieves 1.99% mean absolute error in predicting model accuracy on 66 downstream tasks -- a 44% reduction compared to logistic scaling laws (3.56% MAE). Furthermore, NeuNeu generalizes zero-shot to unseen model families, architectures, parameter counts, and downstream tasks. Our work suggests that predicting downstream scaling directly from data outperforms parametric alternatives.
♻ ☆ Statistical Patterns in the Equations of Physics and the Emergence of a Meta-Law of Nature
Physics seeks to uncover the laws of Nature and express them through mathematical equations. Despite the vast diversity of natural phenomena, physical equations exhibit structural regularities that set them apart from arbitrary mathematical expressions. While principles such as dimensional analysis have long guided the formulation of physical models, the exploration of more subtle statistical patterns within the equations of physics remains an open question. Here, by analysing four corpora of physics equations and applying advanced implicit-likelihood techniques, we find that the frequency of mathematical operators follows an exponential decay law, in contrast to Zipf's power law for word frequencies in natural languages. This reveals a statistical meta-law of physics, possibly reflecting a combination of communication efficiency and constraints imposed by Nature itself. The meta-law offers practical benefits for symbolic regression by drastically narrowing down the space of physically plausible expressions. More broadly, it may inform the development of language models that can generate coherent mathematical representations, advancing the automation of physical law discovery.
comment: 11 pages, 5 figures, 2 table
♻ ☆ SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks
Software development is iterative, yet agentic coding benchmarks hide design issues through their single-shot setup. Recent iterative benchmarks attempt to remedy this but heavily constrain an agent's design decision space, making it impossible to faithfully measure how their decisions shape future extensions. We introduce SlopCodeBench, a benchmark of 36 problems and 196 checkpoints where agents repeatedly extend their own solutions. Unlike prior iterative benchmarks, our evolving specifications demand architectural decisions but leave internal structure to the agent. We measure two forms of degradation: structural erosion (concentrated complexity) and verbosity (redundant code). Evaluating 15 coding agents across open and closed models, we find that no agent fully solves any problem end-to-end, and the best agent passes 14.8% of checkpoints. Quality degrades across checkpoints, with structural erosion rising in 77% of trajectories and verbosity in 75.5%. Compared to 473 open-source Python repositories, agent code is 2.3x more verbose and 2.0x more eroded, and the human repositories degrade less often and by smaller margins across their git histories. Explicit quality guidance reduces initial verbosity and erosion by up to a third, without affecting degradation rates. SlopCodeBench provides the first measurement of code degradation under iterative extension, revealing that agents pass checkpoints while producing code that erodes and bloats with each turn.
comment: Code and Leaderboards are located at https://www.scbench.ai
♻ ☆ UNA: A Unified Supervised Framework for Efficient LLM Alignment Across Feedback Types
RL alignment methods, including RLHF and DPO, are primarily based on pairwise preference data. Although scalar or score-based feedback has been collected in some settings, it is rarely used directly, and preference magnitude information is typically ignored. Furthermore, current alignment frameworks offer limited capability for unifying heterogeneous supervision signals, making it difficult to jointly leverage diverse data types within a single training paradigm. This limitation constrains the richness and scalability of the alignment process. To address this gap, we propose a \textbf{UN}ified \textbf{A}lignment (UNA) framework capable of training across different types of feedback, including binary, pairwise, and score-based, through a generalized implicit reward function. The reward function is theoretically proved to be the optimal policy by the log sum inequality. Extensive experiments on classical benchmarks consistently demonstrate the advantage of the proposed unified framework with typical LLM base models.
♻ ☆ VIDEE: Visual and Interactive Decomposition, Execution, and Evaluation of Text Analytics with Intelligent Agents
Text analytics has traditionally required specialized knowledge in Natural Language Processing (NLP) or text analysis, which presents a barrier for entry-level analysts. Recent advances in large language models (LLMs) have changed the landscape of NLP by enabling more accessible and automated text analysis (e.g., topic detection, summarization, information extraction, etc.). We introduce VIDEE, a system that supports entry-level data analysts to conduct advanced text analytics with intelligent agents. VIDEE instantiates a human-agent collaroration workflow consisting of three stages: (1) Decomposition, which incorporates a human-in-the-loop Monte-Carlo Tree Search algorithm to support generative reasoning with human feedback, (2) Execution, which generates an executable text analytics pipeline, and (3) Evaluation, which integrates LLM-based evaluation and visualizations to support user validation of execution results. We conduct two quantitative experiments to evaluate VIDEE's effectiveness and analyze common agent errors. A user study involving participants with varying levels of NLP and text analytics experience -- from none to expert -- demonstrates the system's usability and reveals distinct user behavior patterns. The findings identify design implications for human-agent collaboration, validate the practical utility of VIDEE for non-expert users, and inform future improvements to intelligent text analytics systems.
♻ ☆ Direct Reasoning Optimization: Token-Level Reasoning Reflectivity Meets Rubric Gates for Unverifiable Tasks
Reinforcement learning (RL) training of large language models (LLMs) on unverifiable tasks is challenging even when a reasonable-quality reference answer is available. We propose a constrained RL training framework that (i) optimizes a token-level dense Reasoning Reflection Reward (R3) aligned with reasoning quality, and (ii) enforces rubric-gating as feasibility constraints at the rollout group level. R3 measures the model's token-level certainty of a reference answer under its chain-of-thought (CoT) prefix, and selectively emphasizes tokens with high cross-rollout variance, which we call reasoning-reflective tokens, that would otherwise be diluted by the bulk of low-variance tokens. The same variance signal also drives a filter that discards queries with insufficient signal for comparative learning. Rubric-gating complements R3 by operationalizing principled task criteria as hard accept/reject checks on final answers. Empirically, across four datasets spanning scientific writing, medicine, legal contracts, and finance, our framework outperforms strong baselines, achieves faster, more sample-efficient learning, and respects feasibility constraints.
♻ ☆ UFT: Unifying Fine-Tuning of SFT and RLHF/DPO/UNA through a Generalized Implicit Reward Function
By pretraining on trillions of tokens, an LLM gains the capability of text generation. However, to enhance its utility and reduce potential harm, SFT and alignment are applied sequentially to the pretrained model. Because SFT and alignment have different objectives and underlying processes, performance on certain tasks can decline. To address this, we seamlessly introduce Unified Fine-Tuning (UFT), which integrates SFT and alignment into a single training stage using the same objective and loss functions through an implicit reward function. Our experimental results demonstrate that UFT outperforms SFT on instruction-tuning data alone. Moreover, when combining instruction-tuning data with alignment data, UFT effectively prevents the degradation on some tasks across these two stages and shows a clear advantage over sequentially applying SFT and alignment. This is evident in the significant improvements observed in the \textbf{ifeval} task for instruction-following and the \textbf{truthful} task for factuality. The proposed general fine-tuning framework UFT establishes an effective and efficient paradigm for LLM post-training.
♻ ☆ Multilingual Safety Alignment via Self-Distillation
Large language models (LLMs) exhibit severe multilingual safety misalignment: they possess strong safeguards in high-resource languages but remain highly vulnerable to jailbreak attacks in low-resource languages. Current safety alignment methods generally rely on high-quality response data for each target language, which is expensive and difficult to generate. In this paper, we propose a cross-lingual safeguard transfer framework named Multilingual Self-Distillation (MSD). This framework transfers an LLM's inherent safety capabilities from high-resource (e.g., English) to low-resource (e.g., Javanese) languages, overcoming the need for response data in any language. Our framework is flexible and can be integrated with different self-distillation strategies. Specifically, we implement two concrete methods -- on-policy MSD and off-policy MSD -- both of which enable effective cross-lingual safety transfer using only multilingual queries. Furthermore, we propose Dual-Perspective Safety Weighting (DPSW), a divergence measure to optimize the distillation objective. By jointly considering the perspectives of both the teacher and the student, DPSW adaptively increases the penalty weights on safety-critical tokens while reducing the weights on non-critical tokens. Extensive experiments on representative LLMs across diverse multilingual jailbreak and utility benchmarks demonstrate that our method consistently achieves superior multilingual safety performance. Notably, it generalizes effectively to more challenging datasets and unseen languages while preserving the model's general capabilities.
♻ ☆ Automated Evaluation can Distinguish the Good and Bad AI Responses to Patient Questions about Hospitalization
Automated approaches to answer patient-posed health questions are rising, but selecting among systems requires reliable evaluation. The current gold standard for evaluating the free-text artificial intelligence (AI) responses--human expert review--is labor-intensive and slow, limiting scalability. Automated metrics are promising yet variably aligned with human judgments and often context-dependent. To address the feasibility of automating the evaluation of AI responses to hospitalization-related questions posed by patients, we conducted a large systematic study of evaluation approaches. Across 100 patient cases, we collected responses from 28 AI systems (2800 total) and assessed them along three dimensions: whether a system response (1) answers the question, (2) appropriately uses clinical note evidence, and (3) uses general medical knowledge. Using clinician-authored reference answers to anchor metrics, automated rankings closely matched human ratings. Our findings suggest that carefully designed automated evaluation can scale comparative assessment of AI systems and support patient-clinician communication.
comment: Accepted for publication in npj Digital Medicine
♻ ☆ Utility-Preserving De-Identification for Math Tutoring: Investigating Numeric Ambiguity in the MathEd-PII Benchmark Dataset
Large-scale sharing of dialogue data is key to advancing the science of teaching and learning, yet rigorous de-identification remains a major barrier. In mathematics tutoring transcripts, numeric expressions frequently resemble structured identifiers (e.g., dates or IDs), leading generic Personally Identifiable Information (PII) detection systems to over-redact core instructional content and reduce data utility. This work asks how to detect PII while preserving educational utility, focusing on this "numeric ambiguity" problem. We introduce MathEd-PII, the first benchmark dataset for PII detection in math tutoring dialogues, built with human-in-the-loop LLM annotation. Using density-based segmentation, we show that false PII redactions cluster in math-dense regions, confirming numeric ambiguity as a key failure mode. We then compare four detection strategies: a Presidio baseline and three LLM-based approaches with basic, math-aware, and segment-aware prompting. Domain-aware prompting, including both math-aware (F1: 0.802) and segment-aware versions (F1: 0.821), substantially outperforms the baseline (F1: 0.379) while reducing numeric false positives, demonstrating that de-identification must incorporate domain context to preserve analytic utility. This work provides a new benchmark and evidence that utility-preserving de-identification for tutoring data requires domain-aware modeling.
♻ ☆ Anatomy of Unlearning: The Dual Impact of Fact Salience and Model Fine-Tuning
Machine Unlearning (MU) enables Large Language Models (LLMs) to remove unsafe or outdated information. However, existing work assumes that all facts are equally forgettable and largely ignores whether the forgotten knowledge originates from pretraining or supervised fine-tuning (SFT). In this paper, we introduce DUAL (Dual Unlearning Evaluation across Training Stages), a benchmark of 28.6k Wikidata-derived triplets annotated with fact popularity using Wikipedia link counts and LLM-based salience scores. Our experiments show that pretrained and SFT models respond differently to unlearning. An SFT step on the forget data yields smoother forgetting, more stable tuning, and 10-50% higher retention, while direct unlearning on pretrained models remains unstable and prone to relearning or catastrophic forgetting.
♻ ☆ Can AI Debias the News? LLM Interventions Improve Cross-Partisan Receptivity but LLMs Overestimate Their Own Effectiveness
Partisan news media erode cross-partisan trust, but large language models (LLMs) offer a potential means of debiasing such content at scale. Across two pre-registered experiments, we tested whether LLM-generated debiasing of liberal news headlines could improve conservative readers' trust-relevant judgments. Study 1 found that subtle lexical debiasing (replacing emotive words with more moderate synonyms) had no effect on any outcome. Study 2 found that a more substantive reframing intervention significantly increased conservatives' perceived trustworthiness, completeness, and willingness to engage with liberal news headlines, without producing a backfire effect among a sample of liberals. In Study 1, the intervention produced robust effects among LLM-simulated silicon participants, whereas it had no impact on human readers. In Study 2, the intervention's effects among silicon participants aligned directionally with human responses but were significantly larger in magnitude for some outcomes. Moderation analyses revealed that the model's implicit theory of who responds to debiasing diverged from the psychological profile that actually predicted human responsiveness. These findings demonstrate that LLM-based debiasing can improve cross-partisan receptivity when targeting ideological framing rather than surface-level language, but that current models lack both the quantitative accuracy and qualitative psychological fidelity to evaluate their own interventions without human oversight.
♻ ☆ MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge
As the era of large language models (LLMs) unfolds, Preference Optimization (PO) methods have become a central approach to aligning LLMs with human preferences and improving performance. We propose Maximum a Posteriori Preference Optimization (MaPPO), a methodology for learning from preferences that explicitly incorporates prior reward knowledge into the optimization objective. Building on the paradigm employed by Direct Preference Optimization (DPO) and its variants of treating preference learning as a Maximum Likelihood Estimation (MLE) problem, MaPPO integrates prior reward estimates into a principled Maximum a Posteriori (MaP) objective. This not only generalizes DPO and its variants, but also enhances alignment by mitigating the oversimplified binary classification of responses. Additionally, MaPPO introduces no additional hyperparameters, and supports preference optimization in both offline and online settings. In addition, MaPPO can be used as a plugin for DPO variants, including widely used SimPO, IPO and CPO, and produce consistent improvements. Extensive empirical evaluations of different model sizes and model series on three standard benchmarks (MT-Bench, AlpacaEval 2.0, and Arena-Hard) demonstrate consistent improvements in alignment performance without sacrificing computational efficiency.
♻ ☆ Comprehensiveness Metrics for Automatic Evaluation of Factual Recall in Text Generation ACL 2026
Despite demonstrating remarkable performance across a wide range of tasks, large language models (LLMs) have also been found to frequently produce outputs that are incomplete or selectively omit key information. In sensitive domains, such omissions can result in significant harm comparable to that posed by factual inaccuracies, including hallucinations. In this study, we address the challenge of evaluating the comprehensiveness of LLM-generated texts, focusing on the detection of missing information or underrepresented viewpoints. We investigate three automated evaluation metrics: (1) an NLI-based method that decomposes texts into atomic statements and uses natural language inference (NLI) to identify missing facts, (2) a Q&A-based metric that extracts question-answer pairs and compares responses across sources, and (3) an end-to-end approach that directly identifies missing content using LLMs. Our experiments demonstrate the surprising effectiveness of the simple end-to-end metric compared to more complex metrics, though at the cost of reduced robustness, interpretability and result granularity. We further assess the comprehensiveness of responses from several popular open-weight LLMs when answering user queries based on multiple sources.
comment: ACL 2026 Findings
♻ ☆ S2S-Arena: Evaluating Paralinguistic Instruction Following in Speech-to-Speech Models ACL 2026
Recent advances in large language models (LLMs) have fundamentally reshaped speech-to-speech (S2S) systems, enabling increasingly natural spoken interaction. However, existing benchmarks still rely heavily on text-based evaluation and largely ignore paralinguistic cues such as prosody, emotion, and speaker traits, which are central to expressive and human-like communication. We introduce S2S-Arena, a speech-native benchmark for evaluating instruction-following S2S models with explicit assessment of both semantic understanding and paralinguistic expression. S2S-Arena features a four-level interaction protocol that systematically probes models under increasing paralinguistic complexity, a two-stage data construction pipeline that produces 1,243 speech samples spanning 100+ real-world tasks, and an arena-style evaluation framework that enables reference-free, pairwise comparison directly in the speech modality. Benchmarking 10 state-of-the-art S2S systems over 1,000+ comparisons reveals substantial performance gaps (especially under complex paralinguistic demands) between current academic and industrial systems. Our analysis further identifies key design factors governing expressive instruction following, providing actionable insights for building more natural, robust, and human-aligned speech agents.
comment: Accepted by ACL 2026 main
Machine Learning 301
☆ ActCam: Zero-Shot Joint Camera and 3D Motion Control for Video Generation SIGGRAPH 2026
For artistic applications, video generation requires fine-grained control over both performance and cinematography, i.e., the actor's motion and the camera trajectory. We present ActCam, a zero-shot method for video generation that jointly transfers character motion from a driving video into a new scene and enables per-frame control of intrinsic and extrinsic camera parameters. ActCam builds on any pretrained image-to-video diffusion model that accepts conditioning in terms of scene depth and character pose. Given a source video with a moving character and a target camera motion, ActCam generates pose and depth conditions that remain geometrically consistent across frames. We then run a single sampling process with a two-phase conditioning schedule: early denoising steps condition on both pose and sparse depth to enforce scene structure, after which depth is dropped and pose-only guidance refines high-frequency details without over-constraining the generation. We evaluate ActCam on multiple benchmarks spanning diverse character motions and challenging viewpoint changes. We find that, compared to pose-only control and other pose and camera methods, ActCam improves camera adherence and motion fidelity, and is preferred in human evaluations, especially under large viewpoint changes. Our results highlight that careful camera-consistent conditioning and staged guidance can enable strong joint camera and motion control without training. Project page: https://elkhomar.github.io/actcam/.
comment: SIGGRAPH 2026
☆ UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
Modern Mixture-of-Experts (MoE) architectures allocate expert capacity through a rigid per-layer rule: each transformer layer owns a separate expert set. This convention couples depth scaling with linear expert-parameter growth and assumes that every layer needs isolated expert capacity. However, recent analyses and our routing probe challenge this allocation rule: replacing a deeper layer's learned top-k router with uniform random routing drops downstream accuracy by only 1.0-1.6 points across multiple production MoE models. Motivated by this redundancy, we propose UniPool, an MoE architecture that treats expert capacity as a global architectural budget by replacing per-layer expert ownership with a single shared pool accessed by independent per-layer routers. To enable stable and balanced training under sharing, we introduce a pool-level auxiliary loss that balances expert utilization across the entire pool, and adopt NormRouter to provide sparse and scale-stable routing into the shared expert pool. Across five LLaMA-architecture model scales (182M, 469M, 650M, 830M, and 978M parameters) trained on 30B tokens from the Pile, UniPool consistently improves validation loss and perplexity over the matched vanilla MoE baselines. Across these scales, UniPool reduces validation loss by up to 0.0386 relative to vanilla MoE. Beyond raw loss improvement, our results identify pool size as an explicit depth-scaling hyperparameter: reduced-pool UniPool variants using only 41.6%-66.7% of the vanilla expert-parameter budget match or outperform layer-wise MoE at the tested scales. This shows that, under a shared-pool design, expert parameters need not grow linearly with depth; they can grow sublinearly while remaining more efficient and effective than vanilla MoE. Further analysis shows that UniPool's benefits compose with finer-grained expert decomposition.
☆ Verifier-Backed Hard Problem Generation for Mathematical Reasoning
Large Language Models (LLMs) demonstrate strong capabilities for solving scientific and mathematical problems, yet they struggle to produce valid, challenging, and novel problems - an essential component for advancing LLM training and enabling autonomous scientific research. Existing problem generation approaches either depend on expensive human expert involvement or adopt naive self-play paradigms, which frequently yield invalid problems due to reward hacking. This work introduces VHG, a verifier-enhanced hard problem generation framework built upon three-party self-play. By integrating an independent verifier into the conventional setter-solver duality, our design constrains the setter's reward to be jointly determined by problem validity (evaluated by the verifier) and difficulty (assessed by the solver). We instantiate two verifier variants: a Hard symbolic verifier and a Soft LLM-based verifier, with evaluations conducted on indefinite integral tasks and general mathematical reasoning tasks. Experimental results show that VHG substantially outperforms all baseline methods by a clear margin.
☆ Why Global LLM Leaderboards Are Misleading: Small Portfolios for Heterogeneous Supervised ML
Ranking LLMs via pairwise human feedback underpins current leaderboards for open-ended tasks, such as creative writing and problem-solving. We analyze ~89K comparisons in 116 languages from 52 LLMs from Arena, and show that the best-fit global Bradley-Terry (BT) ranking is misleading. Nearly 2/3 of the decisive votes cancel out, and even the top 50 models according to the global BT ranking are statistically indistinguishable (pairwise win probabilities are at most 0.53 within the top 50 models). We trace this failure to strong, structured heterogeneity of opinions across language, task, and time. Moreover, we find an important characteristic - *language* plays a key role. Grouping by language (and families) increases the agreement of votes massively, resulting in two orders of magnitude higher spread in the ELO scores (i.e., very consistent rankings). What appears as global noise is in fact a mixture of coherent but conflicting subpopulations. To address such heterogeneity in supervised machine learning, we introduce the framework of $(λ, ν)$-portfolios, which are small sets of models that achieve a prediction error at most $λ$, "covering" at least a $ν$ fraction of users. We formulate this as a variant of the set cover problem and provide guarantees using the VC dimension of the underlying set system. On the Arena data, our algorithms recover just 5 distinct BT rankings that cover over 96% of votes at a modest $λ$, compared to the 21% coverage by the global ranking. We also provide a portfolio of 6 LLMs that cover twice as many votes as the top-6 LLMs from a global ranking. We further construct portfolios for a classification problem on the COMPAS dataset using an ensemble of fairness-regularized classification models and show that these portfolios can be used to detect blind spots in the data, which might be of independent interest to policymakers.
☆ Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less
Optimizers play an important role in both pretraining and finetuning stages when training large language models (LLMs). In this paper, we present an observation that full finetuning with the same optimizer as in pretraining achieves a better learning-forgetting tradeoff, i.e., forgetting less while achieving the same or better performance on the new task, than other optimizers and, possibly surprisingly, LoRA, during the supervised finetuning (SFT) stage. We term this phenomenon optimizer-model consistency. To better understand it, through controlled experiments and theoretical analysis, we show that: 1) optimizers can shape the models by having regularization effects on the activations, leading to different landscapes around the pretrained checkpoints; 2) in response to this regularization effect, the weight update in SFT should follow some specific structures to lower forgetting of the knowledge learned in pretraining, which can be obtained by using the same optimizer. Moreover, we specifically compare Muon and AdamW when they are employed throughout the pretraining and SFT stages and find that Muon performs worse when finetuned for reasoning tasks. With a synthetic language modeling experiment, we demonstrate that this can come from Muon's strong tendency towards rote memorization, which may hurt pattern acquisition with a small amount of data, as for SFT.
☆ When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels
Many deployments must compare candidate language models for safety before a labeled benchmark exists for the relevant language, sector, or regulatory regime. We formalize this setting as benchmarkless comparative safety scoring and specify the contract under which a scenario-based audit can be interpreted as deployment evidence. Scores are valid only under a fixed scenario pack, rubric, auditor, judge, sampling configuration, and rerun budget. Because no labels are available, we replace ground-truth agreement with an instrumental-validity chain: responsiveness to a controlled safe-versus-abliterated contrast, dominance of target-driven variance over auditor and judge artifacts, and stability across reruns. We instantiate the chain in SimpleAudit, a local-first scoring instrument, and validate it on a Norwegian safety pack. Safe and abliterated targets separate with AUROC values between 0.89 and 1.00, target identity is the dominant variance component ($η^2 \approx 0.52$), and severity profiles stabilize by ten reruns. Applying the same chain to Petri shows that it admits both tools. The substantial differences arise upstream of the chain, in claim-contract enforcement and deployment fit. A Norwegian public-sector procurement case comparing Borealis and Gemma 3 demonstrates the resulting evidence in practice: the safer model depends on scenario category and risk measure. Consequently, scores, matched deltas, critical rates, uncertainty, and the auditor and judge used must be reported together rather than collapsed into a single ranking.
comment: SimpleAudit Repository: https://github.com/kelkalot/simpleaudit
☆ Superintelligent Retrieval Agent: The Next Frontier of Information Retrieval
Retrieval-augmented agents are increasingly the interface to large organizational knowledge bases, yet most still treat retrieval as a black box: they issue exploratory queries, inspect returned snippets, and iteratively reformulate until useful evidence emerges. This approach resembles how a newcomer searches an unfamiliar database rather than how an expert navigates it with strong priors about terminology and likely evidence, and results in unnecessary retrieval rounds, increased latency, and poor recall. We introduce \textit{SuperIntelligent Retrieval Agent} (SIRA), which defines \emph{superintelligence} in retrieval as the ability to compress multi-round exploratory search into a single corpus-discriminative retrieval action. SIRA does not merely ask what terms are relevant to the query; it asks which terms are likely to separate the desired evidence from corpus-level confusers. On the corpus side, an LLM enriches each document offline with missing search vocabulary; on the query side, it predicts evidence vocabulary omitted by the query; and document-frequency statistics as a tool call to filter proposed terms that are absent, overly common, or unlikely to create retrieval margin. The final retrieval step is a single weighted BM25 call combining the original query with the validated expansion. Across ten BEIR benchmarks and downstream question-answering tasks, SIRA achieves the significantly superior performance outperforming dense retrievers and state-of-the-art multi-round agentic baselines, demonstrating that one well-formed lexical query, guided by LLM cognition and lightweight corpus statistics, can exceed substantially more expensive multi-round search while remaining interpretable, training-free, and efficient.
☆ Inductive Venn-Abers and related regressors
Venn-Abers predictors are probabilistic predictors that enjoy appealing properties of validity, but their major limitation is that they are applicable only to the case of binary classification, with a recent extension to bounded regression. We generalize them to the case of unbounded regression, which requires adding an element of conformal prediction. In our simulation and empirical studies we investigate the predictive efficiency of point regressors derived from Venn-Abers regressors and argue that they somewhat improve the predictive efficiency of standard regressors for larger training sets.
comment: 33 pages
☆ Edge-specific signal propagation on mature chromophore-region 3D mechanism graphs for fluorescent protein quantum-yield prediction
Fluorescent protein quantum yield (QY) is governed by the mature chromophore and its three-dimensional microenvironment rather than sequence identity alone. Protein language models and emission-band averages capture global trends, but do not model how local physical signals act on specific chromophore regions. We present a chromophore-centred mechanism graph algorithm for QY prediction. Each PDB structure is converted into a typed 3D residue graph, registered to a mature-CRO state, partitioned into phenolate, bridge and imidazolinone regions, and transformed by channel-signal-region propagation. The representation contains 121 enrichment features; after removing identity shortcuts, 52 non-identity features are used for band-specific ExtraTrees regression. Because each feature encodes a contact channel, seed signal and target CRO region, interpretation is intrinsic rather than post hoc. On a 531-protein benchmark, the method achieved the best random-CV performance among model-based baselines (R = 0.772 +/- 0.008, MAE = 0.131 +/- 0.002), exceeding Band mean (R = 0.632), ESM-C (R = 0.734) and SaProt (R = 0.731), and ranked first in bright screening (Bright P@5 = 0.704). Under homology control, the advantage was clearest in the remote bucket (<50% similarity; R = 0.697 versus 0.633, 0.575 and 0.408), with the strongest overall bright/dark Top-K screening. Stable selected features recovered band-specific mechanisms: aromatic packing and clamp asymmetry in GFP-like proteins, charge/clamp balance in Red proteins, and flexibility-risk/bulky-contact features in Far-red proteins. Source code, feature tables and evaluation scripts are available from the first author upon request. Contact: yuchenak05@gmail.com
comment: Includes appendix; source code, processed feature tables and evaluation scripts are available from the first author upon reasonable request
☆ Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study
Despite the growing popularity of Multimodal Domain Generalization (MMDG) for enhancing model robustness, it remains unclear whether reported performance gains reflect genuine algorithmic progress or are artifacts of inconsistent evaluation protocols. Current research is fragmented, with studies varying significantly across datasets, modality configurations, and experimental settings. Furthermore, existing benchmarks focus predominantly on action recognition, often neglecting critical real-world challenges such as input corruptions, missing modalities, and model trustworthiness. This lack of standardization obscures a reliable assessment of the field's advancement. To address this issue, we introduce MMDG-Bench, the first unified and comprehensive benchmark for MMDG, which standardizes evaluation across six datasets spanning three diverse tasks: action recognition, mechanical fault diagnosis, and sentiment analysis. MMDG-Bench encompasses six modality combinations, nine representative methods, and multiple evaluation settings. Beyond standard accuracy, it systematically assesses corruption robustness, missing-modality generalization, misclassification detection, and out-of-distribution detection. With 7, 402 neural networks trained in total across 95 unique cross-domain tasks, MMDG-Bench yields five key findings: (1) under fair comparisons, recent specialized MMDG methods offer only marginal improvements over ERM baseline; (2) no single method consistently outperforms others across datasets or modality combinations; (3) a substantial gap to upper-bound performance persists, indicating that MMDG remains far from solved; (4) trimodal fusion does not consistently outperform the strongest bimodal configurations; and (5) all evaluated methods exhibit significant degradation under corruption and missing-modality scenarios, with some methods further compromising model trustworthiness.
comment: Code: https://github.com/lihongzhao99/MMDG_Benchmark
☆ Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models
*Concept-based explanations* offer a promising approach for explaining the predictions of deep neural networks in terms of high-level, human-understandable concepts. However, existing methods either do not establish a causal connection between the concepts and model predictions or are limited in expressivity and only able to infer causal explanations involving single concepts. At the same time, the parallel line of work on *formal abductive and contrastive explanations* computes the minimal set of input features causally relevant for model outcomes but only considers low-level features such as pixels. Merging these two threads, in this work, we propose the notion of *concept-based abductive and contrastive explanations* that capture the minimal sets of high-level concepts causally relevant for model outcomes. We then present a family of algorithms that enumerate all minimal explanations while using *concept erasure* procedures to establish causal relationships. By appropriately aggregating such explanations, we are not only able to understand model predictions on individual images but also on collections of images where the model exhibits a user-specified, common *behavior*. We evaluate our approach on multiple models, datasets, and behaviors, and demonstrate its effectiveness in computing helpful, user-friendly explanations.
☆ Recursive Agent Optimization
We introduce Recursive Agent Optimization (RAO), a reinforcement learning approach for training recursive agents: agents that can spawn and delegate sub-tasks to new instantiations of themselves recursively. Recursive agents implement an inference-time scaling algorithm that naturally allows agents to scale to longer contexts and generalize to more difficult problems via divide-and-conquer. RAO provides a method to train models to best take advantage of such recursive inference, teaching agents when and how to delegate and communicate. We find that recursive agents trained in this way enjoy better training efficiency, can scale to tasks that go beyond the model's context window, generalize to tasks much harder than the ones the agent was trained on, and can enjoy reduced wall-clock time compared to single-agent systems.
☆ Crafting Reversible SFT Behaviors in Large Language Models
Supervised fine-tuning (SFT) induces new behaviors in large language models, yet imposes no structural constraint on how these behaviors are distributed within the model. Existing behavior interpretation methods, such as circuit attribution approaches, identify sparse subnetworks correlated with SFT-induced behaviors post-hoc. However, such correlations do not imply *causal necessity*, limiting the ability to selectively control SFT-induced behaviors at inference time. We pursue an alternative by asking: can an SFT-induced behavior be deliberately compressed into a sparse, mechanistically necessary subnetwork, termed a *carrier*, while remaining controllable at inference time without weight modification? We propose (a) **Loss-Constrained Dual Descent (LCDD)**, which constructs such carriers by jointly optimizing routing masks and model weights under an explicit utility budget, and (b) **SFT-Eraser**, a soft prompt optimized via activation matching on extracted carrier channels, to reverse the SFT-induced behavior. Across safety, fixed-response, and style behaviors on multiple model families, LCDD yields sparse carriers that preserve target behaviors while enabling strong reversion when triggered by SFT-Eraser. Ablations further establish that the sparse structure is the key precondition for reversal: the same trigger optimization fails on standard SFT models, confirming that structure rather than trigger design is the operative factor. These results provide direct evidence that the learned carriers are causally necessary for the behaviors, pointing to a new direction for systematically localizing and selectively suppressing SFT-induced behaviors in deployed models.
☆ Hybrid Quantum-Classical GANs for the Generation of Adversarial Network Flows
Classical generative adversarial networks (GANs) have been applied to generate adversarial network traffic capable of attacking intrusion detection systems, but they suffer from shortcomings such as the need for large amounts of high-dimensional datasets, mode collapse, and high computational overhead. In this work, we propose a hybrid quantum-classical GAN (QC-GAN) framework where a variational quantum generator is used to generate synthetic network traffic flows mimicking malicious traffic using latent representations. Instead of sampling classical noise vectors, we encode the latent vector (the hidden features) as a quantum state, which is the basis for claiming more expressive latent representations and reducing computational overhead. A classical discriminator will be trained on real-world datasets (UNSW-NB15) and the proposed QC-GAN-generated fake network flows. In this configuration, the generator aims to minimize the discriminator's ability to distinguish real from fake traffic, while the discriminator aims to maximize its classification accuracy, in an iterative manner. In our attack model, we assume that the attacker is a state actor with access to limited quantum computing power, whereas the discriminator is chosen to be classical, as will likely be the case for most end users and organizations. We test the generated flows using classical intrusion detection system (IDS) models, such as a random forest classifier and a convolutional neural network-based classifier, for their ability to bypass the detection process. This work aims to highlight the possibilities of quantum machine learning as a means of generating advanced attack flows and stress testing classical IDS. Lastly, we further evaluate how hardware-based noise affects these attacks to offer a new perspective on IDS, highlighting the need for a quantum resilient defense system.
comment: 14 pages
☆ LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation
Modern sensors generate rich, high-fidelity data, yet applications operating on wearable or remote sensing devices remain constrained by bandwidth and power budgets. Standardized codecs such as JPEG and MPEG achieve efficient trade-offs between bitrate and perceptual quality but are designed for human perception, limiting their applicability to machine-perception tasks and non-traditional modalities such as spatial audio arrays, hyperspectral images, and 3D medical images. General-purpose compression schemes based on scalar quantization or resolution reduction are broadly applicable but fail to exploit inherent signal redundancies, resulting in suboptimal rate-distortion performance. Recent generative neural codecs, or tokenizers, model complex signal dependencies but are often over-parameterized, data-hungry, and modality-specific, making them impractical for resource-constrained environments. We introduce a Lightweight, Versatile, and Asymmetric neural codec architecture (LiVeAction), that addresses these limitations through two key ideas. (1) To reduce the complexity of the encoder to meet the resource constraints of the execution environments, we impose an FFT-like structure and reduce the overall size and depth of the neural-network-based analysis transform. (2) To allow arbitrary signal modalities and simplify training, we replace adversarial and perceptual losses with a variance-based rate penalty. Our design produces codecs that deliver superior rate-distortion performance compared to state-of-the-art generative tokenizers, while remaining practical for deployment on low-power sensors. We release our code, experiments, and python library at https://github.com/UT-SysML/liveaction .
comment: DCC 2026
☆ PianoCoRe: Combined and Refined Piano MIDI Dataset
Symbolic music datasets with matched scores and performances are essential for many music information retrieval (MIR) tasks. Yet, existing resources often cover a narrow range of composers, lack performance variety, omit note-level alignments, or use inconsistent naming formats. This work presents PianoCoRe, a large-scale piano MIDI dataset that unifies and refines major open-source piano corpora. The dataset contains 250,046 performances of 5,625 pieces written by 483 composers, totaling 21,763 h of performed music. PianoCoRe is released in tiered subsets to support different applications: from large-scale analysis and pre-training (PianoCoRe-C and deduplicated PianoCoRe-B) to expressive performance modeling with note-level score alignment (PianoCoRe-A/A*). The note-aligned subset, PianoCoRe-A, provides the largest open-source collection of 157,207 performances aligned to 1,591 scores to date. In addition to the dataset, the contributions are: (1) a MIDI quality classifier for detecting corrupted and score-like transcriptions and (2) RAScoP, an alignment refinement pipeline that cleans temporal alignment errors and interpolates missing notes. The analysis shows that the refinement reduces temporal noise and eliminates tempo outliers. Moreover, an expressive performance rendering model trained on PianoCoRe demonstrates improved robustness to unseen pieces compared to models trained on raw or smaller datasets. PianoCoRe provides a ready-to-use foundation for the next generation of expressive piano performance research.
comment: Published in TISMIR. Project repository: https://github.com/ilya16/PianoCoRe
☆ When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds
Sign-based optimization algorithms, such as SignSGD and Muon, have garnered significant attention for their remarkable performance in training large foundation models. Despite this empirical success, we still lack a theoretical understanding of when and why these sign-based methods outperform vanilla SGD. The core obstacle is that under standard smoothness and finite variance conditions, SGD is known to be minimax optimal for finding stationary points measured by $\ell_2$-norms, thereby fundamentally precluding any complexity gains for sign-based methods in standard settings. To overcome this barrier, we analyze sign-based optimizers leveraging $\ell_1$-norm stationarity, $\ell_\infty$-smoothness, and a separable noise model, which can better capture the coordinate-wise nature of signed updates. Under this distinct problem geometry, we derive matched upper and lower bounds for SignSGD and explicitly characterize the problem class in which SignSGD provably dominates SGD. Specifically, we compare the \emph{upper bound of SignSGD} with the \emph{lower bound of SGD}, illustrating that SignSGD effectively reduces the complexity by a factor of $d$ under \emph{sparse noise}, where $d$ is the problem dimension. Furthermore, we elevate this framework to the matrix domain, providing an equivalent optimal lower bound for the Muon optimizer, proving that extending the sign operator to matrices preserves this optimal scaling with dimensionality. Finally, we bridge our theoretical bounds to practice, demonstrating that the theoretical superiority of SignSGD accurately predicts its faster convergence during the pretraining of a 124M parameter GPT-2 model.
comment: Code is available at https://github.com/Dingzhen230/SignSGD_Outperforms_SGD
☆ Online Bayesian Calibration under Gradual and Abrupt System Changes
Bayesian model calibration is central to digital twins and computer experiments, as it aligns model outputs with field observations by estimating calibration parameters and correcting systematic model bias. Classical Bayesian calibration introduces latent parameters and a discrepancy function to model bias, but suffers from parameter--discrepancy confounding and is typically formulated as an offline procedure under a stationary data-generating assumption. These limitations are restrictive in modern digital twin applications, where systems evolve over time and may exhibit gradual drift and abrupt regime shifts. While data assimilation methods enable sequential updates, they generally do not explicitly model systematic bias and are less effective under abrupt changes. We propose Bayesian Recursive Projected Calibration (BRPC), an online Bayesian calibration framework for streaming data under simulator mismatch and nonstationarity. BRPC extends projected calibration to the online setting by separating a discrepancy-free particle update for calibration parameters from a conditional Gaussian process update for discrepancy, preserving identifiability while enabling bias-aware adaptation under gradual system evolution. To handle abrupt changes, BRPC is integrated with restart mechanisms that detect regime shifts and reset the calibration process. We establish theoretical guarantees for both components, including tracking performance under gradual evolution and false-alarm and detection behavior for restart mechanisms. Empirical studies on synthetic and plant-simulation benchmarks show that BRPC improves calibration accuracy under gradual changes, while restart-augmented BRPC further improves robustness and predictive performance under abrupt regime shifts compared to sliding-window Bayesian calibration and data assimilation baselines.
☆ The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity ICML 2026
Despite the prevalence of the attention sink phenomenon in Large Language Models (LLMs), where initial tokens disproportionately monopolize attention scores, its structural origins remain elusive. This work provides a \textit{mechanistic explanation} for this phenomenon. First, we trace its root to the value aggregation process inherent in self-attention, which induces a systematic variance discrepancy. We further demonstrate that this discrepancy is drastically amplified by the activation of super neurons within Feed-Forward Network (FFN) layers. Specifically, the channel-sparse down-projections trigger a dimension disparity of the first-token representation, necessitating the formation of attention sinks as a structural anchor. Then, we validate this causal chain through two controlled interventions: (i) isolating the aggregation effect via attention mask modifications and (ii) amplifying the variance of targeted token representations. Both interventions can replicate attention sinks at arbitrary positions. Our mechanistic understanding offers a foundation for the systematic control of sink formation. Finally, as a proof of concept, we propose \textit{head-wise RMSNorm}, an architectural modification that stabilizes value aggregation outputs during pre-training. Our experiments demonstrate that restoring statistical parity across positions significantly accelerates convergence.
comment: Accepted to ICML 2026
☆ SoftSAE: Dynamic Top-K Selection for Adaptive Sparse Autoencoders
Sparse Autoencoders (SAEs) have become an important tool in mechanistic interpretability, helping to analyze internal representations in both Large Language Models (LLMs) and Vision Transformers (ViTs). By decomposing polysemantic activations into sparse sets of monosemantic features, SAEs aim to translate neural network computations into human-understandable concepts. However, common architectures such as TopK SAEs rely on a fixed sparsity level. They enforce the same number of active features (K) across all inputs, ignoring the varying complexity of real-world data. Natural data often lies on manifolds with varying local intrinsic dimensionality, meaning the number of relevant factors can change significantly across samples. This suggests that a fixed sparsity level is not optimal. Simple inputs may require only a few features, while more complex ones need more expressive representations. Using a constant K can therefore introduce noise in simple cases or miss important structure in more complex ones. To address this issue, we propose SoftSAE, a sparse autoencoder with a Dynamic Top-K selection mechanism. Our method uses a differentiable Soft Top-K operator to learn an input-dependent sparsity level k. This allows the model to adjust the number of active features based on the complexity of each input. As a result, the representation better matches the structure of the data, and the explanation length reflects the amount of information in the input. Experimental results confirm that SoftSAE not only finds meaningful features, but also selects the right number of features for each concept. The source code is available at: https://anonymous.4open.science/r/SoftSAE-8F71/.
Transformers Efficiently Perform In-Context Logistic Regression via Normalized Gradient Descent
Transformers have demonstrated remarkable in-context learning (ICL) capabilities. The strong ICL performance of transformers is commonly believed to arise from their ability to implicitly execute certain algorithms on the context, thereby enhancing prediction and generation. In this work, we investigate how transformers with softmax attention perform in-context learning on linear classification data. We first construct a class of multi-layer transformers that can perform in-context logistic regression, with each layer exactly performing one step of normalized gradient descent on an in-context loss. Then, we show that our constructed transformer can be obtained through (i) training a single self-attention layer supervised by one-step gradient descent, and (ii) recurrently applying the trained layer to obtain a looped model. Training convergence guarantees of the self-attention layer and out-of-distribution generalization guarantees of the looped model are provided. Our results advance the theoretical understanding of ICL mechanism by showcasing how softmax transformers can effectively act as in-context learners.
comment: 94 pages, 8 figures
☆ DARTS: Targeting Prognostic Covariates in Budget-Constrained Sequential Experiments
Randomized controlled trials typically assume that prognostic covariates are known and available at no cost. In practice, obtaining high-dimensional pretreatment data is costly, forcing a trade-off between covariate-adaptive precision and a measurement budget. We introduce Dynamic Adaptive Rerandomization via Thompson Sampling (DARTS), which treats covariate acquisition as a sequential optimization problem embedded within a design-based causal inference task. A budgeted combinatorial Thompson sampler learns which covariates are most prognostic across successive batches; selected covariates then drive rerandomization and regression adjustment to reduce batch-level average treatment effect variance. Our primary theoretical contribution is a decoupling result: adaptive covariate selection based on past batches preserves batch-level randomization validity, and the cumulative inverse-variance weighted estimator achieves at least nominal asymptotic coverage. We further derive a Bayes risk bound for the acquisition layer that matches the minimax lower bound up to logarithmic factors. Empirically, DARTS systematically concentrates the budget on informative features, significantly closing the efficiency gap to oracle designs while maintaining strict inferential validity.
☆ How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation
Evaluating and predicting the performance of large language models (LLMs) in multi-turn conversational settings is critical yet computationally expensive; key events -- e.g., jailbreaks or successful task completion by an agent -- often emerge only after repeated interactions. These events might be rare, and under any feasible computational budget, remain unobserved. Recent conformal survival frameworks construct reliable lower predictive bounds (LPBs) on the number of iterations to trigger the event of interest, but rely on static budget allocation that is inefficient in multi-turn setups. To address this, we introduce \emph{Dynamic Allocation via PRojected Optimization} (DAPRO), the first theoretically valid dynamic budget allocation framework for bounding the time-to-event in multi-turn LLM interactions. We prove that DAPRO satisfies the budget constraint and provides distribution-free, finite-sample coverage guarantees without requiring the conditional independence between censoring and event times assumed by prior conformal survival approaches. A key theoretical contribution is a novel coverage bound that scales with the square root of the mean censoring weight rather than the worst-case weight, yielding provably tighter guarantees than prior work. Furthermore, DAPRO can be employed to obtain unbiased, low-variance estimates of population-level evaluation metrics, such as the jailbreak rate, under limited computing resources. Comprehensive experiments across agentic task success, adversarial jailbreaks, toxic content generation, and RAG hallucinations using LLMs such as Llama 3.1 and Qwen 2.5 demonstrate that DAPRO consistently achieves coverage closer to the nominal level with lower variance than static baselines, while satisfying the budget constraint.
☆ Weight-Decay Turns Transformer Loss Landscapes Villani: Functional-Analytic Foundations for Optimization and Generalization
Weight decay is widely used as a regularizer in large language models, yet its precise role in shaping Transformer loss landscapes remains theoretically underexplored. This paper provides the first rigorous functional-analytic characterization of the standard Transformer objective--cross-entropy loss with $L^2$ regularization--by proving it satisfies Villani's criteria for coercive energy functions. Specifically, we show that the regularized loss $\mathcal{F}$ is infinitely differentiable, grows at least quadratically, has Gaussian-integrable tails, and satisfies the differential growth condition $-Δ\mathcal{F} + \tfrac{1}{s}\|\nabla\mathcal{F}\|^{2} \to \infty$ as $\|θ\| \to \infty$ for all $s>0$. From this structure, we derive explicit log-Sobolev and Poincaré constants $C_{\mathrm{LS}} \leq λ^{-1} + d/λ^{2}$, linking the regularization strength $λ$ and model dimension $d$ to finite-time convergence guarantees for noisy stochastic gradient descent and PAC-Bayesian generalization bounds that tighten with increasing $λ$. To validate our theory, we introduce a scalable Villani diagnostic $Ψ_s(θ) = -Δ\mathcal{F} + s^{-1}\|\nabla \mathcal{F}\|^2$ and estimate it efficiently using Hutchinson trace probes in models with over 100M parameters. Experiments on GPT-Neo-125M across Penn Treebank and WikiText-103 confirm the predicted quadratic growth of $Ψ_s$, spectral inflation of the Hessian, and exponential convergence behavior consistent with our log-Sobolev analysis. These results demonstrate that weight decay not only improves generalization empirically but also establishes the mathematical conditions required for fast Langevin mixing and theoretically grounded curvature-aware optimization in deep learning.
comment: 17 pages, 10 figures
☆ UniSD: Towards a Unified Self-Distillation Framework for Large Language Models
Self-distillation (SD) offers a promising path for adapting large language models (LLMs) without relying on stronger external teachers. However, SD in autoregressive LLMs remains challenging because self-generated trajectories are free-form, correctness is task-dependent, and plausible rationales can still provide unstable or unreliable supervision. Existing methods mainly examine isolated design choices, leaving their effectiveness, roles, and interactions unclear. In this paper, we propose UniSD, a unified framework to systematically study self-distillation. UniSD integrates complementary mechanisms that address supervision reliability, representation alignment, and training stability, including multi-teacher agreement, EMA teacher stabilization, token-level contrastive learning, feature matching, and divergence clipping. Across six benchmarks and six models from three model families, UniSD reveals when self-distillation improves over static imitation, which components drive the gains, and how these components interact across tasks. Guided by these insights, we construct UniSDfull, an integrated pipeline that combines complementary components and achieves the strongest overall performance, improving over the base model by +5.4 points and the strongest baseline by +2.8 points. Extensive evaluation highlights self-distillation as a practical and steerable approach for efficient LLM adaptation without stronger external teachers.
comment: 22 pages, 12 figures
☆ FedAttr: Towards Privacy-preserving Client-Level Attribution in Federated LLM Fine-tuning
Watermark radioactivity testing type of methods can detect whether a model was trained on watermarked documents, and have become key tools for protecting data ownership in the fine-tuning of large language models (LLMs). Existing works have proved their effectiveness in centralized LLM fine-tuning. However, this type of method faces several challenges and remains underexplored in federated learning (FL), a widely-applied paradigm for fine-tuning LLMs collaboratively on private data across different users. FL mainly ensures privacy through secure aggregation (SA), which allows the server to aggregate updates while keeping clients' updates private. This mechanism preserves privacy but makes it difficult to identify which client trained on watermarked documents. In this work, we propose FedAttr, a new client-level attribution protocol for FL. FedAttr identifies which clients trained on watermarked data via a paired-subset-difference mechanism, while preserving the privacy guarantees of SA and FL performance. FedAttr proceeds in three steps: (i) estimate each client's update by differencing two SA queries, (ii) score the estimate with the watermark detector via differential scoring, and (iii) combine scores across rounds via Stouffer method. We theoretically show that FedAttr produces an unbiased estimator of each client's update with bounded mutual information leakage (i.e., $O(d^*/N)$ per-round update). Moreover, FedAttr empirically achieves 100% TPR and 0% FPR, outperforming all baselines by at least 44.4% in TPR or 19.1% in FPR, with only 6.3% overhead relative to FL training time. Ablation studies confirm that FedAttr is robust to protocol parameters and configurations.
comment: 39 pages, 4 figures, 21 tables (including appendix)
☆ Cross-Modal Navigation with Multi-Agent Reinforcement Learning
Robust embodied navigation relies on complementary sensory cues. However, high-quality and well-aligned multi-modal data is often difficult to obtain in practice. Training a monolithic model is also challenging as rich multi-modal inputs induce complex representations and substantially enlarge the policy space. Cross-modal collaboration among lightweight modality-specialized agents offers a scalable paradigm. It enables flexible deployment and parallel execution, while preserving the strength of each modality. In this paper, we propose \textbf{CRONA}, a Multi-Agent Reinforcement Learning (MARL) framework for \textbf{Cro}ss-Modal \textbf{Na}vigation. CRONA improves collaboration by leveraging control-relevant auxiliary beliefs and a centralized multi-modal critic with global state. Experiments on visual-acoustic navigation tasks show that multi-agent methods significantly improve performance and efficiency over single-agent baselines. We find that homogeneous collaboration with limited modalities is sufficient for short-range navigation under salient cues; heterogeneous collaboration among agents with complementary modalities is generally efficient and effective; and navigation in large, complex environments requires both richer multi-modal perception and increased model capacity.
☆ ReActor: Reinforcement Learning for Physics-Aware Motion Retargeting SIGGRAPH 2026
Retargeting human kinematic reference motion onto a robot's morphology remains a formidable challenge. Existing methods often produce physical inconsistencies, such as foot sliding, self-collisions, or dynamically infeasible motions, which hinder downstream imitation learning. We propose a bilevel optimization framework that jointly adapts reference motions to a robot's morphology while training a tracking policy using reinforcement learning. To make the optimization tractable, we derive an approximate gradient for the upper-level loss. Our framework requires only a sparse set of semantic rigid-body correspondences and eliminates the need for manual tuning by identifying optimal values for a parameterization expressive enough to preserve characteristic motion across different embodiments. Moreover, by integrating retargeting directly with physics simulation, we produce physically plausible motions that facilitate robust imitation learning. We validate our method in simulation and on hardware, demonstrating challenging motions for morphologies that differ significantly from a human, including retargeting onto a quadruped.
comment: SIGGRAPH 2026
☆ DINORANKCLIP: DINOv3 Distillation and Injection for Vision-Language Pretraining with High-Order Ranking Consistency
Contrastive language-image pretraining (CLIP) suffers from two structural weaknesses: the symmetric InfoNCE loss discards the relative ordering among unmatched in-batch pairs, and global pooling collapses the visual representation into a semantic bottleneck that is poorly sensitive to fine-grained local structure. RANKCLIP partially addresses the first issue with a list-wise Plackett-Luce ranking-consistency loss, but its model is strictly first-order and inherits the second weakness untouched. We propose DINORANKCLIP, a pretraining framework that addresses both jointly. Our principal contribution is injecting a frozen DINOv3 teacher into the contrastive trunk through a dual-branch lightweight student and a multi-scale fusion module with channel-spatial attention, a self-attention refiner, and a conflict-aware gate that preserves the cross-modal alignment up to first order. Complementarily, we introduce a high-order Plackett-Luce ranking model in which the per-position utility is augmented with attention-parameterised pairwise and tuple-wise transition terms; the family contains CLIP and RANKCLIP as nested zero-order and first-order special cases, and the optimal order on every benchmark is $R^*=3$. The full empirical study -- order sweep, Fine-grained Probe on five datasets, four-node Modality-Gap analysis, six-variant Fusion ablation -- fits in 72 hours on a single eight-GPU H100 node and trains entirely on Conceptual Captions 3M. DINORANKCLIP consistently outperforms CLIP, CyCLIP, ALIP, and RANKCLIP under matched compute, with the largest relative gains on the fine-grained and out-of-distribution evaluations that most directly stress local structural reasoning.
comment: 18 pages, 7 figures, 9 tables. Code will be made publicly available upon acceptance
☆ BRICKS: Compositional Neural Markov Kernels for Zero-Shot Radiation-Matter Simulation
We introduce a new strategy for compositional neural surrogates for radiation-matter interactions, a key task spanning domains from particle physics through nuclear and space engineering to medical physics. Exploiting the locality and the Markov nature of particle interactions, we create a \emph{next-particle prediction} kernel using hybrid discrete-continuous transformer models based on Riemannian Flow Matching on product manifolds. The model generates variable-sized typed sets of particles and radiation side effects that are the result of the interaction of an incident particle with a material volume. The resulting kernel can be composed to simulate unseen large-scale material distributions in a zero-shot manner. Unlike mechanistic simulators, our model is designed to be differentiable, provides tractable likelihoods for future downstream applications. A significant computational speed-up on GPU compared to CPU-bound mechanistic simulation is observed for single-kernel execution. We evaluate the model at the kernel level and demonstrate predictive stability over multi-round autoregressive rollouts. We additionally release a novel 20M-event radiation-matter interaction dataset for further research.
comment: 10 pages, 5 figures
☆ Towards Metric-Faithful Neural Graph Matching
Graph Edit Distance (GED) is a fundamental, albeit NP-hard, metric for structural graph similarity. Recent neural graph matching architectures approximate GED by first encoding graphs with a Graph Neural Network (GNN) and then applying either a graph-level regression head or a matching-based alignment module. Despite substantial architectural progress, the role of encoder geometry in neural GED estimation remains poorly understood. In this paper, we develop a theoretical framework that connects encoder geometry to GED estimation quality for two broad classes of neural GED estimators: graph similarity predictors and alignment-based methods. On fixed graph collections, where the doubly-stochastic metric $d_{\mathrm{DS}}$ is comparable to GED, we show that graph-level bi-Lipschitz encoders yield controlled GED surrogates and improved ranking stability; for matching-based estimators, node-level bi-Lipschitz geometry propagates to encoder-induced alignment costs and the resulting optimized alignment objective. We instantiate this perspective using FSW-GNN, a bi-Lipschitz WL-equivalent encoder, as a drop-in replacement in representative neural GED architectures. Across representative baselines and benchmark datasets, the resulting geometry-aware variants significantly improve GED prediction and ranking metrics. A faithfulness case study of untrained encoders, together with ablations and transfer experiments, supports the view that these gains arise from improved representation geometry, positioning encoder geometry as a useful design principle for neural graph matching.
☆ Distributionally-Robust Learning to Optimize
We propose a distributionally robust approach to learning hyperparameters for first-order methods in convex optimization. Given a dataset of problem instances, we minimize a Wasserstein distributionally robust version of the performance estimation problem (PEP) over algorithm parameters such as step sizes. Our framework unifies two extremes: as the robustness radius vanishes, we recover classical learning to optimize (L2O); as it grows, we recover worst-case optimal algorithm design via PEP. We solve the resulting problem with stochastic gradient descent, differentiating through the solution of an inner semidefinite program at each step. We prove high-probability bounds showing that the true risk of the learned algorithm is at most the in-sample L2O optimum plus a slack that shrinks with the sample size, and is no worse than the worst-case PEP bound. On unconstrained quadratic minimization, LASSO, and linear programming benchmarks, our learned algorithms achieve strong out-of-sample performance with certifiable robustness, outperforming both worst-case optimal and vanilla L2O baselines.
☆ PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
Many operations on sensory data -- comparison, memory, retrieval, and reasoning -- are naturally expressed over discrete symbolic structures. In language this interface is given by tokens; in audio, it must be learned. Existing audio tokenizers rely on quantization, clustering, or codec reconstruction, assigning tokens locally, so sequence consistency, compactness, length control, termination, and edit similarity are rarely optimized directly. We introduce PairAlign, a framework for compact audio tokenization through sequence-level self-alignment. PairAlign treats tokenization as conditional sequence generation: an encoder maps speech to a continuous condition, and an autoregressive decoder generates tokens from BOS, learning token identity, order, length, and EOS placement. Given two content-preserving views, each view's sequence is trained to be likely under the other's representation, while unrelated examples provide competing sequences. This gives a scalable surrogate for edit-distance preservation while discouraging many-to-one collapse. PairAlign starts from VQ-style tokenization and refines it with EMA-teacher targets, cross-paired teacher forcing, prefix corruption, likelihood contrast, and length control. On 3-second speech, PairAlign learns compact, non-degenerate sequences with broad vocabulary usage and strong cross-view consistency. On TIMIT retrieval, it preserves edit-distance search while reducing archive token count by 55%. A continuous-sweep probe shows lower local overlap than a dense geometric tokenizer, but stronger length control and bounded edit trajectories under 100 ms shifts. PairAlign is a sequence-symbolic predictive learner: like JEPA-style objectives, it predicts an abstract target from another view as a learned variable-length symbolic sequence, not a continuous latent.
comment: 101 pages, 7 Figures, pre-print, Under Review
☆ On the Safety of Graph Representation Learning
Graph representation learning (GRL) has evolved from topology-only graph embeddings to task-specific supervised GNNs, and more recently to reusable representations and graph foundation models (GFMs). However, existing evaluations mainly measure clean transfer, adaptation, and task coverage. It remains unclear whether GRL methods stay reliable when deployment stresses affect graph signals, graph contexts, label support, structural groups, or predictive evidence. We introduce GRL-Safety, a multi-axis safety evaluation benchmark for GRL. GRL-Safety evaluates twelve representative methods, spanning topology-only embedding methods, supervised GNNs, self-supervised graph models, and GFMs, on twenty-five graph datasets under standardized evaluation conditions while preserving method-native adaptation. The evaluation covers five safety axes: corruption robustness, OOD generalization, class imbalance, fairness, and interpretation, with per-axis and sub-condition reporting rather than a single aggregate score. Our analysis yields three cross-axis insights that can inspire future research. First, safety behavior is shaped by the interaction between representation design and the stressed graph factor, rather than by method family alone. Second, foundation-era methods show axis-specific strengths rather than broad safety dominance. Third, several deployment regimes remain difficult even for the best evaluated method, revealing capability gaps that require new robustness, adaptation, or training objectives beyond model selection. The benchmark, evaluation protocols, and code are available at: https://github.com/GXG-CS/GRL-Safety.
comment: Preprint. 10 pages main text, appendices included
☆ Directional Consistency as a Complementary Optimization Signal: The GONO Framework
We identify and formalize an underexplored phenomenon in deep learning optimization: directional alignment and loss convergence can be decoupled. An optimizer can exhibit near-perfect directional consistency (cc_t -> 1, measured via consecutive gradient cosine similarity) while the loss remains high or decreases slowly. This observation reveals that existing optimizers such as Adam, SGD, and RMSprop lack explicit mechanisms to exploit temporal consistency in gradient directions, relying instead on magnitude-based signals that fail to distinguish plateaus, saddle points, and genuine convergence. Motivated by this, we introduce GONO (Gradient-Oriented Norm-Adaptive Optimizer), which adapts Adam's momentum coefficient beta_1 based on cc_t: amplifying momentum under directional consistency and suppressing it during oscillation. We prove GONO matches Adam's O(1/sqrt(T)) convergence rate and reduces exactly to Adam when the signal is uninformative. Empirically, cc_t achieves oscillation detection with F1=1.00 (vs. 0.45 for gradient norm), and GONO remains competitive with AdamW on MNIST (98.15%), CIFAR-10 (43.14%), and ResNet-18 (75.44%), establishing directional alignment as a theoretically grounded, practically actionable optimization signal. Code: https://github.com/victordaniel/gono-optimizer
☆ CLAD: A Clustered Label-Agnostic Federated Learning Framework for Joint Anomaly Detection and Attack Classification
The rapid expansion of the Internet of Things (IoT) and Industrial IoT (IIoT) has created a massive, heterogeneous attack surface that challenges traditional network security mechanisms. While Federated Learning (FL) offers a privacy-preserving alternative to centralized Intrusion Detection Systems (IDS), standard approaches struggle to generalize across diverse device behaviors and typically fail to utilize the vast amounts of unlabeled data present in realistic edge environments. To bridge these gaps, we propose CLAD, a holistic framework that seamlessly incorporates Clustered Federated Learning (CFL) with a novel Dual-Mode Micro-Architecture ($\text{DM}^2\text{A}$). This unified approach simultaneously tackles the two primary bottlenecks of IoT security: device heterogeneity and label scarcity. The $\text{DM}^2\text{A}$ component features a shared encoder followed by two branches, enabling joint unsupervised anomaly detection and supervised attack classification; this allows the framework to harvest intelligence from both labeled and unlabeled clients. Concurrently, the clustering component dynamically groups devices with congruent traffic patterns, preventing global model divergence. By carefully combining these elements, CLAD ensures that no data is discarded and distinct operational patterns are preserved. Extensive evaluations demonstrate that this integrated approach significantly outperforms state-of-the-art baselines, achieving a 30% relative improvement in detection performance in scenarios with 80% unlabeled clients, with only half the communication cost.
comment: 12 pages, 7 figures, 5 tables
☆ SNAPO: Smooth Neural Adjoint Policy Optimization for Optimal Control via Differentiable Simulation
Many real-world problems require sequential decisions under uncertainty: when to inject or withdraw gas from storage, how to rebalance a pension portfolio each month, what temperature profile to run through a pharmaceutical reactor chain. Dynamic programming solves small instances exactly but scales exponentially in state dimensions. Black-box reinforcement learning handles high-dimensional states but trains slowly and produces no sensitivities. We introduce SNAPO (Smooth Neural Adjoint Policy Optimization), a framework that embeds a neural policy inside a known, differentiable simulator, replaces hard constraints with smooth approximations, and computes exact gradients of the objective with respect to all policy parameters and all inputs in a single adjoint pass. We demonstrate SNAPO on three domains: natural gas storage (training in under a minute, 365 forward curve sensitivities at no additional cost per sensitivity), pension fund asset-liability management (6.5x-200x sensitivity speedup over bump-and-revalue, scaling with the number of risk factors), and pharmaceutical manufacturing (cross-unit sensitivities through a 4-unit process chain, with 20 ICH Q8 regulatory sensitivities from 5 adjoint passes in 74.5 milliseconds). All sensitivities are produced by the same backward pass that trains the policy, at a cost proportional to one reverse pass regardless of how many sensitivities are computed.
comment: 27 pages, 8 tables. Three domains: natural gas storage, pension fund ALM, pharmaceutical manufacturing. Benchmark code and trained policies available on request
☆ Dynamic Treatment on Networks
In networks, effective dynamic treatment allocation requires deciding both whom to treat and also when, so as to amplify policy impact through spillovers. An early intervention at a well-connected node can trigger cascades that change which nodes are worth targeting in the next period. Existing treatment strategies under network interference are largely static while dynamic treatment frameworks typically ignore network structure altogether. We integrate these perspectives and propose Q-Ising, a three-stage pipeline that (i) estimates network adoption dynamics via a Bayesian dynamic Ising model from a single observed panel, (ii) augments treatment adoption histories with continuous posterior latent states, and (iii) learns a dynamic policy via offline reinforcement learning. The Bayesian mechanism enables uncertainty quantification over dynamic decisions, yielding posterior ensemble policies with interpretable spillover estimates. We provide a finite-sample regret upper bound that decomposes into standard offline-RL uncertainty, network abstraction error, and first stage error in Ising state estimation. We apply our method to data from Indian village microfinance networks and synthetic stochastic block models under simulated heterogeneous susceptible-infected-susceptible (SIS) dynamics and demonstrate that adaptive targeting outperforms static centrality benchmarks.
☆ Criticality and Saturation in Orthogonal Neural Networks
It has been known for a long time that initializing weight matrices to be orthogonal instead of having i.i.d. Gaussian components can improve training performance. This phenomenon can be analyzed using finite-width corrections, where the infinite-width statistics are supplemented by a power series in $1/\mathrm{width}$. In particular, recent empirical results by Day et al. show that the tensors appearing in this treatment stabilize for large depth, as opposed to the tensors of i.i.d.-initialized networks. In this article, we derive explicit layer-wise recursion relations for the tensors appearing in the finite-width expansion of the network statistics in the case of orthogonal initializations. We also provide an extension of recently-introduced Feynman diagrams for the corresponding recursions in the i.i.d.-case which are valid to all orders in $1/\mathrm{width}$. Finally, we show explicitly that the recursions we derive reproduce the stability of the finite-width tensors which was observed for activation functions with vanishing fixed point. This work therefore provides a theoretical explanation for the stability of nonlinear networks of finite width initialized with orthogonal weights, closing a long-standing gap in the literature. We validate our theoretical results experimentally by showing that numerical solutions of our recursion relations and their analytical large-depth expansions agree excellently with Monte-Carlo estimates from network ensembles.
comment: 11 pages + Appendices
☆ Feature Dimensionality Outweighs Model Complexity in Breast Cancer Subtype Classification Using TCGA-BRCA Gene Expression Data
Accurate classification of breast cancer subtypes from gene expression data is critical for diagnosis and treatment selection. However, such datasets are characterized by high dimensionality and limited sample size, posing challenges for machine learning models. In this study, we evaluate the impact of model complexity and feature selection on subtype classification performance using TCGA-BRCA gene expression data. Logistic regression, random forest, and support vector machine (SVM) models were trained using varying numbers of highly variable genes (50 to 20,518). Performance was evaluated using stratified 5-fold cross-validation and assessed with accuracy and macro F1 score. While all models achieved high accuracy, macro F1 analysis revealed substantial differences in subtype-level performance. Logistic regression demonstrated the most stable and balanced performance across subtypes, including improved detection of rare classes. Random forest underperformed on minority subtypes despite strong overall accuracy, while SVM showed sensitivity to feature dimensionality. These findings highlight the importance of model simplicity, evaluation metrics, and feature selection in high-dimensional biological classification tasks.
comment: 8 pages, 4 figures, 3 tables. Independent research study using TCGA-BRCA RNA-seq data
☆ Optimal Counterfactual Search in Tree Ensembles: A Study Across Modeling and Solution Paradigms
Trust in counterfactual explanations depends critically on whether their recommended changes are truly minimal: suboptimal explanations may vastly overshoot the actual changes needed to alter a decision, and heuristic errors can affect individuals unevenly, giving some users relevant recourse while assigning others unnecessarily costly recommendations. Consequently, we study the problem of computing optimal counterfactual explanations for tree ensembles under plausibility and actionability constraints. This is a combinatorial problem: for a fixed model, counterfactual search boils down to selecting consistent branching decisions and threshold-defined regions under a distance objective. We exploit this structure through CPCF, a constraint programming (CP) formulation in which numerical features are encoded as interval domains induced by split thresholds, while discrete features retain native finite-domain representations. This yields a compact finite-domain formulation that supports multiple distance objectives without continuous split-boundary search. We then place CPCF in a broader comparison across mathematical programming paradigms: we extend a maximum Boolean satisfiability (MaxSAT) formulation, originally designed for hard-voting random forests, to soft-voting ensembles, and compare against the current state-of-the-art mixed-integer linear programming (MILP) optimal approach. Across ten datasets and three types of tree ensembles, we analyze scalability, anytime performance, and sensitivity to distance metrics. We observe that CP achieves the best overall performance. More importantly, our results identify regimes in which the specific strengths of each paradigm make it best suited: CP is most versatile overall, MaxSAT handles hard-voting ensembles particularly well, and MILP remains competitive in amortized inference settings with a moderate number of split levels.
☆ Coordination Matters: Evaluation of Cooperative Multi-Agent Reinforcement Learning
Cooperative multi-agent reinforcement learning (MARL) benchmarks commonly emphasize aggregate outcomes such as return, success rate, or completion time. While essential, these metrics often fail to reveal how agents coordinate, particularly in settings where agents, tasks, and joint assignment choices scale combinatorially. We propose a coordination-aware evaluation perspective that supplements return with process-level diagnostics. We instantiate this perspective using STAT, a controlled commitment-constrained spatial task-allocation testbed that systematically varies agents, tasks, and environment size while holding observation access and task rules fixed. We evaluate six representative value-based MARL methods across varying levels of centralization. Our results show that similar return trends can reflect distinct coordination mechanisms, including differences in redundant assignment, assignment diversity, and task-completion efficiency. We find that in commitment-constrained task allocation, performance under scale is shaped not only by nominal action-space size, but also by assignment pressure, sparse decision opportunities, and redundant choices among interdependent agents. Our findings motivate coordination-aware evaluation as a necessary complement to return-based benchmarking for cooperative MARL.
comment: 27 pages. Submitted and under review
☆ Diverse Sampling in Diffusion Models with Marginal Preserving Particle Guidance
We present EDDY (Exact-marginal Diversification via Divergence-free dYnamics), a guidance mechanism for diffusion and flow matching models that promotes diversity among samples generated while maintaining quality. EDDY exploits symmetries of the Fokker-Planck equation, using drift perturbations that change particle trajectories while preserving the evolving marginal distribution. We instantiate this principle through kernel-based anti-symmetric pairwise matrix fields, constructed from the repulsive directions. The resulting divergence-free dynamics promote diversity at the joint particle level while preserving each particle's marginal distribution without any additional training. As computing the guidance can be computationally expensive in cases such as text-to-image generation with perceptual embeddings, we propose practical approximations as an effective and efficient solution. Experiments on synthetic distributions and text-to-image generation show that EDDY improves diversity while maintaining strong distributional fidelity compared to common baselines.
comment: 9 pages, 4 figures
☆ Sequential Design of Genetic Circuits Under Uncertainty With Reinforcement Learning
The design of biological systems is hindered by uncertainty arising from both intrinsic stochasticity of biomolecular reactions and variability across laboratory or experimental conditions. In this work, we present a sequential framework to optimize genetic circuits under both forms of uncertainty. By employing simulator models based on differential equations or Markov jump processes alongside a reinforcement learning (RL) policy-based approach, our method suggests experiments that adapt to unknown laboratory conditions while accounting for inherent stochasticity. While previous Bayesian methods address uncertainty through iterative experiment-inference-optimization cycles, they typically require computationally expensive inference and optimization steps after each experimental round, leading to delays. To overcome this bottleneck, we propose an amortized approach trained up-front across a distribution of possible uncertain parameters. This strategy sidesteps the need for explicit parameter inference during the design cycle, enabling immediate, observation-based adaptation. We demonstrate our framework on models for heterologous gene expression and a repressilator circuit, showing that it efficiently handles both molecular noise and cross-laboratory variability.
☆ Hedging Memory Horizons for Non-Stationary Prediction via Online Aggregation
We study online prediction under distribution shift, where inputs arrive chronologically and outcomes are revealed only after prediction. In this setting, predictors must remain stable in quiet regimes yet adapt when regimes shift, and the right adaptation memory is unknown in advance. We propose MELO (Memory-hedged Exponentially Weighted Least-Squares Online aggregation), a model-agnostic method that hedges across adaptation scales: it wraps any non-anticipating base-predictor pool with exponentially weighted least-squares (EWLS) adaptation experts at multiple forgetting factors, and aggregates raw and EWLS-adapted forecasts with MLpol, a parameter-free online aggregation rule. Under boundedness conditions, we establish deterministic oracle inequalities showing that it competes with both the best raw predictor and the best bounded, time-varying affine combinations of the base predictions, up to a path-length-dependent tracking cost and a sublinear aggregation overhead. We evaluate MELO on French national electricity-load forecasting through the COVID-19 lockdown using no regime indicators, lockdown dates, or policy covariates. MELO reduces overall RMSE by 34.7\% relative to base-only MLpol and achieves lower overall RMSE than a TabICL reference supplied with an external COVID policy-response covariate. Moreover, MELO requires only lightweight per-step recursive updates without model retraining.
comment: Preprint
☆ Diffusion-Based Posterior Sampling: A Feynman-Kac Analysis of Bias and Stability
Diffusion-based posterior samplers use pretrained diffusion priors to sample from measurement- or reward-conditioned posteriors, and are widely used for inverse problems. Yet their theoretical behavior remains poorly understood: even with exact prior scores, their outputs are biased, and in low-temperature regimes their discretizations can become unstable. We characterize this bias by introducing a tractable surrogate path connecting the true posterior to a standard Gaussian and comparing it to the sampler's path. Their density ratio satisfies a parabolic PDE whose reaction term measures the accumulated bias. A Feynman-Kac representation then expresses the Radon-Nikodym correction as an explicit path expectation, identifying which posterior regions are over- or under-sampled. We apply this framework to DPS and STSL, a related sampler. For DPS, the correction is an Ornstein-Uhlenbeck path expectation coupling the data conditional covariance with the reward curvature, revealing where DPS over- or under-samples. Next, we reinterpret STSL as an auxiliary drift that steers trajectories toward low-uncertainty regions, flattening the spatially varying part of the DPS reaction term. Finally, we characterize early guidance-stopping, a common mitigation for low-temperature instabilities caused by forward-Euler integration of the vector field. Together, these results clarify sampler bias, explain existing correctives, and guide stable variant designs.
☆ Market-Alignment Risk in Pricing Agents: Trace Diagnostics and Trace-Prior RL under Hidden Competitor State
Outcome metrics can certify the wrong behavior. We study this failure in a two-hotel revenue-management simulator where Hotel A trains an agent against a fixed rule-based revenue-management competitor, Hotel B. A standard learning agent can obtain near-reference revenue per available room (RevPAR) while failing to learn market-like yield management: it sells too aggressively, undercuts, or collapses to modal price buckets. We diagnose this as a Goodhart-style failure under partial observability. Hotel A cannot observe the competitor's remaining inventory, booking curve, or pricing rule, so the same Hotel A-visible state maps to multiple plausible Hotel B prices. Deterministic value-based RL and deterministic copying collapse this unresolved uncertainty into shortcut behavior. We introduce a trace-level diagnostic protocol using RevPAR, occupancy, ADR, full price-bucket distributions, L1/JS distances, and seed-level confidence intervals. The verified repair is Trace-Prior RL: learn a distributional market prior from lagged market traces, then train a stochastic pricing policy with a RevPAR reward and a KL penalty to the learned prior. The final policy matches Hotel B's RevPAR, occupancy, ADR, and price distribution within seed-level uncertainty, while still optimizing Hotel A's own reward. We argue that the contribution is not a new optimizer and not a hotel-pricing leaderboard, but a reproducible failure-and-repair recipe for agentic systems where scalar rewards are easy to game and the intended behavior is only visible in traces. A key finding is that higher exact action accuracy can worsen aggregate trace alignment when the target is distributional.
comment: 7 pages
☆ On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR
Recent extensive research has demonstrated that the enhanced reasoning capabilities acquired by models through Reinforcement Learning with Verifiable Rewards (RLVR) are primarily concentrated within the rank-1 components. Predicated on this observation, we employed Periodic Rank-1 Substitution and identified a counterintuitive phenomenon: RLVR may exhibit implicit reward overfitting to the training dataset. Specifically, the model can achieve satisfactory performance on the test set even when its rewards remain relatively low during the training process. Furthermore, we characterize three distinct properties of RL training: (1) The effective rank-1 component in RLVR don't maintain other model knowledge except mathematical reasoning capability. (2) RLVR fundamentally functions by optimizing a specific singular spectrum. The distribution of singular values of almost all linear layers in RLVR-trained model behaves like heavy-tailed distribution. (3) the left singular vectors associated with rank-1 components demonstrate a stronger alignment tendency during training, which echoes the discovery that RLVR is optimizing sampling efficiency in essence. Taken together, our findings and analysis further reveal how RLVR shapes model parameters and offer potential insights for improving existing RL paradigms or other training paradigms to implement continual learning.
☆ Agentic AIs Are the Missing Paradigm for Out-of-Distribution Generalization in Foundation Models
Foundation models (FMs) are increasingly deployed in open-world settings where distribution shift is the rule rather than the exception. The out-of-distribution (OOD) phenomena they face -- knowledge boundaries, capability ceilings, compositional shifts, and open-ended task variation -- differ in kind from the settings that have shaped prior OOD research, and are further complicated because the pretraining and post-training distributions of modern FMs are often only partially observed. Our position is that OOD for foundation models is a structurally distinct problem that cannot be solved within the prevailing model-centric paradigm, and that agentic systems constitute the missing paradigm required to address it. We defend this claim through four steps. First, we give a stage-aware formalization of OOD that accommodates partially observed multi-stage training distributions. Second, we prove a parameter coverage ceiling: there exist practically relevant inputs that no model-centric method (training-time or test-time) can handle within tolerance $\varepsilon$, for reasons intrinsic to parameter-based representation. Third, we characterize agentic OOD systems by four structural properties -- perception, strategy selection, external action, and closed-loop verification -- and show that they strictly extend the reachable set beyond the ceiling. Fourth, we respond to seven counterarguments, conceding two, and outline a research agenda. We do not claim that agentic methods subsume model-centric ones; we argue that the two are complementary, and that progress on FM-OOD requires explicit recognition of the agentic paradigm as a first-class research direction.
comment: 13 pages, 2 figures
☆ Optimizing Social Utility in Sequential Experiments
Regulatory approval of products in high-stakes domains such as drug development requires statistical evidence of safety and efficacy through large-scale randomized controlled trials. However, the high financial cost of these trials may deter developers who lack absolute certainty in their product's efficacy, ultimately stifling the development of `moonshot' products that could offer high social utility. To address this inefficiency, in this paper, we introduce a statistical protocol for experimentation where the product developer (the agent) conducts a randomized controlled trial sequentially and the regulator (the principal) partially subsidizes its cost. By modeling the protocol using a belief Markov decision process, we show that the agent's optimal strategy can be found efficiently using dynamic programming. Further, we show that the social utility is a piecewise linear and convex function over the subsidy level the principal selects, and thus the socially optimal subsidy can also be found efficiently using divide-and-conquer. Simulation experiments using publicly available data on antibiotic development and approval demonstrate that our statistical protocol can be used to increase social utility by more than $35$$\%$ relative to standard, non-sequential protocols.
☆ Efficient Techniques for Data Reconstruction, with Finite-Width Recovery Guarantees
Data reconstruction attacks on trained neural networks aim to recover the data on which the network has been trained and pose a significant threat to privacy, especially if the training dataset contains sensitive information. Here, we propose a unified optimization formulation of the data reconstruction problem based on initial and trained parameter values, incorporating state-of-the-art proposals. We show that in the random feature model, this formulation provably leads to training data reconstruction with high probability, provided the network width is sufficiently large; this unprecedented finite-width result uses PAC-style bounds. Furthermore, when the data lies in a low-dimensional subspace, we show that the network width requirement for successful reconstruction can be relaxed, with bounds depending on the subspace dimension rather than the ambient dimension. For general neural network models and unknown data orientations, we propose an efficient reconstruction algorithm that approximates the low-dimensional data subspace through the change in the first-layer weights during training and uses only the last-layer weights for reconstruction, thus reducing the search space dimension and the required network width for high-quality reconstructions. Our numerical experiments on synthetic datasets and CIFAR-10 confirm that our subspace-aware reconstruction approach outperforms standard full-space techniques.
☆ Is One Layer Enough? Understanding Inference Dynamics in Tabular Foundation Models ICML 2026
Transformer-based tabular foundation models (TFMs) dominate small to medium tabular predictive benchmark tasks, yet their inference mechanisms remain largely unexplored. We present the first large-scale mechanistic study of layerwise dynamics in 6 state-of-the-art tabular in-context learning models. We explore how predictions emerge across depth, identify distinct stages of inference and reveal latent-space dynamics that differ from those of language models. Our findings indicate substantial depthwise redundancy across multiple models, suggesting iterative refinement with overlapping computations during inference stages. Guided by these insights, we design a proof-of-concept, looped single-layer model that uses only 20% of the original model's parameters while achieving comparable performance. The code is available at https://github.com/amirbalef/is_one_layer_enough.
comment: Accepted at the 43rd International Conference on Machine Learning (ICML 2026)
☆ MARBLE: Multi-Aspect Reward Balance for Diffusion RL
Reinforcement learning fine-tuning has become the dominant approach for aligning diffusion models with human preferences. However, assessing images is intrinsically a multi-dimensional task, and multiple evaluation criteria need to be optimized simultaneously. Existing practice deal with multiple rewards by training one specialist model per reward, optimizing a weighted-sum reward $R(x)=\sum_k w_k R_k(x)$, or sequentially fine-tuning with a hand-crafted stage schedule. These approaches either fail to produce a unified model that can be jointly trained on all rewards or necessitates heavy manually tuned sequential training. We find that the failure stems from using a naive weighted-sum reward aggregation. This approach suffers from a sample-level mismatch because most rollouts are specialist samples, highly informative for certain reward dimensions but irrelevant for others; consequently, weighted summation dilutes their supervision. To address this issue, we propose MARBLE (Multi-Aspect Reward BaLancE), a gradient-space optimization framework that maintains independent advantage estimators for each reward, computes per-reward policy gradients, and harmonizes them into a single update direction without manually-tuned reward weighting, by solving a Quadratic Programming problem. We further propose an amortized formulation that exploits the affine structure of the loss used in DiffusionNFT, to reduce the per-step cost from K+1 backward passes to near single-reward baseline cost, together with EMA smoothing on the balancing coefficients to stabilize updates against transient single-batch fluctuations. On SD3.5 Medium with five rewards, MARBLE improves all five reward dimensions simultaneously, turns the worst-aligned reward's gradient cosine from negative under weighted summation in 80% of mini-batches to consistently positive, and runs at 0.97X the training speed of baseline training.
comment: Homepage and code repo: https://aim-uofa.github.io/MARBLE
☆ PACZero: PAC-Private Fine-Tuning of Language Models via Sign Quantization
We introduce PACZero, a family of PAC-private zeroth-order mechanisms for fine-tuning large language models that delivers usable utility at $I(S^*; Y_{1:T})=0$. This privacy regime bounds the membership-inference attack (MIA) posterior success rate at the prior, an MIA-resistance level the DP framework matches only at $\varepsilon=0$ and infinite noise. All DP-ZO comparisons below are matched at the MIA posterior level. The key insight is that PAC Privacy charges mutual information only when the release depends on which candidate subset is the secret. Sign-quantizing subset-aggregated zeroth-order gradients creates frequent unanimity, steps at which every candidate subset agrees on the update direction; at these steps the released sign costs zero conditional mutual information. We propose two variants that span the privacy-utility trade-off: PACZero-MI (budgeted MI via exact calibration on the binary release) and PACZero-ZPL ($I=0$ via a uniform coin flip on disagreement steps). We evaluate on SST-2 and SQuAD with OPT-1.3B and OPT-6.7B in both LoRA and full-parameter tracks. On SST-2 OPT-1.3B full fine-tuning at $I=0$, PACZero-ZPL reaches ${88.99\pm0.91}$, within $2.1$pp of the non-private MeZO baseline ($91.1$ FT). No prior method produces usable utility in the high-privacy regime $\varepsilon<1$, and PACZero-ZPL obtains competitive SST-2 accuracy and nontrivial SQuAD F1 across OPT-1.3B and OPT-6.7B at $I=0$.
☆ Cubit: Token Mixer with Kernel Ridge Regression
Since its introduction in 2017, the Transformer has become one of the most widely adopted architectures in modern deep learning. Despite extensive efforts to improve positional encoding, attention mechanisms, and feed-forward networks, the core token-mixing mechanism in Transformers remains attention. In this work, we show that the attention module in Transformers can be interpreted as performing Nadaraya-Watson regression, where it computes similarities between tokens and aggregates the corresponding values accordingly. Motivated by this perspective, we propose Cubit, a potential next-generation architecture that leverages Kernel Ridge Regression (KRR), while the vanilla Transformer relies on Nadaraya-Watson regression. Specifically, Cubit modifies the classical attention computation by incorporating the closed-form solution of KRR, combining value aggregation through kernel similarities with normalization via the inverse of the kernel matrix. To improve the training stability, we further propose the Limited-Range Rescale (LRR), which rescales the value layer within a controlled range. We argue that Cubit, as a KRR-based architecture, provides a stronger mathematical foundation than the vanilla Transformer, whose attention mechanism corresponds to Nadaraya-Watson regression. We validate this claim through comprehensive experiments. The experimental results suggest that Cubit may exhibit stronger long-sequence modeling capability. In particular, its performance gain over the Transformer appears to increase as the training sequence length grows.
comment: Tech Report
☆ Operator-Guided Invariance Learning for Continuous Reinforcement Learning
Reinforcement learning (RL) with continuous time and state/action spaces is often data-intensive and brittle under nuisance variability and shift, motivating methods that exploit value-preserving structures to stabilize and improve learning. Most existing approaches focus on special cases, such as prescribed symmetries and exact equivariance, without addressing how to discover more general structures that require nonlinear operators to transform and map between continuous state/action systems with isomorphic value functions. We propose \textbf{VPSD-RL} (Value-Preserving Structure Discovery for Reinforcement Learning). It models continuous RL as a controlled diffusion with value-preserving mappings defined through Lie-group actions and associated pullback operators. We show that a value-preserving structure exists exactly when pulling back the value function and pushing forward actions commute with the controlled generator and reward functional. Further, approximate value-preserving structures with rigorous guarantees can be found when the Hamilton--Jacobi--Bellman mismatch is small. This framework discovers exact and approximate value-preserving structures by searching for the associated Lie group operators. VPSD-RL fits differentiable drift, diffusion, and reward models; learns infinitesimal generators via determining-equation residual minimization; exponentiates them with ODE flows to obtain finite transformations; and integrates them into continuous RL through transition augmentation and transformation-consistency regularization. We show that bounded generator/reward mismatch implies quantitative stability of the optimal value function along approximate orbits, with sensitivity governed by the effective horizon, and observe improved data efficiency and robustness on continuous-control benchmarks.
☆ Estimate Level Adjustment For Inference With Proxies Under Random Distribution Shifts
In many scientific domains, including experimentation, researchers rely on measurements of proxy outcomes to achieve faster and more frequent reads, especially when the primary outcome of interest is challenging to measure directly. While proxies offer a more readily accessible observation for inference, the ultimate goal is to draw statistical inferences about the primary outcome parameter and proxy data are typically imperfect in some ways. To correct for these imperfections, current statistical inference methods often depend on strict identifying assumptions (such as surrogacy, covariate/label shift, or missingness assumptions). These assumptions can be difficult to validate and may be violated by various additional sources of distribution shift, potentially leading to biased parameter estimates and miscalibrated uncertainty quantification. We introduce an estimate-level framework, inspired by domain adaptation techniques, to empirically calibrate proxy-based inference. This framework models the proxy-primary metric discrepancy as a random effect at the parameter level, estimating its distribution from aggregated historical observations across past domains (e.g., experiments, time periods, or distinct segments). This method avoids the requirement for retaining individual-level response data. Additionally, this adjustment can be layered on top of existing proxy-correction methods (such as prediction-powered inference or importance weighting) to account for additional biases not addressed by those corrections. To manage uncertainty when the number of historical domains is limited, we provide both a method-of-moments estimator and a domain bootstrap procedure. We further validate this approach using publicly available datasets and real-world experiments.
comment: 10 pages, 5 figures
☆ Risk-Controlled Post-Processing of Decision Policies
Predictive models are often deployed through existing decision policies that stakeholders are reluctant to change unless a risk constraint requires intervention. We study risk-controlled post-processing: given a deterministic baseline policy, choose a new policy that maximizes agreement with the baseline subject to a chance constraint on a user-specified loss. At the population level, we show that the optimal policy has a threshold structure: it follows the baseline except on contexts where switching to the oracle fallback policy yields a large reduction in conditional violation risk. At the finite-sample level, given a fitted fallback policy and score, we develop a post-processing algorithm that uses calibration data to select a threshold. Leveraging tools from algorithmic stability and stochastic processes, we show that under regularity conditions, in the i.i.d. setting, the expected excess risk of the post-processed policy is $O(\log n/n)$. In the special case when an exact-safe fallback policy is available, the algorithm achieves precise expected risk control under exchangeability. In this setting, we also give high-probability near-optimality guarantees on the post-processed policy. Experiments on a COVID-19 radiograph diagnosis task, an LLM routing problem, and a synthetic multiclass decision task show that targeted post-processing can meet or nearly meet risk budgets while preserving substantially more agreement with the baseline than score-blind random mixing.
☆ Q-MMR: Off-Policy Evaluation via Recursive Reweighting and Moment Matching
We present a novel theoretical framework, Q-MMR, for off-policy evaluation in finite-horizon MDPs. Q-MMR learns a set of scalar weights, one for each data point, such that the reweighted rewards approximate the expected return under the target policy. The weights are learned inductively in a top-down manner via a moment matching objective against a value-function discriminator class. Notably, and perhaps surprisingly, a data-dependent finite-sample guarantee for general function approximation can be established under only the realizability of $Q^π$, with a dimension-free bound -- that is, the error does not depend on the statistical complexity of the function class. We also establish connections to several existing methods, such as importance sampling and linear FQE. Further theoretical analyses shed new light on the nature of coverage, a concept of fundamental importance to offline RL.
☆ Efficient Serving for Dynamic Agent Workflows with Prediction-based KV-Cache Management
LLM-based workflows compose specialized agents to execute complex tasks, and these agents usually share substantial context, allowing KV-Cache reuse to save computation. Existing approaches either manage KV-Cache at agent level and fail to exploit the reuse opportunities within workflows, or manage cache at the workflow level but assume that each workflow calls a static sequence of agents. However, practical workflows are typically dynamic, where the sequence of invoked agents and thus induced cache reuse opportunities depend on the context of each task. To serve such dynamic workflows efficiently, we build a system dubbed PBKV (\textbf{P}rediction-\textbf{B}ased \textbf{KV}-Cache Management). For each workflow, PBKV predicts the agent invocations in several future steps by fusing the guidance from historical workflows and context of the target workflow. Based on the predictions, PBKV estimates the reuse potential of cache entries and keeps the high-potential entries in GPU memory. To be robust to prediction errors, PBKV utilizes the predictions conservatively during both cache eviction and prefetching. Experiments on three workflow benchmarks show that PBKV achieves up to $1.85\times$ speedup over LRU on dynamic workflows, and up to $1.26\times$ speedup over the SOTA baseline KVFlow on the static workflow.
☆ Hitting Time Isomorphism for Multi-Stage Planning with Foundation Policies
We present a new operator-theoretic representation learning framework for offline reinforcement learning that recovers the directed temporal geometry of a controlled Markov process from hitting time observations. While prior art often produces symmetric distances or fails to satisfy the triangle inequality, our framework learns a Hilbert-space displacement geometry where expected hitting times are realized as linear functionals of latent displacements. We prove that this representation exists under latent linear closure and is uniquely identifiable up to a bounded linear isomorphism. For finite-dimensional implementations, we show that global hitting-time error is bounded by one-step transition error amplified by the environment's transient spectral radius. Furthermore, we provide finite-sample guarantees accounting for approximation, statistical complexity, and trajectory-label mismatch. Derived from this theory, we curate Isomorphic Embedding Learning (IEL) as a new goal-agnostic foundation policy learning algorithm that anchors a HILP-style consistency objective with explicit hitting-time regression to ensure that the learned geometry reflects actual decision-time progress. This asymmetric and compositional structure enables robust graph-based multi-stage planning for long-horizon navigation. Our experiments demonstrate that IEL improves the state of the art of learning foundation policy policies from offline maze locomotion data. Our code can be found on https://github.com/MagnusBoock/IEL
☆ Dynamic Controlled Variables Based Dynamic Self-Optimizing Control
Self-optimizing control is a strategy for selecting controlled variables, where the economic objective guides the selection and design of controlled variables, with the expectation that maintaining the controlled variables at constant values can achieve optimization effects, translating the process optimization problem into a process control problem. Currently, self-optimizing control is widely applied to steady-state optimization problems. However, the development of process systems exhibits a trend towards refinement, highlighting the importance of optimizing dynamic processes such as batch processes and grade transitions. This paper formally introduces the self-optimizing control problem for dynamic optimization, termed the dynamic self-optimizing control problem, extending the original definition of self-optimizing control. A novel concept, "dynamic controlled variables" (DCVs), is proposed, and an implicit control policy is presented based on this concept. The paper theoretically analyzes the advantages and generality of DCVs compared to explicit control strategies and elucidates the relationship between DCVs and traditional controllers. Moreover, this paper puts forth a data-driven approach to designing self-optimizing DCVs, which considers DCV design as a mapping identification problem and employs deep neural networks to parameterize the variables. Three case studies validate the efficacy and superiority of DCVs in approximating multi-valued and discontinuous functions, as well as their application to dynamic optimization problems with non-fixed horizons, which traditional self-optimizing control methods are unable to address.
☆ No Triangulation Without Representation: Generalization in Topological Deep Learning
Despite an ever-increasing interest in topological deep learning models that target higher-order datasets, there is no consensus on how to evaluate such models. This is exacerbated by the fact that topological objects permit operations, such as structural refinements, that are not appropriate for graph data. In this work, we extend MANTRA, a benchmark dataset containing manifold triangulations, to a larger class of manifolds with more diverse homeomorphism types. We show that, unlike prior claims, both graph neural networks (GNNs) and higher-order message passing (HOMP) methods can saturate the benchmark. However, we find that this is contingent on the right representation and feature assignment, emphasizing their importance in baseline models. We thus provide a novel evaluation protocol based on representational diversity and triangulation refinement. Surprisingly, we find no indication that existing models are capable of generalizing beyond the combinatorial structure of the data. This points towards a research gap in developing models that understand topological structure independent of scale. Our work thus provides the necessary scaffolding to evaluate future models and enable the development of topology-aware inductive biases.
☆ Diversity Curves for Graph Representation Learning
Graph-level representations are crucial tools for characterising structural differences between graphs. However, comparing graphs with different cardinalities, even when sampled from the same underlying distribution, remains challenging. Unsupervised tasks in particular require interpretable, scalable, and reliable size-aware graph representations. Our work addresses these issues by tracking the structural diversity of a graph across coarsening levels. The resulting graph embeddings, which we denote diversity curves, are interpretable by construction, efficient, and directly comparable across coarsening hierarchies. Specifically, we track the spread of graphs, a novel isometry invariant that is inherently well-suited for encoding the metric diversity and geometry of graphs. We utilise edge contraction coarsening and prove that this improves expressivity, thus leading to more powerful graph-level representations than structural descriptors alone. Demonstrating their utility over a range of baseline methods in practice, we use diversity curves to (i) cluster and visualise simulated graphs across varying sizes, (ii) distinguish the geometry of single-cell graphs, (iii) compare the structure of molecular graph datasets, and (iv) characterise geometric shapes.
☆ Invariant-Based Diagnostics for Graph Benchmarks
Progress on graph foundation models is hindered by benchmark practices that conflate the contributions of node features and graph structure, making it hard to tell whether a model actually learns from connectivity, or whether it even needs to. We propose addressing this using graph invariants, i.e., permutation-invariant, task-agnostic structural descriptors that serve as a diagnostic framework for graph benchmarks. We show that (i) invariants are more expressive than standard GNNs, (ii) invariants characterize structural heterogeneity within and across benchmark datasets, (iii) invariants predict multi-task performance, and (iv) simple invariant-based models are competitive with, and sometimes exceed, transformer and message-passing baselines across 26 datasets. Our results suggest that expressivity is not the main driver of predictive performance, and that on tasks where structure matters, a non-trainable structural proxy often matches trained message-passing models. We thus posit that invariant baselines should become a standard for evaluating whether structure is required for a task and whether a model picks up on it, serving as a stepping stone towards graph foundation models.
☆ MINER: Mining Multimodal Internal Representation for Efficient Retrieval
Visual document retrieval has become essential for accessing information in visually rich documents. Existing approaches fall into two camps. Late-interaction retrievers achieve strong quality through fine-grained token-level matching but store hundreds of vectors per page, incurring large index footprints and high serving costs. By contrast, dense single-vector retrievers retain storage and latency advantages but consistently lag in quality because they compress all information into a single final-layer embedding. In this work, we first conduct a layerwise diagnostic on single-vector retrievers, revealing that retrieval-relevant signal resides in internal representations. Motivated by these findings, we propose MINER (Mining Multimodal Internal RepreseNtation for Efficient Retrieval), a lightweight plug-in module that probes and fuses internal signals across transformer layers into a single compact embedding without modifying the backbone or sacrificing single-vector efficiency. The first Retrieval-Aligned Layer Probing stage attaches a lightweight probe at each layer, surfacing which dimensions carry retrieval-relevant information. The subsequent Adaptive Sparse Multi-Layer Fusion stage applies performance-adaptive neuron-level masking to the selected layers and fuses the surviving signals into the final dense vector. Across ViDoRe V1/V2/V3, MINER outperforms existing dense single-vector retrievers on the majority of benchmarks, with up to 4.5% nDCG@5 improvement over its corresponding backbone. Compared to strong late-interaction baselines, in some settings MINER substantially narrows the nDCG@$5$ gap to $0.2$ while preserving the storage and serving advantages of dense retrieval.
comment: Preprint
☆ Invariant Features in Language Models: Geometric Characterization and Model Attribution
Language models exhibit strong robustness to paraphrasing, suggesting that semantic information may be encoded through stable internal representations, yet the structure and origin of such invariance remain unclear. We propose a local geometric framework in which semantically equivalent inputs occupy structured regions in latent space, with paraphrastic variation along nuisance directions and semantic identity preserved in invariant subspaces. Building on this view, we make three contributions: (1) a geometric characterization of invariant latent features, (2) a contrastive subspace discovery method that separates semantic-changing from semantic-preserving variation, and (3) an application of invariant representations to zero-shot model attribution. Across models and layers, empirical results support these contributions. Invariant structure emerges in specific depth regions, semantic displacement lies largely outside the nuisance subspace, and representation-level interventions indicate a causal role of invariant components in model outputs. Invariant representations also capture model-specific geometric patterns, enabling accurate attribution. These findings suggest that semantic invariance can be viewed as a local geometric property of latent representations, offering a principled perspective on how language models organize meaning.
☆ ORTHOBO: Orthogonal Bayesian Hyperparameter Optimization
Bayesian optimization is widely used for hyperparameter optimization when model evaluations are expensive; however, noisy acquisition estimates can lead to unstable decisions. We identify acquisition estimation noise as a failure mode that was previously overlooked: even when the surrogate model and acquisition target are correctly specified, finite-sample Monte Carlo error can perturb acquisition values. This can, in turn, flip candidate rankings and lead to suboptimal BO decisions. As a remedy, we aim at variance reduction and propose an orthogonal acquisition estimator that subtracts an optimally weighted score-function control variate, which yields an acquisition residual orthogonal to posterior score directions and which thus reduces Monte Carlo variance. We further introduce OrthoBO: a Bayesian optimization framework that combines our orthogonal acquisition estimator with ensemble surrogates and an outer log transformation. We show theoretically that our estimator preserves the target, leads to variance reduction, and improves pairwise ranking stability. We further verify the theoretical properties of OrthoBO through numerical experiments where our framework reduces acquisition estimation variance, stabilizes candidate rankings, and achieves strong performance. We also demonstrate the downstream utility of OrthoBO in hyperparameter optimization for neural network training and fine-tuning.
☆ Scene-Adaptive Continual Learning for CSI-based Human Activity Recognition with Mixture of Experts IEEE
Channel state information (CSI)-based human activity recognition (HAR) is vulnerable to performance degradation under domain shifts across varying physical environments. Continual learning (CL) offers a principled way to learn new domains sequentially while preserving past knowledge, but existing CL solutions for CSI-based HAR scale poorly with accumulating domains, rely on a large replay buffer, or incur linearly growing inference cost. In this letter, we propose Scene-Adaptive Mixture of Experts with Clustered Specialists (SAMoE-C), which formulates cross-domain CSI-based HAR as a mixture-of-experts system that enables scene-specific adaptation, via an attention-based semantic router that activates only selected experts for each input. Moreover, we develop a novel training protocol, which requires only a tiny replay buffer for stabilizing domain discrimination of the router. Experimental results on a four-scene CSI dataset demonstrate that SAMoE-C approaches the state-of-the-art accuracy, while maintaining a significantly lower inference cost. By jointly combining modular experts, selective activation with router and a lightweight training protocol, SAMoE-C enables scalable cross-domain CSI-based HAR deployment with low training overhead and high computational efficiency in real-world settings.
comment: 5 pages, 3 figures, 3 tables, this article was submitted to IEEE for possible publication
☆ FedFrozen: Two-Stage Federated Optimization via Attention Kernel Freezing
Federated learning with heterogeneous clients remains a significant challenge for deep learning, primarily due to client drift arising from inconsistent local updates. Existing federated optimization methods typically address this issue through objective-level regularization or update-correction mechanisms. Recent studies, however, suggest that Transformer-based architectures may be inherently more robust than conventional models under heterogeneous federated training. Motivated by this observation, we investigate how different parameter components within the attention mechanism influence federated optimization. Specifically, we decompose the attention module into a query/key block, which determines the attention kernel, and a value block, which performs semantic transformation under the induced kernel. Based on this perspective, we propose FedFrozen, a two-stage federated optimization framework that first performs full-model warm-up training and then freezes the query/key block while continuing to optimize the value block. Under a linear-attention formulation, we show that the warm-up stage can be interpreted as an inexact descent procedure on a regularized kernel-profile objective, while the frozen stage reduces to a restricted value-block optimization problem under a fixed attention kernel. Our analysis further reveals an explicit trade-off that governs the choice of warm-up length. Simulations validate the predicted bias-drift behavior, and real-data experiments demonstrate that FedFrozen improves both the stability and effectiveness of Transformer models in heterogeneous federated learning.
comment: 25 pages
☆ Hyperbolic Concept Bottleneck Models
Concept Bottleneck Models (CBMs) have become a popular approach to enable interpretability in neural networks by constraining classifier inputs to a set of human-understandable concepts. While effective, current models embed concepts in flat Euclidean space, treating them as independent, orthogonal dimensions. Concepts, however, are highly structured and organized in semantic hierarchies. To resolve this mismatch, we propose Hyperbolic Concept Bottleneck Models (HypCBM), a post-hoc framework that grounds the bottleneck in this structure by reformulating concept activation as asymmetric geometric containment in hyperbolic space. Rather than treating entailment cones as a pre-training penalty, we show they encode a natural test-time activation signal: the margin of inclusion within a concept's entailment cone yields sparse, hierarchy-aware activations without any additional supervision or learned modules. We further introduce an adaptive scaling law for hierarchically faithful interventions, propagating user corrections coherently through the concept tree. Empirically, HypCBM rivals post-hoc Euclidean models trained on 20$\times$ more data in sparse regimes required for human interpretability, with stronger hierarchical consistency and improved robustness to input corruptions.
comment: 24 pages, 14 figures
☆ Neural-Actuarial Longevity Forecasting: Anchoring LSTMs for Explainable Risk Management
Traditional multi-population models, such as the Li-Lee framework, rely on the assumption of mean-reverting country-specific deviations. However, recent data from high-longevity clusters suggest a systemic break in this paradigm. We identify a stationarity paradox where mortality residuals in countries like Sweden and West Germany exhibit persistent unit roots, leading to a systematic mispricing of longevity risk in linear models. To address these non-linearities, we propose Hybrid-Lift, a neural-actuarial framework that combines Hierarchical LSTM networks with a Mean-Bias Correction (MBC) anchoring mechanism. Positioned as a governance-friendly model challenger rather than a replacement of classical approaches, the framework exhibits selective superiority on out-of-sample validation (2012-2020): it outperforms Li-Lee by 17.40% in Sweden and 12.57% in West Germany, while remaining comparable for near-linear regimes such as Switzerland and Japan. We complement the predictive model with an integrated governance suite comprising SHAP-based cross-country influence mapping, a dual uncertainty framework for regulatory capital calibration (Swiss ES 99.0% of +1.153 years), and a reverse stress test identifying the critical shock threshold for solvency buffer exhaustion. This research provides evidence that neural networks, when properly anchored by actuarial principles, can serve as effective model challengers for longevity risk management under the SST and Solvency II standards.
comment: 26 pages, 12 figures. Code available at https://github.com/davide-rindori/Actuarial-DS-Portfolio/tree/main/04_Multi_Population_Longevity_XAI
☆ COVID-19 Infodemic. Understanding content features in detecting fake news using a machine learning approach
The use of content features, particularly textual and linguistic for fake news detection is under-researched, despite empirical evidence showing the features could contribute to differentiating real and fake news. To this end, this study investigates a selection of content features such as word bigrams, part of speech distribution etc. to improve fake news detection. We performed a series of experiments on a new dataset gathered during the COVID-19 pandemic and using Decision Tree, K-Nearest Neighbor, Logistic Regression, Support Vector Machine and Random Forest. Random Forest yielded the best results, followed closely by Support Vector Machine, across all setups. In general, both the textual and linguistic features were found to improve fake news detection when used separately, however, combining them into a single model did not improve the detection significantly. Differences were also noted between the use of bigrams and part of speech tags. The study shows that textual and linguistic features can be used successfully in detecting fake news using the traditional machine learning approach as opposed to deep learning.
☆ Federated Cross-Client Subgraph Pattern Detection
Subgraph pattern detection aims to uncover complex interaction structures in graphs. However, state-of-the-art graph neural network (GNN)-based solutions assume centralized access to the entire graph. When graphs are instead distributed across multiple parties, client-local GNN computations diverge from those of a centralized model, resulting in a representation-equivalence gap. We formalize this as a structural observability problem, where subgraph patterns crossing partition boundaries become locally unidentifiable. To bridge this gap, we propose a per-step, layer-wise embedding exchange framework in which clients synchronize intermediate node representations at each layer of the forward pass, without exposing raw features or labels. Under an extended-subgraph assumption and shared model parameters across clients, this framework recovers the same node representations as a centralized GNN over the full graph. Experiments on synthetic directed multigraphs with cycles, bicliques, and scatter-gather patterns show that embedding exchange and federated parameter aggregation are complementary rather than interchangeable: their combination recovers most of the representation gap, provided exchanged embeddings are fresh per-step rather than stale per-epoch.
☆ FREPix: Frequency-Heterogeneous Flow Matching for Pixel-Space Image Generation
Pixel-space diffusion has re-emerged as a promising alternative to latent-space generation because it avoids the representation bottleneck introduced by VAEs. Yet most existing methods still treat image generation as a frequency-homogeneous process, overlooking the distinct roles and learning dynamics of low- and high-frequency components. To address this, we propose FREPix, a FREquency-heterogeneous flow matching framework for Pixel-space image generation. FREPix explicitly decomposes generation into low- and high-frequency components, assigns them separate transport paths, predicts them with a factorized network, and trains them with a frequency-aware objective. In this way, coarse-to-fine generation becomes an explicit design principle rather than an implicit behavior. On ImageNet class-to-image generation, FREPix achieves competitive results among pixel-space generation models, reaching 1.91 FID at $256\times256$ and 2.38 FID at $512\times512$, with particularly strong behavior in the low-NFE regime.
☆ E = T*H/(O+B): A Dimensionless Control Parameter for Mixture-of-Experts Ecology
We introduce E = T*H/(O+B), a dimensionless control parameter that predicts whether Mixture-of-Experts (MoE) models will develop a healthy expert ecology or collapse into dead experts. E combines four hyperparameters -- routing temperature T, routing entropy weight H, oracle weight O, and balance weight B -- into a single quantity. Through 12 controlled experiments (8 vision, 4 language) totaling over 11,000 training epochs, we establish that E >= 0.5 alone is sufficient to guarantee zero dead experts, removing the necessity for handcrafted load-balancing auxiliary losses. We validate this cross-modally on CIFAR-10, CIFAR-100, TinyImageNet-200, WikiText-2, and WikiText-103. Six additional findings emerge: (1) dead experts can resuscitate -- triggered by balance loss driving router re-exploration; (2) ortho toxicity is dataset-dependent, not universal; (3) task complexity shifts the critical E threshold; (4) model overfitting is decoupled from expert ecological health; (5) three-tier MoE spontaneously collapses into a two-tier functional structure; (6) ecological structure is temperature-invariant across a 50x range. We propose that E serves as a unified diagnostic for MoE training, analogous to the Reynolds number in fluid dynamics.
comment: 12 experiments, 11,000+ training epochs, cross-modal validation (vision + language). Extended version of the Claude-in-the-Loop ecology framework
☆ Decoupled PFNs: Identifiable Epistemic-Aleatoric Decomposition via Structured Synthetic Priors
Prior-Fitted Networks (PFNs) amortize Bayesian prediction by meta-learning over a synthetic task prior, but their standard output is a posterior predictive distribution over noisy observations. For sequential decision-making, such as active learning and Bayesian optimization, acquisition should prioritize epistemic uncertainty about the latent signal rather than irreducible aleatoric observation noise. We show that this epistemic--aleatoric split is not identifiable in general from the posterior predictive distribution alone, even when that distribution is known exactly. We then exploit a distinctive advantage of PFNs: because the synthetic data-generating process is under our control, each task can contain an explicit latent signal and noise function, and the generator can provide query-level labels for both the noiseless target and the observation-noise variance. We use these labels to train a decoupled PFN with separate latent-signal and aleatoric heads. The observation-level predictive is induced by convolving the latent signal distribution with the learned noise model. Empirically, epistemic-only acquisition mitigates the failure mode of total-variance exploration in noisy and heteroscedastic settings. In matched comparisons, decoupled models usually improve over tuned observation-level baselines, with the clearest gains in HPO; in broader sweeps, a decoupled model obtains the best average rank in both HPO and synthetic BO.
☆ FRInGe: Distribution-Space Integrated Gradients with Fisher--Rao Geometry
Gradient-based attribution methods are model-faithful and scalable, but Integrated Gradients (IG) can be brittle because explanations depend on heuristic baselines, straight-line paths, discretization, and saturation. We propose Fisher--Rao Integrated Gradients (FRInGe), which defines both the reference and interpolation schedule in predictive distribution space. FRInGe replaces input baselines with a maximum-entropy predictive reference and follows a Fisher-Rao geodesic on the probability simplex. The corresponding input-space trajectory is realized through the pullback Fisher metric and stabilized by KL and Euclidean trust regions; attributions are obtained by integrating input gradients along this trajectory. Across six ImageNet architectures, FRInGe most clearly improves calibration-oriented attribution metrics, especially MAS scores, while remaining competitive on perturbation AUC and infidelity.
☆ SparseForge: Efficient Semi-Structured LLM Sparsification via Annealing of Hessian-Guided Soft-Mask
Semi-structured sparsity provides a practical path to accelerate large language models (LLMs) with native hardware support, but post-training semi-structured pruning often suffers from substantial quality degradation due to strong structural coupling. Existing methods rely on large-scale sparse retraining to recover accuracy, resulting in high computational cost. We propose SparseForge, a post-training framework that improves recovery efficiency by directly optimizing the sparsity mask rather than scaling up retraining tokens. SparseForge combines Hessian-aware importance estimation with progressive annealing of soft masks into hardware-executable structured sparsity, enabling stable and efficient sparse recovery. On LLaMA-2-7B under 2:4 sparsity, SparseForge achieves 57.27% average zero-shot accuracy with only $\textbf{5B}$ retraining tokens, surpassing the dense model's 56.43% accuracy and approaching the 57.52% result of a state-of-the-art method using $\textbf{40B}$ tokens. Such improvements on the accuracy-efficiency trade-off from SparseForge are shown to be consistent across model families.
☆ Consistent Geometric Deep Learning via Hilbert Bundles and Cellular Sheaves
Modern deep learning architectures increasingly contend with sophisticated signals that are natively infinite-dimensional, such as time series, probability distributions, or operators, and are defined over irregular domains. Yet, a unified learning theory for these settings has been lacking. To start addressing this gap, we introduce a novel convolutional learning framework for possibly infinite-dimensional signals supported on a manifold. Namely, we use the connection Laplacian associated with a Hilbert bundle as a convolutional operator, and we derive filters and neural networks, dubbed as \textit{HilbNets}. We make HilbNets and, more generally, the convolution operation, implementable via a two-stage sampling procedure. First, we show that sampling the manifold induces a Hilbert Cellular Sheaf, a generalized graph structure with Hilbert feature spaces and edge-wise coupling rules, and we prove that its sheaf Laplacian converges in probability to the underlying connection Laplacian as the sampling density increases. Notably, this result is a generalization to the infinite-dimensional bundle setting of the Belkin \& Niyogi \cite{BELKIN20081289} convergence result for the graph Laplacian to the manifold Laplacian, a theoretical cornerstone of geometric learning methods. Second, we discretize the signals and prove that the discretized (implementable) HilbNets converge to the underlying continuous architectures and are transferable across different samplings of the same bundle, providing consistency for learning. Finally, we validate our framework on synthetic and real-world tasks. Overall, our results broaden the scope of geometric learning as a whole by lifting classical Laplacian-based frameworks to settings where the signal at each point lives in its own Hilbert space.
comment: 51 pages, 3 figures, 5 tables
Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models
World model-based policy evaluation is a practical proxy for testing real-world robot control by rolling out candidate actions in action-conditioned video diffusion models. As these models increasingly adopt latent diffusion modeling (LDM), choosing the right latent space becomes critical. While the status quo uses autoencoding latent spaces like VAEs that are primarily trained for pixel reconstruction, recent work suggests benefits from pretrained encoders with representation-aligned semantic latent spaces. We systematically evaluate these latent spaces for action-conditioned LDM by comparing six reconstruction and semantic encoders to train world model variants under a fixed protocol on BridgeV2 dataset, and show effective world model training in high-dimensional representation spaces with and without dimension compression. We then propose three axes to assess robotic world model performance: visual fidelity, planning and downstream policy performance, and latent representation quality. Our results show visual fidelity alone is insufficient for world model selection. While reconstruction encoders like VAE and Cosmos achieve strong pixel-level scores, semantic encoders such as V-JEPA 2.1 (strongest overall on policy), Web-DINO, and SigLIP 2 generally excel across the other two axes at all model scales. Our study advocates semantic latent space as stronger foundation for policy-relevant robotics diffusion world models.
comment: 9 pages
☆ Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level
On-policy distillation (OPD) trains a student on its own trajectories with token-level teacher feedback and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its standard advantage weighted policy gradient suffers from three structural weaknesses, including high variance updates, vanishing gradients in zero-advantage regions, and exploration bottlenecks when corrective signals are insufficient.We therefore propose Asymmetric On-Policy Distillation (AOPD), which replaces ineffective negative reinforcement with localized divergence minimization in non-positive advantage regions while preserving positive reinforcement learning. Experiments on mathematical reasoning benchmarks show that AOPD consistently outperforms standard OPD, with average gains of 4.09 / 8.34 under strong / weak initialization, respectively. AOPD also maintains higher policy entropy during training and better capability retention during sequential tool-use adaptation.
☆ Covariate Balancing and Riesz Regression Should Be Guided by the Neyman Orthogonal Score in Debiased Machine Learning
This position paper argues that, in debiased machine learning, balancing functions should be derived from the Neyman orthogonal score, not chosen only as functions of covariates. Covariate balancing is effective when the regression error entering the score can be represented by functions of covariates alone, and it is the natural finite-dimensional approximation for targets such as ATT counterfactual means. For ATE estimation under treatment effect heterogeneity, however, the score error generally contains treatment-specific components because the outcome regression is a function of the full regressor $X=(D,Z)$. In that case, balancing common functions of $Z$ can leave the treatment-specific component unbalanced. We therefore advocate regressor balancing, implemented by Riesz regression with basis functions of $X$, as the general balancing principle for DML. The position is not that covariate balancing is invalid, but that covariate balancing should be understood as the special case that is appropriate when the score-relevant regression error is a function of covariates alone.
☆ Data-Driven Covariate Selection for Nonparametric and Cycle-Agnostic Causal Effect Estimation
Estimating causal effects from observational data requires identifying valid adjustment sets. This task is especially challenging in realistic settings where latent confounding and feedback loops are present. Existing approaches typically assume acyclicity or rely on global causal structure learning, limiting applicability and computational efficiency. In this work, we study a local, data-driven method for covariate selection based on conditional independence information. While this method is known to be sound and complete in acyclic causal models, its validity in the presence of cycles has remained unclear. Our main contribution is to show that these guarantees extend to cyclic causal models. In particular, our result relies on the invariance of conditional independence assertions under $σ$-acyclification. These findings establish a unified, cycle-agnostic perspective on covariate selection and causal effect estimation, showing that the method applies across cyclic and acyclic settings without modification. Empirically, we validate this on extensive synthetic data, showing reliable performance in cyclic causal models.
☆ MinMax Recurrent Neural Cascades
We show that the MinMax algebra provides a form of recurrence that is expressively powerful, efficiently implementable, and most importantly it is not affected by vanishing or exploding gradient. We call MinMax Recurrent Neural Cascades (RNCs) the models obtained by cascading several layers of neurons that employ such recurrence. We show that MinMax RNCs enjoy many favourable theoretical properties. First, their formal expressivity includes all regular languages, arguably the maximal expressivity for a finite-memory system. Second, they can be evaluated in parallel with a runtime that is logarithmic in the input length given enough processors; and they can also be evaluated sequentially. Third, their state and activations are bounded uniformly for all input lengths. Fourth, at almost all points, their loss gradient exists and it is bounded. Fifth, they do not exhibit a vanishing state gradient: the gradient of a state w.r.t. a past state can have constant value one regardless of the time distance between the two states. Finally, we find empirical evidence that the favourable theoretical properties of MinMax RNCs are matched by their practical capabilities: they are able to perfectly solve a number of synthetic tasks, showing superior performance compared to the considered state-of-the-art recurrent neural networks; also, we train a MinMax RNC of 127M parameters on next-token prediction, and the obtained model shows competitive performance for its size, providing evidence of the potential of MinMax RNCs on real-world tasks.
☆ Empirical Evidence for Simply Connected Decision Regions in Image Classifiers
Understanding the topology of decision regions is central to explaining the inner workings of deep neural networks. Prior empirical work has provided evidence that these regions are path connected. We study a stronger topological question: whether closed loops inside a decision region can be contracted without leaving that region. To this end, we propose an iterative quad-mesh filling procedure that constructs a finite-resolution label-preserving surface bounded by a given loop and lying entirely within the same decision region. We further connect this construction to natural Coons patches in order to quantify its deviation from a canonical geometric interpolation of the loop. By evaluating our method across several modern image-classification models, we provide empirical evidence supporting the hypothesis that decision regions in deep neural networks are not only path connected, but also simply connected.
☆ Independent Learning of Nash Equilibria in Partially Observable Markov Potential Games with Decoupled Dynamics
We study Nash equilibrium learning in partially observable Markov games (POMGs), a multi-agent reinforcement learning framework in which agents cannot fully observe the underlying state. Prior work in this setting relies on centralization or information sharing, and suffers from sample and computational complexity that scales exponentially in the number of players. We focus on a subclass of POMGs with independent state transitions, where agents remain coupled through their rewards, and assume that the underlying fully observed Markov game is a Markov potential game. For this class, we present an independent learning algorithm in which players, observing only their own actions and observations and without communication, jointly converge to an approximate Nash equilibrium. Due to partial observability, optimal policies may in general depend on the full action-observation history. Under a filter stability assumption, we show that policies based on finite history windows provide sufficient approximation guarantees. This enables us to approximate the POMG by a surrogate Markov game that is near-potential, leading to quasi-polynomial sample and computational complexity for independent Nash equilibrium learning in the underlying POMG.
☆ A Unified Pair-GRPO Family: From Implicit to Explicit Preference Constraints for Stable and General RL Alignment
Large language model (LLM) alignment via reinforcement learning from human preferences (RLHF) suffers from unstable policy updates, ambiguous gradient directions, poor interpretability, and high gradient variance in mainstream pairwise preference learning paradigms. To systematically address these limitations, we establish a unified theoretical framework for preference-based RL optimization centered on the Pair-GRPO family, comprising two tightly coupled variants: Soft-Pair-GRPO and Hard-Pair-GRPO. Soft-Pair-GRPO is a minimal modification of Group Relative Policy Optimization (GRPO) that replaces group-normalized scalar rewards with binary pairwise preference rewards, retaining GRPO's clipped surrogate and KL-regularized structure. We prove a critical gradient equivalence theorem: under first-order Taylor expansion around the current policy, Soft-Pair-GRPO's gradient is a positive scalar multiple of standard GRPO's gradient, explaining its empirical stability despite discarding continuous reward magnitudes. Building on this foundation, we propose Hard-Pair-GRPO, an advanced variant introducing explicit local probability constraints and constrained KL-fitting optimization to further suppress gradient noise and global policy drift. We provide comprehensive theoretical guarantees for both variants--including monotonic policy improvement, deterministic gradient direction, gradient-variance reduction, and dynamic step-size convergence. Extensive experiments on standard LLM alignment benchmarks (HH-RLHF,UltraFeedback) and the MuJoCo continuous control task HalfCheetah-v4 demonstrate that our Pair-GRPO family consistently outperforms state-of-the-art baselines in alignment quality, human preference win rate, training stability, and generalization to general reinforcement learning. Ablation studies validate the critical contributions of each core component.
☆ Beyond the Independence Assumption: Finite-Sample Guarantees for Deep Q-Learning under $τ$-Mixing
Finite-sample analyses of deep Q-learning typically treat replayed data as independent, even though it is sampled from temporally dependent state-action trajectories. We study the Deep Q-networks (DQN) algorithm under explicit dependence by modelling the minibatches used for updating the network as $τ$-mixing. We show that this assumption holds under certain dependence conditions on the underlying trajectories and the mechanism used to sample minibatches. Building on this observation, we extend statistical analyses of DQN with fully connected ReLU architectures to dependent data. We formulate each update as a nonparametric regression problem with $τ$-mixing observations and derive finite-sample risk bounds under this dependence structure. Our results show that temporal dependence leads to a degradation in the statistical rate by inducing an additional dimensionality penalty in the rate exponent, reflecting the reduced effective sample size of $τ$-mixing data. Moreover, we derive the sample complexity of DQN under $tau$-mixing from these risk bounds. Finally, we empirically demonstrate on standard Gymnasium environments that the independence assumption is systematically violated and that replay sampling yields approximately exponentially decaying correlations, supporting our theoretical framework.
comment: 48 pages total. 6 figures; 3 tables
☆ eXplaining to Learn (eX2L): Regularization Using Contrastive Visual Explanation Pairs for Distribution Shifts
Despite extensive research into mitigating distribution shifts, many existing algorithms yield inconsistent performance, often failing to outperform baseline Empirical Risk Minimization (ERM) across diverse scenarios. Furthermore, high algorithmic complexity frequently limits interpretability and offers only an indirect means of addressing spurious correlations. We propose eXplaining to Learn (eX2L): an interpretable, explanation-based framework that decorrelates confounding features from a classifier's latent representations during training. eX2L achieves this by penalizing the similarity between Grad-CAM activation maps generated by a primary label classifier and those from a concurrently trained confounder classifier. On the rigorous Spawrious Many-to-Many Hard Challenge benchmark, eX2L achieves an average accuracy (AA) of 82.24% +/- 3.87% and a worst-group accuracy (WGA) of 66.31% +/- 8.73%, outperforming the current state-of-the-art (SOTA) by 5.49% and 10.90%, respectively. Beyond its competitive performance, eX2L demonstrates that functional domain invariance can be achieved by explicitly decoupling label and nuisance attributes at the group level.
☆ The Interplay of Data Structure and Imbalance in the Learning Dynamics of Diffusion Models
Real-world datasets are inherently heterogeneous, yet how per-class structural differences and sampling imbalance shape the training dynamics of diffusion models-and potentially exacerbate disparities-remains poorly understood. While models typically transition from an initial phase of generalization to memorizing the training set, existing theory assumes homogeneous data, leaving open how class imbalance and heterogeneity reshape these dynamics. In this work, we develop a high-dimensional analytical framework to study class-dependent learning in score-based diffusion models. Analyzing a random-features model trained on Gaussian mixtures, we derive the feature-covariance spectrum to characterize per-class generalization and memorization times. We reveal the explicit hierarchy governing these dynamics: class variance is the primary determinant of learning order-consistently favoring higher-variance classes-while centroid geometry plays a secondary role. Sampling imbalance acts as a modulator that can reverse this ordering and, under strong imbalance, forces minority classes to acquire distinct, delayed speciation times during backward diffusion. Together, these results suggest that diffusion models can memorize some classes while others remain insufficiently learned. We validate our theoretical predictions empirically using U-Net models trained on Fashion MNIST.
☆ Layer Collapse in Diffusion Language Models NeurIPS
Diffusion language models (DLMs) have recently emerged as competitive alternatives to autoregressive (AR) language models, yet differences in their activation dynamics remain poorly understood. We characterize these dynamics in LLaDA-8B and identify a striking layer-collapse property: a few early layers exhibit highly similar, collapsed activation patterns dominated by a single large super-outlier persisting over a long token range. Despite its apparent redundancy, this outlier is critical: pruning it causes outputs to degrade into repetitive random token loops. Paradoxically, layers in LLaDA contain more redundant representations overall, with redundancy most pronounced in earlier layers -- the reverse of AR models, where deeper layers grow redundant due to undertraining. Our analysis indicates that layer collapse in DLMs is not driven by undertraining but by overtraining: a dominant outlier becomes an indispensable information carrier while remaining representations collapse into redundant structure. These findings have strong practical implications, verified through controlled pre-training experiments. DLMs are surprisingly robust to compression: LLaDA under 3-bit GPTQ quantization drops only -1.8% on GSM8K, whereas Llama-3.1-8B drops -64.7%. Optimal sparsity allocation also reverses between families: at 50% average sparsity, allocating more to early layers in LLaDA yields +8.4% over the reverse strategy, while the same allocation costs Llama -8.4%. Our findings reveal that the DLM training objective fundamentally reshapes layer dynamics relative to AR models, with direct consequences for compression and deployment. Code: github.com/Conzel/super-outlier-dlm.
comment: 9 Pages, Under Review at NeurIPS
☆ Flow Matching with Arbitrary Auxiliary Paths
We introduce a new generative modeling framework, \textbf{Flow Matching with Arbitrary Auxiliary Paths (AuxPath-FM)}, which generalizes conditional flow matching by incorporating an auxiliary variable drawn from an arbitrary distribution into the probability path. Unlike prior methods that restrict auxiliary components to Gaussian noise, AuxPath-FM allows the variable $η$ to follow any distribution, producing trajectories of the form $X_t = a(t)X_1 + b(t)X_0 + c(t)η$. We theoretically demonstrate that this construction preserves the continuity equation and maintains a training objective consistent with the marginal formulation. This flexibility enables the design of diverse probability paths using various priors, including Gaussian, Uniform, Laplace, and discrete Rademacher distributions, each offering unique geometric properties for generative flows. Furthermore, our framework allows for specialized tasks such as label-guided generation by encoding structured semantic information into the auxiliary distribution. Overall, AuxPath-FM provides a principled and general foundation for probability path design, offering both theoretical generality and practical flexibility for diverse generative modeling tasks.
☆ Preliminary Insights in Chronos Frequency Data Understanding and Reconstruction
This paper presents a preliminary analysis of the ability of Chronos foundation model to process and internally represent frequency domain information. Foundation models that process time-series data offer practitioners a unified architecture capable of learning generic temporal representations across diverse tasks and domains, reducing the need for task-specific feature engineering and enabling transfer across signal modalities. Despite their growing adoption, the extent to which such models encode fundamental signal properties remains insufficiently characterised. We address this gap by analysing Chronos under controlled conditions, starting from the simplest class of signals: discrete sinusoids generated at fixed frequencies. Using lightweight online minimum description length probes applied to the decoder architecture, we test for the presence and separability of frequency information in the model's internal representations. The results provide insight into how frequential content is captured across the frequency spectrum and highlight regimes in which representation quality may degrade or require particular care. These findings offer practical guidance for users of Chronos in signal processing and information fusion contexts, and contribute to ongoing efforts to improve the interpretability and evaluation of foundation models for temporal data.
☆ Memory Efficient Full-gradient Attacks (MEFA) Framework for Adversarial Defense Evaluations
This work studies the robust evaluation of iterative stochastic purification defenses under white-box adversarial attacks. Our key technical insight is that gradient checkpointing makes exact end-to-end gradient computation through long purification trajectories practical by trading additional recomputation for substantially lower memory usage. This enables full-gradient adaptive attacks against diffusion- and Langevin-based purification defenses, where prior evaluations often resort to approximate backpropagation due to memory constraints. These approximations can weaken the attack signal and risk overestimating robustness. In parallel, stochasticity in iterative purification is frequently under-controlled, even though different purification trajectories can substantially change reported robustness metrics. Building on this insight, we introduce a memory-efficient full-gradient evaluation framework for stochastic purification defenses. The framework combines checkpointed backpropagation with evaluation protocols that control stochastic variability, thereby reducing memory bottlenecks while preserving exact gradients. We evaluate diffusion-based purification and Langevin sampling with Energy-Based Models (EBMs), demonstrating that full-gradient attacks uncover vulnerabilities missed by approximate-gradient evaluations. Our framework yields stronger state-of-the-art $\ell_{\infty}$ and $\ell_{2}$ white-box attacks and further supports probing out-of-distribution robustness. Overall, our results show that exact-gradient evaluation is essential for reliable benchmarking of iterative stochastic defenses.
☆ Order-Agnostic Autoregressive Modelling with Missing Data
Order-Agnostic autoregressive models have demonstrated strong performance in deep generative modeling, yet their use in settings with incomplete data remains largely unexplored. In this work, we reinterpret them through the lens of missing data. First, we show that their standard training procedure on fully observed data implicitly performs imputation under a missing completely at random mechanism, resulting in robust out-of-sample imputation performance in settings with high missingness. Second, we introduce the first principled framework for training them directly on incomplete datasets under general missingness mechanisms. Third, we leverage their amortized conditional density estimation to perform active information acquisition, i.e., sequentially selecting the most informative missing variables for downstream prediction or inference. Across a suite of real-world benchmarks, our Missingness-Aware Order-Agnostic Autoregressive Model (MO-ARM) consistently outperforms established imputation baselines.
☆ Topological Signatures of Grokking
We study the grokking phenomenon through the lens of topology. Using persistent homology on point clouds derived from the embedding matrices of a range of models trained on modular arithmetic with varying primes, we identify a clear and consistent topological signature of grokking: a sharp increase in both the maximum and total persistence of first homology ($H_1$). Persistence diagrams reveal the emergence of a dominant long-lived topological feature together with increasingly structured secondary features, reflecting the underlying cyclic structure of the task. Compared to existing spectral and geometric diagnostics -- specifically, Fourier analysis and local intrinsic dimension -- persistent homology provides a unified geometric and topological characterization of representation learning, capturing both local and global multi-scale structure. Ablations across data regimes and control settings show that these topological transitions are tied to generalization rather than memorization. Our results suggest that persistent homology offers a principled and interpretable framework for analyzing how neural networks internalize latent structure during training.
comment: 19 pages, 14 figures, 2 tables
☆ Is Escalation Worth It? A Decision-Theoretic Characterization of LLM Cascades
Model cascades, in which a cheap LLM defers to an expensive one on low-confidence queries, are widely used to navigate the cost-quality tradeoff at deployment. Existing approaches largely treat the deferral threshold as an empirical hyperparameter, with limited guidance on the geometry of the resulting cost-quality frontier over a model pool. We develop a decision-theoretic framework grounded in constrained optimization and duality. For a two-model cascade, we establish piecewise concavity of the cost-quality frontier on decreasing-benefit regions of the confidence support, with reciprocal shadow prices linking the budget- and quality-constrained formulations. Given a pool of $k$ models, we characterize the frontier achievable by deterministic two-model threshold cascades as the pointwise envelope over $\binom{k}{2}$ pairwise cascades, with switching points where the optimal pair changes. For $k$-model cascades, we derive first-order conditions in which a single shadow price equalizes marginal quality-per-cost across stage boundaries. We validate the framework on five benchmarks (MATH, MMLU, TriviaQA, SimpleQA, LiveCodeBench) across eight models from five providers. Within the deterministic threshold-cascade class, full fixed chains underperform the pairwise envelope, and optimized subsequence cascades do not deliver practically meaningful held-out gains over it. A lightweight pre-generation router exceeds the best cascade policy on four of five datasets, mainly because it avoids the cheap model's generation cost on queries sent directly to a larger model rather than because of a stronger routing signal. These results suggest that cascade performance is limited primarily by structural cost, since cascades pay the cheap model before any escalation decision, rather than by a shortage of intermediate stages.
☆ A Benchmark for Strategic Auditee Gaming Under Continuous Compliance Monitoring
Continuous post-deployment compliance audits, mandated by emerging regulations such as the EU AI Act and Digital Services Act, create a class of strategic gaming distinct from the one-shot input/output gaming studied in prior work. Regulated systems can delay outcome reporting, drift their reports within plausible noise envelopes, exploit longitudinal sample attrition, and cherry-pick among ambiguous metric definitions. We formalize continuous auditing as a $T$-round Stackelberg game between an auditor that commits to a temporal policy and an adaptive auditee, and identify a structural feature of any noise-aware static-auditor design: a cover regime in which coverage gaps and granularity gaps cannot be closed simultaneously. We make this formal as Observation 1 and show that two minimal extension policies, each derived from the observation, close the regime along orthogonal axes: a sample-size-aware static rule (Periodic-with-floor) closes the granularity-failure case, while a history-conditioned suspicion-escalation policy closes the coverage-failure case for the naive Drift strategy -- and neither closes both, exactly as the observation predicts; an audit-aware OffAuditDrift strategy that exploits Stackelberg commitment defeats both. To support empirical study we contribute a non-additive harm decomposition (welfare loss $W$, coverage loss $C$) that exposes how attrition shifts harm from the regulator-accountable surface to a regulator-invisible one; an initial library of five auditee strategies (Delay, Drift, Cherry-pick, Attrition, OffAuditDrift) and five auditor policies, calibrated to summary statistics from published audits of the DSA Transparency Database; and a reproducible simulator with a small, extensible Python interface.
☆ Eliciting associations between clinical variables from LLMs via comparison questions across populations
The training data of large language models (LLMs) comprises a wide range of biomedical literature, reflecting data from many different patient populations. We investigate how it might be possible to recover information on correlation and causal links between patient characteristics, as a key building block for medical decision making. To avoid the pitfalls of direct elicitation, we propose an approach based on structured comparison questions, specifically patient comparison triplet questions. This is combined with a statistical model for the LLM representation that provides estimates of correlations without access to activations or model internals. Intuitively, we consider how similarity decisions of LLMs based on a first variable are affected by providing information on a second variable for one of the patients being assessed. We then induce prompt-level environment shifts to obtain correlation estimates for different subpopulations, which enables an invariant causal prediction (ICP) approach to obtain conservative candidate parent links. We demonstrate the method in two clinical domains, chronic obstructive pulmonary disease (COPD) and multiple sclerosis (MS). Across prompted environments, the elicited correlations are smooth, stable, and clinically interpretable, yet vary in a statistically significant way that supports downstream invariance testing, such that ICP provides a small set of candidate invariant parent links. These results show that indirect elicitation via triplet comparisons can recover meaningful association structure from LLMs and offer a cautious route from implicit correlations to causal statements that are congruent with LLM answering patterns.
☆ MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents
Tool-using large language model (LLM) agents are increasingly deployed in settings where their reliable behavior is governed by strict procedural manuals. Ensuring that such agents comply with the rules from these manuals is challenging, as they are typically written for humans in natural language while agent behavior manifests as an execution trace of tool calls. Existing evaluations of LLM agents rely on manually constructed benchmarks or LLM-based judges, which either do not scale or lack reliability for complex, long-horizon manuals. To overcome these limitations, we present MANTRA, a framework for automatically synthesizing machine-checkable compliance benchmarks from natural-language manuals and tool schemas. MANTRA independently generates (i) a symbolic world model capturing procedural dependencies, and (ii) a set of trace-level compliance checks for a given task, and validates their consistency using SMT solving. A structured repair loop resolves inconsistencies, requiring human intervention only as a fallback. %This yields benchmarks that are formally validated. Importantly, MANTRA supports arbitrary domains and long procedural manuals, and provides a tunable notion of task complexity which is utilized to automatically derive challenging tasks accompanying compliance checks. Using MANTRA, we build a new benchmark suite with 285 tasks across 6 domains scaling to 50+ page manuals with minimal human effort. Empirically, we show that the compliance checks are richer with stronger constraint enforcement compared to existing benchmarks. Additionally, the granularity of the checks can be used for debugging the agents' failure modes. These results demonstrate that combining automated benchmark generation with formally grounded validation methods enables scalable and reliable benchmarking of tool-using agents.
☆ TinyBayes: Closed-Form Bayesian Inference via Jacobi Prior for Real-Time Image Classification on Edge Devices
Cocoa (Theobroma cacao) is a critical cash crop for millions of smallholder farmers in West Africa, where Cocoa Swollen Shoot Virus Disease (CSSVD) and anthracnose cause devastating yield losses. Automated disease detection from leaf images is essential for early intervention, yet deploying such systems in resource-constrained settings demands models that are small, fast, and require no internet connectivity. Existing edge-deployable plant disease systems rely on end-to-end deep learning without uncertainty quantification, while Bayesian methods for edge devices focus on hardware-level inference architectures rather than agricultural applications. We bridge this gap with TinyBayes, the first framework to combine a closed-form Bayesian classifier with a mobile-grade computer vision pipeline for crop disease detection. Our pipeline uses YOLOv8-Nano (5.9 MB) for lesion localisation, MobileNetV3-Small (3.5 MB) for feature extraction, and the Jacobi prior; a Bayesian method that provides a closed form non-iterative estimators via projection, for the classification. The Jacobi-DMR (Distributed Multinomial Regression) classifier adds only 13.5 KB to the pipeline, bringing the total model size within 9.5 MB, while achieving 78.7% accuracy on the Amini Cocoa Contamination Challenge dataset and enabling end-to-end CPU inference under 150 ms per image. We benchmark against seven classifiers including Random Forest, SVM, Ridge, Lasso, Elastic Net, XGBoost, and Jacobi-GP, and demonstrate that the Jacobi-DMR offers the best trade-off between accuracy, model size, and inference speed for edge deployment. We have proved the asymptotic equivalence and consistency, asymptotic normality and the bias correction of Jacobi-DMR. All data and codes are available here: https://github.com/shouvik-sardar/TinyBayes
comment: 14 Pages, 1 Figure, 4 Tables
☆ LINC: Decoupling Local Consequence Scoring from Hidden Matching in Constructive Neural Routing
Constructive neural routing solvers usually score the next action by matching a decoder context to candidate embeddings, hiding deterministic one-step consequences such as travel, waiting, slack, and capacity changes. We propose LINC (Local Inference via Normed Comparison), a decoder-side candidate decision architecture that computes these consequences explicitly. LINC uses them according to their decision role: centered relative consequences are compared by a shared linear local scorer, while feasible-set summaries modulate the decoder context. This preserves standard global matching and relieves the hidden state from rediscovering transition arithmetic. The Capacitated Vehicle Routing Problem with Time Windows (CVRPTW) serves as the main constrained-routing stress test; the same interface extends to the Capacitated Vehicle Routing Problem (CVRP) and Traveling Salesman Problem (TSP). In particular, for CVRPTW, LINC reduces PolyNet's Solomon/Homberger gaps from 13.83\%/38.15\% to 7.26\%/14.71\%; for TSP and CVRP, it also improves external-benchmark gaps.
comment: 21 pages, 10 figures, 10 tables. Code: https://github.com/Elaina10172004/LINC
☆ Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity
Safety benchmarks are routinely treated as evidence about how a language model will behave once deployed, but this inference is fragile if behavior depends on whether a prompt looks like an evaluation. We define evaluation-context divergence as an observable within-item change in behavior induced by framing a fixed task as an evaluation, a live deployment interaction, or a neutral request, and present a paired-prompt protocol that measures it in open-weight LLMs while controlling for paraphrase variation, benchmark familiarity, and judge framing-sensitivity. Across five instruction-tuned checkpoints from four open-weight families plus a matched OLMo-3 base/instruct ablation ($20$ paired items, $840$ generations per checkpoint), we find striking heterogeneity. OLMo-3-Instruct alone is eval-cautious -- evaluation framing raises refusal vs. neutral by $11.8$pp ($p=0.007$) and reduces harmful compliance vs. deployment by $3.6$pp ($p=0.024$, $0/20$ items inverted) -- while Mistral-Small-3.2, Phi-3.5-mini, and Llama-3.1-8B are deployment-cautious}, with marginal eval-vs-deployment refusal effects of $-9$ to $-20$pp. The matched OLMo-3 base also exhibits the deployment-cautious pattern, identifying alignment as the inversion stage; within Llama-3.1, the $70$B model preserves direction with attenuated magnitude, ruling out a simple ``small-model effect that reverses at scale.'' One caveat: the cross-family heterogeneity is judge-dependent. Re-judging with a different-family safety classifier (Llama-Guard-3-8B) preserves the within-OLMo eval-cautious direction but flattens the cross-family contrast, indicating that the two judges operationalize distinct constructs.
☆ Gaming the Metric, Not the Harm: Certifying Safety Audits against Strategic Platform Manipulation
Online-safety regulation under the UK Online Safety Act and the EU Digital Services Act increasingly treats scalar metrics as compliance evidence. Once announced, such a metric also becomes an optimization target: a strategic platform can improve its score by routing recommendations through semantically equivalent content variants, without reducing true harm. We ask when such an audit metric can still certify a genuine reduction in harm. The protocol is modeled as a published transformation graph whose connected components form semantic classes, and the metric itself is treated as a security object. Three results follow. First, any metric that scores variants directly is manipulable as soon as two equivalent variants in a harmful class disagree in score. Second, the semantic-envelope lift, which assigns each variant the maximum score in its class, is the unique pointwise minimum among conservative classwise-constant repairs. Third, a class-stratified certificate, $H^\star(x) \le (1/\hatα) M_{\mathrm{Env}(m)}(x) + \barη$, holds for every platform strategy, with $\barη$ absorbing annotation and protocol error. We check the claims at three levels: exhaustive enumeration on a finite-state grid of mixed strategies, an SMT encoding in Z3 cross-replayed in cvc5, and a bounded single-player MDP encoded in PRISM-games. The fragile metric fails manipulation invariance and cannot support the same useful predeclared class-coverage certificate; under the envelope-level certificate, it produces large violations at every tested instance, with a large mean gaming gap across random catalogs at a fixed audit budget. The semantic-envelope metric exhibits no such violation in the tested instances.
☆ SMolLM: Small Language Models Learn Small Molecular Grammar
Language models for molecular design have scaled to hundreds of millions of parameters, yet how they learn chemical grammar is poorly understood. We train SMolLM, a 53K-parameter weight-shared transformer, to generate novel SMILES with 95% validity on the ZINC-250K drug-like-molecule benchmark, outperforming a standard GPT with 10 times more parameters. Mechanistically, the same block resolves SMILES constraints across passes in a fixed order: brackets first, rings second, and valence last, as shown by error classification, linear probing, and sparse autoencoders. A systematic ablation across attention heads and passes further localizes the first bracket-matching step to a single attention head. Together, these results yield a compact, mechanistically interpretable molecular generator and a testbed for studying iterative computation in formal-language domains.
comment: 18 pages, 5 figures, 10 tables
☆ Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization
Optimizers that exploit the matrix structure of gradients are central to modern LLM pre-training, with two distinct frontiers: explicit Kronecker-factored preconditioning -- most recently KL-Shampoo, which estimates the preconditioner via KL divergence minimization -- and orthogonalization of the gradient momentum, exemplified by Muon and analyzed as steepest descent under the spectral norm. The two routes are typically developed in isolation. We make a structural observation about KL-Shampoo's Kronecker preconditioners: their eigenvalue spectra exhibit a \emph{spike-and-flat} shape -- a few dominant eigenvalues followed by an approximately uniform tail -- across layers and training stages, holding exactly under a rank-$ρ$ signal-plus-noise gradient model. We exploit this structure by restricting one of KL-Shampoo's Kronecker factors to a parametric family aligned with the spike-and-flat shape: full spectral structure on a tracked $r$-dimensional subspace, single shared eigenvalue across the remaining $n-r$ directions. On these directions, we apply orthogonalization. An identity shows that this orthogonalization recovers the algebraic form of full KL-Shampoo's preconditioner. On four pre-training scales (GPT-2 124M / 350M, LLaMA 134M / 450M), Pro-KLShampoo consistently outperforms KL-Shampoo at every subspace rank we test in validation loss, peak per-GPU memory, and wallclock time to reach each loss level.
☆ End-to-End Identifiable and Consistent Recurrent Switching Dynamical Systems
Learning identifiable representations in deep generative models remains a fundamental challenge, particularly for sequential data with regime-switching dynamics. Existing approaches establish identifiability under restrictive assumptions, such as stationarity or limited emission models, and typically rely on variational autoencoder (VAE) estimators, which introduce approximation gaps that limit the recovery of the latent structure. In this work, we address both the theoretical and practical limitations of this setting. First, we establish identifiability of a broad class of recurrent nonlinear switching dynamical systems under flexible assumptions, significantly extending prior results. Second, we introduce $Ω$SDS, a flow-based estimator that enables exact likelihood optimization using expectation-maximisation. Through empirical validation on both synthetic and real-world data, our results demonstrate that $Ω$SDS achieves improved disentanglement compared to VAE-based estimators and more accurate forecasting of underlying dynamics.
☆ When Does $\ell_2$-Boosting Overfit Benignly? High-Dimensional Risk Asymptotics and the $\ell_1$ Implicit Bias
Benign overfitting is well-characterized in $\ell_2$ geometries, but its behavior under the $\ell_1$ implicit bias of greedy ensembles remains challenging. The analytical barrier stems from the non-linear coupling of coordinate selection thresholds, which invalidates standard spectral resolvent tools. To isolate this algorithmic bias, we characterize the high-dimensional risk of continuous-time $\ell_2$-Boosting over $p$ features and $n$ samples. By coupling the Convex Gaussian Minimax Theorem with delicate asymptotic expansions of double-sided truncated Gaussian moments, we analytically resolve the non-smooth $\ell_1$ interpolant. Under an isotropic pure-noise model, we prove that benign overfitting fails at the linear rate: greedy selection localizes noise into sparse active sets, and the excess variance decays at a logarithmic rate $Θ(σ^2/\log(p/n))$ for noise variance $σ^2$. We remark that while this localization mechanism should persist in the presence of signals, the exact signal-noise decomposition remains an open problem. For spiked-isotropic designs with $k^*$ head eigenvalues and $r_2 = p - k^*$ tail dimensions, the risk converges to zero when $r_{2} \gg n$, but only at a logarithmic rate $Θ(σ^2/\log(r_2/n))$, which is slower than the linear decay observed in $\ell_2$ geometries. To avoid this slow convergence, we analyze the non-smooth subdifferential dynamics of the boosting flow. This yields a tuning-free early stopping rule that, under a bounded $\ell_1$-path condition, recovers the Lasso basic inequality and attains the minimax-optimal empirical prediction rate for $\ell_1$-bounded signals.
☆ Perceive, Route and Modulate: Dynamic Pattern Recalibration for Time Series Forecasting
Local temporal patterns in real-world time series continuously shift, rendering globally shared transformations suboptimal. Current deep forecasting models, despite their scale and complexity, rely on fixed weight matrices applied uniformly to all temporal tokens. This creates a static pattern response: models settle into a compromised average, unable to adapt to changing local dynamics. We introduce Dynamic Pattern Recalibration (DPR), a backbone-agnostic mechanism that resolves this via token-level recalibration. Through a lightweight "Perceive-Route-Modulate" pipeline, DPR computes a soft-routing distribution over a learned basis of adaptive response patterns, generating a time-aware modulation vector that recalibrates hidden states via a residual Hadamard product. As a backbone-agnostic adapter, DPR enhances forecasting across diverse architectures with minimal overhead, confirming it addresses a general bottleneck. As a minimalist standalone model, DPRNet achieves competitive performance across 12 benchmarks, validating dynamic recalibration against macroscopic parameter scaling.
comment: 22 pages, 6 figures. Preprint
☆ Molecules Meet Language: Confound-Aware Representation Learning and Chemical Property Steering in Transformer-VAE Latent Spaces
Molecular generative models often assume meaningful latent geometry, but apparent property predictability can reflect sequence-level shortcuts rather than chemical organization. We study this issue in an unsupervised autoregressive Transformer-VAE trained on SELFIES. After training, we freeze the model, fit linear probes to RDKit descriptors, and use the probe weights as candidate global steering directions. To separate chemical signal from SELFIES artifacts, we introduce a confound-aware evaluation based on residualization, confound-direction alignment analysis, and decoded-molecule traversal. This is necessary because SELFIES length, branch tokens, ring tokens, and token entropy are strongly encoded in the latent space. Under this confound-aware evaluation, we find robust monotonic steering for cLogP, FractionCSP3, HeavyAtomCount, TPSA, BertzCT, and HBA. Nonlinear probes further show that some properties admit stable global directions, while others are better described by local latent gradients. Overall, our results show that chemically meaningful steering can emerge in entangled molecular latent spaces, but only when validated through decoded molecules and controlled for representation-level confounds.
☆ Region Seeding via Pre-Activation Regularization: A Geometric View from Piecewise Affine Nerual Networks
Deep networks with continuous piecewise affine activations induce polyhedral partitions of the input space, making the number of realized affine regions a natural measure of expressive capacity and a key determinant of how well the model can approximate nonlinear target functions. In practice, standard training realizes far fewer region refinements in data-visited neighborhoods than the architecture could in principle support, while existing region-count theory is primarily architectural and offers little guidance on how optimization shapes the realized partition near the data. Our theory provides a sufficient condition under which bringing neuron switching surfaces sufficiently close to data points ensures their intersection with local neighborhoods, which in turn implies a strict increase in the local affine-region count, yielding a principled training-time handle for seeding data-relevant partitions early in optimization. Guided by these results, we propose a plug-and-play region-seeding regularizer that encourages early partitioning while allowing task-driven refinement to dominate later in training. Experiments show that the regularizer increases the number of realized affine regions via exact enumeration and improves overall performance on toy datasets, while also improving early-stage accuracy and achieving comparable (or slightly improved) final accuracy on ImageNet-1k for classical models.
☆ Attributions All the Way Down? The Metagame of Interpretability
We introduce the metagame, a conceptual framework for quantifying second-order interaction effects of model explanations. For any first-order attribution $φ(f)$ explaining a model $f$, we measure the directional influence of feature $j$ on the attribution of feature $i$, denoted as meta-attribution $\varphi_{j \to i}(f)$, by treating the attribution method itself as a cooperative game and computing its Shapley value. Theoretically, we prove that attributions hierarchically decompose into meta-attributions, and establish these as directional extensions of existing interaction indices. Empirically, we demonstrate that the metagame delivers insights across diverse interpretability applications: (i) quantifying token interactions in instruction-tuned language models, (ii) explaining cross-modal similarity in vision-language encoders, and (iii) interpreting text-to-image concepts in multimodal diffusion transformers.
☆ Log-Likelihood, Simpson's Paradox, and the Detection of Machine-Generated Text
The ability to reliably distinguish human-written text from that generated by large language models is of profound societal importance. The dominant approach to this problem exploits the likelihood hypothesis: that machine-generated text should appear more probable to a detector language model than human-written text. However, we demonstrate that the token-level signal distinguishing human and machine text is non-uniform across the hidden space of the detector model, and naively averaging likelihood-based token scores across regions with fundamentally different statistical structure, as most detectors do, causes a form of Simpson's paradox: a strong local signal is destroyed by inappropriate aggregation. To correct for this, we introduce a learned local calibration step grounded in Bayesian decision theory. Rather than aggregating raw token scores, we first learn lightweight predictors of the score distributions conditioned on position in hidden space, and aggregate calibrated log-likelihood ratios instead. This single intervention dramatically and consistently improves detection performance across all baseline detectors and all datasets we consider. For example, our calibrated variant of Fast-DetectGPT improves AUROC from $0.63$ to $0.85$ on GPT-5.4 text, and a locally-calibrated DMAP detector we introduce achieves state-of-the-art performance across the board. That said, our central contribution is not a new detector, but a precise diagnosis of a significant cause of under-performance of existing detectors and a principled, modular remedy compatible with any token-averaging pipeline. This will serve as a foundation for the community to build upon, with natural avenues including richer distributional models, improved calibration strategies, and principled ensembling with hidden-space geometry signals via the full Bayes-optimal decision rule.
comment: 10 pages, 3 figures, 2 tables, 11 appendices
☆ Multimodal Deep Generative Model for Semi-Supervised Learning under Class Imbalance
When modeling class-imbalanced data, it is crucial to address the imbalance, as models trained on such data tend to be biased towards the majority classes. This problem is amplified under partial supervision, where pseudo-labels for unlabeled data are predicted based on imbalanced labeled data, propagating the bias. While recent semi-supervised models address class imbalance, they typically assume single-modal input data. However, with the growing availability of multimodal data, it is essential to leverage complementary modalities. In this article, we propose a multimodal deep generative model for semi-supervised learning under class imbalance. Our approach uses separate encoders for each modality, sharing latent variables across modalities, and simplifies joint posterior computation with a product-of-experts method. To further address class imbalance, we replace typical Gaussian distributions with Student's t-distributions for the prior, encoder, and decoder, better capturing the heavy-tailed latent distributions in imbalanced data. We derive a new objective function for training the proposed model on both labeled and unlabeled data using $γ$-power divergence. Empirical results on benchmark and real-world datasets demonstrate that our model outperforms baseline methods in generalization, achieving superior classification performance for partially labeled multimodal data with imbalanced class distributions.
☆ LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG
Single-step retrieval-augmented generation (RAG) provides an efficient way to incorporate external information for simple question answering tasks but struggles with complex questions. Agentic RAG extends this paradigm by replacing single-step retrieval with a multi-step process, in which the large language model (LLM) acts as a search agent that generates intermediate thoughts and subqueries to iteratively interact with the retrieval system. This iterative process incurs substantial latency due to the autoregressive generation of lengthy thoughts and subqueries. To address this limitation, we propose LatentRAG, a novel framework that shifts both reasoning and retrieval from discrete language space to continuous latent space. Unlike existing explicit methods that generate natural language thoughts or subqueries token-by-token, LatentRAG produces latent tokens for thoughts and subqueries directly from the hidden states in a single forward pass. We align LLMs with dense retrieval models in the latent space, enabling retrieval over latent subquery tokens and supporting end-to-end joint optimization. To improve transparency and encourage semantically meaningful latent representations, we incorporate a parallel latent decoding mechanism that translates latent tokens back into natural language. Extensive experiments on seven benchmark datasets show that LatentRAG achieves performance comparable to explicit agentic RAG methods while reducing inference latency by approximately 90%, substantially narrowing the latency gap with traditional single-step RAG.
☆ INEUS: Iterative Neural Solver for High-Dimensional PIDEs
In this paper, we introduce INEUS, a meshfree iterative neural solver for partial integro-differential equations (PIDEs). The method replaces the explicit evaluation of nonlocal jump integrals with single-jump sampling and reformulates PIDE solving as a sequence of recursive regression problems. Like Physics-Informed Neural Networks (PINNs), INEUS learns global solutions over the entire space-time domain, yet it offers a more efficient treatment of nonlocal terms and avoids the computationally expensive differentiation of full PIDE residuals. These features make INEUS particularly well suited for high-dimensional PDEs and PIDEs. Supported by a contraction-based convergence proof for linear PIDEs, our numerical experiments show that INEUS delivers accurate and scalable solutions for various high-dimensional linear and nonlinear examples.
☆ PACE: Prune-And-Compress Ensemble Models
Ensemble models achieve state-of-the-art performance on prediction tasks, but usually require aggregating a large number of weak learners. This can hinder deployment, interpretability, and downstream tasks such as robustness verification. Remedies to this issue fall into two main camps: pruning, which discards redundant learners, and compression, which generates new ones from scratch. We introduce PACE, a framework that interleaves these paradigms in a two-phase strategy. First, new learners are actively generated via a theoretically grounded procedure to enhance the diversity of the initial ensemble. When no more relevant learners can be found, a second phase of pruning is performed on this enriched ensemble. During both operations, PACE allows fine control on the faithfulness to the original ensemble. Experiments show that our method outperforms prior pruning and compression methods while offering principled control of faithfulness guarantees.
☆ When Labels Have Structure: Improving Image Classification with Hierarchy-Aware Cross-Entropy
Standard cross-entropy is the default classification loss across virtually all of machine learning, yet it treats all misclassifications equally, ignoring the semantic distances that a class hierarchy encodes. We propose Hierarchy-Aware Cross-Entropy (HACE), a drop-in replacement for standard cross-entropy that incorporates a known class hierarchy directly into the loss. HACE combines two components: prediction aggregation, which propagates the model's probability mass upward through the class hierarchy to ensure that parent nodes accumulate the confidence of their children; and ancestral label smoothing, which distributes the ground-truth signal along the path from the true class to the root. We evaluate HACE on CIFAR-100, FGVC Aircraft, and NABirds in two regimes: end-to-end training across six architectures spanning convolutional and attention-based designs, and linear probing on frozen DINOv2-Large features. In end-to-end training, HACE improves accuracy over standard cross-entropy in 15 out of 18 architecture--dataset pairs, with a mean gain of 4.66\%. In linear probing on frozen DINOv2-Large features, HACE outperforms all competing methods on all three datasets, with a mean improvement of 2.18\% over the next best baseline.
☆ A Flow Matching Algorithm for Many-Shot Adaptation to Unseen Distributions
While generative modeling has achieved remarkable success on tasks like natural language-conditioned image generation, enabling model adaptation from example data points remains a relatively underexplored and challenging problem. To this end, we propose Function Projection for Flow Matching (FP-FM), an algorithm that directly conditions generation on samples from the target distribution. FP-FM learns basis functions to span the velocity fields corresponding to a set of training distributions, and adapts to new distributions by computing a simple least-squares projection onto this basis. This enables efficient generation of samples from diverse target distributions without additional training at inference time. We further introduce multiple variants of FP-FM that provide a trade-off in expressivity and compute by enriching the coefficient calculation, e.g., by making the coefficients dependent on time. FP-FM achieves greatly improved precision and recall relative to baselines across synthetic and image-based datasets, with especially strong gains on unseen distributions.
☆ ConquerNet: Convolution-Smoothed Quantile ReLU Neural Networks with Minimax Guarantees
Quantile regression is a fundamental tool for distributional learning but poses significant optimization challenges for deep models due to the non-smoothness of the pinball loss. We propose ConquerNet, a class of \textbf{con}volution-smoothed \textbf{qu}antil\textbf{e} \textbf{R}eLU neural \textbf{net}works, which yield smooth objectives while preserving the underlying quantile structure. We establish general nonasymptotic risk bounds for ConquerNet under mild conditions, providing minimax guarantees over Besov function classes. In numerical studies, we demonstrate that the proposed approach outperforms standard quantile neural networks at multiple quantile levels, showing improved estimation accuracy and training efficiency across the board, with particularly pronounced advantages at high and low quantiles.
☆ Can Attribution Predict Risk? From Multi-View Attribution to Planning Risk Signals in End-to-End Autonomous Driving
End-to-end autonomous driving models generate future trajectories from multi-view inputs, improving system integration but introducing opaque decisions and hard-to-localize risks. Existing methods either rely on auxiliary monitoring models or generate textual explanations, but are decoupled from the planning process and fail to reveal the visual evidence underlying trajectory generation. While attribution offers a direct alternative, planning differs from image classification by taking six-view camera images as input and predicting continuous multi-step trajectories, requiring attribution to capture both critical views and regions and their influence on outputs. Moreover, whether attribution maps can support risk identification remains underexplored. To address this, we propose a hierarchical attribution framework for end-to-end planning. Specifically, using L2 consistency with the original trajectory as the objective, we design a coarse-to-fine region attribution strategy that searches candidate regions across the full six-view input and refines attribution within them. We further extract three attribution statistics as predictive signals for planning risk, including attribution entropy to measure how concentrated the planner's reliance is over the joint visual space, within-camera spatial variance to characterize how spread out the attribution is within each view, and cross-camera Gini coefficient to quantify how unevenly attribution is distributed across the six cameras. Experiments on BridgeAD, UniAD, and GenAD show that these statistics correlate with planning risk, achieving Spearman correlations of $0.30 \pm 0.07$ with trajectory error and AUROC of $0.77 \pm 0.04$ for collision detection. The signal generalizes to held-out scenes with negligible degradation and remains stable under an alternative attribution baseline.
☆ Inference-Time Refinement Closes the Synthetic-Real Gap in Tabular Diffusion
Diffusion-based generators set the current state of the art for synthetic tabular data. These methods approach but rarely exceed real-data utility, and closing this synthetic-real gap has so far been pursued exclusively at training time, via architectural advances, scaling, and retraining of monolithic generators. The inference-time alternative, i.e., refining the outputs of a pre-trained backbone with parameters left untouched, has remained largely unexplored for tabular synthesis. We introduce TARDIS (Tabular generation through Refinement, Distillation, and Inference-time Sampling), an inference-time refinement framework that operates on a frozen pre-trained backbone, configured per dataset by a Tree-structured Parzen Estimator search over score-level guidance during reverse diffusion, with each trial's objective set by an inner grid search over post-hoc sample selectors and an optional soft-label distillation step. The search space encodes a single mathematical pattern we name Bidirectional Chamfer Refinement (BCR): the symmetric Chamfer functional between synthetic and real samples is minimized both continuously, via a score-level gradient, and discretely, via batch-ranking post-generation. The per-dataset search recovers BCR-aligned configurations on most datasets, evidence for BCR as the dominant refinement pattern. Across 15 binary, multiclass, and regression benchmarks TARDIS achieves a median +8.6% downstream-task improvement over models trained on real data (95% CI [+3.3, +16.4], Wilcoxon p=0.016, 11/15 strict wins) and improves over the TabDiff backbone on all 15 datasets (mean +12.9%, p<10^-4), matching the backbone on manifold fidelity, diversity, and sample-level privacy. Inference-time refinement of a pre-trained tabular diffusion backbone reaches and exceeds real-data utility in 1 to 80 minutes on a single consumer-grade GPU.
☆ Beyond Rigid Alignment: Graph Federated Learning via Dual Manifold Calibration
Graph Federated Learning (GFL) enables collaborative representation learning across distributed subgraphs while preserving privacy. However, heterogeneity remains a critical challenge, as subgraphs across clients typically differ significantly in both semantics and structures. Existing methods address heterogeneity by enforcing the rigid alignment of model parameters or prototypes between clients and the server. However, these alignments implicitly rely on a restrictive global linearity assumption that summarizes local data distributions using a single and globally consistent representation space. This severely compresses the personalized representation space of clients and fails to preserve diverse local graph distributions. To overcome these limitations, we propose Federated Graph Manifold Calibration (FedGMC), a novel paradigm that tackles semantic heterogeneity and structural heterogeneity from a unified manifold perspective. Instead of enforcing rigid alignment, FedGMC introduces a dual manifold calibration mechanism that preserves global commonalities while maximizing the personalized representation space of local clients. Specifically, for semantic heterogeneity, the server constructs a geometrically optimal semantic manifold via equidistant semantic anchors, so as to guide the calibration of local semantic manifolds. For structural heterogeneity, the server constructs a global structural manifold by building global structural templates, so as to guide the calibration of local structural manifolds. Finally, the server dynamically refines both global semantic manifolds and structural manifolds by aggregating local manifolds. Extensive experiments on eleven homophilic and heterophilic graphs demonstrate that FedGMC effectively balances global commonality and local personalization, thereby significantly outperforming state-of-the-art baseline methods.
comment: 30 pages
☆ Trade-off Functions for DP-SGD with Subsampling based on Random Shuffling: Tight Upper and Lower Bounds
We derive a tight analysis of the trade-off function for Differentially Private Stochastic Gradient Descent (DP-SGD) with subsampling based on random shuffling within the $f$-DP framework. Our analysis covers the regime $σ\geq \sqrt{3/\ln M}$, where $σ$ is the noise multiplier and $M$ is the number of rounds within a single epoch. Unlike $f$-DP analyses for Poisson subsampling, which yield non-closed implicit formulas that can be machine computed but are non-transparent, random shuffling admits a tight analysis yielding transparent and interpretable closed-form bounds. Our concrete bounds, derived via the Berry-Esseen theorem, are tight up to constant factors within the proof framework. We demonstrate worked parameter settings for a single epoch ($E=1$) with a corresponding trade-off function $\geq 1-a-δ$, that is, only $δ$ below the ideal random guessing diagonal $1-a$: For $δ= 1/100$ and $σ= 1$, roughly $M \approx 1.14\times 10^6$ rounds and $N \approx 1.14\times 10^7$ training samples suffice to achieve meaningful differential privacy. This is in contrast to recent negative results for the regime $σ\leq 1/\sqrt{2 \ln M}$. Our concrete bounds can be composed over multiple epochs leading to $δ$ having a linear in $E$ dependency, which restricts $E=O(\sqrt{M})$. To go beyond Berry--Esseen, we introduce a new proof technique based on a generalization of the law of large numbers that yields an asymptotic random guessing diagonal-limit result: if $E=c_M^2M$ with $c_M\to 0$, then the $E$-fold composed trade-off function satisfies $f^{\otimes E}(a)\to 1-a$ uniformly in $a\in[0,1]$ with $δ$ having only an $O(\sqrt{E})$ dependency. We compare this asymptotic regime with the corresponding Poisson subsampling asymptotic, and highlight the characterization of explicit convergence rates as an open question.
☆ The Weight Gram Matrix Captures Sequential Feature Linearization in Deep Networks
Understanding how deep neural networks learn representations remains a central challenge in machine learning theory. In this work, we propose a feature-centric framework for analyzing neural network training by relating weight updates to feature evolution. We introduce a simple identity, the Feature Learning Equation, which identifies the weight Gram matrix as the key object capturing feature dynamics. This enables us to interpret gradient descent as implicitly inducing a hypothetical evolution of features, whose covariance structure - termed the Virtual Covariance - characterizes how representations evolve during training. Building on this perspective, we introduce Target Linearity, a measure quantifying the linear alignment between features and targets. By analyzing the training and layer-wise dynamics, we show that deep networks learn to sequentially transform representations toward target-linear structure. This linearization perspective provides a unified interpretation of several empirical phenomena, including Neural Collapse and linear interpolation in generative models.
comment: 29 pages including appendix
☆ The Role of Node Features in Graph Pooling
Graph pooling is commonly applied in graph classification, yet its empirical gains over standard WL-1 expressive GNNs are often marginal or inconsistent. We study this gap by analysing the interaction between node features and graph topology and their effect on pooling objectives. Our analysis reveals that pooling operators require node features that are well-aligned with the graph's topology -- a condition often overlooked and not guaranteed in empirical networks. We formalise fundamental requirements for node features to enable effective pooling, and introduce a quantitative measure of feature quality. Our empirical evaluation shows that, when these requirements are satisfied, pooling can be beneficial and improve performance on appropriate datasets.
☆ Structure-Preserving Gaussian Processes Via Discrete Euler-Lagrange Equations
In this paper, we propose Lagrangian Gaussian Processes (LGPs) for probabilistic and data-efficient learning of dynamics via discrete forced Euler-Lagrange equations. Importantly, the geometric structure of the Lagrange-d'Alembert principle, which governs the motion of dynamical systems, is preserved by construction in the absence of external forces. This allows learning physically consistent models that overcome erroneous drift in the system's energy, thereby providing stable long-term predictions. At the core of our approach lie linear operators for Gaussian process conditioning, constructed from discrete forced Euler-Lagrange equations and variational discretization schemes. Thereby and unlike prior work, the method enables learning dynamics from discrete position snapshots, i.e., without access to a system's velocities or momenta. This is particularly relevant for a large class of practical scenarios where only position measurements are available, for instance, in motion capture or visual servoing applications. We demonstrate the data-efficiency and generalization capabilities of the LGPs in various synthetic and real-world case studies, including a real-world soft robot with hysteresis. The experimental results underscore that the LGPs learn physically consistent dynamics with uncertainty quantification solely from sparse positional data and enable stable long-term predictions.
comment: 30 pages
☆ Cumulative-Goodness Free-Riding in Forward-Forward Networks: Real, Repairable, but Not Accuracy-Dominant
Forward-Forward (FF) training allows each layer to learn from a local goodness criterion. In cumulative-goodness variants, however, later layers can inherit a task that earlier layers have already partially separated. We formalize this phenomenon as layer free-riding: under the softplus FF criterion, the class-discrimination gradient reaching block $d$ decays exponentially with the positive margin accumulated by preceding blocks. We then study three local remedies -- per-block, hardness-gated, and depth-scaled -- that recover current-layer separation measures without relying on backpropagated gradients. On CIFAR-10 and CIFAR-100, these remedies dramatically improve layer-separation statistics, with $4\times$--$45\times$ gains in deeper layers, while changing accuracy by less than one percentage point for non-degenerate training procedures. Tiny ImageNet provides a tougher cross-dataset check for our selected block-wise configuration and reveals the same qualitative gap between layer-health diagnostics and final accuracy. Calibration experiments further show that architecture and augmentation choices have a larger effect on final accuracy than the training-rule modifications studied here. Cumulative free-riding is therefore a real and repairable optimization pathology. Nonetheless, for the FF training rules, architectures, and datasets we study, it is not the dominant factor limiting achievable accuracy.
☆ When Graph Language Models Go Beyond Memorization
It remains unclear whether graph language models learn structural regularities or merely memorize training graphs; this cannot be resolved by current aggregate fidelity metrics alone. We develop a calibrated diagnostic protocol that combines frequent subgraph mining, a graph-level bootstrap baseline, and three-level frequency stratification to disentangle memorization from structural alignment. Using this framework, we show that graph language models can acquire structural regularities beyond memorization at scale, primarily in the high-frequency regime. This is supported by the following empirical evidence: On five TU benchmarks, LLaMA-style graph language models reach high subgraph-rank correlation, yet their alignment is matched or exceeded by the memorization bootstrap in most cases. At small scale, under our bootstrap diagnostic, fidelity is largely indistinguishable from verbatim recall. In contrast, at large scale with 3.75M graphs, verbatim memorization drops sharply while rank correlation remains near ceiling. Crucially, in a separate fixed-subsample analysis, frequent subgraph mining restricted to the novel-only subset closely tracks the corresponding all-generation Spearman correlation, providing evidence that the alignment is not driven solely by verbatim recall. Across all scales, high-frequency patterns are well reproduced, while rare patterns remain poorly covered, and this deficit narrows only marginally as capacity increases. We observe the same scale-dependent crossover under two distinct graph serializations (canonical DFS code and action sequences), providing evidence of robustness in our analysis.
comment: Under review
☆ Band Together: Untargeted Adversarial Training with Multimodal Coordination against Evasion-based Promotion Attacks
Multimodal recommender systems exploit visual and textual signals to alleviate data sparsity, but this also makes them more vulnerable to evasion-based promotion attacks. Existing defenses are largely limited to single-modal settings and mainly focus on poisoning-based threats, leaving evasion-based threats underexplored. In this work, we first identify a cross-modal gradient mismatch under the multi-user promotion setting, where visual and textual perturbations are optimized in inconsistent directions due to the dominance of distinct user groups. This phenomenon dilutes the attack effectiveness and leads robust training to underestimate worst-case risks. To address this issue, we propose Untargeted Adversarial Training with Multimodal Coordination (UAT-MC). UAT-MC tackles the challenge of unknown targeted items in evasion-based attacks (as opposed to poisoning-based attacks) by treating all items as potential targets, and introduces a gradient alignment mechanism to explicitly correct this mismatch. This design ensures synchronized perturbations across modalities, thereby maximizing adversarial strength for robust training. Extensive experiments demonstrate that UAT-MC significantly improves robustness against promotion attacks while maintaining acceptable recommendation performance under the defense-accuracy trade-off. Code is available at https://github.com/gmXian/UAT-MC.
☆ Soft Deterministic Policy Gradient with Gaussian Smoothing
Deterministic policy gradient (DPG) is widely utilized for continuous control; however, it inherently relies on the differentiability of the critic with respect to the action during policy updates. This assumption is violated in practical control problems involving sparse or discrete rewards, leading to ill-defined policy gradients and unstable learning. To address these challenges, we propose a principled alternative based on a smoothed Bellman equation formulated via Gaussian smoothing. Specifically, we define a novel action-value function based on a smoothed Bellman equation and derive the soft deterministic policy gradient (Soft-DPG). Our formulation eliminates explicit dependence on critic action-gradients and ensures that the gradient remains well-defined even for non-smooth Q-functions. We instantiate this framework into a deep reinforcement learning algorithm, which we call soft deep deterministic policy gradient (Soft DDPG). Empirical evaluations on standard continuous control benchmarks and their discretized-reward variants show that Soft DDPG remains competitive in dense-reward settings and provides clear gains in most discretized-reward environments, where standard DDPG is more sensitive to irregular critic landscapes.
comment: 25 pages, 4 figures
☆ Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs
Steering large language models (LLMs) is usually done by either instruction prompting or activation steering. Prompting often gives strong control, but caches guidance tokens at every layer and can clutter long interactions; activation steering is compact but typically weaker and does not support large structured reminders. We introduce memory inception (MI), a training-free method that steers in latent attention space by inserting text-derived key-value (KV) banks only at selected layers. Rather than materializing reminder content throughout the prompt cache, MI treats steering as selective KV allocation, injecting latent slots only where the model routes to them. On matched personality-steering tasks, MI gives the best overall control--drift trade-off, remaining competitive with prompting while consistently outperforming CAA. On updateable guidance, MI supports mid-conversation behavior shifts without rewriting the visible transcript, achieving the highest post-shift alignment on Qwen3. On structured reasoning, MI outperforms visible prompting on HARDMath and PHYSICS (10/12 subject$\times$mode cells), serving as proxies for structured reasoning in verifiable domains, while cutting content-matched KV storage by up to 118$\times$. These results position MI as a powerful steering method when guidance is persistent, structured, or expensive to keep in the visible transcript.
☆ AffineLens: Capturing the Continuous Piecewise Affine Functions of Neural Networks
Piecewise affine neural networks (PANNs) provide a principled geometric perspective on neural network expressivity by characterizing the input--output map as a continuous piecewise affine (CPA) function whose complexity is governed by the number, arrangement, and shapes of its affine regions. However, existing interpretability and expressivity analyses often rely on indirect proxies (e.g., activation statistics or theoretical upper bounds) and rarely offer practical, accurate tools for enumerating and visualizing the induced region partition under realistic architectures and bounded input domains. In this work, we present AffineLens, a unified framework for computing the hyperplane arrangements and polyhedral structures underlying PANNs. Given a calibrated (bounded) input polytope, AffineLens identifies the subset of neuron-induced hyperplanes that intersect the domain, enumerates the resulting affine sub-regions in a layer-wise manner, and returns provably non-empty maximal CPA regions together with interior representatives. The framework further provides visualizations of region partitioning and decision boundaries, enabling qualitative inspection alongside quantitative region counts. By exploiting the affine restriction property of CPA networks under fixed activation patterns, AffineLens supports a broad class of modern components, including batch normalization, pooling, residual connections, multilayer perceptrons, and convolutional layers. Finally, we use AffineLens to perform a systematic empirical study of architectural expressivity, comparing networks through region complexity metrics and revealing how design choices influence the geometry of learned functions.
☆ TIDE: Every Layer Knows the Token Beneath the Context
We revisit a universally accepted but under-examined design choice in every modern LLM: a token index is looked up once at the input embedding layer and then permanently discarded. This single-injection assumption induces two structural failures: (i) the Rare Token Problem, where a Zipf-type distribution of vocabulary causes rare-token embeddings are chronically under-trained due to receiving a fraction of the cumulative gradient signal compared to common tokens; and (ii) the Contextual Collapse Problem, where limited parameters models map distributionally similar tokens to indistinguishable hidden states. As an attempt to address both, we propose TIDE, which augments the standard transformer with EmbeddingMemory: an ensemble of K independent MemoryBlocks that map token indices to context-free semantic vectors, computed once and injected into every layer through a depth-conditioned softmax router with a learnable null bank. We theoretically and empirically establish the benefits of TIDE in addressing the issues associated with single-token identity injection as well as improve performance across multiple language modeling and downstream tasks.
☆ Playing the network backward: A Game Theoretic Attribution Framework
Attribution methods explain which input features drive a model's prediction, making them central to model debugging and mechanistic interpretability. Yet backward attribution methods, including gradients, LRP, and transformer-specific rules, lack a shared framework in which to compare the underlying backward calculations. We introduce such a framework by recasting backward attribution as a two-player game on an extended network graph, building on Gaubert and Vlassopoulos' ReLU Net Game. Gradients and the full alpha-beta-LRP family arise as integrals over game trajectories under specific equilibria, so attribution maps become projections of trajectory distributions rather than the primary object. Desired explanation properties, such as localisation focus, robustness to input noise, or stable attention routing, can be specified as game-theoretic concepts, including policy regularization, risk aversion, and extended action sets, and translate directly into novel adaptations of the well-known backward rules. On ViT-B/16, one such selected adaptation of alpha-beta-LRP outperforms prior transformer-specific backward methods across all considered localisation metrics.
☆ Contrastive Identification and Generation in the Limit
In the classical identification in the limit model of Gold [1967], a stream of positive examples is presented round by round, and the learner must eventually recover the target hypothesis. Recently, Kleinberg and Mullainathan [2024] introduced generation in the limit, where the learner instead must eventually output novel elements of the target's support. Both lines of work focus on positive-only or fully labeled data. Yet many natural supervision signals are inherently relational rather than singleton, which encode relationships between examples rather than labels of individual ones. We initiate the study of contrastive identification and generation in the limit, where the learner observes a contrastive presentation of data: a stream of unordered pairs $\{x,y\}$ satisfying $h(x)\ne h(y)$ for an unknown target binary hypothesis $h$, but which element is positive is hidden from the learner. We first present three results in the noiseless setting: an exact characterization of contrastive identifiable classes (a one-line geometric refinement of Angluin [1980]'s tell-tale condition), a combinatorial dimension called contrastive closure dimension (a contrasitive analogue of the closure dimension in Raman et al. [2025]) and exactly characterizing uniform contrastive generation with tight sample complexity, and a strict hierarchy in which contrastive generation and text identification are mutually incomparable. We then prove a sharp reversal under finite adversarial corruption: there exist classes identifiable from contrastive pairs under any finite corruption budget by a single budget-independent algorithm, yet not identifiable from positive examples under even one corrupted observation. The unifying technical object is the common crossing graph, which encodes pairwise ambiguity, family-level generation obstructions, and corruption defects in a single coverage-and-incidence language.
☆ Super-Level-Set Regression: Conditional Quantiles via Volume Minimization
Constructing minimum-volume prediction regions that satisfy conditional coverage is a fundamental challenge in multivariate regression. Standard approaches rely on explicitly estimating the full conditional density and subsequently thresholding it. This two-step plug-in process is notoriously difficult, sensitive to estimation errors, and computationally expensive. One would like to instead optimize the region directly. Formulating a direct solution is challenging, however, because it requires minimizing a volume objective that is coupled with the conditional quantiles of the model's own estimation error. In this work, we address this challenge. We introduce super-level-set regression (SLS), a novel mathematical framework that successfully resolves this implicit coupling, allowing us to directly parameterize and optimize the geometric boundaries of the target conditional level sets. By bypassing full distribution estimation and leveraging flexible volume-preserving frontier functions, our approach natively captures complex, multimodal, and disjoint conditional structures end-to-end. Ultimately, SLS offers a new perspective on multivariate conditional quantile regression, replacing the restrictive assumptions of density-first methods with a direct geometric optimization strategy.
☆ Taming the Entropy Cliff: Variable Codebook Size Quantization for Autoregressive Visual Generation
Most discrete visual tokenizers rely on a default design: every position in the sequence shares the same codebook. Researchers try to scale the codebook size $K$ to get better reconstruction performance. Such a constant-codebook design hits a fundamental information-theoretic limit. We observe that the per-position conditional entropy of the training set decays so quickly along the sequence that, after a few positions, the conditional distribution becomes essentially deterministic. On ImageNet with $K=16384$, this happens within only 2 out of 256 positions, turning the remaining 254 into a memorization problem. We call this phenomenon the Entropy Cliff and formalize it with a simple expression: $t^{*} = \lceil \log_2 N / \log_2 K \rceil$. Interestingly, this phenomenon is not observed in language, as its natural structure keeps the effective entropy per position well below the codebook capacity. To address this, we propose Variable Codebook Size Quantization (VCQ), where the codebook size $K_t$ grows monotonically along the sequence from $K_{\min}=2$ to $K_{\max}$, leaving the loss function, parameter count, and AR training procedure unchanged. With a vanilla autoregressive Transformer and standard next-token prediction, a base version of VCQ reduces gFID w/o CFG from 27.98 to 14.80 on ImageNet $256\times256$ over the baseline. Scaled up, it reaches gFID 1.71 with 684M autoregressive parameters, without any extra training techniques such as semantic regularization or causal alignment. The extreme information bottleneck at $K_{\min}=2$ naturally induces a coarse-to-fine semantic hierarchy: a linear probe on only the first 10 tokens reaches 43.8% top-1 accuracy on ImageNet, compared to 27.1% for uniform codebooks. Ultimately, these results show that what matters is not only the total capacity of the codebook, but also how that capacity is distributed and organized.
☆ Federation of Experts: Communication Efficient Distributed Inference for Large Language Models
Mixture of experts has emerged as the primary mechanism for making Large Language Models (LLMs) computationally efficient. However, in distributed settings, communicating token embeddings between experts is a significant bottleneck. We present the novel Federation of Experts (FoE) architecture. FoE restructures the MoE block of a transformer layer into multiple MoE clusters. Each cluster is responsible for only one of the KV heads and expert parallelism is applied between those experts. Between clusters, a sum synchronizes the post-attention residuals, which then drives routing and dispatch for the next MoE block. In a single-node setting, FoE completely eliminates all-to-all communication as all experts within a group are contained on the same GPU. In multi-node settings, FoE confines all-to-all communication to the intra-node fabric, thus significantly reducing communication overhead. An implementation of FoE finds that on LongBench, FoE significantly improves inference throughput and latency in both single-node and multi-node settings, reducing the end-to-end forward-pass latency by up to 5.2x, TTFT by 3.62x, and TBT by 1.95x. It does so while achieving comparable generation quality to a mixture of experts model of the same size and training configuration.
☆ When Does Trimming Help Conformal Prediction? A Retained-Law Diagnostic under Calibration Contamination
Trimming suspicious calibration points is a common response to contamination in conformal prediction. Its effect on clean-target coverage, however, is governed by the retained law induced by trimming, not by the contamination level alone. We analyse fixed-threshold trimming as conditioning rather than purification. It replaces the contaminated calibration law with a retained law, reducing clean-target coverage to a one-dimensional score-CDF transfer problem with an exact finite-sample identity. A componentwise bound on the transfer gap gives a population-level diagnostic. This separates a clean-side covariance cost from a retained-contamination cost, governed by the dirty-to-clean retention ratio. Trimming helps when the anomaly score separates retention probabilities while remaining score-neutral on the clean population. Otherwise, it cannot substantially reduce contamination through the retained mixture coefficient. We also give finite-sample certificate templates that provide numerical guarantees under independent audit.
☆ Bandit Learning in General Open Multi-agent Systems
Recent developments in digital platforms have highlighted the prevalence of open systems, where agents can arrive and depart over time. While bandit learning in open systems has recently received initial attention, existing work imposes structural assumptions that are frequently violated in practice. A learning paradigm for general open systems creates fresh challenges: newly arriving agents induce endogenous non-stationarity; agent patterns determine how quickly information accumulates; and new agents make regret scale further with the time horizon. To this end, we formulate a unified open-system bandit problem with general dynamics, including heterogeneous rewards and general agent patterns. We introduce new concepts to capture the inherent complexities: the \emph{pre-training degree} of new agents quantifies how much information an agent carries upon entry, \emph{stability} measures the impact of new agents on the system, and \emph{global dynamic regret} compares the cumulative expected reward of all active agents with that of the varying optimal arms. We develop certified global-UCB learning methodologies with provable guarantees. Our regret bounds reveal that entry uncertainty enters linearly via the pre-training degree, while in stable regimes, regret is governed by the time needed to identify a persistent optimal arm, as well as by the agent patterns. We further show that these dependencies are tight via lower bounds in hard instances.
☆ Bridging visual saliency and large language models for explainable deep learning in medical imaging
The opaque nature of deep learning models remains a significant barrier to their clinical adoption in medical imaging. This paper presents a multimodal explainability framework that bridges the gap between convolutional neural network (CNN) predictions and clinically actionable insights for brain tumor classification, leveraging large language models (LLMs) to deliver human-interpretable diagnostic narratives. The proposed framework operates through three coupled stages. First, nine CNN architectures are extended with a dual-output hybrid formulation that simultaneously optimises a classification head and a segmentation head, enabling spatially richer feature learning. Second, visual saliency attribution methods, namely Grad-CAM, Grad-CAM++, and ScoreCAM, are applied to generate class-discriminative heatmaps, which are subsequently refined into binary tumor masks via an adaptive percentile thresholding pipeline. Third, the resulting masks are mapped onto the Harvard-Oxford cortical atlas to translate pixel-level evidence into named neuroanatomical structures, and the extracted findings are encoded into a structured JSON file that conditions three LLMs (Grok3, Mistral, and LLaMA) to generate coherent, radiological-style diagnostic reports. Evaluated on a dataset of 4,834 contrast-enhanced T1-weighted brain MRI images spanning three tumor classes, InceptionResNetV2 achieved the highest classification performance and Grad-CAM++ yielded the best segmentation overlap. Among the language models, Grok3 led in lexical diversity and coherence, while LLaMA achieved the highest readability score. By integrating visual, anatomical, and linguistic modalities into a unified pipeline, the framework produces explanations that are technically grounded and meaningfully interpretable, advancing the transparency and clinical accountability of artificial intelligence assisted brain tumor diagnosis.
☆ Constrained Contextual Bandits with Adversarial Contexts
We study budget-constrained contextual bandits with adversarial contexts, where each action yields a random reward and incurs a random cost. We adopt the standard realizability assumption: conditioned on the observed context, rewards and costs are drawn independently from fixed distributions whose expectations belong to known function classes. We focus on the continuing setting, in which the algorithm operates over the entire horizon even after the budget for cumulative cost is exhausted. In this setting, the objective is to simultaneously control regret and the violation of the budget constraint. Building on the seminal $\mathsf{SquareCB}$ framework of Foster et al. [2018], we propose a simple and modular framework that leverages online regression oracles to reduce the constrained problem to a standard unconstrained contextual bandit problem with adaptively defined surrogate reward functions. In contrast to prior works, which focus on stochastic contexts, our reduction yields improved guarantees for more general adversarial contexts, together with an efficient algorithm with a compact and transparent analysis.
☆ Predictive-Generative Drift Decomposition for Speech Enhancement and Separation NeurIPS 2026
We propose a plug-and-play framework for speech enhancement and separation that augments predictive methods with a generative speech prior. Our approach, termed Stochastic Interpolant Prior for Speech (SIPS), builds on stochastic interpolants and leverages their flexibility to bridge predictive and generative modeling. Specifically, we decompose the interpolation dynamics into a task-specific drift and a stochastic denoising component, allowing a predictive estimate to be integrated directly into the generative sampling process. This results in a mathematically grounded framework for combining strong pretrained predictors with the expressive power of generative models. To this end, we train a score model using only clean speech, yielding a degradation-agnostic prior that can be reused across tasks. During inference, the predictor provides a deterministic drift that steers the sampling process toward a task-consistent estimate, while the score model preserves perceptual naturalness. Unlike prior hybrid approaches, which typically rely on architecture-specific conditioning and are tied to particular predictors or degradation settings, SIPS provides a unified framework that generalizes across predictors and additive degradation tasks. We demonstrate its effectiveness for both speech enhancement and speech separation using recent predictors such as SEMamba and FlexIO. The proposed method consistently improves perceptual quality, achieving gains up +1.0 NISQA for speech separation.
comment: Submitted to NeurIPS 2026
☆ In-Context Black-Box Optimization with Unreliable Feedback
Black-box optimization in science and engineering often comes with side information: experts, simulators, pretrained predictors, or heuristics can suggest which candidates look promising. This information can accelerate search, but it can also be biased, input-dependent, or misleading. Feedback-aware BO methods typically handle one task at a time, limiting their ability to generalize over multiple sources of feedback. In-context optimizers address cross-task adaptation, but usually assume that optimization history is the only available signal at test time. We study feedback-informed in-context black-box optimization (FICBO), where a pretrained optimizer conditions on both the observed history and cheap auxiliary feedback for the current candidate set. We introduce a structured feedback prior that models how feedback sources vary in their access, relevance, and distortion relative to the true objective, and use it to pretrain a feedback-aware transformer. At test time, the model estimates source reliability in context by comparing observed objective values with auxiliary signals, improving query selection. On synthetic and real-world tasks, FICBO effectively exploits informative feedback while remaining robust to weak or misleading sources, improving over other baselines. Empirical investigations further illustrate how the model perceives test-time sources, offering insights into its interpretability and decision-making process.
☆ Teaching LLMs Program Semantics via Symbolic Execution Traces
We introduce an evaluation framework of 500 C verification tasks across five property types (memory safety, overflow, termination, reachability, data races) built on SV-COMP 2025, and evaluate 14 models across six families. We find that high overall accuracy masks a critical weakness: while most models reliably confirm properties hold, violation detection varies widely and degrades sharply with program length. To close this gap, we train on formal verification artifacts: running the Soteria symbolic execution engine on generic open-source C code and using the resulting traces for continued pretraining of Qwen3-8B. Just ${\sim}$3,000 bug traces combined with chain-of-thought reasoning at inference time improve violation detection by over 17 percentage points, producing one of the most balanced accuracy profiles among evaluated models. On violation detection, the trained 8B model outperforms the 4$\times$ larger Qwen3-32B without thinking and approaches it in overall accuracy. The interaction between trace training and chain-of-thought is superadditive: neither alone provides meaningful gains, but their combination does. Improvements transfer across all five property types, including ones the training traces do not target. Our 28 configurations confirm the gains stem from trace semantics, not code volume, and that trace curation and format matter.
☆ Rethinking Adapter Placement: A Dominant Adaptation Module Perspective
Low-rank adaptation (LoRA) is a widely used parameter-efficient fine-tuning method that places trainable low-rank adapters into frozen pre-trained models. Recent studies show that using fewer LoRA adapters may still maintain or even improve performance, but existing methods still distribute adapters broadly, leaving where to place a limited number of adapters to maximize performance largely open. To investigate this, we introduce PAGE (Projected Adapter Gradient Energy), a gradient-based sensitivity probe that estimates the initial trainable gradient energy available to each candidate LoRA adapter. Surprisingly, we find that PAGE is highly concentrated on a single shallow FFN down-projection across two model families and four downstream tasks. We term this module the dominant adaptation module and show that its layer index is architecture-dependent but task-stable. Motivated by this finding, we propose DomLoRA, a placement method that places a single adapter at the dominant adaptation module. With only ~0.7% of vanilla LoRA's trainable parameters, DomLoRA outperforms it on average across various downstream tasks, including instruction following, mathematical reasoning, code generation, and multi-turn conversation. This method also improves other LoRA variants, supporting the dominant adaptation module perspective as a practical placement guideline.
☆ Expressivity of Bi-Lipschitz Normalizing Flows: A Score-Based Diffusion Perspective
Many normalizing flow architectures impose regularity constraints, yet their distributional approximation properties are not fully characterized. We study the expressivity of bi-Lipschitz normalizing flows through the lens of score-based diffusion models. For the probability flow ODE of a variance-preserving diffusion, Lipschitz regularity of the score induces a flow of bi-Lipschitz diffeomorphic transport maps. This ODE bridge allows us to analyze the distributional approximation power of bi-Lipschitz normalizing flows and, conversely, derive deterministic convergence guarantees for diffusion-based transport. Our key idea is to use the probability flow ODE to link regularity of the score to regularity of the induced transport maps. We verify score regularity for broad target densities, including compactly supported densities, Gaussian convolutions of compactly supported measures and finite Gaussian mixtures. We obtain a universal distributional approximation result: Gaussian pullbacks induced by bi-Lipschitz variance-preserving transport maps are $L^1$-dense among all probability densities. For Gaussian convolution targets, we further obtain convergence in Kullback-Leibler divergence without early stopping.
☆ Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers
Scaling Diffusion Transformers (DiTs) to hundreds of layers introduces a structural vulnerability: networks can enter a silent, mean-dominated collapse state that homogenizes token representations and suppresses centered variation. Through mechanistic auditing, we isolate the trigger event of this collapse as Mean Mode Screaming (MMS). MMS can occur even when training appears stable, with a mean-coherent backward shock on residual writers that opens deep residual branches and drives the network into a mean-dominated state. We show this behavior is driven by an exact decomposition of these gradients into mean-coherent and centered components, compounded by the structural suppression of attention-logit gradients through the null space of the Softmax Jacobian once values homogenize. To address this, we propose Mean-Variance Split (MV-Split) Residuals, which combine a separately gained centered residual update with a leaky trunk-mean replacement. On a 400-layer single-stream DiT, MV-Split prevents the divergent collapse that crashes the un-stabilized baseline; it tracks close to the baseline's pre-crash trajectory while remaining substantially better than token-isotropic gating methods such as LayerScale across the full schedule. Finally, we present a 1000-layer DiT as a scale-validation run at boundary scales, establishing that the architecture remains stably trainable at extreme depth.
comment: 43 pages (9-page main paper + appendix)
☆ One Algorithm, Two Goals: Dual Scoring for Parameter and Data Selection in LLM Fine-Tuning
In Large Language Model (LLM) fine-tuning, parameter and data selection are common strategies for reducing fine-tuning cost, yet they are typically driven by separate scoring mechanisms. When a parameter mask and data subset jointly determine restricted fine-tuning, this separation incurs redundant overhead and makes coordinated selection difficult. We cast parameter and data selection as two bilevel selection problems under a common validation objective and derive a shared local response-surrogate scoring rule. Under first- and second-order validation-improvement approximations, parameter importance and data utility emerge as column-wise and row-wise aggregations of a single gradient interaction matrix, yielding a closed-form row-column correspondence for co-extracting both signals. Building on this structure, we propose DualSFT (Dual-Selection Fine-Tuning), a one-shot dual-scoring algorithm that produces a parameter mask and data subset from shared gradient statistics. On 3B-9B LLMs, single-axis DualSFT variants strengthen target-task performance and stability-plasticity trade-offs within their comparison groups, while full DualSFT yields a more favorable joint-constrained trade-off than sequential hybrid baselines under matched budgets.
☆ Entropy-Regularized Adjoint Matching for Offline RL
Integrating expressive generative policies, such as flow-matching models, into offline reinforcement learning (RL) allows agents to capture complex, multi-modal behaviors. While Q-learning with Adjoint Matching (QAM) stabilizes policy optimization via the continuous adjoint method, it remains inherently bound to the fixed behavior distribution. This dependence induces a \textit{popularity bias} that can suppress high-reward actions in low-density regions, and creates a \textit{support binding} that restricts off-manifold exploration. Existing workarounds, such as appending \textit{residual} Gaussian policies, often re-introduce the expressivity bottlenecks associated with unimodal distributions. In this work, we propose \textit{Maximum Entropy Adjoint Matching} (ME-AM), a unified framework that addresses these limitations within the continuous flow formulation. ME-AM incorporates two mechanisms: (1) a Mirror Descent entropy maximization objective that mitigates the popularity bias to facilitate the extraction of optimal policies from offline datasets, and (2) a \textit{Mixture Behavior Prior} that mathematically broadens the geometric support to encompass out-of-distribution high-reward regions. By exploring this extended geometry, ME-AM identifies robust actions while preserving the absolute continuity of the generative vector field. Empirically, ME-AM demonstrates competitive or superior performance compared to prior state-of-the-art (SOTA) methods across a diverse suite of sparse-reward continuous control environments.
☆ Graphlets as Building Blocks for Structural Vocabulary in Knowledge Graph Foundation Models
Foundation models excel at language, where sentences become tokens, and vision, where images become pixels, because both reduce to discrete symbols on a shared, fixed grid. Knowledge Graphs share the discreteness, but not the geometry. Their entities and relations are discrete symbols, yet their arrangement is relational and lacks a common, fixed grid. Knowledge Graphs (KGs) share the discreteness, but not the geometry. They form irregular, non-Euclidean topologies whose local neighborhoods differ from graph to graph. Therefore, Knowledge Graph Foundation Models (KGFMs) rely on identifying structural invariances to produce transferable representations. Without a universal token set, KGFMs are limited in their ability to transfer representations across unseen KGs. We close this gap by treating graphlets, small connected graphs, as structural tokens that recur in heterogeneous KGs. In this paper, We introduce a model-agnostic framework based on a vocabulary of graphlets that mines a KG between relations via pattern matching. In particular, we considered closed and open 2- and 3-path, and star graphlets, to obtain robust invariances. The framework is evaluated on 51 KGs from a wide range of domains, for zero-shot inductive and transductive link prediction. Experiments show that adding simple graphlets to the vocabulary yields models that outperform prior KGFMs.
☆ Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes
Deep neural networks exhibit periodic loss spikes during unregularized long-term training, a phenomenon known as the "Slingshot Mechanism." Existing work usually attributes this to intrinsic optimization dynamics, but its triggering mechanism remains unclear. This paper proves that this phenomenon is a result of floating-point arithmetic precision limits. As training enters a high-confidence stage, the difference between the correct-class logit and the other logits may exceed the absorption-error threshold. Then during backpropagation, the gradient of the correct class is rounded exactly to zero, while the gradients of the incorrect classes remain nonzero. This breaks the zero-sum constraint of gradients across classes and introduces a systematic drift in the parameter update of the classifier layer. We prove that this drift forms a positive feedback loop with the feature, causing the global classifier mean and the global feature mean to grow exponentially. We call this mechanism Numerical Feature Inflation (NFI). This mechanism explains the rapid norm growth before a Slingshot spike, the subsequent reappearance of gradients, and the resulting loss spike. We further show that NFI is not equivalent to an observed loss spike: in more practical tasks, partial absorption may not produce visible spikes, but it can still break the zero-sum constraint and drive rapid growth of parameter norms. Our results reinterpret Slingshot as a numerical dynamic of finite-precision training, and provide a testable explanation for abnormal parameter growth and logit divergence in late-stage training.
comment: 28 pages, 13 figures
☆ AdaGamma: State-Dependent Discounting for Temporal Adaptation in Reinforcement Learning
The discount factor in reinforcement learning controls both the effective planning horizon and the strength of bootstrapping, yet most deep RL methods use a single fixed value across all states. While state-dependent discounting is conceptually appealing, naive deep actor--critic implementations can become unstable and degenerate toward TD-error collapse. We propose AdaGamma, a practical deep actor--critic method for state-dependent discounting that learns a state-dependent discount function together with a return-consistency objective to regularize the induced backup structure. On the theory side, we analyze the Bellman operator induced by state-dependent discounting and establish its basic well-posedness properties under suitable conditions. Empirically, AdaGamma integrates into both SAC and PPO, yielding consistent improvements on continuous-control benchmarks, and achieves statistically significant gains in an online A/B test on the JD Logistics platform. These results suggest that state-dependent discounting can be made effective in deep RL when coupled with a return-consistency objective that prevents degenerate target manipulation.
comment: 22 pages, 9 figures
☆ Learning Discrete Autoregressive Priors with Wasserstein Gradient Flow
Discrete image tokenizers are commonly trained in two stages: first for reconstruction, and then with a prior model fitted to the frozen token sequences. This decoupling leaves the tokenizer unaware of the model that will later generate its tokens. As a result, the learned tokens may preserve image information well but still be difficult for an autoregressive (AR) prior to predict from left to right. We analyze this mismatch using Tripartite Variational Consistency (TVC), which decomposes latent-variable learning into three consistency conditions: conditional-likelihood consistency, prior consistency, and posterior consistency. TVC shows that two-stage training preserves the reconstruction side but leaves prior consistency outside the tokenizer objective: the overall token distribution is fixed before the AR prior participates in training. Motivated by this view, we add a distribution-level prior-matching signal during tokenizer training, while keeping the reconstruction objective unchanged. We optimize this signal with a Wasserstein-gradient-flow update. For hard categorical tokens, the update reduces to a token-level contrast between an auxiliary AR model that tracks the tokenizer's current token distribution and the target AR prior. It requires only forward passes through the two AR models and does not backpropagate through either of them. The resulting tokenizer, wAR-Tok, reduces AR loss and improves generation FID on CIFAR-10 and ImageNet at comparable reconstruction quality.
☆ Unifying Goal-Conditioned RL and Unsupervised Skill Learning via Control-Maximization
Unsupervised pretraining has driven empirical advances in goal-conditioned reinforcement learning (GCRL), but its theoretical foundations remain poorly understood. In particular, an influential class of methods, mutual information skill learning (MISL), discovers behaviorally diverse skills that can later be used for downstream goal-reaching. However, it remains a theoretical mystery why skills learned through MISL should support goal-reaching. A subtle challenge is that both GCRL and MISL are umbrella terms: different GCRL tasks use distinct criteria for measuring goal-reaching performance, while different MISL methods optimize distinct notions of behavioral diversity. We address this challenge and unify GCRL and MISL as instances of control maximization. We identify three canonical GCRL formulations and prove that they are fundamentally inequivalent: they can induce incompatible optimal policies even in the same environment. Nevertheless, they all share a common interpretation: a well-performing goal-conditioned policy is one whose future trajectory is highly sensitive to the commanded goal, with the precise notion of sensitivity determined by the GCRL formulation. Noting that MISL objectives can be understood as measures of skill-sensitivity akin to goal-sensitivity, we show that MISL objectives are bounded by formulation-specific downstream goal-sensitivities. These bounds establish a precise correspondence between MISL methods and downstream GCRL tasks: for every GCRL formulation, there exists a matching MISL objective for which more diverse skills afford greater downstream goal sensitivity. Our results thus lay a theoretical foundation for RL pretraining and have important practical implications, such as suggesting which pretraining objectives to use when a user cares about a specific class of downstream tasks.
☆ Matrix-Valued Optimism is Matrix-Valued Augmentation: Additive Hybrid Designs for Constrained Optimization
Augmented Lagrangian and optimistic primal--dual methods stabilize equality-constrained optimization through seemingly different mechanisms: the former adds constraint-dependent primal curvature, while the latter adds dual memory. Recent work has shown that these mechanisms are equivalent for scalar parameters. We extend this equivalence to matrix-valued correction. We prove an additivity principle: for symmetric matrix parameters, the ideal primal trajectory depends only on the summed correction matrix, not on how it is split between augmented and optimistic channels. This exposes a design freedom: algebraically equivalent decompositions can have different finite-step feasibility because augmented correction affects primal curvature, whereas optimistic correction affects the scale of the dual memory correction. We formulate the resulting step-size-limited design problem and derive a closed-form hybrid rule that selects a matrix correction, splits it between the two channels, and chooses primal and dual steps using local spectral weights. Experiments on nonlinear equality-constrained problems with controlled constraint-Jacobian conditioning show that the hybrid design improves over pure augmented and pure optimistic endpoints, closely tracks a grid-search hybrid oracle, and is competitive with first-order primal--dual baselines under mild-to-moderate ill-conditioning. The experiments also identify the expected limitation: exact cancellation requires increasingly large matrix corrections as the constraint Jacobian becomes ill-conditioned.
☆ SymDrift: One-Shot Generative Modeling under Symmetries
Generative modeling of physical systems, such as molecules, requires learning distributions that are invariant under global symmetries, such as rotations in three-dimensional space. Equivariant diffusion and flow matching models can incorporate such invariances effectively, even when trained on a non-invariant empirical distribution, but they typically rely on costly multi-step sampling. Recently, drifting models have emerged as an efficient alternative, enabling single-step generation and achieving state-of-the-art performance in generative modeling tasks. However, we show that drifting models face a symmetry-specific challenge, since an equivariant generator does not generally produce the same drifting field as the one obtained from the symmetrized target distribution. Addressing this issue would require expensive symmetrization of the empirical distribution. To avoid this cost, we propose SymDrift, a framework that makes the drifting field itself symmetry-aware. We introduce two complementary strategies: (i) a symmetrized drift in coordinate space based on optimal alignment, and (ii) a $G$-invariant embedding that removes symmetry ambiguity by construction. Empirically, SymDrift outperforms existing one-shot methods on standard benchmarks for conformer and transition state generation, while remaining competitive with significantly more expensive multi-step approaches. By enabling one-shot inference, SymDrift reduces computational overhead by up to 40$\times$ compared to existing baselines, making it promising for high-throughput applications such as virtual drug screening and large-scale reaction network exploration.
☆ Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex
Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for large language models (LLMs) post-training to incentivize reasoning capacity. Among existing recipes, group-based policy gradient is prevalent, which samples a group of responses per prompt and updates the policy via group-relative advantage signals. This work reveals that these optimization strategies share a common geometric structure: each implicitly defines a target distribution on the response simplex and projects toward it via first-order approximation. Building on this insight, we propose Listwise Policy Optimization (LPO) to explicitly conduct the target-projection, which demystifies the implicit target by restricting the proximal RL objective to the response simplex, and then projects the policy via exact divergence minimization. This framework provides (i) monotonic improvement on the listwise objective with bounded, zero-sum, and self-correcting projection gradients, and (ii) flexibility in divergence selection with distinct structural properties through the decoupled projection step. On diverse reasoning tasks and LLM backbones, LPO consistently improves training performance over typical policy gradient baselines under matched targets, while intrinsically preserving optimization stability and response diversity.
☆ Autoregressive Visual Generation Needs a Prologue
In this work, we propose Prologue, an approach to bridging the reconstruction-generation gap in autoregressive (AR) image generation. Instead of modifying visual tokens to satisfy both reconstruction and generation, Prologue generates a small set of prologue tokens prepended to the visual token sequence. These prologue tokens are trained exclusively with the AR cross-entropy (CE) loss, while visual tokens remain dedicated to reconstruction. This decoupled design lets us optimize generation through the AR model's true distribution without affecting reconstruction quality, which we further formalize from an ELBO perspective. On ImageNet 256x256, Prologue-Base reduces gFID from 21.01 to 10.75 without classifier-free guidance while keeping reconstruction almost unchanged; Prologue-Large reaches a competitive rFID of 0.99 and gFID of 1.46 using a standard AR model without auxiliary semantic supervision. Interestingly, driven only by AR gradients, prologue tokens exhibit emergent semantic structure: linear probing on 16 prologue tokens reaches 35.88% Top-1, far above the 23.71% of the first 16 tokens from a standard tokenizer; resampling with fixed prologue tokens preserves a similar high-level semantic layout. Our results suggest a new direction: generation quality can be improved by introducing a separate learned generative representation while leaving the original representation intact.
☆ Diffusion model for SU(N) gauge theories
Implicit score matching provides a computationally efficient approach for training diffusion models and generating high-quality samples from complex distributions. In this work, we develop a score-matching framework for SU(N) lattice gauge theories, which can be extended to other Lie groups. We apply the method to SU(3) gauge configurations with the Wilson gauge action in two and four dimensions and assess the quality of the generated samples by comparison with Hybrid Monte Carlo (HMC) simulations. We show that the diffusion models can be successfully trained and applied for sampling the Wilson gauge action. For large values of inverse coupling, accurate reverse-time integration requires predictor-corrector schemes, for which we introduce a corrector based on Hamiltonian molecular dynamics. While the corrector significantly improves sampling quality, it also increases the computational cost. We outline several strategies for improving sampling efficiency.
comment: 23 pages, 6 figures
☆ BoostLLM: Boosting-inspired LLM Fine-tuning for Few-shot Tabular Classification
Large language models (LLMs) have recently been adapted to tabular prediction by serializing structured features into natural language, but their performance in low-data regimes remains limited compared to gradient-boosted decision trees (GBDTs). In this work, we revisit the boosting paradigm, traditionally associated with tree ensembles, and ask whether it can be applied as a general training principle for LLM fine-tuning. We propose BoostLLM, a framework that transforms parameter-efficient fine-tuning into a multi-round residual optimization process by training sequential PEFT adapters as weak learners. To incorporate tabular inductive bias, BoostLLM integrates decision-tree paths as a second input view alongside raw features; analysis reveals that the path view acts as a structured teacher in early training steps before the model shifts toward feature-driven representations. Empirically, BoostLLM achieves consistent improvements over standard fine-tuning across multiple LLM backbones and datasets, matching or surpassing XGBoost across a wide range of shot counts and outperforming GPT-4o-based methods with a 4B model. We further show that the framework scales: pairing with stronger tree models and extended boosting horizons yields additional gains under appropriate stabilization. These results suggest that boosting can serve as a general training principle for LLM fine-tuning, particularly in low-data regimes for structured data.
comment: 19 pages, 4 figures
☆ Beyond Autoregressive RTG: Conditioning via Injection Outside Sequential Modeling in Decision Transformer
Decision Transformer (DT) formulates offline reinforcement learning as autoregressive sequence modeling, achieving promising results by predicting actions from a sequence of Return-to-Go (RTG), state, and action tokens. However, RTG is a scalar that summarizes future rewards, containing far less information than typical state or action vectors, yet it consumes the same computational budget per token. Worse, the self-attention cost of Transformers grows quadratically with sequence length, so including RTG as a separate token adds unnecessary overhead. We propose SlimDT, which removes RTG from the autoregressive sequence. Instead, we inject RTG information into the state representations before the sequential modeling step, allowing the Transformer to process only a compact (state, action) sequence. This reduces the sequence length by one-third, directly improving inference efficiency. On the D4RL benchmark, SlimDT surpasses standard DT across various tasks and achieves performance comparable to existing state-of-the-art methods. Decoupling a sparse conditioning signal from an information-rich sequence thus yields both computational gains and higher task performance.
☆ CredibleDFGO: Differentiable Factor Graph Optimization with Credibility Supervision
Global navigation satellite system (GNSS) positioning is widely used for urban navigation, but the covariance reported by the GNSS solver is often unreliable in urban canyons. Existing differentiable factor graph optimization (DFGO) methods already learn measurement weighting through the solver, but they still use position-only objectives. As a result, the mean estimate may improve while the reported covariance remains too small, too large, or wrong in shape. In this work, we propose CredibleDFGO (CDFGO), a differentiable GNSS factor graph framework that makes covariance credibility an explicit training target. The Weighting Generation Network (WGN) predicts per-satellite reliability weights. The differentiable Gauss--Newton solver maps these weights to a position estimate and posterior covariance, and proper scoring rules supervise the East--North predictive distribution end-to-end. We study negative log-likelihood (NLL), Energy Score (ES), and their combination. Results on three UrbanNav test scenes show consistent gains in uncertainty credibility. Positioning accuracy also improves on the medium-urban and harsh-urban scenes, and the mean horizontal error and 95th-percentile error improve on the deep-urban scene. On the harsh-urban Mong Kok (MK) scene, CDFGO-Combined reduces the mean horizontal error from 13.77\,m to 11.68\,m, reduces NLL from 40.63 to 6.59, and reduces ES from 12.31 to 9.05. The case studies link the MK improvement to better axis-wise consistency, more credible local covariance ellipses, and satellite-level reweighting.
comment: Submitted to NAVIGATION: Journal of the Institute of Navigation
☆ Time-Inhomogeneous Preconditioned Langevin Dynamics
Langevin sampling from distributions of the form $p(x) \propto \exp(-Ψ(x))$ faces two major challenges: (global) mode coverage and (local) mode exploration. The first challenge is particularly relevant for multi-modal distributions with disjoint modes, whereas the second arises when the potential $Ψ$ exhibits diverse and ill-conditioned local mode geometry. To address these challenges, a common approach is to precondition Langevin dynamics with problem-specific information, such as the sample covariance or the local curvature of $Ψ$. However, existing preconditioner choices inherently involve a trade-off between global mode coverage and local mode exploration, and no prior method resolves both simultaneously. To overcome this limitation, we propose the TIPreL, which introduces a time- and position-dependent preconditioner. This design effectively addresses both challenges mentioned above within a single framework. We establish convergence of the resulting dynamics in the Wasserstein-2 distance both in continuous time and for a tamed Euler discretization. In particular, our analysis extends the existing state of the art by proving convergence under time- and space-dependent diffusion coefficients, and only locally Lipschitz drifts, which has not been covered by prior work. Finally, we experimentally compare TIPreL with competing preconditioning schemes on a two-dimensional, severely ill-posed example and on a Bayesian logistic regression task in higher dimensions, confirming the efficiency of the proposed method.
☆ Revisiting Uncertainty: On Evidential Learning for Partially Relevant Video Retrieval ICML 2026
Partially relevant video retrieval aims to retrieve untrimmed videos using text queries that describe only partial content. However, the inherent asymmetry between brief queries and rich video content inevitably introduces uncertainty into the retrieval process. In this setting, vague queries often induce semantic ambiguity across videos, a challenge that is further exacerbated by the sparse temporal supervision within videos, which fails to provide sufficient matching evidence. To address this, we propose Holmes, a hierarchical evidential learning framework that aggregates multi-granular cross-modal evidence to quantify and model uncertainty explicitly. At the inter-video level, similarity scores are interpreted as evidential support and modeled via a Dirichlet distribution. Based on the proposed three-fold principle, we perform fine-grained query identification, which then guides query-adaptive calibrated learning. At the intra-video level, to accumulate denser evidence, we formulate a soft query-clip alignment via flexible optimal transport with an adaptive dustbin, which alleviates sparse temporal supervision while suppressing spurious local responses. Extensive experiments demonstrate that Holmes outperforms state-of-the-art methods. Code is released at https://github.com/lijun2005/ICML26-Holmes.
comment: Accepted by ICML 2026. 16 pages, 6 figures, 3 tables
☆ PoTAcc: A Pipeline for End-to-End Acceleration of Power-of-Two Quantized DNNs IEEE
Power-of-two (PoT) quantization significantly reduces the size of deep neural networks (DNNs) and replaces multiplications with bit-shift operations for inference. Prior work has shown that PoT-quantized DNNs can preserve accuracy for tasks such as image classification; however, their performance on resource-constrained edge devices remains insufficiently understood. While general-purpose edge CPUs and GPUs do not provide optimized backends for bit-shift operations, custom hardware accelerators can better exploit PoT quantization by implementing dedicated shift-based processing elements. However, deploying PoT-quantized models on such accelerators is challenging due to limited support in existing inference frameworks. In addition, the impact of different PoT quantization strategies on hardware design, performance, and energy efficiency during full inference has not been systematically explored. To address these challenges, we propose PoTAcc, an open-source end-to-end pipeline for accelerating and evaluating PoT-quantized DNNs on resource-constrained edge devices. PoTAcc enables seamless preparation and deployment of PoT-quantized models via TensorFlow Lite (TFLite) across heterogeneous platforms, including CPU-only systems and hybrid CPU-FPGA systems with custom accelerators. We design shift-based processing element (shift-PE) accelerators for three PoT quantization methods and implement them on two FPGA platforms. We evaluate accuracy, performance, energy efficiency, and resource utilization across a range of models, including CNNs and Transformer-based architectures. Results show that our CPU-accelerator design achieves up to 3.6x speedup and 78% energy reduction compared to CPU-only execution for PoT-quantized DNNs on PYNQ-Z2 and Kria boards. The code will be publicly released at https://github.com/gicLAB/PoTAcc
comment: Accepted to IEEE Transactions on Circuits and Systems for Artificial Intelligence (TCASAI), 2026
☆ Fast Gauss-Newton for Multiclass Cross-Entropy
In multiclass softmax cross-entropy, the full generalized Gauss-Newton (GGN) curvature couples all output logits through the softmax covariance, making curvature-vector products harder to scale as the number of classes grows. We show that the standard multiclass GGN can be decomposed exactly into a true-vs-rest term and a positive semidefinite within-competitor covariance term. Fast Gauss-Newton (FGN) retains the first term and drops the second, yielding a positive semidefinite under-approximation of the multiclass GGN that is exact for binary classification. The derivation uses an exact true-vs-rest scalar-margin representation of softmax cross-entropy: the loss and gradient are unchanged, and the approximation enters only at the curvature level. Exploiting the FGN curvature structure, the damped update can be written as an equivalent whitened row-space system with one row per mini-batch example. We solve this system matrix-free by conjugate gradient using Jacobian-vector and vector-Jacobian products of the scalar margin map. Targeted mechanism experiments and an evaluation on a fixed-feature multiclass head support the predictions from the decomposition: FGN stays closest to the full softmax GGN when competitor mass is concentrated or damping is large, and deviates as the dropped within-competitor covariance grows.
comment: 29 pages, 3 figures, 1 table, 1 algorithm
☆ Understanding diffusion models requires rethinking (again) generalization
This position paper argues that understanding generalization in diffusion models requires fundamentally new theoretical frameworks that go beyond both classical statistical learning theory and the benign overfitting paradigm developed for supervised learning. In diffusion models, unlike in supervised learning, memorization of training data and generalization to novel samples are incompatible: a model that has fully memorized its training set generates copies rather than novel data. Several theoretical explanations for why practical diffusion models nevertheless generalize have been proposed, based on capacity limitations, implicit regularization from optimization, or architectural inductive biases, but their interactions remain unclear. We argue that the field should pivot from explaining why the diffusion models do not memorize to investigating what the model actually learns during pre-memorization phase. To highlight our stance, we conduct empirical study of diffusion models trained on CIFAR-10, and we distill the findings into concrete open questions that we believe are key to improve understanding of generalization in diffusion models.
☆ PRISM: Iterative Cross-Modal Posterior Refinement for Dynamic Text-Attributed Graphs
Dynamic text-attributed graphs (DyTAGs) provide a powerful framework for modeling evolving systems in which node semantics and time-dependent interactions are tightly coupled. Recently, multimodal learning has emerged as a promising yet underexplored direction for enhancing DyTAG representation learning. However, existing methods typically rely on rigid modality partitions and one-shot fusion strategies, which limit their ability to capture the intrinsic and evolving dependencies between node semantics and interaction behaviors. To address these limitations, we propose \textbf{PRISM}, an iterative cross-modal posterior refinement framework for DyTAG representation learning. PRISM organizes DyTAG information into semantic and behavioral modalities, providing a more intrinsic alternative to carrier-level modality partitions. Instead of fusing the two modalities in a single step, PRISM learns a refinement trajectory that progressively transforms semantic priors into behavior-conditioned posterior states through cross-modal interaction with behavioral evidence. Extensive experiments on DTGB benchmark datasets show that PRISM achieves strong performance on temporal link prediction and destination node retrieval tasks. Further ablation studies validate the effectiveness of semantic--behavioral modeling and iterative posterior refinement.
☆ Normalized Architectures are Natively 4-Bit
Training large language models at 4-bit precision is critical for efficiency. We show that nGPT, an architecture that constrains weights and hidden representations to the unit hypersphere, is inherently more robust to low-precision arithmetic. This removes the need for interventions-such as applying random Hadamard transforms and performing per-tensor scaling calculations-to preserve model quality, and it enables stable end-to-end NVFP4 training. We validate this approach on both a 1.2B dense model and hybrid (Mamba-Transformer) MoE models of up to 3B/30B parameters. We trace this robustness to the dot product: while quantization noise remains largely uncorrelated in both standard and normalized architectures, the signal behaves differently. In nGPT, the hypersphere constraint enhances weak positive correlations among the element-wise products, leading to a constructive accumulation of the signal across the hidden dimension while the noise continues to average out. This yields a higher effective signal-to-noise ratio and a flatter loss landscape, with the effect strengthening as the hidden dimension grows, suggesting increasing advantages at scale. A reference implementation is available at https://github.com/anonymous452026/ngpt-nvfp4
☆ Causal Reinforcement Learning for Complex Card Games: A Magic The Gathering Benchmark
Causal reinforcement learning (RL) lacks benchmarks for complex systems that combine sequential decision making, hidden information, large masked action spaces, and explicit causal structure. We introduce MTG-Causal-RL, a Gymnasium benchmark built on Magic: The Gathering with a 3,077-dimensional partial observation, a 478-action masked discrete action space, five competitive Standard archetypes, three reward schemes, and a hand-specified Structural Causal Model (SCM) over strategic variables. Every episode exposes causal variables, SCM-predicted intervention effects, and per-factor credit traces, making causal credit assignment, leave-one-out cross-archetype transfer, and policy auditability first-class metrics. We adapt a panel of reference baselines: random, heuristic, masked PPO, a causal-world-model PPO variant, and an architecture-matched scalar control. We propose Causal Graph-Factored Advantage PPO (CGFA-PPO) as a reference causal agent that uses SCM parents of win probability as factor-aligned critic targets with an intervention-calibration loss. All comparisons use paired seeds, paired-bootstrap confidence intervals, and Holm-Bonferroni correction within pre-registered families. Masked PPO and CGFA-PPO reach competitive in-distribution win rates and exceed the random baseline; per-factor calibration trajectories and leave-one-out transfer gaps expose diagnostic structure that scalar win rate alone cannot. We release the benchmark, reference-baseline results, and full evaluation protocol openly. By coupling a strategically rich, partially observed domain with an explicit causal interface and statistical protocol, MTG-Causal-RL gives causal-RL, world-model, and LLM-agent research a shared testbed for questions current benchmarks cannot pose together: causal credit assignment under masked action spaces, structural transfer across archetypes, and SCM-grounded policy auditability.
comment: 21 pages, 8 figures, 9 tables, 1 algorithm
☆ Geometry-Aware Simplicial Message Passing
The Weisfeiler--Lehman (WL) test and its simplicial extension (SWL) characterize the combinatorial expressivity of message passing networks, but they are blind to geometry, i.e., meshes with identical connectivity but different embeddings are indistinguishable. We introduce the Geometric Simplicial Weisfeiler--Lehman (GSWL) test, which incorporates vertex coordinates into color refinement for geometric simplicial complexes. In addition, we show that (i) the expressivity of geometry-aware simplicial message passing schemes is bounded above by GSWL, and (ii) that there exist parameters such that the discriminating power of GSWL is matched by these schemes on any fixed finite family of geometric simplicial complexes. Combined with the Euler Characteristic Transform (ECT), a complete invariant for geometric simplicial complexes, this yields a geometric expressivity characterization together with an approximation framework. Experiments on synthetic and mesh datasets serve to validate our theory, showing a clear hierarchy from combinatorial to geometry-aware models.
☆ Correcting heterogeneous diagnostic bias when developing clinical prediction models using causal hidden Markov models
In routine care, individuals identified a priori as high-risk are usually tested for conditions more frequently. Protected attributes, such as sex or ethnicity may also determine testing frequency. Such heterogeneous detection rates across a population induce label error. This causes systematic model error for specific groups and biases performance metrics during validation. This paper proposes a method to correct for such bias in prediction models due to differential diagnostic delay. We use a causal inference framework to define our target estimand: an individual's diagnosis probability in a counterfactual scenario where their diagnosis rate matches that of a reference group. We model the longitudinal process as a hidden Markov model, in which confirmatory test results are emissions from a latent progressive disease stage. We validate our approach in simulated data and apply it to a case study of chronic kidney disease prediction using electronic health records. In simulations, our method reduces prediction bias and improves calibration-in-the-large, correcting the Observed:Expected ratio in the underdiagnosed group from 1.34 (standard deviation: 0.09) in a model developed without any correction for underdiagnosis bias to 1.02 (0.09). Violations of assumptions in the simulation affected the estimation of model parameters, but the proposed approach nonetheless remained better calibrated than the standard model. In the clinical case study, we identify diabetes as the main driver of observability, with an odds ratio of 10.36 (95% confidence interval, 9.80 - 11.02) in 6-month urine albumin-creatinine ratio testing rate. Using our approach to predict the counterfactual diagnostic rate in patients without diabetes, we improved the Observed:Expected ratio of a developed clinical prediction model from 1.55 (1.51 - 1.59) to 1.01 (0.98 - 1.04).
comment: 4 figures, 2 tables, 4 supplementaries
☆ Towards Self-Explainable Document Visual Question Answering with Chain-of-Explanation Predictions
Document Visual Question Answering (DocVQA) requires vision-language models to reason not only about what information in a document is relevant to a question, but also where the answer is grounded on the page. Existing DocVQA models entangle question-relevant evidence and answer localization and operate largely as black boxes, offering limited means to verify how predictions depend on visual evidence. We propose CoExVQA, a self-explainable DocVQA framework with a grounded reasoning process through a chain-of-explanation design. CoExVQA first identifies question-relevant evidence, then explicitly localizes the answer region, and finally decodes the answer exclusively from the grounded region. Prediction via CoExVQA's chain-of-explanation enables direct inspection and verification of the reasoning process across modalities. Empirical results show that restricting decoding to grounded evidence achieves SotA explainable DocVQA performance on PFL-DocVQA, improving ANLS by 12% over the current explainable baselines while providing transparent and verifiable predictions.
☆ Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend
Mixture-of-Experts (MoE) inference requires large-scale token exchange across devices, making dispatch and combine major bottlenecks in both prefill and decode. Beyond network transfer, routing-driven layout transformation, temporary relay, and output restoration can add substantial overhead. Existing MoE communication paths are often buffer-centric, using explicit inter-process relay and reordering buffers around collective transfer. This report presents a relay-buffer-free communication design for MoE inference acceleration on Ascend systems. The design reorganizes dispatch and combine around direct placement into destination expert windows and direct reading from remote expert windows. Built on globally pooled high-bandwidth memory and symmetric-memory allocation, it removes most intermediate relay and reordering buffers while retaining only lightweight control state, including counts, offsets, and synchronization metadata. We instantiate the design as two schedules for the main phases of MoE inference: a prefill schedule with richer planning state for throughput-oriented execution, and a compact decode schedule for latency-sensitive execution. Experiments on Ascend-based MoE workloads show reduced dispatch and combine latency in both settings. At the serving level, the implementation improves time to first token (TTFT), preserves competitive time per output token (TPOT), and enlarges the feasible scheduling space under practical latency constraints. These results indicate that, on platforms with globally addressable device memory, reducing intermediate buffering and output restoration around expert execution is an effective direction for accelerating MoE inference.
☆ Towards Generation-Efficient Uncertainty Estimation in Large Language Models
Uncertainty estimation is important for deploying LLMs in high-stakes applications such as healthcare and finance, where hallucinations can appear fluent and plausible while being factually incorrect, making it difficult for users to judge whether an output should be trusted. Existing methods require one or more full autoregressive generations to estimate uncertainty, which introduces substantial inference cost and often delays uncertainty assessment. In this paper, we investigate whether effective uncertainty estimation can be achieved with partial generation or even input-only information. Specifically, we first develop a unified framework that formulates uncertainty estimation as an early estimation problem over the autoregressive generation process of LLMs. This framework organises existing and proposed estimators by the information they observe, ranging from multi-generation to input-only prediction, and clarifies the performance-cost trade-off underlying different uncertainty estimation methods. Building on this view, we study two largely underexplored low-cost settings: estimating uncertainty with part of the generation, and predicting uncertainty from the input prompt. We propose Logit Magnitude, which uses top-M logit evidence to estimate uncertainty from an early-stopped generation prefix, and MetaUE, which distils generation-based uncertainty into a lightweight input-only estimator trained with uncertainty scores. Extensive experiments on general and domain-specific benchmarks show that Logit Magnitude achieves strong performance, and partial generations of LLMs are often sufficient for effective uncertainty estimation. MetaUE further provides a competitive input-only approximation in several settings. These findings suggest that effective uncertainty estimation requires less generation than commonly assumed, enabling unreliable responses to be identified earlier.
comment: 21 pages, 6 figures, and 8 tables. The abstract provided in the metadata differs slightly from the manuscript version due to character limits
☆ When Brain Networks Travel: Learning Beyond Site
Graph-based learning on functional magnetic resonance imaging (fMRI) has shown strong potential for brain network analysis. However, existing methods degrade under cross-site out-of-distribution (OOD) settings because site-conditioned confounders induce non-pathological shortcuts, while functional connectivity constructed by temporal averaging obscures transient neurodynamics, limiting generalization to unseen sites. In this paper, we propose Cross-site OOD Robust brain nEtwork (CORE), a unified framework for brain network learning across unseen sites. CORE first performs site-aware confounder decoupling to mitigate site-conditioned bias and extract a cross-site population scaffold of reproducible diagnostic connectivity edges. It then profiles transient pathway dynamics over this scaffold using lightweight temporal descriptors and organizes scaffold edges into a line graph for transferable pathway-level modeling. Finally, CORE introduces a prior-guided subject-adaptive gating mechanism that leverages scaffold-derived population priors while preserving subject-specific connectivity variability. Extensive experiments under leave-one-site-out evaluation on real-world datasets (ABIDE, REST-meta-MDD, SRPBS, and ABCD) show that CORE consistently outperforms state-of-the-art baselines, with up to 6.7% relative gain. Furthermore, CORE remains robust to atlas variations, maintaining performance gains across different brain parcellation schemes.
☆ TFM-Retouche: A Lightweight Input-Space Adapter for Tabular Foundation Models
Tabular foundation models (TFMs), such as TabPFN-2.6, TabICLv2, ConTextTab, Mitra, LimiX, and TabDPT, achieve strong zero-shot performance through in-context learning, but their inductive biases remain fixed at inference time. Adapting a pretrained TFM to a specific dataset or task typically requires either full fine-tuning, which is computationally expensive, or parameter-efficient tuning methods (PEFT) such as LoRA, which must be tailored to the internal architecture of each TFM. Furthermore, the evidence on whether weight-space fine-tuning improves accuracy or calibration is mixed \citep{tanna_exploring_2026,rubachev_finetuning_2025}. We introduce TFM-Retouche, a lightweight input-space residual adapter that is architecture-agnostic by design with respect to the frozen TFM backbone. TFM-Retouche learns a small residual correction in the input space to align the input data with the inductive biases of the pretrained model. The adapter is trained end-to-end through the frozen TFM, with a post-training identity guard that falls back to the unmodified TFM whenever adaptation does not help on held-out validation. On TabArena-Lite (51 datasets spanning binary classification, multiclass classification, and regression), TabICLv2-Retouche -- the framework instantiated on TabICLv2 -- is the top-ranked method on the leaderboard with light per-task tuning and ensembling, lifting aggregate Elo by +56 over the frozen TabICLv2 base and sitting on the Pareto front of predictive quality versus both training and inference time.
☆ Requests of a Feather Must Flock Together: Batch Size vs. Prefix Homogeneity in LLM Inference
Auto-regressive token generation in large language models is memory-bound because it requires "attending to" key and value tensors (KV cache) of all previous tokens. Prior work aims to improve the efficiency of this decode process by batching multiple requests together, and maximizing batch size subject to GPU memory constraints. The key observation of our work is that with prefix-sharing workloads, smaller, prefix-homogeneous batches -- where all requests share a common prefix -- can achieve higher decode throughput than larger, heterogeneous batches, due to better spatial and temporal locality during KV cache accesses. However, prefix-aware schedulers in state-of-the-art inference engines maximize prefix reuse within a batch only to reduce KV cache memory footprint, but do not stop batch formation at smaller homogeneous batches that could have performed better. Further, we show that shared prefix detection in existing schedulers relies on radix-tree traversals, incurring substantial CPU overhead that is often comparable to GPU execution time. This paper presents Feather, a prefix-aware scheduler that uses reinforcement learning (RL) to learn the optimal tradeoff between batch size and prefix homogeneity. We also introduce Chunked Hash Tree (CHT), a lightweight data structure that enables fast prefix detection and efficient request selection for the RL scheduler, avoiding expensive tree traversals. We integrate Feather into vLLM and SGLang, and our evaluation shows that Feather achieves 2--10$\times$ higher end-to-end throughput as compared to existing schedulers, while doing no worse than the status quo when the workload does not have enough prefix sharing. Feather achieves these gains by reducing the total number of KV cache accesses, surpassing the performance of prefix-aware attention kernels that have the same goal.
comment: 22 pages, 36 figures
☆ Optimal Transport for LLM Reward Modeling from Noisy Preference
Reward models are fundamental to Reinforcement Learning from Human Feedback (RLHF), yet real-world datasets are inevitably corrupted by noisy preference. Conventional training objectives tend to overfit these errors, while existing denoising approaches often rely on homogeneous noise assumptions that fail to capture the complexity of linguistic preferences. To handle these challenges, we propose SelectiveRM, a framework grounded in optimal transport. We first devise a Joint Consistency Discrepancy to align the distribution of model predictions with preference data. Furthermore, to address the limitation of strict mass conservation which compels the model to fit outliers, we incorporate a Mass Relaxation mechanism via partial transport. This enables the autonomous exclusion of samples with noisy preference that contradict semantic consistency. Theoretically, we demonstrate that SelectiveRM optimizes a tighter upper bound on the unobserved clean risk. Extensive experiments validate that our approach significantly outperforms state-of-the-art baselines across diverse benchmarks.
☆ Does Synthetic Data Help? Empirical Evidence from Deep Learning Time Series Forecasters
Synthetic data has transformed language model training, yet its role in time series forecasting remains poorly understood. We present a large-scale empirical study: nine experiment groups, 4,218 runs systematically evaluating synthetic time series augmentation across five architectures, four synthetic signals and seven datasets. The effect is sharply architecture-conditional: channel-mixing models (TimesNet, iTransformer) benefit in the majority of trials, while channel-independent models (DLinear, PatchTST) are consistently degraded. In selected low-resource settings the gains are striking: TimesNet trained on only 10\% of Weather data with synthetic augmentation surpasses the full-data baseline (4 of 16 sparsity-dataset combinations). Averaged across all architectures, augmentation hurts in 67\% of trials. We further find that only the Seasonal-Trend generator reliably helps across the tested benchmarks, and that hard curriculum switching is actively harmful (+24\% MSE degradation). These results provide concrete, actionable guidelines on how to use synthetic data: use synthetic augmentation with channel-mixing architectures, use gradual annealing schedules, and treat low-resource augmentation as architecture- and dataset-dependent. Code is available at \href{https://github.com/hugoiscracked/synthetic-ts/tree/main}
☆ Multi-agent decision making: A Blackwell's informativeness approach
The rapid development of large language models (LLMs) has motivated research on decision-making in multi-agent systems, where multiple agents collaborate to achieve shared objectives. Existing aggregation approaches, such as voting and debate, are largely ad-hoc and lack formal guarantees regarding the informativeness of the resulting decisions. In this paper, we provide a principled approach to analyse decisions made in the multi-LLM setting using Blackwell's informativeness framework. Within the Blackwell information-structure abstraction, we show that voting and debate induce information structures that are no more informative than the pooled private information of all agents. This result identifies Bayesian pooled posterior maximisation as an information-theoretic upper-bound decision rule under the Blackwell ordering. Motivated by this theoretical analysis, we introduce a practical method for LLM-based question-answering (QA) tasks that estimates each agent's posterior and approximates the pooled posterior using a product-of-posteriors estimator. Extensive experiments on six QA benchmarks demonstrate that our approach outperforms state-of-the-art multi-LLM debate and voting methods.
☆ Matrix-Decoupled Concentration for Autoregressive Sequences: Dimension-Free Guarantees for Sparse Long-Context Rewards
Sequence-level evaluations in autoregressive Large Language Models (LLMs) rely on highly dependent token generation. Establishing tight concentration bounds for these processes remains a challenge due to two fundamental bottlenecks in existing frameworks: (i) classical inequalities typically separate dependency structures from target sensitivities, leading to a scalar collapse that inflates the variance proxy to a suboptimal $\mathcal{O}(N)$ for sparse terminal rewards; (ii) conversely, while certain spatial methods achieve tighter bounds, they lack the strictly causal filtration required by sequential generation, rendering them inapplicable to the autoregressive setting. To resolve both bottlenecks, we establish a sharp McDiarmid-type inequality for dependent sequences, governed strictly by the exact matrix-vector multiplication of the causal dependency resolvent and the target sensitivity vector. This Matrix-Decoupled Concentration (MDC) framework natively recovers optimal constants for Markov chains and exploits directed $d$-separation to yield order-optimal bounds for causal trees. Crucially, by exactly preserving the coordinate-wise sparsity of rewards within a strictly causal framework, MDC mathematically prevents scalar collapse, guaranteeing a dimension-free $\mathcal{O}(1)$ variance proxy and providing a rigorous mathematical justification for the stability of long-context reasoning.
☆ Quantizing With Randomized Hadamard Transforms: Efficient Heuristic Now Proven
Uniform random rotations (URRs) are a common preprocessing step in modern quantization approaches used for gradient compression, inference acceleration, KV-cache compression, model weight quantization, and approximate nearest-neighbor search in vector databases. In practice, URRs are often replaced by randomized Hadamard transforms (RHTs), which preserve orthogonality while admitting fast implementations. The remaining issue is the performance for worst-case inputs. With a URR, each coordinate is individually distributed as a shifted beta distribution, which converges to a Gaussian distribution in high dimensions. Generally, one RHT is not suitable in the worst case, as individual coordinates can be far from these distributions. We show that after composing two RHTs on any $d$-sized input vector, the marginal distribution of every fixed coordinate of the normalized rotated vector is within $O(d^{-1/2})$ of a standard Gaussian both in Kolmogorov distance and in $1$-Wasserstein distance. We then plug these bounds into the analyses of modern compression schemes, namely DRIVE and QUIC-FL, and show that two RHTs achieve performance that asymptotically matches URRs. However, we show that two RHTs may not be sufficient for Vector Quantization (VQ), which often requires weak correlation across fixed-size blocks of coordinates (as opposed to only marginal distribution convergence for single coordinates). We prove that a composition of three RHTs leads to decaying coordinate covariance. This ensures that any fixed, bounded, multi-dimensional VQ codebook optimized for URRs has the same expected error when using three RHTs, up to an additive term that vanishes with the dimension. Finally, because practical inputs are rarely adversarial, we propose a linear-time ${O}(d)$ check on the input's moments to dynamically adapt the number of RHTs used at runtime to improve performance.
☆ A Fine-Grained Understanding of Uniform Convergence for Halfspaces
We study the fine-grained uniform convergence behavior of halfspaces beyond worst-case VC bounds. For inhomogeneous halfspaces in $\mathbb{R}^d$ with $d\ge 2$, we show that standard first-order VC bounds are essentially tight: even consistent hypotheses can incur population error $Θ(d\ln(n/d)/n)$, and in the agnostic setting the deviation scales as $\sqrt{τ\ln(1/τ)}$ at true error $τ$. In contrast, homogeneous halfspaces in $\mathbb{R}^2$ exhibit a markedly different behavior. In the realizable case, every hypothesis consistent with the sample has error $O(1/n)$. In the agnostic case, we prove a bandwise, log-free deviation bound on each dyadic risk band via a critical-wedge localization argument. Unioning over bands incurs only a $\ln\ln n$ overhead, and we establish a matching lower bound showing this overhead is unavoidable. Together, these results give a fine-grained and nearly complete picture of uniform convergence for halfspaces, revealing sharp dimensional and structural thresholds.
☆ Gaussian mixture models in Hilbert spaces via kernel methods
Modern datasets across many disciplines increasingly consist of time-evolving, potentially infinite-dimensional random objects, such as dynamic functional data, which are naturally modeled in Hilbert spaces. In these settings, characterizing probability measures, for example, through densities, can be ill-defined or technically challenging. Motivated by clustering applications, we propose a Gaussian mixture framework for Hilbert-space-valued data based on kernel mean embeddings and develop efficient optimization algorithms for estimation. We establish theoretical guarantees showing that the proposed algorithm is well defined and that the model yields a dense class of approximations in infinite-dimensional spaces. We evaluate the framework through extensive experiments on diverse structures and data geometries, including $L^2$-functional data and random graphs in Laplacian spaces arising in modern medical applications.
comment: 38 pages, 13 figures
☆ DiBA: Diagonal and Binary Matrix Approximation for Neural Network Weight Compression
In this paper, we propose DiBA (Diagonal and Binary Matrix Approximation), a compact matrix factorization for neural network weight compression. Many components of modern networks, including linear layers, $1\times1$ convolutions, attention projections, and embedding layers, have dense matrix weights. DiBA approximates $A\in\mathbb{R}^{m\times n}$ by $\widehat A=D_1B_1D_2B_2D_3$, where $D_1,D_2,D_3$ are diagonal matrices and $B_1,B_2$ are $0/1$ binary matrices. The intermediate dimension $k$ controls the trade-off between theoretical storage and approximation accuracy. For matrix-vector products, DiBA decomposes dense multiplication into three element-wise scaling operations and two binary mixing operations, reducing the floating-point multiplication count from $mn$ to $m+k+n$. For optimization, we introduce DiBA-Greedy, an alternating solver that combines closed-form least-squares updates for the diagonal factors with exact one-bit improvement tests for the binary factors. We also introduce DiBARD (DiBA with Retuning only Diagonal factors), which replaces dense-matrix layers by DiBA factors, freezes the binary matrices, and retunes only the diagonal entries on downstream data. This preserves compact binary mixing without discrete search during adaptation. On 40 dense weight matrices extracted from public pretrained models, DiBA-Greedy yields consistent SNR improvements as the theoretical storage ratio increases. After DiBA replacement in two component-replacement studies, DiBARD improves DistilBERT/WikiText masked-token accuracy from 0.4447 to 0.5210 and Speech Commands test accuracy for an Audio Spectrogram Transformer from 0.7684 to 0.9781 without reoptimizing the binary factors.
☆ TabCF: Distributional Control Function Estimation with Tabular Foundation Models
Instrumental variable (IV) and control function (CF) methods are powerful tools for causal effect estimation in the presence of unmeasured confounding, yet most existing approaches target only mean effects and/or demand substantial fitting and tuning effort. In this paper, we introduce a simple method, TabCF, for control function regression using tabular foundation models, which enables accurate, fast, identification-transparent, and tuning-light causal estimation of distributional quantities, such as interventional means and quantiles; we also propose a copula-based approximation for multivariate outcomes. TabCF performs favorably against representative methods across a broad range of small- to medium-sized synthetic and real data scenarios. The central message is two-fold: for practitioners, it highlights that TabCF is an effective tool for distributional causal inference; for researchers, it suggests that the proposed approach could be considered a strong baseline for future method development. Code is available at https://github.com/GepingChen/TabCF.
☆ Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions ICML 2026
Recently, steering vectors (SVs) have emerged as an effective and lightweight approach to steer behaviors of large language models (LLMs), among which fine-tuned SVs are more effective than optimization-free ones. However, current approaches to fine-tuned SVs suffer from two limitations. First, they require careful selection of steering factors on a per-SV basis to balance steering effectiveness and generation quality at inference time. Second, they operate as full-sequence SVs (FSSVs), which can sacrifice generation quality regardless of factor selection due to excessive intervention on the model generation process. To address the first limitation, we propose joint training of steering factors and directions, such that post-hoc factor selection is no longer required. Using neural network scaling theory, we find that moderately large initialization sizes and learning rates for steering factors are essential for stability and efficiency of joint training. To tackle the second limitation, we draw inspiration from representation fine-tuning and introduce Prompt-only SV (PrOSV), an SV that intervenes only on a few prompt tokens. Our empirical results show that PrOSV outperforms traditional FSSVs on AxBench when using our joint training scheme. We also find that PrOSV achieves a better tradeoff between general model utility and adversarial robustness than FSSV.
comment: 63 pages, 50 figures; accepted by ICML 2026
☆ Physical Fidelity Reconstruction via Improved Consistency-Distilled Flow Matching for Dynamical Systems
Reconstructing high-fidelity flow fields from low-fidelity observations is a central problem in scientific machine learning, yet recent diffusion and flow-matching models typically rely on iterative sampling, making them costly for latency-sensitive workflows such as ensemble forecasting, real-time visualization, and simulation-in-the-loop inference. We study whether a high-fidelity flow-matching generative model can be compressed into a compact one-step model for fast scientific flow reconstruction. Our approach distills an optimal-transport flow-matching teacher into a one-step consistency model. Low-fidelity observations are incorporated at inference by initializing the generative trajectory from a noised observation along the transport path, allowing an unconditional high-fidelity flow model to perform conditional reconstruction without retraining the teacher. We evaluate this distillation strategy on three fluid benchmarks, Smoke Buoyancy, Turbulent Channel Flow, and Kolmogorov Flow, using coarse-to-fine reconstruction as a controlled testbed at field sizes up to $256 \times 256$. Across these settings, the distilled student retains similar performance of the teacher's model on spectrum metrics, while using roughly half as many parameters and achieving a $12\times$ inference speedup over the flow-matching teacher. Under the same training budget, the distilled student also outperforms a one-step consistency model trained directly from scratch by $23.1\%$ in SSIM, showing that teacher distillation improves training efficiency rather than merely accelerating sampling. These results suggest a promising route for turning future high-capacity scientific generative models into compact reconstruction models that are faster to train, cheaper to run, and easier to deploy.
☆ Towards Reliable LLM Evaluation: Correcting the Winner's Curse in Adaptive Benchmarking
Adaptive prompt and program search makes LLM evaluation selection-sensitive. Once benchmark items are reused inside tuning, the observed winner's score need not estimate the fresh-data performance of the full tune-then-deploy procedure. We study inference for this procedure-level target under explicit tuning budgets. We propose SIREN, a selection-aware repeated-split reporting protocol that freezes the post-search shortlist, separates splitwise selection from held-out evaluation, and uses an item-level Gaussian multiplier bootstrap for uncertainty quantification. In a fixed-shortlist regime with smooth stabilized selection, the estimator admits a first-order item-level representation, and the bootstrap yields valid simultaneous inference on a finite budget grid. This supports confidence intervals for procedure-performance curves and pre-specified equal-budget and cross-budget comparisons. Controlled simulations and MMLU-Pro tuning experiments show that winner-based reporting can be optimistic and can change deployment conclusions, while SIREN remains close to the finite-sample reporting target.
☆ Training Transformers for KV Cache Compressibility
Long-context language modeling is increasingly constrained by the Key-Value (KV) cache, whose memory and decode-time access costs scale linearly with the prefix length. This bottleneck has motivated a range of context-compression methods, from token-level summarization to recent optimization-based KV compression methods. These post-hoc methods operate on the KV cache of a fixed pretrained model, so their effectiveness is fundamentally limited by how well the model's internal representations can be compressed. In this work, we formalize the notion of KV compressibility and show that it is a property of the learned representations, rather than of the context alone. We prove that almost any sequence-to-vector function admits both highly compressible and inherently non-compressible transformer implementations, highlighting the need to guide transformers toward compressible representations during training. Motivated by this, we propose KV-Compression Aware Training (KV-CAT), a continued pretraining procedure that incentivizes the emergence of compressible representations. We introduce a train-time KV sparsification policy that masks KV slots during training. This forces the model to use fewer KV slots and encourages it to learn representations amenable to post-hoc compression. Empirically, we show that KV-CAT improves the quality-budget tradeoff of downstream compression methods across retrieval, long-context question answering, and perplexity-based evaluation of compressed-prefix continuation.
comment: 32 pages, 4 figures
☆ Sharper Guarantees for Misspecified Kernelized Bandit Optimization
Existing guarantees for misspecified kernelized bandit optimization pay for misspecification through kernel complexity: in generic offline bounds, the misspecification level $\varepsilon$ is multiplied by $\sqrt{d_\mathrm{eff}}$, where $d_\mathrm{eff}$ is the kernel effective dimension, while in online regret bounds, the corresponding penalty is $\sqrt{γ_n}\,n\varepsilon$, where $γ_n$ is the maximum information gain after $n$ rounds of interaction. In this work, we show that, for a large class of kernels, the misspecification amplification can be reduced to logarithmic or polylogarithmic growth. In the offline setting, we first prove high-probability simple-regret bounds whose misspecification term is governed by a spectral Lebesgue constant. This yields logarithmic amplification for one-dimensional monotone spectra and polylogarithmic amplification for multivariate Fourier-diagonal product kernels. In the online setting, we modify a domain-splitting algorithm and prove a cumulative regret bound of $\widetilde{\mathcal O}(\sqrt{γ_n n}+n\varepsilon)$ under mild localized eigendecay assumptions, removing the extra $\sqrt{γ_n}$ factor from the misspecification term. The common principle is localization: spectral localization controls the Lebesgue constant of the offline approximation operator, while domain splitting implements the spatial analogue of this mechanism in the online setting, preventing local misspecification errors from being amplified globally.
☆ Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR
Reinforcement Learning with Verifiable Rewards (RLVR) has become a key approach for improving the reasoning abilities of large language models. However, widely used critic-free algorithms such as Group Relative Policy Optimization (GRPO) necessitate a ``uniform credit assignment'' assumption that indiscriminately broadcast trajectory-level advantages, hindering learning efficiency by failing to distinguish critical reasoning steps. To address this limitation, we propose Selective Eligibility Traces (S-trace). Grounded in the intuition of partial trust region preservation, we initially introduce P-trace as a sample-efficient, critic-free eligibility traces method, upon which we build S-trace, implementing a sparse eligibility traces mechanism to further mitigate variance and achieve fine-grained credit assignment by selectively masking low-entropy tokens. Theoretically, we contextualize the recent Group Sequence Policy Optimization (GSPO) method within the critic-free eligibility traces framework, identifying it as a special instance of the eligibility traces method operating under uniform credit assignment. Experiments demonstrate that S-trace not only outperforms GRPO, showing gains of 0.49\% on Qwen3-1.7B and 3.16\% on Qwen3-4B, and maintaining a robust 2.98\% improvement when scaled further to Qwen3-8B in average pass@16, but notably achieves this with simultaneously higher sample and token efficiency.
☆ Uncertainty Estimation via Hyperspherical Confidence Mapping ICLR 2026
Quantifying uncertainty in neural network predictions is essential for high-stakes domains such as autonomous driving, healthcare, and manufacturing. While existing approaches often depend on costly sampling or restrictive distributional assumptions, we propose Hyperspherical Confidence Mapping (HCM), a simple yet principled framework for sampling-free and distribution-free uncertainty estimation. HCM decomposes outputs into a magnitude and a normalized direction vector constrained to lie on the unit hypersphere, enabling a novel interpretation of uncertainty as the degree of violation of this geometric constraint. This yields deterministic and interpretable estimates applicable to both regression and classification. Experiments across diverse benchmarks and real-world industrial tasks demonstrate that HCM matches or surpasses ensemble and evidential approaches, with far lower inference cost and stronger confidence-error alignment. Our results highlight the power of geometric structure in uncertainty estimation and position HCM as a versatile alternative to conventional techniques.
comment: Accepted at ICLR 2026. 24 pages, 7 figures, including appendix
☆ From Coordinate Matching to Structural Alignment: Rethinking Prototype Alignment in Heterogeneous Federated Learning
Heterogeneous federated learning (HtFL) aims to enable collaboration among clients that differ in both data distributions and model architectures. Prototype-based methods, which communicate class-level feature centers (prototypes) instead of full model parameters, have recently shown strong potential for HtFL. Existing prototype-based HtFL methods typically reuse the MSE-based or cosine-based alignment mechanism developed for homogeneous FL when aligning client-specific representations with global prototypes. These approaches are essentially coordinate alignment, where representations of clients are forced to match the global prototypes in the embedding space in an element-wise manner. Such alignment implicitly assumes that all clients should map their representations into the feature subspace defined by the global prototypes. This assumption is reasonable in homogeneous FL, where all clients share the same feature extractor. However, it becomes problematic in HtFL, since heterogeneous feature extractors naturally induce client-specific feature subspaces, and forcing all clients to optimize within a single global subspace unnecessarily suppresses their learning capacity. We observe that coordinate alignment implicitly couples two distinct objectives: aligning inter-class semantic structure, which is directly beneficial for classification, and enforcing a shared feature basis, which is unnecessary and even harmful under model heterogeneity. Building on this insight, we design FedSAF, which shifts the alignment objective from absolute coordinates to inter-class relational structure. We demonstrate that structural alignment consistently outperforms coordinate alignment in heterogeneous settings. Experiments on multiple benchmarks show that our structural alignment outperforms state-of-the-art prototype-based HtFL methods by up to 3.52\%.
comment: 14 pages, 10 figures, 9 tables
☆ COVID-19 Infodemic. Understanding content features in detecting fake news using a machine learning approach
The use of content features, particularly textual and linguistic for fake news detection is under-researched, despite empirical evidence showing the features could contribute to differentiating real and fake news. To this end, this study investigates a selection of content features such as word bigrams, part of speech distribution etc. to improve fake news detection. We performed a series of experiments on a new dataset gathered during the COVID-19 pandemic and using Decision Tree, K-Nearest Neighbor, Logistic Regression, Support Vector Machine and Random Forest. Random Forest yielded the best results, followed closely by Support Vector Machine, across all setups. In general, both the textual and linguistic features were found to improve fake news detection when used separately, however, combining them into a single model did not improve the detection significantly. Differences were also noted between the use of bigrams and part of speech tags. The study shows that textual and linguistic features can be used successfully in detecting fake news using the traditional machine learning approach as opposed to deep learning.
♻ ☆ Predictive and Prescriptive AI toward Optimizing Wildfire Suppression
Intense wildfire seasons require critical prioritization decisions to allocate scarce suppression resources over a dispersed geographical area. This paper develops a predictive and prescriptive approach to jointly optimize crew assignments and wildfire suppression. The problem features a discrete resource-allocation structure with endogenous wildfire demand and non-linear wildfire dynamics. We formulate an integer optimization model with crew assignments on a time-space-rest network, wildfire dynamics on a time-state network, and linking constraints between them. We develop a two-sided branch-and-price-and-cut algorithm based on: (i) a two-sided column generation scheme that generates fire suppression plans and crew routes iteratively; (ii) a new family of cuts exploiting the knapsack structure of the linking constraints; and (iii) novel branching rules to accommodate non-linear wildfire dynamics. We also propose a data-driven double machine learning approach to estimate wildfire spread as a function of covariate information and suppression efforts, mitigating observed confounding between historical crew assignments and wildfire growth. Extensive computational experiments show that the optimization algorithm scales to otherwise intractable real-world instances; and that the methodology can enhance suppression effectiveness in practice, resulting in significant reductions in area burned over a wildfire season and guiding resource sharing across wildfire jurisdictions.
♻ ☆ On the optimization dynamics of RLVR: Gradient gap and step size thresholds
Reinforcement Learning with Verifiable Rewards (RLVR), which uses simple binary feedback to post-train large language models, has found significant empirical success. However, a principled understanding of why it works is lacking. This paper builds a theoretical foundation for RLVR by analyzing its training process at both the full-response (trajectory) and token levels. Central to our analysis is a new quantity called the Gradient Gap, which formalizes the direction of improvement from low-reward to high-reward regions of the response space. We prove that convergence critically depends on aligning the update direction with this Gradient Gap. Moreover, we derive a sharp step-size threshold based on the magnitude of the Gradient Gap: below it, learning converges, whereas above it, performance collapses. Our theory further predicts how the critical step size must scale with response length and the success rate, thereby explaining why practical heuristics such as length normalization improve stability and showing that, with a fixed learning rate, the success rate can stagnate strictly below $100\%$. Importantly, our theory holds flexibly for any policy-gradient algorithm and so characterizes the dynamics of popular approaches such as REINFORCE and GRPO. We validate these predictions through controlled bandit simulations and language model experiments on post-training Qwen2.5-Math-7B with GRPO.
♻ ☆ Flexible Agent Alignment with Goal Inference from Open-Ended Dialog
We introduce Open-Universe Assistance Games (OU-AGs), a formal framework extending assistance games to LLM-based agents. Effective assistance requires reasoning over human preferences that are unbounded, underspecified, and evolving. Current LLM agents struggle in multi-turn interactions and with maintaining accurate models of user intent in collaborative settings. Existing assistance game formulations assume fixed, predefined preferences, an assumption that breaks down in open-ended dialogue where goals are revised incrementally and expressed in natural language. Grounded in cognitive science accounts of preference construction, we represent human preferences as a dynamically updated distribution over discrete natural-language goals. To operationalize OU-AGs, we introduce GOOD (GOals from Open-ended Dialogue), a data-efficient online method that extracts and ranks candidate goals during interaction, using LLM-simulated users to perform probabilistic inference over goal hypotheses. This allows for interpretable, uncertainty-aware preference representations without large offline datasets. We evaluate GOOD across three text-based domains: grocery shopping, household robotics (AI2-THOR), and coding. Compared to baselines without explicit goal tracking, GOOD produces semantically coherent goal representations and improves alignment with user intent across domains.
comment: Previous version of the paper was titled: Open-Universe Assistance Games
♻ ☆ How to make the most of your masked language model for protein engineering ICLR 2026
A plethora of protein language models have been released in recent years. Yet comparatively little work has addressed how to best sample from them to optimize desired biological properties. We fill this gap by proposing a flexible, effective sampling method for masked language models (MLMs), and by systematically evaluating models and methods both in silico and in vitro on actual antibody therapeutics campaigns. Firstly, we propose sampling with stochastic beam search, exploiting the fact that MLMs are remarkably efficient at evaluating the pseudo-perplexity of the entire 1-edit neighborhood of a sequence. Reframing generation in terms of entire-sequence evaluation enables flexible guidance with multiple optimization objectives. Secondly, we report results from our extensive in vitro head-to-head evaluation for the antibody engineering setting. This reveals that choice of sampling method is at least as impactful as the model used, motivating future research into this under-explored area.
comment: Accepted into the GEM Workshop, ICLR 2026
♻ ☆ DARK: Diagonal-Anchored Repulsive Knowledge Distillation for Vision-Language Models under Extreme Compression
Compressing vision-language models for on-device deployment is increasingly important in clinical settings, but knowledge distillation (KD) degrades sharply when the teacher-student capacity gap spans an order of magnitude or more. We argue that, under such gaps, strict imitation of the teacher is a poor objective: much of the teacher's pairwise similarity structure reflects its own architectural biases rather than information a compact student can efficiently represent. We propose \textbf{Diagonal-Anchored Repulsive Knowledge Distillation (DARK)}, a contrastive KD framework that decomposes the distillation loss into a diagonal term (matched image-text pairs) and an off-diagonal term (non-target similarities). The diagonal term anchors matched-pair alignment throughout training; the off-diagonal term is annealed from positive to negative weighting, transitioning the student from imitating to \emph{repelling} the teacher's non-target similarity structure. We instantiate DARK by distilling FetalCLIP, a 427M-parameter fetal ultrasound vision-language model, into \textbf{MobileFetalCLIP}, a 75M-parameter student model with a $26\times$ smaller visual encoder, running in 1.6\,ms on an iPhone~16~Pro. The student matches or exceeds its teacher on three zero-shot benchmarks, including HC18 biometry validity (88.6\% vs.\ 83.5\%) and brain sub-plane F1 (0.784 vs.\ 0.702). Embedding-geometry and logit analyses show that DARK induces \emph{structured decorrelation}: the student preserves teacher-aligned per-image confidence while diverging from inherited inter-class confusion, suggesting that controlled repulsion can be more efficient than imitation under extreme compression.
comment: Project website: www.numansaeed.com/mobilefetalclip
♻ ☆ Purely Agent-Driven Black-Box Optimization for Biological Design
Many key challenges in biological design -- such as small-molecule drug discovery, antimicrobial peptide development, and protein engineering -- can be framed as black-box optimization over vast, complex structured spaces. Existing methods rely mainly on raw structural data and struggle to exploit the rich scientific literature. While large language models (LLMs) have been added to these pipelines, they have been confined to narrow roles within structure-centered optimizers. We instead cast biological black-box optimization as an agent-driven, language-based reasoning process. We introduce Purely Agent-driven BLack-box Optimization (PABLO), a hierarchical agentic system that uses scientific LLMs pretrained on chemistry and biology literature to generate and iteratively refine biological candidates. On both the standard GuacaMol molecular design and antimicrobial peptide optimization tasks, PABLO achieves state-of-the-art performance, substantially improving sample efficiency and final objective values over established baselines. Compared to prior optimization methods that incorporate LLMs, PABLO achieves competitive token usage per run despite relying on LLMs throughout the optimization loop. Beyond raw performance, the agentic formulation offers key advantages for realistic design: it naturally incorporates semantic task descriptions, retrieval-augmented domain knowledge, and complex constraints. In follow-up in vitro validation, PABLO-optimized peptides showed strong activity against drug-resistant pathogens, underscoring the practical potential of PABLO for therapeutic discovery.
♻ ☆ How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum
SFT-then-RLVR is widely used for post-training reasoning models, but why this specific ordering, and why RLVR-only stalls at cold start, have lacked a unifying theoretical account. We provide that account under a unified loss family $J_Q$ using the Tsallis $q$-logarithm. $J_Q$ is a single-parameter family that interpolates between RLVR (at $q{=}0$, the \textit{exploitation pole}) and the log-marginal-likelihood over latent trajectories (at $q{=}1$, the \textit{density-estimation pole}), under which the standard pipeline corresponds to a stepwise $q{=}1 \to 0$ schedule. All members share the same per-example gradient direction, differing only by a per-instance amplification $P_θ^{-q}$ that reweights each instance independently of the learning rate. Under gradient flow analysis, we show that the exploitation pole requires $Ω(\frac{1}{p_0})$ time to escape cold start but is robust to label noise, while the density-estimation pole escapes in $Θ\big(\log(\frac{1}{p_0})\big)$ but memorizes label noise. This separation explains how SFT ($q{=}1$) first moves the model out of the cold-start regime, followed by the more robust RLVR ($q{=}0$), under the SFT-then-RLVR paradigm. We further derive two Monte Carlo estimators that directly optimize fixed-$q$ on the $J_Q$ continuum, without annotated rationales: Gradient-Amplified RL (GARL) and Posterior-Attenuated Fine-Tuning (PAFT), with shared bias $O\big(\frac{q}{M P_θ^q}\big)$ but different variance and stability properties. On FinQA, HotPotQA, and MuSiQue, GARL at sufficiently high $q$ substantially mitigates cold-start stalling, escaping cold start where GRPO fails entirely. In warm start, GARL at low $q$ dominates FinQA where training is stable; on HotPotQA and MuSiQue, GARL destabilizes and PAFT at $q{=}0.75$ remains stable, reaching $47.9$ \texttt{m@16} on HotPotQA ($+13.9$ over GRPO).
♻ ☆ Generative AI Meets 6G and Beyond: Diffusion Models for Semantic Communications IEEE
Semantic communications mark a paradigm shift from bit-accurate transmission toward meaning-centric communication, essential as wireless systems approach theoretical capacity limits. The emergence of generative AI has catalyzed generative semantic communications, where receivers reconstruct content from minimal semantic cues by leveraging learned priors. Among generative approaches, diffusion models stand out for their superior generation quality, stable training dynamics, and rigorous theoretical foundations. However, the field currently lacks systematic guidance connecting diffusion techniques to communication system design, forcing researchers to navigate disparate literatures. This article provides the first comprehensive tutorial on diffusion models for generative semantic communications. We present score-based diffusion foundations and systematically review three technical pillars: conditional diffusion for controllable generation, efficient diffusion for accelerated inference, and generalized diffusion for cross-domain adaptation. In addition, we introduce an inverse problem perspective that reformulates semantic decoding as posterior inference, bridging semantic communications with computational imaging. Through analysis of human-centric, machine-centric, and agent-centric scenarios, we illustrate how diffusion models enable extreme compression while maintaining semantic fidelity and robustness. By bridging generative AI innovations with communication system design, this article aims to establish diffusion models as foundational components of next-generation wireless networks and beyond.
comment: Accepted by IEEE COMST, GitHub repository: https://github.com/qin-jingyun/Awesome-DiffComm, project page: https://qin-jingyun.github.io/Awesome-DiffComm
♻ ☆ The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding
Vision-Language Models (VLMs) have achieved strong performance on implicit and explicit visual grounding and related tasks. However, such abilities are generally tested on simple, single-object phrases. We find that grounding performance degrades for complex, multi-object references. These limitations largely arise from training objectives that leverage image-caption alignment, where direct multi-object references are rare, the number of possible such references is theoretically large (exponential in the number of objects), and attribution is difficult. To address this, without requiring any additional annotations, we propose Compositional Attention-Regularized Training (CompART), which decomposes captions into object-centric phrases and constructs composite phrases by pairing them with conjunctions. We then introduce a composition loss that encourages the attention induced by a composite phrase to equal the sum of the attentions of its constituent phrases, promoting balanced multi-object localization. We evaluate CompART across four VLM architectures, spanning both contrastive-based and generative-based models, on four benchmarks for multi-object grounding and two VQA benchmarks for general visual understanding. CompART consistently improves grounding for both single- and multi-object references across diverse VLM architectures and datasets, and further demonstrates enhanced visual understanding, as evidenced by gains on VQA, despite not being explicitly trained for this task.
♻ ☆ Suspicious Alignment of SGD: A Fine-Grained Step Size Condition Analysis
This paper explores the suspicious alignment phenomenon in stochastic gradient descent (SGD) under ill-conditioned optimization, where the Hessian spectrum splits into dominant and bulk subspaces. This phenomenon describes the behavior of gradient alignment in SGD updates. Specifically, during the initial phase of SGD updates, the alignment between the gradient and the dominant subspace tends to decrease. Subsequently, it enters a rising phase and eventually stabilizes in a high-alignment phase. The alignment is considered ``suspicious'' because, paradoxically, the projected gradient update along this highly-aligned dominant subspace proves ineffective at reducing the loss. The focus of this work is to give a fine-grained analysis in a high-dimensional quadratic setup about how step size selection produces this phenomenon. Our main contribution can be summarized as follows: We propose a step-size condition revealing that in low-alignment regimes, an adaptive critical step size $η_t^*$ separates alignment-decreasing ($η_t < η_t^*$) from alignment-increasing ($η_t > η_t^*$) regimes, whereas in high-alignment regimes, the alignment is self-correcting and decreases regardless of the step size. We further show that under sufficient ill-conditioning, a step size interval exists where projecting the SGD updates to the bulk space decreases the loss while projecting them to the dominant space increases the loss, which explains a recent empirical observation that projecting gradient updates to the dominant subspace is ineffective. Finally, based on this adaptive step-size theory, we prove that for a constant step size and large initialization, SGD exhibits this distinct two-phase behavior: an initial alignment-decreasing phase, followed by stabilization at high alignment.
comment: The 37th International Conference on Algorithmic Learning Theory
♻ ☆ Flow-Based Conformal Predictive Distributions
Conformal prediction provides a distribution-free framework for uncertainty quantification via prediction sets with exact finite-sample coverage. In low dimensions these sets are easy to interpret, but in high-dimensional or structured output spaces they are difficult to represent and use, which can limit their ability to integrate with downstream tasks such as sampling and probabilistic forecasting. We show that any sufficiently regular differentiable nonconformity score induces a deterministic flow on the output space whose trajectories converge to the boundary of the corresponding conformal prediction set. This leads to a computationally efficient, training-free method for sampling conformal boundaries in arbitrary dimensions. Mixing across confidence levels yields conformal predictive distributions whose quantile regions coincide with the empirical conformal prediction sets. We provide an approximation bound decomposing CPD predictive error into score-induced distortion, base-measure quality, and gradient flow-induced distortion. We evaluate the approach on PDE inverse problems, precipitation downscaling, climate model debiasing, and hurricane trajectory forecasting.
comment: 9 pages, 15 figures, 20 appendix pages
♻ ☆ Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models ACL'2026
Large Language Models (LLMs) often struggle with computational efficiency and error propagation in multi-step reasoning tasks. While recent advancements on prompting and post-training have enabled LLMs to perform step-wise reasoning, they still tend to explore unproductive solution paths without effective backtracking or strategy adjustment. In this paper, we propose Meta-Reasoner, a new framework that empowers LLMs to "think about how to think". It optimizes the inference process by dynamically adapting reasoning strategies in real-time. Our approach employs contextual multi-armed bandits (CMABs) to learn an adaptive policy. It learns to evaluate the current state of LLM's reasoning and determine optimal strategy that is most likely to lead to a successful outcome during inference, like whether to backtrack, switch to a new approach, or restart the problem-solving process. This meta-guidance helps avoid unproductive paths exploration during inference and hence improves computational efficiency. We evaluate Meta-Reasoner on math problems (e.g., Game-of-24, TheoremQA) and scientific tasks (e.g., SciBench). Results show that our method outperform previous SOTA methods by 9-12% in accuracy, while reducing inference time by 28-35% under the same compute budget. Additional experiments on creative writing demonstrate the generalizability of our approach to diverse reasoning-intensive tasks.
comment: Accepted by ACL'2026
♻ ☆ Contrastive Image-Metadata Pre-Training for Materials Transmission Electron Microscopy
The transmission electron microscope facilitates the highest-resolution imaging of any instrument ever created, and its limiting factor is no longer spatial resolution but dose efficiency. Low electron doses avoid sample damage but produce noisy images for which, unlike in classical computer vision, there is no ground truth. Autonomous materials experimentation poses a related problem, since closed-loop instruments need representations grounded in the microscope state at acquisition. Both demand representations grounded in how an image was acquired. We release 7,330 paired high-angle annular dark-field scanning-TEM (HAADF-STEM) images and their seven-dimensional acquisition metadata, and propose Contrastive Image-Metadata Pre-training (CIMP), a CLIP-style encoder that aligns the two modalities and reaches 84.4% Top-1 cross-modal retrieval on a held-out split. All seven parameters are individually recoverable from the frozen visual embedding through a linear probe, and we use the embedding to condition a metadata-conditioned style-transfer model that re-renders experimental images under different acquisition parameters. Virtually scaling dwell time and beam current of low-dose images turns this model into a physics-informed denoiser; in a blind user study, experimental microscopists prefer it over the current state-of-the-art denoiser for STEM imagery on 70.2% of trials.
♻ ☆ Mirror Descent-Ascent for mean-field min-max problems
We study two variants of the mirror descent-ascent (MDA) algorithm for solving min-max problems on the space of measures: simultaneous and alternating. We work under assumptions of convexity-concavity and relative smoothness of the payoff function with respect to a suitable Bregman divergence, defined on the space of measures via flat derivatives. We establish non-asymptotic convergence rates to mixed Nash equilibria, measured in the Nikaidô-Isoda error, proving an $\mathcal{O}(N^{-1/2})$ rate for simultaneous MDA and an improved $\mathcal{O}(N^{-2/3})$ rate for alternating MDA. The main technical contribution is an infinite-dimensional dual space analysis that relates Bregman divergences on measures to dual Bregman divergences on spaces of bounded continuous functions, allowing us to control asymmetric commutator terms created by alternating updates. The results substantially generalize prior analyses restricted to bilinear objectives and also apply to nonlinear convex-concave problems on measure spaces, thereby providing a unified theoretical foundation for MDA in mean-field min-max optimization.
comment: 57 pages; substantially revised version with improved presentation, re-worked main theorems, and added numerical experiments
♻ ☆ Importance-Guided Basis Selection for Low-Rank Decomposition of Large Language Models
Low-rank decomposition is a compelling approach for compressing large language models, but its effectiveness hinges on selecting which singular-vector bases to retain for a target task. Existing methods such as Basel adapt singular-value coefficients on downstream data and prune bases with small re-learned magnitudes, a heuristic that can be misaligned with task performance because it ignores the local geometry of the loss landscape. We present Basis Selection with Importance (BSI), a principled low-rank compression framework that ranks and prunes bases by directly estimating the expected loss increase incurred when each basis is removed. BSI derives a derivative-based importance score from a second-order Taylor expansion of the task loss with respect to singular values, combining first-order sensitivity and second-order curvature to quantify pruning impact. To make this criterion practical for LLMs, we develop an efficient Hessian-diagonal estimator by adapting the Hutchinson randomized-probing method to loss curvature with symmetric parameter perturbations. We provide a comprehensive theoretical analysis, including loss-increase bounds under basis pruning, explicit propagation of Hessian-diagonal estimation error into these bounds, variance characterization tied to the Hessian spectrum, high-probability sample-complexity guarantees for achieving a target estimation accuracy, and guidance on perturbation intensity. Extensive experiments on mathematical reasoning benchmarks demonstrate that BSI consistently outperforms state-of-the-art low-rank decomposition baselines, with especially strong improvements under deep compression.
♻ ☆ Mochi: Aligning Pre-training and Inference for Efficient Graph Foundation Models via Meta-Learning
We propose Mochi, a Graph Foundation Model that addresses task unification and training efficiency by adopting a meta-learning based training framework. Prior models pre-train with reconstruction-based objectives such as link prediction, and assume that the resulting representations can be aligned with downstream tasks through a separate unification step such as class prototypes. We demonstrate through synthetic and real-world experiments that this procedure, while simple and intuitive, has limitations that directly affect downstream task performance. To address these limitations, Mochi pre-trains on few-shot episodes that mirror the downstream evaluation protocol, aligning the training objective with inference rather than relying on a post-hoc unification step. We show that Mochi, along with its more powerful variant Mochi++, achieves competitive or superior performance compared to existing Graph Foundation Models across 25 real-world graph datasets spanning node classification, link prediction, and graph classification, while requiring 8$\sim$27 times less training time than the strongest baseline.
comment: 23 pages, 7 figures
♻ ☆ Contact Wasserstein Geodesics for Non-Conservative Schrödinger Bridges ICLR 2026
The Schrödinger Bridge provides a principled framework for modeling stochastic processes between distributions; however, existing methods are limited by energy-conservation assumptions, which constrains the bridge's shape preventing it from model varying-energy phenomena. To overcome this, we introduce the non-conservative generalized Schrödinger bridge (NCGSB), a novel, energy-varying reformulation based on contact Hamiltonian mechanics. By allowing energy to change over time, the NCGSB provides a broader class of real-world stochastic processes, capturing richer and more faithful intermediate dynamics. By parameterizing the Wasserstein manifold, we lift the bridge problem to a tractable geodesic computation in a finite-dimensional space. Unlike computationally expensive iterative solutions, our contact Wasserstein geodesic (CWG) is naturally implemented via a ResNet architecture and relies on a non-iterative solver with near-linear complexity. Furthermore, CWG supports guided generation by modulating a task-specific distance metric. We validate our framework on tasks including manifold navigation, molecular dynamics predictions, and image generation, demonstrating its practical benefits and versatility.
comment: 44 pages, 21 figures, ICLR 2026
♻ ☆ Counterfactual Maps: What They Are and How to Find Them
Counterfactual explanations are a central tool in interpretable machine learning, yet computing them exactly for complex models remains challenging. For tree ensembles, predictions are piecewise constant over a large collection of axis-aligned hyperrectangles, implying that an optimal counterfactual for a point corresponds to its projection onto the nearest rectangle with an alternative label under a chosen metric. Existing methods largely overlook this geometric structure, relying either on heuristics with no optimality guarantees or on mixed-integer programming formulations that do not scale to interactive use. In this work, we revisit counterfactual generation through the lens of nearest-region search and introduce counterfactual maps, a global representation of recourse for tree ensembles. Leveraging the fact that any tree ensemble can be compressed into an equivalent partition of labeled hyperrectangles, we cast counterfactual search as the problem of identifying the generalized Voronoi cell associated with the nearest rectangle of an alternative label. This leads to an exact, amortized algorithm based on volumetric k-dimensional (KD) trees, which performs branch-and-bound nearest-region queries with explicit optimality certificates and sublinear average query time after a one-time preprocessing phase. Our experimental analyses on several real datasets drawn from high-stakes application domains show that this approach delivers globally optimal counterfactual explanations with millisecond-level latency, achieving query times that are orders of magnitude faster than existing exact, cold-start optimization methods.
♻ ☆ DC-DiT: Adaptive Compute and Elastic Inference for Visual Generation via Dynamic Chunking
Diffusion Transformers rely on static patchify tokenization, assigning the same token budget to smooth backgrounds, detailed object regions, noisy early timesteps, and late-stage refinements. We introduce the Dynamic Chunking Diffusion Transformer (DC-DiT), which replaces fixed patchification with a learned encoder-router-decoder scaffold that adaptively compresses the 2D input into a shorter token sequence through a chunking mechanism learned end-to-end with diffusion training. DC-DiT allocates fewer tokens to predictable regions and noisy timesteps, and more tokens to detailed regions and later refinement stages, yielding meaningful spatial segmentations and timestep-adaptive compression schedules without supervision. Furthermore, the router provides an importance ordering over retained tokens, enabling elastic inference: a single checkpoint can be evaluated at flexible compute budgets with a smooth quality-compute tradeoff. Additionally, DC-DiT can be upcycled from pretrained DiT checkpoints and is also compatible with orthogonal dynamic computation approaches. On class-conditional ImageNet generation, DC-DiT reduces inference FLOPs by up to 36.8% and improves FID by up to 37.8% over DiT baselines, yielding a stronger quality--compute Pareto frontier across model scales, resolutions, and guidance settings. More broadly, these results suggest that adaptive tokenization is a general mechanism for making visual generation both more efficient and more flexible at inference time.
♻ ☆ Attribution-Guided Pruning for Insight and Control: Circuit Discovery and Targeted Correction in Small-scale LLMs
Large Language Models (LLMs) are widely deployed in real-world applications, yet their internal mechanisms remain difficult to interpret and control, limiting our ability to diagnose and correct undesirable behaviors. Mechanistic interpretability addresses this challenge by identifying circuits -- subsets of model components responsible for specific behaviors. However, discovering such circuits in LLMs remains difficult due to their scale and complexity. We frame circuit discovery as identifying parameters that contribute most to model outputs on task-specific inputs, and use Layer-wise Relevance Propagation (LRP) with reference samples to attribute and extract these components via pruning. Building on this, we introduce contrastive relevance to isolate circuits associated with undesired behaviors while preserving general capabilities, enabling targeted model correction. On OPT-125M, we show that pruning as little as ~0.3% of neurons substantially reduces toxic outputs, while pruning approximately 0.03% of weight elements mitigates repetitive text generation without degrading general performance. These results establish attribution-guided pruning as an effective mechanism for identifying and intervening on behavior-specific circuits in LLMs. We further validate our findings on additional small-scale language models, demonstrating that the proposed approach transfers across architectures. Our code is publicly available at https://github.com/erfanhatefi/SparC3.
comment: Work in progress (9 pages manuscript, 3 pages references, 16 pages appendix)
♻ ☆ AGMARL-DKS: An Adaptive Graph-Enhanced Multi-Agent Reinforcement Learning for Dynamic Kubernetes Scheduling
State-of-the-art cloud-native applications require intelligent schedulers that can effectively balance system stability, resource utilisation, and associated costs. While Kubernetes provides feasibility-based placement by default, recent research efforts have explored the use of reinforcement learning (RL) for more intelligent scheduling decisions. However, current RL-based schedulers have three major limitations. First, most of these schedulers use monolithic centralised agents, which are non-scalable for large heterogeneous clusters. Second, the ones that use multi-objective reward functions assume simple, static, linear combinations of the objectives. Third, no previous work has produced a stress-aware scheduler that can react adaptively to dynamic conditions. To address these gaps in current research, we propose the Adaptive Graph-enhanced Multi-Agent Reinforcement Learning Dynamic Kubernetes Scheduler (AGMARL-DKS). AGMARL-DKS addresses these gaps by introducing three major innovations. First, we construct a scalable solution by treating the scheduling challenge as a cooperative multi-agent problem, where every cluster node operates as an agent, employing centralised training methods before decentralised execution. Second, to be context-aware and yet decentralised, we use a Graph Neural Network (GNN) to build a state representation of the global cluster context at each agent. This represents an improvement over methods that rely solely on local observations. Finally, to make trade-offs between these objectives, we use a stress-aware lexicographical ordering policy instead of a simple, static linear weighting of these objectives. The evaluations in Google Kubernetes Engine (GKE) reveal that AGMARL-DKS significantly outperforms the default scheduler in terms of fault tolerance, utilisation, and cost, especially in scheduling batch and mission-critical workloads.
♻ ☆ Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text
Self-play has recently emerged as a promising paradigm for post-training Large Language Models (LLMs). In self-play, the target LLM creates the task input (e.g., a question), which it then addresses itself by producing a task output (e.g., an answer). A reward model evaluates the output, and the rewards are used to train the LLM, typically via Reinforcement Learning (RL). A key benefit of self-play for post-training LLMs is its minimal supervision costs: self-play avoids the need for high-quality input-output pairs traditionally constructed by humans or expensive proprietary models. Existing work, however, explores self-play only for verifiable tasks, such as math and coding, for which objective ground truth is available and easily checkable. In this paper, we seek to extend self-play to more realistic open-ended tasks. We propose POP, a self-play framework that uses the same LLM to synthesize evaluation rubrics along with each input-output pair. The rubric is used to evaluate outputs and train the model. Crucially, we ground the framework on a content-rich pretraining corpus to (1) enable an exploitable generation-verification gap and reduce reward hacking, and (2) prevent mode collapse. On Qwen-2.5-7B, POP increases performance of both the pretrained base model and instruction-tuned model on multiple tasks ranging from long-form healthcare QA to creative writing and instruction following.
♻ ☆ Leviathan: Decoupling Input and Output Representations in Language Models
Modern language models use a single matrix for input embedding and output projection. This couples two distinct objectives: token representation and discrimination over a vocabulary. This work introduces Leviathan, a Transformer architecture that replaces the input embedding matrix with learned embedding vectorization (LEV), a compact continuous mapping from token indices to embeddings. Leviathan's output head remains untied for a parameter increase of as low as 0.2%. Under controlled comparisons with identical Transformer backbones, Leviathan consistently improves language modeling performance over standard tied-embedding baselines across a 200M-1.2B parameter regime on The Pile with gains that grow during training. At 1.2B scale, Leviathan reduces validation perplexity by 9%, requires $2.1\times$ fewer training tokens to reach the tied baseline's final loss, and improves on all six downstream benchmarks evaluated, including a 30% reduction in LAMBADA perplexity. Frequency-stratified analysis reveals gains to be concentrated in rare tokens, where continuous parameterization reduces perplexity by 81%, falling to near zero for the most frequent.
♻ ☆ Principled Federated Random Forests for Heterogeneous Data
Random Forests (RF) are among the most powerful and widely used predictive models for centralized tabular data, yet few methods exist to adapt them to the federated learning setting. Unlike most federated learning approaches, the piecewise-constant nature of RF prevents exact gradient-based optimization. As a result, existing federated RF implementations rely on unprincipled heuristics: for instance, aggregating decision trees trained independently on clients fails to optimize the global impurity criterion, even under simple distribution shifts. We propose FedForest, a new federated RF algorithm for horizontally partitioned data that naturally accommodates diverse forms of client data heterogeneity, from covariate shift to more complex outcome shift mechanisms. We prove that our splitting procedure, based on aggregating carefully chosen client statistics, closely approximates the split selected by a centralized algorithm. Moreover, FedForest allows splits on client indicators, enabling a non-parametric form of personalization that is absent from prior federated random forest methods. Empirically, we demonstrate that the resulting federated forests closely match centralized performance across heterogeneous benchmarks while remaining communication-efficient.
♻ ☆ CatNet: Controlling the False Discovery Rate in LSTM with SHAP Feature Importance and Gaussian Mirrors
We introduce CatNet, an algorithm that effectively controls False Discovery Rate (FDR) and selects significant features in LSTM. CatNet employs the derivative of SHAP values to quantify the feature importance, and constructs a vector-formed mirror statistic for FDR control with the Gaussian Mirror algorithm. To avoid instability due to nonlinear or temporal correlations among features, we also propose a new kernel-based independence measure. CatNet performs robustly on different model settings with both simulated and real-world data, which reduces overfitting and improves interpretability of the model. Our framework that introduces SHAP for feature importance in FDR control algorithms and improves Gaussian Mirror can be naturally extended to other time-series or sequential deep learning models.
comment: Withdrawn by the authors. The main theoretical result relies on an assumption that is not valid as stated. A substantially revised and corrected work will be posted separately
♻ ☆ LAMP: Look-Ahead Mixed-Precision Inference of Large Language Models
Mixed-precision computations are a hallmark of the current stage of AI, driving the progress in large language models towards efficient, locally deployable solutions. This article addresses the floating-point computation of compositionally-rich functions, concentrating on transformer inference. Based on the rounding error analysis of a composition $f(g(\mathrm{x}))$, we provide an adaptive strategy that selects a small subset of components of $g(\mathrm{x})$ to be computed more accurately while all other computations can be carried out with lower accuracy. We then explain how this strategy can be applied to different compositions within a transformer and illustrate its overall effect on transformer inference. We study the effectiveness of this algorithm numerically on GPT-2 models and demonstrate that already very low recomputation rates allow for improvements of up to two orders of magnitude in accuracy.
comment: Major revision
♻ ☆ Pulling Back the Curtain on Deep Networks
In linear models, visualizing a weight vector naturally reveals the model's preferred input direction, but extending this intuition to deep networks via gradients or gradient ascent often yields brittle or adversarial-looking features. We argue that deep networks are better understood as input-conditioned affine operators, whose natural adjoint action pulls a neuron's preferred direction back to input space. We further refine this representation by backward-only softening and iterative enhancement to reconstruct coherent local structures encoded by the target neuron. This provides a unifying perspective on previously disparate ideas such as SmoothGrad, B-cos-style alignment, and Feature Accentuation. The resulting Semantic Pullbacks (SP) generate perceptually aligned, class-conditional post-hoc explanations that emphasize semantically meaningful features, facilitate coherent counterfactual perturbations, and remain theoretically grounded. Across convolutional architectures (ResNet50, VGG) and transformer-based models (PVT), Semantic Pullbacks achieve the best overall trade-off across faithfulness, stability, and target-sensitivity benchmarks, while remaining general, computationally efficient, and readily integrable into existing deep learning pipelines.
comment: Preprint; 9 pages, 23-page appendix, 12 figures, 6 Tables; v6 changes: slight reframing of the presentation
♻ ☆ Probabilistic NDVI Forecasting from Sparse Satellite Time Series and Weather Covariates
Short-term forecasting of vegetation dynamics is a key enabler for data-driven decision support in precision agriculture. Normalized Difference Vegetation Index (NDVI) forecasting from satellite observations, however, remains challenging due to sparse and irregular sampling caused by cloud masking, as well as the heterogeneous climatic conditions under which crops evolve. In this work, we propose a probabilistic forecasting framework for field-level NDVI prediction under sparse, irregular clear-sky acquisitions. The architecture separates the encoding of historical NDVI and meteorological observations from future exogenous covariates, fusing both representations for multi-step quantile prediction. To address irregular revisit patterns and horizon-dependent uncertainty, we introduce a temporal-distance weighted quantile loss that aligns the training objective with the effective forecasting horizon. In addition, we incorporate cumulative and extreme-weather feature engineering to capture delayed meteorological effects relevant to vegetation response. Experiments on European satellite data show that the proposed approach outperforms statistical, deep learning, and time-series baselines on both pointwise and probabilistic evaluation metrics. Ablation studies confirm that target history is the primary driver of performance, with meteorological covariates providing additional gains in the full multimodal setting. The code is available at https://github.com/arco-group/ndvi-forecasting.
♻ ☆ Fusion Complexity Inversion: Why Simpler Cross View Modules Outperform SSMs and Cross View Attention Transformers for Pasture Biomass Regression CVPR
Accurate estimation of pasture biomass from agricultural imagery is critical for sustainable livestock management, yet existing methods are limited by the small, imbalanced, and sparsely annotated datasets typical of real world monitoring. In this study, adaptation of vision foundation models to agricultural regression is systematically evaluated on the CSIRO Pasture Biomass benchmark, a 357 image dual view dataset with laboratory validated, component wise ground truth for five biomass targets, through 17 configurations spanning four backbones (EfficientNet-B3 to DINOv3-ViT-L), five cross view fusion mechanisms, and a 4x2 metadata factorial. A counterintuitive principle, termed "fusion complexity inversion", is uncovered: on scarce agricultural data, a two layer gated depthwise convolution (R^2 = 0.903) outperforms cross view attention transformers (0.833), bidirectional SSMs (0.819), and full Mamba (0.793, below the no fusion baseline). Backbone pretraining scale is found to monotonically dominate all architectural choices, with the DINOv2 -> DINOv3 upgrade alone yielding +5.0 R^2 points. Training only metadata (species, state, and NDVI) is shown to create a universal ceiling at R^2 ~ 0.829, collapsing an 8.4 point fusion spread to 0.1 points. Actionable guidelines for sparse agricultural benchmarks are established: backbone quality should be prioritized over fusion complexity, local modules preferred over global alternatives, and features unavailable at inference excluded.
comment: Accepted to CVPR: Vision for Agriculture Workshop 2026 (Withdrawn)
♻ ☆ Screening Is Enough
A core limitation of standard softmax attention is that it does not provide an independently interpretable measure of query--key relevance: attention scores are unbounded, while attention weights are defined only relative to competing keys. Consequently, irrelevant keys cannot be explicitly rejected, and some attention mass is assigned even when no key is genuinely relevant. We introduce Multiscreen, a language-model architecture built around a mechanism we call screening, which enables absolute query--key relevance. Instead of redistributing attention across all keys, screening computes bounded query--key similarities and applies an explicit threshold, discarding irrelevant keys and aggregating the remaining keys without global competition. Across experiments, Multiscreen achieves comparable validation loss with roughly 30\% fewer parameters than a Transformer baseline and remains stable at substantially larger learning rates. It maintains stable long-context perplexity beyond the training context and shows little degradation in retrieval performance as context length increases. Finally, Multiscreen achieves lower full-context forward-pass latency at long context lengths.
comment: 36 pages, 27 figures. Revised version with retuned Transformer baselines, additional experiments, ablations, and appendix analyses
♻ ☆ Neural Stochastic Differential Equations on Compact State Spaces: Theory, Methods, and Application to Suicide Risk Modeling ICML 2025
Ecological Momentary Assessment (EMA) studies enable the collection of high-frequency self-reports of suicidal thoughts and behaviors (STBs) via smartphones. Latent stochastic differential equations (SDEs) are a promising model class for EMA data, as it is irregularly sampled, noisy, and partially observed. But SDE-based models suffer from two key limitations. (a) These models often violate domain constraints, undermining scientific validity and clinical trust of the model. (b) Training is numerically unstable without ad hoc fixes (e.g. oversimplified dynamics) that are ill-suited for high-stakes applications. Here, we develop a novel class of expressive SDEs whose solutions are provably confined to a prescribed compact polyhedral state space, matching the domains of EMA data. In this work, (1) we show why chain-rule based constructions of SDEs on compact domains fail, theoretically and empirically; (2) we derive constraints on drift and diffusion for general and stationary SDEs so their solutions remain in the desired state space; and (3), we introduce a parameterization that maps arbitrary (neural or expert-given) dynamics into constraint-satisfying SDEs. On several real EMA datasets, including a large suicide-risk study, our parameterization improves forecasts and optimization dynamics over standard latent neural SDE baselines. These contributions pave the way for principled, trustworthy continuous-time models of suicide risk and other clinical time series and extend applications of SDE-based methods (e.g. diffusion models) to domains with hard state constraints.
comment: Accepted at the Symposium on Probabilistic Machine Learning (ProbML) 2026, and at the Methods and Opportunities at Small Scale (MOSS), ICML 2025, Vancouver, Canada
♻ ☆ P^2O: Joint Policy and Prompt Optimization
Reinforcement Learning with Verifiable Rewards (RLVR) enhances Large Language Model (LLM) reasoning but suffers from advantage collapse on ``hard samples'' where all rollouts fail. This lack of variance eliminates crucial learning signals. For these intractable samples, simply scaling up rollout budgets offers limited gains. We introduce Joint Policy and Prompt Optimization (P$^2$O) to mitigate this collapse by alternating continuous policy updates with discrete prompt evolution. P$^2$O leverages the GEPA algorithm to discover successful reasoning prompts for intractable instances. Via context distillation, the model internalizes these prompt-induced gains directly into its parameters, removing the need for inference-time prompting. Empirically, P$^2$O restores critical advantage signals, significantly outperforming standard GRPO and surpassing baselines with doubled rollout budgets, ultimately yielding strong out-of-distribution generalization and an up to $9.5\%$ performance improvement. Our findings expose the limits of standard exploration in sparse-reward environments, illuminating the potential of unifying evolutionary algorithms with reinforcement learning. This integration of discrete semantic search and continuous parameter updates establishes a self-reinforcing paradigm for autonomous LLM alignment.
♻ ☆ Low-Rank Adaptation for Critic Learning in Off-Policy Reinforcement Learning
Scaling critic capacity is a promising direction for improving off-policy reinforcement learning (RL). However, recent work shows that larger critics are prone to overfitting and instability in replay-based bootstrapped training. In this paper, we propose using Low-Rank Adaptation (LoRA) as a structural regularizer for critic learning. Our approach freezes randomly initialized base matrices and optimizes only the corresponding low-rank adapters, thereby constraining critic updates to a low-dimensional subspace. We evaluate our method across different off-policy RL algorithms, including SAC and FastTD3 based on different network architectures. Empirically, LoRA efficiently reduces critic loss during training and improves overall policy performance, achieving the best or competitive results on most tasks. Extensive experiments demonstrate that our low-rank updates provide a simple and effective form of structural regularization for critic learning in off-policy RL.
♻ ☆ Neural Discovery of Strichartz Extremizers
Strichartz inequalities are a cornerstone of the modern theory of dispersive PDEs, but their extremizers are known explicitly only in a handful of sharp cases. The non-convexity of the underlying functional makes the problem hard, and to our knowledge no systematic numerical attack has been attempted. We propose a simple neural-network-based pipeline that searches for extremizers as critical points of the Strichartz ratio, and apply it in three settings. First, on the Schrödinger group we recover the Gaussian extremizers of Foschi and Hundertmark--Zharnitsky in dimensions $d=1,2$ to within $10^{-3}$ relative error, with no analytical prior. Second, on $59$ further admissible pairs in $d=1$ where the answer is conjectural, the method consistently finds Gaussians, supporting the conjecture that Gaussians are the universal extremizers in the admissible range. Third, on the critical Airy--Strichartz inequality at $γ=1/q$, where existence is open, the optimization does not converge to any $L^2$ profile: instead, the iterates organize themselves as mKdV breathers $B(0,\cdot;α,1,0,0)$ with growing internal frequency $α$, and the discovered ratio approaches the Frank--Sabin universal lower bound $\widetilde A_{q,r}$ from below with a power-law gap $\simα^{-0.9}$. We confirm the same picture with an independent Hermite-basis ansatz. We propose a precise conjecture: the supremum equals $\widetilde A_{q,r}$ and is approached, but not attained, along the breather family. The pipeline thus serves both as a validator on known cases and as a discovery tool when no extremizer exists.
comment: 38 pages, 26 figures; v.2: corrected typos
♻ ☆ Probe-Geometry Alignment: Erasing the Cross-Sequence Memorization Signature Below Chance
Recent attacks show that behavioural unlearning of large language models leaves internal traces recoverable by adversarial probes. We characterise where this retention lives and show it can be surgically removed without measurable capability cost. Our central protocol is a leave-one-out cross-sequence probe that tests whether a memorisation signature generalises across held-out sequences. The signature is real and consistent across scale: memorisation-specific gaps of +0.32, +0.19, +0.30 on Pythia-70M, GPT-2 medium, and Mistral-7B; on Pythia-70M, the random-initialisation control collapses to -0.04 at the deepest layer where the pretrained signature peaks. The probe direction is causally separable from recall -- projecting it out collapses the signature locally (+0.44 -> -0.19) while behavioural recall barely changes -- and a probe trained on naturally memorised content does not classify fine-tuning-injected secrets, marking two representationally distinct regimes. We then introduce probe-geometry alignment (PGA), a surgical erasure that aligns activations along the probe's live readout direction at each depth. PGA drives the cross-sequence probe below random chance at all four scales tested (toy depth-4: 0.17; Pythia-70M: 0.07; Mistral-7B: 0.45; GPT-2 medium: 0.06 via MD-PGA k=2) and remains robust to six adversarial probe variants. Against a re-fitting attacker who trains a fresh probe on PGA-treated activations, we extend PGA adversarially, defeating the re-fit probe at every memorisation-relevant depth while preserving five zero-shot capability benchmarks within 2.8 percentage points per task (mean Δacc = +0.2pp). The cross-sequence signature is a real, causally separable, regime-specific property of pretrained representations -- removable below chance with a single rank-one intervention per depth at no measurable capability cost.
♻ ☆ Fast and Efficient Gossip Algorithms for Robust and Non-smooth Decentralized Learning
Decentralized learning on resource-constrained edge devices demands algorithms that are communication-efficient, robust to data corruption, and lightweight in memory. State-of-the-art gossip-based methods address communication efficiency, but achieving robustness remains challenging. Methods for robust estimation and optimization typically rely on non-smooth objectives (\textit{e.g.}, pinball loss, $\ell_1$ loss), yet standard gossip methods are primarily designed for smooth losses. Asynchronous decentralized ADMM-based methods have been proposed to handle such non-smooth objectives; however, existing approaches require memory that scales with node degree, making them impractical when memory is limited. We propose AsylADMM, a novel asynchronous gossip algorithm for decentralized non-smooth optimization requiring only two variables per node. We provide a new theoretical analysis for the synchronous variant and leverage it to prove convergence of AsylADMM in a simplified setting based on the squared loss. Empirically, AsylADMM converges faster than existing baselines on challenging non-smooth problems, including quantile and geometric median estimation, lasso regression, and robust regression. More broadly, our novel gossip framework opens a practical pathway toward robust and non-smooth decentralized learning.
♻ ☆ It's Not a Lottery, It's a Race: Understanding How Gradient Descent Adapts the Network's Capacity to the Task
Our theoretical understanding of neural networks is lagging behind their empirical success. One of the important unexplained phenomena is why and how, during the process of training with gradient descent, the theoretical capacity of neural networks is reduced to an effective capacity that fits the task. We here investigate the mechanism by which gradient descent achieves this through analyzing the learning dynamics at the level of individual neurons in single hidden layer ReLU networks. We identify three dynamical principles, namely mutual alignment, unlocking and racing, that together explain why we can often successfully reduce capacity after training through the merging of equivalent neurons or the pruning of low norm weights. We specifically explain the mechanism behind the lottery ticket conjecture, or why the specific, beneficial initial conditions of some neurons lead them to obtain higher weight norms.
♻ ☆ HDTree: Generative Modeling of Cellular Hierarchies for Robust Lineage Inference ICML26
In single-cell research, tracing and analyzing high-throughput single-cell differentiation trajectories is crucial for understanding biological processes. Key to this is the robust modeling of hierarchical structures that govern cellular development. Traditional methods face limitations in computational cost, performance, and stability. VAE-based approaches have made strides but still require branch-specific network modules, limiting their scalability and stability, while often suffering from posterior collapse. To overcome these challenges, we introduce HDTree, a generative modeling framework designed for robust lineage inference. HDTree captures tree relationships within a hierarchical latent space using a unified hierarchical codebook and employs a quantized diffusion process to model continuous cell state transitions. By aligning the generative process with the Waddington landscape, this method not only improves stability and scalability but also enhances the biological plausibility of inferred lineages. HDTree's effectiveness is demonstrated through comparisons on both general-purpose and single-cell datasets, where it outperforms existing methods in lineage inference accuracy, reconstruction quality, and hierarchical consistency. These contributions enable accurate and efficient modeling of cellular differentiation paths, offering reliable insights for biological discovery.\footnote{Code is available at https://github.com/zangzelin/code\_HDTree\_icml.
comment: accepted by ICML26
♻ ☆ DeTrigger: A Gradient-Centric Approach to Backdoor Attack Mitigation in Federated Learning
Federated Learning (FL) enables collaborative model training across distributed devices while preserving local data privacy, making it ideal for mobile and embedded systems. However, the decentralized nature of FL also opens vulnerabilities to model poisoning attacks, particularly backdoor attacks, where adversaries implant trigger patterns to manipulate model predictions. In this paper, we propose DeTrigger, a scalable and efficient backdoor-robust federated learning framework that leverages insights from adversarial attack methodologies. By employing gradient analysis with temperature scaling, DeTrigger detects and isolates backdoor triggers, allowing for precise model weight pruning of backdoor activations without sacrificing benign model knowledge. Extensive evaluations across four widely used datasets demonstrate that DeTrigger achieves up to 251x faster detection than traditional methods and mitigates backdoor attacks by up to 98.9%, with minimal impact on global model accuracy. Our findings establish DeTrigger as a robust and scalable solution to protect federated learning environments against sophisticated backdoor threats.
comment: 21 pages
♻ ☆ Synthetic Counterfactual Labels for Efficient Conformal Counterfactual Inference
This work addresses the problem of constructing reliable prediction intervals for individual counterfactual outcomes. Existing conformal counterfactual inference (CCI) methods provide marginal coverage guarantees but often produce overly conservative intervals, particularly under treatment imbalance when counterfactual samples are scarce. We introduce synthetic data-powered CCI (SP-CCI), a new framework that augments the calibration set with synthetic counterfactual labels generated by a pre-trained counterfactual model. To ensure validity, SP-CCI incorporates synthetic samples into a conformal calibration procedure based on risk-controlling prediction sets (RCPS) with a debiasing step informed by prediction-powered inference (PPI). We prove that SP-CCI achieves tighter prediction intervals while preserving marginal coverage, with theoretical guarantees under both exact and approximate importance weighting. Empirical results on different datasets confirm that SP-CCI consistently reduces interval width compared to standard CCI across all settings.
♻ ☆ Risk Horizons: Structured Hypothesis Spaces for Longitudinal Clinical Prediction
Predicting future clinical events from longitudinal electronic health records (EHRs) requires selecting plausible outcomes from a large and structured event space under sparse observations. While clinical coding systems provide hierarchical organization of events, cross-modal and temporal relationships are not explicitly specified and must instead be inferred from data, making prediction difficult for weakly observed longitudinal transitions. We introduce Risk Horizons, a geometry-aware framework for constructing patient-specific candidate spaces for multi-modal next-visit prediction. Risk Horizons combines deterministic coding hierarchies with data-driven lagged cross-modal associations, embeds the resulting clinical graph in hyperbolic space, and retrieves candidate futures using directional risk cones. This reframes longitudinal prediction as ranking within a compact, clinically coherent hypothesis space rather than scoring an unconstrained vocabulary. Experiments on MIMIC-IV and eICU demonstrate competitive next-visit prediction performance, with consistently improved hierarchy consistency across diagnoses, procedures, and medications. Further analysis suggests that hyperbolic structured candidate retrieval is the primary driver of performance, while LLMs are effective as constrained inference-time rerankers operating over clinically grounded candidate sets.
♻ ☆ Aligned explanations in neural networks
As artificial intelligence increasingly drives critical decisions, the ability to genuinely explain how neural networks make predictions is essential for trust. Yet, most current explanation methods offer post-hoc rationalizations rather than guaranteeing a true reflection of the model's reasoning. We introduce the notion of explanatory alignment, a requirement that explanations directly construct predictions rather than rationalize them. To achieve this in complex data domains, we present Pointwise-interpretable Networks (PiNets), a pseudo-linear architecture that forms linear models instance-wise. Evaluated on image classification and segmentation tasks, PiNets demonstrate that their explanations are deeply faithful across four criteria: meaningfulness, alignment, robustness, and sufficiency (MARS). Our contributions pave the way for promising avenues: by reconciling the predictive power of deep learning with the interpretability of linear models, PiNets provide a principled foundation for trustworthy AI and data-driven scientific discovery.
♻ ☆ A Basin-Selection Perspective on Grokking via Singular Learning Theory
Grokking, the abrupt transition from memorization to generalisation after extended training, suggests the presence of competing solution basins with distinct statistical properties. We study this phenomenon through the lens of Singular Learning Theory (SLT), a Bayesian framework that characterizes the geometry of the loss landscape. The key measure is the local learning coefficient (LLC) which quantifies the local degeneracy of the loss surface. SLT links lower-LLC basins to higher posterior mass concentration and lower expected generalisation error. Leveraging SLT, we develop a basin-selection perspective on grokking in quadratic networks: LLC ranks competing near-zero-loss basins by statistical preference, while the training-time transition between them is governed by optimisation dynamics. In this view, grokking corresponds to a transition from a higher-LLC (memorising) basin to a lower-LLC (generalising) basin that dominates the posterior. To support this, we derive analytic formulas for the LLC in shallow quadratic networks under both lazy and feature learning regimes. Empirically, we demonstrate that LLC trajectories estimated from training data track the onset of generalisation and provide an informative probe of the optimisation path.
♻ ☆ Multivariate Standardized Residuals for Conformal Prediction
While split conformal prediction guarantees marginal coverage, approaching the stronger property of conditional coverage is essential for reliable uncertainty quantification. Naive conformal scores, however, suffer from poor conditional coverage in heteroskedastic settings. In univariate regression, this is commonly addressed by normalizing non-conformity scores using an estimated local score variance. In this work, we propose a natural extension of this normalization to the multivariate setting, effectively whitening the residuals to decouple output correlations and standardize local variance. Furthermore, we derive a sufficient condition characterizing a broad class of distributions for which standardized residuals yield asymptotic conditional coverage. We demonstrate that using the Mahalanobis distance induced by a learned local covariance as a non-conformity score provides a closed-form, computationally efficient mechanism for capturing inter-output correlations and heteroskedasticity, avoiding the expensive sampling required by previous methods based on cumulative distribution functions. This structure unlocks several practical extensions, including the handling of missing output values, the refinement of conformal sets when partial information is revealed, and the construction of valid conformal sets for transformations of the output. Finally, we provide extensive empirical evidence on both synthetic and real-world datasets showing that our approach yields conformal sets that improve upon the conditional coverage of existing multivariate baselines.
♻ ☆ Why and When Deep is Better than Shallow: Implementation-Agnostic State-Transition Model of Deep Learning
Why and when does depth improve generalization? We study this question in an implementation-agnostic state-transition model, where a depth-$k$ predictor is a readout class $H$ composed with the word ball $B(k,F)$ generated by hidden state transitions. Generalization bounds separate implementation error, approximation error, and statistical complexity, and upper bound the depth-dependent variance term by a Dudley entropy integral over $B(k,F)$, with a conditional lower-bound diagnostic under readout separation. We identify geometric and semigroup mechanisms that keep this entropy contribution saturated or polynomial, and contrast them with separation mechanisms that recover the classical exponential-growth obstruction. Coupling these variance upper bounds with approximation rates gives typical depth trade-off patterns, clarifying that depth is statistically favorable when approximation improves rapidly while the transition semigroup remains geometrically tame.
♻ ☆ SuperWing: a comprehensive transonic wing dataset for data-driven aerodynamic design
Machine-learning surrogate models have shown promise in accelerating aerodynamic design, yet progress toward generalizable predictors for three-dimensional wings has been limited by the scarcity and restricted diversity of existing datasets. Here, we present SuperWing, a comprehensive open dataset of transonic swept-wing aerodynamics comprising 4,239 parameterized wing geometries and 28,856 Reynolds-averaged Navier-Stokes flow field solutions. The wing shapes in the dataset are generated using a simplified yet expressive geometry parameterization that incorporates spanwise variations in airfoil shape, twist, and dihedral, allowing for an enhanced diversity without relying on perturbations of a baseline wing. All shapes are simulated under a broad range of Mach numbers and angles of attack covering the typical flight envelope. To demonstrate the dataset's utility, we benchmark two state-of-the-art Transformers that accurately predict surface flow and achieve a 2.5 drag-count error on held-out samples. Models pretrained on SuperWing further exhibit strong zero-shot generalization to complex benchmark wings such as DLR-F6 and NASA CRM, underscoring the dataset's diversity and potential for practical usage.
♻ ☆ Optimal Control of Fluid Restless Multi-armed Bandits: A Machine Learning Approach
We present a novel machine learning framework for the optimal control of fluid restless multi-armed bandit problems (FRMABPs) with state equations that are either affine or quadratic in the state variables. By establishing fundamental properties of FRMABPs, we develop an efficient numerical algorithm that generates a comprehensive training set by solving multiple instances with diverse initial states. We further enhance this training set by applying a nonlinear transformation to the feature vectors, leveraging structural properties of FRMABPs. A time-dependent state feedback policy is then learned using Optimal Classification Trees with hyperplane splits (OCT-H). We test our approach on machine maintenance, epidemic control, and fisheries control problems, demonstrating that our method yields high-quality state feedback policies. Furthermore, once a policy is learned, it achieves a speed-up of up to 26 million times compared to the direct numerical algorithm.
♻ ☆ Graph Learning Is Suboptimal in Causal Bandits AISTATS 2026
We study regret minimization in causal bandits under causal sufficiency where the underlying causal structure is not known to the agent. Previous work has focused on identifying the reward's parents and then applying classic bandit methods to them, or jointly learning the parents while minimizing regret. We investigate whether such strategies are optimal. Somewhat counterintuitively, our results show that learning the parent set is suboptimal. We do so by proving that there exist instances where regret minimization and parent identification are fundamentally conflicting objectives. We further analyze both the known and unknown parent set size regimes, establish novel regret lower bounds that capture the combinatorial structure of the action space. Building on these insights, we propose nearly optimal algorithms that bypass graph and parent recovery, demonstrating that parent identification is indeed unnecessary for regret minimization. Experiments confirm that there exists a large performance gap between our method and existing baselines in various environments.
comment: 32 pages, accepted at AISTATS 2026
♻ ☆ Unsupervised Anomaly Detection in Wearable Foot Sensor Data: A Baseline Feasibility Study Towards Diabetic Foot Ulcer Prevention
Diabetic foot ulcers (DFUs) are a severe complication of diabetes associated with significant morbidity, amputation risk, and healthcare burden. Developing effective continuous monitoring frameworks requires first establishing reliable baseline models of normal foot biomechanics. This paper presents a feasibility study of an anomaly detection framework applied to time-series data from wearable foot sensors, specifically NTC thin-film thermocouples for temperature and FlexiForce A401 pressure sensors for plantar load monitoring. Data were collected from healthy adult subjects across 312 capture sessions on an instrumented pathway, generating 93,790 valid multi-sensor readings spanning September 2023 to June 2024. Two unsupervised algorithms, Isolation Forest and K-Nearest Neighbors using Local Outlier Factor (KNN/LOF), were applied to detect statistical deviations in foot temperature and pressure signals. Results show that Isolation Forest is more sensitive to subtle, distributed anomalies, while KNN/LOF identifies concentrated extreme deviations but flags a higher proportion of sessions not corroborated by Isolation Forest. Since no clinical ground truth is available, this difference is interpreted as lower specificity under the shared 5 percent contamination assumption rather than a confirmed false-positive rate. A mild positive correlation (0.41-0.48) between pressure and temperature features supports the case for combined multi-modal monitoring. These findings establish a validated baseline analytical pipeline and provide a methodological foundation for future clinical validation studies involving diabetic patients, where the relationship between detected anomalies and DFU-related pathophysiology can be directly assessed.
comment: 36 pages, 19 figures. Published in Biomedical Signal Processing and Control, Vol. 123, Part A, 110416, September 2026. https://doi.org/10.1016/j.bspc.2026.110416
♻ ☆ DOGMA: Weaving Structural Information into Data-centric Single-cell Transcriptomics Analysis
Recently, data-centric AI methodology has been a dominant paradigm in single-cell transcriptomics analysis, which treats data representation rather than model complexity as the fundamental bottleneck. In the review of current studies, earlier sequence methods treat cells as independent entities and adapt prevalent ML models to analyze their directly inherited sequence data. Despite their simplicity and intuition, these methods overlook the latent intercellular relationships driven by the functional mechanisms of biological systems and the inherent quality issues of the raw sequencing data. Therefore, a series of structured methods has emerged. Although they employ various heuristic rules to capture intricate intercellular relationships and enhance the raw sequencing data, these methods often neglect biological prior knowledge. This omission incurs substantial overhead and yields suboptimal graph representations, hindering the utility of ML models. To address these issues, we propose DOGMA, a data-centric framework designed for the structural reshaping and semantic enhancement of raw data through multi-level biological prior knowledge. Transcending reliance on purely data-driven heuristics, DOGMA provides a prior-guided graph construction pipeline that integrates statistical alignment with Cell Ontology and phylogenetic structure for biologically grounded cell-graph construction and robust cross-species alignment. Furthermore, Gene Ontology is utilized to bridge the feature-level semantic gap by incorporating functional priors. In complex multi-species and multi-organ benchmarks, DOGMA exhibits strong robustness in strict zero-shot cell-type evaluation and sample efficiency while using substantially lower GPU memory and inference time in downstream evaluation.
comment: 34 pages, 4 figures
♻ ☆ RAMPAGE: RAndomized Mid-Point for debiAsed Gradient Extrapolation
A celebrated method for Variational Inequalities (VIs) is Extragradient (EG), which can be viewed as a standard discrete-time integration scheme. With this view in mind, in this paper we show that EG may suffer from discretization bias when applied to non-linear vector fields, conservative or otherwise. To resolve this discretization shortcoming, we introduce RAndomized Mid-Point for debiAsed Gradient Extrapolation (RAMPAGE) and its variance-reduced counterpart, RAMPAGE+, which leverages antithetic sampling. In contrast with EG, both methods are unbiased. Furthermore, leveraging negative correlation, RAMPAGE+ acts as an unbiased, geometric path-integrator that completely removes internal first-order terms from the variance, provably improving upon RAMPAGE. We further demonstrate that both methods enjoy provable $\mathcal{O}(1/k)$ convergence guarantees for a range of problems including root finding under co-coercive, co-hypomonotone, and generalized Lipschitzness regimes. Furthermore, we introduce symmetrically scaled variants to extend our results to constrained VIs. Finally, we provide convergence guarantees of both methods for stochastic and deterministic smooth convex-concave games. Somewhat interestingly, despite being a randomized method, RAMPAGE+ attains purely deterministic bounds for a number of the studied settings.
comment: First three authors contributed equally
♻ ☆ Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols
To evaluate the safety and usefulness of deployment protocols for untrusted AIs, AI Control uses a red-teaming exercise played between a protocol designer and an adversary. This paper introduces AI-Control Games, a formal decision-making model of the red-teaming exercise as a multi-objective, partially observable, stochastic game. We also introduce reductions from AI-Control Games to a special case of zero-sum partially observable stochastic games that allow us to leverage existing algorithms to find Pareto-optimal protocols. We apply our formalism to model, evaluate and synthesise protocols for deploying untrusted language models as programming assistants, focusing on Trusted Monitoring protocols, which use weaker language models and limited human assistance. To demonstrate the utility of our formalism, we show improvements over empirical studies in existing settings, evaluate protocols in new settings, and analyse how modelling assumptions affect the safety and usefulness of protocols. Finally, we leverage our formalism to precisely describe some of the implicit assumptions in prior control work.
♻ ☆ Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling
Reasoning is a core capability of large language models, yet how multi-step reasoning is learned and executed remains unclear. We study this question in a controlled cellular-automata (1dCA) framework that excludes memorisation by using disjoint training and test rules. Given a short state sequence, the model is required to infer the hidden local rule and then chain it to predict multiple future steps. Our evaluation shows that LLMs largely fail to reliably solve a natural-language proxy of the proposed task. We find that most neural architectures trained from scratch can learn rule inference and achieve high next-step accuracy, but performance drops sharply as the required number of intermediate reasoning steps increases. Experiments show that increasing model depth is crucial, and extending effective depth via recurrence, memory, or test-time compute improves results but remains bounded. The code is available on github: https://github.com/RodkinIvan/associative-recurrent-memory-transformer/tree/ACT
♻ ☆ KaVa: Latent Reasoning via Compressed KV-Cache Distillation ICLR 2026
Large Language Models (LLMs) excel at multi-step reasoning problems with explicit chain-of-thought (CoT), but verbose traces incur significant computational costs and memory overhead, and often carry redundant, stylistic artifacts. Latent reasoning has emerged as an efficient alternative that internalizes the thought process, but it suffers from a critical lack of supervision, limiting its effectiveness on complex, natural-language reasoning traces. In this work we propose KaVa, the first framework that bridges this gap by distilling knowledge directly from a compressed KV-cache of the teacher into a latent-reasoning student via self-distillation, leveraging the representational flexibility of continuous latent tokens to align stepwise KV trajectories. We show that the abstract, unstructured knowledge within compressed KV-cache, which lacks direct token correspondence, can serve as a rich supervisory signal for a latent reasoning student. Empirically, the approach consistently outperforms strong latent baselines, exhibits markedly smaller degradation from equation-only to natural-language traces, and scales to larger backbones while preserving efficiency. These results establish compressed KV-cache distillation as a scalable supervision signal for latent reasoning, combining the accuracy of CoT-trained teachers with the efficiency and deployability of latent inference.
comment: ICLR 2026
♻ ☆ FutureWorld: A Live Reinforcement Learning Environment for Predictive Agents with Real-World Outcome Rewards
Live future prediction refers to the task of making predictions about real-world events before they unfold. This task is increasingly studied using large language model-based agent systems, and it is important for building agents that can continually learn from the real world. It can provide a large number of prediction questions grounded in diverse real-world events, while preventing answer leakage. To leverage the advantages of future prediction, we present FutureWorld, a live agentic reinforcement learning environment that closes the training loop between prediction, outcome realization, and parameter updates. Specifically, we modify and extend verl-tool, resulting in a new framework that we call verl-tool-future. Unlike standard RL training frameworks that rely on immediate rewards, verl-tool-future stores prediction-time rollouts, backfills rewards after real-world outcomes become available, and then replays the completed trajectories for policy update. Across three open-source agents, successive FutureWorld training rounds lead to consistent improvements in prediction accuracy, probabilistic scoring, and calibration, demonstrating that delayed real-world outcome feedback can serve as an effective RL signal for predictive agents.
comment: We will release the code in the near future
♻ ☆ Knowledge-Level Consistency Reinforcement Learning: Dual-Fact Alignment for Long-Form Factuality
Hallucination in large language models (LLMs) during long-form generation remains difficult to address under existing reinforcement learning from human feedback (RLHF) frameworks, as their preference rewards often overlook the model's own knowledge boundaries. In this paper, we propose the $\textbf{K}$nowledge-$\textbf{L}$evel $\textbf{C}$onsistency Reinforcement Learning $\textbf{F}$ramework ($\textbf{KLCF}$), which re-examines this problem from a distribution alignment perspective. KLCF formalizes long-form factuality as a bidirectional distribution matching objective between the policy model's expressed knowledge distribution and the base model's parametric knowledge distribution: under the constraint that generation must not exceed the support set of the base knowledge, the objective maximizes coverage of high-probability facts, thereby jointly optimizing precision and recall. To achieve this, we design a Dual-Fact Alignment mechanism that approximates the recall term using a factual checklist constructed by sampling from the base model, and constrains hallucinations with a lightweight truthfulness reward model. Both components are jointly optimized and require no external retrieval throughout training. Experimental results demonstrate that KLCF consistently improves factuality metrics across multiple long-form benchmarks and model scales, effectively alleviating hallucination and over-conservatism while maintaining efficiency and scalability.
comment: 32 pages
♻ ☆ RSAT: Structured Attribution Makes Small Language Models Faithful Table Reasoners ACL 2026
When a language model answers a table question, users have no way to verify which cells informed which reasoning steps. We introduce RSAT, a method that trains small language models (SLMs, 1-8B) to produce step-by-step reasoning with cell-level citations grounded in table evidence. Phase 1 (SFT) teaches a structured JSON output format from verified reasoning traces. Phase 2 (GRPO) optimizes a composite reward centered on NLI-based faithfulness, alongside citation validity and parsimony. Across six models from two families-Qwen 2.5 (1.5B/3B/7B) and Llama 3 (1B/3B/8B)-RSAT improves faithfulness 3.7$\times$ over SFT alone (0.224$\rightarrow$0.826), with near-perfect citation validity (0.992). Post-hoc attribution collapses below 13% format success, confirming that attribution must be integrated into reasoning, not retrofitted. Ablations show the faithfulness reward is essential: removing it drops faithfulness from 0.97 to 0.03.
comment: 8 pages, 8 tables, 9 figures, and a 3-page Appendix. Accepted at the SURGeLLM Workshop at ACL 2026 and will be included in the proceedings
♻ ☆ Covering-Space Normalizing Flows: Approximating Pushforwards on Lens Spaces
We construct pushforward distributions via the universal covering map rho: S^3 -> L(p;q) with the goal of approximating these distributions using flows on L(p;q). We highlight that our method deletes redundancies in the case of a symmetric S^3 distribution. Using our model, we approximate the pushforwards of von Mises-Fisher-induced target densities as well as that of a Z_12-symmetric Boltzmann distribution on S^3 constructed to model benzene.
comment: Errors in text
♻ ☆ The Illusion of Forgetting: Attack Unlearned Diffusion via Initial Latent Variable Optimization
Text-to-image diffusion models (DMs) are frequently abused to produce harmful or copyrighted content, violating public interests. Concept erasure (unlearning) is a promising paradigm to alleviate this issue. However, there exists a peculiar forgetting illusion phenomenon with unclear cause. Based on empirical analysis, we formally explain this cause: most unlearning partially disrupt the mapping between linguistic symbols and the underlying internal knowledge, leaving the knowledge intact as dormant memories. We further demonstrate that distributional discrepancy in the denoising process serves as a measurable indicator of how much of the mapping is retained, also reflecting unlearning strength. Inspired by this, we propose IVO (Initial Latent Variable Optimization), a novel attack framework designed to assess the robustness of current unlearning methods. IVO optimizes initial latent variables to realign the noise distribution of unlearned models with that of their vanilla counterparts, which reconstructs the fractured mappings and consequently revives dormant memories. Extensive experiments covering 11 unlearning techniques and 3 concept scenarios show that IVO outperforms state-of-the-art baselines, exposing fundamental flaws in current unlearning mechanisms. Warning: This paper has unsafe images that may offend some readers.
comment: 25 pages, 12 figures, 12 tables
♻ ☆ Post-Selection Distributional Model Evaluation
Formal model evaluation methods typically certify that a model satisfies a prescribed target key performance indicator (KPI) level. However, in many applications, the relevant target KPI level may not be known a priori, and the user may instead wish to compare candidate models by analyzing the full trade-offs between performance and reliability achievable at test time by the models. This task, requiring the reliable estimate of the test-time KPI distributions, is made more complicated by the fact that the same data must often be used both to pre-select a subset of candidate models and to estimate their KPI distributions, causing a potential post-selection bias. In this work, we introduce post-selection distributional model evaluation (PS-DME), a general framework for statistically valid distributional model assessment after arbitrary data-dependent model pre-selection. Building on e-values, PS-DME controls post-selection false coverage rate (FCR) for the distributional KPI estimates and we establish explicit conditions under which it is provably more sample efficient than a baseline method based on sample splitting. Experiments on synthetic data, text-to-SQL decoding with large language models, and telecom network performance evaluation demonstrate that PS-DME enables reliable comparison of candidate configurations across a range of reliability levels, supporting the statistically reliable exploration of performance--reliability trade-offs.
♻ ☆ AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models
Vision-language-action (VLA) models have recently emerged as a powerful paradigm for building generalist robots. However, traditional VLA models that generate actions through flow matching (FM) typically rely on rigid and uniform time schedules, i.e., synchronous FM (SFM). Without action context awareness and asynchronous self-correction, SFM becomes unstable in long-horizon tasks, where a single action error can cascade into failure. In this work, we propose asynchronous flow matching VLA (AsyncVLA), a novel framework that introduces temporal flexibility in asynchronous FM (AFM) and enables self-correction in action generation. AsyncVLA breaks from the vanilla SFM in VLA models by generating the action tokens in a non-uniform time schedule with action context awareness. Besides, our method introduces the confidence rater to extract confidence of the initially generated actions, enabling the model to selectively refine inaccurate action tokens before execution. Moreover, we propose a unified training procedure for SFM and AFM that endows a single model with both modes, improving KV-cache utilization. Extensive experiments on robotic manipulation benchmarks demonstrate that AsyncVLA is data-efficient and exhibits self-correction ability. AsyncVLA outperforms existing methods across both simulation and real-world evaluations. Our code is available at https://github.com/YuhuaJiang2002/AsyncVLA.
♻ ☆ Subcritical Signal Propagation at Initialization in Normalization-Free Transformers
We study signal propagation at initialization in transformers through the averaged partial Jacobian norm (APJN), a measure of gradient amplification across layers. We extend APJN analysis to transformers with bidirectional attention and permutation-symmetric input token configurations by deriving recurrence relations for activation statistics and APJNs across layers. Our theory predicts how attention modifies the asymptotic behavior of the APJN at large depth and matches APJNs measured in deep vision transformers. The criticality picture known from residual networks carries over to transformers: the pre-LayerNorm architecture exhibits power-law APJN growth, whereas transformers with LayerNorm replaced by elementwise $\tanh$-like nonlinearities have stretched-exponential APJN growth, indicating that the latter are subcritical. Applied to Dynamic Tanh (DyT) and Dynamic erf (Derf) transformers, the theory explains why these architectures can be more sensitive to initialization and optimization choices and require careful tuning for stable training.
comment: Minor text edits; 10 pages of main text; 34 pages total; 5 figures in the main text, 25 figures total; preprint
♻ ☆ Variational Kolmogorov-Arnold Network
Kolmogorov-Arnold Networks (KANs) offer a theoretically grounded alternative to multi-layer perceptrons by representing multivariate functions as compositions of univariate basis functions. However, a critical limitation of KANs is the need to manually specify the number of basis functions per layer -- a hyperparameter that directly controls model capacity and substantially impacts performance, yet whose optimal value varies unpredictably across tasks. We present InfinityKAN, a variational inference framework that eliminates this design choice by learning the number of basis functions during training. Our approach models the basis count as a latent variable with a truncated exponential prior, introducing a differentiable weighting function that enables gradient-based optimization. We establish the Lipschitz continuity of the variational objective, ensuring stable training dynamics. Experiments across 18 datasets spanning synthetic, image, tabular, and graph domains demonstrate that InfinityKAN matches or exceeds the performance of KANs while requiring no manual selection of the number of bases for each layer.
comment: Preprint
♻ ☆ MapPFN: Learning Causal Perturbation Maps in Context
Planning effective interventions in biological systems requires treatment-effect models that adapt to unseen biological contexts by identifying their specific underlying mechanisms. Yet single-cell perturbation datasets span only a handful of biological contexts, and existing methods cannot leverage new interventional evidence at inference time to adapt beyond their training data. To meta-learn a perturbation effect estimator, we present MapPFN, a prior-data fitted network (PFN) pre-trained on a synthetic biological prior with causal interventions, decoupling pre-training from limited wet-lab data. Unlike existing methods, MapPFN uses in-context learning to map a sequence of experiments to a post-perturbation distribution, enabling a single pre-trained model to adapt to new datasets and arbitrary gene sets at inference time. Zero-shot, MapPFN identifies differentially expressed genes on par with models trained on real single-cell data, and fine-tuning further improves predictions across biological contexts. Our code, model and data are available at https://marvinsxtr.github.io/MapPFN.
♻ ☆ On the notion of missingness for path attribution explainability methods in medical settings: Guiding the selection of medically meaningful baselines
The explainability of deep learning models remains a significant challenge, particularly in the medical domain where interpretable outputs are essential for clinical trust and transparency. Path attribution methods such as Integrated Gradients rely on a baseline that represents the absence of informative features, a notion commonly referred to as missingness. Standard baselines, such as all-zero inputs, are often semantically meaningless in medical contexts, where intensity values carry clinical significance. In this work, we revisit the notion of missingness for medical imaging, expose the limitations of standard baselines in this setting, and formalize a stricter missingness we term semantic missingness: a baseline must not merely lack signal, but must represent a clinically plausible state in which the disease-related features are absent. This formulation motivates a counterfactual-guided approach to baseline selection, in which a synthetically generated counterfactual (i.e. a clinically normal variant of the pathological input) serves as a principled and semantically meaningful reference. We derive theoretical guarantees showing that counterfactual baselines yield more faithful attributions than standard alternatives, and empirically validate this with two complementary counterfactual generative models, a VAE and a diffusion model, though the concept is model-agnostic and compatible with any suitable counterfactual method. Across three diverse medical datasets, counterfactual baselines produce more faithful and medically relevant attributions, outperforming standard baseline choices as well as related methods. Notably, we also compare against using the counterfactual directly as an explanation (an established paradigm in its own) and show that employing it as a baseline for Integrated Gradients yields superior results, thereby bridging two complementary explainability paradigms.
♻ ☆ CIGaRS I: Combined simulation-based inference from type Ia supernovae and host photometry
Using type Ia supernovae as cosmological probes requires empirical corrections that are correlated with their host environment. Here we present a unified Bayesian hierarchical model designed to infer, from purely photometric observations, the intrinsic dependence of the brightness of type Ia supernovae on progenitor properties (metallicity and age), the delay-time distribution that governs their rate as a function of age, and cosmology, as well as the redshifts of all hosts. The model incorporates physics-based prescriptions for star formation and chemical evolution from Prospector-beta, dust extinction of both galaxy and supernova light, and observational selection effects. We show with simulations that intrinsic dependences on metallicity and age have distinct observational signatures, with metallicity mimicking the well-known step of magnitudes of type Ia supernovae across a host stellar mass of $\sim 10^{10}M_{\odot}$. We then demonstrate neural simulation-based inference of all model parameters from mock observations of ~16,000 type Ia supernovae and their hosts up to redshift 0.9. Our joint physics-based approach delivers robust and precise photometric redshifts (~0.01 median scatter) and improves cosmological constraints by a factor of ~4 over analyses of the small fraction of objects with spectroscopic follow-up. This approach unlocks the full power of photometric data and paves the way for an end-to-end simulation-based analysis pipeline in the LSST era.
comment: published in Nature Astronomy
♻ ☆ Algorithmic Task Capture, Computational Complexity, and Inductive Bias of Infinite Transformers
We formally define algorithmic capture of combinatorial tasks as the ability of a transformer to extrapolate to arbitrary task sizes with controllable error and logarithmic sample adaptation, providing a sharp scaling criterion for distinguishing logic internalization from statistical interpolation. Empirically, across scaling ranges spanning up to 2.5 orders of magnitude, we observe evidence of capture and non-capture. By analyzing infinite-width transformers in both the lazy and rich regimes, we derive upper bounds on the inference-time computational complexity of the combinatorial tasks these networks can capture. We show that, despite their universal expressivity, transformers possess an inductive bias that disfavors higher-complexity algorithmic procedures within the efficient polynomial-time heuristic scheme class, consistent with successful capture on simpler combinatorial tasks such as induction heads, sort, and string matching.
♻ ☆ Divergence is Uncertainty: A Closed-Form Posterior Covariance for Flow Matching
Flow matching has become a leading framework for generative modeling, but quantifying the uncertainty of its samples remains an open problem. Existing approaches retrain the model with auxiliary variance heads, maintain costly ensembles, or propagate approximate covariance through many integration steps, trading off training cost, inference cost, or accuracy. We show that none of these trade-offs is necessary. We prove that, for any pre-trained flow matching velocity field, the trace of the posterior covariance over the clean data given the current state equals, in closed form, the divergence of the velocity field, up to a known time-dependent prefactor and an additive constant. We call this the \emph{divergence-uncertainty identity} for flow matching. The matrix-level form of the identity is similarly closed-form, depending solely on the velocity Jacobian. Because the identity is exact and post-hoc, it is computable on any pre-trained flow matching model, with no retraining and no architectural modification. For one-step generators such as MeanFlow, the same identity yields the exact end-to-end generation uncertainty in a single forward pass, eliminating the multi-step variance propagation required by all prior methods. Experiments on MNIST confirm that the resulting per-pixel uncertainty maps are semantically meaningful, concentrating on digit boundaries where inter-sample variation is highest, and that the scalar uncertainty score tracks actual prediction error, all at roughly 10,000$\times$ less total compute than ensembling or Monte Carlo dropout.
comment: 9 Pages, 5 figures
♻ ☆ Are Stochastic Multi-objective Bandits Harder than Single-objective Bandits?
Multi-objective bandits have attracted increasing attention for their broad applicability, with \(d\)-dimensional reward vectors inducing Pareto regret. There has been a subtle debate over whether this added structure makes the problem fundamentally harder than single-objective bandits. We answer this by showing that, in terms of Pareto regret, it is surprisingly no harder: Pareto regret scales inversely with \(g^\dagger\), the largest objective-wise suboptimality gap, and thus matches the smallest objective-wise classical regret. We formalize this idea via a novel method with upper and lower confidence-bound estimators for every arm-objective pair. It uses top-two races to compare arms within each objective and an uncertainty-greedy rule to allocate exploration toward the largest objective-wise gap \(g^\dagger\), until the corresponding Pareto-optimal arm is committed to. We prove that it achieves Pareto regret of \(O(\nicefrac{\log T}{g^\dagger})\), where \(T\) is the horizon, with \emph{no dependence on \(d\)}. A matching lower bound of \(Ω(\nicefrac{\log T}{g^\dagger})\) implies optimality. We evaluate the method on synthetic and real-world datasets, confirming the theory and achieving order-of-magnitude reductions in Pareto regret over baselines. Real-world results further show that our method commits to a Pareto optimal arm, possibly at the cost of empirical fairness, suggesting a potential hardness absent in single-objective bandits.
comment: 21 pages
♻ ☆ MARVL: Multi-Stage Guidance for Robotic Manipulation via Vision-Language Models
Designing dense reward functions is pivotal for efficient robotic Reinforcement Learning (RL). However, most dense rewards rely on manual engineering, which fundamentally limits the scalability and automation of reinforcement learning. While Vision-Language Models (VLMs) offer a promising path to reward design, naive VLM rewards often misalign with task progress, struggle with spatial grounding, and show limited understanding of task semantics. To address these issues, we propose MARVL-Multi-stAge guidance for Robotic manipulation via Vision-Language models. MARVL fine-tunes a VLM for spatial and semantic consistency and decomposes tasks into multi-stage subtasks with task direction projection for trajectory sensitivity. Empirically, MARVL significantly outperforms existing VLM-reward methods on the Meta-World benchmark, demonstrating superior sample efficiency and robustness on sparse-reward manipulation tasks.
♻ ☆ Disentangled Generative Graph Representation Learning
Recently, generative graph models have shown promising results in learning graph representations through self-supervised methods. However, most existing generative graph representation learning (GRL) approaches rely on random masking across the entire graph, which overlooks the entanglement of learned representations. This oversight results in non-robustness and a lack of explainability. Furthermore, disentangling the learned representations remains a significant challenge and has not been sufficiently explored in GRL research. Based on these insights, this paper introduces DiGGR (Disentangled Generative Graph Representation Learning), a self-supervised learning framework. DiGGR aims to learn latent disentangled factors and utilizes them to guide graph mask modeling, thereby enhancing the disentanglement of learned representations and enabling end-to-end joint learning. Extensive experiments on 11 public datasets for two different graph learning tasks demonstrate that DiGGR consistently outperforms many previous self-supervised methods, verifying the effectiveness of the proposed approach.
♻ ☆ Is Flow Matching Just Trajectory Replay for Sequential Data?
Flow matching (FM) is increasingly used in scientific domains for time series generation and forecasting, where data often arise from underlying dynamical systems. However, it is not well-understood whether it learns transferable dynamical structure or simply performs an effective "trajectory replay". We study this question by deriving the velocity field targeted by the empirical FM objective on sequential data in the limit of perfect function approximation. For the Gaussian conditional paths commonly used in practice, we show that the implied sampler is an ODE whose dynamics constitutes a nonparametric, memory-augmented continuous-time dynamical system. The optimal field admits a closed-form expression as a similarity-weighted mixture of instantaneous velocities induced by observed transitions, making the dataset dependence explicit and interpretable. This characterization positions neural FM models as parametric surrogates of an ideal nonparametric solution and suggests practical approximation schemes for robust ODE-based generation. As a byproduct of our analysis, the resulting closed-form sampler, FreeFM, provides strong probabilistic forecasts on nonlinear dynamical system benchmarks directly from historical transitions, without training.
comment: 56 pages
♻ ☆ PlatoLTL: Learning to Generalize Across Symbols in LTL Instructions for Multi-Task RL
A central challenge in multi-task reinforcement learning (RL) is to train generalist policies capable of performing tasks not seen during training. To facilitate such generalization, linear temporal logic (LTL) has emerged as a powerful formalism for specifying structured, temporally extended tasks to RL agents. While existing approaches to LTL-guided multi-task RL demonstrate generalization across LTL specifications, they are unable to generalize to unseen vocabularies of propositions (or "symbols"), which describe high-level events in LTL. We present PlatoLTL, a novel approach that enables policies to zero-shot generalize not only compositionally across LTL structures, but also parametrically across propositions. We model propositions as parameterized instances of atomic predicates, allowing policies to learn shared structure across related propositions. We propose a novel architecture that embeds and composes parameterized propositions to represent LTL formulae, and demonstrate zero-shot generalization in a range of challenging environments.
comment: 14 pages, 4 figures (main paper). 22 pages, 11 figures (appendix)
♻ ☆ FIT to Forget: Robust Continual Unlearning for Large Language Models
While large language models (LLMs) exhibit remarkable capabilities, they increasingly face demands to unlearn memorized privacy-sensitive, copyrighted, or harmful content. Existing unlearning methods primarily focus on \emph{single-shot} scenarios, whereas real-world deletion requests arrive \emph{continually}. Naïvely applying these methods to sequential requests leads to severe utility degradation and catastrophic forgetting. To address this, we propose \fit, a robust continual unlearning framework to process high-volume sequential deletion streams while resisting both catastrophic forgetting and post-unlearning recovery. \fit stabilizes sequential updates through three synergistic mechanisms: redundancy \underline{F}iltering, \underline{I}mportance-aware adaptive algorithm selection, and \underline{T}argeted layer attribution. Furthermore, to facilitate rigorous evaluation, we introduce \textbf{PCH}, a unified benchmark encompassing \textbf{P}ersonal, \textbf{C}opyrighted, and \textbf{H}armful content, alongside two symmetric metrics, Forget Degree (F.D.) and Retain Utility (R.U.), to systematically quantify forgetting-utility trade-offs. Extensive experiments across five LLMs (up to 14B parameters) demonstrate that \fit consistently achieves state-of-the-art unlearning efficacy and utility preservation. Notably, even after hundreds of sequential requests, \fit preserves strong downstream (\eg, GSM8K, MMLU) performance and exhibits superior resilience against relearning and quantization recovery attacks.
comment: 26 Pages
♻ ☆ A Theoretical Analysis of Test-Driven Code Generation
Code assistants are increasingly utilized in test-driven software development, yet the theoretical mechanisms behind their environment-interaction strategies remain underexplored. We provide a probabilistic framework for two dominant paradigms: code selection after generation using the execution environment, and code generation conditioned on environment feedback. First, we formalize several well-established selection heuristics as environment-aware estimators of code correctness. We theoretically prove that estimators based on fuzzy functional similarity add an inductive bias and strictly dominate estimators based on functional equivalence in terms of signal-to-noise ratio. Second, we frame backprompting as an in-context approximation of Thompson sampling. We derive a novel regret bound for reward functions with unobservable components, theoretically explaining why the effectiveness of backprompting is limited by the ambiguity of the informal task description (an irreducible regret). Using five state-of-the-art open weight models, we corroborate these findings across BigCodeBenchHard, LeetCodeDataset, and QiskitHumanEvalSim. Our formalization also suggests how to improve task descriptions effectively, leading to a new benchmark, QiskitHumanEvalSimX.
comment: preprint
♻ ☆ ReMAP: Neural Reparameterization for Scalable MAP Inference in Arbitrary-Order Markov Random Fields
Scalable high-quality MAP inference in arbitrary-order Markov Random Fields (MRFs) remains challenging. Approximate message-passing methods are often efficient but can degrade on dense or high-order instances, while exact solvers such as Toulbar2 become increasingly expensive at scale. We present ReMAP, an instance-wise neural reparameterization framework that directly optimizes a differentiable relaxation of the original MRF energy. Instead of relying on supervised labels or amortized training, ReMAP treats each MRF as an independent optimization problem: a Graph Neural Network produces node-wise label distributions, and gradient-based optimization searches for a low-energy discrete solution in an over-parameterized continuous space. The method supports pairwise and arbitrary-order factors, heterogeneous label cardinalities, and efficient GPU execution, without requiring labeled solutions. We show that the relaxed objective is consistent with the discrete MAP problem and analyze how neural over-parameterization can expose low-energy optimization paths unavailable in the original discrete space. Empirically, on synthetic pairwise and high-order MRFs, UAI 2022 inference benchmarks, and real-world Physical Cell Identity (PCI) problems, ReMAP consistently outperforms approximate baselines and often finds lower-energy solutions than Toulbar2 on hard large-scale instances within practical time budgets.
♻ ☆ Linearized Attention Cannot Enter the Kernel Regime at Any Practical Width
Understanding whether attention mechanisms converge to the kernel regime is foundational to the validity of influence functions for transformer accountability. Exact NTK characterization of softmax attention is precluded by its exponential nonlinearity; linearized attention is the canonical tractable proxy and the object of study here. This paper establishes that even this proxy does not converge to its NTK limit at any practical width, revealing a fundamental trade-off in the learning dynamics of attention. An exact correspondence is established between parameter-free linearized attention and a data-dependent Gram-induced kernel; spectral amplification analysis shows that the attention transformation cubes the Gram matrix's condition number, requiring width $m = Ω(κ_d(\mathbf{G})^6 n\log n)$ for NTK convergence, where $κ_d(\mathbf{G})$ is the effective condition number of the rank-$\min(n,d)$ truncation of the input Gram matrix; for natural image datasets this threshold is physically infeasible ($m \gg 10^{24}$ for MNIST and $m \gg 10^{29}$ for CIFAR-10, 12--17 orders of magnitude beyond the largest known architectures). \emph{Influence malleability} is introduced to characterize this non-convergence: linearized attention exhibits 2--9$\times$ higher malleability than ReLU networks under adversarial data perturbation, with the gap depending on dataset condition number and task setting. A dual implication is established: the same data-dependent kernel is shown theoretically to reduce approximation error when targets align with the data geometry, while, empirically, creating vulnerability to adversarial manipulation of the training data. The structural argument extends to trainable QKV attention under standard initialization, with direct consequences for influence methods applied to deployed transformer architectures.
♻ ☆ GraphVec: Cross-Domain Graph Vectorization for Graph-Level Representation Learning
Learning universal graph representations across heterogeneous domains is difficult because graph datasets differ in topology, node-attribute semantics, feature dimensions, and even attribute availability. We propose GraphVec, a language-model-free graph vectorization model that maps diverse graphs into transferable fixed-dimensional embeddings for graph-level tasks. Instead of directly using incomparable raw node attributes, GraphVec constructs multi-scale global graphs over all nodes in each dataset and extracts spectral embeddings to obtain domain-agnostic relational features. To make these spectral features comparable across datasets, we introduce a density-maximization mean alignment algorithm over orthogonal transformations and prove its monotonic convergence. GraphVec further combines a GIN--Graph Transformer backbone with a multi-layer reference distribution module, which preserves node-level distributional information beyond standard pooling. We also provide a generalization error bound for the proposed model. Experiments on 13 datasets with more than 15 comparison methods demonstrate that GraphVec consistently outperforms strong graph pretraining baselines in cross-domain few-shot graph classification and graph clustering. Beyond graph-level tasks, GraphVec also yields strong node-level representations, achieving competitive performance on few-shot node classification against representative graph prompt learning methods.
♻ ☆ Direction-Aware Offline-to-Online Learning in Linear Contextual Bandits
Many bandit systems are deployed with offline historical data, such as past logs from earlier policies. Using these data can reduce early online exploration when they remain informative for the online problem. When the offline and online environments differ, such data can be biased for the online problem. For linear (contextual) bandits, this bias is directional: offline data may be informative in some feature directions and misleading in others. However, prior work typically controls this gap through a known Euclidean bound on the model parameters, which we prove is too coarse: even with the offline parameter known, bias in a single unknown direction can force dimension-dependent regret. To address this challenge, we introduce a directional bias certificate $(M_{\mathrm{bias}},ρ)$ that measures the offline-to-online gap through an $M_{\mathrm{bias}}$-induced norm and assigns different bias budgets to different directions. Building on this certificate, we propose \emph{Ellipsoidal-MINUCB}, which augments the online learning with an offline-pooled branch that safely exploits historical data. When the certificate is known, we show that the algorithm matches the standard SupLinUCB rate in the worst case and improves when offline coverage aligns with low-bias directions. When the certificate is unknown, we estimate it adaptively from offline and accumulated online data and establish a corresponding regret guarantee. Numerical experiments support the theory and show gains in aligned regimes.
♻ ☆ Dynamic Expert-Guided Model Averaging for Causal Discovery
Would-be practitioners of causal discovery face a dizzying array of algorithms without a clear best choice. This abundance of competitive methods makes ensembling a natural strategy for practical applications. At the same time, real-world use cases frequently violate the assumptions on which common causal discovery algorithms are based, forcing reliance on expert knowledge. Inspired by recent work on dynamically requested expert knowledge and large language models (LLMs) as experts, we present a flexible model averaging method that integrates selective expert querying to ensemble a diverse set of causal discovery algorithms. Crucially, we distinguish between edge existence and orientation, enabling the method to leverage the complementary strengths of data-driven discovery and expert input. We further consider the realistic setting of limited access to an imperfect expert, using disagreement among algorithms to query the expert in cases of greater uncertainty. Experiments demonstrate that our method consistently outperforms strong baselines on both clean and noisy data. Code and data are available at https://anonymous.4open.science/r/expert-cd-ensemble-3282/.
♻ ☆ Making Reconstruction FID Predictive of Diffusion Generation FID
It is well known that the reconstruction FID (rFID) of a VAE is poorly correlated with the generation FID (gFID) of a latent diffusion model. We propose interpolated FID (iFID), a simple variant of rFID that exhibits a strong correlation with gFID. Specifically, for each dataset element, we retrieve its nearest neighbor in latent space, interpolate between their latent representations, decode the interpolated latent, and compute the FID between the decoded samples and the original dataset. We provide an intuitive explanation for why iFID correlates well with gFID, and why reconstruction metrics can be negatively correlated with gFID, by connecting iFID to recent results on diffusion generalization and hallucination. Theoretically, we show that iFID evaluates decoded interpolations aligned with the ridge set around which diffusion samples concentrate, thereby measuring a quantity closely related to diffusion sample quality. Empirically, iFID is the first metric shown to strongly correlate with diffusion gFID across diverse VAEs, achieving Pearson and Spearman correlations of approximately $0.85$. The project page is available at https://tongdaxu.github.io/pages/ifid.html.
♻ ☆ Congestion-Aware Dynamic Axonal Delay for Spiking Neural Networks
Spiking Neural Networks (SNNs) are widely regarded as an energy-efficient paradigm for modeling and processing temporal and event-driven information. Incorporating delays in SNNs has been proven to be an effective mechanism for improving spike alignment in event-driven tasks. However, existing delay learning approaches predominantly assign static delays to individual synapses, resulting in a large number of delay parameters and limited adaptability to input-dependent activity dynamics. To this end, we propose a Congestion-Aware Dynamic Axonal Delay (CADAD) mechanism, which decomposes the delay into a channel-wise static base delay for temporal structuring and a global, activity-conditioned shift that dynamically regulates the state update rate under varying spike intensities. The delay parameters are learned using differentiable linear interpolation and discretized at inference time, preserving the benefits of dynamic delay modulation while incurring only minimal additional cost. Experiments on speech benchmarks, including the Spiking Heidelberg Dataset, Spiking Speech Commands, and Google Speech Commands, demonstrate that introducing congestion-aware delays into synaptic signal transmission effectively improves accuracy on temporal tasks, notably achieving 93.75% accuracy on SHD, 80.69% accuracy on SSC, and 95.58% on GSC-35, while reducing the parameter count by approximately 50% compared to state-of-the-art delay-based methods with the same architecture.
♻ ☆ Generalizing the Geometry of Model Merging Through Frechet Averages
Model merging aims to combine multiple models into one without additional training. Naïve parameter-space averaging can be fragile under architectural symmetries, as their geometry does not take them into account. In this work we show that not only the geometry, but also the averaging procedure itself, must be symmetry-invariant to achieve symmetry-aware merges. Consequently, we propose a general solution: merging as Fréchet averaging, i.e., selecting parameters that minimize a sum of geodesic distances on an appropriate manifold. In this view, the key design choice is the overall geometry, i.e., the choice of metric, manifold, and distance approximation, that determines what it means for two models to be "close". We show that Fréchet averaging, combined with simplifying assumptions, contains Fisher merging. Building on this, we examine the particular case of low-rank adapters (LoRA), whose symmetries induce a distinct geometry: that of a quotient manifold. We outline the limitations of current LoRA merging methods, propose a practical algorithm for this setting, and show how they compare with other commonly used approaches.
♻ ☆ High entropy leads to symmetry equivariant policies in Dec-POMDPs
We prove that in any Dec-POMDP, sufficiently high entropy regularization ensures that the policy gradient flow with tabular softmax parametrization always converges, for any initialization, to the same joint policy, and that this joint policy is equivariant w.r.t. all symmetries of the Dec-POMDP. In particular, policies coming from different initializations will be fully compatible, in that their cross-play returns are equal to their self-play returns. Through extensive evaluation of independent PPO, arguably the standard baseline deep multi-agent policy gradient algorithm, in the Hanabi, Overcooked and Yokai environments, we find that the entropy coefficient has a massive influence on the cross-play returns between independently trained policies, and that the decrease in self-play returns coming from increased entropy regularization can often be counteracted by greedifying the learned policies after training. In Hanabi in particular we achieve a new SOTA in inter-seed cross-play this way. While we give examples of Dec-POMDPs in which one cannot learn the optimal symmetry equivariant policy this way, both our theoretical and empirical results suggest that one should consider far higher entropy coefficients during hyperparameter sweeps in Dec-POMDPs than is typically done.
♻ ☆ Time series saliency maps: explaining models across multiple domains
Traditional saliency map methods, popularized in computer vision, highlight individual points (pixels) of the input that contribute the most to the model's output. However, in time series, they offer limited insights, as semantically meaningful features are often found in other domains. We introduce Cross-domain Integrated Gradients, a generalization of Integrated Gradients. Our method enables feature attributions in any domain that can be formulated as an invertible, differentiable transformation of the time domain. Crucially, our derivation extends the original Integrated Gradients into the complex domain, enabling frequency-based attributions. We provide the necessary theoretical guarantees, namely, path independence and completeness. We validate our method via controlled experiments with mechanistic analysis, quantitative faithfulness tests, and real-world case studies. Our approach reveals interpretable, problem-specific attributions that time-domain methods cannot capture in three real-world tasks across a variety of model architectures, machine-learning tasks, and cross-domain transforms: frequency-based attribution for a regression task in wearable heart rate extraction, independent component analysis in a classification task for electroencephalography-based seizure detection, and seasonal-trend decomposition for a forecasting problem with a zero-shot time-series foundation model. We release an open-source TensorFlow/PyTorch library to enable plug-and-play cross-domain explainability for time-series models. These results demonstrate the ability of Cross-Domain Integrated Gradients to provide semantically meaningful insights into time-series models that are impossible to achieve with traditional saliency in the time domain.
♻ ☆ Don't Throw Away Your Beams: Improving Consistency-based Uncertainties in LLMs via Beam Search
Consistency-based methods have emerged as an effective approach to uncertainty quantification (UQ) in large language models. These methods typically rely on several generations obtained via multinomial sampling, measuring their agreement level. However, in short-form QA, multinomial sampling is prone to producing duplicates due to peaked distributions, and its stochasticity introduces considerable variance in uncertainty estimates across runs. We introduce a new family of methods that employ beam search to generate candidates for consistency-based UQ, yielding improved performance and reduced variance compared to multinomial sampling. We also provide a theoretical lower bound on the beam set probability mass under which beam search achieves a smaller error than multinomial sampling. We empirically evaluate our approach on six QA datasets and find that its consistent improvements over multinomial sampling lead to state-of-the-art UQ performance.
♻ ☆ Spectral Alignment in Forward-Backward Representations via Temporal Abstraction
Forward-backward (FB) representations provide a powerful framework for learning the successor representation (SR) in continuous spaces by enforcing a low-rank factorization. However, a fundamental spectral mismatch often exists between the high-rank transition dynamics of continuous environments and the low-rank bottleneck of the FB architecture, making accurate low-rank representation learning difficult. In this work, we analyze temporal abstraction as a mechanism to mitigate this mismatch. By characterizing the spectral properties of the transition operator, we show that temporal abstraction acts analogously to a low-pass filter that suppresses high-frequency spectral components. This suppression reduces the effective rank of the induced SR while preserving a formal bound on the resulting value function error. Empirically, we show that this alignment is a key factor for stable FB learning, particularly at high discount factors where bootstrapping becomes error-prone. Our results identify temporal abstraction as a principled mechanism for shaping the spectral structure of the underlying MDP and enabling effective long-horizon representations in continuous control.
♻ ☆ Mono-Forward: Revisiting Forward-Forward through Objective-Locality Decomposition
Backpropagation remains the dominant algorithm for training deep neural networks, but it incurs substantial memory overhead and relies on global error propagation, which is often regarded as biologically implausible. The Forward-Forward (FF) algorithm is an appealing local-learning alternative to backpropagation, yet it still lags behind backpropagation in accuracy. A central unresolved question is whether this gap arises from FF's locality or from the positive-negative double-pass goodness objective used to train each layer. In this work, we revisit FF under the supervised setting through a decomposition that separates these two design choices. Our analysis suggests that FF's performance limitations are not explained by locality alone, but are also likely influenced by its goodness objective. Motivated by this view, we introduce Mono-Forward (MF), a simplification of FF that preserves its locality while replacing the contrastive goodness objective with a standard multi-class cross-entropy objective applied locally at each layer, serving as a controlled baseline for evaluating local learning under a standard classification objective. Across MLPs and convolutional networks, MF outperforms vanilla FF and remains competitive in multiple FF variants. On MLP-Mixers, MF achieves stronger results on PathMNIST than backpropagation while requiring only 31% of backpropagation's memory.
comment: 26 pages
♻ ☆ Revenue Maximization Under Sequential Price Competition Via The Estimation Of s-Concave Demand Functions
We consider price competition among multiple sellers over a selling horizon of $T$ periods. In each period, sellers simultaneously offer their prices (which are made public) and subsequently observe their respective demand (not made public). The demand function of each seller depends on all sellers' prices through a private, unknown, and nonlinear relationship. We propose a dynamic pricing policy that uses semi-parametric least-squares estimation and show that when the sellers employ our policy, their prices converge at a rate of $O(T^{-1/7})$ to the Nash equilibrium prices that sellers would reach if they were fully informed. Each seller incurs a regret of $O(T^{5/7})$ relative to a dynamic benchmark policy. A theoretical contribution of our work is proving the existence of equilibrium under shape-constrained demand functions via the concept of $s$-concavity and establishing regret bounds of our proposed policy. Technically, we also establish new concentration results for the least squares estimator under shape constraints. Our findings offer significant insights into dynamic competition-aware pricing and contribute to the broader study of non-parametric learning in strategic decision-making.
♻ ☆ CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning
Large language models (LLMs) often solve challenging math exercises yet fail to apply the concept right when the problem requires genuine understanding. Popular Reinforcement Learning with Verifiable Rewards (RLVR) pipelines reinforce final answers but provide little fine-grained conceptual signal, so models improve at pattern reuse rather than conceptual applications. We introduce CORE (Concept-Oriented REinforcement), an RL training framework that turns explicit concepts into a controllable supervision signal. Starting from a high-quality, low-contamination textbook resource that links verifiable exercises to concise concept descriptions, we run a sanity probe showing LLMs can restate definitions but fail concept-linked quizzes, quantifying the conceptual reasoning gap. CORE then (i) synthesizes concept-aligned quizzes, (ii) injects brief concept snippets during rollouts to elicit concept-primed trajectories, and (iii) reinforces conceptual reasoning via trajectory replacement after group failures, a lightweight forward-KL constraint that aligns unguided with concept-primed policies, or standard GRPO directly on concept-aligned quizzes. Across several models, CORE delivers consistent gains over vanilla and SFT baselines on both in-domain concept-exercise suites and diverse out-of-domain math benchmarks. CORE unifies direct training on concept-aligned quizzes and concept-injected rollouts under outcome regularization. It provides fine-grained conceptual supervision that bridges problem-solving competence and genuine conceptual reasoning, while remaining algorithm- and verifier-agnostic.
♻ ☆ MEMSAD: Gradient-Coupled Anomaly Detection for Memory Poisoning in Retrieval-Augmented Agents NeurIPS 2026
Persistent external memory enables LLM agents to maintain context across sessions, yet its security properties remain formally uncharacterized. We formalize memory poisoning attacks on retrieval-augmented agents as a Stackelberg game with a unified evaluation framework spanning three attack classes with escalating access assumptions. Correcting an evaluation protocol inconsistency in the triggered-query specification of Chen et al. (2024), we show faithful evaluation increases measured attack success by $4\times$ (ASR-R: $0.25 \to 1.00$). Our primary contribution is MEMSAD (Semantic Anomaly Detection), a calibration-based defense grounded in a gradient coupling theorem: under encoder regularity, the anomaly score gradient and the retrieval objective gradient are provably identical, so any continuous perturbation that reduces detection risk necessarily degrades retrieval rank. This coupling yields a certified detection radius guaranteeing correct classification regardless of adversary strategy. We prove minimax optimality via Le Cam's method, showing any threshold detector requires $Ω(1/ρ^2)$ calibration samples and MEMSAD achieves this up to $\log(1/δ)$ factors. We further derive online regret bounds for rolling calibration at rate $O(σ^{2/3}Δ^{1/3})$, and formally characterize a discrete synonym-invariance loophole that marks the boundary of what continuous-space defenses can guarantee. Experiments on a $3 \times 5$ attack-defense matrix with bootstrap confidence intervals, Bonferroni-corrected hypothesis tests, and Clopper-Pearson validation ($n=1{,}000$) confirm: composite defenses achieve TPR $= 1.00$, FPR $= 0.00$ across all attacks, while synonym substitution evades detection at $Δ$ ASR-R $\approx 0$, exposing a gap existing embedding-based defenses cannot close.
comment: 28 pages, 9 figures, 6 theorems. Submitted to NeurIPS 2026
♻ ☆ Model Compression with Exact Budget Constraints via Riemannian Manifolds
Assigning one of K options to each of N groups under a total cost budget is a recurring problem in efficient AI, including mixed-precision quantization, non-uniform pruning, and expert selection. The objective, typically model loss, depends jointly on all assignments and does not decompose across groups, preventing combinatorial solvers from directly optimizing the true objective and forcing reliance on proxy formulations. Methods such as evolutionary search evaluate the actual loss but lack gradient information, while penalty-based approaches enforce the budget only approximately and often require extensive hyperparameter tuning. We present a new approach by showing that, under softmax relaxation, the budget constraint defines a smooth Riemannian manifold in logit space with unusually simple geometry. The normal vector admits a closed-form expression, shifting logits along the cost vector changes expected cost monotonically, and vector transport reduces to a single inner product. Building on these properties, we propose Riemannian Constrained Optimization (RCO), which augments a standard Adam step with tangent projection, binary-search retraction, and momentum transport. Combined with Gumbel straight-through estimation and budget-constrained dynamic programming for discrete feasibility, RCO enables first-order optimization of the actual loss under exact budget enforcement without introducing constraint-specific hyperparameters. Across both synthetic benchmarks and realistic LLM compression settings, RCO matches or exceeds state-of-the-art methods while often requiring substantially less wall-clock time. Source code is available at https://github.com/IST-DASLab/RCO.
♻ ☆ Decentralized Attention Fails Centralized Signals: Rethinking Transformers for Medical Time Series ICLR 2026
Accurate analysis of medical time series (MedTS) data, such as electroencephalography (EEG) and electrocardiography (ECG), plays a pivotal role in healthcare applications, including the diagnosis of brain and heart diseases. MedTS data typically exhibit two critical patterns: temporal dependencies within individual channels and channel dependencies across multiple channels. While recent advances in deep learning have leveraged Transformer-based models to effectively capture temporal dependencies, they often struggle with modeling channel dependencies. This limitation stems from a structural mismatch: MedTS signals are inherently centralized, whereas the Transformer's attention mechanism is decentralized, making it less effective at capturing global synchronization and unified waveform patterns. To address this mismatch, we propose CoTAR (Core Token Aggregation-Redistribution), a centralized MLP-based module designed to replace decentralized attention. Instead of allowing all tokens to interact directly, as in standard attention, CoTAR introduces a global core token that serves as a proxy to facilitate inter-token interactions, thereby enforcing a centralized aggregation and redistribution strategy. This design not only better aligns with the centralized nature of MedTS signals but also reduces computational complexity from quadratic to linear. Experiments on five benchmarks validate the superiority of our method in both effectiveness and efficiency, achieving up to a 11.6% improvement on the APAVA dataset, while using only 33% of the memory and 20% of the inference time compared to the previous state of the art. Code and all training scripts are available at https://github.com/Levi-Ackman/TeCh.
comment: Accepted by ICLR 2026 (Oral). arXiv admin note: text overlap with arXiv:2405.19363 by other authors
♻ ☆ Submanifold Sparse Convolutional Networks for Automated 3D Segmentation of Kidneys and Kidney Tumours in Computed Tomography
Accurate delineation of kidney tumours in Computed Tomography (CT) is essential for downstream quantitative analysis and precision oncology, but manual segmentation is a specialised task, time-consuming and difficult to scale. Automated 3D segmentation remains challenging because CT scans are large volumetric images, making high-resolution dense convolutional networks computationally expensive and often dependent on downsampling or patch-based inference. We propose a two-stage 3D segmentation methodology based on voxel sparsification and submanifold sparse convolutional networks (SSCNs). Stage 1 uses a low-resolution sparse network to identify a region of interest (ROI); Stage 2 applies a high-resolution sparse network for refined segmentation within the cropped ROI. This enables native high-resolution 3D processing while reducing memory use and inference time. We evaluate the method on the KiTS23 renal cancer CT dataset using 5-fold cross-validation. Our method achieved Dice similarity coefficients of 95.8% for kidneys + masses, 85.7% for tumours + cysts, and 80.3% for tumours alone, competitive with top KiTS23 approaches. In direct comparisons on the same cross-validation folds, the proposed sparse method achieves tumour + cyst and tumour-only Dice scores comparable to, and slightly higher than, a patch-based nnU-Net baseline, while consistently requiring less VRAM and shorter inference time across the tested hardware. Across the tested GPUs, our sparse model is markedly faster than both nnU-Net and the zero-shot zoom-out/zoom-in foundation model SegVol, which localises kidneys well but underperforms on small heterogeneous lesions. Compared to an equivalent dense implementation of the same architecture, the proposed sparse approach achieves up to a 60% reduction in inference time and up to a 75% reduction in VRAM usage across both CPU and the GPU configurations tested.
comment: 15 pages, 6 figures
♻ ☆ Autolearn: Learn by Surprise, Commit by Proof
We propose Autolearn, a framework that enables language models to learn from documents they read, with no external supervision. Passages that produce anomalously high per-token loss are flagged, verified through a self-generated Q&A chain, and trained on with conviction-proportional $β_2$ adjustment. We introduce the perturbation gap (paraphrase-to-original perplexity ratio) as a metric that distinguishes memorization from understanding. The key mechanism is the training data format: Q&A-format training drives the perturbation gap below the pre-trained baseline (2.098 vs. 2.204, $Δ= -0.106$, $> 10σ$), suppressing token-sequence memorization, while standard fine-tuning's best attempt remains within noise ($Δ= -0.010$, $< 1σ$). Across four models spanning Qwen3 and Phi-4 families, Autolearn is the only method that enters this regime. Stochastic evaluation reveals passage-specific knowledge acquisition: the probability of generating a correct novel fact rises from 6% to 54% after training ($p < 10^{-4}$), and Q&A format outperforms standard fine-tuning on genuinely novel facts. The system is self-extinguishing: learned content reduces surprisal below threshold and is skipped on re-encounter.
comment: 21 pages, 2 figures
♻ ☆ InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees
We study how large language models can be used to evolve inventory policies in online, non-stationary environments. Our work is motivated by recent advances in LLM-based evolutionary search, such as AlphaEvolve, which demonstrates strong performance for static and highly structured problems such as mathematical discovery, but is not directly suited to online dynamic inventory settings. To this end, we propose InvEvolve, an end-to-end inventory-policy evolution and inference framework grounded in confidence-interval-based certification. The framework trains a large language model using reinforcement learning, incorporates demand data as well as numerical and textual features beyond demand, and generates white-box inventory policy with statistical safety guarantees for deployment in future periods. We further introduce a unified theoretical interface that connects training, inference, and deployment. This allows us to characterize the probability lower bound that the InvEvolve evolves a statistically safe and improved policy, and to quantify the multi-period performance gap relative to the oracle-safe benchmark. Tested on both synthetic data and real-world retail data, InvEvolve outperforms classical inventory policies and deep learning based methods. In canonical inventory settings, it evolves new policies that improve upon existing benchmarks.
♻ ☆ Democratizing Tool Learning with Environments Fully Simulated by a Free 8B Language Model
Reinforcement learning (RL) has become a prevalent paradigm for training tool calling agents, which typically requires online interactive environments. Existing approaches either rely on training data with ground truth annotations or require advanced proprietary language models (LMs) to synthesize environments that keep fixed once created. In this work, we propose TRUSTEE, a cost-friendly method for training tool calling agents with dynamic environments fully simulated by free open-source LMs that can be as small as 8B, including task generation, user simulation, tool simulation and trajectory evaluation, paired with an adaptive curriculum learning mechanism that controls task difficulty during training. Our empirical results show that TRUSTEE outperforms baselines which require extra external resources in most cases. These confirm that, with a sufficiently sophisticated design, even simulated environments with a local 8B LM as the backbone could set a strong baseline for tool learning. We hope our proposed paradigm could democratize tool learning and inspire future research on environment scaling with limited resources.
comment: Preprint
♻ ☆ Optimal In-context Adaptivity and Distributional Robustness of Transformers
We study in-context learning problems where a Transformer is pretrained on tasks drawn from a mixture distribution $π=\sum_{α\in\mathcal{A}} λ_α π_α$, called the pretraining prior, in which each mixture component $π_α$ is a distribution on tasks of a specific difficulty level indexed by $α$. Our goal is to understand the performance of the pretrained Transformer when evaluated on a different test distribution $μ$, consisting of tasks of fixed difficulty $β\in\mathcal{A}$, and with potential distribution shift relative to $π_β$, subject to the chi-squared divergence $χ^2(μ,π_β)$ being at most $κ$. In particular, we consider nonparametric regression problems with random smoothness, and multi-index models with both random smoothness and random effective dimension. We prove that a large Transformer pretrained on sufficient data achieves the optimal rate of convergence corresponding to the difficulty level $β$, uniformly over test distributions $μ$ in the chi-squared divergence ball. Thus, the pretrained Transformer is able to achieve faster rates of convergence on easier tasks and is robust to distribution shift at test time. Finally, we prove that even if an estimator had access to the test distribution $μ$, the convergence rate of its expected risk over $μ$ could not be faster than that of our pretrained Transformers, thereby providing a more appropriate optimality guarantee than minimax lower bounds.
comment: 47 pages, 4 figures
♻ ☆ Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training
LLM post-training typically propagates task gradients through the full depth of the model. Although this end-to-end structure is simple and general, it couples task adaptation to full-depth activation storage, long-range backward dependencies and direct task-gradient access to pretrained representations. We argue that this full-depth backward coupling can be unnecessarily expensive and intrusive, particularly when post-training supervision is much narrower than pre-training. To this end, we propose \textbf{LoPT}: Local-Learning Post-Training, a simple post-training strategy that makes gradient reach an explicit design choice. LoPT places a single gradient boundary at the transformer midpoint: the second-half block learns from the task objective, while the first-half block is updated by a lightweight feature-reconstruction objective to preserve useful representations and maintain interface compatibility. LoPT shortens the task-induced backward path while limiting direct interference from narrow task gradients on early-layer representations. Extensive experiments demonstrate that LoPT achieves competitive performance with lower memory cost, higher training efficiency and better retention of pretrained capabilities. Our code is available at: https://github.com/HumyuShi/LoPT
comment: 33pages
♻ ☆ Continual Knowledge Updating in LLM Systems: Learning Through Multi-Timescale Memory Dynamics
LLMs are trained once, then deployed into a world that never stops changing. External memory compensates for this, but most systems manage it explicitly rather than letting it adapt on its own. Biological memory works differently: coupled multi-timescale dynamics make new associations immediately usable, strengthen what repetition confirms, and let the rest fade. We argue that external memory should follow a similar principle. In Memini, this view takes the form of an associative memory that organizes knowledge as a directed graph. Each edge carries two coupled internal variables, one fast and one slow, following the Benna-Fusi model of synaptic consolidation. From this coupling, episodic sensitivity, gradual consolidation, and selective forgetting emerge as facets of a single mechanism, reframing external memory as a learning substrate that reorganizes through its own dynamics.
comment: Preprint. 9 pages, 2 figures
♻ ☆ Teaching Metric Distance to Discrete Autoregressive Language Models
Large language models (LLMs) operate as autoregressive predictors over discrete token vocabularies, a formulation that has enabled their adaptation far beyond natural language to vision, robotics, and multimodal reasoning. However, training against one-hot targets disregards metric relationships between tokens and limits effectiveness on tasks where distance is meaningful, such as numerical values, spatial coordinates, or quantized embeddings. We introduce DIST2Loss, a distance-aware objective for discrete autoregressive models that replaces one-hot targets with reward-weighted distributions derived from predefined token distances. DIST2Loss can be interpreted as the closed-form solution to entropy-regularized policy optimization with known per-token rewards, retaining the core mechanism of reinforcement learning while avoiding sampling, rollouts, and instability. Our experiments show that DIST2Loss improves data efficiency and downstream performance across diverse domains. It yields tighter bounding boxes in visual grounding, accelerates robotic manipulation by improving action learning, enhances reward modeling for LLM alignment, and strengthens vector-quantized image generation. These results demonstrate that distance-aware supervision offers a simple and general alternative to one-hot supervision for discrete autoregressive models.
♻ ☆ DisRFM: Polar Riemannian Flow Matching for Structure-Preserving Graph Domain Adaptation
Graph Domain Adaptation (GDA) aims to transfer graph classifiers across domains with both semantic and topological shifts. Existing Euclidean adversarial methods face two challenges: Structural Degeneration, where domain confusion entangles and suppresses label-relevant topology, and Optimization Instability, where minimax training induces oscillatory gradients under large structural shifts. We propose DisRFM, a geometry-aware GDA framework that addresses these challenges with Riemannian representation learning and flow-based transport. DisRFM embeds graph representations on a constant-curvature manifold and expresses them in geodesic polar coordinates. Polar endpoint regularization calibrates topologysensitive radial scales via univariate Wasserstein alignment and preserves scalenormalized class semantics through confidence-filtered angular alignment, with radial magnitude modulating pseudo-label reliability. DisRFM introduces topologyconditioned polar flow matching, which couples class-compatible source and target samples by a normalized polar transport cost and learns a metric-corrected vector field along geodesic interpolants. Theoretical analysis characterizes the structural risk of unconditional domain confusion and relates polar discrepancies and flow error to target risk. Extensive experiments under diverse domain shifts demonstrate that DisRFM consistently outperforms state-of-the-art methods.
Multimedia 8
☆ Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study
Despite the growing popularity of Multimodal Domain Generalization (MMDG) for enhancing model robustness, it remains unclear whether reported performance gains reflect genuine algorithmic progress or are artifacts of inconsistent evaluation protocols. Current research is fragmented, with studies varying significantly across datasets, modality configurations, and experimental settings. Furthermore, existing benchmarks focus predominantly on action recognition, often neglecting critical real-world challenges such as input corruptions, missing modalities, and model trustworthiness. This lack of standardization obscures a reliable assessment of the field's advancement. To address this issue, we introduce MMDG-Bench, the first unified and comprehensive benchmark for MMDG, which standardizes evaluation across six datasets spanning three diverse tasks: action recognition, mechanical fault diagnosis, and sentiment analysis. MMDG-Bench encompasses six modality combinations, nine representative methods, and multiple evaluation settings. Beyond standard accuracy, it systematically assesses corruption robustness, missing-modality generalization, misclassification detection, and out-of-distribution detection. With 7, 402 neural networks trained in total across 95 unique cross-domain tasks, MMDG-Bench yields five key findings: (1) under fair comparisons, recent specialized MMDG methods offer only marginal improvements over ERM baseline; (2) no single method consistently outperforms others across datasets or modality combinations; (3) a substantial gap to upper-bound performance persists, indicating that MMDG remains far from solved; (4) trimodal fusion does not consistently outperform the strongest bimodal configurations; and (5) all evaluated methods exhibit significant degradation under corruption and missing-modality scenarios, with some methods further compromising model trustworthiness.
comment: Code: https://github.com/lihongzhao99/MMDG_Benchmark
☆ LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation
Modern sensors generate rich, high-fidelity data, yet applications operating on wearable or remote sensing devices remain constrained by bandwidth and power budgets. Standardized codecs such as JPEG and MPEG achieve efficient trade-offs between bitrate and perceptual quality but are designed for human perception, limiting their applicability to machine-perception tasks and non-traditional modalities such as spatial audio arrays, hyperspectral images, and 3D medical images. General-purpose compression schemes based on scalar quantization or resolution reduction are broadly applicable but fail to exploit inherent signal redundancies, resulting in suboptimal rate-distortion performance. Recent generative neural codecs, or tokenizers, model complex signal dependencies but are often over-parameterized, data-hungry, and modality-specific, making them impractical for resource-constrained environments. We introduce a Lightweight, Versatile, and Asymmetric neural codec architecture (LiVeAction), that addresses these limitations through two key ideas. (1) To reduce the complexity of the encoder to meet the resource constraints of the execution environments, we impose an FFT-like structure and reduce the overall size and depth of the neural-network-based analysis transform. (2) To allow arbitrary signal modalities and simplify training, we replace adversarial and perceptual losses with a variance-based rate penalty. Our design produces codecs that deliver superior rate-distortion performance compared to state-of-the-art generative tokenizers, while remaining practical for deployment on low-power sensors. We release our code, experiments, and python library at https://github.com/UT-SysML/liveaction .
comment: DCC 2026
☆ Modality-Aware Contrastive and Uncertainty-Regularized Emotion Recognition
Multimodal Emotion Recognition (MER) has attracted growing attention with the rapid advancement of human-computer interaction. However, different modalities exhibit substantial discrepancies in semantics, quality, and availability, leading to highly heterogeneous modality combinations and posing significant challenges to achieving consistent and reliable emotion understanding. To address this challenge, we propose the Modality-Aware Contrastive and Uncertainty-Regularized (MCUR) framework, which approaches MER from the perspective of representation consistency, aiming to enable robust emotion prediction across heterogeneous modality combinations. MCUR incorporates two core components: (1) Modality Combination-Based and Category-Based Contrastive Learning mechanism (MCB-CL), which encourages samples with the same emotion category and the same available modalities to be close in the representation space; and (2) Sample-wise Uncertainty-Guided Regularization (SUGR), which adaptively assigns sample-wise uncertain weights to samples to optimize training. Extensive experiments demonstrate that MCUR consistently outperforms existing methods, achieving average F1 gains of 2.2% on MOSI, 2.67% on MOSEI, and 4.37% on IEMOCAP.
comment: 24 pages, 6 figures and 16 tables
☆ Revisiting Uncertainty: On Evidential Learning for Partially Relevant Video Retrieval ICML 2026
Partially relevant video retrieval aims to retrieve untrimmed videos using text queries that describe only partial content. However, the inherent asymmetry between brief queries and rich video content inevitably introduces uncertainty into the retrieval process. In this setting, vague queries often induce semantic ambiguity across videos, a challenge that is further exacerbated by the sparse temporal supervision within videos, which fails to provide sufficient matching evidence. To address this, we propose Holmes, a hierarchical evidential learning framework that aggregates multi-granular cross-modal evidence to quantify and model uncertainty explicitly. At the inter-video level, similarity scores are interpreted as evidential support and modeled via a Dirichlet distribution. Based on the proposed three-fold principle, we perform fine-grained query identification, which then guides query-adaptive calibrated learning. At the intra-video level, to accumulate denser evidence, we formulate a soft query-clip alignment via flexible optimal transport with an adaptive dustbin, which alleviates sparse temporal supervision while suppressing spurious local responses. Extensive experiments demonstrate that Holmes outperforms state-of-the-art methods. Code is released at https://github.com/lijun2005/ICML26-Holmes.
comment: Accepted by ICML 2026. 16 pages, 6 figures, 3 tables
☆ Closing the Loop: Unified 3D Scene Generation and Immersive Interaction via LLM-RL Coupling
Recent advances in large language models (LLMs) have significantly improved language-driven 3D content generation, but most existing approaches still treat scene generation and user interaction as separate processes, limiting the adaptability and immersive potential of interactive multimedia systems. This paper presents a unified framework that closes the loop between language-driven 3D scene generation and immersive user interaction. Given natural language instructions, the system first constructs structured scene representations using LLMs, and then optimizes spatial layouts via reinforcement learning under geometric and semantic constraints. The generated environments are deployed in a virtual reality setting to facilitate HRI-in-the-loop, where user interactions provide continuous feedback to align generated content with human perception and usability. By tightly coupling generation and interaction, the proposed framework enables more responsive, adaptive, and realistic multimedia experiences. Experiments on the ALFRED benchmark demonstrate state-of-the-art performance in task-based scene generation. Furthermore, qualitative results and user studies show consistent improvements in immersion, interaction quality, and task efficiency, highlighting the importance of closed-loop integration of generation and interaction for next-generation multimedia systems. Our project page can be found at https://proj-showcase.github.io/h3ds/.
☆ MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes
The rise of Internet of Things (IoT) devices in the physical world necessitates voice-based interfaces capable of handling complex user experiences. While modern Large Language Models (LLMs) already demonstrate strong tool-usage capabilities, modeling real-world IoT devices presents a difficult, understudied challenge which combines modeling spatiotemporal constraints with speech inputs, dynamic state tracking, and mixed-initiative interaction patterns. We introduce MIST (the Multimodal Interactive Speech-based Tool-calling Dataset), a synthetic multi-turn, voice-driven code generation task that operates over IoT devices. We find that there is a significant gap between open- and closed-weight multimodal LLMs on MIST, and that even frontier closed-weight LLMs have substantial headroom. We release MIST and an extensible data generation framework to build related datasets in order to facilitate research on mixed-initiative voice assistants which reason about physical world constraints.
comment: Project Page: https://billyzhang24kobe.github.io/mist-smarthome/
♻ ☆ Generative AI Meets 6G and Beyond: Diffusion Models for Semantic Communications IEEE
Semantic communications mark a paradigm shift from bit-accurate transmission toward meaning-centric communication, essential as wireless systems approach theoretical capacity limits. The emergence of generative AI has catalyzed generative semantic communications, where receivers reconstruct content from minimal semantic cues by leveraging learned priors. Among generative approaches, diffusion models stand out for their superior generation quality, stable training dynamics, and rigorous theoretical foundations. However, the field currently lacks systematic guidance connecting diffusion techniques to communication system design, forcing researchers to navigate disparate literatures. This article provides the first comprehensive tutorial on diffusion models for generative semantic communications. We present score-based diffusion foundations and systematically review three technical pillars: conditional diffusion for controllable generation, efficient diffusion for accelerated inference, and generalized diffusion for cross-domain adaptation. In addition, we introduce an inverse problem perspective that reformulates semantic decoding as posterior inference, bridging semantic communications with computational imaging. Through analysis of human-centric, machine-centric, and agent-centric scenarios, we illustrate how diffusion models enable extreme compression while maintaining semantic fidelity and robustness. By bridging generative AI innovations with communication system design, this article aims to establish diffusion models as foundational components of next-generation wireless networks and beyond.
comment: Accepted by IEEE COMST, GitHub repository: https://github.com/qin-jingyun/Awesome-DiffComm, project page: https://qin-jingyun.github.io/Awesome-DiffComm
♻ ☆ MesonGS++: Post-training Compression of 3D Gaussian Splatting with Hyperparameter Searching
3D Gaussian Splatting (3DGS) achieves high-quality novel view synthesis with real-time rendering, but its storage cost remains prohibitive for practical deployment. Existing post-training compression methods still rely on many coupled hyperparameters across pruning, transformation, quantization, and entropy coding, making it difficult to control the final compressed size and fully exploit the rate-distortion trade-off. We propose MesonGS++, a size-aware post-training codec for 3D Gaussian compression. On the codec side, MesonGS++ combines joint importance-based pruning, octree geometry coding, attribute transformation, selective vector quantization for higher-degree spherical harmonics, and group-wise mixed-precision quantization with entropy coding. On the configuration side, it treats the reserve ratio and bit-width allocation as the dominant rate-distortion knobs and jointly optimizes them under a target storage budget via discrete sampling and 0--1 integer linear programming. We further propose a linear size estimator and a CUDA parallel quantization operator to accelerate the hyperparameter searching process. Extensive experiments show that MesonGS++ achieves over 34$\times$ compression while preserving rendering fidelity, outperforming state-of-the-art post-training methods and accurately meeting target size budgets. Remarkably, without any training, MesonGS++ can even surpass the PSNR of vanilla 3DGS at a 20$\times$ compression rate on the Stump scene. Our code is available at https://github.com/mmlab-sigs/mesongs_plus
comment: https://github.com/mmlab-sigs/mesongs_plus